Tons of Curation Heuristics Available!
+We have tools that can solve these problem!
-... that can be wrong
+... most of the time
... very wrong
-+ Problem: It's hard to trust tools that can be wrong! +
+
+ In the name of Codd
Thou shalt not give the user a wrong answer.
+
... but when combined with heuristics
- Incomplete and Probabilistic Databases
have existed since the 1980s...
-
T. Imielinski & W. Lipski Jr.(VLDB 1981)
(Typical Heuristics)
+ +1. ProbDBs Produce Probability Distributions as Outputs
+2. ProbDBs Require Probability Distributions as Inputs
(Probabilistic Query Outputs)
-(Joint work with Boris Glavic, Su Feng, Aaron Huber)
+
|
+ + |
|
+
The relational view
+The functional view
++ $$R(1, 2) \mapsto 1$$ + $$R(1, 3) \mapsto 1$$ + $$R(4, 3) \mapsto 1$$ +
++ $$R(4, 5) \mapsto 0$$ +
++ $$S(3, 6) \mapsto 2$$ +
+$= S(3, 6) + S(3, 6)$
+$= 2 + 2 = 4$
+$= R(4, 3) \times S(3, 6)$
+$= 1 \times 2 = 2$
+$= \sum_{Y} R(Y, 3)$
+$ = R(1, 3) + R(4, 3) + \ldots$
+$= 1 + 1 + 0 = 2$
+$\cup$ | $\approx$ | $+$ |
$\bowtie$ | $\approx$ | $\times$ |
$\pi$ | $\approx$ | $+$ |
T.J. Green & G. Karvounarakis & V. Tannen(PODS 2007)
+Semiring | Equivalent Query Semantics | +
---|---|
$\left<\mathbb N, +, \times, 0, 1\right>$ | +Bag Semantics | +
$\left<\mathbb B, \vee, \wedge, \bot, \top\right>$ | +Set Semantics | +
$\left<\mathcal K^W, \vec \oplus, \vec \otimes, \mathbb{\vec 0}, \mathbb{\vec 1}\right>$ | +Possible Worlds Semantics | +
R | A | B |
---|---|---|
1 | 2 | |
1 | 3 | |
3 |
|
+ + |
|
+
R | A | B | |
---|---|---|---|
1 | 2 | $\mapsto [1,1]$ | |
1 | 3 | $\mapsto [1,1]$ | |
4 | 3 | $\mapsto [1,0]$ | |
9 | 3 | $\mapsto [0,1]$ |
+
|
+
+ $$\texttt{PW}_0(R(1, 2)) = 1$$ +$$\texttt{PW}_0(R(4, 3)) = 1$$ +$$\texttt{PW}_1(R(4, 3)) = 0$$ +$$\mathcal C(R(4, 3)) = 0$$ +$$\mathcal P(R(4, 3)) = 1$$ + |
+
A quick step back into reality...
+ +R | A | B |
---|---|---|
1 | 2 | |
1 | 3 | |
4 or 9 | 3 |
+
R | A | B |
---|---|---|
1 | 2 | |
1 | 3 | |
4 or 9 | 3 |
Standard practice: "Just use the best option."
+What's in between these extremes?
+R | A | B | |
---|---|---|---|
1 | 2 | ||
1 | 3 | ||
4 | 3 | * |
Use the best option, but mark potential errors.
+$PW_{i}(Q(\mathcal D))$ | +The results Alice would have "just used". | +
$\mathcal C(Q(\mathcal D))$ | +Which of those results are trustworthy. | +
(Computing $PW_{i}(Q(\mathcal D))$ is cheap!)
+Can we do the same thing for $\mathcal C(Q(\mathcal D))$?
+R | A | B | $K^W$ | $\mathcal C$ | |
---|---|---|---|---|---|
1 | 2 | $\mapsto$ | $[1,1]$ | 1 | |
1 | 3 | $\mapsto$ | $[1,1]$ | 1 | |
4 | 3 | $\mapsto$ | $[1,0]$ | 0 | |
9 | 3 | $\mapsto$ | $[0,1]$ | 0 |
Compute $\pi_B(R)$
+$\pi_B$R | B | $K^W$ | $\mathcal C$ | |
---|---|---|---|---|
2 | $\mapsto$ | $[1,1]$ | $1$ | |
3 | $\mapsto$ | $[2,2]$ | $1+0+0=1$ |
... also attribute level uncertainty
+Mimir allows users to define special UDFs called Models.
+
+ CREATE MODEL TYPE Geocoder AS mimir.models.GeocodingModel;
+
+ CREATE MODEL INSTANCE Text_To_Loc USING Geocoder('Google');
+
+ SELECT C.name, C.id, Text_To_Loc(C.address) AS address
+ FROM Customer C;
+
+ (Not actual Mimir-SQL. Language adapted for your viewing pleasure.)
+Models...
... return one best guess
... define the space of alternatives
Lenses instantiate/train a model and wrap a query
+Evaluation handled by a DBMS or Spark via query rewriting.
+
+ SELECT C.name, C.id, Text_To_Loc(C.address) AS address
+ FROM Customer C;
+
+ becomes...
+
+ SELECT C.name, C.id, Text_To_Loc(C.address) AS address,
+ 1 AS name_certain, 1 AS id_certain,
+ 0 AS address_certain, 1 AS row_certain
+ FROM Customer C;
+
+
+ A few more things we're doing with Mimir...
+
+ LOAD 'customers.csv';
+
+ SELECT name FROM customers WHERE last_purchase < LAST_WEEK();
+
+ This is all guesswork!
+Idea: Make the System Catalog a Probabilistic Table
+Sampling from ProbDBs is Sloooow
+Evaluate the query $N$ times.
Plug in samples instead of best guesses.
Merge evaluation to mitigate redundancy.
+
+
| + âž” + |
+
|
+
| + âž” + |
+
|
Idea: Let the compiler pick the right representation
(or combination)
Students | @@ -364,21 +944,21 @@ Sampling (x10), 300, 242.5666234549135, 300, 119.61607021316885, 162.00108394436
- Aaron Aaron |
Lisa |
-
- Olivia Gourab |
---|
Alumni | +Alumni | |||||||
---|---|---|---|---|---|---|---|---|
@@ -395,7 +975,11 @@ Sampling (x10), 300, 242.5666234549135, 300, 119.61607021316885, 162.00108394436 |
- Shivang Shivang |
+
+
+ Olivia |
+ |
Boris Glavic (IIT) |
- + |
Su Feng (IIT) |
@@ -451,27 +1035,10 @@ Sampling (x10), 300, 242.5666234549135, 300, 119.61607021316885, 162.00108394436 |
Mimir is supported by NSF Award ACI-1640864, NPS Award N00244-16-1-0022, and gifts from Oracle
+Mimir is supported by NSF Award ACI-1640864, NPS Award N00244-16-1-0022, and gifts from Oracle
Thanks!
- -