SELECT
on a raw CSV FileUNION
two data sourcesSELECT
on JSON or a Doc Store{ A: "Bob", B: "Alice" }
)We have tools that can solve these problem!
... most of the time
Problem: It's hard to trust tools that can be wrong!
In the name of Codd
Thou shalt not give the user a wrong answer.
... but this assumes that we start with perfect data.
T. Imielinski & W. Lipski Jr.(VLDB 1981)
1. ProbDBs Produce Probability Distributions as Outputs
2. ProbDBs Require Probability Distributions as Inputs
The Uncertainty Management System
At each step, Mimir tracks ambiguity and potential errors.
Declarative uncertainty requires...
(Joint work with Boris Glavic, Su Feng, Aaron Huber)
T.J. Green & G. Karvounarakis & V. Tannen(PODS 2007)
|
|
The relational view
The functional view
$$R(1, 2) \mapsto 1$$ $$R(1, 3) \mapsto 1$$ $$R(4, 3) \mapsto 1$$ $$S(2, 5) \mapsto 1$$ $$S(3, 6) \mapsto 2$$
$$R(4, 5) \mapsto 0$$
$= S(3, 6) + S(3, 6)$
$= 2 + 2 = 4$
$= R(4, 3) \times S(3, 6)$
$= 1 \times 2 = 2$
$= \sum_{Y} R(Y, 3)$
$ = R(1, 3) + R(4, 3) + \ldots$
$= 1 + 1 + 0 = 2$
$\cup$ | $\approx$ | $+$ |
$\bowtie$ | $\approx$ | $\times$ |
$\pi$ | $\approx$ | $+$ |
Semiring | Equivalent Query Semantics |
---|---|
$\left<\mathbb N, +, \times, 0, 1\right>$ | Bag Semantics |
$\left<\mathbb B, \vee, \wedge, \bot, \top\right>$ | Set Semantics |
$\left<\mathcal K^W, \vec \oplus, \vec \otimes, \mathbb{\vec 0}, \mathbb{\vec 1}\right>$ | Possible Worlds Semantics |
R | A | B |
---|---|---|
1 | 2 | |
1 | 3 | |
3 |
|
|
R | A | B | |
---|---|---|---|
1 | 2 | $\mapsto [1,1]$ | |
1 | 3 | $\mapsto [1,1]$ | |
4 | 3 | $\mapsto [1,0]$ | |
9 | 3 | $\mapsto [0,1]$ |
$$\mathcal K^W \rightarrow \mathcal K$$ (plug in any $K$-Relation-compatible $\mathcal K$)
Correct/Possible mirrors "Correctness of SQL Queries on Databases with Nulls" [Guagliardo, Libkin 2017]
|
$$\texttt{PW}_0(R(1, 2)) = 1$$ $$\texttt{PW}_0(R(4, 3)) = 1$$ $$\texttt{PW}_1(R(4, 3)) = 0$$ $$\mathcal C(R(4, 3)) = 0$$ $$\mathcal P(R(4, 3)) = 1$$ |
A quick step back into reality...
R | A | B |
---|---|---|
1 | 2 | |
1 | 3 | |
4 or 9 | 3 |
R | A | B |
---|---|---|
1 | 2 | |
1 | 3 | |
4 or 9 | 3 |
Standard practice: "Just use the best option."
What's in between these extremes?
R | A | B | |
---|---|---|---|
1 | 2 | ||
1 | 3 | ||
4 | 3 | * |
Use the best option, but mark potential errors.
$PW_{i}(Q(\mathcal D))$ | The results Alice would have "just used". |
$\mathcal C(Q(\mathcal D))$ | Which of those results are trustworthy. |
(Computing $PW_{i}(Q(\mathcal D))$ is cheap!)
Can we do the same thing for $\mathcal C(Q(\mathcal D))$?
$$C(Q(\mathcal D)) \stackrel{?}{=} Q(\mathcal C(\mathcal D))$$
R | A | B | $K^W$ | $\mathcal C$ | |
---|---|---|---|---|---|
1 | 2 | $\mapsto$ | $[1,1]$ | 1 | |
1 | 3 | $\mapsto$ | $[1,1]$ | 1 | |
4 | 3 | $\mapsto$ | $[1,0]$ | 0 | |
9 | 3 | $\mapsto$ | $[0,1]$ | 0 |
Compute $\pi_B(R)$
$\pi_B$R | B | $K^W$ | $\mathcal C$ | |
---|---|---|---|---|
2 | $\mapsto$ | $[1,1]$ | $\mathcal C([1,1]) = 1$ | |
3 | $\mapsto$ | $[2,2]$ | $\mathcal C([1,1])+\mathcal C([1,0])+\mathcal C([0,1])$ |
$=1+0+0=1$ $\neq C([2,2])$
... also attribute level uncertainty
(but not today)
Mimir allows users to define special UDFs called Models.
CREATE MODEL TYPE Geocoder AS mimir.models.GeocodingModel;
CREATE MODEL INSTANCE Text_To_Loc USING Geocoder('Google');
SELECT C.name, C.id, Text_To_Loc(C.address) AS address
FROM Customer C;
(Not actual Mimir-SQL. Language adapted for your viewing pleasure.)
Models...
... return one best guess
... define the space of alternatives
Lenses instantiate/train a model and wrap a query
Evaluation handled by a DBMS or Spark
via query rewriting using GProM.
SELECT C.name, C.id, Text_To_Loc(C.address) AS address
FROM Customer C;
becomes...
SELECT C.name, C.id, Text_To_Loc(C.address) AS address,
1 AS name_certain, 1 AS id_certain,
0 AS address_certain, 1 AS row_certain
FROM Customer C;
A few more things we're doing with Mimir...
LOAD 'customers.csv';
SELECT name FROM customers WHERE last_purchase < LAST_WEEK();
This is all guesswork!
Idea: Make the System Catalog a Probabilistic Table
Sampling from ProbDBs is Sloooow
Evaluate the query $N$ times.
Plug in samples instead of best guesses.
Merge evaluation to mitigate redundancy.
| ➔ |
|
| ➔ |
|
Idea: Let the compiler pick the right representation
(or combination)
Students | ||
---|---|---|
Poonam |
Will |
Aaron |
Dev |
---|
Mike |
Alumni | ||||||
---|---|---|---|---|---|---|
Ying |
Niccolò |
Arindam |
Shivang |
Olivia |
Gourab |
External Collaborators | |||
---|---|---|---|
Dieter Gawlick (Oracle) |
Zhen Hua Liu (Oracle) |
Ronny Fehling (Airbus) |
Beda Hammerschmidt (Oracle) |
Boris Glavic (IIT) |
Su Feng (IIT) |
Juliana Freire (NYU) |
Wolfgang Gatterbauer (NEU) |
Heiko Mueller (NYU) |
Remi Rampin (NYU) |
Mimir is supported by NSF Award ACI-1640864, NPS Award N00244-16-1-0022, and gifts from Oracle