SELECT
on a raw CSV FileUNION
two data sourcesSELECT
on JSON or a Doc Store{ A: "Bob", B: "Alice" }
)We have tools that can solve these problem!
... most of the time
Problem: It's hard to trust tools that can be wrong!
In the name of Codd
Thou shalt not give the user a wrong answer.
T. Imielinski & W. Lipski Jr.(VLDB 1981)
1. ProbDBs Produce Probability Distributions as Outputs
2. ProbDBs Require Probability Distributions as Inputs
The Uncertainty-Aware Database
Mimir is a vehicle for research on...
(Joint work with Boris Glavic, Su Feng, Aaron Huber)
T.J. Green & G. Karvounarakis & V. Tannen(PODS 2007)
|
|
The relational view
The functional view
$$R(1, 2) \mapsto 1$$ $$R(1, 3) \mapsto 1$$ $$R(4, 3) \mapsto 1$$
$$R(4, 5) \mapsto 0$$
$$S(3, 6) \mapsto 2$$
$= S(3, 6) + S(3, 6)$
$= 2 + 2 = 4$
$= R(4, 3) \times S(3, 6)$
$= 1 \times 2 = 2$
$= \sum_{Y} R(Y, 3)$
$ = R(1, 3) + R(4, 3) + \ldots$
$= 1 + 1 + 0 = 2$
$\cup$ | $\approx$ | $+$ |
$\bowtie$ | $\approx$ | $\times$ |
$\pi$ | $\approx$ | $+$ |
Semiring | Equivalent Query Semantics |
---|---|
$\left<\mathbb N, +, \times, 0, 1\right>$ | Bag Semantics |
$\left<\mathbb B, \vee, \wedge, \bot, \top\right>$ | Set Semantics |
$\left<\mathcal K^W, \vec \oplus, \vec \otimes, \mathbb{\vec 0}, \mathbb{\vec 1}\right>$ | Possible Worlds Semantics |
R | A | B |
---|---|---|
1 | 2 | |
1 | 3 | |
3 |
|
|
R | A | B | |
---|---|---|---|
1 | 2 | $\mapsto [1,1]$ | |
1 | 3 | $\mapsto [1,1]$ | |
4 | 3 | $\mapsto [1,0]$ | |
9 | 3 | $\mapsto [0,1]$ |
|
$$\texttt{PW}_0(R(1, 2)) = 1$$ $$\texttt{PW}_0(R(4, 3)) = 1$$ $$\texttt{PW}_1(R(4, 3)) = 0$$ $$\mathcal C(R(4, 3)) = 0$$ $$\mathcal P(R(4, 3)) = 1$$ |
A quick step back into reality...
R | A | B |
---|---|---|
1 | 2 | |
1 | 3 | |
4 or 9 | 3 |
R | A | B |
---|---|---|
1 | 2 | |
1 | 3 | |
4 or 9 | 3 |
Standard practice: "Just use the best option."
What's in between these extremes?
R | A | B | |
---|---|---|---|
1 | 2 | ||
1 | 3 | ||
4 | 3 | * |
Use the best option, but mark potential errors.
$PW_{i}(Q(\mathcal D))$ | The results Alice would have "just used". |
$\mathcal C(Q(\mathcal D))$ | Which of those results are trustworthy. |
(Computing $PW_{i}(Q(\mathcal D))$ is cheap!)
Can we do the same thing for $\mathcal C(Q(\mathcal D))$?
$$C(Q(\mathcal D)) \stackrel{?}{=} C(Q(\mathcal D))$$
R | A | B | $K^W$ | $\mathcal C$ | |
---|---|---|---|---|---|
1 | 2 | $\mapsto$ | $[1,1]$ | 1 | |
1 | 3 | $\mapsto$ | $[1,1]$ | 1 | |
4 | 3 | $\mapsto$ | $[1,0]$ | 0 | |
9 | 3 | $\mapsto$ | $[0,1]$ | 0 |
Compute $\pi_B(R)$
$\pi_B$R | B | $K^W$ | $\mathcal C$ | |
---|---|---|---|---|
2 | $\mapsto$ | $[1,1]$ | $1$ | |
3 | $\mapsto$ | $[2,2]$ | $1+0+0=1$ |
... also attribute level uncertainty
(but not today)
Mimir allows users to define special UDFs called Models.
CREATE MODEL TYPE Geocoder AS mimir.models.GeocodingModel;
CREATE MODEL INSTANCE Text_To_Loc USING Geocoder('Google');
SELECT C.name, C.id, Text_To_Loc(C.address) AS address
FROM Customer C;
(Not actual Mimir-SQL. Language adapted for your viewing pleasure.)
Models...
... return one best guess
... define the space of alternatives
Lenses instantiate/train a model and wrap a query
Evaluation handled by a DBMS or Spark
via query rewriting using GProM.
SELECT C.name, C.id, Text_To_Loc(C.address) AS address
FROM Customer C;
becomes...
SELECT C.name, C.id, Text_To_Loc(C.address) AS address,
1 AS name_certain, 1 AS id_certain,
0 AS address_certain, 1 AS row_certain
FROM Customer C;
A few more things we're doing with Mimir...
LOAD 'customers.csv';
SELECT name FROM customers WHERE last_purchase < LAST_WEEK();
This is all guesswork!
Idea: Make the System Catalog a Probabilistic Table
Sampling from ProbDBs is Sloooow
Evaluate the query $N$ times.
Plug in samples instead of best guesses.
Merge evaluation to mitigate redundancy.
| ➔ |
|
| ➔ |
|
Idea: Let the compiler pick the right representation
(or combination)
Students | ||||
---|---|---|---|---|
Poonam |
Will |
Aaron |
Lisa |
Gourab |
Alumni | ||||
---|---|---|---|---|
Ying |
Niccolò |
Arindam |
Shivang |
Olivia |
Dev |
---|
Mike |
External Collaborators | |||
---|---|---|---|
Dieter Gawlick (Oracle) |
Zhen Hua Liu (Oracle) |
Ronny Fehling (Airbus) |
Beda Hammerschmidt (Oracle) |
Boris Glavic (IIT) |
Su Feng (IIT) |
Juliana Freire (NYU) |
Wolfgang Gatterbauer (NEU) |
Heiko Mueller (NYU) |
Remi Rampin (NYU) |
Mimir is supported by NSF Award ACI-1640864, NPS Award N00244-16-1-0022, and gifts from Oracle