PhD Students: Ying Yang, Will Spoth, Aaron Huber, Poonam Kumari, Jon Logan
BS Students: Lisa Lu, Jacob P. Verghese
Alums: Arindam Nandi, Niccoló Meneghetti (HPE/Vertica), Vinayak Karuppasamy (Bloomberg)
Collabs: Ronny Fehling (Airbus), Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle), Beda Hammerschmidt (Oracle),
Boris Glavic (IIT), Wolfgang Gatterbauer (CMU), Juliana Freire (NYU), Heiko Mueller (NYU), Moises Sudit (UB-ISE)
SELECT
on a raw CSV FileState of the art: External Table Defn + "Manually" edit CSV
UNION
two data sourcesState of the art: Manually map schema
SELECT
on JSON or a Doc Store{ A: "Bob", B: "Alice" }
)State of the art: DataGuide, Wrangler, etc...
PID, RATING, REVIEW_CT
P123, 4.5, 50.0
P2345, A3, 245.0
P124, 4.0, 100.0
P325, 6.4, 30
'A3' is not a valid rating
PID | RATING | REVIEW_CT |
---|---|---|
P123 | 4.5 | 50.0 |
P2345 | NULL | 245.0 |
P124 | 4.0 | 100.0 |
P325 | 6.4 | 30 |
$NULL > 4$ is Unknown
(Same for $<$, $\leq$, $\geq$, $\neq$, $=$)
True | False | Unknown | |
---|---|---|---|
True | True | False | Unknown |
False | False | False | False |
Unknown | Unknown | False | Unknown |
True | False | Unknown | |
---|---|---|---|
True | True | True | True |
False | True | False | Unknown |
Unknown | True | Unknown | Unknown |
True | False | Unknown |
---|---|---|
False | True | Unknown |
PID | RATING | REVIEW_CT |
---|---|---|
P123 | 4.5 | 50.0 |
P2345 | $NULL_1$ | 245.0 |
P124 | 4.0 | 100.0 |
P325 | 6.4 | 30 |
Label nulls with a Subscript
Expression | Truth Value |
---|---|
$NULL_i = 4$ | Unknown |
$NULL_i = NULL_j$ ($i \neq j$) | Unknown |
$NULL_i = NULL_j$ ($i = j$) | True |
Solves some erroneously discarded values: $$(Ratings_1 \bowtie Ratings_1) = Ratings_1$$ (but not all)
PID | RATING | REVIEW_CT | $\phi$ |
---|---|---|---|
P123 | 4.5 | 50.0 | $\top$ |
P2345 | $NULL_1$ | 245.0 | $\top$ |
P124 | 4.0 | 100.0 | $\top$ |
P325 | 6.4 | 30 | $\top$ |
Add a 'local condition' column
$$\sigma_{RATING > 4} RATINGS_1$$
PID | RATING | REVIEW_CT | $\phi$ |
---|---|---|---|
P123 | 4.5 | 50.0 | $\top$ |
P2345 | $NULL_1$ | 245.0 | $NULL_1 > 4$ |
P325 | 6.4 | 30 | $\top$ |
Expression | Evaluates To |
---|---|
$[[\pi_{a_i \leftarrow e_i}(R)]]_{CT}$ | $\{\left< a_i:[[e_i(t)]]_{lazy}, \phi: t.\phi\right>\;|\;t \in [[R]]_{CT}\}$ |
$[[\sigma_\psi(R)]]_{CT}$ | $\{\left< a_i: t.a_i, \phi: [[t.\phi \wedge \psi(t)]]_{lazy}\right>\;|\;t \in [[R]]_{CT} \wedge \left([[t.\phi \wedge \psi(t)]]_{lazy} \not \equiv \bot \right) \}$ |
$[[R\times S]]_{CT}$ | $\{\left< a_i: t_1.a_i, a_j: t_2.a_j, \phi: t_1.\phi \wedge t_2.\phi \right>\;|\;t_1 \in [[R]]_{CT} \wedge t_2 \in [[S]]_{CT}\}$ |
$[[R\uplus S]]_{CT}$ | $\{\left< a_i: t.a_i, \phi: t.\phi\right>\;|\;t \in ([[R]]_{CT} \uplus [[S]]_{CT})\}$ |
'Plug-in' a valuation to get a specific result.
For example, with $\{NULL_1 \rightarrow 5\}$
PID | RATING | REVIEW_CT |
---|---|---|
P123 | 4.5 | 50.0 |
P2345 | 5 | 245.0 |
P325 | 6.4 | 30 |
PID | A | B | $\phi$ |
---|---|---|---|
P123 | 4.5 | 2.25 | $\top$ |
P2345 | $NULL_1$ | $NULL_2$ | $\top$ |
P124 | 4.0 | 2.0 | $\top$ |
P325 | 6.4 | 3.2 | $\top$ |
(Must also record that $NULL_2 = \frac{NULL_1}{2}$)
PID | A | B | $\phi$ |
---|---|---|---|
P123 | 4.5 | 2.25 | $\top$ |
P2345 | $NULL_1$ | $\frac{NULL_1}{2}$ | $\top$ |
P124 | 4.0 | 2.0 | $\top$ |
P325 | 6.4 | 3.2 | $\top$ |
(Labeled Nulls + Lazy Arithmetic Expressions)
$VGTerm(\ldots)$ constructs new variables
(it's a skolem function)
Mimir SQL allows the $VGTerm()$ operator to inlined
SELECT A, VGTerm('X', B)+2 AS C FROM R;
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
A | C |
---|---|
1 | $X_2+2$ |
3 | $X_4+2$ |
5 | $X_6+2$ |
SELECT PID,
CASE WHEN CAN_CAST(RATING AS FLOAT)
THEN CAST(RATING AS FLOAT)
ELSE VGTerm('RATING', ROWID)
END AS RATING,
REVIEW_CT
FROM RATINGS1;
CREATE VIEW FIXED_RATINGS AS
SELECT PID,
CASE WHEN CAN_CAST(RATING AS FLOAT)
THEN CAST(RATING AS FLOAT)
ELSE VGTerm('RATING', ROWID)
END AS RATING,
REVIEW_CT
FROM RATINGS1;
CREATE VIEW FIXED_RATINGS AS SELECT ...
Behind the scenes, we also create a model...
SELECT * FROM RATINGS1;
A distribution for $RATING_{ROWID}$
CREATE VIEW FIXED_RATINGS AS
SELECT PID,
CASE WHEN CAN_CAST(RATING AS FLOAT)
THEN CAST(RATING AS FLOAT)
ELSE VGTerm('RATING', ROWID)
END AS RATING,
REVIEW_CT
FROM RATINGS1;
SELECT PID FROM FIXED_RATINGS WHERE RATING > 4
The query+view+model completely describe the distribution of possible results...
Participants were shown a table of 3 products with 3 ratings (e.g., Amazon, Best Buy, Walmart) each
Part 1: The randomly generated ratings were biased to encourage a predictable, but mildly ambiguous ordering of the three products.
Part 2: We used the same randomization, but this time we marked several of the values as uncertain:
Red Text | value |
Red Background | value |
Asterisk | $value*$ |
Tolerance | $value \pm tolerance$ |
Range | $low – high$ |
UDFs allow guesses to be plugged in
SELECT PID,
CASE WHEN CAN_CAST(RATING AS FLOAT)
THEN CAST(RATING AS FLOAT)
ELSE MIMIR_VG_BESTGUESS('RATING', ROWID)
END AS RATING,
REVIEW_CT
FROM RATINGS1;
... but we lose track of which outputs are uncertain
SELECT PID,
CASE WHEN CAN_CAST(RATING AS FLOAT)
THEN CAST(RATING AS FLOAT)
ELSE MIMIR_VG_BESTGUESS('RATING', ROWID)
END AS RATING,
REVIEW_CT,
TRUE AS MIMIR_IS_DETERMINISTIC_PID,
CAN_CAST(RATING AS FLOAT) AS MIMIR_IS_DETERMINISTIC_RATING,
TRUE AS MIMIR_IS_DETERMINISTIC_REVIEW_CT,
TRUE AS MIMIR_ROW_IS_DETERMINISTIC
FROM RATINGS1;
PDBench: TPC-H Data, but add random FK violations.
Strategy | Q1 | Q2 | Q3 |
---|---|---|---|
Inline | 85.5s | 676.6s | 103.3s |
TupleBundle | 8.2s | 55.2s | 9.8s |
Partition | >1hr | 739.7s | >1hr |
Questions?
My phone is guessing, but is letting me know that it did
Easy interactions to accept, reject, or explain uncertainty
Easy access to: Provenance, Alternatives, and Confidence
Mimir virtualizes uncertainty
$Var(\ldots)$ constructs new variables
Variables can't be evaluated until they are bound.
So, we allow arbitrary expressions to represent data.
A lazy value without variables is deterministic
Mimir SQL allows the $Var()$ operator to inlined
SELECT A, VAR('X', B)+2 AS C FROM R;
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
A | C |
---|---|
1 | $X_2+2$ |
3 | $X_4+2$ |
5 | $X_6+2$ |
Selects on $Var()$ need to be deferred too...
SELECT A FROM R WHERE VAR('X', B) > 2;
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
A | $\phi$ |
---|---|
1 | $X_2>2$ |
3 | $X_4>2$ |
5 | $X_6>2$ |
When evaluating the table, rows where $\phi = \bot$ are dropped.
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
is (almost) the same as the query...
CREATE VIEW PRODUCTS
AS SELECT ID, NAME, ...,
CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
END AS DEPARTMENT
FROM PRODUCTS_RAW;
ID | Name | ... | Department |
---|---|---|---|
123 | Apple 6s, White | ... | Phone |
34234 | Dell, Intel 4 core | ... | Computer |
34235 | HP, AMD 2 core | ... | $Prod.Dept_3$ |
... | ... | ... | ... |
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
Behind the scenes, a lens also creates a model...
SELECT * FROM PRODUCTS_RAW;
An estimator for $PRODUCTS.DEPARTMENT_{ROWID}$
SELECT A, VAR('X', B)+2 AS C FROM R;
Mimir dispatches this query to the DB:
SELECT A, B FROM R;
And for each row of the result, evaluates:
SELECT A, VAR('X', B)+2 AS C FROM RESULT;
All uncertainty comes from labeled nulls in the expressions that Mimir evaluates for each row of the output.
VAR('X', B)+2
.VAR('X', B)
.
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
Mimir dispatches this query to the DB:
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2 FROM R, S;
And for each row of the result, evaluates:
SELECT A, C FROM RESULT WHERE VAR('X', TEMP_1) = TEMP_2;
UDFs allow the DB to interpret labeled nulls
SELECT R.A, S.C FROM R, S
WHERE S.B = MIMIR_VG_BESTGUESS('VARIABLE_X', R.B);
... but we lose the ability to explain outputs
SELECT NAME FROM PRODUCTS
WHERE DEPARTMENT='PHONE'
AND ( VENDOR='APPLE'
OR PLATFORM='ANDROID' )
Row-level uncertainty is a boolean formula $\phi$.
For this query, $\phi$ can be as complex as: $$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$
Too many variables! Which is the most important?
Data Cleaning
The ones that keep us from knowing everything
$$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$
$$A \wedge (B \vee C)$$
Consider a game between a database and an impartial oracle.
Naive Algorithm: Pick all variables!
Less Naive Algorithm: Minimize $E\left[\sum c_v\right]$.
$$\phi = A \wedge (B \vee C)$$
Entropy is intuitive:
$H = 1$ means we know nothing,
$H = 0$ means we know everything.
$$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$
Information gain of $v$: The reduction in entropy from knowing the truth value of a variable $v$.
$$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$
Expected information gain of $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.
Combine Information Gain and Cost
$$f(\mathcal I_{A}(\phi), c_A)$$
For example: $EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$
Greedy Algorithm: Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step
Simulate an analyst trying to manually explore correlations.
EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order
EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order