Ying Yang, Poonam Kumari, William Spoth, Aaron Huber,
Lisa Lu, Jacob Powathikunnil Verghese
Niccolo Meneghetti, Arindam Nandi (both now HPE/Vertica),
Vinayak Karuppasamy (now Bloomberg),
Ronny Fehling (Airbus),
Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle),
Boris Glavic (IIT), Juliana Freire (NYU)
SELECT
on a raw CSV FileState of the art: External Table Defn + "Manually" edit CSV
UNION
two data sourcesState of the art: Manually map schema
SELECT
on JSON or a Doc Store{ A: "Bob", B: "Alice" }
)State of the art: DataGuide, Wrangler, etc...
Alice spends weeks cleaning her data before using it.
Add structure, curation effort On-Demand
My phone is guessing, but is letting me know that it did
Easy interactions to accept, reject, or explain uncertainty
Good Explanations, Alternatives, and Feedback Vectors
Here's a problem with my data. Fix it.
Lenses introduce uncertainty
SELECT NAME, DEPARTMENT FROM PRODUCTS;
Name | Department |
---|---|
Apple 6s, White | Phone |
Dell, Intel 4 core | Computer |
HP, AMD 2 core | Computer |
... | ... |
Simple UI: Highlight values that are based on guesses.
SELECT NAME, DEPARTMENT FROM PRODUCTS;
Name | Department |
---|---|
Apple 6s, White | Phone |
Dell, Intel 4 core | Computer |
HP, AMD 2 core | Computer |
... | ... |
Allow users to EXPLAIN
uncertain outputs
Explanations include reasons given in English
$PRODUCTS.DEPARTMENT_{3}$
"I guessed 'Computer' for 'Department' on Row '3'"
Email: okennedy@buffalo.edu
Office: Davis 338H
Mimir virtualizes uncertainty
$Var(\ldots)$ constructs new variables
Variables can't be evaluated until they are bound.
So, we allow arbitrary expressions to represent data.
A lazy value without variables is deterministic
Mimir SQL allows the $Var()$ operator to inlined
SELECT A, VAR('X', B)+2 AS C FROM R;
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
A | C |
---|---|
1 | $X_2+2$ |
3 | $X_4+2$ |
5 | $X_6+2$ |
Selects on $Var()$ need to be deferred too...
SELECT A FROM R WHERE VAR('X', B) > 2;
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
A | $\phi$ |
---|---|
1 | $X_2>2$ |
3 | $X_4>2$ |
5 | $X_6>2$ |
When evaluating the table, rows where $\phi = \bot$ are dropped.
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
is (almost) the same as the query...
CREATE VIEW PRODUCTS
AS SELECT ID, NAME, ...,
CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
END AS DEPARTMENT
FROM PRODUCTS_RAW;
ID | Name | ... | Department |
---|---|---|---|
123 | Apple 6s, White | ... | Phone |
34234 | Dell, Intel 4 core | ... | Computer |
34235 | HP, AMD 2 core | ... | $Prod.Dept_3$ |
... | ... | ... | ... |
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
Behind the scenes, a lens also creates a model...
SELECT * FROM PRODUCTS_RAW;
An estimator for $PRODUCTS.DEPARTMENT_{ROWID}$
SELECT A, VAR('X', B)+2 AS C FROM R;
Mimir dispatches this query to the DB:
SELECT A, B FROM R;
And for each row of the result, evaluates:
SELECT A, VAR('X', B)+2 AS C FROM RESULT;
All uncertainty comes from labeled nulls in the expressions that Mimir evaluates for each row of the output.
VAR('X', B)+2
.VAR('X', B)
.
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
Mimir dispatches this query to the DB:
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2 FROM R, S;
And for each row of the result, evaluates:
SELECT A, C FROM RESULT WHERE VAR('X', TEMP_1) = TEMP_2;
Helper views allow the DB to interpret labeled nulls
SELECT R.A, S.C FROM R, S
WHERE S.B = (SELECT VALUE FROM VARIABLE_X WHERE KEY = R.B);
... but we lose the ability to explain outputs
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
Mimir dispatches this query to the DB:
SELECT R.A, S.C,
R.ROWID AS ID_1, S.ROWID AS ID_2
WHERE S.B = (SELECT VALUE FROM VARIABLE_X WHERE KEY = R.B);
Then to explain, Mimir dispatches the query:
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2
WHERE R.ROWID = ID_1 AND S.ROWID = ID_2
TPC-H Data, but replace 0.1% of FK references with NULL. Ask Mimir to fix.
(a worst case from a performance standpoint)
Mimir over SQLite in 4 different execution modes.
100% = Zero overhead
Lenses implement PC-Tables
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
AS
clause defines source data.USING
clause requests repairs.
SELECT NAME FROM PRODUCTS
WHERE DEPARTMENT='PHONE'
AND ( VENDOR='APPLE'
OR PLATFORM='ANDROID' )
Recall, row-level uncertainty is a boolean formula $\phi$.
For this query, $\phi$ can be as complex as: $$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$
Too many variables! Which is the most important?
Data Cleaning
The ones that keep us from knowing everything
$$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$
$$A \wedge (B \vee C)$$
Consider a game between a database and an impartial oracle.
Naive Algorithm: Pick all variables!
Less Naive Algorithm: Minimize $E\left[\sum c_v\right]$.
$$\phi = A \wedge (B \vee C)$$
Entropy is intuitive:
$H = 1$ means we know nothing,
$H = 0$ means we know everything.
$$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$
Information gain of $v$: The reduction in entropy from knowing the truth value of a variable $v$.
$$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$
Expected information gain of $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.
Combine Information Gain and Cost
$$f(\mathcal I_{A}(\phi), c_A)$$
For example: $EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$
Greedy Algorithm: Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step
Simulate an analyst trying to manually explore correlations.
EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order
EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order