PhD Students: Ying Yang, Will Spoth, Aaron Huber, Poonam Kumari, Jon Logan
BS Students: Lisa Lu, Jacob P. Verghese
Alums: Arindam Nandi, Niccoló Meneghetti (HPE/Vertica), Vinayak Karuppasamy (Bloomberg)
Collabs: Ronny Fehling (Airbus), Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle), Beda Hammerschmidt (Oracle),
Boris Glavic (IIT), Wolfgang Gatterbauer (CMU), Juliana Freire (NYU), Heiko Mueller (NYU), Moises Sudit (UB-ISE)
SELECT
on a raw CSV FileState of the art: External Table Defn + "Manually" edit CSV
UNION
two data sourcesState of the art: Manually map schema
SELECT
on JSON or a Doc Store{ A: "Bob", B: "Alice" }
)State of the art: DataGuide, Wrangler, etc...
Alice spends weeks cleaning her data before using it.
Make structure, curation effort On-Demand
My phone is guessing, but is letting me know that it did
Easy interactions to accept, reject, or explain uncertainty
Easy access to: Provenance, Alternatives, and Confidence
Here's a problem with my data. Fix it.
Lenses introduce uncertainty
SELECT NAME, DEPARTMENT FROM PRODUCTS;
Name | Department |
---|---|
Apple 6s, White | Phone |
Dell, Intel 4 core | Computer |
HP, AMD 2 core | Computer |
... | ... |
Simple UI: Highlight values that are based on guesses.
SELECT NAME, DEPARTMENT FROM PRODUCTS;
Name | Department |
---|---|
Apple 6s, White | Phone |
Dell, Intel 4 core | Computer |
HP, AMD 2 core | Computer |
... | ... |
Allow users to EXPLAIN
uncertain outputs
Explanations include reasons given in English
{
"grad":{"students":[
{name:"Alice",deg:"PhD",credits:"10"},
{name:"Bob",deg:"MS"}, ...]},
"undergrad":{"students":[
{name:"Carol"},
{name:"Dave",deg:"U"}, ...]}
}
Questions?
Mimir virtualizes uncertainty
$Var(\ldots)$ constructs new variables
Variables can't be evaluated until they are bound.
So, we allow arbitrary expressions to represent data.
A lazy value without variables is deterministic
Mimir SQL allows the $Var()$ operator to inlined
SELECT A, VAR('X', B)+2 AS C FROM R;
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
A | C |
---|---|
1 | $X_2+2$ |
3 | $X_4+2$ |
5 | $X_6+2$ |
Selects on $Var()$ need to be deferred too...
SELECT A FROM R WHERE VAR('X', B) > 2;
A | B |
---|---|
1 | 2 |
3 | 4 |
5 | 6 |
A | $\phi$ |
---|---|
1 | $X_2>2$ |
3 | $X_4>2$ |
5 | $X_6>2$ |
When evaluating the table, rows where $\phi = \bot$ are dropped.
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
is (almost) the same as the query...
CREATE VIEW PRODUCTS
AS SELECT ID, NAME, ...,
CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
END AS DEPARTMENT
FROM PRODUCTS_RAW;
ID | Name | ... | Department |
---|---|---|---|
123 | Apple 6s, White | ... | Phone |
34234 | Dell, Intel 4 core | ... | Computer |
34235 | HP, AMD 2 core | ... | $Prod.Dept_3$ |
... | ... | ... | ... |
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
Behind the scenes, a lens also creates a model...
SELECT * FROM PRODUCTS_RAW;
An estimator for $PRODUCTS.DEPARTMENT_{ROWID}$
SELECT A, VAR('X', B)+2 AS C FROM R;
Mimir dispatches this query to the DB:
SELECT A, B FROM R;
And for each row of the result, evaluates:
SELECT A, VAR('X', B)+2 AS C FROM RESULT;
All uncertainty comes from labeled nulls in the expressions that Mimir evaluates for each row of the output.
VAR('X', B)+2
.VAR('X', B)
.
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
Mimir dispatches this query to the DB:
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2 FROM R, S;
And for each row of the result, evaluates:
SELECT A, C FROM RESULT WHERE VAR('X', TEMP_1) = TEMP_2;
UDFs allow the DB to interpret labeled nulls
SELECT R.A, S.C FROM R, S
WHERE S.B = MIMIR_VG_BESTGUESS('VARIABLE_X', R.B);
... but we lose the ability to explain outputs
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
Mimir dispatches this query to the DB:
SELECT R.A, S.C,
R.ROWID AS ID_1, S.ROWID AS ID_2
WHERE S.B = MIMIR_VG_BESTGUESS('VARIABLE_X', R.B);
Then to explain, Mimir dispatches the query:
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2
WHERE R.ROWID = ID_1 AND S.ROWID = ID_2
PDBench: TPC-H Data, but add random FK violations.
Strategy | Q1 | Q2 | Q3 |
---|---|---|---|
Inline | 85.5s | 676.6s | 103.3s |
TupleBundle | 8.2s | 55.2s | 9.8s |
Partition | >1hr | 739.7s | >1hr |
Participants were shown a table of 3 products with 3 ratings (e.g., Amazon, Best Buy, Walmart) each
Part 1: The randomly generated ratings were biased to encourage a predictable, but mildly ambiguous ordering of the three products.
Part 2: We used the same randomization, but this time we marked several of the values as uncertain:
Red Text | value |
Red Background | value |
Asterisk | $value*$ |
Tolerance | $value \pm tolerance$ |
Range | $low – high$ |
Part 3: We asked participants to verbalize their thought process and tagged specific exclamations in the transcripts.
CONTEXT-DOMAIN
: The participant relied on the 0-5 range of reviews to infer an uncertain ratingCONTEXT-ROW
: The participant used other reviews for the same product to infer an uncertain ratingUNCERTAINTY-IGNORED
: The participant explicitly disregarded an uncertain ratingUNCERTAINTY-IRRELEVANT
: The participant didn't need the uncertain value.Part 3: We asked participants to verbalize their thought process and tagged specific exclamations in the transcripts.
[DIS]COMFORT-*
: The participant expressed a positive or negative emotional response.*-DATA
: The emotional response pertained to the data itself.*-UNCERTAINTY
: The emotional response pertained to the uncertain values or representation.
SELECT NAME FROM PRODUCTS
WHERE DEPARTMENT='PHONE'
AND ( VENDOR='APPLE'
OR PLATFORM='ANDROID' )
Row-level uncertainty is a boolean formula $\phi$.
For this query, $\phi$ can be as complex as: $$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$
Too many variables! Which is the most important?
Data Cleaning
The ones that keep us from knowing everything
$$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$
$$A \wedge (B \vee C)$$
Consider a game between a database and an impartial oracle.
Naive Algorithm: Pick all variables!
Less Naive Algorithm: Minimize $E\left[\sum c_v\right]$.
$$\phi = A \wedge (B \vee C)$$
Entropy is intuitive:
$H = 1$ means we know nothing,
$H = 0$ means we know everything.
$$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$
Information gain of $v$: The reduction in entropy from knowing the truth value of a variable $v$.
$$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$
Expected information gain of $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.
Combine Information Gain and Cost
$$f(\mathcal I_{A}(\phi), c_A)$$
For example: $EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$
Greedy Algorithm: Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step
Simulate an analyst trying to manually explore correlations.
EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order
EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order