Embracing uncertainty with

Joint work with:

PhD Students: Ying Yang, Will Spoth, Aaron Huber, Poonam Kumari, Jon Logan
BS Students: Lisa Lu, Jacob P. Verghese
Alums: Arindam Nandi, Niccoló Meneghetti (HPE/Vertica), Vinayak Karuppasamy (Bloomberg)
Collabs: Ronny Fehling (Airbus), Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle), Beda Hammerschmidt (Oracle), Boris Glavic (IIT), Wolfgang Gatterbauer (CMU), Juliana Freire (NYU), Heiko Mueller (NYU), Moises Sudit (UB-ISE)

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

→

Alice's store collects sales data

(OpenClipArt.org)

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

→

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)

+ ?

... asks her question ...

(OpenClipArt.org)

+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

→

It's never this easy...

CSV Import

Run a `SELECT` on a raw CSV File

File may not have column headers
CSV does not provide "types"
Lines may be missing fields
Fields may be mistyped (typo, missing comma)
Comment text can be inlined into the file

State of the art: External Table Defn + "Manually" edit CSV

Merge Two Datasets

`UNION` two data sources

Schema matching
Deduplication
Format alignment (GIS coordinates, $ vs €)
Precision alignment (State vs County)

State of the art: Manually map schema

JSON Shredding

Run a `SELECT` on JSON or a Doc Store

Separating fields and record sets:
(e.g., { A: "Bob", B: "Alice" })
Missing fields (Records with no 'address')
Type alignment (Records with 'address' as an array)
Schema matching$^2$

State of the art: DataGuide, Wrangler, etc...

Data Cleaning is Hard!

Lots of potential data errors!

Running Example


                      PID,   RATING, REVIEW_CT
                      P123,  4.5,    50.0	
                      P2345, A3,     245.0
                      P124,  4.0,    100.0
                      P325,  6.4,    30

'A3' is not a valid rating

Enter the `NULL` Value

PID	RATING	REVIEW_CT
P123	4.5	50.0
P2345	NULL	245.0
P124	4.0	100.0
P325	6.4	30

3-Valued Logic

$NULL > 4$ is Unknown

(Same for $<$, $\leq$, $\geq$, $\neq$, $=$)

3-Valued Logic - AND

	True	False	Unknown
True	True	False	Unknown
False	False	False	False
Unknown	Unknown	False	Unknown

3-Valued Logic - OR

	True	False	Unknown
True	True	True	True
False	True	False	Unknown
Unknown	True	Unknown	Unknown

3-Valued Logic - NOT

True	False	Unknown
False	True	Unknown

Query Evaluation

NULL contaminates everything.
Return only deterministically true rows.

Limitations

Unintuitive semantics: $$Unknown \equiv \neg Unknown$$
Silently discarding unknown values: $$(Ratings_1 \bowtie Ratings_1) \subset Ratings_1$$
Null represents a complete lack of information

V-Tables

PID	RATING	REVIEW_CT
P123	4.5	50.0
P2345	$NULL_1$	245.0
P124	4.0	100.0
P325	6.4	30

Label nulls with a Subscript

Labeled Nulls

Expression	Truth Value
$NULL_i = 4$	Unknown
$NULL_i = NULL_j$ ($i \neq j$)	Unknown
$NULL_i = NULL_j$ ($i = j$)	True

Solves some erroneously discarded values: $$(Ratings_1 \bowtie Ratings_1) = Ratings_1$$ (but not all)

C-Tables

PID	RATING	REVIEW_CT	$\phi$
P123	4.5	50.0	$\top$
P2345	$NULL_1$	245.0	$\top$
P124	4.0	100.0	$\top$
P325	6.4	30	$\top$

Add a 'local condition' column

Selection

$$\sigma_{RATING > 4} RATINGS_1$$

PID	RATING	REVIEW_CT	$\phi$
P123	4.5	50.0	$\top$
P2345	$NULL_1$	245.0	$NULL_1 > 4$
P325	6.4	30	$\top$

(Bag-) Evaluation Semantics

Expression	Evaluates To
$[[\pi_{a_i \leftarrow e_i}(R)]]_{CT}$	$\{\left< a_i:[[e_i(t)]]_{lazy}, \phi: t.\phi\right>\;\|\;t \in [[R]]_{CT}\}$
$[[\sigma_\psi(R)]]_{CT}$	$\{\left< a_i: t.a_i, \phi: [[t.\phi \wedge \psi(t)]]_{lazy}\right>\;\|\;t \in [[R]]_{CT} \wedge \left([[t.\phi \wedge \psi(t)]]_{lazy} \not \equiv \bot \right) \}$
$[[R\times S]]_{CT}$	$\{\left< a_i: t_1.a_i, a_j: t_2.a_j, \phi: t_1.\phi \wedge t_2.\phi \right>\;\|\;t_1 \in [[R]]_{CT} \wedge t_2 \in [[S]]_{CT}\}$
$[[R\uplus S]]_{CT}$	$\{\left< a_i: t.a_i, \phi: t.\phi\right>\;\|\;t \in ([[R]]_{CT} \uplus [[S]]_{CT})\}$

C-Tables

'Plug-in' a valuation to get a specific result.

For example, with $\{NULL_1 \rightarrow 5\}$

PID	RATING	REVIEW_CT
P123	4.5	50.0
P2345	5	245.0
P325	6.4	30

C-Tables

Pro: 'Hidden' correlations created by the query preserved
Pro: 'Model' for values decoupled from representation
Con: Expensive (e.g., Joins may degrade to cross-products)
Con: Databases don't support labeled nulls
Con: Generalized projection has side effects

$$\pi_{PID,\; A \leftarrow RATING,\; B \leftarrow \frac{RATING}{2}} (RATINGS_1)$$

PID	A	B	$\phi$
P123	4.5	2.25	$\top$
P2345	$NULL_1$	$NULL_2$	$\top$
P124	4.0	2.0	$\top$
P325	6.4	3.2	$\top$

(Must also record that $NULL_2 = \frac{NULL_1}{2}$)

Generalized C-Tables

aka PIP-Tables

PID	A	B	$\phi$
P123	4.5	2.25	$\top$
P2345	$NULL_1$	$\frac{NULL_1}{2}$	$\top$
P124	4.0	2.0	$\top$
P325	6.4	3.2	$\top$

(Labeled Nulls + Lazy Arithmetic Expressions)

C-Tables

Pro: 'Hidden' correlations created by the query preserved
Pro: 'Model' for values decoupled from representation
Con: Expensive (e.g., Joins may degrade to cross-products)
Con: Databases don't support labeled nulls
Con: Generalized projection has side effects

Back to the beginning

How did the uncertainty arise?

VGTerm

$VGTerm(\ldots)$ constructs new variables
(it's a skolem function)

$VGTerm('X')$ constructs a new variable $X$
$VGTerm('X', 1)$ constructs a new variable $X_{1}$
$VGTerm('X', ROWID)$ evaluates $ROWID$ and then constructs a new variable $X_{ROWID}$

Mimir SQL allows the $VGTerm()$ operator to inlined


                  SELECT A, VGTerm('X', B)+2 AS C FROM R;

A	B
1	2
3	4
5	6

A	C
1	$X_2+2$
3	$X_4+2$
5	$X_6+2$



           SELECT PID, 
                  CASE WHEN CAN_CAST(RATING AS FLOAT) 
                       THEN CAST(RATING AS FLOAT)
                       ELSE VGTerm('RATING', ROWID) 
                    END AS RATING,
                  REVIEW_CT
           FROM RATINGS1;


        CREATE VIEW FIXED_RATINGS AS
           SELECT PID, 
                  CASE WHEN CAN_CAST(RATING AS FLOAT) 
                       THEN CAST(RATING AS FLOAT)
                       ELSE VGTerm('RATING', ROWID) 
                    END AS RATING,
                  REVIEW_CT
           FROM RATINGS1;


        CREATE VIEW FIXED_RATINGS AS SELECT ...

Behind the scenes, we also create a model...


                      SELECT * FROM RATINGS1;

↓

A distribution for $RATING_{ROWID}$


        CREATE VIEW FIXED_RATINGS AS
           SELECT PID, 
                  CASE WHEN CAN_CAST(RATING AS FLOAT) 
                       THEN CAST(RATING AS FLOAT)
                       ELSE VGTerm('RATING', ROWID) 
                    END AS RATING,
                  REVIEW_CT
           FROM RATINGS1;

        SELECT PID FROM FIXED_RATINGS WHERE RATING > 4

The query+view+model completely describe the distribution of possible results...

... but how much detail do we actually need?

Uncertainty Representation User Study

Participants were shown a table of 3 products with 3 ratings (e.g., Amazon, Best Buy, Walmart) each

Part 1: The randomly generated ratings were biased to encourage a predictable, but mildly ambiguous ordering of the three products.

Part 2: We used the same randomization, but this time we marked several of the values as uncertain:

Red Text	value
Red Background	value
Asterisk	$value*$
Tolerance	$value \pm tolerance$
Range	$low – high$

Probability of Agreement With Elicited Order

Time Taken

Takeaway...

Small, '1-bit' representations can be sufficient
Small, '1-bit' representations help users make faster decisions

1-Bit Representations

Make a best guess (Maximize prior probability)
Plug in the best guess
Profit?

UDFs allow guesses to be plugged in


           SELECT PID, 
                  CASE WHEN CAN_CAST(RATING AS FLOAT) 
                       THEN CAST(RATING AS FLOAT)
                       ELSE MIMIR_VG_BESTGUESS('RATING', ROWID) 
                    END AS RATING,
                  REVIEW_CT
           FROM RATINGS1;

... but we lose track of which outputs are uncertain

Provenance Recovers Explanations


    SELECT PID, 
       CASE WHEN CAN_CAST(RATING AS FLOAT) 
            THEN CAST(RATING AS FLOAT)
            ELSE MIMIR_VG_BESTGUESS('RATING', ROWID) 
         END AS RATING,
       REVIEW_CT,
       TRUE                      AS MIMIR_IS_DETERMINISTIC_PID,
       CAN_CAST(RATING AS FLOAT) AS MIMIR_IS_DETERMINISTIC_RATING,
       TRUE                      AS MIMIR_IS_DETERMINISTIC_REVIEW_CT,
       TRUE                      AS MIMIR_ROW_IS_DETERMINISTIC
    FROM RATINGS1;

Performance

PDBench: TPC-H Data, but add random FK violations.

Query 1: ~TPC-H Q3; 3-way FK Join with Predicates
Query 2: ~TPC-H Q6; Table Scan with Predicates.
Query 3: ~TPC-H Q7; 5-way Star Join with Predicates.

Partition:: Separate query fragments compute 'certain' results and one or more classes of uncertain results.
TupleBundle:: Compute and summarize 10 sampled results in parallel
Inline:: UDFs dynamically inject best guess values into the query.

Strategy	Q1	Q2	Q3
Inline	85.5s	676.6s	103.3s
TupleBundle	8.2s	55.2s	9.8s
Partition	>1hr	739.7s	>1hr

On-Demand Data Curation makes data exploration easier.
"Best-Guess" results streamline analytics.
... if the DB communicates the resulting uncertainty.

Questions?

Backup Slides

Industry says...

My phone is guessing, but is letting me know that it did

Easy interactions to accept, reject, or explain uncertainty

Easy access to: Provenance, Alternatives, and Confidence

Communication

Why is my data uncertain?
How bad is it?
What can I do about it?

What if a database did the same?

A: Standard SQL.
B: Annotated Output.
C: Subway Diagram.
D: Result Explanations.

Demo

Mimir is a DB Overlay

Mimir virtualizes uncertainty (OpenClipArt.org)

How?

Labeled Nulls

$Var(\ldots)$ constructs new variables

$Var('X')$ constructs a new variable $X$
$Var('X', 1)$ constructs a new variable $X_{1}$
$Var('X', ROWID)$ evaluates $ROWID$ and then constructs a new variable $X_{ROWID}$

Lazy Evaluation

Variables can't be evaluated until they are bound.
So, we allow arbitrary expressions to represent data.

$X$ is a legitimate data value.
$X+1$ is a legitimate data value.
$1+1$ is a legitimate data value, but can be reduced to $2$.

A lazy value without variables is deterministic

Mimir SQL allows the $Var()$ operator to inlined


                  SELECT A, VAR('X', B)+2 AS C FROM R;

A	B
1	2
3	4
5	6

A	C
1	$X_2+2$
3	$X_4+2$
5	$X_6+2$

Selects on $Var()$ need to be deferred too...


                  SELECT A FROM R WHERE VAR('X', B) > 2;

A	B
1	2
3	4
5	6

A	$\phi$
1	$X_2>2$
3	$X_4>2$
5	$X_6>2$

When evaluating the table, rows where $\phi = \bot$ are dropped.

C-Tables

Original Formulation [Imielinski, Lipski 1981]
PC-Tables [Green, Tannen 2006]
Systems
- Orchestra [Green, Karvounarakis, Taylor, Biton, Ives, Tannen 2007]
- MayBMS [Huang, Antova, Koch, Olteanu 2009]
- Pip [Kennedy, Koch 2009]
- Sprout [Fink, Hogue, Olteanu, Rath 2011]
Generalized PC-Tables [Kennedy, Koch 2009]

Labeled nulls capture a lens' uncertainty


  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);

is (almost) the same as the query...


  CREATE VIEW PRODUCTS 
     AS SELECT ID, NAME, ...,
          CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
               ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
          END AS DEPARTMENT
     FROM PRODUCTS_RAW;

ID	Name	...	Department
123	Apple 6s, White	...	Phone
34234	Dell, Intel 4 core	...	Computer
34235	HP, AMD 2 core	...	$Prod.Dept_3$
...	...	...	...


  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);

Behind the scenes, a lens also creates a model...


                      SELECT * FROM PRODUCTS_RAW;

↓

An estimator for $PRODUCTS.DEPARTMENT_{ROWID}$

... but databases don't support labeled nulls

Labeled Nulls Percolate Up


                  SELECT A, VAR('X', B)+2 AS C FROM R;

Mimir dispatches this query to the DB:


                          SELECT A, B FROM R;

And for each row of the result, evaluates:


               SELECT A, VAR('X', B)+2 AS C FROM RESULT;

Generating Explanations

All uncertainty comes from labeled nulls in the expressions that Mimir evaluates for each row of the output.

Why is the data uncertain?: All relevant lenses referenced in VAR('X', B)+2.
How uncertain?: Estimate by sampling from VAR('X', B).
How do I fix it?: Each lens fixes one well-defined type of error.

Lazy evaluation can cause problems


        SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;

Mimir dispatches this query to the DB:


      SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2 FROM R, S;

And for each row of the result, evaluates:


      SELECT A, C FROM RESULT WHERE VAR('X', TEMP_1) = TEMP_2;

UDFs allow the DB to interpret labeled nulls


      SELECT R.A, S.C FROM R, S
      WHERE S.B = MIMIR_VG_BESTGUESS('VARIABLE_X', R.B);

... but we lose the ability to explain outputs

Selection (Filtering)


                  SELECT NAME FROM PRODUCTS
                  WHERE DEPARTMENT='PHONE' 
                     AND ( VENDOR='APPLE' 
                           OR PLATFORM='ANDROID' )

Row-level uncertainty is a boolean formula $\phi$.

For this query, $\phi$ can be as complex as: $$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$

Too many variables! Which is the most important?

What is important?

Data Cleaning

Which variables are important?

The ones that keep us from knowing everything

$$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$

⬍

$$A \wedge (B \vee C)$$

Naive Approach

Consider a game between a database and an impartial oracle.

The DB picks a variable $v$ in $\phi$ and pays a cost $c_v$.
The Oracle reveals the truth value of $v$.
The DB updates $\phi$ accordingly and repeats until $\phi$ is deterministic.

Naive Algorithm: Pick all variables!

Less Naive Algorithm: Minimize $E\left[\sum c_v\right]$.

Exponential Time Bad!

The Value of What We Don't Know

$$\phi = A \wedge (B \vee C)$$

Generate Samples for $A$, $B$, $C$
Estimate $p(\phi)$
Compute $H[\phi] = -\log\left(p(\phi) \cdot (1-p(\phi))\right)$

Entropy is intuitive:
$H = 1$ means we know nothing,
$H = 0$ means we know everything.

Information Gain

$$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$

Information gain of $v$: The reduction in entropy from knowing the truth value of a variable $v$.

Expected Information Gain

$$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$

Expected information gain of $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.

The Cost of Perfect Information

Combine Information Gain and Cost

$$f(\mathcal I_{A}(\phi), c_A)$$

For example: $EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$

Greedy Algorithm: Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step

Experimental Data

Start with a large dataset.
Delete random fields (~50%).

Experimental Queries

Simulate an analyst trying to manually explore correlations.

Train a tree-classifier on the base data.
Convert the decision tree to a query for all rows where the tree predicts a specific value.

Cost vs Entropy: Credit Data

EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order

Cost vs Entropy: Product Data

EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order

Embracing uncertainty with

Joint work with:

A Big Data Fairy Tale

Meet Alice

Alice has a Store

Alice's store collects sales data

Alice wants to use her sales data to run a promotion

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

... asks her question ...

... and basks in the limitless possibilities of big data.

Why is this a fairy tale?

It's never this easy...

CSV Import

Run a SELECT on a raw CSV File

Merge Two Datasets

UNION two data sources

JSON Shredding

Run a SELECT on JSON or a Doc Store

Data Cleaning is Hard!

Lots of potential data errors!

Running Example

Enter the NULL Value

3-Valued Logic

3-Valued Logic - AND

3-Valued Logic - OR

3-Valued Logic - NOT

Query Evaluation

Limitations

V-Tables

Labeled Nulls

C-Tables

Selection

(Bag-) Evaluation Semantics

C-Tables

C-Tables

Generalized C-Tables

aka PIP-Tables

C-Tables

Back to the beginning

How did the uncertainty arise?

VGTerm

... but how much detail do we actually need?

Uncertainty Representation User Study

Probability of Agreement With Elicited Order

Time Taken

Takeaway...

1-Bit Representations

Provenance Recovers Explanations

Performance

Backup Slides

Industry says...

Communication

What if a database did the same?

Mimir is a DB Overlay

How?

Labeled Nulls

Lazy Evaluation

C-Tables

Labeled nulls capture a lens' uncertainty

... but databases don't support labeled nulls

Labeled Nulls Percolate Up

Generating Explanations

Lazy evaluation can cause problems

Selection (Filtering)

What is important?

Which variables are important?

Naive Approach

Exponential Time Bad!

The Value of What We Don't Know

Information Gain

Expected Information Gain

The Cost of Perfect Information

Experimental Data

Experimental Queries

Cost vs Entropy: Credit Data

Cost vs Entropy: Product Data

Run a `SELECT` on a raw CSV File

`UNION` two data sources

Run a `SELECT` on JSON or a Doc Store

Enter the `NULL` Value