Embracing Uncertainty

Embracing uncertainty with

Joint work with:

PhD Students: Ying Yang, Will Spoth, Aaron Huber, Poonam Kumari, Jon Logan
BS Students: Lisa Lu, Jacob P. Verghese
Alums: Arindam Nandi, Niccoló Meneghetti (HPE/Vertica), Vinayak Karuppasamy (Bloomberg)
Collabs: Ronny Fehling (Airbus), Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle), Beda Hammerschmidt (Oracle), Boris Glavic (IIT), Wolfgang Gatterbauer (CMU), Juliana Freire (NYU), Heiko Mueller (NYU), Moises Sudit (UB-ISE)

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

Alice's store collects sales data

(OpenClipArt.org)
+ =

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)
+ ?

... asks her question ...

(OpenClipArt.org)
+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

It's never this easy...

CSV Import

Run a SELECT on a raw CSV File

  • File may not have column headers
  • CSV does not provide "types"
  • Lines may be missing fields
  • Fields may be mistyped (typo, missing comma)
  • Comment text can be inlined into the file

State of the art: External Table Defn + "Manually" edit CSV

Merge Two Datasets

UNION two data sources

  • Schema matching
  • Deduplication
  • Format alignment (GIS coordinates, $ vs €)
  • Precision alignment (State vs County)

State of the art: Manually map schema

JSON Shredding

Run a SELECT on JSON or a Doc Store

  • Separating fields and record sets:
    (e.g., { A: "Bob", B: "Alice" })
  • Missing fields (Records with no 'address')
  • Type alignment (Records with 'address' as an array)
  • Schema matching$^2$

State of the art: DataGuide, Wrangler, etc...

Data Cleaning is Hard!

Lots of potential data errors!

Running Example


                      PID,   RATING, REVIEW_CT
                      P123,  4.5,    50.0	
                      P2345, A3,     245.0
                      P124,  4.0,    100.0
                      P325,  6.4,    30

'A3' is not a valid rating

Enter the NULL Value

PIDRATINGREVIEW_CT
P1234.550.0
P2345NULL245.0
P1244.0100.0
P3256.430

3-Valued Logic

$NULL > 4$ is Unknown

(Same for $<$, $\leq$, $\geq$, $\neq$, $=$)

3-Valued Logic - AND

TrueFalseUnknown
TrueTrueFalseUnknown
FalseFalseFalseFalse
UnknownUnknownFalseUnknown

3-Valued Logic - OR

TrueFalseUnknown
TrueTrueTrueTrue
FalseTrueFalseUnknown
UnknownTrueUnknownUnknown

3-Valued Logic - NOT

TrueFalseUnknown
FalseTrueUnknown

Query Evaluation

  • NULL contaminates everything.
  • Return only deterministically true rows.

Limitations

  • Unintuitive semantics: $$Unknown \equiv \neg Unknown$$
  • Silently discarding unknown values: $$(Ratings_1 \bowtie Ratings_1) \subset Ratings_1$$
  • Null represents a complete lack of information

V-Tables

PIDRATINGREVIEW_CT
P1234.550.0
P2345$NULL_1$245.0
P1244.0100.0
P3256.430

Label nulls with a Subscript

Labeled Nulls

ExpressionTruth Value
$NULL_i = 4$Unknown
$NULL_i = NULL_j$ ($i \neq j$)Unknown
$NULL_i = NULL_j$ ($i = j$)True

Solves some erroneously discarded values: $$(Ratings_1 \bowtie Ratings_1) = Ratings_1$$ (but not all)

C-Tables

PIDRATINGREVIEW_CT$\phi$
P1234.550.0 $\top$
P2345$NULL_1$245.0$\top$
P1244.0100.0$\top$
P3256.430$\top$

Add a 'local condition' column

Selection

$$\sigma_{RATING > 4} RATINGS_1$$

PIDRATINGREVIEW_CT$\phi$
P1234.550.0 $\top$
P2345$NULL_1$245.0$NULL_1 > 4$
P3256.430$\top$

(Bag-) Evaluation Semantics

ExpressionEvaluates To
$[[\pi_{a_i \leftarrow e_i}(R)]]_{CT}$ $\{\left< a_i:[[e_i(t)]]_{lazy}, \phi: t.\phi\right>\;|\;t \in [[R]]_{CT}\}$
$[[\sigma_\psi(R)]]_{CT}$ $\{\left< a_i: t.a_i, \phi: [[t.\phi \wedge \psi(t)]]_{lazy}\right>\;|\;t \in [[R]]_{CT} \wedge \left([[t.\phi \wedge \psi(t)]]_{lazy} \not \equiv \bot \right) \}$
$[[R\times S]]_{CT}$ $\{\left< a_i: t_1.a_i, a_j: t_2.a_j, \phi: t_1.\phi \wedge t_2.\phi \right>\;|\;t_1 \in [[R]]_{CT} \wedge t_2 \in [[S]]_{CT}\}$
$[[R\uplus S]]_{CT}$ $\{\left< a_i: t.a_i, \phi: t.\phi\right>\;|\;t \in ([[R]]_{CT} \uplus [[S]]_{CT})\}$

C-Tables

'Plug-in' a valuation to get a specific result.

For example, with $\{NULL_1 \rightarrow 5\}$

PIDRATINGREVIEW_CT
P1234.550.0
P23455245.0
P3256.430

C-Tables

  • Pro: 'Hidden' correlations created by the query preserved
  • Pro: 'Model' for values decoupled from representation
  • Con: Expensive (e.g., Joins may degrade to cross-products)
  • Con: Databases don't support labeled nulls
  • Con: Generalized projection has side effects
$$\pi_{PID,\; A \leftarrow RATING,\; B \leftarrow \frac{RATING}{2}} (RATINGS_1)$$
PIDAB$\phi$
P1234.52.25$\top$
P2345$NULL_1$$NULL_2$$\top$
P1244.02.0$\top$
P3256.43.2$\top$

(Must also record that $NULL_2 = \frac{NULL_1}{2}$)

Generalized C-Tables

aka PIP-Tables

PIDAB$\phi$
P1234.52.25$\top$
P2345$NULL_1$$\frac{NULL_1}{2}$$\top$
P1244.02.0$\top$
P3256.43.2$\top$

(Labeled Nulls + Lazy Arithmetic Expressions)

C-Tables

  • Pro: 'Hidden' correlations created by the query preserved
  • Pro: 'Model' for values decoupled from representation
  • Con: Expensive (e.g., Joins may degrade to cross-products)
  • Con: Databases don't support labeled nulls
  • Con: Generalized projection has side effects

Back to the beginning

How did the uncertainty arise?

VGTerm

$VGTerm(\ldots)$ constructs new variables
(it's a skolem function)

  • $VGTerm('X')$ constructs a new variable $X$
  • $VGTerm('X', 1)$ constructs a new variable $X_{1}$
  • $VGTerm('X', ROWID)$ evaluates $ROWID$ and then constructs a new variable $X_{ROWID}$

Mimir SQL allows the $VGTerm()$ operator to inlined


                  SELECT A, VGTerm('X', B)+2 AS C FROM R;
					
AB
12
34
56
AC
1$X_2+2$
3$X_4+2$
5$X_6+2$
 


           SELECT PID, 
                  CASE WHEN CAN_CAST(RATING AS FLOAT) 
                       THEN CAST(RATING AS FLOAT)
                       ELSE VGTerm('RATING', ROWID) 
                    END AS RATING,
                  REVIEW_CT
           FROM RATINGS1;


					

        CREATE VIEW FIXED_RATINGS AS
           SELECT PID, 
                  CASE WHEN CAN_CAST(RATING AS FLOAT) 
                       THEN CAST(RATING AS FLOAT)
                       ELSE VGTerm('RATING', ROWID) 
                    END AS RATING,
                  REVIEW_CT
           FROM RATINGS1;


					

        CREATE VIEW FIXED_RATINGS AS SELECT ...
					

Behind the scenes, we also create a model...


                      SELECT * FROM RATINGS1;
						

A distribution for $RATING_{ROWID}$


        CREATE VIEW FIXED_RATINGS AS
           SELECT PID, 
                  CASE WHEN CAN_CAST(RATING AS FLOAT) 
                       THEN CAST(RATING AS FLOAT)
                       ELSE VGTerm('RATING', ROWID) 
                    END AS RATING,
                  REVIEW_CT
           FROM RATINGS1;

        SELECT PID FROM FIXED_RATINGS WHERE RATING > 4
					

The query+view+model completely describe the distribution of possible results...

... but how much detail do we actually need?

Uncertainty Representation User Study

Participants were shown a table of 3 products with 3 ratings (e.g., Amazon, Best Buy, Walmart) each

Part 1: The randomly generated ratings were biased to encourage a predictable, but mildly ambiguous ordering of the three products.

Part 2: We used the same randomization, but this time we marked several of the values as uncertain:

Red Textvalue
Red Backgroundvalue
Asterisk$value*$
Tolerance$value \pm tolerance$
Range$low – high$

Probability of Agreement With Elicited Order

Time Taken

Takeaway...

  1. Small, '1-bit' representations can be sufficient
  2. Small, '1-bit' representations help users make faster decisions

1-Bit Representations

  1. Make a best guess (Maximize prior probability)
  2. Plug in the best guess
  3. Profit?

UDFs allow guesses to be plugged in


           SELECT PID, 
                  CASE WHEN CAN_CAST(RATING AS FLOAT) 
                       THEN CAST(RATING AS FLOAT)
                       ELSE MIMIR_VG_BESTGUESS('RATING', ROWID) 
                    END AS RATING,
                  REVIEW_CT
           FROM RATINGS1;
					

... but we lose track of which outputs are uncertain

Provenance Recovers Explanations


    SELECT PID, 
       CASE WHEN CAN_CAST(RATING AS FLOAT) 
            THEN CAST(RATING AS FLOAT)
            ELSE MIMIR_VG_BESTGUESS('RATING', ROWID) 
         END AS RATING,
       REVIEW_CT,
       TRUE                      AS MIMIR_IS_DETERMINISTIC_PID,
       CAN_CAST(RATING AS FLOAT) AS MIMIR_IS_DETERMINISTIC_RATING,
       TRUE                      AS MIMIR_IS_DETERMINISTIC_REVIEW_CT,
       TRUE                      AS MIMIR_ROW_IS_DETERMINISTIC
    FROM RATINGS1;
					

Performance

PDBench: TPC-H Data, but add random FK violations.

  • Query 1: ~TPC-H Q3; 3-way FK Join with Predicates
  • Query 2: ~TPC-H Q6; Table Scan with Predicates.
  • Query 3: ~TPC-H Q7; 5-way Star Join with Predicates.
Partition:
Separate query fragments compute 'certain' results and one or more classes of uncertain results.
TupleBundle:
Compute and summarize 10 sampled results in parallel
Inline:
UDFs dynamically inject best guess values into the query.
StrategyQ1Q2Q3
Inline85.5s676.6s103.3s
TupleBundle8.2s55.2s9.8s
Partition>1hr739.7s>1hr
  • On-Demand Data Curation makes data exploration easier.
  • "Best-Guess" results streamline analytics.
       ... if the DB communicates the resulting uncertainty.

Questions?

Backup Slides

Industry says...

            

My phone is guessing, but is letting me know that it did

Easy interactions to accept, reject, or explain uncertainty

Easy access to: Provenance, Alternatives, and Confidence

Communication

  • Why is my data uncertain?
  • How bad is it?
  • What can I do about it?

What if a database did the same?

  • A: Standard SQL.
  • B: Annotated Output.
  • C: Subway Diagram.
  • D: Result Explanations.
Demo

Mimir is a DB Overlay

(Any DB) (Lens) (Any DB) SELECT SELECT SELECT UNION UNION

Mimir virtualizes uncertainty (OpenClipArt.org)

How?

Labeled Nulls

$Var(\ldots)$ constructs new variables

  • $Var('X')$ constructs a new variable $X$
  • $Var('X', 1)$ constructs a new variable $X_{1}$
  • $Var('X', ROWID)$ evaluates $ROWID$ and then constructs a new variable $X_{ROWID}$

Lazy Evaluation

Variables can't be evaluated until they are bound.
So, we allow arbitrary expressions to represent data.

  • $X$ is a legitimate data value.
  • $X+1$ is a legitimate data value.
  • $1+1$ is a legitimate data value, but can be reduced to $2$.

A lazy value without variables is deterministic

Mimir SQL allows the $Var()$ operator to inlined


                  SELECT A, VAR('X', B)+2 AS C FROM R;
					
AB
12
34
56
AC
1$X_2+2$
3$X_4+2$
5$X_6+2$
 

Selects on $Var()$ need to be deferred too...


                  SELECT A FROM R WHERE VAR('X', B) > 2;
					
AB
12
34
56
A$\phi$
1$X_2>2$
3$X_4>2$
5$X_6>2$
 

When evaluating the table, rows where $\phi = \bot$ are dropped.

C-Tables

  • Original Formulation [Imielinski, Lipski 1981]
  • PC-Tables [Green, Tannen 2006]
  • Systems
    • Orchestra [Green, Karvounarakis, Taylor, Biton, Ives, Tannen 2007]
    • MayBMS [Huang, Antova, Koch, Olteanu 2009]
    • Pip [Kennedy, Koch 2009]
    • Sprout [Fink, Hogue, Olteanu, Rath 2011]
  • Generalized PC-Tables [Kennedy, Koch 2009]

Labeled nulls capture a lens' uncertainty


  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
					

is (almost) the same as the query...


  CREATE VIEW PRODUCTS 
     AS SELECT ID, NAME, ...,
          CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
               ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
          END AS DEPARTMENT
     FROM PRODUCTS_RAW;
						
IDName...Department
123Apple 6s, White...Phone
34234Dell, Intel 4 core...Computer
34235HP, AMD 2 core...$Prod.Dept_3$
............

  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
					

Behind the scenes, a lens also creates a model...


                      SELECT * FROM PRODUCTS_RAW;
						

An estimator for $PRODUCTS.DEPARTMENT_{ROWID}$

... but databases don't support labeled nulls

Labeled Nulls Percolate Up


                  SELECT A, VAR('X', B)+2 AS C FROM R;
					

Mimir dispatches this query to the DB:


                          SELECT A, B FROM R;
					

And for each row of the result, evaluates:


               SELECT A, VAR('X', B)+2 AS C FROM RESULT;
					

Generating Explanations

All uncertainty comes from labeled nulls in the expressions that Mimir evaluates for each row of the output.

Why is the data uncertain?
All relevant lenses referenced in VAR('X', B)+2.
How uncertain?
Estimate by sampling from VAR('X', B).
How do I fix it?
Each lens fixes one well-defined type of error.

Lazy evaluation can cause problems


        SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
					

Mimir dispatches this query to the DB:


      SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2 FROM R, S;
					

And for each row of the result, evaluates:


      SELECT A, C FROM RESULT WHERE VAR('X', TEMP_1) = TEMP_2;
					

UDFs allow the DB to interpret labeled nulls


      SELECT R.A, S.C FROM R, S
      WHERE S.B = MIMIR_VG_BESTGUESS('VARIABLE_X', R.B);
					

... but we lose the ability to explain outputs

Selection (Filtering)


                  SELECT NAME FROM PRODUCTS
                  WHERE DEPARTMENT='PHONE' 
                     AND ( VENDOR='APPLE' 
                           OR PLATFORM='ANDROID' )
					

Row-level uncertainty is a boolean formula $\phi$.

For this query, $\phi$ can be as complex as: $$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$

Too many variables! Which is the most important?

What is important?

Data Cleaning

Which variables are important?

The ones that keep us from knowing everything

$$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$

$$A \wedge (B \vee C)$$

Naive Approach

Consider a game between a database and an impartial oracle.

  • The DB picks a variable $v$ in $\phi$ and pays a cost $c_v$.
  • The Oracle reveals the truth value of $v$.
  • The DB updates $\phi$ accordingly and repeats until $\phi$ is deterministic.

Naive Algorithm: Pick all variables!

Less Naive Algorithm: Minimize $E\left[\sum c_v\right]$.

Exponential Time Bad!

The Value of What We Don't Know

$$\phi = A \wedge (B \vee C)$$

  1. Generate Samples for $A$, $B$, $C$
  2. Estimate $p(\phi)$
  3. Compute $H[\phi] = -\log\left(p(\phi) \cdot (1-p(\phi))\right)$

Entropy is intuitive:
$H = 1$ means we know nothing,
$H = 0$ means we know everything.

Information Gain

$$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$

Information gain of $v$: The reduction in entropy from knowing the truth value of a variable $v$.

Expected Information Gain

$$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$

Expected information gain of $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.

The Cost of Perfect Information

Combine Information Gain and Cost

$$f(\mathcal I_{A}(\phi), c_A)$$

For example: $EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$

Greedy Algorithm: Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step

Experimental Data

  • Start with a large dataset.
  • Delete random fields (~50%).

Experimental Queries

Simulate an analyst trying to manually explore correlations.

  • Train a tree-classifier on the base data.
  • Convert the decision tree to a query for all rows where the tree predicts a specific value.

Cost vs Entropy: Credit Data

EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order

Cost vs Entropy: Product Data

EG2: Greedy Cost/Value Ordering
NMETC: Naive Minimal Expected Total Cost
Random: Completely Random Order