Don't Wrangle, Guess Instead

with

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

→

Alice's store collects sales data

(OpenClipArt.org)

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

→

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)

+ ?

... asks her question ...

(OpenClipArt.org)

+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

→

It's never this easy...

CSV Import

Run a `SELECT` on a raw CSV File

File may not have column headers
CSV does not provide "types"
Lines may be missing fields
Fields may be mistyped (typo, missing comma)
Comment text can be inlined into the file

Merge Data From Two Sources

`UNION` two data sources

Schema matching
Deduplication
Format alignment (GIS coordinates, $ vs €)
Precision alignment (State vs County)

JSON Shredding

Run a `SELECT` on JSON or a Doc Store

Separating fields and record sets:
(e.g., { A: "Bob", B: "Alice" })
Missing fields (Records with no 'address')
Type alignment (Records with 'address' as an array)
Schema matching$^2$

We have tools that can solve these problem!

... most of the time

(google.com)

Problem: It's hard to trust tools that can be wrong!

Your Pass Phrase for today is

Mary Wheeler

Options

Ignore the Problem

In the name of Codd
Thou shalt not give the user a wrong answer.

... but this assumes that we start with perfect data.

(Fox News)

Options

Ignore the Problem
Heresy

On representing incomplete information in a relational data base

T. Imielinski & W. Lipski Jr.(VLDB 1981)

But...

1. ProbDBs Produce Probability Distributions as Outputs

2. ProbDBs Require Probability Distributions as Inputs

Probabilistic Databases...

... require probabilities as inputs
... produce probabilities as outputs
... are slow

Options

Ignore the Problem
Heresy
?

Declarative Uncertainty

The Uncertainty Management System

http://mimirdb.info

At each step, Mimir tracks ambiguity and potential errors.

A row that may or may not exist.
An attribute value that is missing or ambiguous.
A table with multiple possible schemas.
A violated constraint.

Declarative uncertainty requires...

... uncertainty capture
... query processing over uncertain data
... intuitive and qualitative presentation of uncertainty

Uncertainty-Annotated Databases

(Joint work with Boris Glavic, Su Feng, Aaron Huber)

Other Projects

Adaptive Schemas
Probabilistic Query Compilers

Background

Possible Worlds
$K$-Relations
$K^W$-Relations

$K$-Relations

Provenance Semirings

T.J. Green & G. Karvounarakis & V. Tannen(PODS 2007)

R	A	B
	1	2
	1	3
	4	3

S	B	C
	2	5
	3	6
	3	6

The relational view

The functional view

$$R(1, 2) \mapsto 1$$ $$R(1, 3) \mapsto 1$$ $$R(4, 3) \mapsto 1$$ $$S(2, 5) \mapsto 1$$ $$S(3, 6) \mapsto 2$$

$$R(4, 5) \mapsto 0$$

$$[R_1 \cup R_2](\vec X) \equiv R_1(\vec X) + R_2(\vec X)$$

$[S \cup S](3, 6)$

$= S(3, 6) + S(3, 6)$

$= 2 + 2 = 4$

$$[R_1 \bowtie R_2](\vec X) \equiv R_1(\vec X) \times R_2(\vec X)$$

$[R \bowtie S](4, 3, 6)$

$= R(4, 3) \times S(3, 6)$

$= 1 \times 2 = 2$

$$[\pi_{\vec A} R](\vec X) \equiv \sum_{\vec Y} R(\vec X \vec Y)$$

$[\pi_{B} R](3)$

$= \sum_{Y} R(Y, 3)$

$ = R(1, 3) + R(4, 3) + \ldots$

$= 1 + 1 + 0 = 2$

$\cup$	$\approx$	$+$
$\bowtie$	$\approx$	$\times$
$\pi$	$\approx$	$+$

$$\left<\;\mathcal K,\;\oplus,\;\otimes,\;\mathbb 0,\;\mathbb 1\;\right>$$

Semiring	Equivalent Query Semantics
$\left<\mathbb N, +, \times, 0, 1\right>$	Bag Semantics
$\left<\mathbb B, \vee, \wedge, \bot, \top\right>$	Set Semantics
$\left<\mathcal K^W, \vec \oplus, \vec \otimes, \mathbb{\vec 0}, \mathbb{\vec 1}\right>$	Possible Worlds Semantics

$K^W$-Relations

R	A	B
	1	2
	1	3
		3

$R_1$	A	B
	1	2
	1	3
	4	3

$R_2$	A	B
	1	2
	1	3
	9	3

R	A	B
	1	2	$\mapsto [1,1]$
	1	3	$\mapsto [1,1]$
	4	3	$\mapsto [1,0]$
	9	3	$\mapsto [0,1]$

Summarizing Possible Worlds

$$\mathcal K^W \rightarrow \mathcal K$$ (plug in any $K$-Relation-compatible $\mathcal K$)

Annotation in World $i$: $\texttt{PW}_i(\vec k) \equiv \vec k_i$
Certain Annotation: $\mathcal C(\vec k) \equiv min(\vec k)$
Possible Annotation: $\mathcal P(\vec k) \equiv max(\vec k)$

Correct/Possible mirrors "Correctness of SQL Queries on Databases with Nulls" [Guagliardo, Libkin 2017]

R	A	B
	1	2	$\mapsto [1,1]$
	1	3	$\mapsto [1,1]$
	4	3	$\mapsto [1,0]$
	9	3	$\mapsto [0,1]$

$$\texttt{PW}_0(R(1, 2)) = 1$$

$$\texttt{PW}_0(R(4, 3)) = 1$$

$$\texttt{PW}_1(R(4, 3)) = 0$$

$$\mathcal C(R(4, 3)) = 0$$

$$\mathcal P(R(4, 3)) = 1$$

A quick step back into reality...

R	A	B
	1	2
	1	3
	4 or 9	3

R	A	B
	1	2
	1	3
	4 or 9	3

Standard practice: "Just use the best option."

What's in between these extremes?

R	A	B
	1	2
	1	3
	4	3	*

Use the best option, but mark potential errors.

To answer $Q(\mathcal D)$ we want...

$PW_{i}(Q(\mathcal D))$	The results Alice would have "just used".

$\mathcal C(Q(\mathcal D))$	Which of those results are trustworthy.

$$\texttt{PW}_i(Q(\mathcal D)) \equiv Q(\texttt{PW}_i(\mathcal D))$$

(Computing $PW_{i}(Q(\mathcal D))$ is cheap!)

Can we do the same thing for $\mathcal C(Q(\mathcal D))$?

$$C(Q(\mathcal D)) \stackrel{?}{=} Q(\mathcal C(\mathcal D))$$

No.

R	A	B		$K^W$	$\mathcal C$
	1	2	$\mapsto$	$[1,1]$	1
	1	3	$\mapsto$	$[1,1]$	1
	4	3	$\mapsto$	$[1,0]$	0
	9	3	$\mapsto$	$[0,1]$	0

Compute $\pi_B(R)$

$\pi_B$R	B		$K^W$	$\mathcal C$
	2	$\mapsto$	$[1,1]$	$\mathcal C([1,1]) = 1$
	3	$\mapsto$	$[2,2]$	$\mathcal C([1,1])+\mathcal C([1,0])+\mathcal C([0,1])$

$=1+0+0=1$ $\neq C([2,2])$

So what can we do with $\mathcal C$?

We can Approximate

(C-) Soundness: $Q(\mathcal C(\mathcal D)) \leq \mathcal C(Q(\mathcal D))$; We can efficiently compute a conservative approximation of $\mathcal C$.
Completeness (for some queries): $Q(\mathcal C(\mathcal D)) = \mathcal C(Q(\mathcal D))$ ...if $Q$ is ??? (work in progress)

... also attribute level uncertainty

(but not today)

We implemented it...

Su Feng

Defining Possible Worlds

Mimir allows users to define special UDFs called Models.


    CREATE MODEL TYPE Geocoder AS mimir.models.GeocodingModel;

    CREATE MODEL INSTANCE Text_To_Loc USING Geocoder('Google');

    SELECT C.name, C.id, Text_To_Loc(C.address) AS address 
      FROM Customer C;

(Not actual Mimir-SQL. Language adapted for your viewing pleasure.)

Models...
... return one best guess
... define the space of alternatives

Example Models

Geocoding Addresses
Imputation using a SparkML classifier
Heuristic detection of order-by columns for interpolation
Schema matching based on edit-distance
MayBMS-style probabilistic repair-key
And more...

Convenience Operators: Lenses

Lenses instantiate/train a model and wrap a query

Domain Constraint Repair / Missing Value Imputation
Schema Matching
Sequence Repair
Key Repair
Arbitrary Choice
Type Detection *
Header Detection *
JSON Shredder *

Evaluation handled by a DBMS or Spark
via query rewriting using GProM.


    SELECT C.name, C.id, Text_To_Loc(C.address) AS address 
      FROM Customer C;

becomes...


    SELECT C.name, C.id, Text_To_Loc(C.address) AS address,
           1 AS name_certain,    1 AS id_certain, 
           0 AS address_certain, 1 AS row_certain
      FROM Customer C;

A few more things we're doing with Mimir...

Adaptive Schemas

Domain Constraint Repair / Missing Value Imputation
Schema Matching
Sequence Repair
Key Repair
Arbitrary Choice
Type Detection *
Header Detection *
JSON Shredder *

Adaptive Schemas

Domain Constraint Repair / Missing Value Imputation
Schema Matching
Sequence Repair
Key Repair
Arbitrary Choice
Type Detection *
Header Detection *
JSON Shredder *


      LOAD 'customers.csv';

      SELECT name FROM customers WHERE last_purchase < LAST_WEEK();

How does the system know...

... which column is 'name'?: Guess that row 1 is headers.
... that 'last_purchase' is a date?: All rows look like YYYY-MM-DD

This is all guesswork!

Idea: Make the System Catalog a Probabilistic Table

Probabilistic Query Compilers

Sampling from ProbDBs is Sloooow

Trivial Sampling

Evaluate the query $N$ times.
Plug in samples instead of best guesses.

Better Solutions

Merge evaluation to mitigate redundancy.

Sparse Encoding

$R_1$	A	B
	1	2
	3	4

$R_2$	A	B
	1	5

➔

A	B	S#
1	2	1
3	4	1
1	5	2

Tuple Bundles

$R_1$	A	B
	1	2
	3	4

$R_2$	A	B
	1	5

➔

$R_{bundle}$	A	B	$\phi$
	1	[2,5]	[T,T]
	3	4	[T,F]

Idea: Let the compiler pick the right representation
(or combination)

http://mimirdb.info

Students
Poonam (PhD-3Y)	Will (PhD-2Y)	Aaron (PhD-3Y)

Dev
Mike (Sr. Rsrch. Dev.)

Alumni
Ying (PhD 2017)	Niccolò (PhD 2016)	Arindam (MS 2016)	Shivang (MS 2018)	Olivia (BS 2017)	Gourab (MS 2018)

External Collaborators
Dieter Gawlick (Oracle)	Zhen Hua Liu (Oracle)	Ronny Fehling (Airbus)	Beda Hammerschmidt (Oracle)

Boris Glavic
(IIT)

Su Feng
(IIT)

Juliana Freire
(NYU)

Wolfgang Gatterbauer
(NEU)

Heiko Mueller
(NYU)

Remi Rampin
(NYU)

Mimir is supported by NSF Award ACI-1640864, NPS Award N00244-16-1-0022, and gifts from Oracle

Don't Wrangle, Guess Instead

with

A Big Data Fairy Tale

Meet Alice

Alice has a Store

Alice's store collects sales data

Alice wants to use her sales data to run a promotion

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

... asks her question ...

... and basks in the limitless possibilities of big data.

Why is this a fairy tale?

It's never this easy...

CSV Import

Run a SELECT on a raw CSV File

Merge Data From Two Sources

UNION two data sources

JSON Shredding

Run a SELECT on JSON or a Doc Store

Mary Wheeler

Options

Options

On representing incomplete information in a relational data base

But...

Probabilistic Databases...

Options

Declarative Uncertainty

Uncertainty-Annotated Databases

Other Projects

Background

$K$-Relations

Provenance Semirings

$K^W$-Relations

Summarizing Possible Worlds

No.

So what can we do with $\mathcal C$?

We can Approximate

We implemented it...

Defining Possible Worlds

Example Models

Convenience Operators: Lenses

Adaptive Schemas

Adaptive Schemas

How does the system know...

Probabilistic Query Compilers

Trivial Sampling

Better Solutions

Sparse Encoding

Tuple Bundles

Run a `SELECT` on a raw CSV File

`UNION` two data sources

Run a `SELECT` on JSON or a Doc Store