Don't Wrangle, Guess

Don't Wrangle, Guess Instead

with

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

Alice's store collects sales data

(OpenClipArt.org)
+ =

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)
+ ?

... asks her question ...

(OpenClipArt.org)
+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

It's never this easy...

CSV Import

Run a SELECT on a raw CSV File

  • File may not have column headers
  • CSV does not provide "types"
  • Lines may be missing fields
  • Fields may be mistyped (typo, missing comma)
  • Comment text can be inlined into the file

Merge Data From Two Sources

UNION two data sources

  • Schema matching
  • Deduplication
  • Format alignment (GIS coordinates, $ vs €)
  • Precision alignment (State vs County)

JSON Shredding

Run a SELECT on JSON or a Doc Store

  • Separating fields and record sets:
    (e.g., { A: "Bob", B: "Alice" })
  • Missing fields (Records with no 'address')
  • Type alignment (Records with 'address' as an array)
  • Schema matching$^2$

We have tools that can solve these problem!

... most of the time

(google.com)

Problem: It's hard to trust tools that can be wrong!

Options

  1. Ignore the Problem

In the name of Codd
Thou shalt not give the user a wrong answer.

... but this assumes that we start with perfect data.

(Fox News)

Options

  1. Ignore the Problem
  2. Heresy

On representing incomplete information in a relational data base

T. Imielinski & W. Lipski Jr.(VLDB 1981)

But...

1. ProbDBs Produce Probability Distributions as Outputs

2. ProbDBs Require Probability Distributions as Inputs

Label , PDB-1, PDB-2, PDB-3, TPCH-1, TPCH-3, TPCH-5, TPCH-9 SQLite , 9.521, 7.59, 31.22, 19.561, 22.835, 33.308, 51.125 MayBMS-SQLite , 22.1345477, 7.291376699999999, 29.1511957 MayBMS-PGSql , 23.439012999999996, 13.000651999999999, 20.2954832 Sampling (x10), 300, 242.5666234549135, 300, 119.61607021316885, 162.00108394436538, 258.74168805666267, 300

Probabilistic Databases...

  1. ... require probabilities as inputs
  2. ... produce probabilities as outputs
  3. ... are slow

Options

  1. Ignore the Problem
  2. Heresy
  3. ?

Declarative Uncertainty

The Uncertainty Management System

http://mimirdb.info

At each step, Mimir tracks ambiguity and potential errors.

  • A row that may or may not exist.
  • An attribute value that is missing or ambiguous.
  • A table with multiple possible schemas.
  • A violated constraint.

Declarative uncertainty requires...

  1. ... uncertainty capture
  2. ... query processing over uncertain data
  3. ... intuitive and qualitative presentation of uncertainty

Uncertainty-Annotated Databases

(Joint work with Boris Glavic, Su Feng, Aaron Huber)

Other Projects

  • Adaptive Schemas
  • Probabilistic Query Compilers

Background

  1. Possible Worlds
  2. $K$-Relations
  3. $K^W$-Relations

$K$-Relations

Provenance Semirings

T.J. Green & G. Karvounarakis & V. Tannen(PODS 2007)

RAB
12
13
43
SBC
25
36
36

The relational view

The functional view

$$R(1, 2) \mapsto 1$$ $$R(1, 3) \mapsto 1$$ $$R(4, 3) \mapsto 1$$ $$S(2, 5) \mapsto 1$$ $$S(3, 6) \mapsto 2$$

$$R(4, 5) \mapsto 0$$

$$[R_1 \cup R_2](\vec X) \equiv R_1(\vec X) + R_2(\vec X)$$
$[S \cup S](3, 6)$

$= S(3, 6) + S(3, 6)$

$= 2 + 2 = 4$

$$[R_1 \bowtie R_2](\vec X) \equiv R_1(\vec X) \times R_2(\vec X)$$
$[R \bowtie S](4, 3, 6)$

$= R(4, 3) \times S(3, 6)$

$= 1 \times 2 = 2$

$$[\pi_{\vec A} R](\vec X) \equiv \sum_{\vec Y} R(\vec X \vec Y)$$
$[\pi_{B} R](3)$

$= \sum_{Y} R(Y, 3)$

$ = R(1, 3) + R(4, 3) + \ldots$

$= 1 + 1 + 0 = 2$

$\cup$ $\approx$$+$
$\bowtie$$\approx$$\times$
$\pi$ $\approx$$+$
$$\left<\;\mathcal K,\;\oplus,\;\otimes,\;\mathbb 0,\;\mathbb 1\;\right>$$
SemiringEquivalent Query Semantics
$\left<\mathbb N, +, \times, 0, 1\right>$ Bag Semantics
$\left<\mathbb B, \vee, \wedge, \bot, \top\right>$ Set Semantics
$\left<\mathcal K^W, \vec \oplus, \vec \otimes, \mathbb{\vec 0}, \mathbb{\vec 1}\right>$ Possible Worlds Semantics

$K^W$-Relations

RAB
12
13
3
$R_1$AB
12
13
43
$R_2$AB
12
13
93
RAB
12$\mapsto [1,1]$
13$\mapsto [1,1]$
43$\mapsto [1,0]$
93$\mapsto [0,1]$

Summarizing Possible Worlds

$$\mathcal K^W \rightarrow \mathcal K$$ (plug in any $K$-Relation-compatible $\mathcal K$)
Annotation in World $i$
$\texttt{PW}_i(\vec k) \equiv \vec k_i$
Certain Annotation
$\mathcal C(\vec k) \equiv min(\vec k)$
Possible Annotation
$\mathcal P(\vec k) \equiv max(\vec k)$

Correct/Possible mirrors "Correctness of SQL Queries on Databases with Nulls" [Guagliardo, Libkin 2017]

RAB
12$\mapsto [1,1]$
13$\mapsto [1,1]$
43$\mapsto [1,0]$
93$\mapsto [0,1]$

$$\texttt{PW}_0(R(1, 2)) = 1$$

$$\texttt{PW}_0(R(4, 3)) = 1$$

$$\texttt{PW}_1(R(4, 3)) = 0$$

$$\mathcal C(R(4, 3)) = 0$$

$$\mathcal P(R(4, 3)) = 1$$

A quick step back into reality...

RAB
12
13
4 or 93

 

RAB
12
13
4 or 93

Standard practice: "Just use the best option."

What's in between these extremes?

RAB
12
13
43*

Use the best option, but mark potential errors.

To answer $Q(\mathcal D)$ we want...
$PW_{i}(Q(\mathcal D))$ The results Alice would have "just used".
$\mathcal C(Q(\mathcal D))$ Which of those results are trustworthy.
$$\texttt{PW}_i(Q(\mathcal D)) \equiv Q(\texttt{PW}_i(\mathcal D))$$

(Computing $PW_{i}(Q(\mathcal D))$ is cheap!)

Can we do the same thing for $\mathcal C(Q(\mathcal D))$?

$$C(Q(\mathcal D)) \stackrel{?}{=} Q(\mathcal C(\mathcal D))$$

No.

RAB$K^W$$\mathcal C$
12$\mapsto$$[1,1]$1
13$\mapsto$$[1,1]$1
43$\mapsto$$[1,0]$0
93$\mapsto$$[0,1]$0

Compute $\pi_B(R)$

$\pi_B$RB$K^W$$\mathcal C$
2$\mapsto$$[1,1]$$\mathcal C([1,1]) = 1$
3$\mapsto$$[2,2]$$\mathcal C([1,1])+\mathcal C([1,0])+\mathcal C([0,1])$

$=1+0+0=1$ $\neq C([2,2])$

So what can we do with $\mathcal C$?

We can Approximate

(C-) Soundness
$Q(\mathcal C(\mathcal D)) \leq \mathcal C(Q(\mathcal D))$
We can efficiently compute a conservative approximation of $\mathcal C$.
Completeness (for some queries)
$Q(\mathcal C(\mathcal D)) = \mathcal C(Q(\mathcal D))$ ...if $Q$ is safe

... also attribute level uncertainty

(but not today)

We implemented it...

Su Feng

Defining Possible Worlds

Mimir allows users to define special UDFs called Models.


    CREATE MODEL TYPE Geocoder AS mimir.models.GeocodingModel;

    CREATE MODEL INSTANCE Text_To_Loc USING Geocoder('Google');

    SELECT C.name, C.id, Text_To_Loc(C.address) AS address 
      FROM Customer C;
          

(Not actual Mimir-SQL. Language adapted for your viewing pleasure.)

Models...
... return one best guess
... define the space of alternatives

Example Models

  • Geocoding Addresses
  • Imputation using a SparkML classifier
  • Heuristic detection of order-by columns for interpolation
  • Schema matching based on edit-distance
  • MayBMS-style probabilistic repair-key
  • And more...

Convenience Operators: Lenses

Lenses instantiate/train a model and wrap a query

  • Domain Constraint Repair / Missing Value Imputation
  • Schema Matching
  • Sequence Repair
  • Key Repair
  • Arbitrary Choice
  • Type Detection *
  • Header Detection *
  • JSON Shredder *

Evaluation handled by a DBMS or Spark
via query rewriting using GProM.


    SELECT C.name, C.id, Text_To_Loc(C.address) AS address 
      FROM Customer C;
          

becomes...


    SELECT C.name, C.id, Text_To_Loc(C.address) AS address,
           1 AS name_certain,    1 AS id_certain, 
           0 AS address_certain, 1 AS row_certain
      FROM Customer C;
          
Label, PDB-1, PDB-2, PDB-3 Deterministic, 4.714, 4.073, 5.238 Mimir+GProM+SQLite, 4.962, 4.257, 6.989 MayBMS, 21.814, 9.171, 18.137

A few more things we're doing with Mimir...

Adaptive Schemas

  • Domain Constraint Repair / Missing Value Imputation
  • Schema Matching
  • Sequence Repair
  • Key Repair
  • Arbitrary Choice
  • Type Detection *
  • Header Detection *
  • JSON Shredder *

Adaptive Schemas

  • Domain Constraint Repair / Missing Value Imputation
  • Schema Matching
  • Sequence Repair
  • Key Repair
  • Arbitrary Choice
  • Type Detection *
  • Header Detection *
  • JSON Shredder *

      LOAD 'customers.csv';

      SELECT name FROM customers WHERE last_purchase < LAST_WEEK();
          

How does the system know...

... which column is 'name'?
Guess that row 1 is headers.
... that 'last_purchase' is a date?
All rows look like YYYY-MM-DD

This is all guesswork!

Idea: Make the System Catalog a Probabilistic Table

Probabilistic Query Compilers

Sampling from ProbDBs is Sloooow

Trivial Sampling

Evaluate the query $N$ times.
Plug in samples instead of best guesses.

Better Solutions

Merge evaluation to mitigate redundancy.

Sparse Encoding

$R_1$AB
12
34
$R_2$AB
15
$R_{sparse}$ABS#
121
341
152

Tuple Bundles

$R_1$AB
12
34
$R_2$AB
15
$R_{bundle}$AB$\phi$
1[2,5][T,T]
34[T,F]
Label, TPCH-1, TPCH-3, TPCH-5 Sparse Tables, 119.6160702, 162.0010839, 258.7416881 Tuple Bundles, 14.65919489, 300, 300

Idea: Let the compiler pick the right representation
(or combination)

http://mimirdb.info

Students

Poonam
(PhD-3Y)

Will
(PhD-2Y)

Aaron
(PhD-3Y)

Gourab
(MS-2Y)

Alumni

Ying
(PhD 2017)

Niccolò
(PhD 2016)

Arindam
(MS 2016)

Shivang
(MS 2018)

Olivia
(BS 2017)

Dev

Mike
(Sr. Rsrch. Dev.)

External Collaborators
Dieter Gawlick
(Oracle)
Zhen Hua Liu
(Oracle)
Ronny Fehling
(Airbus)
Beda Hammerschmidt
(Oracle)
Boris Glavic
(IIT)
Su Feng
(IIT)
Juliana Freire
(NYU)
Wolfgang Gatterbauer
(NEU)
Heiko Mueller
(NYU)
Remi Rampin
(NYU)

Mimir is supported by NSF Award ACI-1640864, NPS Award N00244-16-1-0022, and gifts from Oracle