Embracing Uncertainty

Don't Wrangle, Guess Instead

with

Students
Poonam (PhD-3Y)	Will (PhD-2Y)	Aaron (PhD-3Y)	Shivang (MS-2Y)	Lisa (BS-Sr)	Olivia (BS-Sr)

Alumni
Ying (PhD 2017)	Niccolò (PhD 2016)	Arindam (MS 2016)

Dev
Mike (Sr. Rsrch. Dev.)

External Collaborators
Dieter Gawlick (Oracle)	Zhen Hua Liu (Oracle)	Ronny Fehling (Airbus)	Beda Hammerschmidt (Oracle)

Boris Glavic
(IIT)

Juliana Freire
(NYU)

Wolfgang Gatterbauer
(NEU)

Heiko Mueller
(NYU)

Remi Rampin
(NYU)

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

→

Alice's store collects sales data

(OpenClipArt.org)

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

→

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)

+ ?

... asks her question ...

(OpenClipArt.org)

+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

→

It's never this easy...

CSV Import

Run a `SELECT` on a raw CSV File

File may not have column headers
CSV does not provide "types"
Lines may be missing fields
Fields may be mistyped (typo, missing comma)
Comment text can be inlined into the file

State of the art: External Table Defn + "Manually" edit CSV

Merge Two Datasets

`UNION` two data sources

Schema matching
Deduplication
Format alignment (GIS coordinates, $ vs €)
Precision alignment (State vs County)

State of the art: Manually map schema

JSON Shredding

Run a `SELECT` on JSON or a Doc Store

Separating fields and record sets:
(e.g., { A: "Bob", B: "Alice" })
Missing fields (Records with no 'address')
Type alignment (Records with 'address' as an array)
Schema matching$^2$

State of the art: DataGuide, Wrangler, etc...

Data Cleaning is Hard!

State of the Art

(skilledup.com)

Alice spends weeks cleaning her data before using it.

The database is in the way

Why?

In the name of Codd,
thou shalt not give the user a wrong answer.

... but what if we did?

What would it take for that to be ok?

Industry says...

My phone is guessing, but is letting me know that it did

Apple iOS 10; Phone App

Good Explanations, Alternatives, and Feedback Vectors

Bing Translate (c.a. 2016)

Communication

What data is uncertain?
Why is my data uncertain?
How bad is it?
What can I do about it?

What if a database did the same?

(they can)

On representing incomplete information in a relational data base

T. Imielinski & W. Lipski Jr.(VLDB 1981)

Incomplete and Probabilistic Databases
have existed since the 1980s

We've gotten good at query processing on uncertain data.
But not at "sourcing" uncertain data ... or communicating results.

Challenges

Where do Probabilities/Possible Worlds Come From?
How do I use the output of a probablistic DB query?
Probablistic DB queries are sloooooow.

A small shift in how we think about PDBs addresses all three points.

It's not the data that's uncertain,
it's the interpretation

Time	Sensor Reading	Temp Around Sensor
1	31.6	Roughly 31.6˚C
2	-999	Around 30˚C?
4	28.1	Roughly 28.1˚C?
3	32.2	Roughly 32.2˚C

The reading is deterministic

... but what we care about is what the reading measures

Insight: Treat data as 100% deterministic.

Instead, queries propose alternative interpretations.

Effects

It's clear where uncertainty comes from.
Results can be communicated through provenance.
Query evaluation is decoupled from physical layout.

Non-Deterministic Queries

Uncertainty as Provenance

Introduce Best-Guess queries and the idea of explanations. Key points:

Best-guess queries
Generating explanations
Ranking explanations

Demo

Virtualized Uncertainty

Optimizing sampling-based query evaluation

Don't Wrangle, Guess Instead

with

A Big Data Fairy Tale

Meet Alice

Alice has a Store

Alice's store collects sales data

Alice wants to use her sales data to run a promotion

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

... asks her question ...

... and basks in the limitless possibilities of big data.

Why is this a fairy tale?

It's never this easy...

CSV Import

Run a SELECT on a raw CSV File

Merge Two Datasets

UNION two data sources

JSON Shredding

Run a SELECT on JSON or a Doc Store

Data Cleaning is Hard!

State of the Art

The database is in the way

Why?

In the name of Codd,thou shalt not give the user a wrong answer.

... but what if we did?

What would it take for that to be ok?

Industry says...

Communication

What if a database did the same?

(they can)

On representing incomplete information in a relational data base

T. Imielinski & W. Lipski Jr.(VLDB 1981)

Challenges

It's not the data that's uncertain,it's the interpretation

Effects

Non-Deterministic Queries

Uncertainty as Provenance

Demo

Virtualized Uncertainty

Schema-Level Uncertainty

Run a `SELECT` on a raw CSV File

`UNION` two data sources

Run a `SELECT` on JSON or a Doc Store

In the name of Codd,
thou shalt not give the user a wrong answer.

It's not the data that's uncertain,
it's the interpretation