Don't Wrangle, Guess

Don't Wrangle, Guess Instead

with

Students

Poonam
(PhD-3Y)

Will
(PhD-2Y)

Aaron
(PhD-3Y)

Shivang
(MS-2Y)

Lisa
(BS-Sr)

Olivia
(BS-Sr)

Alumni

Ying
(PhD 2017)

Niccolò
(PhD 2016)

Arindam
(MS 2016)

Dev

Mike
(Sr. Rsrch. Dev.)

External Collaborators
Dieter Gawlick
(Oracle)
Zhen Hua Liu
(Oracle)
Ronny Fehling
(Airbus)
Beda Hammerschmidt
(Oracle)
Boris Glavic
(IIT)
Juliana Freire
(NYU)
Wolfgang Gatterbauer
(NEU)
Heiko Mueller
(NYU)
Remi Rampin
(NYU)

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

Alice's store collects sales data

(OpenClipArt.org)
+ =

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)
+ ?

... asks her question ...

(OpenClipArt.org)
+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

It's never this easy...

CSV Import

Run a SELECT on a raw CSV File

  • File may not have column headers
  • CSV does not provide "types"
  • Lines may be missing fields
  • Fields may be mistyped (typo, missing comma)
  • Comment text can be inlined into the file

State of the art: External Table Defn + "Manually" edit CSV

Merge Two Datasets

UNION two data sources

  • Schema matching
  • Deduplication
  • Format alignment (GIS coordinates, $ vs €)
  • Precision alignment (State vs County)

State of the art: Manually map schema

JSON Shredding

Run a SELECT on JSON or a Doc Store

  • Separating fields and record sets:
    (e.g., { A: "Bob", B: "Alice" })
  • Missing fields (Records with no 'address')
  • Type alignment (Records with 'address' as an array)
  • Schema matching$^2$

State of the art: DataGuide, Wrangler, etc...

Data Cleaning is Hard!

State of the Art

(skilledup.com)

Alice spends weeks cleaning her data before using it.

The database is in the way

Why?

In the name of Codd,
thou shalt not give the user a wrong answer.

... but what if we did?

What would it take for that to be ok?

Industry says...

            

My phone is guessing, but is letting me know that it did

Apple iOS 10; Phone App

Good Explanations, Alternatives, and Feedback Vectors

Bing Translate (c.a. 2016)

Communication

  • What data is uncertain?
  • Why is my data uncertain?
  • How bad is it?
  • What can I do about it?

What if a database did the same?

(they can)

On representing incomplete information in a relational data base

T. Imielinski & W. Lipski Jr.(VLDB 1981)

Incomplete and Probabilistic Databases
have existed since the 1980s

Q(D) Q(D) Q(D) Q(D) ? Probab. Cert. A.

We've gotten good at query processing on uncertain data.
But not at "sourcing" uncertain data ... or communicating results.

Challenges

  • Where do Probabilities/Possible Worlds Come From?
  • How do I use the output of a probablistic DB query?
  • Probablistic DB queries are sloooooow.

A small shift in how we think about PDBs addresses all three points.

It's not the data that's uncertain,
it's the interpretation

TimeSensor ReadingTemp Around Sensor
131.6Roughly 31.6˚C
2-999Around 30˚C?
428.1Roughly 28.1˚C?
332.2Roughly 32.2˚C

The reading is deterministic

... but what we care about is what the reading measures

Q1(D) Q2(D) Q3(D) Q4(D)

Insight: Treat data as 100% deterministic.

Instead, queries propose alternative interpretations.

Effects

  1. It's clear where uncertainty comes from.
  2. Results can be communicated through provenance.
  3. Query evaluation is decoupled from physical layout.

Non-Deterministic Queries

Uncertainty as Provenance

Introduce Best-Guess queries and the idea of explanations. Key points:

  • Best-guess queries
  • Generating explanations
  • Ranking explanations

Demo

Virtualized Uncertainty

Optimizing sampling-based query evaluation

Schema-Level Uncertainty