Embracing Uncertainty

Embracing uncertainty with

Student Collaborators (PhD/MS/BS): Poonam Kumari, William Spoth, Aaron Huber,
Lisa Lu, Olivia Alphonce, Shivang Aggarwal
Alumni: Niccolo Meneghetti, Arindam Nandi (both HPE/Vertica),
Vinayak Karuppasamy (Bloomberg), Ying Yang (Oracle)
Other Collaborators: Mike Brachmann (UB), Ronny Fehling (Airbus),
Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle),
Boris Glavic (IIT), Juliana Freire (NYU)

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

→

Alice's store collects sales data

(OpenClipArt.org)

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

→

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)

+ ?

... asks her question ...

(OpenClipArt.org)

+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

→

It's never this easy...

CSV Import

Run a `SELECT` on a raw CSV File

File may not have column headers
CSV does not provide "types"
Lines may be missing fields
Fields may be mistyped (typo, missing comma)
Comment text can be inlined into the file

State of the art: External Table Defn + "Manually" edit CSV

Merge Two Datasets

`UNION` two data sources

Schema matching
Deduplication
Format alignment (GIS coordinates, $ vs €)
Precision alignment (State vs County)

State of the art: Manually map schema

JSON Shredding

Run a `SELECT` on JSON or a Doc Store

Separating fields and record sets:
(e.g., { A: "Bob", B: "Alice" })
Missing fields (Records with no 'address')
Type alignment (Records with 'address' as an array)
Schema matching$^2$

State of the art: DataGuide, Wrangler, etc...

Data Cleaning is Hard!

State of the Art

(skilledup.com)

Alice spends weeks cleaning her data before using it.

Newer State of the Art

(azure.microsoft.com)

(timoelliott.com)

Structure is hard!

Structured models (RelDBs) force curation during loading.
- Problem: All curation costs are upfront.
Unstructured models (NoSQL) force curation into queries.
- Problem: Complexity/redundancy blowup in queries.

Add structure, curation effort On-Demand

But... you still need some sort of structure?!?

Let the database make a guess!

In the name of Codd,
thou shalt not give the user a wrong answer.

... but what if we did?

What would it take for that to be ok?

Industry says...

My phone is guessing, but is letting me know that it did

Easy interactions to accept, reject, or explain uncertainty

Good Explanations, Alternatives, and Feedback Vectors

Communication

What data is uncertain?
Why is my data uncertain?
How bad is it?
What can I do about it?

What if a database did the same?

A: Standard SQL.
B: Annotated Output.
C: Lens Diagram.
D: Result Explanations.

Lenses

Here's a problem with my data. Fix it.

What type is this column? (majority vote)
How do the columns of these relations line up? (pick your favorite schema matching paper)
How do I query heterogeneous JSON objects? (see above)
What should these missing values be? (learning-based interpolation)

Lenses introduce uncertainty

(OpenClipArt.org)

The User's View


  SELECT NAME, DEPARTMENT FROM PRODUCTS;

Name	Department
Apple 6s, White	Phone
Dell, Intel 4 core	Computer
HP, AMD 2 core	Computer
...	...

Simple UI: Highlight values that are based on guesses.


  SELECT NAME, DEPARTMENT FROM PRODUCTS;

Name	Department
Apple 6s, White	Phone
Dell, Intel 4 core	Computer
HP, AMD 2 core	Computer
...	...

Allow users to EXPLAIN uncertain outputs

Explanations include reasons given in English

Explanations

Mark uncertain data and results.
Upon request, provide more detail:
- Why is my data uncertain? (provenance)
- How bad is it? (confidence, entropy, bounds)
- What are other possibile answers? (samples)
- What can I do to fix it? (repairs)

Email: okennedy@buffalo.edu

Office: Davis 338H

Web: https://odin.cse.buffalo.edu

Mimir: http://mimirdb.info

Today's password is Frances Allen

Embracing uncertainty with

A Big Data Fairy Tale

Meet Alice

Alice has a Store

Alice's store collects sales data

Alice wants to use her sales data to run a promotion

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

... asks her question ...

... and basks in the limitless possibilities of big data.

Why is this a fairy tale?

It's never this easy...

CSV Import

Run a SELECT on a raw CSV File

Merge Two Datasets

UNION two data sources

JSON Shredding

Run a SELECT on JSON or a Doc Store

Data Cleaning is Hard!

State of the Art

Newer State of the Art

Structure is hard!

But... you still need some sort of structure?!?

Let the database make a guess!

In the name of Codd,thou shalt not give the user a wrong answer.

... but what if we did?

What would it take for that to be ok?

Industry says...

Communication

What if a database did the same?

Lenses

The User's View

Explanations

Run a `SELECT` on a raw CSV File

`UNION` two data sources

Run a `SELECT` on JSON or a Doc Store

In the name of Codd,
thou shalt not give the user a wrong answer.