Embracing Uncertainty

ODIn Lab

https://odin.cse.buffalo.edu

Embracing Uncertainty

Oliver Kennedy

Ying Yang, Niccolo Meneghetti,
Arindam Nandi, Vinayak Karuppasamy
(UB)

Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick
(Oracle)

Before we begin...

Insider Threats

How do we identify abnormal query behavior from users?
What is normal user behavior?
Multiple gigs of query logs from M&T

...with Gokhan Kul, Duc Thanh Anh Luong, Ting Xie, Shambhu, Varun, Hung

Pocket Data

Months of query logs from PhoneLab Phones (2 queries per phone per second)
SQLite is inefficient
SQLite is being used inefficiently
Let's develop a benchmark to help shine a light on these inefficiencies

...with Jerry Ajay, Geoff, Luke

Just-in-Time Datastructures

Decouple Physical Structure from Logical Interface.
Express Datastructure Organization through Rewrite Rules.
...allows hybridized datastructures for intermediate tradeoffs.
...allows for semifunctional datastructures with all the benefits but fewer tradeoffs.

...with Luke

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

→

Alice's store collects sales data

(OpenClipArt.org)

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

→

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)

+ ?

... asks her question ...

(OpenClipArt.org)

+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

→

It's never this easy...

Data Cleaning is Hard!

State of the Art

(skilledup.com)

Alice spends weeks cleaning her data before using it.

Newer State of the Art

(azure.microsoft.com)

(timoelliott.com)

Making Cleaning Easier

Can we start with automation and work our way up?

Automate educated guesses for fast cleaning
- Lenses: A family of simple data-cleaning operators
- ... but what if the guesses are wrong?
Annotate 'best guess' relations with the guesses
- Virtual C-Tables: A lineage model based on views, labeled nulls, and lazy evaluation.
- ... so now the user needs to interpret your guesses?
Rank guesses by their impact on result uncertainty
- CPI: A greedy heuristic for ranking sources of uncertainty.

Lenses

Here's a problem with my data. Fix it.

What type is this column? (majority vote)
How do the columns of these relations line up? (pick your favorite schema matching paper)
How do I query heterogeneous JSON objects? (see above)
What should these missing values be? (learning-based interpolation)

Lenses

Each lens implements one automated data repair task with minimal configuration or training.

A "SQL" Expression
A Model that defines configuration parameters and best-guesses for data repairs.


  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);

AS clause defines source data.
USING clause requests repairs.


  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);

The Query


  CREATE VIEW PRODUCTS 
     AS SELECT ID, NAME, ...,
          CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
               ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
          END AS DEPARTMENT
     FROM PRODUCTS_RAW;

ID	Name	...	Department
123	Apple 6s, White	...	Phone
34234	Dell, Intel 4 core	...	Computer
34235	HP, AMD 2 core	...	$Prod.Dept_3$
...	...	...	...


  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);

The Model


                      SELECT * FROM PRODUCTS_RAW;

↓

An estimator for each $Prod.Dept_{ROWID}$

The User's View


  SELECT NAME, DEPARTMENT FROM PRODUCTS;

Name	Department
Apple 6s, White	Phone
Dell, Intel 4 core	Computer
HP, AMD 2 core	Computer
...	...

Simple UI: Highlight values (and rows) based on guesses.


  SELECT NAME, DEPARTMENT FROM PRODUCTS;

Name	Department
Apple 6s, White	Phone
Dell, Intel 4 core	Computer
HP, AMD 2 core	Computer
...	...

Allow users to EXPLAIN uncertain outputs

Explanations include reasons given in English

Other Lenses

Schema Matching (equivalently JSON/XML import)
Archival (how stale is my data?)
Type Inference
Deduplication / Entity Resolution
Schema Name Inference
And more...