Embracing Uncertainty & ODIn Lab Overview

Embracing Uncertainty

ODIn Lab

https://odin.cse.buffalo.edu

Embracing Uncertainty

Oliver Kennedy

Ying Yang, Niccolo Meneghetti,
Arindam Nandi, Vinayak Karuppasamy
(UB)

Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick
(Oracle)

Before we begin...

Insider Threats

  • How do we identify abnormal query behavior from users?
  • What is normal user behavior?
  • Multiple gigs of query logs from M&T

...with Gokhan Kul, Duc Thanh Anh Luong, Ting Xie, Shambhu, Varun, Hung

Pocket Data

  • Months of query logs from PhoneLab Phones (2 queries per phone per second)
  • SQLite is inefficient
  • SQLite is being used inefficiently
  • Let's develop a benchmark to help shine a light on these inefficiencies

...with Jerry Ajay, Geoff, Luke

Just-in-Time Datastructures

  • Decouple Physical Structure from Logical Interface.
  • Express Datastructure Organization through Rewrite Rules.
  • ...allows hybridized datastructures for intermediate tradeoffs.
  • ...allows for semifunctional datastructures with all the benefits but fewer tradeoffs.

...with Luke

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

Alice's store collects sales data

(OpenClipArt.org)
+ =

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)
+ ?

... asks her question ...

(OpenClipArt.org)
+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

It's never this easy...

Data Cleaning is Hard!

State of the Art

(skilledup.com)

Alice spends weeks cleaning her data before using it.

Newer State of the Art

(azure.microsoft.com)
(timoelliott.com)

Making Cleaning Easier

Scalability Reliability Expert Analysis Crowdsourcing Automation

Can we start with automation and work our way up?

  • Automate educated guesses for fast cleaning
    • Lenses: A family of simple data-cleaning operators
    • ... but what if the guesses are wrong?
  • Annotate 'best guess' relations with the guesses
    • Virtual C-Tables: A lineage model based on views, labeled nulls, and lazy evaluation.
    • ... so now the user needs to interpret your guesses?
  • Rank guesses by their impact on result uncertainty
    • CPI: A greedy heuristic for ranking sources of uncertainty.

Lenses

Here's a problem with my data. Fix it.

  • What type is this column? (majority vote)
  • How do the columns of these relations line up? (pick your favorite schema matching paper)
  • How do I query heterogeneous JSON objects? (see above)
  • What should these missing values be? (learning-based interpolation)

Lenses

Each lens implements one automated data repair task with minimal configuration or training.

  • A "SQL" Expression
  • A Model that defines configuration parameters and best-guesses for data repairs.

  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
					
  • AS clause defines source data.
  • USING clause requests repairs.

  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
					

The Query


  CREATE VIEW PRODUCTS 
     AS SELECT ID, NAME, ...,
          CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
               ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
          END AS DEPARTMENT
     FROM PRODUCTS_RAW;
						
IDName...Department
123Apple 6s, White...Phone
34234Dell, Intel 4 core...Computer
34235HP, AMD 2 core...$Prod.Dept_3$
............

  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
					

The Model


                      SELECT * FROM PRODUCTS_RAW;
						

An estimator for each $Prod.Dept_{ROWID}$

The User's View


  SELECT NAME, DEPARTMENT FROM PRODUCTS;
					
NameDepartment
Apple 6s, WhitePhone
Dell, Intel 4 coreComputer
HP, AMD 2 coreComputer
......

Simple UI: Highlight values (and rows) based on guesses.


  SELECT NAME, DEPARTMENT FROM PRODUCTS;
					
NameDepartment
Apple 6s, WhitePhone
Dell, Intel 4 coreComputer
HP, AMD 2 coreComputer
......
Produced by OmniGraffle 6.2.5 2015-09-20 14:45:55 +0000 Canvas 1 Layer 1 Probability: 95%Reason: Because I guessed ‘Computer’ for ‘Department’ on Row ‘3’ of ‘PRODUCTS’

Allow users to EXPLAIN uncertain outputs

Explanations include reasons given in English

Other Lenses

  • Schema Matching (equivalently JSON/XML import)
  • Archival (how stale is my data?)
  • Type Inference
  • Deduplication / Entity Resolution
  • Schema Name Inference
  • And more...

Demo (Mimir)

Intuitive Uncertainty

UB: Ying Yang, Niccolo Meneghetti,
Arindam Nandi, Vinayak Karuppasamy

Oracle: Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick

Thanks to Oracle for multiple gifts that made this research possible