Center for Multisource Information Fusion   Embracing Uncertainty

Embracing Uncertainty

U @ Buffalo
Ying Yang, Niccolo Meneghetti,
Arindam Nandi, Vinayak Karuppasamy,
Oliver Kennedy, Jan Chomicki
Oracle
Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

Alice's store collects sales data

(OpenClipArt.org)
+ =

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)
+ ?

... asks her question ...

(OpenClipArt.org)
+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

It's never this easy...

Loading Data

  • Validating and Fixing Outliers
  • Handling Missing Data
  • Matching Schemas
  • Fixing Schemas
  • Managing Stale Data
  • Deduplicating Records
  • ... and lots more

Data Cleaning is Hard!

State of the Art

(skilledup.com)

Alice spends weeks cleaning her data before using it.

Newer State of the Art

(azure.microsoft.com)
(timoelliott.com)

Making Cleaning Easier

Scalability Reliability Expert Analysis Crowdsourcing Automation

Can we start with automation and work our way up?

Mimir

  • Automate educated guesses for fast cleaning
    • Lenses: A family of simple data-cleaning operators
    • ... but what if the guesses are wrong?
  • Annotate 'best guess' relations with the guesses
    • Virtual C-Tables: A lineage model based on views, labeled nulls, and lazy evaluation.
    • ... so now the user needs to interpret your guesses?
  • Rank guesses by their impact on result uncertainty
    • CPI: A greedy heuristic for ranking sources of uncertainty.

Lenses

Here's a problem with my data. Fix it.

  • What types should columns in this table have?
    • Majority Vote of All Castable Types
  • How do the columns of these relations line up?
    • Paygo and Countless Other Papers/Systems
  • How do I query heterogeneous JSON/XML objects?
    • XMorph and Many Others
  • What should these missing values be?
    • Machine Learning + Interpolation

Lenses

Each lens implements one automated data repair task with minimal configuration or training.

  • A "SQL" Expression
  • A Model that defines configuration parameters and best-guesses for data repairs.

  CREATE LENS PRODUCTS 
     AS SELECT * FROM PRODUCTS_RAW
     USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
					
  • AS clause defines source data.
  • USING clause requests repairs.

The Lens Query


  CREATE VIEW PRODUCTS 
     AS SELECT ID, NAME, ...,
          CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
               ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
          END AS DEPARTMENT
     FROM PRODUCTS_RAW;
						
IDName...Department
123Apple 6s, White...Phone
34234Dell, Intel 4 core...Computer
34235HP, AMD 2 core...$Prod.Dept_3$
............

The Lens Model


                      SELECT * FROM PRODUCTS_RAW;
						

An estimator for each $Prod.Dept_{ROWID}$

The User's View


  SELECT NAME, DEPARTMENT FROM PRODUCTS;
					
NameDepartment
Apple 6s, WhitePhone
Dell, Intel 4 coreComputer
HP, AMD 2 coreComputer
......

Simple UI: Highlight values (and rows) based on guesses.


  SELECT NAME, DEPARTMENT FROM PRODUCTS;
					
NameDepartment
Apple 6s, WhitePhone
Dell, Intel 4 coreComputer
HP, AMD 2 coreComputer
......
Produced by OmniGraffle 6.2.5 2015-09-20 14:45:55 +0000 Canvas 1 Layer 1 Probability: 95%Reason: Because I guessed ‘Computer’ for ‘Department’ on Row ‘3’ of ‘PRODUCTS’

Allow users to EXPLAIN uncertain outputs

Explanations include reasons given in English

Other Lenses

  • Schema Matching (equivalently JSON/XML import)
  • Archival (how stale is my data?)
  • Type Inference
  • Deduplication / Entity Resolution
  • Schema Name Inference
  • And more...

Mimir Demo

Intuitive Uncertainty

UB: Ying Yang, Niccolo Meneghetti,
Arindam Nandi, Vinayak Karuppasamy,
Oliver Kennedy, Jan Chomicki

Oracle: Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick

Thanks to Oracle for multiple gifts that make this research possible