Embracing Uncertainty with Mimir
http://odin.cse.buffalo.edu

Embracing Uncertainty

U.B.
Ying Yang, Niccolo Meneghetti,
Arindam Nandi, Vinayak Karuppasamy,
Oliver Kennedy, Jan Chomicki
Oracle
Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

Alice's store collects sales data

(OpenClipArt.org)
+ =

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)
+ ?

... asks her question ...

(OpenClipArt.org)
+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

It's never this easy...

Loading Data

  • Validating and Fixing Outliers
  • Handling Missing Data
  • Matching Schemas
  • Fixing Schemas
  • Managing Stale Data
  • Deduplicating Records
  • ... and lots more

Data Cleaning is Hard!

(skilledup.com)

Alice spends weeks cleaning her data before using it.

Newer State of the Art

(azure.microsoft.com)
(timoelliott.com)

Making Cleaning Easier

Scalability Reliability Expert Analysis Crowdsourcing Automation

Can we start with automation and work our way up?

Mimir!

Intuitive Uncertainty

UB: Ying Yang, Niccolo Meneghetti,
Arindam Nandi, Vinayak Karuppasamy,
Oliver Kennedy, Jan Chomicki

Oracle: Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick

Thanks to Oracle for multiple gifts that make this research possible