Embracing Uncertainty with Mimir

Embracing Uncertainty

Ying Yang, Niccolo Meneghetti,
Arindam Nandi, Vinayak Karuppasamy,
Oliver Kennedy, Jan Chomicki
Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick

A Big Data Fairy Tale

Meet Alice


Alice has a Store


Alice's store collects sales data

+ =

Alice wants to use her sales data to run a promotion


So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

+ ?

... asks her question ...

+ ? →

... and basks in the limitless possibilities of big data.


Why is this a fairy tale?

It's never this easy...

Loading Data

  • Validating and Fixing Outliers
  • Handling Missing Data
  • Matching Schemas
  • Fixing Schemas
  • Managing Stale Data
  • Deduplicating Records
  • ... and lots more

Data Cleaning is Hard!


Alice spends weeks cleaning her data before using it.

Newer State of the Art


Making Cleaning Easier

Scalability Reliability Expert Analysis Crowdsourcing Automation

Can we start with automation and work our way up?


Intuitive Uncertainty

UB: Ying Yang, Niccolo Meneghetti,
Arindam Nandi, Vinayak Karuppasamy,
Oliver Kennedy, Jan Chomicki

Oracle: Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick

Thanks to Oracle for multiple gifts that make this research possible