Don't Wrangle, Debug

Oliver Kennedy

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

Alice's store collects sales data

(OpenClipArt.org)
+ =

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)
+ ?

... asks her question ...

(OpenClipArt.org)
+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

It's never this easy...

File Formats

(JSON/XML, CSV, 1000s of Files in Directories)

  • Manually Explore Examples
  • Automatic Summarization (Oracle Data Guides)
  • Manual Segmentation (Log Data, Files)

Missing Data

(Sensor errors, Survey Data)

  • Discover Outliers
  • Fix (pick one): Don't guess wrong!
    • Impute Missing Values
    • Interpolate Missing Values
    • Drop Rows with Missing Data

Documentation

("The CSV is in GIT")

  • Rediscover Column/Variable Meanings
  • Units / Measurement Techniques
  • Caveats on Data Usage / Cleaning Techniques
  • Make & Rediscover Assumptions About Data

Mimir & Family

SchemaDrill
JSON/Filesystem Schemas
UADBs
Data-Associated Documentation
LOKI
Automatic Reverse-Engineered Documentation
Vizier
Multi-Modal, Interactive Data Exploration

SchemaDrill


            { "foo" : 1, "bar" : 2 }

            { "foo" : 3, "bar" : 4,  "baz" : 5 }

            { "baz" : 6, "frob" : 7 }
          

What's the schema of these objects?

SchemaDrill


  { "name" : "Alice", "address" : "123 A Street" }

  { "name" : "Bob", "address" : "456 B Street",  "city" : "Buffalo" }
  
  { "city" : "Buffalo", "state" : "New York" }
          

What's the schema of these objects?

SchemaDrill


  { "restaurant" : "10-21", "menu" : ["Wings", "Beer"] }

  { "restaurant" : "11-21", "menu" : ["Gnocchi"],  "bar" : "17-00" }

  { "bar" : "18-02", "patio" : true }
          

What's the schema of these objects?

SchemaDrill

Does a collection of objects encode one type of entity or multiple?
Non-Negative Matrix Factorization
Does a nested array / object represent a tuple or a collection?
Key Entropy (higher → more collection-like)
Type Entropy (higher → more tuple-like)

UADBs

Classical Ways to Query Uncertain Data

  • Certain Answers: Answers we're $100\%$ confident in.
  • Possible Answers: Answers we're $>0\%$ confident in.
  • Best-Guess Answer: Some internally self consistent answer that looks right.

Certain is principled; Best-Guess is fast.

UADBs

  1. Mark possibly erroneous inputs
  2. Trace marks through queries
  3. Automatically mark outputs

UADBs

UADBs

  • How do we efficiently compute which results are based on marked inputs?
  • Can we guarantee that all marked outputs depend on marked inputs?
  • How do we tie these results to existing encodings uncertain data?

LOKI

Label Once and Keep It

Vizier

Vizier

Vizier

Vizier

Students

Poonam
(PhD-3Y)

Will
(PhD-2Y)

Aaron
(PhD-3Y)

Mercy
(BS)

Dev

Mike
(Sr. Rsrch. Dev.)

Alumni

Ying
(PhD 2017)

Niccolò
(PhD 2016)

Arindam
(MS 2016)

Shivang
(MS 2018)

Olivia
(BS 2017)

Gourab
(MS 2018)

External Collaborators
Dieter Gawlick
(Oracle)
Zhen Hua Liu
(Oracle)
Ronny Fehling
(Airbus)
Beda Hammerschmidt
(Oracle)
Boris Glavic
(IIT)
Su Feng
(IIT)
Juliana Freire
(NYU)
Wolfgang Gatterbauer
(NEU)
Heiko Mueller
(NYU)
Remi Rampin
(NYU)
Sonia Castelo Quispe
(NYU)

Mimir is supported by NSF Awards ACI-1640864, IIS-1750460, and gifts from Oracle. Prior support from NPS Award N00244-16-1-0022.