Don't Wrangle, Guess

Don't Wrangle, Guess Instead

with

Students

Poonam
(PhD-3Y)

Will
(PhD-2Y)

Aaron
(PhD-3Y)

Shivang
(MS-2Y)

Lisa
(BS-Sr)

Olivia
(BS-Sr)

Alumni

Ying
(PhD 2017)

Niccolò
(PhD 2016)

Arindam
(MS 2016)

Dev

Mike
(Sr. Rsrch. Dev.)

External Collaborators
Dieter Gawlick
(Oracle)
Zhen Hua Liu
(Oracle)
Ronny Fehling
(Airbus)
Beda Hammerschmidt
(Oracle)
Boris Glavic
(IIT)
Su Feng
(IIT)
Juliana Freire
(NYU)
Wolfgang Gatterbauer
(NEU)
Heiko Mueller
(NYU)
Remi Rampin
(NYU)

A Big Data Fairy Tale

Meet Alice

(OpenClipArt.org)

Alice has a Store

(OpenClipArt.org)

Alice's store collects sales data

(OpenClipArt.org)
+ =

Alice wants to use her sales data to run a promotion

(OpenClipArt.org)

So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.

(OpenClipArt.org)
+ ?

... asks her question ...

(OpenClipArt.org)
+ ? →

... and basks in the limitless possibilities of big data.

(OpenClipArt.org)

Why is this a fairy tale?

It's never this easy...

CSV Import

Run a SELECT on a raw CSV File

  • File may not have column headers
  • CSV does not provide "types"
  • Lines may be missing fields
  • Fields may be mistyped (typo, missing comma)
  • Comment text can be inlined into the file

State of the art: External Table Defn + "Manually" edit CSV

Merge Two Datasets

UNION two data sources

  • Schema matching
  • Deduplication
  • Format alignment (GIS coordinates, $ vs €)
  • Precision alignment (State vs County)

State of the art: Manually map schema

JSON Shredding

Run a SELECT on JSON or a Doc Store

  • Separating fields and record sets:
    (e.g., { A: "Bob", B: "Alice" })
  • Missing fields (Records with no 'address')
  • Type alignment (Records with 'address' as an array)
  • Schema matching$^2$

State of the art: DataGuide, Wrangler, etc...

Loading requires curation...

Data Curation is Hard!

Automation: Put up a bunch of paper PNGs from papers on ER, schema detection, missing value repair, etc...
Ooops: 12 year old terrorist, etc...
Problem: Fox graph vs https://blog.udacity.com/wp-content/uploads/2014/11/data_analysis1.jpg
ProbDBs are a possible solution, but... Label , PDB-1, PDB-2, PDB-3, TPCH-1, TPCH-3, TPCH-5, TPCH-9 SQLite , 9.521, 7.59, 31.22, 19.561, 22.835, 33.308, 51.125 MayBMS-SQLite, 22.1345477, 7.291376699999999, 29.1511957 MayBMS-PGSql , 23.439012999999996, 13.000651999999999, 20.2954832 Sampling , TIMEOUT, 242.5666234549135, 300, 119.61607021316885, 162.00108394436538, 258.74168805666267, 300


http://mimirdb.info

  • It's not the data that's uncertain, it's the interpretation.
  • Tagged best-guess evaluation is faster and easier to understand.
  • Not committing to one representation allows faster query processing.

Thanks!