Embracing uncertainty with
- Student Collaborators (PhD/MS/BS)
-
Poonam Kumari, William Spoth, Aaron Huber,
Lisa Lu, Olivia Alphonce, Shivang Aggarwal
- Alumni
-
Niccolo Meneghetti, Arindam Nandi (both HPE/Vertica),
Vinayak Karuppasamy (Bloomberg), Ying Yang (Oracle)
- Other Collaborators
-
Mike Brachmann (UB), Ronny Fehling (Airbus),
Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle),
Boris Glavic (IIT), Juliana Freire (NYU)
Meet Alice
(OpenClipArt.org)
Alice has a Store
(OpenClipArt.org)
→
Alice's store collects sales data
(OpenClipArt.org)
+
=
Alice wants to use her sales data to run a promotion
(OpenClipArt.org)
→
So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.
(OpenClipArt.org)
+ ?
... asks her question ...
(OpenClipArt.org)
+ ? →
... and basks in the limitless possibilities of big data.
(OpenClipArt.org)
Why is this a fairy tale?
→
It's never this easy...
CSV Import
Run a SELECT
on a raw CSV File
- File may not have column headers
- CSV does not provide "types"
- Lines may be missing fields
- Fields may be mistyped (typo, missing comma)
- Comment text can be inlined into the file
State of the art: External Table Defn + "Manually" edit CSV
Merge Two Datasets
UNION
two data sources
- Schema matching
- Deduplication
- Format alignment (GIS coordinates, $ vs €)
- Precision alignment (State vs County)
State of the art: Manually map schema
JSON Shredding
Run a SELECT
on JSON or a Doc Store
- Separating fields and record sets:
(e.g., { A: "Bob", B: "Alice" }
)
- Missing fields (Records with no 'address')
- Type alignment (Records with 'address' as an array)
- Schema matching$^2$
State of the art: DataGuide, Wrangler, etc...
Structure is hard!
- Structured models (RelDBs) force curation during loading.
- Problem: All curation costs are upfront.
- Unstructured models (NoSQL) force curation into queries.
- Problem: Complexity/redundancy blowup in queries.
Add structure, curation effort On-Demand
But... you still need some sort of structure?!?
Let the database make a guess!
In the name of Codd,
thou shalt not give the user a wrong answer.
... but what if we did?
What would it take for that to be ok?
Lenses
Here's a problem with my data. Fix it.
- What type is this column? (majority vote)
- How do the columns of these relations line up? (pick your favorite schema matching paper)
- How do I query heterogeneous JSON objects? (see above)
- What should these missing values be? (learning-based interpolation)
Lenses introduce uncertainty
(OpenClipArt.org)
The User's View
SELECT NAME, DEPARTMENT FROM PRODUCTS;
Name | Department |
Apple 6s, White | Phone |
Dell, Intel 4 core | Computer |
HP, AMD 2 core | Computer |
... | ... |
Simple UI: Highlight values that are based on guesses.
SELECT NAME, DEPARTMENT FROM PRODUCTS;
Name | Department |
Apple 6s, White | Phone |
Dell, Intel 4 core | Computer |
HP, AMD 2 core | Computer |
... | ... |
Allow users to EXPLAIN
uncertain outputs
Explanations include reasons given in English
Explanations
- Mark uncertain data and results.
- Upon request, provide more detail:
- Why is my data uncertain? (provenance)
- How bad is it? (confidence, entropy, bounds)
- What are other possibile answers? (samples)
- What can I do to fix it? (repairs)