Alice wants to analyze two unaligned time series.
Time | Reading |
---|---|
1575731001 | 0 |
1575731014 | 0 |
1575731030 | 0 |
1575731035 | 0 |
... | |
1575731219 | 1 |
1575731229 | 1 |
1575731240 | 1 |
Time | Reading |
---|---|
1575731011 | 0 |
1575731020 | 0 |
1575731031 | 0 |
1575731039 | 0 |
... | |
1575731218 | 1 |
1575731228 | 1 |
1575731237 | 1 |
Step 1: Line up the readings
Lots of active research efforts! |
... but Alice is trying to to GSD! |
INSERT INTO series_one_buckets
SELECT CAST(time / 10 AS int) AS bucket,
FIRST(reading)
FROM series_one
GROUP BY bucket;
Interpolate missing values
Hand tune around the switchover as-needed
Time taken: < 30 minutes
Similar analysis...
... different data
Can Bob re-use Alice's prep+analytics workflow?
... and even then, some manual effort is needed!
Bob needs to know Alice's assumptions
(and how to use the workflow)?
Carol gets a dataset from Dave
Dave adds new data to the dataset!
Can Carol re-use her workflow?
Carol needs to remember her assumptions about the data and trust that the new data is like the old data
Eve needs to load a CSV file
→
I'm sorry, I can't do that, Eve.
You have a non-numerical value at position 1252538:24.
Load Successful!
(btw, 175326 records didn't load)
Heuristics only work most of the time.
Data science is nuanced.
Assumptions can't be avoided!
It's easy to miss an assumption when re-using work.
... this is what Bob saw:
... this is what Carol saw:
⚠ |
The data included an unexpected value: 'Non-Hispanic White' The most similar known value is 'White Non-Hispanic' |
Annotate data with warnings.
If you use this value/record,
here's what you need to know!
... this is what Eve saw:
A brief digression...
One database $D$
Each query gets one answer $R \leftarrow Q(D)$
Multiple possible databases $D \in \mathcal D$
(possible worlds)
Queries get a set of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$
Certain tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$
Uncertain tuples exist in at least one,
but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$
(not limited to set semantics)
A caveat is an assumption tied to one or more data elements (cells or rows).
If the assumption is wrong, so is the element.
An element has a caveat → The element is uncertain.
... and btw, here's why.
$> pip3 install --user vizier-webapi
$> vizier
Or get an account from me and try it out at https://demo.vizierdb.info.
Students | ||||
---|---|---|---|---|
Poonam |
Will |
Aaron |
Dev |
---|
Mike |
Alumni | ||||||
---|---|---|---|---|---|---|
Ying |
Niccolò |
Arindam |
Shivang |
Olivia |
Lisa |
Gourab |
External Collaborators | |||||
---|---|---|---|---|---|
Zhen Hua Liu (Oracle) |
Ying Lu (Oracle) |
Beda Hammerschmidt (Oracle) |
Boris Glavic (IIT) |
Su Feng (IIT) |
Juliana Freire (NYU) |
Heiko Mueller (NYU) |
Sonia Castelo Quispe (NYU) |
Carlos Bautista (NYU) |
Remi Rampin (NYU) |
Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle