VizierDB

A Notebook with Caveats

Oliver Kennedy okennedy@buffalo.edu

Story Time!

Act 1

Alice wants to analyze two unaligned time series.

Time	Reading
1575731001	0
1575731014	0
1575731030	0
1575731035	0
...
1575731219	1
1575731229	1
1575731240	1

Time	Reading
1575731011	0
1575731020	0
1575731031	0
1575731039	0
...
1575731218	1
1575731228	1
1575731237	1

Step 1: Line up the readings

Option 1: Do it right

Lots of active research efforts!

... but Alice is trying to to GSD!

Alice's Observations

Readings every ~10s
Readings are binary
Readings are incredibly stable


            INSERT INTO series_one_buckets
              SELECT CAST(time / 10 AS int) AS bucket, 
                     FIRST(reading)
              FROM   series_one
              GROUP BY bucket;

Interpolate missing values

Hand tune around the switchover as-needed

Time taken: < 30 minutes

FreeSVG.org

Enter Bob...

Similar analysis...

... different data

Can Bob re-use Alice's prep+analytics workflow?

Maybe?

Are readings still every ~10s?
Is the data still binary?
Is the data still (relatively) stable?

... and even then, some manual effort is needed!

Bob needs to know Alice's assumptions
(and how to use the workflow)?

Act 2

Carol gets a dataset from Dave

↓

→

FreeSVG.org

Dave adds new data to the dataset!

Can Carol re-use her workflow?

Maybe?

Did the data dictionary change?
Did new errors get introduced?

Carol needs to remember her assumptions about the data and trust that the new data is like the old data

Act 3

Eve needs to load a CSV file

→

FreeSVG.org

Scenario 1

I'm sorry, I can't do that, Eve.

You have a non-numerical value at position 1252538:24.

FreeSVG.org

Scenario 2

Load Successful!

(btw, 175326 records didn't load)

Heuristics only work most of the time.

Data science is nuanced.

Assumptions can't be avoided!

It's easy to miss an assumption when re-using work.

https://xkcd.com/2239/

Wouldn't it be nice if...

... this is what Bob saw:

Wouldn't it be nice if...

... this is what Carol saw:

⚠	The data included an unexpected value: 'Non-Hispanic White' The most similar known value is 'White Non-Hispanic'

Annotate data with warnings.

If you use this value/record,
here's what you need to know!

Caveat Physicus

Why?

Propagation

Caveats...
... can go where the data goes: Derived values retain caveats on source data.
... stop where the data stops: Irrelevant caveats don't get propagated

Wouldn't it be nice if...

... this is what Eve saw:

What is a Caveat?

A brief digression...

Classical Databases

One database $D$

Each query gets one answer $R \leftarrow Q(D)$

Incomplete Databases

Multiple possible databases $D \in \mathcal D$

(possible worlds)

Queries get a set of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$

Certain tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$

Uncertain tuples exist in at least one,
but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$

(not limited to set semantics)

A caveat is an assumption tied to one or more data elements (cells or rows).

If the assumption is wrong, so is the element.

Alice / Bob

FIRST may not pick the right value for a bucket with 2+ distinct values.
Interpolation may not pick the right value for a bucket with 0 values.

Carol / Dave

The model hyperparameters may not work if the data changes too significantly.
New values could indicate new data errors that Carol's ingest script hasn't accounted for.

Eve / Hal

Replacing a parse error with a NULL might not be what Eve expects.

An element has a caveat → The element is uncertain.

... and btw, here's why.

Demo

https://vizierdb.info


          $> pip3 install --user vizier-webapi
          $> vizier

Or get an account from me and try it out at https://demo.vizierdb.info.

Students
Poonam (PhD-4Y)	Will (PhD-3Y)	Aaron (PhD-4Y)

Dev
Mike (Sr. Rsrch. Dev.)

Alumni
Ying (PhD 2017)	Niccolò (PhD 2016)	Arindam (MS 2016)	Shivang (MS 2018)	Olivia (BS 2017)	Lisa (BS 2018)	Gourab (MS 2018)

External Collaborators
Zhen Hua Liu (Oracle)	Ying Lu (Oracle)	Beda Hammerschmidt (Oracle)	Boris Glavic (IIT)	Su Feng (IIT)

Juliana Freire
(NYU)

Heiko Mueller
(NYU)

Sonia Castelo Quispe
(NYU)

Carlos Bautista
(NYU)

Remi Rampin
(NYU)

Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle