VizierDB


A Notebook with Caveats


Oliver Kennedy okennedy@buffalo.edu

Story Time!

Act 1

Alice wants to analyze two unaligned time series.

TimeReading
15757310010
15757310140
15757310300
15757310350
...
15757312191
15757312291
15757312401
TimeReading
15757310110
15757310200
15757310310
15757310390
...
15757312181
15757312281
15757312371

Step 1: Line up the readings

Option 1: Do it right

Lots of active research efforts!
... but Alice is trying to to GSD!

Alice's Observations

  • Readings every ~10s
  • Readings are binary
  • Readings are incredibly stable

            INSERT INTO series_one_buckets
              SELECT CAST(time / 10 AS int) AS bucket, 
                     FIRST(reading)
              FROM   series_one
              GROUP BY bucket;
          

Interpolate missing values

Hand tune around the switchover as-needed

Time taken: < 30 minutes

FreeSVG.org

Enter Bob...

Similar analysis...

... different data

Can Bob re-use Alice's prep+analytics workflow?

Maybe?

  • Are readings still every ~10s?
  • Is the data still binary?
  • Is the data still (relatively) stable?

... and even then, some manual effort is needed!

Bob needs to know Alice's assumptions
(and how to use the workflow)?

Act 2

Carol gets a dataset from Dave



FreeSVG.org

Dave adds new data to the dataset!

Can Carol re-use her workflow?

Maybe?

  • Did the data dictionary change?
  • Did new errors get introduced?

Carol needs to remember her assumptions about the data and trust that the new data is like the old data

Act 3

Eve needs to load a CSV file

FreeSVG.org

Scenario 1

I'm sorry, I can't do that, Eve.

You have a non-numerical value at position 1252538:24.

FreeSVG.org

Scenario 2

Load Successful!

(btw, 175326 records didn't load)

Heuristics only work most of the time.

Data science is nuanced.

Assumptions can't be avoided!

It's easy to miss an assumption when re-using work.

https://xkcd.com/2239/

Wouldn't it be nice if...

Wouldn't it be nice if...

... this is what Bob saw:

Wouldn't it be nice if...

... this is what Carol saw:

The data included an unexpected value: 'Non-Hispanic White'
The most similar known value is 'White Non-Hispanic'

Annotate data with warnings.

If you use this value/record,
here's what you need to know!

Caveat Physicus

Why?

Propagation

Caveats...
... can go where the data goes
Derived values retain caveats on source data.
... stop where the data stops
Irrelevant caveats don't get propagated

Wouldn't it be nice if...

... this is what Eve saw:

What is a Caveat?

A brief digression...

Classical Databases

One database $D$

Each query gets one answer $R \leftarrow Q(D)$

Incomplete Databases

Multiple possible databases $D \in \mathcal D$

(possible worlds)

Queries get a set of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$

Certain tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$

Uncertain tuples exist in at least one,
but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$

(not limited to set semantics)

A caveat is an assumption tied to one or more data elements (cells or rows).

If the assumption is wrong, so is the element.

An element has a caveat → The element is uncertain.

... and btw, here's why.

Demo

Students

Poonam
(PhD-4Y)

Will
(PhD-3Y)

Aaron
(PhD-4Y)

Dev

Mike
(Sr. Rsrch. Dev.)

Alumni

Ying
(PhD 2017)

Niccolò
(PhD 2016)

Arindam
(MS 2016)

Shivang
(MS 2018)

Olivia
(BS 2017)

Gourab
(MS 2018)

External Collaborators
Zhen Hua Liu
(Oracle)
Ying Lu
(Oracle)
Beda Hammerschmidt
(Oracle)
Boris Glavic
(IIT)
Su Feng
(IIT)
Juliana Freire
(NYU)
Munaf Arshad Qazi
(NYU)
Heiko Mueller
(NYU)
Sonia Castelo Quispe
(NYU)
Carlos Bautista
(NYU)
Remi Rampin
(NYU)

Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle

https://vizierdb.info


          $> pip3 install --user vizier-webapi
          $> vizier
        

Or get an account from me and try it out at https://demo.vizierdb.info