VizierDB

Safe, Reusable Heuristic Data Transformation

(through Caveats)

Oliver Kennedy okennedy@buffalo.edu

Story Time!

Act 1

Alice wants to analyze two unaligned time series.

Time	Reading
1575731001	0
1575731014	0
1575731030	0
1575731035	0
...
1575731219	1
1575731229	1
1575731240	1

Time	Reading
1575731011	0
1575731020	0
1575731031	0
1575731039	0
...
1575731218	1
1575731228	1
1575731237	1

Step 1: Line up the readings

Option 1: Do it right

Lots of active research efforts!

... but Alice is trying to to GSD!

Alice's Observations

Readings every ~10s
Readings are binary
Readings are incredibly stable


            INSERT INTO series_one_buckets
              SELECT CAST(time / 10 AS int) AS bucket, 
                     FIRST(reading)
              FROM   series_one
              GROUP BY bucket;

Interpolate missing values

Hand tune around the switchover as-needed

Time taken: < 30 minutes

FreeSVG.org

Enter Bob...

Similar analysis...

... different data

Can Bob re-use Alice's prep+analytics workflow?

Maybe?

Are readings still every ~10s?
Is the data still binary?
Is the data still (relatively) stable?

... and even then, some manual effort is needed!

Bob needs to know Alice's assumptions
(and how to use the workflow)?

Act 2

Carol gets a dataset from Dave

↓

→

FreeSVG.org

Dave adds new data to the dataset!

Can Carol re-use her workflow?

Maybe?

Did the data dictionary change?
Did new errors get introduced?

Carol needs to remember her assumptions about the data and trust that the new data is like the old data

Act 3

Eve needs to load a CSV file

→

FreeSVG.org

Scenario 1

I'm sorry, I can't do that, Eve.

You have a non-numerical value at position 1252538:24.

FreeSVG.org

Scenario 2

Load Successful!

(btw, 175326 records didn't load)

Heuristics only work most of the time.

Data science is nuanced.

Assumptions can't be avoided!

It's easy to miss an assumption when re-using work.

https://xkcd.com/2239/

Wouldn't it be nice if...

... this is what Bob saw:

Wouldn't it be nice if...

... this is what Carol saw:

⚠	The data included an unexpected value: 'Non-Hispanic White' The most similar known value is 'White Non-Hispanic'

Annotate data with warnings.

If you use this value/record,
here's what you need to know!

Caveat Physicus

Why?

Propagation

Caveats...
... can go where the data goes: Derived values retain caveats on source data.
... stop where the data stops: Irrelevant caveats don't get propagated

Wouldn't it be nice if...

... this is what Eve saw:

What is a Caveat?

A brief digression...

Classical Databases

One database $D$

Each query gets one answer $R \leftarrow Q(D)$

Incomplete Databases

Multiple possible databases $D \in \mathcal D$

(possible worlds)

Queries get a set of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$

Certain tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$

Uncertain tuples exist in at least one,
but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$

(not limited to set semantics)

A caveat is an assumption tied to one or more data elements (cells or rows).

If the assumption is wrong, so is the element.

Alice / Bob

FIRST may not pick the right value for a bucket with 2+ distinct values.
Interpolation may not pick the right value for a bucket with 0 values.

Carol / Dave

The model hyperparameters may not work if the data changes too significantly.
New values could indicate new data errors that Carol's ingest script hasn't accounted for.

Eve / Hal

Replacing a parse error with a NULL might not be what Eve expects.

An element has a caveat → The element is uncertain.

... and btw, here's why.

Caveats

Story Time
What is a Caveat?
The Vizier Notebook
Applying Caveats
Propagating Caveats
Caveats Beyond SQL

Demo

Caveats

Story Time
What is a Caveat?
The Vizier Notebook
Applying Caveats
Propagating Caveats
Caveats Beyond SQL


            SELECT setting_1, setting_2, estimate
            FROM Simulation;

We want to indicate that the estimate column is only accurate if (for example) P ≠ NP.

caveat(value, assumption)

returns value, annotated with assumption.


            SELECT setting_1, setting_2,
                   caveat(estimate, 'Only correct if P ≠ NP')
                     AS estimate
            FROM Simulation;

annotation is just a human-readable string.

Incomplete Databases

caveat() creates 2 sets of possible worlds:

The assumption holds: value is correct.
The assumption does not hold: value is unknown.

Alice / Bob

Mark multi-valued buckets (key repair).


    SELECT bucket, 
           CASE WHEN bucket_size > 1 THEN
                 caveat(reading, 'Picked between two bucket values.')
                ELSE reading END AS reading
    FROM (
      SELECT CAST(time / 10 AS int) AS bucket, 
             FIRST(reading) AS reading
             COUNT(*) AS bucket_size
      FROM sensor
      GROUP BY bucket;
    )

Interpolation is more complex... but similar.

Carol / Dave

Mark unexpected values the model wasn't trained on.


  SELECT
    CASE WHEN race_ethnicity 
      IN ('White Non-Hispanic', 'Black Non-Hispanic', /* ... */)
      THEN race_ethnicity

      ELSE caveat(race_ethnicity, 
                    'Unexpected race_ethnicity: ' & race_ethnicity)

    END, /* ... */
  FROM R

This check can be automated.

Eve / Hal


      SELECT /* ... */, 
          CASE WHEN CAST(salary AS float) IS NULL THEN

            caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
            
            ELSE CAST(salary AS float) END AS salary
      FROM raw_csv_data;

Caveats

Story Time
What is a Caveat?
The Vizier Notebook
Applying Caveats
Propagating Caveats
Caveats Beyond SQL

Has anyone asked about "where" provenance?

Another brief digression...

Value Annotations

Provenance in Databases: Why, How, and Where
James Cheney, Laura Chiticariu and Wang-Chiew Tan

MONDRIAN: Annotating and Querying Databases through Colors and Blocks.
Floris Geerts, Anastasios Kementsietsidis, Diego Milano

and more...

Value Annotations


            CREATE VIEW Q AS 
              SELECT R.A     AS X, 
                     R.B+R.C AS Y 
              FROM R

$$annot(\texttt{Q.X}[i]) \leftarrow annot(\texttt{R.A}[i])$$

$$annot(\texttt{Q.Y}[i]) \leftarrow annot(\texttt{R.B}[i]) \cup annot(\texttt{R.C}[i])$$

Value Annotations


            CREATE VIEW Q AS 
              SELECT R.A      AS X, 
                     SUM(R.B) AS Y 
              FROM R

$$annot(\texttt{Q.X}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.A}[j] = Q.A[i]} annot(\texttt{R.A}[j])$$

$$annot(\texttt{Q.Y}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.B}[j] = Q.B[i]} annot(\texttt{R.B}[j])$$

... not the semantics we want

Caveats on $\texttt{R.A}$ also affect $\texttt{Q.B}$.

Caveats ≠ Value Annotations

Certain Data Elements: Elements guaranteed to be in the result in all possible worlds.

... i.e., elements unaffected by the choice of possible world.

If a caveatted element can't affect an output element, don't propagate its caveats!

Propagate caveats to any data elements that could be affected by a change in assumptions.

Challenge: How do we propagate caveats
without penalizing query evaluation?

Don't!

Staged Caveat Discovery

Alongside query evaluation...: Instrument queries to discover which elements are affected by a caveat.
After query evaluation...: Enumerate specific caveats affecting those elements.

Marking Caveatted Elements

Enumerating Caveats

Instrumenting Queries

≅ computing certain answers! (CoNP-Complete)

Conservative Approximation

Correctness of SQL Queries on Databases with Nulls.
Paolo Guagliardo, Leonid Libkin

Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers
Su Feng, Aaron Huber, Boris Glavic, Oliver Kennedy

Unmarked rows are guaranteed to be caveat-free.
Marked rows might not be caveatted.

Add and maintain a binary "has caveat"
column for each row/column.


    CREATE VIEW survey_responses AS
      SELECT language, 
          CASE WHEN CAST(salary AS float) IS NULL THEN
            caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
            ELSE CAST(salary AS float) END AS salary
      FROM raw_csv_data;

becomes


    CREATE VIEW survey_responses AS
      SELECT language, CAST(salary AS float) AS salary,
             FALSE                         AS _caveat_field_language,
             CAST(salary as float) IS NULL AS _caveat_field_salary
             FALSE                         AS _caveat_row
      FROM raw_csv_data;


            SELECT salary 
            FROM survey_responses
            WHERE language = 'Scala'

becomes


      SELECT salary, 
             _caveat_field_salary AS _caveat_field_salary,
             _caveat_row AND _caveat_field_language AS _caveat_row
      FROM survey_responses
      WHERE language = 'Scala'


            SELECT AVG(salary) AS salary
            FROM survey_responses

becomes


      SELECT AVG(salary), 
             GROUP_OR(_caveat_field_salary
                      OR _caveat_row) AS _caveat_field_salary,
             FALSE AS _caveat_row
      FROM survey_responses


            SELECT language, AVG(salary) AS salary
            FROM survey_responses
            GROUP BY language

... first we evaluate


      SELECT GROUP_OR(_caveat_field_language)
      FROM survey_responses

Can often be evaluated statically.

If GROUP BY has caveats


    SELECT language, AVG(salary) AS salary
           FALSE                  AS _caveat_field_language
           TRUE                   AS _caveat_field_salary
           GROUP_AND(_caveat_field_language OR 
                     _caveat_row) AS _caveat_row
    FROM by_language
    GROUP BY language

If no GROUP BY caveats


        SELECT language, AVG(salary) AS salary
               FALSE                  AS _caveat_field_language
               GROUP_OR(_caveat_field_salary,
                        _caveat_row)  AS _caveat_field_salary
               GROUP_AND(_caveat_row) AS _caveat_row
        FROM by_language
        GROUP BY language

Enumerating Caveats

Static Analysis: Which caveats could possibly affect the element?
Dynamic Analysis: Which specific caveats affect the element?

Static Analysis

What calls to caveat() appear in the derivation of the specified element?

Analogous to program slicing.

Program Slicing

Eliminate lines of code not relevant
to computing a specific value.

This is exactly what a database optimizer does.

Lookup: Caveats on $\texttt{R.A}[i]$


            SELECT A
            FROM R
            WHERE ROWID = i

All calls to caveat() surviving optimization
(probably) affect the target.

Dynamic Analysis

For each call to caveat(), isolate a query to generate the message.
Union the message query results together.

Isolate the message


   WITH data_source AS
     SELECT caveat(A, 'valid if '& B &' is within tolerances.') AS A,
            C, D, E
     FROM R
   
   SELECT C, D, E FROM data_source WHERE ROWID = i

becomes


           SELECT 'valid if '& B &' is within tolerances.' 
                    AS caveat_message
           FROM R WHERE ROWID = i

Caveats

Story Time
What is a Caveat?
The Vizier Notebook
Applying Caveats
Propagating Caveats
Caveats Beyond SQL

Oliver: I have this great tool for tracking assumptions!
Data Scientist: Super! How does it work?
Oliver: Well you just write a SQL query...
Data Scientist: ...

Caveats for the Masses

✔	SQL
✔	R (sort of)
🗶	Spreadsheets
🗶	Python

The Exception That Improves The Rule
Juliana Freire, Boris Glavic, Oliver Kennedy, Heiko Mueller

Vizual

Spreadsheet Operations → SQL DDL / SQL DML

Edit Cell A3 to 'foo': UPDATE R SET A = 'foo' WHERE ROWID = 3;
Insert Row: INSERT INTO R() VALUES ();
Insert Column `bar`: ALTER TABLE R ADD COLUMN `bar`;

Ok... so we have an edit history in DDL/DML.

Caveats on DDL/DML

DDL → SQL

Using Reenactment to Retroactively Capture Provenance for Transactions
Bahareh Sadat Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan, Boris Glavic

DML → SQL

Graceful database schema evolution: the PRISM workbench
Carlo Curino, Hyun Jin Moon, Carlo Zaniolo


            UPDATE R SET A = 'foo' WHERE ROWID = 3;

becomes


            SELECT CASE ROWID
                     WHEN 3 THEN 'foo'
                     ELSE A END AS A, 
                   B, C, /* ... */
            FROM R


            INSERT INTO R() VALUES ();

becomes


              SELECT * FROM R
            UNION ALL
              SELECT NULL AS A, NULL AS B,
                     NULL AS C, /* ... */


            ALTER TABLE R ADD COLUMN `bar`;

becomes


            SELECT *, NULL as `bar` FROM R;

https://vizierdb.info


          $> pip3 install --user vizier-webapi
          $> vizier

Students
Poonam (PhD-4Y)	Will (PhD-3Y)	Aaron (PhD-4Y)

Dev
Mike (Sr. Rsrch. Dev.)

Alumni
Ying (PhD 2017)	Niccolò (PhD 2016)	Arindam (MS 2016)	Shivang (MS 2018)	Olivia (BS 2017)	Gourab (MS 2018)

External Collaborators
Zhen Hua Liu (Oracle)	Ying Lu (Oracle)	Beda Hammerschmidt (Oracle)	Boris Glavic (IIT)	Su Feng (IIT)

Juliana Freire
(NYU)

Heiko Mueller
(NYU)

Sonia Castelo Quispe
(NYU)

Carlos Bautista
(NYU)

Remi Rampin
(NYU)

Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle

Optimizations

Too Much Information: Limit the number of messages returned per call.
Unions (on Spark) are Expensive: Execute each query individually in parallel.

VizierDB

Safe, Reusable Heuristic Data Transformation

(through Caveats)

Oliver Kennedy okennedy@buffalo.edu

Story Time!

Act 1

Option 1: Do it right

Alice's Observations

Enter Bob...

Maybe?

Act 2

Maybe?

Act 3

Scenario 1

Scenario 2

Wouldn't it be nice if...

Wouldn't it be nice if...

Wouldn't it be nice if...

Caveat Physicus

Why?

Propagation

Wouldn't it be nice if...

What is a Caveat?

Classical Databases

Incomplete Databases

Alice / Bob

Carol / Dave

Eve / Hal

Caveats

Demo

Caveats

Incomplete Databases

Alice / Bob

Carol / Dave

Eve / Hal

Caveats

Has anyone asked about "where" provenance?

Value Annotations

Value Annotations

Value Annotations

Caveats ≠ Value Annotations

Staged Caveat Discovery

Marking Caveatted Elements

Enumerating Caveats

Instrumenting Queries

Conservative Approximation

If GROUP BY has caveats

If no GROUP BY caveats

Enumerating Caveats

Static Analysis

Program Slicing

Dynamic Analysis

Isolate the message

Caveats

Caveats for the Masses

Vizual

Caveats on DDL/DML

DDL → SQL

DML → SQL

https://vizierdb.info

Optimizations

Graffiti

Shootings