VizierDB


Safe, Reusable Heuristic Data Transformation

(through Caveats)


Oliver Kennedy okennedy@buffalo.edu

Story Time!

Act 1

Alice wants to analyze two unaligned time series.

TimeReading
15757310010
15757310140
15757310300
15757310350
...
15757312191
15757312291
15757312401
TimeReading
15757310110
15757310200
15757310310
15757310390
...
15757312181
15757312281
15757312371

Step 1: Line up the readings

Option 1: Do it right

Lots of active research efforts!
... but Alice is trying to to GSD!

Alice's Observations

  • Readings every ~10s
  • Readings are binary
  • Readings are incredibly stable

            INSERT INTO series_one_buckets
              SELECT CAST(time / 10 AS int) AS bucket, 
                     FIRST(reading)
              FROM   series_one
              GROUP BY bucket;
          

Interpolate missing values

Hand tune around the switchover as-needed

Time taken: < 30 minutes

FreeSVG.org

Enter Bob...

Similar analysis...

... different data

Can Bob re-use Alice's prep+analytics workflow?

Maybe?

  • Are readings still every ~10s?
  • Is the data still binary?
  • Is the data still (relatively) stable?

... and even then, some manual effort is needed!

Bob needs to know Alice's assumptions
(and how to use the workflow)?

Act 2

Carol gets a dataset from Dave



FreeSVG.org

Dave adds new data to the dataset!

Can Carol re-use her workflow?

Maybe?

  • Did the data dictionary change?
  • Did new errors get introduced?

Carol needs to remember her assumptions about the data and trust that the new data is like the old data

Act 3

Eve needs to load a CSV file

FreeSVG.org

Scenario 1

I'm sorry, I can't do that, Eve.

You have a non-numerical value at position 1252538:24.

FreeSVG.org

Scenario 2

Load Successful!

(btw, 175326 records didn't load)

Heuristics only work most of the time.

Data science is nuanced.

Assumptions can't be avoided!

It's easy to miss an assumption when re-using work.

https://xkcd.com/2239/

Wouldn't it be nice if...

Wouldn't it be nice if...

... this is what Bob saw:

Wouldn't it be nice if...

... this is what Carol saw:

The data included an unexpected value: 'Non-Hispanic White'
The most similar known value is 'White Non-Hispanic'

Annotate data with warnings.

If you use this value/record,
here's what you need to know!

Caveat Physicus

Why?

Propagation

Caveats...
... can go where the data goes
Derived values retain caveats on source data.
... stop where the data stops
Irrelevant caveats don't get propagated

Wouldn't it be nice if...

... this is what Eve saw:

What is a Caveat?

A brief digression...

Classical Databases

One database $D$

Each query gets one answer $R \leftarrow Q(D)$

Incomplete Databases

Multiple possible databases $D \in \mathcal D$

(possible worlds)

Queries get a set of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$

Certain tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$

Uncertain tuples exist in at least one,
but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$

(not limited to set semantics)

A caveat is an assumption tied to one or more data elements (cells or rows).

If the assumption is wrong, so is the element.

Alice / Bob

  • FIRST may not pick the right value for a bucket with 2+ distinct values.
  • Interpolation may not pick the right value for a bucket with 0 values.

Carol / Dave

  • The model hyperparameters may not work if the data changes too significantly.
  • New values could indicate new data errors that Carol's ingest script hasn't accounted for.

Eve / Hal

  • Replacing a parse error with a NULL might not be what Eve expects.

An element has a caveat → The element is uncertain.

... and btw, here's why.

Caveats

  1. Story Time
  2. What is a Caveat?
  3. The Vizier Notebook
  4. Applying Caveats
  5. Propagating Caveats
  6. Caveats Beyond SQL

Demo

Caveats

  1. Story Time
  2. What is a Caveat?
  3. The Vizier Notebook
  4. Applying Caveats
  5. Propagating Caveats
  6. Caveats Beyond SQL

            SELECT setting_1, setting_2, estimate
            FROM Simulation;
          

We want to indicate that the estimate column is only accurate if (for example) P ≠ NP.

caveat(value, assumption)

returns value, annotated with assumption.


            SELECT setting_1, setting_2,
                   caveat(estimate, 'Only correct if P ≠ NP')
                     AS estimate
            FROM Simulation;
          

annotation is just a human-readable string.

Incomplete Databases

caveat() creates 2 sets of possible worlds:

  • The assumption holds: value is correct.
  • The assumption does not hold: value is unknown.

Alice / Bob

Mark multi-valued buckets (key repair).


    SELECT bucket, 
           CASE WHEN bucket_size > 1 THEN
                 caveat(reading, 'Picked between two bucket values.')
                ELSE reading END AS reading
    FROM (
      SELECT CAST(time / 10 AS int) AS bucket, 
             FIRST(reading) AS reading
             COUNT(*) AS bucket_size
      FROM sensor
      GROUP BY bucket;
    )
          

Interpolation is more complex... but similar.

Carol / Dave

Mark unexpected values the model wasn't trained on.


  SELECT
    CASE WHEN race_ethnicity 
      IN ('White Non-Hispanic', 'Black Non-Hispanic', /* ... */)
      THEN race_ethnicity

      ELSE caveat(race_ethnicity, 
                    'Unexpected race_ethnicity: ' & race_ethnicity)

    END, /* ... */
  FROM R
          

This check can be automated.

Eve / Hal


      SELECT /* ... */, 
          CASE WHEN CAST(salary AS float) IS NULL THEN

            caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
            
            ELSE CAST(salary AS float) END AS salary
      FROM raw_csv_data;
          

Caveats

  1. Story Time
  2. What is a Caveat?
  3. The Vizier Notebook
  4. Applying Caveats
  5. Propagating Caveats
  6. Caveats Beyond SQL

Has anyone asked about "where" provenance?

Another brief digression...

Value Annotations

Provenance in Databases: Why, How, and Where
James Cheney, Laura Chiticariu and Wang-Chiew Tan

MONDRIAN: Annotating and Querying Databases through Colors and Blocks.
Floris Geerts, Anastasios Kementsietsidis, Diego Milano

and more...

Value Annotations


            CREATE VIEW Q AS 
              SELECT R.A     AS X, 
                     R.B+R.C AS Y 
              FROM R
          

$$annot(\texttt{Q.X}[i]) \leftarrow annot(\texttt{R.A}[i])$$

$$annot(\texttt{Q.Y}[i]) \leftarrow annot(\texttt{R.B}[i]) \cup annot(\texttt{R.C}[i])$$

Value Annotations


            CREATE VIEW Q AS 
              SELECT R.A      AS X, 
                     SUM(R.B) AS Y 
              FROM R
          

$$annot(\texttt{Q.X}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.A}[j] = Q.A[i]} annot(\texttt{R.A}[j])$$

$$annot(\texttt{Q.Y}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.B}[j] = Q.B[i]} annot(\texttt{R.B}[j])$$

... not the semantics we want

Caveats on $\texttt{R.A}$ also affect $\texttt{Q.B}$.

Caveats ≠ Value Annotations

Certain Data Elements: Elements guaranteed to be in the result in all possible worlds.

... i.e., elements unaffected by the choice of possible world.

If a caveatted element can't affect an output element, don't propagate its caveats!

Propagate caveats to any data elements that could be affected by a change in assumptions.

Challenge: How do we propagate caveats
without penalizing query evaluation?

Don't!

Staged Caveat Discovery

Alongside query evaluation...
Instrument queries to discover which elements are affected by a caveat.
After query evaluation...
Enumerate specific caveats affecting those elements.

Marking Caveatted Elements

Enumerating Caveats

Instrumenting Queries

≅ computing certain answers! (CoNP-Complete)

Conservative Approximation

Correctness of SQL Queries on Databases with Nulls.
Paolo Guagliardo, Leonid Libkin

Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers
Su Feng, Aaron Huber, Boris Glavic, Oliver Kennedy

  • Unmarked rows are guaranteed to be caveat-free.
  • Marked rows might not be caveatted.

Add and maintain a binary "has caveat"
column for each row/column.


    CREATE VIEW survey_responses AS
      SELECT language, 
          CASE WHEN CAST(salary AS float) IS NULL THEN
            caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
            ELSE CAST(salary AS float) END AS salary
      FROM raw_csv_data;
          
becomes

    CREATE VIEW survey_responses AS
      SELECT language, CAST(salary AS float) AS salary,
             FALSE                         AS _caveat_field_language,
             CAST(salary as float) IS NULL AS _caveat_field_salary
             FALSE                         AS _caveat_row
      FROM raw_csv_data;
          

            SELECT salary 
            FROM survey_responses
            WHERE language = 'Scala'
          
becomes

      SELECT salary, 
             _caveat_field_salary AS _caveat_field_salary,
             _caveat_row AND _caveat_field_language AS _caveat_row
      FROM survey_responses
      WHERE language = 'Scala'
          

            SELECT AVG(salary) AS salary
            FROM survey_responses
          
becomes

      SELECT AVG(salary), 
             GROUP_OR(_caveat_field_salary
                      OR _caveat_row) AS _caveat_field_salary,
             FALSE AS _caveat_row
      FROM survey_responses
          

            SELECT language, AVG(salary) AS salary
            FROM survey_responses
            GROUP BY language
          
... first we evaluate

      SELECT GROUP_OR(_caveat_field_language)
      FROM survey_responses
          

Can often be evaluated statically.

If GROUP BY has caveats


    SELECT language, AVG(salary) AS salary
           FALSE                  AS _caveat_field_language
           TRUE                   AS _caveat_field_salary
           GROUP_AND(_caveat_field_language OR 
                     _caveat_row) AS _caveat_row
    FROM by_language
    GROUP BY language
          

If no GROUP BY caveats


        SELECT language, AVG(salary) AS salary
               FALSE                  AS _caveat_field_language
               GROUP_OR(_caveat_field_salary,
                        _caveat_row)  AS _caveat_field_salary
               GROUP_AND(_caveat_row) AS _caveat_row
        FROM by_language
        GROUP BY language
          

Enumerating Caveats

Static Analysis
Which caveats could possibly affect the element?
Dynamic Analysis
Which specific caveats affect the element?

Static Analysis

What calls to caveat() appear in the derivation of the specified element?

Analogous to program slicing.

Program Slicing

Eliminate lines of code not relevant
to computing a specific value.

This is exactly what a database optimizer does.

Lookup: Caveats on $\texttt{R.A}[i]$


            SELECT A
            FROM R
            WHERE ROWID = i
          

All calls to caveat() surviving optimization
(probably) affect the target.

Dynamic Analysis

  1. For each call to caveat(), isolate a query to generate the message.
  2. Union the message query results together.

Isolate the message


   WITH data_source AS
     SELECT caveat(A, 'valid if '& B &' is within tolerances.') AS A,
            C, D, E
     FROM R
   
   SELECT C, D, E FROM data_source WHERE ROWID = i
          

becomes


           SELECT 'valid if '& B &' is within tolerances.' 
                    AS caveat_message
           FROM R WHERE ROWID = i
          

Caveats

  1. Story Time
  2. What is a Caveat?
  3. The Vizier Notebook
  4. Applying Caveats
  5. Propagating Caveats
  6. Caveats Beyond SQL
Oliver
I have this great tool for tracking assumptions!
Data Scientist
Super! How does it work?
Oliver
Well you just write a SQL query...
Data Scientist
...

Caveats for the Masses

SQL
R (sort of)
🗶 Spreadsheets
🗶 Python

The Exception That Improves The Rule
Juliana Freire, Boris Glavic, Oliver Kennedy, Heiko Mueller

Vizual

Spreadsheet Operations → SQL DDL / SQL DML

Edit Cell A3 to 'foo'
UPDATE R SET A = 'foo' WHERE ROWID = 3;
Insert Row
INSERT INTO R() VALUES ();
Insert Column `bar`
ALTER TABLE R ADD COLUMN `bar`;

Ok... so we have an edit history in DDL/DML.

Caveats on DDL/DML

DDL → SQL

Using Reenactment to Retroactively Capture Provenance for Transactions
Bahareh Sadat Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan, Boris Glavic

DML → SQL

Graceful database schema evolution: the PRISM workbench
Carlo Curino, Hyun Jin Moon, Carlo Zaniolo


            UPDATE R SET A = 'foo' WHERE ROWID = 3;
          
becomes

            SELECT CASE ROWID
                     WHEN 3 THEN 'foo'
                     ELSE A END AS A, 
                   B, C, /* ... */
            FROM R
          

            INSERT INTO R() VALUES ();
          
becomes

              SELECT * FROM R
            UNION ALL
              SELECT NULL AS A, NULL AS B,
                     NULL AS C, /* ... */
          

            ALTER TABLE R ADD COLUMN `bar`;
          
becomes

            SELECT *, NULL as `bar` FROM R;
          

https://vizierdb.info


          $> pip3 install --user vizier-webapi
          $> vizier
        
Students

Poonam
(PhD-4Y)

Will
(PhD-3Y)

Aaron
(PhD-4Y)

Dev

Mike
(Sr. Rsrch. Dev.)

Alumni

Ying
(PhD 2017)

Niccolò
(PhD 2016)

Arindam
(MS 2016)

Shivang
(MS 2018)

Olivia
(BS 2017)

Gourab
(MS 2018)

External Collaborators
Zhen Hua Liu
(Oracle)
Ying Lu
(Oracle)
Beda Hammerschmidt
(Oracle)
Boris Glavic
(IIT)
Su Feng
(IIT)
Juliana Freire
(NYU)
Heiko Mueller
(NYU)
Sonia Castelo Quispe
(NYU)
Carlos Bautista
(NYU)
Remi Rampin
(NYU)

Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle

Optimizations

Too Much Information
Limit the number of messages returned per call.
Unions (on Spark) are Expensive
Execute each query individually in parallel.

Graffiti

Shootings