VizierDB


Your notebook is not crumby enough, REPLace it


Michael Brachmann, William Spoth, Oliver Kennedy, Boris Glavic, Heiko Mueller, Sonia Castelo, Carlos Bautista, Juliana Freire

Demo

VizierDB

A Data-First Notebook Built for Reproducibility


  1. Automatic Refresh & Dependency Management
  2. Caveats
  3. Hybrid Notebook/Spreadsheet
  4. History & Version Management
  5. Polyglot & Multimodal

Data Errors Suck

https://xkcd.com/2239/


Assumption Assumption

freesvg.org
© 20th Century Fox

What is a Caveat?

An assumption tied to a fragment of the dataset.

If the assumption is wrong, so is the fragment.





           caveat(race_ethnicity, 
                    'Unexpected race_ethnicity: ' & race_ethnicity)





          


    CASE WHEN race_ethnicity NOT IN ('Black Non-Hispanic', /* ... */)

      THEN caveat(race_ethnicity, 
                    'Unexpected race_ethnicity: ' & race_ethnicity)

      ELSE race_ethnicity



          

  SELECT
    CASE WHEN race_ethnicity NOT IN ('Black Non-Hispanic', /* ... */)

      THEN caveat(race_ethnicity, 
                    'Unexpected race_ethnicity: ' & race_ethnicity)

      ELSE race_ethnicity

    END, /* ... */
  FROM R
          

Propagation

Can twiddling the caveatted value change the output?

$C \leftarrow (5 \times X) + Y$

Caveats on $X$ and $Y$ propagate to $C$*

Some conditions may apply

Sloooow!

+

Is a value caveatted?

≡ Certain answers in incomplete databases

(coNP-complete)

Conservative Approximation

Correctness of SQL Queries on Databases with Nulls.
Paolo Guagliardo, Leonid Libkin

Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers
Su Feng, Aaron Huber, Boris Glavic, Oliver Kennedy

  • Unmarked rows are guaranteed to be caveat-free.
  • Marked rows might not be caveatted.

Enumerating Caveats

Static Analysis
Which caveats could possibly affect the element?
Dynamic Analysis
Which specific caveats affect the element?

Static Analysis

What calls to caveat() appear in the derivation of the specified element?

Analogous to program slicing.

Program Slicing

Eliminate lines of code not relevant
to computing a specific value.

This is exactly what a database optimizer does.

Lookup: Caveats on $\texttt{R.A}[i]$


            SELECT A
            FROM R
            WHERE ROWID = i
          

All calls to caveat() surviving optimization
(probably) affect the target.

Dynamic Analysis

  1. For each call to caveat(), isolate a query to generate the message.
  2. Union the message query results together.

Isolate the message


   WITH data_source AS
     SELECT caveat(A, 'valid if '& B &' is within tolerances.') AS A,
            C, D, E
     FROM R
   
   SELECT C, D, E FROM data_source WHERE ROWID = i
          

becomes


           SELECT 'valid if '& B &' is within tolerances.' 
                    AS caveat_message
           FROM R WHERE ROWID = i
          

Vizual

Spreadsheet Operations → SQL DDL / SQL DML

Edit Cell A3 to 'foo'
UPDATE R SET A = 'foo' WHERE ROWID = 3;
Insert Row
INSERT INTO R() VALUES ();
Insert Column `bar`
ALTER TABLE R ADD COLUMN `bar`;

This gives us an edit history in DDL/DML.

Caveats on DDL/DML

DML → SQL

Using Reenactment to Retroactively Capture Provenance for Transactions
Bahareh Sadat Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan, Boris Glavic

DDL → SQL

Graceful database schema evolution: the PRISM workbench
Carlo Curino, Hyun Jin Moon, Carlo Zaniolo


            UPDATE R SET A = 'foo' WHERE ROWID = 3;
          
becomes

            SELECT CASE ROWID
                     WHEN 3 THEN 'foo'
                     ELSE A END AS A, 
                   B, C, /* ... */
            FROM R
          

https://vizierdb.info


            $> pip3 install --user vizier-webapi
            $> vizier
          

[https://]VizierDB[.info]


Michael Brachmann, William Spoth, Oliver Kennedy, Boris Glavic, Heiko Mueller, Sonia Castelo, Carlos Bautista, Juliana Freire


Ying Yang, Su Feng, Poonam Kumari, Aaron Huber, Niccolò Meneghetti, Arindam Nandi, Shivang Agarwal, Olivia Alphonse, Lisa Lu, Gourab Malhotra, Remi Rampin


Vizier is supported by NSF Awards ACI-1640864 and IIS-1750460 and gifts from Oracle

Bonus Slides


    CREATE VIEW survey_responses AS
      SELECT language, 
          CASE WHEN CAST(salary AS float) IS NULL THEN
            caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
            ELSE CAST(salary AS float) END AS salary
      FROM raw_csv_data;
          
becomes

    CREATE VIEW survey_responses AS
      SELECT language, CAST(salary AS float) AS salary,
             FALSE                         AS _caveat_field_language,
             CAST(salary as float) IS NULL AS _caveat_field_salary
             FALSE                         AS _caveat_row
      FROM raw_csv_data;
          

            SELECT salary 
            FROM survey_responses
            WHERE language = 'Scala'
          
becomes

      SELECT salary, 
             _caveat_field_salary AS _caveat_field_salary,
             _caveat_row AND _caveat_field_language AS _caveat_row
      FROM survey_responses
      WHERE language = 'Scala'
          

            SELECT AVG(salary) AS salary
            FROM survey_responses
          
becomes

      SELECT AVG(salary), 
             GROUP_OR(_caveat_field_salary
                      OR _caveat_row) AS _caveat_field_salary,
             FALSE AS _caveat_row
      FROM survey_responses
          

            SELECT language, AVG(salary) AS salary
            FROM survey_responses
            GROUP BY language
          
... first we evaluate

      SELECT GROUP_OR(_caveat_field_language)
      FROM survey_responses
          

Can often be evaluated statically.

If GROUP BY has caveats


    SELECT language, AVG(salary) AS salary
           FALSE                  AS _caveat_field_language
           TRUE                   AS _caveat_field_salary
           GROUP_AND(_caveat_field_language OR 
                     _caveat_row) AS _caveat_row
    FROM by_language
    GROUP BY language
          

If no GROUP BY caveats


        SELECT language, AVG(salary) AS salary
               FALSE                  AS _caveat_field_language
               GROUP_OR(_caveat_field_salary,
                        _caveat_row)  AS _caveat_field_salary
               GROUP_AND(_caveat_row) AS _caveat_row
        FROM by_language
        GROUP BY language
          

            INSERT INTO R() VALUES ();
          
becomes

              SELECT * FROM R
            UNION ALL
              SELECT NULL AS A, NULL AS B,
                     NULL AS C, /* ... */
          

            ALTER TABLE R ADD COLUMN `bar`;
          
becomes

            SELECT *, NULL as `bar` FROM R;