diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/AX11-graph.svg b/slides/talks/2020-1-CIDR-Vizier/graphics/AX11-graph.svg new file mode 100644 index 00000000..a82c7841 --- /dev/null +++ b/slides/talks/2020-1-CIDR-Vizier/graphics/AX11-graph.svg @@ -0,0 +1,76 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + Openclipart + + + graph + 2009-03-05T13:36:40 + icon for statistics, plot or graph + https://openclipart.org/detail/21839/graph-by-ax11 + + + AX11 + + + + + cartoon + color + gtaph + icon + plot + statistics + symbol + + + + + + + + + + + diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/male-computer-user.png b/slides/talks/2020-1-CIDR-Vizier/graphics/male-computer-user.png new file mode 100644 index 00000000..0b19b7e7 Binary files /dev/null and b/slides/talks/2020-1-CIDR-Vizier/graphics/male-computer-user.png differ diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg b/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg index 12387161..3918912c 100644 Binary files a/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg and b/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg differ diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/qr.png b/slides/talks/2020-1-CIDR-Vizier/graphics/qr.png new file mode 100644 index 00000000..5a9d54f2 Binary files /dev/null and b/slides/talks/2020-1-CIDR-Vizier/graphics/qr.png differ diff --git a/slides/talks/2020-1-CIDR-Vizier/index.html b/slides/talks/2020-1-CIDR-Vizier/index.html index 19266ee8..d937c288 100644 --- a/slides/talks/2020-1-CIDR-Vizier/index.html +++ b/slides/talks/2020-1-CIDR-Vizier/index.html @@ -77,213 +77,38 @@ VizierDB

Safe, Reusable Heuristic Data Transformation


(through Caveats)


Your notebook is not crumby enough, REPLace it


Oliver Kennedy - okennedy@buffalo.edu


Michael Brachmann, + William Spoth, + Oliver Kennedy, + Boris Glavic, + Heiko Mueller, + Sonia Castelo, + Carlos Bautista, + Juliana Freire

+ -

Story Time!

- - -
- -

Act 1


Alice wants to analyze two unaligned time series.



- - - - - - - - - - -
- - - - - - - - - - -

Step 1: Line up the readings

- -

Option 1: Do it right

- -
- - - - - - - - -
- Lots of active research efforts! -
- ... but Alice is trying to to GSD! -
- -

Alice's Observations

- - -
- -

-            INSERT INTO series_one_buckets
-              SELECT CAST(time / 10 AS int) AS bucket, 
-                     FIRST(reading)
-              FROM   series_one
-              GROUP BY bucket;

Interpolate missing values


Hand tune around the switchover as-needed

- -

Time taken: < 30 minutes

- -
- - FreeSVG.org -
- -

Enter Bob...

- -

Similar analysis...


... different data

- -

Can Bob re-use Alice's prep+analytics workflow?

- -


- -

... and even then, some manual effort is needed!

- -

Bob needs to know Alice's assumptions
(and how to use the workflow)?

- -

Act 2


Carol gets a dataset from Dave

- -
- - → - - - → - - - FreeSVG.org -
- -

Dave adds new data to the dataset!


Can Carol re-use her workflow?

- -


- -
- -

Carol needs to remember her assumptions about the data and trust that the new data is like the old data

- -

Act 3


Eve needs to load a CSV file

- - → - - FreeSVG.org -
- -

Scenario 1

- -
- -

- I'm sorry, I can't do that, Eve.


- You have a non-numerical value at position 1252538:24. -

- -
- FreeSVG.org -

Scenario 2

- - -
- -

- Load Successful! -


- (btw, 175326 records didn't load) -

- -

Heuristics only work most of the time.


+ + VizierDB +

A Data-First Notebook Built for Reproducibility

  1. In-Order Execution
  2. +
  3. Inter-Cell Dependency Management
  4. +
  5. Caveats
  6. +
  7. Hybrid Notebook/Spreadsheet
  8. +
  9. History & Version Management
  10. +
  11. Polyglot & Multimodal
  12. +
@@ -305,436 +130,152 @@ - Re-emphasize that enumeration is not required - Propagation Overview --> -

Data science is nuanced.


Assumptions can't be avoided!


It's easy to miss an assumption when re-using work.


Data Errors Suck


Wouldn't it be nice if...

+ +

+ + + + + + + +

+ +

+ + +
+ Assumption + + Assumption +

+ + freesvg.org +
+ +
+ + © 20th Century Fox

Wouldn't it be nice if...


... this is what Bob saw:

- -
- -

Wouldn't it be nice if...


... this is what Carol saw:

- - - - - - -
- The data included an unexpected value: 'Non-Hispanic White'
The most similar known value is 'White Non-Hispanic' -
- -

Annotate data with warnings.

- -

If you use this value/record,
here's what you need to know!

- -

Caveat Physicus

- -




- -
... can go where the data goes
Derived values retain caveats on source data.
- -
... stop where the data stops
Irrelevant caveats don't get propagated
- -

Wouldn't it be nice if...


... this is what Eve saw:

- -
- - -

What is a Caveat?


A brief digression...


An assumption tied to a fragment of the dataset.


If the assumption is wrong, so is the fragment.


Classical Databases

- -

One database $D$

- -

Each query gets one answer $R \leftarrow Q(D)$

- -

Incomplete Databases

- -

Multiple possible databases $D \in \mathcal D$


(possible worlds)


Queries get a set of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$

- -

Certain tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$


Uncertain tuples exist in at least one,
but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$


(not limited to set semantics)

- -

A caveat is an assumption tied to one or more data elements (cells or rows).


If the assumption is wrong, so is the element.

- -

Alice / Bob

- -
- -

Carol / Dave

- -
- -

Eve / Hal

- -
- -

An element has a caveat → The element is uncertain.

- -

... and btw, here's why.

- -
- -


- -
  1. Story Time
  2. -
  3. What is a Caveat?
  4. -
  5. The Vizier Notebook
  6. -
  7. Applying Caveats
  8. -
  9. Propagating Caveats
  10. -
  11. Caveats Beyond SQL
  12. -
- -


- -


- -
  1. Story Time
  2. -
  3. What is a Caveat?
  4. -
  5. The Vizier Notebook
  6. -
  7. Applying Caveats
  8. -
  9. Propagating Caveats
  10. -
  11. Caveats Beyond SQL
  12. -
- -

-            SELECT setting_1, setting_2, estimate
-            FROM Simulation;
- -

We want to indicate that the estimate column is only accurate if (for example) P ≠ NP.

- -

caveat(value, assumption)

- -

returns value, annotated with assumption.

- -

-            SELECT setting_1, setting_2,
-                   caveat(estimate, 'Only correct if P ≠ NP')
-                     AS estimate
-            FROM Simulation;

annotation is just a human-readable string.

- -

Incomplete Databases


- caveat() creates 2 sets of possible worlds: -


- -

Alice / Bob


Mark multi-valued buckets (key repair).


-    SELECT bucket, 
-           CASE WHEN bucket_size > 1 THEN
-                 caveat(reading, 'Picked between two bucket values.')
-                ELSE reading END AS reading
-    FROM (
-      SELECT CAST(time / 10 AS int) AS bucket, 
-             FIRST(reading) AS reading
-             COUNT(*) AS bucket_size
-      FROM sensor
-      GROUP BY bucket;
-    )

Interpolation is more complex... but similar.

- -

Carol / Dave


Mark unexpected values the model wasn't trained on.

     CASE WHEN race_ethnicity 
-      IN ('White Non-Hispanic', 'Black Non-Hispanic', /* ... */)
-      THEN race_ethnicity
+      NOT IN ('White Non-Hispanic', 'Black Non-Hispanic', /* ... */)
-      ELSE caveat(race_ethnicity, 
+      THEN caveat(race_ethnicity, 
                     'Unexpected race_ethnicity: ' & race_ethnicity)
+      ELSE race_ethnicity
     END, /* ... */

This check can be automated.


Eve / Hal

-      SELECT /* ... */, 
-          CASE WHEN CAST(salary AS float) IS NULL THEN
-            caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
-            ELSE CAST(salary AS float) END AS salary
-      FROM raw_csv_data;
+           caveat(race_ethnicity, 
+                    'Unexpected race_ethnicity: ' & race_ethnicity)
- - -


- -
  1. Story Time
  2. -
  3. What is a Caveat?
  4. -
  5. The Vizier Notebook
  6. -
  7. Applying Caveats
  8. -
  9. Propagating Caveats
  10. -
  11. Caveats Beyond SQL
  12. -
- -

Has anyone asked about "where" provenance?


Another brief digression...


Value Annotations

- -

- Provenance in Databases: Why, How, and Where
- James Cheney, Laura Chiticariu and Wang-Chiew Tan -

- -

- MONDRIAN: Annotating and Querying Databases through Colors and Blocks.
- Floris Geerts, Anastasios Kementsietsidis, Diego Milano -

- -

and more...

- -

Value Annotations

-            CREATE VIEW Q AS 
-              SELECT R.A     AS X, 
-                     R.B+R.C AS Y 
-              FROM R
+    CASE WHEN /*...*/
+      THEN caveat(race_ethnicity, 
+                    'Unexpected race_ethnicity: ' & race_ethnicity)
+      ELSE race_ethnicity

- $$annot(\texttt{Q.X}[i]) \leftarrow annot(\texttt{R.A}[i])$$ -


- $$annot(\texttt{Q.Y}[i]) \leftarrow annot(\texttt{R.B}[i]) \cup annot(\texttt{R.C}[i])$$ -


Value Annotations


-            CREATE VIEW Q AS 
-              SELECT R.A      AS X, 
-                     SUM(R.B) AS Y 
-              FROM R

- $$annot(\texttt{Q.X}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.A}[j] = Q.A[i]} annot(\texttt{R.A}[j])$$ -


- $$annot(\texttt{Q.Y}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.B}[j] = Q.B[i]} annot(\texttt{R.B}[j])$$ -


... not the semantics we want

+ + + + + + + + + + + + + + + + + +
+ + + +
+ freesvg.org

- Caveats on $\texttt{R.A}$ also affect $\texttt{Q.B}$. -




Can twiddling the caveatted value change the output?

+ +

$C \leftarrow (a \times X) + Y$


Caveats on $X$ and $Y$ propagate to $C$


Caveats ≠ Value Annotations

- -
- -
- -

Certain Data Elements: Elements guaranteed to be in the result in all possible worlds.

- -

... i.e., elements unaffected by the choice of possible world.




If a caveatted element can't affect an output element, don't propagate its caveats!


Propagate caveats to any data elements that could be affected by a change in assumptions.

+ +



Challenge: How do we propagate caveats
without penalizing query evaluation?

- -


- -

Staged Caveat Discovery

- -
Alongside query evaluation...
Instrument queries to discover which elements are affected by a caveat.
- -
After query evaluation...
Enumerate specific caveats affecting those elements.
- -

Marking Caveatted Elements

- -
- -

Enumerating Caveats

- +

Is a value caveatted?


≡ Certain answers in incomplete databases



- -

Instrumenting Queries


≅ computing certain answers! (CoNP-Complete)


Conservative Approximation

@@ -860,6 +401,7 @@


- -
  1. Story Time
  2. -
  3. What is a Caveat?
  4. -
  5. The Vizier Notebook
  6. -
  7. Applying Caveats
  8. -
  9. Propagating Caveats
  10. -
  11. Caveats Beyond SQL
  12. -
@@ -977,24 +495,6 @@

Caveats for the Masses

- - - - - - - - - - - - - -
R (sort of)

The Exception That Improves The Rule
@@ -1022,7 +522,7 @@


Ok... so we have an edit history in DDL/DML.


This gives us an edit history in DDL/DML.

@@ -1082,153 +582,57 @@

- - https://vizierdb.info -


-          $> pip3 install --user vizier-webapi
-          $> vizier
- - - - - - - - - -
- -


- -


- -


- - - - - - - -
- -

(Sr. Rsrch. Dev.)

- - - - - - - - - - - - - -
- -

(PhD 2017)

- -

(PhD 2016)

- -

(MS 2016)

- -

(MS 2018)

- -

(BS 2017)

- -

(BS 2018)

- -

(MS 2018)

- - - - - - - - - - - -
External Collaborators
- Zhen Hua Liu
(Oracle) -
- Ying Lu
(Oracle) -
- Beda Hammerschmidt
(Oracle) -
- Boris Glavic
(IIT) -
- Su Feng
(IIT) -
- - - - - - - - -
- Juliana Freire
(NYU) -
- Heiko Mueller
(NYU) -
- Sonia Castelo Quispe
(NYU) -
- Carlos Bautista
(NYU) -
- Remi Rampin
(NYU) -

Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle

- -
- -


- -
Too Much Information
Limit the number of messages returned per call.
- -
Unions (on Spark) are Expensive
Execute each query individually in parallel.

+ + https://vizierdb.info +


+            $> pip3 install --user vizier-webapi
+            $> vizier


- -

+ + + [https://]VizierDB[.info] + + +


+ Michael Brachmann, + William Spoth, + Oliver Kennedy, + Boris Glavic, + Heiko Mueller, + Sonia Castelo, + Carlos Bautista, + Juliana Freire



- -

+ Ying Yang, + Su Feng, + Poonam Kumari, + Aaron Huber, + Niccolò Meneghetti, + Arindam Nandi, + Shivang Agarwal, + Olivia Alphonse, + Lisa Lu, + Gourab Malhotra, + Remi Rampin


Vizier is supported by NSF Awards ACI-1640864 and IIS-1750460 and gifts from Oracle

+ + +
@@ -1270,5 +674,7 @@ + +