diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/AX11-graph.svg b/slides/talks/2020-1-CIDR-Vizier/graphics/AX11-graph.svg new file mode 100644 index 00000000..a82c7841 --- /dev/null +++ b/slides/talks/2020-1-CIDR-Vizier/graphics/AX11-graph.svg @@ -0,0 +1,76 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + Openclipart + + + graph + 2009-03-05T13:36:40 + icon for statistics, plot or graph + https://openclipart.org/detail/21839/graph-by-ax11 + + + AX11 + + + + + cartoon + color + gtaph + icon + plot + statistics + symbol + + + + + + + + + + + diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/male-computer-user.png b/slides/talks/2020-1-CIDR-Vizier/graphics/male-computer-user.png new file mode 100644 index 00000000..0b19b7e7 Binary files /dev/null and b/slides/talks/2020-1-CIDR-Vizier/graphics/male-computer-user.png differ diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg b/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg index 12387161..3918912c 100644 Binary files a/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg and b/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg differ diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/qr.png b/slides/talks/2020-1-CIDR-Vizier/graphics/qr.png new file mode 100644 index 00000000..5a9d54f2 Binary files /dev/null and b/slides/talks/2020-1-CIDR-Vizier/graphics/qr.png differ diff --git a/slides/talks/2020-1-CIDR-Vizier/index.html b/slides/talks/2020-1-CIDR-Vizier/index.html index 19266ee8..d937c288 100644 --- a/slides/talks/2020-1-CIDR-Vizier/index.html +++ b/slides/talks/2020-1-CIDR-Vizier/index.html @@ -77,213 +77,38 @@ VizierDB
-

Safe, Reusable Heuristic Data Transformation

-

(through Caveats)

+

Your notebook is not crumby enough, REPLace it


-

Oliver Kennedy - okennedy@buffalo.edu

+

Michael Brachmann, + William Spoth, + Oliver Kennedy, + Boris Glavic, + Heiko Mueller, + Sonia Castelo, + Carlos Bautista, + Juliana Freire

+ -
-

Story Time!

-
- - -
- -
-

Act 1

-

Alice wants to analyze two unaligned time series.

+

Demo

- - - - - - - - - - -
TimeReading
15757310010
15757310140
15757310300
15757310350
...
15757312191
15757312291
15757312401
- - - - - - - - - - -
TimeReading
15757310110
15757310200
15757310310
15757310390
...
15757312181
15757312281
15757312371
-

Step 1: Line up the readings

-
- -
-

Option 1: Do it right

-
- -
- - - - - - - - -
- Lots of active research efforts! -
- ... but Alice is trying to to GSD! -
-
- -
-

Alice's Observations

- - -
- -
-

-            INSERT INTO series_one_buckets
-              SELECT CAST(time / 10 AS int) AS bucket, 
-                     FIRST(reading)
-              FROM   series_one
-              GROUP BY bucket;
-          
-

Interpolate missing values

-

Hand tune around the switchover as-needed

-
- -
-

Time taken: < 30 minutes

-
- -
- - FreeSVG.org -
- -
-

Enter Bob...

- -

Similar analysis...

-

... different data

- -

Can Bob re-use Alice's prep+analytics workflow?

-
- -
-

Maybe?

- -

... and even then, some manual effort is needed!

-
- -
-

Bob needs to know Alice's assumptions
(and how to use the workflow)?

-
-
- -
-
-

Act 2

-

Carol gets a dataset from Dave

-
- -
-
-
- - → - - - → - - - FreeSVG.org -
- -
-

Dave adds new data to the dataset!

-

Can Carol re-use her workflow?

-
- -
-

Maybe?

- -
- -
-

Carol needs to remember her assumptions about the data and trust that the new data is like the old data

-
-
- -
-
-

Act 3

-

Eve needs to load a CSV file

- - → - - FreeSVG.org -
- -
-

Scenario 1

- -
- -

- I'm sorry, I can't do that, Eve.
-

-

- You have a non-numerical value at position 1252538:24. -

- -
- FreeSVG.org -
-
-

Scenario 2

- - -
- -

- Load Successful! -

-
-

- (btw, 175326 records didn't load) -

-
-
-
- -
-

Heuristics only work most of the time.

+

+ + VizierDB +

A Data-First Notebook Built for Reproducibility
+

+
+
    +
  1. In-Order Execution
  2. +
  3. Inter-Cell Dependency Management
  4. +
  5. Caveats
  6. +
  7. Hybrid Notebook/Spreadsheet
  8. +
  9. History & Version Management
  10. +
  11. Polyglot & Multimodal
  12. +
@@ -305,436 +130,152 @@ - Re-emphasize that enumeration is not required - Propagation Overview --> -
-

Data science is nuanced.

-

Assumptions can't be avoided!

-

It's easy to miss an assumption when re-using work.

-
+

Data Errors Suck

https://xkcd.com/2239/
-

Wouldn't it be nice if...

+ +

+ + + + + + + +

+ +

+ + +
+ Assumption + + Assumption +

+ + freesvg.org +
+ +
+ + © 20th Century Fox
-
-

Wouldn't it be nice if...

-

... this is what Bob saw:

- -
- -
-

Wouldn't it be nice if...

-

... this is what Carol saw:

- - - - - - -
- The data included an unexpected value: 'Non-Hispanic White'
The most similar known value is 'White Non-Hispanic' -
-
- -
-

Annotate data with warnings.

- -

If you use this value/record,
here's what you need to know!

- -

Caveat Physicus

-
- -
-

Why?

-

Propagation

-
-
Caveats...
- -
-
... can go where the data goes
-
Derived values retain caveats on source data.
-
- -
-
... stop where the data stops
-
Irrelevant caveats don't get propagated
-
-
-
- -
-

Wouldn't it be nice if...

-

... this is what Eve saw:

- -
- - -

What is a Caveat?

-

A brief digression...

+
+

An assumption tied to a fragment of the dataset.

+

If the assumption is wrong, so is the fragment.

+
-

Classical Databases

- -

One database $D$

- -

Each query gets one answer $R \leftarrow Q(D)$

-
- -
-

Incomplete Databases

- -

Multiple possible databases $D \in \mathcal D$

-

(possible worlds)

-

Queries get a set of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$

-
- -
-

Certain tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$

-

Uncertain tuples exist in at least one,
but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$

-

(not limited to set semantics)

-
- -
-

A caveat is an assumption tied to one or more data elements (cells or rows).

-

If the assumption is wrong, so is the element.

-
- -
-

Alice / Bob

- -
- -
-

Carol / Dave

- -
- -
-

Eve / Hal

- -
- -
-

An element has a caveat → The element is uncertain.

- -

... and btw, here's why.

-
-
- -
- -
-

Caveats

- -
    -
  1. Story Time
  2. -
  3. What is a Caveat?
  4. -
  5. The Vizier Notebook
  6. -
  7. Applying Caveats
  8. -
  9. Propagating Caveats
  10. -
  11. Caveats Beyond SQL
  12. -
-
- -
-

Demo

-
-
- -
-
-

Caveats

- -
    -
  1. Story Time
  2. -
  3. What is a Caveat?
  4. -
  5. The Vizier Notebook
  6. -
  7. Applying Caveats
  8. -
  9. Propagating Caveats
  10. -
  11. Caveats Beyond SQL
  12. -
-
- -
-

-            SELECT setting_1, setting_2, estimate
-            FROM Simulation;
-          
- -

We want to indicate that the estimate column is only accurate if (for example) P ≠ NP.

-
- -
-

caveat(value, assumption)

- -

returns value, annotated with assumption.

-
- -
-

-            SELECT setting_1, setting_2,
-                   caveat(estimate, 'Only correct if P ≠ NP')
-                     AS estimate
-            FROM Simulation;
-          
-

annotation is just a human-readable string.

-
- -
-

Incomplete Databases

-

- caveat() creates 2 sets of possible worlds: -

-

-
- -
-

Alice / Bob

-

Mark multi-valued buckets (key repair).

-

-    SELECT bucket, 
-           CASE WHEN bucket_size > 1 THEN
-                 caveat(reading, 'Picked between two bucket values.')
-                ELSE reading END AS reading
-    FROM (
-      SELECT CAST(time / 10 AS int) AS bucket, 
-             FIRST(reading) AS reading
-             COUNT(*) AS bucket_size
-      FROM sensor
-      GROUP BY bucket;
-    )
-          
-

Interpolation is more complex... but similar.

-
- -
-

Carol / Dave

-

Mark unexpected values the model wasn't trained on.


   SELECT
     CASE WHEN race_ethnicity 
-      IN ('White Non-Hispanic', 'Black Non-Hispanic', /* ... */)
-      THEN race_ethnicity
+      NOT IN ('White Non-Hispanic', 'Black Non-Hispanic', /* ... */)
 
-      ELSE caveat(race_ethnicity, 
+      THEN caveat(race_ethnicity, 
                     'Unexpected race_ethnicity: ' & race_ethnicity)
 
+      ELSE race_ethnicity
+
     END, /* ... */
   FROM R
           
-

This check can be automated.

-

Eve / Hal


-      SELECT /* ... */, 
-          CASE WHEN CAST(salary AS float) IS NULL THEN
 
-            caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
-            
-            ELSE CAST(salary AS float) END AS salary
-      FROM raw_csv_data;
+
+
+
+           caveat(race_ethnicity, 
+                    'Unexpected race_ethnicity: ' & race_ethnicity)
+
+
+
+
+
           
-
- - -
-
-

Caveats

- -
    -
  1. Story Time
  2. -
  3. What is a Caveat?
  4. -
  5. The Vizier Notebook
  6. -
  7. Applying Caveats
  8. -
  9. Propagating Caveats
  10. -
  11. Caveats Beyond SQL
  12. -
-
-
- -
-
-

Has anyone asked about "where" provenance?

-

Another brief digression...

-
-
-

Value Annotations

- -

- Provenance in Databases: Why, How, and Where
- James Cheney, Laura Chiticariu and Wang-Chiew Tan -

- -

- MONDRIAN: Annotating and Querying Databases through Colors and Blocks.
- Floris Geerts, Anastasios Kementsietsidis, Diego Milano -

- -

and more...

-
- -
-

Value Annotations


-            CREATE VIEW Q AS 
-              SELECT R.A     AS X, 
-                     R.B+R.C AS Y 
-              FROM R
+
+    CASE WHEN /*...*/
+
+
+      THEN caveat(race_ethnicity, 
+                    'Unexpected race_ethnicity: ' & race_ethnicity)
+
+      ELSE race_ethnicity
+
+
+
           
-

- $$annot(\texttt{Q.X}[i]) \leftarrow annot(\texttt{R.A}[i])$$ -

-

- $$annot(\texttt{Q.Y}[i]) \leftarrow annot(\texttt{R.B}[i]) \cup annot(\texttt{R.C}[i])$$ -

-

Value Annotations

-

-            CREATE VIEW Q AS 
-              SELECT R.A      AS X, 
-                     SUM(R.B) AS Y 
-              FROM R
-          
-

- $$annot(\texttt{Q.X}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.A}[j] = Q.A[i]} annot(\texttt{R.A}[j])$$ -

-

- $$annot(\texttt{Q.Y}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.B}[j] = Q.B[i]} annot(\texttt{R.B}[j])$$ -

-

... not the semantics we want

+ + + + + + + + + + + + + + + + + +
+ + + +
AssumptionsAssumptions
+ freesvg.org
-

- Caveats on $\texttt{R.A}$ also affect $\texttt{Q.B}$. -

+

Propagation

+

Can twiddling the caveatted value change the output?

+ +

$C \leftarrow (a \times X) + Y$

+

Caveats on $X$ and $Y$ propagate to $C$

-

Caveats ≠ Value Annotations

-
- -
- -
- -
-

Certain Data Elements: Elements guaranteed to be in the result in all possible worlds.

- -

... i.e., elements unaffected by the choice of possible world.

+

Sloooow!

-

If a caveatted element can't affect an output element, don't propagate its caveats!

-

Propagate caveats to any data elements that could be affected by a change in assumptions.

+ +

+

+
-

Challenge: How do we propagate caveats
without penalizing query evaluation?

- -

Don't!

-
- -
-

Staged Caveat Discovery

- -
-
Alongside query evaluation...
-
Instrument queries to discover which elements are affected by a caveat.
- -
After query evaluation...
-
Enumerate specific caveats affecting those elements.
-
-
- -
-

Marking Caveatted Elements

- -
- -
-

Enumerating Caveats

- +

Is a value caveatted?

+

≡ Certain answers in incomplete databases

+

(coNP-complete)

- -
-

Instrumenting Queries

-

≅ computing certain answers! (CoNP-Complete)

-
-

Conservative Approximation

@@ -860,6 +401,7 @@
+
-
-

Caveats

- -
    -
  1. Story Time
  2. -
  3. What is a Caveat?
  4. -
  5. The Vizier Notebook
  6. -
  7. Applying Caveats
  8. -
  9. Propagating Caveats
  10. -
  11. Caveats Beyond SQL
  12. -
-
@@ -977,24 +495,6 @@
-
-

Caveats for the Masses

- - - - - - - - - - - - - -
SQL
R (sort of)
🗶Spreadsheets
🗶Python
-
-

The Exception That Improves The Rule
@@ -1022,7 +522,7 @@

-

Ok... so we have an edit history in DDL/DML.

+

This gives us an edit history in DDL/DML.

@@ -1082,153 +582,57 @@
-
-

- - https://vizierdb.info -

-

-          $> pip3 install --user vizier-webapi
-          $> vizier
-        
-
-
- - - - - - - - - -
Students
- -

Poonam
(PhD-4Y)

-
- -

Will
(PhD-3Y)

-
- -

Aaron
(PhD-4Y)

-
- - - - - - - -
Dev
- -

Mike
(Sr. Rsrch. Dev.)

-
- - - - - - - - - - - - - -
Alumni
- -

Ying
(PhD 2017)

-
- -

Niccolò
(PhD 2016)

-
- -

Arindam
(MS 2016)

-
- -

Shivang
(MS 2018)

-
- -

Olivia
(BS 2017)

-
- -

Lisa
(BS 2018)

-
- -

Gourab
(MS 2018)

-
- - - - - - - - - - - -
External Collaborators
- Zhen Hua Liu
(Oracle) -
- Ying Lu
(Oracle) -
- Beda Hammerschmidt
(Oracle) -
- Boris Glavic
(IIT) -
- Su Feng
(IIT) -
- - - - - - - - -
- Juliana Freire
(NYU) -
- Heiko Mueller
(NYU) -
- Sonia Castelo Quispe
(NYU) -
- Carlos Bautista
(NYU) -
- Remi Rampin
(NYU) -
-

Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle

-
- -
- -
-

Optimizations

- -
-
Too Much Information
-
Limit the number of messages returned per call.
- -
Unions (on Spark) are Expensive
-
Execute each query individually in parallel.
-
+

+ + https://vizierdb.info +

+

+            $> pip3 install --user vizier-webapi
+            $> vizier
+          
-

Graffiti

- -
+

+ + + [https://]VizierDB[.info] + + +

+
+

+ Michael Brachmann, + William Spoth, + Oliver Kennedy, + Boris Glavic, + Heiko Mueller, + Sonia Castelo, + Carlos Bautista, + Juliana Freire

+
-
-

Shootings

- -
+

+ Ying Yang, + Su Feng, + Poonam Kumari, + Aaron Huber, + Niccolò Meneghetti, + Arindam Nandi, + Shivang Agarwal, + Olivia Alphonse, + Lisa Lu, + Gourab Malhotra, + Remi Rampin

+
+

Vizier is supported by NSF Awards ACI-1640864 and IIS-1750460 and gifts from Oracle

+ + +
@@ -1270,5 +674,7 @@ + +