diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/AX11-graph.svg b/slides/talks/2020-1-CIDR-Vizier/graphics/AX11-graph.svg new file mode 100644 index 00000000..a82c7841 --- /dev/null +++ b/slides/talks/2020-1-CIDR-Vizier/graphics/AX11-graph.svg @@ -0,0 +1,76 @@ + + + diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/male-computer-user.png b/slides/talks/2020-1-CIDR-Vizier/graphics/male-computer-user.png new file mode 100644 index 00000000..0b19b7e7 Binary files /dev/null and b/slides/talks/2020-1-CIDR-Vizier/graphics/male-computer-user.png differ diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg b/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg index 12387161..3918912c 100644 Binary files a/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg and b/slides/talks/2020-1-CIDR-Vizier/graphics/montoya.jpeg differ diff --git a/slides/talks/2020-1-CIDR-Vizier/graphics/qr.png b/slides/talks/2020-1-CIDR-Vizier/graphics/qr.png new file mode 100644 index 00000000..5a9d54f2 Binary files /dev/null and b/slides/talks/2020-1-CIDR-Vizier/graphics/qr.png differ diff --git a/slides/talks/2020-1-CIDR-Vizier/index.html b/slides/talks/2020-1-CIDR-Vizier/index.html index 19266ee8..d937c288 100644 --- a/slides/talks/2020-1-CIDR-Vizier/index.html +++ b/slides/talks/2020-1-CIDR-Vizier/index.html @@ -77,213 +77,38 @@ VizierDB
Alice wants to analyze two unaligned time series.
+Time | Reading |
---|---|
1575731001 | 0 |
1575731014 | 0 |
1575731030 | 0 |
1575731035 | 0 |
... | |
1575731219 | 1 |
1575731229 | 1 |
1575731240 | 1 |
Time | Reading |
---|---|
1575731011 | 0 |
1575731020 | 0 |
1575731031 | 0 |
1575731039 | 0 |
... | |
1575731218 | 1 |
1575731228 | 1 |
1575731237 | 1 |
Step 1: Line up the readings
-- Lots of active research efforts! - | -
- ... but Alice is trying to to GSD! - | -
- INSERT INTO series_one_buckets
- SELECT CAST(time / 10 AS int) AS bucket,
- FIRST(reading)
- FROM series_one
- GROUP BY bucket;
-
- Interpolate missing values
-Hand tune around the switchover as-needed
-Time taken: < 30 minutes
-Similar analysis...
-... different data
- -Can Bob re-use Alice's prep+analytics workflow?
-... and even then, some manual effort is needed!
-Bob needs to know Alice's assumptions
(and how to use the workflow)?
Carol gets a dataset from Dave
-Dave adds new data to the dataset!
-Can Carol re-use her workflow?
-Carol needs to remember her assumptions about the data and trust that the new data is like the old data
-Eve needs to load a CSV file
-
- I'm sorry, I can't do that, Eve.
-
- You have a non-numerical value at position 1252538:24. -
- -- Load Successful! -
-- (btw, 175326 records didn't load) -
-Heuristics only work most of the time.
+
+
+ VizierDB
+
Data science is nuanced.
-Assumptions can't be avoided!
-It's easy to miss an assumption when re-using work.
-
+
+
+ →
+
+
+ →
+
+
+ ↓
+ ↓
+
+ Assumption
+ ≠
+ Assumption
+
... this is what Bob saw:
-... this is what Carol saw:
-⚠ | -
- The data included an unexpected value: 'Non-Hispanic White' The most similar known value is 'White Non-Hispanic' - |
-
Annotate data with warnings.
- -If you use this value/record,
here's what you need to know!
... this is what Eve saw:
-A brief digression...
+An assumption tied to a fragment of the dataset.
+If the assumption is wrong, so is the fragment.
+One database $D$
- -Each query gets one answer $R \leftarrow Q(D)$
-Multiple possible databases $D \in \mathcal D$
-(possible worlds)
-Queries get a set of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$
-Certain tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$
-Uncertain tuples exist in at least one,
but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$
(not limited to set semantics)
-A caveat is an assumption tied to one or more data elements (cells or rows).
-If the assumption is wrong, so is the element.
-An element has a caveat → The element is uncertain.
- -... and btw, here's why.
-
- SELECT setting_1, setting_2, estimate
- FROM Simulation;
-
-
- We want to indicate that the estimate column is only accurate if (for example) P ≠ NP.
-caveat(value, assumption)
- -returns value, annotated with assumption.
-
- SELECT setting_1, setting_2,
- caveat(estimate, 'Only correct if P ≠ NP')
- AS estimate
- FROM Simulation;
-
- annotation is just a human-readable string.
-- caveat() creates 2 sets of possible worlds: -
Mark multi-valued buckets (key repair).
-
- SELECT bucket,
- CASE WHEN bucket_size > 1 THEN
- caveat(reading, 'Picked between two bucket values.')
- ELSE reading END AS reading
- FROM (
- SELECT CAST(time / 10 AS int) AS bucket,
- FIRST(reading) AS reading
- COUNT(*) AS bucket_size
- FROM sensor
- GROUP BY bucket;
- )
-
- Interpolation is more complex... but similar.
-Mark unexpected values the model wasn't trained on.
SELECT
CASE WHEN race_ethnicity
- IN ('White Non-Hispanic', 'Black Non-Hispanic', /* ... */)
- THEN race_ethnicity
+ NOT IN ('White Non-Hispanic', 'Black Non-Hispanic', /* ... */)
- ELSE caveat(race_ethnicity,
+ THEN caveat(race_ethnicity,
'Unexpected race_ethnicity: ' & race_ethnicity)
+ ELSE race_ethnicity
+
END, /* ... */
FROM R
- This check can be automated.
- SELECT /* ... */,
- CASE WHEN CAST(salary AS float) IS NULL THEN
- caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
-
- ELSE CAST(salary AS float) END AS salary
- FROM raw_csv_data;
+
+
+
+ caveat(race_ethnicity,
+ 'Unexpected race_ethnicity: ' & race_ethnicity)
+
+
+
+
+
Another brief digression...
-
- Provenance in Databases: Why, How, and Where
- James Cheney, Laura Chiticariu and Wang-Chiew Tan
-
- MONDRIAN: Annotating and Querying Databases through Colors and Blocks.
- Floris Geerts, Anastasios Kementsietsidis, Diego Milano
-
and more...
-
- CREATE VIEW Q AS
- SELECT R.A AS X,
- R.B+R.C AS Y
- FROM R
+
+ CASE WHEN /*...*/
+
+
+ THEN caveat(race_ethnicity,
+ 'Unexpected race_ethnicity: ' & race_ethnicity)
+
+ ELSE race_ethnicity
+
+
+
- - $$annot(\texttt{Q.X}[i]) \leftarrow annot(\texttt{R.A}[i])$$ -
-- $$annot(\texttt{Q.Y}[i]) \leftarrow annot(\texttt{R.B}[i]) \cup annot(\texttt{R.C}[i])$$ -
- CREATE VIEW Q AS
- SELECT R.A AS X,
- SUM(R.B) AS Y
- FROM R
-
- - $$annot(\texttt{Q.X}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.A}[j] = Q.A[i]} annot(\texttt{R.A}[j])$$ -
-- $$annot(\texttt{Q.Y}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.B}[j] = Q.B[i]} annot(\texttt{R.B}[j])$$ -
-... not the semantics we want
+ +
+ →
+ |
+ ||
↑ | ++ | ↑ | +
Assumptions | +→ | +Assumptions | +
- Caveats on $\texttt{R.A}$ also affect $\texttt{Q.B}$. -
+Can twiddling the caveatted value change the output?
+ +$C \leftarrow (a \times X) + Y$
+Caveats on $X$ and $Y$ propagate to $C$
Certain Data Elements: Elements guaranteed to be in the result in all possible worlds.
- -... i.e., elements unaffected by the choice of possible world.
+If a caveatted element can't affect an output element, don't propagate its caveats!
-Propagate caveats to any data elements that could be affected by a change in assumptions.
++
+Challenge: How do we propagate caveats
without penalizing query evaluation?
Don't!
-Is a value caveatted?
+≡ Certain answers in incomplete databases
+(coNP-complete)
≅ computing certain answers! (CoNP-Complete)
-✔ | -SQL |
✔ | -R (sort of) |
🗶 | -Spreadsheets |
🗶 | -Python |
The Exception That Improves The Rule
@@ -1022,7 +522,7 @@
Ok... so we have an edit history in DDL/DML.
+This gives us an edit history in DDL/DML.
- $> pip3 install --user vizier-webapi
- $> vizier
-
- Students | -||||
---|---|---|---|---|
- ![]() Poonam |
-
- ![]() Will |
-
- ![]() Aaron |
-
Dev | -
---|
- ![]() Mike |
-
Alumni | -||||||
---|---|---|---|---|---|---|
- ![]() Ying |
-
- ![]() Niccolò |
-
- ![]() Arindam |
-
- ![]() Shivang |
-
- ![]() Olivia |
-
- ![]() Lisa |
-
- ![]() Gourab |
-
External Collaborators | -|||||
---|---|---|---|---|---|
- Zhen Hua Liu (Oracle) - |
-
- Ying Lu (Oracle) - |
-
- Beda Hammerschmidt (Oracle) - |
-
- Boris Glavic (IIT) - |
-
- Su Feng (IIT) - |
-
- Juliana Freire (NYU) - |
-
- Heiko Mueller (NYU) - |
-
- Sonia Castelo Quispe (NYU) - |
-
- Carlos Bautista (NYU) - |
-
- Remi Rampin (NYU) - |
-
Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle
-
+ $> pip3 install --user vizier-webapi
+ $> vizier
+
Vizier is supported by NSF Awards ACI-1640864 and IIS-1750460 and gifts from Oracle
+ + +