Website/slides/talks/2019-5-VizierCaveats/index.html

1275 lines
44 KiB
HTML

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Safe, Reusable Heuristic Data Transformation through Caveats</title>
<meta name="description" content="Safe, Reusable Heuristic Data Transformation through Caveats">
<meta name="author" content="Oliver Kennedy">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="../../reveal.js-3.7.0/css/reveal.css">
<link rel="stylesheet" href="ubodin.css" id="theme">
<!-- Code syntax highlighting -->
<link rel="stylesheet" href="../../reveal.js-3.7.0/lib/css/zenburn.css">
<style type="text/css">
.reveal .slides section .fragment.growbig {
opacity: 1;
visibility: inherit; }
.reveal .slides section .fragment.growbig.visible {
-webkit-transform: scale(7);
transform: scale(7); }
</style>
<!-- Printing and PDF exports -->
<script>
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../../reveal.js-3.7.0/css/print/pdf.css' : '../reveal.js-3.7.0/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<!--[if lt IE 9]>
<script src="../reveal.js-3.5.0/lib/js/html5shiv.js"></script>
<![endif]-->
</head>
<body>
<div class="reveal">
<div class="header">
<!-- Any Talk-Specific Header Content Goes Here -->
<center>
<a href="http://www.buffalo.edu" target="_blank">
<img src="../graphics/logos/ub-1line-ro-white.png" height="20"/>
</a>
</center>
</div>
<div class="footer">
<!-- Any Talk-Specific Footer Content Goes Here -->
<div style="float: left; margin-top: 15px; ">
Exploring <u><b>O</b></u>nline <u><b>D</b></u>ata <u><b>In</b></u>teractions
</div>
<a href="https://odin.cse.buffalo.edu" target="_blank">
<img src="../graphics/logos/odin-1line-white.png" height="40" style="float: right;"/>
</a>
</div>
<div class="slides">
<!-- Any section element inside of this container is displayed as a slide -->
<section>
<!-- Credits... introduce everyone, etc... -->
<section>
<h3>
<img src="graphics/vizier-blue.svg" height="100px" style="vertical-align: middle; margin-right: 20px;" />
<span style="vertical-align: middle;" >VizierDB</span>
</h3>
<hr/>
<h3>Safe, Reusable Heuristic Data Transformation</h3>
<h4 style="font-weight: bold;">(through Caveats)</h4>
<hr/>
<h4>Oliver Kennedy
<a href="mailto:okennedy@buffalo.edu" style="margin-left: 50px;">okennedy@buffalo.edu</a></h4>
</section>
<!--
Let me tell you a story...
- Alice/Bob heuristic alignment
- Carol/Dave data changes
- Eve/HAL data ingestion
-->
<section>
<h2>Story Time!</h2>
</section>
</section>
<section>
<section>
<h3>Act 1</h3>
<p>Alice wants to analyze two unaligned time series.</p>
</section>
<section>
<table style="font-size: 60%; display: inline; padding: 50px;">
<tr><th>Time</th><th>Reading</th></tr>
<tr><td>1575731001</td><td>0</td></tr>
<tr><td>1575731014</td><td>0</td></tr>
<tr><td>1575731030</td><td>0</td></tr>
<tr><td>1575731035</td><td>0</td></tr>
<tr><td colspan="2">...</td></tr>
<tr><td>1575731219</td><td>1</td></tr>
<tr><td>1575731229</td><td>1</td></tr>
<tr><td>1575731240</td><td>1</td></tr>
</table>
<table style="font-size: 60%; display: inline; padding: 50px;">
<tr><th>Time</th><th>Reading</th></tr>
<tr><td>1575731011</td><td>0</td></tr>
<tr><td>1575731020</td><td>0</td></tr>
<tr><td>1575731031</td><td>0</td></tr>
<tr><td>1575731039</td><td>0</td></tr>
<tr><td colspan="2">...</td></tr>
<tr><td>1575731218</td><td>1</td></tr>
<tr><td>1575731228</td><td>1</td></tr>
<tr><td>1575731237</td><td>1</td></tr>
</table>
<p class="fragment">Step 1: Line up the readings</p>
</section>
<section>
<h3>Option 1: Do it right</h3>
</section>
<section>
<img src="graphics/timeseries.png" height="500px" style="display: inline; vertical-align: middle;" />
<table style="display: inline; vertical-align: middle;">
<tr>
<td style="padding-bottom: 50px;" class="fragment">
Lots of active research efforts!
</td>
</tr>
<tr>
<td style="padding-top: 50px;" class="fragment">
... but Alice is trying to to GSD!
</td>
</tr>
</table>
</section>
<section>
<h3>Alice's Observations</h3>
<ul>
<li>Readings every ~10s</li>
<li>Readings are binary</li>
<li>Readings are incredibly stable</li>
</ul>
</section>
<section>
<pre><code class="sql">
INSERT INTO series_one_buckets
SELECT CAST(time / 10 AS int) AS bucket,
FIRST(reading)
FROM series_one
GROUP BY bucket;
</code></pre>
<p class="fragment">Interpolate missing values</p>
<p class="fragment">Hand tune around the switchover as-needed</p>
</section>
<section>
<p>Time taken: &lt; 30 minutes</p>
</section>
<section>
<img src="graphics/woman-reading-book-on-beach.svg" height="400px" />
<attribution>FreeSVG.org</attribution>
</section>
<section>
<h3>Enter Bob...</h3>
<p class="fragment"><u>Similar</u> analysis...</p>
<p class="fragment">... <u>different</u> data</p>
<p class="fragment">Can Bob re-use Alice's prep+analytics workflow?</p>
</section>
<section>
<h3>Maybe?</h3>
<ul>
<li class="fragment">Are readings still every ~10s?</li>
<li class="fragment">Is the data still binary?</li>
<li class="fragment">Is the data still (relatively) stable?</li>
</ul>
<p class="fragment">... and even then, some manual effort is needed!</p>
</section>
<section>
<p>Bob needs to know Alice's assumptions<br/>(and how to use the workflow)?</p>
</section>
</section>
<section>
<section>
<h3>Act 2</h3>
<p>Carol gets a dataset from Dave</p>
</section>
<section>
<img src="graphics/female-computer-user.svg" height="100px"/><br/>
<span class="fragment" data-fragment-index="1"></span><br/>
<img src="graphics/Binary-file-20110715.svg" height="100px" style="vertical-align: middle;">
<span class="fragment" data-fragment-index="1">
<img src="graphics/Prismatic-Cloud-Gears-2.svg" height="100px" style="vertical-align: middle;"/>
</span>
<span class="fragment" data-fragment-index="2">
<img src="graphics/moneybag.svg" height="100px" style="vertical-align: middle;">
</span>
<attribution>FreeSVG.org</attribution>
</section>
<section>
<p>Dave adds new data to the dataset!</p>
<p>Can Carol re-use her workflow?</p>
</section>
<section>
<h3>Maybe?</h3>
<ul>
<li class="fragment">Did the data dictionary change?</li>
<li class="fragment">Did new errors get introduced?</li>
</ul>
</section>
<section>
<p>Carol needs to remember her assumptions about the data and trust that the new data is like the old data</p>
</section>
</section>
<section>
<section>
<h3>Act 3</h3>
<p>Eve needs to load a CSV file</p>
<img src="graphics/Binary-file-20110715.svg" height="100px" style="vertical-align: middle;">
<img src="graphics/db.svg" height="100px" style="vertical-align: middle;">
<attribution>FreeSVG.org</attribution>
</section>
<section>
<h3>Scenario 1</h3>
<div>
<img src="graphics/HAL9000_iconic_eye.svg" height="150px;">
<p class="fragment" style="font-family: monospace;">
I'm sorry, I can't do that, Eve.<br/>
</p>
<p class="fragment" style="font-family: monospace;">
You have a non-numerical value at position 1252538:24.
</p>
</div>
<attribution>FreeSVG.org</attribution>
</section>
<section>
<h3>Scenario 2</h3>
<div>
<img src="graphics/HAL9000_iconic_eye.svg" height="150px;">
<p class="fragment" style="font-family: monospace;" data-fragment-index="1">
Load Successful!
</p>
<div class="fragment growbig" data-fragment-index="3">
<p class="fragment" style="font-family: monospace; font-size: 10%;" data-fragment-index="2">
(btw, 175326 records didn't load)
</p>
</div>
</div>
</section>
<section>
<p>Heuristics only work <b>most</b> of the time.</p>
</section>
</section>
<section>
<!--
Problem: Documentation Disconnected from Data
Solution: Annotate/propagate data
Formally Define Caveats
- Possible Worlds / Heuristic Choices
- Link to IDBs/PDBs
- Pick One World + Mark "Uncertain" Values
- Ex on TI-/BI-DB/CTables
- Emphasize: Don't need to be able to enumerate all worlds:
- Just need one world and the ability to decide what is certain
- Introduce the Caveat function
- Examples:
- 3 examples above
- Re-emphasize that enumeration is not required
- Propagation Overview
-->
<section>
<p>Data science is <span style="color: lightgrey;">nuanced</span>.</p>
<p class="fragment">Assumptions can't be avoided!</p>
<p class="fragment">It's easy to miss an assumption when re-using work.</p>
</section>
<section>
<img src="graphics/data_error.png">
<attribution><a href="https://xkcd.com/2239/">https://xkcd.com/2239/</a></attribution>
</section>
<section>
<h3>Wouldn't it be nice if...</h3>
<img src="graphics/montoya.jpeg" height="400px" />
</section>
<section>
<h3>Wouldn't it be nice if...</h3>
<p>... this is what Bob saw:</p>
<img src="graphics/time_series_with_errors.svg" />
</section>
<section>
<h3>Wouldn't it be nice if...</h3>
<p>... this is what Carol saw:</p>
<table>
<tr>
<td style="
color: rgb(251, 189, 8);
background-color: #eed;
text-decoration: none;
text-decoration-color: rgb(251, 189, 8);
text-decoration-line: none;
text-decoration-style: solid;
vertical-align: middle;
border-radius: 15px 0px 0px 15px;
font-size: 150%"></td>
<td style="
font-size: 70%;
background-color: #eee;
vertical-align: middle;
border-radius: 0px 15px 15px 0px;
padding: 20px;">
The data included an unexpected value: <b>'Non-Hispanic White'</b><br/>The most similar known value is <b>'White Non-Hispanic'</b>
</td>
</tr>
</table>
</section>
<section>
<p>Annotate data with warnings.</p>
<p class="fragment" data-fragment-index="1">If you use this value/record, <br/>here's what you need to know!</p>
<h3 class="fragment" data-fragment-index="2">Caveat Physicus</h3>
</section>
<section>
<h3>Why?</h3>
<h4 class="fragment" data-fragment-index="1">Propagation</h4>
<dl>
<dd class="fragment" data-fragment-index="2" style="margin-left: -20px;">Caveats...</dd>
<div class="fragment" data-fragment-index="2">
<dt>... can go where the data goes</dt>
<dd>Derived values retain caveats on source data.</dd>
</div>
<div class="fragment" data-fragment-index="3">
<dt>... stop where the data stops</dt>
<dd>Irrelevant caveats don't get propagated</dd>
</div>
</dl>
</section>
<section>
<h3>Wouldn't it be nice if...</h3>
<p>... this is what Eve saw:</p>
<img src="graphics/caveat-spreadsheet.png"/>
</section>
</section>
<section>
<section>
<h3>What is a Caveat?</h3>
<p class="fragment">A brief digression...</p>
</section>
<section>
<h3>Classical Databases</h3>
<p class="fragment">One database $D$</p>
<p class="fragment">Each query gets one answer $R \leftarrow Q(D)$</p>
</section>
<section>
<h3>Incomplete Databases</h3>
<p class="fragment">Multiple <u>possible</u> databases $D \in \mathcal D$</p>
<p class="fragment">(possible worlds)</p>
<p class="fragment">Queries get a <u>set</u> of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$</p>
</section>
<section>
<p class="fragment"><b>Certain</b> tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$</p>
<p class="fragment"><b>Uncertain</b> tuples exist in at least one, <br/>but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$</p>
<p style="font-size: 70%;" class="fragment">(not limited to set semantics)</p>
</section>
<section>
<p>A caveat is an assumption tied to one or more data elements (cells or rows).</p>
<p>If the assumption is wrong, so is the element.</p>
</section>
<section>
<h3>Alice / Bob</h3>
<ul>
<li><span style="font-family: monospace;">FIRST</span> may not pick the right value for a bucket with 2+ distinct values.</li>
<li>Interpolation may not pick the right value for a bucket with 0 values.</li>
</ul>
</section>
<section>
<h3>Carol / Dave</h3>
<ul>
<li>The model hyperparameters may not work if the data changes too significantly.</li>
<li>New values could indicate new data errors that Carol's ingest script hasn't accounted for.</li>
</ul>
</section>
<section>
<h3>Eve / Hal</h3>
<ul>
<li>Replacing a parse error with a NULL might not be what Eve expects.</li>
</ul>
</section>
<section>
<p>An element has a caveat → The element is uncertain.</p>
<p class="fragment">... and btw, here's why.</p>
</section>
</section>
<section>
<!--
Vizier
- Reproducibility-Focused Notebook
- Scripting
- Spreadsheets
- Point+Click
- Key Feature: Caveats
- Demo
-->
<section>
<h3>Caveats</h3>
<ol>
<li style="color: lightgrey;">Story Time</li>
<li style="color: lightgrey;">What is a Caveat?</li>
<li class="fragment grow highlight-blue">The Vizier Notebook</li>
<li>Applying Caveats</li>
<li>Propagating Caveats</li>
<li>Caveats Beyond SQL</li>
</ol>
</section>
<section>
<h1><a href="http://127.0.0.1:5000/vizier-db/api/v1/web-ui/vizier-db" target="_blank">Demo</a></h1>
</section>
</section>
<section>
<section>
<h3>Caveats</h3>
<ol>
<li style="color: lightgrey;">Story Time</li>
<li style="color: lightgrey;">What is a Caveat?</li>
<li style="color: lightgrey;">The Vizier Notebook</li>
<li class="fragment grow highlight-blue">Applying Caveats</li>
<li>Propagating Caveats</li>
<li>Caveats Beyond SQL</li>
</ol>
</section>
<section>
<pre><code class="sql">
SELECT setting_1, setting_2, estimate
FROM Simulation;
</code></pre>
<p>We want to indicate that the estimate column is only accurate if (for example) P ≠ NP.</p>
</section>
<section>
<p style="font-family: monospace;">caveat(value, assumption)</p>
<p>returns <span style="font-family: monospace;">value</span>, annotated with <span style="font-family: monospace; ">assumption</span>.</p>
</section>
<section>
<pre><code class="sql">
SELECT setting_1, setting_2,
caveat(estimate, 'Only correct if P ≠ NP')
AS estimate
FROM Simulation;
</code></pre>
<p><span style="font-family: monospace;">annotation</span> is just a human-readable string.</p>
</section>
<section>
<h3>Incomplete Databases</h3>
<p>
<span style="font-family: monospace;">caveat()</span> creates 2 sets of possible worlds:
<ul>
<li>The assumption holds: <span style="font-family: monospace;">value</span> is correct.</li>
<li>The assumption does not hold: <span style="font-family: monospace;">value</span> is unknown.</li>
</ul>
</p>
</section>
<section>
<h3>Alice / Bob</h3>
<p>Mark multi-valued buckets <span class="fragment">(key repair).</span></p>
<pre><code class="sql" data-line-numbers="2-3">
SELECT bucket,
CASE WHEN bucket_size > 1 THEN
caveat(reading, 'Picked between two bucket values.')
ELSE reading END AS reading
FROM (
SELECT CAST(time / 10 AS int) AS bucket,
FIRST(reading) AS reading
COUNT(*) AS bucket_size
FROM sensor
GROUP BY bucket;
)
</code></pre>
<p class="fragment">Interpolation is more complex... but similar.</p>
</section>
<section>
<h3>Carol / Dave</h3>
<p>Mark unexpected values the model wasn't trained on.</p>
<pre><code class="sql">
SELECT
CASE WHEN race_ethnicity
IN ('White Non-Hispanic', 'Black Non-Hispanic', /* ... */)
THEN race_ethnicity
ELSE caveat(race_ethnicity,
'Unexpected race_ethnicity: ' & race_ethnicity)
END, /* ... */
FROM R
</code></pre>
<p class="fragment">This check can be automated.</p>
</section>
<section>
<h3>Eve / Hal</h3>
<pre><code class="sql">
SELECT /* ... */,
CASE WHEN CAST(salary AS float) IS NULL THEN
caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
ELSE CAST(salary AS float) END AS salary
FROM raw_csv_data;
</code></pre>
</section>
</section>
<section>
<section>
<h3>Caveats</h3>
<ol>
<li style="color: lightgrey;">Story Time</li>
<li style="color: lightgrey;">What is a Caveat?</li>
<li style="color: lightgrey;">The Vizier Notebook</li>
<li style="color: lightgrey;">Applying Caveats</li>
<li class="fragment grow highlight-blue">Propagating Caveats</li>
<li>Caveats Beyond SQL</li>
</ol>
</section>
</section>
<section>
<section>
<h3>Has anyone asked about "where" provenance?</h3>
<p class="fragment">Another brief digression...</p>
</section>
<section>
<h3>Value Annotations</h3>
<p style="margin-top: 50px; font-size: 70%;" class="fragment" data-fragment-index="1">
<b>Provenance in Databases: Why, How, and <u>Where</u></b><br/>
James Cheney, Laura Chiticariu and Wang-Chiew Tan
</p>
<p style="margin-top: 50px; font-size: 70%;" class="fragment" data-fragment-index="2">
<b>MONDRIAN: Annotating and Querying Databases through Colors and Blocks.</b><br/>
Floris Geerts, Anastasios Kementsietsidis, Diego Milano
</p>
<p class="fragment" data-fragment-index="2">and more...</p>
</section>
<section>
<h3>Value Annotations</h3>
<pre><code class="sql">
CREATE VIEW Q AS
SELECT R.A AS X,
R.B+R.C AS Y
FROM R
</code></pre>
<p class="fragment" style="font-size: 70%">
$$annot(\texttt{Q.X}[i]) \leftarrow annot(\texttt{R.A}[i])$$
</p>
<p class="fragment" style="font-size: 70%">
$$annot(\texttt{Q.Y}[i]) \leftarrow annot(\texttt{R.B}[i]) \cup annot(\texttt{R.C}[i])$$
</p>
</section>
<section>
<h3>Value Annotations</h3>
<pre><code class="sql">
CREATE VIEW Q AS
SELECT R.A AS X,
SUM(R.B) AS Y
FROM R
</code></pre>
<p class="fragment" style="font-size: 70%">
$$annot(\texttt{Q.X}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.A}[j] = Q.A[i]} annot(\texttt{R.A}[j])$$
</p>
<p class="fragment" style="font-size: 70%">
$$annot(\texttt{Q.Y}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.B}[j] = Q.B[i]} annot(\texttt{R.B}[j])$$
</p>
<p class="fragment">... not the semantics we want</p>
</section>
<section>
<p>
Caveats on $\texttt{R.A}$ also affect $\texttt{Q.B}$.
</p>
</section>
<section>
<h3>Caveats ≠ Value Annotations</h3>
</section>
</section>
<section>
<!--
Overview of Challenges:
- Propagation is Expensive
- Not Everyone uses SQL
- How to present Caveats?
Propagating Caveats is Expensive
- UI solution: Mark some values as "uncertain"
- Summarize Poonam's work
- Solution: 2 step propagation
- Step 1: Which values are affected by a caveat
- Step 2: Which caveats affect those values
-->
<section>
<p><b>Certain Data Elements: </b> Elements guaranteed to be in the result <u>in all possible worlds</u>.</p>
<p class="fragment">... i.e., elements unaffected by the choice of possible world.</p>
</section>
<section>
<p>If a caveatted element can't affect an output element, don't propagate its caveats!</p>
<p class="fragment">Propagate caveats to any data elements that could be affected by a change in assumptions.</p>
</section>
<section>
<p><b>Challenge: </b> How do we propagate caveats<br/>without penalizing query evaluation?</p>
<p class="fragment">Don't!</p>
</section>
<section>
<h3>Staged Caveat Discovery</h3>
<dl>
<dt>Alongside query evaluation...</dt>
<dd>Instrument queries to discover which elements are affected by a caveat.</dd>
<dt>After query evaluation...</dt>
<dd>Enumerate specific caveats affecting those elements.</dd>
</dl>
</section>
<section>
<h3>Marking Caveatted Elements</h3>
<img src="graphics/caveat-spreadsheet.png" width="800px" style="border: 1px solid grey;">
</section>
<section>
<h3>Enumerating Caveats</h3>
<img src="graphics/caveat-list.png" width="800px">
</section>
</section>
<section>
<!--
Propagating Caveat Markings
- Core question: If we change marked values in the input, what output values could change?
- ~= Determining certain answers in an IDB
- Worse: We can't enumerate all possible worlds!
- Conservative approximation (no false negatives)
- Guagliardo/Libkin Labeled NULLs
- ... but with no labels
- Borrow from UPenn UADB talk
- Analogous solution for values.
-->
<section>
<h3>Instrumenting Queries</h3>
<p class="fragment">≅ computing certain answers! (CoNP-Complete)</p>
</section>
<section>
<h3>Conservative Approximation</h3>
<div class="fragment" data-fragment-index="1">
<p style="margin-top: 20px; font-size: 60%;">
<b>Correctness of SQL Queries on Databases with Nulls.</b><br/>
Paolo Guagliardo, Leonid Libkin
</p>
<p style="margin-top: 20px; font-size: 60%;" class="fragment highlight-blue grow" data-fragment-index="4">
<b>Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers</b><br/>
Su Feng, Aaron Huber, Boris Glavic, Oliver Kennedy
</p>
</div>
<ul>
<li class="fragment" data-fragment-index="2">Unmarked rows are guaranteed to be caveat-free.</li>
<li class="fragment" data-fragment-index="3">Marked rows might not be caveatted.</li>
</ul>
</section>
<section>
<p>Add and maintain a binary "has caveat"<br/>column for each row/column.</p>
</section>
<section>
<pre><code class="sql">
CREATE VIEW survey_responses AS
SELECT language,
CASE WHEN CAST(salary AS float) IS NULL THEN
caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
ELSE CAST(salary AS float) END AS salary
FROM raw_csv_data;
</code></pre>
<div class="fragment">
becomes
<pre><code class="sql">
CREATE VIEW survey_responses AS
SELECT language, CAST(salary AS float) AS salary,
FALSE AS _caveat_field_language,
CAST(salary as float) IS NULL AS _caveat_field_salary
FALSE AS _caveat_row
FROM raw_csv_data;
</code></pre>
</div>
</section>
<section>
<pre><code class="sql">
SELECT salary
FROM survey_responses
WHERE language = 'Scala'
</code></pre>
<div class="fragment">
becomes
<pre><code class="sql">
SELECT salary,
_caveat_field_salary AS _caveat_field_salary,
_caveat_row AND _caveat_field_language AS _caveat_row
FROM survey_responses
WHERE language = 'Scala'
</code></pre>
</div>
</section>
<section>
<pre><code class="sql">
SELECT AVG(salary) AS salary
FROM survey_responses
</code></pre>
<div class="fragment">
becomes
<pre><code class="sql">
SELECT AVG(salary),
GROUP_OR(_caveat_field_salary
OR _caveat_row) AS _caveat_field_salary,
FALSE AS _caveat_row
FROM survey_responses
</code></pre>
</div>
</section>
<section>
<pre><code class="sql">
SELECT language, AVG(salary) AS salary
FROM survey_responses
GROUP BY language
</code></pre>
<div class="fragment">
... first we evaluate
<pre><code class="sql">
SELECT GROUP_OR(_caveat_field_language)
FROM survey_responses
</code></pre>
</div>
<p class="fragment">Can often be evaluated statically.</p>
</section>
<section>
<h3>If GROUP BY has caveats</h3>
<pre><code class="sql">
SELECT language, AVG(salary) AS salary
FALSE AS _caveat_field_language
TRUE AS _caveat_field_salary
GROUP_AND(_caveat_field_language OR
_caveat_row) AS _caveat_row
FROM by_language
GROUP BY language
</code></pre>
</section>
<section>
<h3>If no GROUP BY caveats</h3>
<pre><code class="sql">
SELECT language, AVG(salary) AS salary
FALSE AS _caveat_field_language
GROUP_OR(_caveat_field_salary,
_caveat_row) AS _caveat_field_salary
GROUP_AND(_caveat_row) AS _caveat_row
FROM by_language
GROUP BY language
</code></pre>
</section>
</section>
<section>
<!--
Enumerating Caveats
-->
<section>
<h3>Enumerating Caveats</h3>
<dl>
<dt>Static Analysis</dt>
<dd>Which caveats could possibly affect the element?</dd>
<dt>Dynamic Analysis</dt>
<dd>Which specific caveats affect the element?</dd>
</dl>
</section>
<section>
<h3>Static Analysis</h3>
<p>What calls to <span style="font-family: monospace;">caveat()</span> appear in the derivation of the specified element?</p>
<p class="fragment">Analogous to <i>program slicing</i>.</p>
</section>
<section>
<h3>Program Slicing</h3>
<p>Eliminate lines of code not relevant<br/>to computing a specific value.</p>
<p class="fragment">This is <i>exactly</i> what a database optimizer does.</p>
</section>
<section>
<p><b>Lookup: </b> Caveats on $\texttt{R.A}[i]$</p>
<pre class="fragment"><code class="sql">
SELECT A
FROM R
WHERE ROWID = i
</code></pre>
<p class="fragment">All calls to <span style="font-family: monospace;">caveat()</span> surviving optimization<br/> (probably) affect the target.</p>
</section>
<section>
<h3>Dynamic Analysis</h3>
<ol>
<li>For each call to <span style="font-family: monospace;">caveat()</span>, isolate a query to generate the message.</li>
<li>Union the message query results together.
</ol>
</section>
<section>
<h3>Isolate the message</h3>
<pre><code class="sql">
WITH data_source AS
SELECT caveat(A, 'valid if '& B &' is within tolerances.') AS A,
C, D, E
FROM R
SELECT C, D, E FROM data_source WHERE ROWID = i
</code></pre>
<p>becomes</p>
<pre><code class="sql">
SELECT 'valid if '& B &' is within tolerances.'
AS caveat_message
FROM R WHERE ROWID = i
</code></pre>
</section>
</section>
<section>
<!--
Generalizing to Other Modalities
Spreadsheets
- Vizual
- Imperative (DDL-style)
- Re-enactment (DDL -> SQL)
- Targetting Rows
Python
- Open question
- Reference existing work on Python deependency tracking
-->
<section>
<h3>Caveats</h3>
<ol>
<li style="color: lightgrey;">Story Time</li>
<li style="color: lightgrey;">What is a Caveat?</li>
<li style="color: lightgrey;">The Vizier Notebook</li>
<li style="color: lightgrey;">Applying Caveats</li>
<li style="color: lightgrey;">Propagating Caveats</li>
<li class="fragment grow highlight-blue">Caveats Beyond SQL</li>
</ol>
</section>
<section>
<dl>
<div style="margin-top: 20px;" class="fragment">
<dt>Oliver</dt>
<dd>I have this great tool for tracking assumptions!</dd>
</div>
<div style="margin-top: 20px;" class="fragment">
<dt>Data Scientist</dt>
<dd>Super! How does it work?</dd>
</div>
<div style="margin-top: 20px;" class="fragment">
<dt>Oliver</dt>
<dd>Well you just write a SQL query...</dd>
</div>
<div style="margin-top: 20px;" class="fragment">
<dt>Data Scientist</dt>
<dd>...</dd>
</div>
</dl>
</section>
<section>
<h3>Caveats for the Masses</h3>
<table>
<tr>
<td class="fragment" style="color: green;" data-fragment-index="1"></td>
<td style="text-align: left;">SQL</td></tr>
<tr>
<td class="fragment" style="color: green;" data-fragment-index="2"></td>
<td style="text-align: left;">R <span class="fragment" data-fragment-index="2">(sort of)</span></td></tr>
<tr>
<td class="fragment" style="color: red;" data-fragment-index="3" >🗶</td>
<td style="text-align: left;" class="fragment highlight-blue grow" data-fragment-index="4">Spreadsheets</td></tr>
<tr>
<td class="fragment" style="color: red;" data-fragment-index="3" >🗶</td>
<td style="text-align: left;">Python</td></tr>
</table>
</section>
<section>
<p>
<b>The Exception That Improves The Rule</b><br/>
Juliana Freire, Boris Glavic, Oliver Kennedy, Heiko Mueller
</p>
</section>
<section>
<h3>Vizual</h3>
<p>Spreadsheet Operations → SQL DDL / SQL DML</p>
<dl>
<div class="fragment">
<dt>Edit Cell A3 to 'foo'</dt>
<dd style="font-family: monospace;">UPDATE R SET A = 'foo' WHERE ROWID = 3;</dd>
</div>
<div class="fragment">
<dt>Insert Row</dt>
<dd style="font-family: monospace;">INSERT INTO R() VALUES ();</dd>
</div>
<div class="fragment">
<dt>Insert Column `bar`</dt>
<dd style="font-family: monospace;">ALTER TABLE R ADD COLUMN `bar`;</dd>
</div>
</dl>
</section>
<section>
<p>Ok... so we have an edit history in DDL/DML.</p>
</section>
<section>
<h3>Caveats on DDL/DML</h3>
<div class="fragment">
<h4 style="margin-top: 50px;">DDL → SQL</h4>
<p style="font-size: 70%">
<b>Using Reenactment to Retroactively Capture Provenance for Transactions</b><br/>
Bahareh Sadat Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan, Boris Glavic
</p>
</div>
<div class="fragment">
<h4 style="margin-top: 50px;">DML → SQL</h4>
<p style="font-size: 70%">
<b>Graceful database schema evolution: the PRISM workbench</b><br/>
Carlo Curino, Hyun Jin Moon, Carlo Zaniolo
</p>
</div>
</section>
<section>
<pre><code class="sql">
UPDATE R SET A = 'foo' WHERE ROWID = 3;
</code></pre>
becomes
<pre><code class="sql">
SELECT CASE ROWID
WHEN 3 THEN 'foo'
ELSE A END AS A,
B, C, /* ... */
FROM R
</code></pre>
</section>
<section>
<pre><code class="sql">
INSERT INTO R() VALUES ();
</code></pre>
becomes
<pre><code class="sql">
SELECT * FROM R
UNION ALL
SELECT NULL AS A, NULL AS B,
NULL AS C, /* ... */
</code></pre>
</section>
<section>
<pre><code class="sql">
ALTER TABLE R ADD COLUMN `bar`;
</code></pre>
becomes
<pre><code class="sql">
SELECT *, NULL as `bar` FROM R;
</code></pre>
</section>
</section>
<section>
<h3>
<img src="graphics/vizier-blue.svg" height="100px" style="vertical-align: middle; margin-right: 20px;" />
<span style="vertical-align: middle;" ><a href="https://vizierdb.info/">https://vizierdb.info</a></span>
</h3>
<pre style="margin-top: 50px;"><code class="sql">
$> pip3 install --user vizier-webapi
$> vizier
</code></pre>
</section>
<section>
<table style="display: inline-block; margin-right: 100px">
<tr>
<th colspan="5" style="font-size: 12pt">Students</th>
</tr>
<tr height="80px">
<td width="100px">
<img src="people/poonam.jpg" width="70px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Poonam<br/>(PhD-4Y)</p>
</td>
<td width="100px">
<img src="people/will.png" width="61px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Will<br/>(PhD-3Y)</p>
</td>
<td width="100px">
<img src="people/aaron.jpg" width="64px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Aaron<br/>(PhD-4Y)</p>
</td>
</tr>
</table>
<table style="display: inline-block; margin-left: 100px">
<tr>
<th colspan="1" style="font-size: 12pt">Dev</th>
</tr>
<tr>
<td width="100px">
<img src="people/mike.jpg" width="80px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Mike<br/>(Sr. Rsrch. Dev.)</p>
</td>
</tr>
</table>
<table style="display: inline-block;">
<tr>
<th colspan="7" style="font-size: 12pt">Alumni</th>
</tr>
<tr height="80px">
<td width="100px">
<img src="people/ying.jpg" width="60px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Ying<br/>(PhD 2017)</p>
</td>
<td width="100px">
<img src="people/niccolo.png" width="50px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Niccolò<br/>(PhD 2016)</p>
</td>
<td width="100px">
<img src="people/arindam.jpg" width="80px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Arindam<br/>(MS 2016)</p>
</td>
<td width="100px">
<img src="people/shivang.jpg" width="55px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Shivang<br/>(MS 2018)</p>
</td>
<td width="100px">
<img src="people/olivia.png" width="50px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Olivia<br/>(BS 2017)</p>
</td>
<td width="100px">
<img src="people/lisa.jpg" width="71px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Lisa<br/>(BS 2018)</p>
</td>
<td width="100px">
<img src="people/gourab.jpg" width="80px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Gourab<br/>(MS 2018)</p>
</td>
</tr>
</table>
<table>
<tr>
<th colspan="6" style="font-size: 12pt">External Collaborators</th>
</tr>
<tr>
<td width="130px" style="font-size: 10pt;">
Zhen Hua Liu<br/>(Oracle)
</td>
<td width="130px" style="font-size: 10pt;">
Ying Lu<br/>(Oracle)
</td>
<td width="130px" style="font-size: 10pt;">
Beda Hammerschmidt<br/>(Oracle)
</td>
<td width="140px" style="font-size: 10pt;">
Boris Glavic<br/>(IIT)
</td>
<td width="140px" style="font-size: 10pt;">
Su Feng<br/>(IIT)
</td>
</tr>
</table>
<table style="margin-top: 5px">
<tr>
<td width="140px" style="font-size: 10pt;">
Juliana Freire<br/>(NYU)
</td>
<td width="140px" style="font-size: 10pt;">
Heiko Mueller<br/>(NYU)
</td>
<td width="140px" style="font-size: 10pt;">
Sonia Castelo Quispe<br/>(NYU)
</td>
<td width="140px" style="font-size: 10pt;" style="color: grey; ">
Carlos Bautista<br/>(NYU)
</td>
<td width="140px" style="font-size: 10pt;">
Remi Rampin<br/>(NYU)
</td>
</tr>
</table>
<p style="font-size: 10pt; text-decoration: underline;">Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle</p>
</section>
<section>
<section>
<h3>Optimizations</h3>
<dl>
<dt class="fragment" style="margin-top: 50px;">Too Much Information</dt>
<dd class="fragment">Limit the number of messages returned per call.</dd>
<dt class="fragment" style="margin-top: 50px;">Unions (on Spark) are Expensive</dt>
<dd class="fragment">Execute each query individually in parallel.</dd>
</dl>
</section>
<section>
<h3>Graffiti</h3>
<img src="graphics/waterfall-warm-graffiti.svg" height="400px" />
</section>
<section>
<h3>Shootings</h3>
<img src="graphics/waterfall-warm-shootings.svg" height="400px" />
</section>
</section>
</div></div>
<script src="../reveal.js-3.5.0/lib/js/head.min.js"></script>
<script src="../reveal.js-3.5.0/js/reveal.js"></script>
<script>
// Full list of configuration options available at:
// https://github.com/hakimel/../reveal.js#configuration
Reveal.initialize({
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'fade', // none/fade/slide/convex/concave/zoom
// Optional ../reveal.js plugins
dependencies: [
{ src: '../../reveal.js-3.7.0/plugin/svginline/data-src-svg.js' },
{ src: '../reveal.js-3.5.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../reveal.js-3.5.0/plugin/math/math.js',
condition: function() { return true; },
mathjax: '../reveal.js-3.5.0/js/MathJax.js'
},
{ src: '../reveal.js-3.5.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.5.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
//{ src: '../reveal.js-3.5.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'tt code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../../reveal.js-3.7.0/plugin/highlight/highlight-9.16.2.js', async: true,
callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../reveal.js-3.5.0/plugin/zoom-js/zoom.js', async: true },
{ src: '../reveal.js-3.5.0/plugin/notes/notes.js', async: true }
]
});
</script>
</body>
</html>