Website/slides/talks/2020-1-CIDR-Vizier/index.html

647 lines
21 KiB
HTML

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Your notebook is not crumby enough, REPLace it</title>
<meta name="description" content="Your notebook is not crumby enough, REPLace it">
<meta name="author" content="Oliver Kennedy">
<meta name="author" content="Mike Brachmann">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="../../reveal.js-3.7.0/css/reveal.css">
<link rel="stylesheet" href="ubodin.css" id="theme">
<!-- Code syntax highlighting -->
<link rel="stylesheet" href="../../reveal.js-3.7.0/lib/css/zenburn.css">
<style type="text/css">
.reveal .slides section .fragment.growbig {
opacity: 1;
visibility: inherit; }
.reveal .slides section .fragment.growbig.visible {
-webkit-transform: scale(7);
transform: scale(7); }
</style>
<!-- Printing and PDF exports -->
<script>
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../../reveal.js-3.7.0/css/print/pdf.css' : '../reveal.js-3.7.0/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<!--[if lt IE 9]>
<script src="../reveal.js-3.5.0/lib/js/html5shiv.js"></script>
<![endif]-->
</head>
<body>
<div class="reveal">
<div class="header">
<!-- Any Talk-Specific Header Content Goes Here -->
<center>
<a href="http://www.buffalo.edu" target="_blank">
<img src="../graphics/logos/ub-1line-ro-white.png" height="20"/>
</a>
</center>
</div>
<div class="footer">
<!-- Any Talk-Specific Footer Content Goes Here -->
<div style="float: left; margin-top: 15px; ">
Exploring <u><b>O</b></u>nline <u><b>D</b></u>ata <u><b>In</b></u>teractions
</div>
<a href="https://odin.cse.buffalo.edu" target="_blank">
<img src="../graphics/logos/odin-1line-white.png" height="40" style="float: right;"/>
</a>
</div>
<div class="slides">
<!-- Any section element inside of this container is displayed as a slide -->
<section>
<!-- Credits... introduce everyone, etc... -->
<section>
<h3>
<img src="graphics/vizier-blue.svg" height="100px" style="vertical-align: middle; margin-right: 20px;" />
<span style="vertical-align: middle;" >VizierDB</span>
</h3>
<hr/>
<h3>Your notebook is not crumby enough, REPLace it</h3>
<hr/>
<h4><u>Michael&nbsp;Brachmann</u>,
William&nbsp;Spoth,
Oliver&nbsp;Kennedy,
Boris&nbsp;Glavic,
Heiko&nbsp;Mueller,
Sonia&nbsp;Castelo,
Carlos&nbsp;Bautista,
Juliana&nbsp;Freire</h4>
</section>
<section>
<h2>Demo</h2>
</section>
<section>
<p style="margin-top: -20px">
<img src="graphics/vizier-blue.svg" height="70px" style="vertical-align: middle; margin-right: 20px;" />
<span style="vertical-align: middle; font-weight: bold; font-size: 120%" ><b>VizierDB</b></span>
<div style="font-size: 70%; margin-top: -30px">A Data-First Notebook Built for Reproducibility</div>
</p>
<hr style="margin-bottom: 50px;" />
<ol>
<li>Automatic Refresh &amp; Dependency Management</li>
<li class="fragment highlight-blue" data-fragment-index="1">Caveats</li>
<li>Hybrid Notebook/Spreadsheet</li>
<li>History &amp; Version Management</li>
<li>Polyglot &amp; Multimodal</li>
</ol>
</section>
</section>
<section>
<!--
Problem: Documentation Disconnected from Data
Solution: Annotate/propagate data
Formally Define Caveats
- Possible Worlds / Heuristic Choices
- Link to IDBs/PDBs
- Pick One World + Mark "Uncertain" Values
- Ex on TI-/BI-DB/CTables
- Emphasize: Don't need to be able to enumerate all worlds:
- Just need one world and the ability to decide what is certain
- Introduce the Caveat function
- Examples:
- 3 examples above
- Re-emphasize that enumeration is not required
- Propagation Overview
-->
<section>
<h3>Data Errors Suck</h3>
<img src="graphics/data_error.png">
<attribution><a href="https://xkcd.com/2239/">https://xkcd.com/2239/</a></attribution>
</section>
<section>
<p>
<span class="fragment">
<img src="graphics/female-computer-user.svg" height="70px" style="vertical-align: middle;"/>
<span style="vertical-align: middle; padding-left: 70px; padding-right: 70px"></span>
</span>
<img src="graphics/db.svg" height="70px" style="vertical-align: middle;"/>
<span style="vertical-align: middle; padding-left: 70px; padding-right: 70px"></span>
<img src="graphics/male-computer-user.png" height="70px" style="vertical-align: middle;"/>
</p>
<p class="fragment">
<span style="margin-right: 250px; vertical-align: middle;"></span>
<span style="margin-left: 250px; vertical-align: middle;"></span>
<br/>
<span style="margin-right: 100px; vertical-align: middle;">Assumption</span>
<span style="font-size: 300%; vertical-align: middle;" class="fragment"></span>
<span style="margin-left: 100px; vertical-align: middle;">Assumption</span>
</p>
<attribution>freesvg.org</attribution>
</section>
<section style="top: 121px; display: block;" class="" aria-hidden="true">
<h3>Assumptions?</h3>
<ol style="font-size: 70%">
<li>"This outlier is actually a data error"</li>
<li>"There will always be six values in this column"</li>
<li>"The correct fix is to delete erroneous records"</li>
<li>"Unparseable values should be treated as NULL"</li>
<li>"Nobody will analyze this portion of the dataset"</li>
<li>"These subjective field observations are correct"</li>
</ol>
<p class="fragment">Alice needs to document each and every assumption.</p>
<p class="fragment">Bob needs to understand the implications<br/>on every part of his analysis.</p>
</section>
<section>
<img src="graphics/montoya.jpeg" height="400px" />
<attribution>&copy; 20th Century Fox</attribution>
</section>
<section>
<h3>What is a Caveat?</h3>
<div style="margin-top: 70px;">
<p class="fragment">An assumption tied to a fragment of the dataset.</p>
<p class="fragment">If the assumption is wrong, so is the fragment.</p>
</div>
</section>
<section>
<pre><code class="sql">
caveat(race_ethnicity,
'Unexpected race_ethnicity: ' & race_ethnicity)
</code></pre>
</section>
<section>
<pre><code class="sql">
CASE WHEN race_ethnicity NOT IN ('Black Non-Hispanic', /* ... */)
THEN caveat(race_ethnicity,
'Unexpected race_ethnicity: ' & race_ethnicity)
ELSE race_ethnicity
</code></pre>
</section>
<section>
<pre><code class="sql">
SELECT
CASE WHEN race_ethnicity NOT IN ('Black Non-Hispanic', /* ... */)
THEN caveat(race_ethnicity,
'Unexpected race_ethnicity: ' & race_ethnicity)
ELSE race_ethnicity
END, /* ... */
FROM R
</code></pre>
</section>
<section>
<h3>Propagation</h3>
<p>Can twiddling the caveatted value change the output?</p>
<p class="fragment" data-fragment-index="1" style="margin-top: 50px;">$C \leftarrow (5 \times X) + Y$</p>
<p class="fragment" data-fragment-index="1">Caveats on $X$ and $Y$ propagate to $C$<span class="fragment" data-fragment-index="2">*</span></p>
<p class="fragment" data-fragment-index="2" style="font-size:30%">Some conditions may apply</p>
</section>
</section>
<section>
<section>
<h2>Sloooow!</h2>
</section>
<section>
<img src="graphics/caveat-spreadsheet.png" height="150px;" />
<p style="font-size: 200%">+</p>
<img src="graphics/caveat-list.png" height="150px;" />
</section>
<section>
<p>Is a value caveatted?</p>
<p class="fragment">≡ Certain answers in incomplete databases</p>
<p class="fragment">(coNP-complete)</p>
</section>
<section>
<h3>Conservative Approximation</h3>
<div>
<p style="margin-top: 20px; font-size: 60%;">
<b>Correctness of SQL Queries on Databases with Nulls.</b><br/>
Paolo Guagliardo, Leonid Libkin
</p>
<p style="margin-top: 20px; font-size: 60%;">
<b>Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers</b><br/>
Su Feng, Aaron Huber, Boris Glavic, Oliver Kennedy
</p>
</div>
<ul>
<li class="fragment" data-fragment-index="1">Unmarked rows are guaranteed to be caveat-free.</li>
<li class="fragment" data-fragment-index="2">Marked rows might not be caveatted.</li>
</ul>
</section>
</section>
<section>
<!--
Enumerating Caveats
-->
<section>
<h3>Enumerating Caveats</h3>
<dl>
<dt>Static Analysis</dt>
<dd>Which caveats could possibly affect the element?</dd>
<dt>Dynamic Analysis</dt>
<dd>Which specific caveats affect the element?</dd>
</dl>
</section>
<section>
<h3>Static Analysis</h3>
<p>What calls to <span style="font-family: monospace;">caveat()</span> appear in the derivation of the specified element?</p>
<p class="fragment">Analogous to <i>program slicing</i>.</p>
</section>
<section>
<h3>Program Slicing</h3>
<p>Eliminate lines of code not relevant<br/>to computing a specific value.</p>
<p class="fragment">This is <i>exactly</i> what a database optimizer does.</p>
</section>
<section>
<p><b>Lookup: </b> Caveats on $\texttt{R.A}[i]$</p>
<pre class="fragment"><code class="sql">
SELECT A
FROM R
WHERE ROWID = i
</code></pre>
<p class="fragment">All calls to <span style="font-family: monospace;">caveat()</span> surviving optimization<br/> (probably) affect the target.</p>
</section>
<section>
<h3>Dynamic Analysis</h3>
<ol>
<li>For each call to <span style="font-family: monospace;">caveat()</span>, isolate a query to generate the message.</li>
<li>Union the message query results together.
</ol>
</section>
<section>
<h3>Isolate the message</h3>
<pre><code class="sql">
WITH data_source AS
SELECT caveat(A, 'valid if '& B &' is within tolerances.') AS A,
C, D, E
FROM R
SELECT C, D, E FROM data_source WHERE ROWID = i
</code></pre>
<p>becomes</p>
<pre><code class="sql">
SELECT 'valid if '& B &' is within tolerances.'
AS caveat_message
FROM R WHERE ROWID = i
</code></pre>
</section>
</section>
<section>
<section>
<img src="graphics/caveat-spreadsheet.png" />
</section>
<section>
<h3>Vizual</h3>
<p>Spreadsheet Operations → SQL DDL / SQL DML</p>
<dl>
<div class="fragment">
<dt>Edit Cell A3 to 'foo'</dt>
<dd style="font-family: monospace;">UPDATE R SET A = 'foo' WHERE ROWID = 3;</dd>
</div>
<div class="fragment">
<dt>Insert Row</dt>
<dd style="font-family: monospace;">INSERT INTO R() VALUES ();</dd>
</div>
<div class="fragment">
<dt>Insert Column `bar`</dt>
<dd style="font-family: monospace;">ALTER TABLE R ADD COLUMN `bar`;</dd>
</div>
</dl>
</section>
<section>
<p>This gives us an edit history in DDL/DML.</p>
</section>
<section>
<h3>Caveats on DDL/DML</h3>
<div class="fragment">
<h4 style="margin-top: 50px;">DML → SQL</h4>
<p style="font-size: 70%">
<b>Using Reenactment to Retroactively Capture Provenance for Transactions</b><br/>
Bahareh Sadat Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan, Boris Glavic
</p>
</div>
<div class="fragment">
<h4 style="margin-top: 50px;">DDL → SQL</h4>
<p style="font-size: 70%">
<b>Graceful database schema evolution: the PRISM workbench</b><br/>
Carlo Curino, Hyun Jin Moon, Carlo Zaniolo
</p>
</div>
</section>
<section>
<pre><code class="sql">
UPDATE R SET A = 'foo' WHERE ROWID = 3;
</code></pre>
becomes
<pre><code class="sql">
SELECT CASE ROWID
WHEN 3 THEN 'foo'
ELSE A END AS A,
B, C, /* ... */
FROM R
</code></pre>
</section>
</section>
<section>
<section>
<h3>
<img src="graphics/vizier-blue.svg" height="100px" style="vertical-align: middle; margin-right: 20px;" />
<span style="vertical-align: middle;" ><a href="https://vizierdb.info/">https://vizierdb.info</a></span>
</h3>
<pre style="margin-top: 50px;"><code class="sql">
$> pip3 install --user vizier-webapi
$> vizier
</code></pre>
</section>
<section>
<h3>
<img src="graphics/vizier-blue.svg" height="100px" style="vertical-align: middle; margin-right: 50px;" />
<span style="vertical-align: middle;" >
<span style="font-size: 50%;">[https://]</span>VizierDB<span style="font-size: 50%;">[.info]</span>
</span>
<img src="graphics/qr.png" height="150px" style="vertical-align: middle; margin-left: 50px">
</h3>
<hr/>
<h4 style="font-size: 70%">
Michael&nbsp;Brachmann,
William&nbsp;Spoth,
Oliver&nbsp;Kennedy,
Boris&nbsp;Glavic,
Heiko&nbsp;Mueller,
Sonia&nbsp;Castelo,
Carlos&nbsp;Bautista,
Juliana&nbsp;Freire</h4>
<hr/>
<h4 style="font-size: 70%">
Ying&nbsp;Yang,
Su&nbsp;Feng,
Poonam&nbsp;Kumari,
Aaron&nbsp;Huber,
Niccolò&nbsp;Meneghetti,
Arindam&nbsp;Nandi,
Shivang&nbsp;Agarwal,
Olivia&nbsp;Alphonse,
Lisa&nbsp;Lu,
Gourab&nbsp;Malhotra,
Remi&nbsp;Rampin</h4>
<hr/>
<p style="font-size: 16pt">Vizier is supported by NSF Awards ACI-1640864 and IIS-1750460 and gifts from Oracle</p>
</td>
</section>
</section>
<section>
<section>
<h2>Bonus Slides</h2>
</section>
<section>
<pre><code class="sql">
CREATE VIEW survey_responses AS
SELECT language,
CASE WHEN CAST(salary AS float) IS NULL THEN
caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
ELSE CAST(salary AS float) END AS salary
FROM raw_csv_data;
</code></pre>
<div class="fragment">
becomes
<pre><code class="sql">
CREATE VIEW survey_responses AS
SELECT language, CAST(salary AS float) AS salary,
FALSE AS _caveat_field_language,
CAST(salary as float) IS NULL AS _caveat_field_salary
FALSE AS _caveat_row
FROM raw_csv_data;
</code></pre>
</div>
</section>
<section>
<pre><code class="sql">
SELECT salary
FROM survey_responses
WHERE language = 'Scala'
</code></pre>
<div class="fragment">
becomes
<pre><code class="sql">
SELECT salary,
_caveat_field_salary AS _caveat_field_salary,
_caveat_row AND _caveat_field_language AS _caveat_row
FROM survey_responses
WHERE language = 'Scala'
</code></pre>
</div>
</section>
<section>
<pre><code class="sql">
SELECT AVG(salary) AS salary
FROM survey_responses
</code></pre>
<div class="fragment">
becomes
<pre><code class="sql">
SELECT AVG(salary),
GROUP_OR(_caveat_field_salary
OR _caveat_row) AS _caveat_field_salary,
FALSE AS _caveat_row
FROM survey_responses
</code></pre>
</div>
</section>
<section>
<pre><code class="sql">
SELECT language, AVG(salary) AS salary
FROM survey_responses
GROUP BY language
</code></pre>
<div class="fragment">
... first we evaluate
<pre><code class="sql">
SELECT GROUP_OR(_caveat_field_language)
FROM survey_responses
</code></pre>
</div>
<p class="fragment">Can often be evaluated statically.</p>
</section>
<section>
<h3>If GROUP BY has caveats</h3>
<pre><code class="sql">
SELECT language, AVG(salary) AS salary
FALSE AS _caveat_field_language
TRUE AS _caveat_field_salary
GROUP_AND(_caveat_field_language OR
_caveat_row) AS _caveat_row
FROM by_language
GROUP BY language
</code></pre>
</section>
<section>
<h3>If no GROUP BY caveats</h3>
<pre><code class="sql">
SELECT language, AVG(salary) AS salary
FALSE AS _caveat_field_language
GROUP_OR(_caveat_field_salary,
_caveat_row) AS _caveat_field_salary
GROUP_AND(_caveat_row) AS _caveat_row
FROM by_language
GROUP BY language
</code></pre>
</section>
<section>
<pre><code class="sql">
INSERT INTO R() VALUES ();
</code></pre>
becomes
<pre><code class="sql">
SELECT * FROM R
UNION ALL
SELECT NULL AS A, NULL AS B,
NULL AS C, /* ... */
</code></pre>
</section>
<section>
<pre><code class="sql">
ALTER TABLE R ADD COLUMN `bar`;
</code></pre>
becomes
<pre><code class="sql">
SELECT *, NULL as `bar` FROM R;
</code></pre>
</section>
</section>
</div></div>
<script src="../reveal.js-3.5.0/lib/js/head.min.js"></script>
<script src="../reveal.js-3.5.0/js/reveal.js"></script>
<script>
// Full list of configuration options available at:
// https://github.com/hakimel/../reveal.js#configuration
Reveal.initialize({
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'fade', // none/fade/slide/convex/concave/zoom
// Optional ../reveal.js plugins
dependencies: [
{ src: '../../reveal.js-3.7.0/plugin/svginline/data-src-svg.js' },
{ src: '../reveal.js-3.5.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../reveal.js-3.5.0/plugin/math/math.js',
condition: function() { return true; },
mathjax: '../reveal.js-3.5.0/js/MathJax.js'
},
{ src: '../reveal.js-3.5.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.5.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
//{ src: '../reveal.js-3.5.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'tt code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../../reveal.js-3.7.0/plugin/highlight/highlight-9.16.2.js', async: true,
callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../reveal.js-3.5.0/plugin/zoom-js/zoom.js', async: true },
{ src: '../reveal.js-3.5.0/plugin/notes/notes.js', async: true }
]
});
</script>
<script>document.write('<script src="http://' + (location.host || 'localhost').split(':')[0] + ':35729/livereload.js?snipver=1"></' + 'script>')</script>
</body>
</html>