<div class="slides">
<!-- Any section element inside of this container is displayed as a slide -->
<!-- Credits... introduce everyone, etc... -->
<img src="graphics/vizier-blue.svg" height="100px" style="vertical-align: middle; margin-right: 20px;" />
<span style="vertical-align: middle;" >VizierDB</span>
<h3>A Notebook with Caveats</h3>
<h4>Oliver Kennedy
<a href="" style="margin-left: 50px;"></a></h4>
Let me tell you a story...
- Alice/Bob heuristic alignment
- Carol/Dave data changes
- Eve/HAL data ingestion
<h2>Story Time!</h2>
<h3>Act 1</h3>
<p>Alice wants to analyze two unaligned time series.</p>
<table style="font-size: 60%; display: inline; padding: 50px;">
<tr><td colspan="2">...</td></tr>
<table style="font-size: 60%; display: inline; padding: 50px;">
<tr><td colspan="2">...</td></tr>
<p class="fragment">Step 1: Line up the readings</p>
<h3>Option 1: Do it right</h3>
<img src="graphics/timeseries.png" height="500px" style="display: inline; vertical-align: middle;" />
<table style="display: inline; vertical-align: middle;">
<td style="padding-bottom: 50px;" class="fragment">
Lots of active research efforts!
<td style="padding-top: 50px;" class="fragment">
... but Alice is trying to to GSD!
<h3>Alice's Observations</h3>
<li>Readings every ~10s</li>
<li>Readings are binary</li>
<li>Readings are incredibly stable</li>
<pre><code class="sql">
INSERT INTO series_one_buckets
SELECT CAST(time / 10 AS int) AS bucket,
FROM series_one
GROUP BY bucket;
<p class="fragment">Interpolate missing values</p>
<p class="fragment">Hand tune around the switchover as-needed</p>
<p>Time taken: &lt; 30 minutes</p>
<img src="graphics/woman-reading-book-on-beach.svg" height="400px" />
<h3>Enter Bob...</h3>
<p class="fragment"><u>Similar</u> analysis...</p>
<p class="fragment">... <u>different</u> data</p>
<p class="fragment">Can Bob re-use Alice's prep+analytics workflow?</p>
<li class="fragment">Are readings still every ~10s?</li>
<li class="fragment">Is the data still binary?</li>
<li class="fragment">Is the data still (relatively) stable?</li>
<p class="fragment">... and even then, some manual effort is needed!</p>
<p>Bob needs to know Alice's assumptions<br/>(and how to use the workflow)?</p>
<h3>Act 2</h3>
<p>Carol gets a dataset from Dave</p>
<img src="graphics/female-computer-user.svg" height="100px"/><br/>
<span class="fragment" data-fragment-index="1"></span><br/>
<img src="graphics/Binary-file-20110715.svg" height="100px" style="vertical-align: middle;">
<span class="fragment" data-fragment-index="1">
<img src="graphics/Prismatic-Cloud-Gears-2.svg" height="100px" style="vertical-align: middle;"/>
<span class="fragment" data-fragment-index="2">
<img src="graphics/moneybag.svg" height="100px" style="vertical-align: middle;">
<p>Dave adds new data to the dataset!</p>
<p>Can Carol re-use her workflow?</p>
<li class="fragment">Did the data dictionary change?</li>
<li class="fragment">Did new errors get introduced?</li>
<p>Carol needs to remember her assumptions about the data and trust that the new data is like the old data</p>
<h3>Act 3</h3>
<p>Eve needs to load a CSV file</p>
<img src="graphics/Binary-file-20110715.svg" height="100px" style="vertical-align: middle;">
<img src="graphics/db.svg" height="100px" style="vertical-align: middle;">
<h3>Scenario 1</h3>
<img src="graphics/HAL9000_iconic_eye.svg" height="150px;">
<p class="fragment" style="font-family: monospace;">
I'm sorry, I can't do that, Eve.<br/>
<p class="fragment" style="font-family: monospace;">
You have a non-numerical value at position 1252538:24.
<h3>Scenario 2</h3>
<img src="graphics/HAL9000_iconic_eye.svg" height="150px;">
<p class="fragment" style="font-family: monospace;" data-fragment-index="1">
Load Successful!
<div class="fragment growbig" data-fragment-index="3">
<p class="fragment" style="font-family: monospace; font-size: 10%;" data-fragment-index="2">
(btw, 175326 records didn't load)
<p>Heuristics only work <b>most</b> of the time.</p>
<p>Data science is <span style="color: lightgrey;">nuanced</span>.</p>
<p class="fragment">Assumptions can't be avoided!</p>
<p class="fragment">It's easy to miss an assumption when re-using work.</p>
<img src="graphics/data_error.png">
<attribution><a href=""></a></attribution>
<h3>Wouldn't it be nice if...</h3>
<img src="graphics/montoya.jpeg" height="400px" />
<h3>Wouldn't it be nice if...</h3>
<p>... this is what Bob saw:</p>
<img src="graphics/time_series_with_errors.svg" />
<h3>Wouldn't it be nice if...</h3>
<p>... this is what Carol saw:</p>
<td style="
color: rgb(251, 189, 8);
background-color: #eed;
text-decoration: none;
text-decoration-color: rgb(251, 189, 8);
text-decoration-line: none;
text-decoration-style: solid;
vertical-align: middle;
border-radius: 15px 0px 0px 15px;
font-size: 150%">⚠</td>
<td style="
font-size: 70%;
background-color: #eee;
vertical-align: middle;
border-radius: 0px 15px 15px 0px;
padding: 20px;">
The data included an unexpected value: <b>'Non-Hispanic White'</b><br/>The most similar known value is <b>'White Non-Hispanic'</b>
<p>Annotate data with warnings.</p>
<p class="fragment" data-fragment-index="1">If you use this value/record, <br/>here's what you need to know!</p>
<h3 class="fragment" data-fragment-index="2">Caveat Physicus</h3>
<h4 class="fragment" data-fragment-index="1">Propagation</h4>
<dd class="fragment" data-fragment-index="2" style="margin-left: -20px;">Caveats...</dd>
<div class="fragment" data-fragment-index="2">
<dt>... can go where the data goes</dt>
<dd>Derived values retain caveats on source data.</dd>
<div class="fragment" data-fragment-index="3">
<dt>... stop where the data stops</dt>
<dd>Irrelevant caveats don't get propagated</dd>
<h3>Wouldn't it be nice if...</h3>
<p>... this is what Eve saw:</p>
<img src="graphics/caveat-spreadsheet.png"/>
<h3>What is a Caveat?</h3>
<p class="fragment">A brief digression...</p>
<h3>Classical Databases</h3>
<p class="fragment">One database $D$</p>
<p class="fragment">Each query gets one answer $R \leftarrow Q(D)$</p>
<h3>Incomplete Databases</h3>
<p class="fragment">Multiple <u>possible</u> databases $D \in \mathcal D$</p>
<p class="fragment">(possible worlds)</p>
<p class="fragment">Queries get a <u>set</u> of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$</p>
<p class="fragment"><b>Certain</b> tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$</p>
<p class="fragment"><b>Uncertain</b> tuples exist in at least one, <br/>but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$</p>
<p style="font-size: 70%;" class="fragment">(not limited to set semantics)</p>
<p>A caveat is an assumption tied to one or more data elements (cells or rows).</p>
<p>If the assumption is wrong, so is the element.</p>
<h3>Alice / Bob</h3>
<li><span style="font-family: monospace;">FIRST</span> may not pick the right value for a bucket with 2+ distinct values.</li>
<li>Interpolation may not pick the right value for a bucket with 0 values.</li>
<h3>Carol / Dave</h3>
<li>The model hyperparameters may not work if the data changes too significantly.</li>
<li>New values could indicate new data errors that Carol's ingest script hasn't accounted for.</li>
<h3>Eve / Hal</h3>
<li>Replacing a parse error with a NULL might not be what Eve expects.</li>
<p>An element has a caveat → The element is uncertain.</p>
<p class="fragment">... and btw, here's why.</p>
- Reproducibility-Focused Notebook
- Scripting
- Spreadsheets
- Point+Click
- Key Feature: Caveats
- Demo
<h1><a href="" target="_blank">Demo</a></h1>
<table style="display: inline-block; margin-right: 100px">
<th colspan="5" style="font-size: 12pt">Students</th>
<tr height="80px">
<td width="100px">
<img src="people/poonam.jpg" width="70px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Poonam<br/>(PhD-4Y)</p>
<td width="100px">
<img src="people/will.png" width="61px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Will<br/>(PhD-3Y)</p>
<td width="100px">
<img src="people/aaron.jpg" width="64px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Aaron<br/>(PhD-4Y)</p>
<table style="display: inline-block; margin-left: 100px">
<th colspan="1" style="font-size: 12pt">Dev</th>
<td width="100px">
<img src="people/mike.jpg" width="80px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Mike<br/>(Sr. Rsrch. Dev.)</p>
<table style="display: inline-block;">
<th colspan="7" style="font-size: 12pt">Alumni</th>
<tr height="80px">
<td width="100px">
<img src="people/ying.jpg" width="60px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Ying<br/>(PhD 2017)</p>
<td width="100px">
<img src="people/niccolo.png" width="50px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Niccolò<br/>(PhD 2016)</p>
<td width="100px">
<img src="people/arindam.jpg" width="80px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Arindam<br/>(MS 2016)</p>
<td width="100px">
<img src="people/shivang.jpg" width="55px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Shivang<br/>(MS 2018)</p>
<td width="100px">
<img src="people/olivia.png" width="50px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Olivia<br/>(BS 2017)</p>
<td width="100px">
<img src="people/gourab.jpg" width="80px" height="80px" style="margin-bottom: 0px"/>
<p style="margin-top: 0px; font-size: 10pt;">Gourab<br/>(MS 2018)</p>
<th colspan="6" style="font-size: 12pt">External Collaborators</th>
<td width="130px" style="font-size: 10pt;">
Zhen Hua Liu<br/>(Oracle)
<td width="130px" style="font-size: 10pt;">
Ying Lu<br/>(Oracle)
<td width="130px" style="font-size: 10pt;">
Beda Hammerschmidt<br/>(Oracle)
<td width="140px" style="font-size: 10pt;">
Boris Glavic<br/>(IIT)
<td width="140px" style="font-size: 10pt;">
Su Feng<br/>(IIT)
<table style="margin-top: 5px">
<td width="140px" style="font-size: 10pt;">
Juliana Freire<br/>(NYU)
<td width="140px" style="font-size: 10pt;">
Munaf Arshad Qazi<br/>(NYU)
<td width="140px" style="font-size: 10pt;">
Heiko Mueller<br/>(NYU)
<td width="140px" style="font-size: 10pt;">
Sonia Castelo Quispe<br/>(NYU)
<td width="140px" style="font-size: 10pt;" style="color: grey; ">
Carlos Bautista<br/>(NYU)
<td width="140px" style="font-size: 10pt;">
Remi Rampin<br/>(NYU)
<p style="font-size: 10pt; text-decoration: underline;">Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle</p>
<img src="graphics/vizier-blue.svg" height="100px" style="vertical-align: middle; margin-right: 20px;" />
<span style="vertical-align: middle;" ><a href=""></a></span>
<pre style="margin-top: 50px;"><code class="sql">
$> pip3 install --user vizier-webapi
$> vizier
<p>Or get an account from me and try it out at <a href=""></a></p>
