Website/src/talks/2022-02-16-MaterialsDB.erb

502 lines
15 KiB
Plaintext

---
template: templates/talk_slides_v1.erb
title: "Caveatting your data"
---
<section>
<h2>Caveatting your data</h2>
<h3>Adding explainability to incomplete datasets</h3>
<h4 style="margin-top: 20px;">Oliver Kennedy</h4>
<h4>University at Buffalo</h4>
<p style="font-size: 60%;">Based on joint work with Olga Wodo, Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...</p>
</section>
<section>
<section>
<h2>Story Time!</h2>
<p class="fragment">Alice wants to analyze two unaligned time series.</p>
</section>
<section>
<table style="font-size: 60%; display: inline; padding: 50px;">
<tr><th>Time</th><th>Reading</th></tr>
<tr><td>1575731001</td><td>0</td></tr>
<tr><td>1575731014</td><td>0</td></tr>
<tr><td>1575731030</td><td>0</td></tr>
<tr><td>1575731035</td><td>0</td></tr>
<tr><td colspan="2">...</td></tr>
<tr><td>1575731219</td><td>1</td></tr>
<tr><td>1575731229</td><td>1</td></tr>
<tr><td>1575731240</td><td>1</td></tr>
</table>
<table style="font-size: 60%; display: inline; padding: 50px;">
<tr><th>Time</th><th>Reading</th></tr>
<tr><td>1575731011</td><td>0</td></tr>
<tr><td>1575731020</td><td>0</td></tr>
<tr><td>1575731031</td><td>0</td></tr>
<tr><td>1575731039</td><td>0</td></tr>
<tr><td colspan="2">...</td></tr>
<tr><td>1575731218</td><td>1</td></tr>
<tr><td>1575731228</td><td>1</td></tr>
<tr><td>1575731237</td><td>1</td></tr>
</table>
<p class="fragment">Step 1: Line up the readings</p>
</section>
<section>
<h3>Option 1: Do it right</h3>
</section>
<section>
<img src="graphics/2022-02-16/timeseries.png" height="500px" style="display: inline; vertical-align: middle;" />
<table style="display: inline; vertical-align: middle;">
<tr>
<td style="padding-bottom: 50px;" class="fragment">
Lots of active research efforts!
</td>
</tr>
<tr>
<td style="padding-top: 50px;" class="fragment">
... but Alice is trying to to GSD!
</td>
</tr>
</table>
</section>
<section>
<h3>Alice's Observations</h3>
<ul>
<li>Readings every ~10s</li>
<li>Readings are binary</li>
<li>Readings are incredibly stable</li>
</ul>
</section>
<section>
<pre><code class="python">
df1['bucket'] = df1['timestamp'].map(lambda x : x / 10)
df2['bucket'] = df2['timestamp'].map(lambda x : x / 10)
df1.groupby('bucket').first().join(
df1.groupby('bucket').first(),
on='bucket',
how='outer'
)
</code></pre>
<p class="fragment">Interpolate missing values</p>
<p class="fragment">Hand tune around the switchover as-needed</p>
</section>
<section>
<p>Time taken: &lt; 30 minutes</p>
</section>
<section>
<img src="graphics/clipart/woman-reading-book-on-beach.svg" height="400px" />
<attribution>FreeSVG.org</attribution>
</section>
<section>
<h3>Enter Bob...</h3>
<p class="fragment"><u>Similar</u> analysis...</p>
<p class="fragment">... <u>different</u> data</p>
<p class="fragment">Can Bob re-use Alice's prep+analytics workflow?</p>
</section>
<section>
<h3>Maybe?</h3>
<ul>
<li class="fragment">Are readings still every ~10s?</li>
<li class="fragment">Is the data still binary?</li>
<li class="fragment">Is the data still (relatively) stable?</li>
</ul>
<p class="fragment">... and even then, some manual effort is needed!</p>
</section>
<section>
<p>Bob needs to know Alice's assumptions<br/>(and how to use the workflow)?</p>
</section>
<section>
<h3>In summary</h3>
<ul>
<li>Shortcuts are unavoidable</li>
<li>... but introduce ambiguity into datasets</li>
<li>... and processes</li>
</ul>
</section>
</section>
<section>
<section>
<h3>Act 2</h3>
<p>Carol gets datasets from Dave, Eve, and Fred</p>
</section>
<section>
<img src="graphics/clipart/female-computer-user.svg" height="100px"/><br/>
<span class="fragment" data-fragment-index="1">↓</span><br/>
<img src="graphics/clipart/Binary-file-20110715.svg" height="100px" style="vertical-align: middle;">
<span class="fragment" data-fragment-index="1">→
<img src="graphics/clipart/Prismatic-Cloud-Gears-2.svg" height="100px" style="vertical-align: middle;"/>
</span>
<span class="fragment" data-fragment-index="2">→
<img src="graphics/clipart/moneybag.svg" height="100px" style="vertical-align: middle;">
</span>
<attribution>FreeSVG.org</attribution>
</section>
<section>
<p>What questions can Carol's model answer?</p>
</section>
<section>
<h3>It depends!</h3>
<ul>
<li>Do the multiple datasets share variables?</li>
<li>Is the entire domain covered?</li>
<li>... by every variable </li>
<li>And more...</li>
</ul>
</section>
</section>
<section>
<section>
<h3>Incomplete Data</h3>
<img src="graphics/2022-02-16/survivorship_bias.svg" height="400px">
<attribution><a href="https://commons.wikimedia.org/wiki/File:Survivorship-bias.svg">Martin Grandjean (vector), McGeddon (picture), Cameron Moll (concept) - Own work</a>; CC BY-SA 4.0</attribution>
</section>
<section>
<h3>Incomplete Data</h3>
<img src="graphics/2022-02-16/hurricane.png" height="400px">
<attribution>Doraiswami et. al., "Using Topological Analysis to Support Event-Guided Exploration in Urban Data"</attribution>
</section>
<section>
<h3>Incomplete Coverage</h3>
<svg data-src="graphics/2022-02-16/small_data.svg"/>
</section>
<section>
<h3>Right now...</h3>
<ul>
<li class="fragment">Retain only a complete subset.</li>
<li class="fragment">Impute the data.</li>
<li class="fragment">Let the dataset users figure it out.</li>
</ul>
<p class="fragment"><b>The Curse of Small Data</b>: Utility vs Trustworthiness.</p>
</section>
<section>
<img src="graphics/2022-02-16/montoya.jpeg" height="400px">
</section>
</section>
<section>
<section>
<h2>Incomplete Data Management</h2>
</section>
<section>
<h3>What is Incomplete Data?</h3>
<ul>
<li><tt>n/a</tt>, <tt>None</tt>, <tt>NULL</tt>, etc...</li>
<li>Records that don't exist.</li>
<li>The results of an expensive simulation.</li>
<li>Excel accidentally interpreting one cell as a date.</li>
<li>1µm resolution images vs 1.2µm resolution images.</li>
</ul>
<p class="fragment">Data that is not known precisely.</p>
</section>
<section>
<h3>Data Model</h3>
<h4 class="fragment">(General Idea)</h4>
</section>
<section>
<ol>
<li class="fragment">A 'placeholder' value or record. <ul style="font-size: 70%;">
<li><tt>n/a</tt> or <tt>None</tt></li>
<li>"The value is probably 3.2"</li>
</ul></li>
<li class="fragment">Constraints describing what we do know. <ul style="font-size: 70%;">
<li>"The value must be between $0$ and $0.7$."</li>
<li>"There are at most 10 records with values from $0.9$ to $1$."</li>
</ul></li>
<li class="fragment">Metadata describing the source of the error. <ul style="font-size: 70%;">
<li>"The experiment hasn't been run yet."</li>
<li>"Alignment issues between datasets $A$ and $B$."</li>
</ul></li>
</ol>
</section>
<section>
<table style="font-size: 70%">
<thead>
<th>locale</th>
<th>rate</th>
<th>size</th>
</thead>
<tr>
<td>Los Angeles</td>
<td>$[3\%,4\%]^1$</td>
<td>metro</td>
</tr>
<tr>
<td>Austin</td>
<td>$18\%$</td>
<td>[city, metro]$^2$</td>
</tr>
<tr>
<td>Houston</td>
<td>$14\%$</td>
<td>metro</td>
</tr>
<tr>
<td>Berlin</td>
<td colspan="2">$1\%$, town &nbsp;<b>or</b>&nbsp; $3\%$, city$^2$</td>
</tr>
<tr>
<td>Sacramento</td>
<td>$1\%$</td>
<td><b>null</b></td>
</tr>
<tr>
<td>Springfield</td>
<td><b>null</b></td>
<td>town</td>
</tr>
</table>
<p style="font-size: 60%; margin-top: 20px; ">
$1:$ Conflict between CDC and locality-reported statistics.<br/>
$2:$ Multiple localities with this name.
</p>
<attribution>Feng et. al. "Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds"</attribution>
</section>
</section>
<section>
<section>
<h3>Incomplete Databases</h3>
<p>"Incompleteness" as a first-class database primitive.</p>
<ul>
<li class="fragment">Start with a Data Model (e.g., Relations/Data Frames)<ul>
<li class="fragment">Add a (formal) notion of "Incompleteness"</li>
</ul></li>
<li class="fragment">Pick a Query Language (e.g., SQL, Pandas, Spark)<ul>
<li class="fragment">Define semantics for queries under incompleteness.</li>
<li class="fragment">Optimize.</li>
</ul></li>
</ul>
</section>
<section>
<h3>Using Incomplete Data</h3>
<ul>
<li class="fragment">"Certain" and "Possible" answers.</li>
<li class="fragment">Summary Statistics.</li>
<li class="fragment">Presenting Incompleteness.</li>
<li class="fragment">Incompleteness-Aware ML.</li>
</ul>
</section>
</section>
<section>
<section>
<h3>UMAMI</h3>
<img src="graphics/2022-02-16/microstructure_filter.png">
</section>
<section>
<h3>Certain, Possible Answers</h3>
<pre><code class="python">
result = df[ df["NORMALIZED_INTERFERENCE"] < 10
and df["ABS_wf_D"] > 0.28] ]
</code></pre>
<p class="fragment">$3.1 > 2.8$ is True</p>
<p class="fragment">$\text{n/a} < 10$ is Uncertain</p>
<p class="fragment">$(3.1 > 2.8) \wedge (\text{n/a} > 10)$ is ???</p>
</section>
<section>
<p class="fragment">Records meeting both conditions are <i>certain</i> matches.</p>
<p class="fragment">Records with <tt>ABS_wf_D < 2.8</tt> are definitely not matches.</p>
<p class="fragment">Records with <tt>ABS_wf_D > 2.8</tt> but missing <tt>NORMALIZED_INTERFERENCE</tt> are <i>possible</i> matches.</p>
</section>
<section>
<h3>3-Valued Boolean Logic</h3>
<p>AND</p>
<table>
<tr><td></td><th>True</th><th>Unknown</th><th>False</th></tr>
<tr><th style="border-bottom: none; border-right: solid 1px black;">True</td><td>True</td><td>Unknown</td><td>False</td></tr>
<tr><th style="border-bottom: none; border-right: solid 1px black;">Unknown</td><td>Unknown</td><td>Unknown</td><td>False</td></tr>
<tr><th style="border-bottom: none; border-right: solid 1px black;">False</td><td>False</td><td>False</td><td>False</td></tr>
</table>
<p class="fragment">similar truth tables for OR, NOT, etc...</p>
</section>
<section>
<h3>Certain vs Possible</h3>
<table>
<thead>
<tr>
<th>Result</th>
<th>Includes...</th>
</tr>
</thead>
<tr>
<td>Certain</td>
<td>Filter = True</td>
</tr>
<tr>
<td>Possible</td>
<td>Filter $\in$ { True, Unknown }</td>
</tr>
</table>
</section>
<section>
<h3>Summary Statistics</h3>
<pre><code class="python">
result.count()
</code></pre>
<p class="fragment">There are at least <i>certain.count()</i> records.</p>
<p class="fragment">There are at most <i>possible.count()</i> records.</p>
<p class="fragment">The output is also incomplete</p>
<p class="fragment">(but in a different way)</p>
</section>
</section>
<section>
<section>
<h3>Presenting Incompleteness</h3>
<p><b>Database: </b> This nanostructure is possibly a result.</p>
<p><b>You: </b> Why isn't it certain?</p>
</section>
<section>
<p>Incompleteness as a provenance problem <br>(aka lineage, pedigree)</p>
<attribution>Yang et. al., "Lenses: An On-Demand Approach to ETL"</attribution>
</section>
<section>
<h3>Caveats</h3>
<p>Annotate incomplete values with a small note</p>
<p class="fragment"><b>Challenge: </b> Propagating annotations through queries.</p>
</section>
<section>
<p>Filter = Uncertain $\rightarrow$ The annotation propagates to the row.</p>
<p class="fragment"><tt>df.count()</tt> $\rightarrow$ Row annotations propagate back to the value.</p>
<div class="fragment">
<p>We have caveats working with Apache Spark,<br/><i>with negligible overhead</i>.</p>
<attribution>Brachmann et. al., "Your notebook is not crumby enough, REPLace it"</attribution>
<attribution>Feng et. al., "Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers"</attribution>
</div>
</section>
<section>
<!-- <img src="https://odin.cse.buffalo.edu/assets/logos/mimir_logo_final.png" height="100px"><br> -->
<img src="graphics/2022-02-16/caveat_dataframe.png"><br>
(<a href="https://mimirdb.info">https://mimimrdb.info</a> | <a href="https://vizierdb.info">https://vizierdb.info</a>)
</section>
</section>
<section>
<section>
<h2>Incompleteness-Aware ML</h2>
<p class="fragment">... a work in progress</p>
<p class="fragment">... and initially focused on explainable models like Bayes Nets</p>
</section>
<section>
<h3>Bayes Net</h3>
<p>Fitting a model involves repeatedly computing:</p>
<p>
$P[A | B, C]$ <span class="fragment">$= \frac{\sum_{D, E, \ldots} P[A, B, C, D, E, \ldots]}{\sum_{B, C, D, E, \ldots} P[A, B, C, D, E, \ldots]}$</span>
</p>
<pre class="fragment"><code class="python">
by_a = df.groupby("A").count()
counts = df.groupby(["A", "B", "C"]).count()
counts["count"] = counts.map(
lambda row: row["count"] / by_a[row["A"]]
)
</code></pre>
</section>
<section>
<h3>Bayes Net</h3>
<p>Fitting graphical models is just filtering &amp; counting<sup class="fragment">🤞</sup></p>
<p class="fragment">30 decades of work on incomplete databases come "for free"</p>
</section>
<section>
<h3>Incomplete Bayes Nets</h3>
<ul>
<li class="fragment">"$P[A < 23 | B = 2] \in [0.7, 0.99]$"</li>
<li class="fragment">"The estimate is inaccurate because 100 records are missing attribute C in the following 3 datasets."</li>
<li class="fragment">"Improve precision by at least 50% by running simulation C on the following three materials."</li>
</ul>
</section>
</section>
<section>
<h3>Open Questions...</h3>
<ul>
<li>Measuring "unknown unknowns"</li>
<li>Combining models of incompleteness</li>
<li>Presenting/summarizing result incompleteness</li>
<li class="fragment">... and your questions too 😀</li>
</ul>
</section>