Materials talk

This commit is contained in:
Oliver Kennedy 2022-02-22 23:52:54 -05:00
parent 5670699fb3
commit efab1d7e7f
Signed by: okennedy
GPG key ID: 3E5F9B3ABD3FDB60
3 changed files with 282 additions and 14 deletions

View file

@ -7,7 +7,7 @@ title: "Caveatting your data"
<h3>Adding explainability to incomplete datasets</h3>
<h4 style="margin-top: 20px;">Oliver Kennedy</h4>
<h4>University at Buffalo</h4>
<p style="font-size: 60%;">(Based on joint work with Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...</p>
<p style="font-size: 60%;">(Based on joint work with Olga Wodo, Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...</p>
</section>
<section>
@ -117,6 +117,16 @@ title: "Caveatting your data"
<section>
<p>Bob needs to know Alice's assumptions<br/>(and how to use the workflow)?</p>
</section>
<section>
<h3>In summary</h3>
<ul>
<li>Shortcuts are unavoidable</li>
<li>... but introduce ambiguity into datasets</li>
<li>... and processes</li>
</ul>
</section>
</section>
<section>
@ -169,7 +179,7 @@ title: "Caveatting your data"
</section>
<section>
<h3>The curse of small data</h3>
<h3>Incomplete Coverage</h3>
<svg data-src="graphics/2022-02-16/small_data.svg"/>
</section>
@ -183,13 +193,19 @@ title: "Caveatting your data"
<li class="fragment">Let the dataset users figure it out.</li>
</ul>
<p class="fragment">Small datasets must trade off between utility and trustworthiness.</p>
<p class="fragment"><b>The Curse of Small Data</b>: Utility vs Trustworthiness.</p>
</section>
<section>
<img src="graphics/2022-02-16/montoya.jpeg" height="400px">
</section>
</section>
<section>
<section>
<h2>Incomplete Data Management</h2>
</section>
<section>
<h3>What is Incomplete Data?</h3>
@ -204,31 +220,283 @@ title: "Caveatting your data"
<p class="fragment">Data that is not known precisely.</p>
</section>
<section>
<h3>Data Model</h3>
<h4 class="fragment">(General Idea)</h4>
</section>
<section>
<ol>
<li>A 'placeholder' value or record. <ul style="font-size: 70%;">
<li class="fragment">A 'placeholder' value or record. <ul style="font-size: 70%;">
<li><tt>n/a</tt> or <tt>None</tt></li>
<li>"The value is probably 3.2"</li>
</ul></li>
<li>Constraints describing what we do know. <ul style="font-size: 70%;">
<li>The value must be between $0$ and $1$.</li>
<li>The value is normally distributed ($\mu = 10, \sigma^2 = 2$).</li>
<li>There are at most 10 records with values from $0.9$ to $1$.</li>
<li class="fragment">Constraints describing what we do know. <ul style="font-size: 70%;">
<li>"The value must be between $0$ and $0.7$."</li>
<li>"There are at most 10 records with values from $0.9$ to $1$."</li>
</ul></li>
<li>Metadata describing the source of the error. <ul style="font-size: 70%;">
<li>The experiment hasn't been run yet.</li>
<li>Alignment issues between datasets $A$ and $B$.</li>
<li class="fragment">Metadata describing the source of the error. <ul style="font-size: 70%;">
<li>"The experiment hasn't been run yet."</li>
<li>"Alignment issues between datasets $A$ and $B$."</li>
</ul></li>
</ol>
</section>
<section>
<table style="font-size: 70%">
<thead>
<th>locale</th>
<th>rate</th>
<th>size</th>
</thead>
<tr>
<td>Los Angeles</td>
<td>$[3\%,4\%]^1$</td>
<td>metro</td>
</tr>
<tr>
<td>Austin</td>
<td>$18\%$</td>
<td>[city, metro]$^2$</td>
</tr>
<tr>
<td>Houston</td>
<td>$14\%$</td>
<td>metro</td>
</tr>
<tr>
<td>Berlin</td>
<td colspan="2">$1\%$, town &nbsp;<b>or</b>&nbsp; $3\%$, city$^2$</td>
</tr>
<tr>
<td>Sacramento</td>
<td>$1\%$</td>
<td><b>null</b></td>
</tr>
<tr>
<td>Springfield</td>
<td><b>null</b></td>
<td>town</td>
</tr>
</table>
<p style="font-size: 60%; margin-top: 20px; ">
$1:$ Conflict between CDC and locality-reported statistics.<br/>
$2:$ Multiple localities with this name.
</p>
<attribution>Feng et. al. "Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds"</attribution>
</section>
</section>
<section>
<section>
<h3>Incomplete Databases</h3>
<p>"Incompleteness" as a first-class database primitive.</p>
<ul>
<li class="fragment">Start with a Data Model (e.g., Relations/Data Frames)<ul>
<li class="fragment">Add a (formal) notion of "Incompleteness"</li>
</ul></li>
<li class="fragment">Pick a Query Language (e.g., SQL, Pandas, Spark)<ul>
<li class="fragment">Define semantics for queries under incompleteness.</li>
<li class="fragment">Optimize.</li>
</ul></li>
</ul>
</section>
<section>
<h3>How is Incomplete Data used?</h3>
<h3>Using Incomplete Data</h3>
<ul>
<li class="fragment">"Certain" and "Possible" answers.</li>
<li class="fragment">Summary Statistics.</li>
<li class="fragment">Presenting Incompleteness.</li>
<li class="fragment">Incompleteness-Aware ML.</li>
</ul>
</section>
</section>
<section>
<section>
<h3>UMAMI</h3>
<img src="graphics/2022-02-16/microstructure_filter.png">
</section>
<section>
<h3>Certain, Possible Answers</h3>
<pre><code class="python">
result = df[ df["NORMALIZED_INTERFERENCE"] < 10
and df["ABS_wf_D"] > 0.28] ]
</code></pre>
<p class="fragment">$3.1 > 2.8$ is True</p>
<p class="fragment">$\text{n/a} < 10$ is Uncertain</p>
<p class="fragment">$(3.1 > 2.8) \wedge (\text{n/a} > 10)$ is ???</p>
</section>
<section>
<p class="fragment">Records meeting both conditions are <i>certain</i> matches.</p>
<p class="fragment">Records with <tt>ABS_wf_D < 2.8</tt> are definitely not matches.</p>
<p class="fragment">Records with <tt>ABS_wf_D > 2.8</tt> but missing <tt>NORMALIZED_INTERFERENCE</tt> are <i>possible</i> matches.</p>
</section>
<section>
<h3>3-Valued Boolean Logic</h3>
<p>AND</p>
<table>
<tr><td></td><th>True</th><th>Unknown</th><th>False</th></tr>
<tr><th style="border-bottom: none; border-right: solid 1px black;">True</td><td>True</td><td>Unknown</td><td>False</td></tr>
<tr><th style="border-bottom: none; border-right: solid 1px black;">Unknown</td><td>Unknown</td><td>Unknown</td><td>False</td></tr>
<tr><th style="border-bottom: none; border-right: solid 1px black;">False</td><td>False</td><td>False</td><td>False</td></tr>
</table>
<p class="fragment">similar truth tables for OR, NOT, etc...</p>
</section>
<section>
<h3>Certain vs Possible</h3>
<table>
<thead>
<tr>
<th>Result</th>
<th>Includes...</th>
</tr>
</thead>
<tr>
<td>Certain</td>
<td>Filter = True</td>
</tr>
<tr>
<td>Possible</td>
<td>Filter $\in$ { True, Unknown }</td>
</tr>
</table>
</section>
<section>
<h3>Summary Statistics</h3>
<pre><code class="python">
result.count()
</code></pre>
<p class="fragment">There are at least <i>certain.count()</i> records.</p>
<p class="fragment">There are at most <i>possible.count()</i> records.</p>
<p class="fragment">The output is also incomplete</p>
<p class="fragment">(but in a different way)</p>
</section>
</section>
<section>
<section>
<h3>Presenting Incompleteness</h3>
<p><b>Database: </b> This nanostructure is possibly a result.</p>
<p><b>You: </b> Why isn't it certain?</p>
</section>
<section>
<p>Incompleteness as a provenance problem <br>(aka lineage, pedigree)</p>
<attribution>Yang et. al., "Lenses: An On-Demand Approach to ETL"</attribution>
</section>
<section>
<h3>Caveats</h3>
<p>Annotate incomplete values with a small note</p>
<p class="fragment"><b>Challenge: </b> Propagating annotations through queries.</p>
</section>
<section>
<p>Filter = Uncertain $\rightarrow$ The annotation propagates to the row.</p>
<p class="fragment"><tt>df.count()</tt> $\rightarrow$ Row annotations propagate back to the value.</p>
<div class="fragment">
<p>We have caveats working with Apache Spark,<br/><i>with negligible overhead</i>.</p>
<attribution>Brachmann et. al., "Your notebook is not crumby enough, REPLace it"</attribution>
<attribution>Feng et. al., "Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers"</attribution>
</div>
</section>
<section>
<!-- <img src="https://odin.cse.buffalo.edu/assets/logos/mimir_logo_final.png" height="100px"><br> -->
<img src="graphics/2022-02-16/caveat_dataframe.png"><br>
(<a href="https://mimirdb.info">https://mimimrdb.info</a> | <a href="https://vizierdb.info">https://vizierdb.info</a>)
</section>
</section>
<section>
<section>
<h2>Incompleteness-Aware ML</h2>
<p class="fragment">... a work in progress</p>
<p class="fragment">... and initially focused on explainable models like Bayes Nets</p>
</section>
<section>
<h3>Bayes Net</h3>
<p>Fitting a model involves repeatedly computing:</p>
<p>
$P[A | B, C]$ <span class="fragment">$= \frac{\sum_{D, E, \ldots} P[A, B, C, D, E, \ldots]}{\sum_{B, C, D, E, \ldots} P[A, B, C, D, E, \ldots]}$</span>
</p>
<pre class="fragment"><code class="python">
by_a = df.groupby("A").count()
counts = df.groupby(["A", "B", "C"]).count()
counts["count"] = counts.map(
lambda row: row["count"] / by_a[row["A"]]
)
</code></pre>
</section>
<section>
<h3>Bayes Net</h3>
<p>Fitting graphical models is just filtering &amp; counting<sup class="fragment">🤞</sup></p>
<p class="fragment">30 decades of work on incomplete databases come "for free"</p>
</section>
<section>
<h3>Incomplete Bayes Nets</h3>
<ul>
<li class="fragment">"$P[A < 23 | B = 2] \in [0.7, 0.99]$"</li>
<li class="fragment">"The estimate is inaccurate because 100 records are missing attribute C in the following 3 datasets."</li>
<li class="fragment">"Improve precision by at least 50% by running simulation C on the following three materials."</li>
</ul>
</section>
</section>
<section>
<h3>Open Questions...</h3>
<ul>
<li>Measuring "unknown unknowns"</li>
<li>Combining models of incompleteness</li>
<li>Presenting/summarizing result incompleteness</li>
<li class="fragment">... and your questions too 😀</li>
</ul>
</section>

Binary file not shown.

After

Width:  |  Height:  |  Size: 506 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB