Materials talk
This commit is contained in:
parent
5670699fb3
commit
efab1d7e7f
|
@ -7,7 +7,7 @@ title: "Caveatting your data"
|
|||
<h3>Adding explainability to incomplete datasets</h3>
|
||||
<h4 style="margin-top: 20px;">Oliver Kennedy</h4>
|
||||
<h4>University at Buffalo</h4>
|
||||
<p style="font-size: 60%;">(Based on joint work with Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...</p>
|
||||
<p style="font-size: 60%;">(Based on joint work with Olga Wodo, Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
|
@ -117,6 +117,16 @@ title: "Caveatting your data"
|
|||
<section>
|
||||
<p>Bob needs to know Alice's assumptions<br/>(and how to use the workflow)?</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>In summary</h3>
|
||||
|
||||
<ul>
|
||||
<li>Shortcuts are unavoidable</li>
|
||||
<li>... but introduce ambiguity into datasets</li>
|
||||
<li>... and processes</li>
|
||||
</ul>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
|
@ -169,7 +179,7 @@ title: "Caveatting your data"
|
|||
</section>
|
||||
|
||||
<section>
|
||||
<h3>The curse of small data</h3>
|
||||
<h3>Incomplete Coverage</h3>
|
||||
|
||||
<svg data-src="graphics/2022-02-16/small_data.svg"/>
|
||||
</section>
|
||||
|
@ -183,13 +193,19 @@ title: "Caveatting your data"
|
|||
<li class="fragment">Let the dataset users figure it out.</li>
|
||||
</ul>
|
||||
|
||||
<p class="fragment">Small datasets must trade off between utility and trustworthiness.</p>
|
||||
<p class="fragment"><b>The Curse of Small Data</b>: Utility vs Trustworthiness.</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<img src="graphics/2022-02-16/montoya.jpeg" height="400px">
|
||||
</section>
|
||||
|
||||
</section>
|
||||
<section>
|
||||
<section>
|
||||
<h2>Incomplete Data Management</h2>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>What is Incomplete Data?</h3>
|
||||
|
||||
|
@ -204,31 +220,283 @@ title: "Caveatting your data"
|
|||
<p class="fragment">Data that is not known precisely.</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Data Model</h3>
|
||||
|
||||
<h4 class="fragment">(General Idea)</h4>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<ol>
|
||||
<li>A 'placeholder' value or record. <ul style="font-size: 70%;">
|
||||
<li class="fragment">A 'placeholder' value or record. <ul style="font-size: 70%;">
|
||||
<li><tt>n/a</tt> or <tt>None</tt></li>
|
||||
<li>"The value is probably 3.2"</li>
|
||||
</ul></li>
|
||||
<li>Constraints describing what we do know. <ul style="font-size: 70%;">
|
||||
<li>The value must be between $0$ and $1$.</li>
|
||||
<li>The value is normally distributed ($\mu = 10, \sigma^2 = 2$).</li>
|
||||
<li>There are at most 10 records with values from $0.9$ to $1$.</li>
|
||||
<li class="fragment">Constraints describing what we do know. <ul style="font-size: 70%;">
|
||||
<li>"The value must be between $0$ and $0.7$."</li>
|
||||
<li>"There are at most 10 records with values from $0.9$ to $1$."</li>
|
||||
</ul></li>
|
||||
<li>Metadata describing the source of the error. <ul style="font-size: 70%;">
|
||||
<li>The experiment hasn't been run yet.</li>
|
||||
<li>Alignment issues between datasets $A$ and $B$.</li>
|
||||
<li class="fragment">Metadata describing the source of the error. <ul style="font-size: 70%;">
|
||||
<li>"The experiment hasn't been run yet."</li>
|
||||
<li>"Alignment issues between datasets $A$ and $B$."</li>
|
||||
</ul></li>
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<table style="font-size: 70%">
|
||||
<thead>
|
||||
<th>locale</th>
|
||||
<th>rate</th>
|
||||
<th>size</th>
|
||||
</thead>
|
||||
<tr>
|
||||
<td>Los Angeles</td>
|
||||
<td>$[3\%,4\%]^1$</td>
|
||||
<td>metro</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Austin</td>
|
||||
<td>$18\%$</td>
|
||||
<td>[city, metro]$^2$</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Houston</td>
|
||||
<td>$14\%$</td>
|
||||
<td>metro</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Berlin</td>
|
||||
<td colspan="2">$1\%$, town <b>or</b> $3\%$, city$^2$</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Sacramento</td>
|
||||
<td>$1\%$</td>
|
||||
<td><b>null</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Springfield</td>
|
||||
<td><b>null</b></td>
|
||||
<td>town</td>
|
||||
</tr>
|
||||
</table>
|
||||
<p style="font-size: 60%; margin-top: 20px; ">
|
||||
$1:$ Conflict between CDC and locality-reported statistics.<br/>
|
||||
$2:$ Multiple localities with this name.
|
||||
</p>
|
||||
|
||||
<attribution>Feng et. al. "Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds"</attribution>
|
||||
</section>
|
||||
|
||||
|
||||
</section>
|
||||
|
||||
<section>
|
||||
|
||||
<section>
|
||||
<h3>Incomplete Databases</h3>
|
||||
|
||||
<p>"Incompleteness" as a first-class database primitive.</p>
|
||||
|
||||
<ul>
|
||||
<li class="fragment">Start with a Data Model (e.g., Relations/Data Frames)<ul>
|
||||
<li class="fragment">Add a (formal) notion of "Incompleteness"</li>
|
||||
</ul></li>
|
||||
<li class="fragment">Pick a Query Language (e.g., SQL, Pandas, Spark)<ul>
|
||||
<li class="fragment">Define semantics for queries under incompleteness.</li>
|
||||
<li class="fragment">Optimize.</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>How is Incomplete Data used?</h3>
|
||||
|
||||
<h3>Using Incomplete Data</h3>
|
||||
|
||||
<ul>
|
||||
<li class="fragment">"Certain" and "Possible" answers.</li>
|
||||
<li class="fragment">Summary Statistics.</li>
|
||||
<li class="fragment">Presenting Incompleteness.</li>
|
||||
<li class="fragment">Incompleteness-Aware ML.</li>
|
||||
</ul>
|
||||
</section>
|
||||
|
||||
</section>
|
||||
|
||||
<section>
|
||||
|
||||
<section>
|
||||
<h3>UMAMI</h3>
|
||||
|
||||
<img src="graphics/2022-02-16/microstructure_filter.png">
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Certain, Possible Answers</h3>
|
||||
|
||||
<pre><code class="python">
|
||||
result = df[ df["NORMALIZED_INTERFERENCE"] < 10
|
||||
and df["ABS_wf_D"] > 0.28] ]
|
||||
</code></pre>
|
||||
<p class="fragment">$3.1 > 2.8$ is True</p>
|
||||
<p class="fragment">$\text{n/a} < 10$ is Uncertain</p>
|
||||
<p class="fragment">$(3.1 > 2.8) \wedge (\text{n/a} > 10)$ is ???</p>
|
||||
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<p class="fragment">Records meeting both conditions are <i>certain</i> matches.</p>
|
||||
<p class="fragment">Records with <tt>ABS_wf_D < 2.8</tt> are definitely not matches.</p>
|
||||
<p class="fragment">Records with <tt>ABS_wf_D > 2.8</tt> but missing <tt>NORMALIZED_INTERFERENCE</tt> are <i>possible</i> matches.</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>3-Valued Boolean Logic</h3>
|
||||
|
||||
<p>AND</p>
|
||||
|
||||
<table>
|
||||
<tr><td></td><th>True</th><th>Unknown</th><th>False</th></tr>
|
||||
<tr><th style="border-bottom: none; border-right: solid 1px black;">True</td><td>True</td><td>Unknown</td><td>False</td></tr>
|
||||
<tr><th style="border-bottom: none; border-right: solid 1px black;">Unknown</td><td>Unknown</td><td>Unknown</td><td>False</td></tr>
|
||||
<tr><th style="border-bottom: none; border-right: solid 1px black;">False</td><td>False</td><td>False</td><td>False</td></tr>
|
||||
</table>
|
||||
|
||||
<p class="fragment">similar truth tables for OR, NOT, etc...</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Certain vs Possible</h3>
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Result</th>
|
||||
<th>Includes...</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tr>
|
||||
<td>Certain</td>
|
||||
<td>Filter = True</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Possible</td>
|
||||
<td>Filter $\in$ { True, Unknown }</td>
|
||||
</tr>
|
||||
</table>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Summary Statistics</h3>
|
||||
|
||||
<pre><code class="python">
|
||||
result.count()
|
||||
</code></pre>
|
||||
|
||||
<p class="fragment">There are at least <i>certain.count()</i> records.</p>
|
||||
<p class="fragment">There are at most <i>possible.count()</i> records.</p>
|
||||
<p class="fragment">The output is also incomplete</p>
|
||||
<p class="fragment">(but in a different way)</p>
|
||||
</section>
|
||||
|
||||
</section>
|
||||
|
||||
<section>
|
||||
|
||||
<section>
|
||||
<h3>Presenting Incompleteness</h3>
|
||||
|
||||
<p><b>Database: </b> This nanostructure is possibly a result.</p>
|
||||
<p><b>You: </b> Why isn't it certain?</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<p>Incompleteness as a provenance problem <br>(aka lineage, pedigree)</p>
|
||||
<attribution>Yang et. al., "Lenses: An On-Demand Approach to ETL"</attribution>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Caveats</h3>
|
||||
|
||||
<p>Annotate incomplete values with a small note</p>
|
||||
|
||||
<p class="fragment"><b>Challenge: </b> Propagating annotations through queries.</p>
|
||||
</section>
|
||||
|
||||
|
||||
<section>
|
||||
<p>Filter = Uncertain $\rightarrow$ The annotation propagates to the row.</p>
|
||||
|
||||
<p class="fragment"><tt>df.count()</tt> $\rightarrow$ Row annotations propagate back to the value.</p>
|
||||
|
||||
<div class="fragment">
|
||||
<p>We have caveats working with Apache Spark,<br/><i>with negligible overhead</i>.</p>
|
||||
<attribution>Brachmann et. al., "Your notebook is not crumby enough, REPLace it"</attribution>
|
||||
<attribution>Feng et. al., "Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers"</attribution>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<!-- <img src="https://odin.cse.buffalo.edu/assets/logos/mimir_logo_final.png" height="100px"><br> -->
|
||||
<img src="graphics/2022-02-16/caveat_dataframe.png"><br>
|
||||
(<a href="https://mimirdb.info">https://mimimrdb.info</a> | <a href="https://vizierdb.info">https://vizierdb.info</a>)
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
|
||||
<section>
|
||||
<h2>Incompleteness-Aware ML</h2>
|
||||
|
||||
<p class="fragment">... a work in progress</p>
|
||||
|
||||
<p class="fragment">... and initially focused on explainable models like Bayes Nets</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Bayes Net</h3>
|
||||
|
||||
<p>Fitting a model involves repeatedly computing:</p>
|
||||
<p>
|
||||
$P[A | B, C]$ <span class="fragment">$= \frac{\sum_{D, E, \ldots} P[A, B, C, D, E, \ldots]}{\sum_{B, C, D, E, \ldots} P[A, B, C, D, E, \ldots]}$</span>
|
||||
</p>
|
||||
|
||||
<pre class="fragment"><code class="python">
|
||||
by_a = df.groupby("A").count()
|
||||
counts = df.groupby(["A", "B", "C"]).count()
|
||||
counts["count"] = counts.map(
|
||||
lambda row: row["count"] / by_a[row["A"]]
|
||||
)
|
||||
</code></pre>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Bayes Net</h3>
|
||||
<p>Fitting graphical models is just filtering & counting<sup class="fragment">🤞</sup></p>
|
||||
|
||||
<p class="fragment">30 decades of work on incomplete databases come "for free"</p>
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Incomplete Bayes Nets</h3>
|
||||
|
||||
<ul>
|
||||
<li class="fragment">"$P[A < 23 | B = 2] \in [0.7, 0.99]$"</li>
|
||||
<li class="fragment">"The estimate is inaccurate because 100 records are missing attribute C in the following 3 datasets."</li>
|
||||
<li class="fragment">"Improve precision by at least 50% by running simulation C on the following three materials."</li>
|
||||
</ul>
|
||||
</section>
|
||||
|
||||
</section>
|
||||
|
||||
<section>
|
||||
<h3>Open Questions...</h3>
|
||||
|
||||
<ul>
|
||||
<li>Measuring "unknown unknowns"</li>
|
||||
<li>Combining models of incompleteness</li>
|
||||
<li>Presenting/summarizing result incompleteness</li>
|
||||
<li class="fragment">... and your questions too 😀</li>
|
||||
</ul>
|
||||
|
||||
</section>
|
BIN
src/talks/graphics/2022-02-16/caveat_dataframe.png
Normal file
BIN
src/talks/graphics/2022-02-16/caveat_dataframe.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 506 KiB |
BIN
src/talks/graphics/2022-02-16/microstructure_filter.png
Normal file
BIN
src/talks/graphics/2022-02-16/microstructure_filter.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 43 KiB |
Loading…
Reference in a new issue