Materials talk

2022-02-22 23:52:54 -05:00 · 2022-02-22 23:52:54 -05:00 · efab1d7e7f
parent 5670699fb3
commit efab1d7e7f
3 changed files with 282 additions and 14 deletions
--- a/src/talks/2022-02-16-MaterialsDB.erb
+++ b/src/talks/2022-02-16-MaterialsDB.erb
@ -7,7 +7,7 @@ title: "Caveatting your data"
  <h3>Adding explainability to incomplete datasets</h3>
  <h4 style="margin-top: 20px;">Oliver Kennedy</h4>
  <h4>University at Buffalo</h4>
-  <p style="font-size: 60%;">(Based on joint work with Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...</p>
+  <p style="font-size: 60%;">(Based on joint work with Olga Wodo, Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...</p>
 </section>

 <section>
@ -117,6 +117,16 @@ title: "Caveatting your data"
  <section>
    <p>Bob needs to know Alice's assumptions<br/>(and how to use the workflow)?</p>
  </section>
+
+  <section>
+    <h3>In summary</h3>
+
+    <ul>
+      <li>Shortcuts are unavoidable</li>
+      <li>... but introduce ambiguity into datasets</li>
+      <li>... and processes</li>
+    </ul>
+  </section>
 </section>

 <section>
@ -169,7 +179,7 @@ title: "Caveatting your data"
  </section>

  <section>
-    <h3>The curse of small data</h3>
+    <h3>Incomplete Coverage</h3>

    <svg data-src="graphics/2022-02-16/small_data.svg"/>
  </section>
@ -183,13 +193,19 @@ title: "Caveatting your data"
      <li class="fragment">Let the dataset users figure it out.</li>
    </ul>

-    <p class="fragment">Small datasets must trade off between utility and trustworthiness.</p>
+    <p class="fragment"><b>The Curse of Small Data</b>: Utility vs Trustworthiness.</p>
  </section>

  <section>
    <img src="graphics/2022-02-16/montoya.jpeg" height="400px">
  </section>

+</section>
+<section>
+  <section>
+    <h2>Incomplete Data Management</h2>
+  </section>
+
  <section>
    <h3>What is Incomplete Data?</h3>

@ -204,31 +220,283 @@ title: "Caveatting your data"
    <p class="fragment">Data that is not known precisely.</p>
  </section>

+  <section>
+    <h3>Data Model</h3>
+
+    <h4 class="fragment">(General Idea)</h4>
+  </section>
+
  <section>
    <ol>
-      <li>A 'placeholder' value or record. <ul style="font-size: 70%;">
+      <li class="fragment">A 'placeholder' value or record. <ul style="font-size: 70%;">
        <li><tt>n/a</tt> or <tt>None</tt></li>
+        <li>"The value is probably 3.2"</li>
      </ul></li>
-      <li>Constraints describing what we do know. <ul style="font-size: 70%;">
-        <li>The value must be between $0$ and $1$.</li>
-        <li>The value is normally distributed ($\mu = 10, \sigma^2 = 2$).</li>
-        <li>There are at most 10 records with values from $0.9$ to $1$.</li>
+      <li class="fragment">Constraints describing what we do know. <ul style="font-size: 70%;">
+        <li>"The value must be between $0$ and $0.7$."</li>
+        <li>"There are at most 10 records with values from $0.9$ to $1$."</li>
      </ul></li>
-      <li>Metadata describing the source of the error. <ul style="font-size: 70%;">
-        <li>The experiment hasn't been run yet.</li>
-        <li>Alignment issues between datasets $A$ and $B$.</li>
+      <li class="fragment">Metadata describing the source of the error. <ul style="font-size: 70%;">
+        <li>"The experiment hasn't been run yet."</li>
+        <li>"Alignment issues between datasets $A$ and $B$."</li>
      </ul></li>
    </ol>
  </section>

  <section>
+    <table style="font-size: 70%">
+      <thead>
+        <th>locale</th>
+        <th>rate</th>
+        <th>size</th>
+      </thead>
+      <tr>
+        <td>Los Angeles</td>
+        <td>$[3\%,4\%]^1$</td>
+        <td>metro</td>
+      </tr>
+      <tr>
+        <td>Austin</td>
+        <td>$18\%$</td>
+        <td>[city, metro]$^2$</td>
+      </tr>
+      <tr>
+        <td>Houston</td>
+        <td>$14\%$</td>
+        <td>metro</td>
+      </tr>
+      <tr>
+        <td>Berlin</td>
+        <td colspan="2">$1\%$, town &nbsp;<b>or</b>&nbsp; $3\%$, city$^2$</td>
+      </tr>
+      <tr>
+        <td>Sacramento</td>
+        <td>$1\%$</td>
+        <td><b>null</b></td>
+      </tr>
+      <tr>
+        <td>Springfield</td>
+        <td><b>null</b></td>
+        <td>town</td>
+      </tr>
+    </table>
+    <p style="font-size: 60%; margin-top: 20px; ">
+      $1:$ Conflict between CDC and locality-reported statistics.<br/>
+      $2:$ Multiple localities with this name.
+    </p>
+
+    <attribution>Feng et. al. "Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds"</attribution>
+  </section>
+
+
+</section>
+
+<section>
+  
+  <section>
+    <h3>Incomplete Databases</h3>
+
+    <p>"Incompleteness" as a first-class database primitive.</p>
+
+    <ul>
+      <li class="fragment">Start with a Data Model (e.g., Relations/Data Frames)<ul>
+        <li class="fragment">Add a (formal) notion of "Incompleteness"</li>
+      </ul></li>
+      <li class="fragment">Pick a Query Language (e.g., SQL, Pandas, Spark)<ul>
+        <li class="fragment">Define semantics for queries under incompleteness.</li>
+        <li class="fragment">Optimize.</li>
+      </ul></li>
+    </ul>

  </section>

  <section>
-    <h3>How is Incomplete Data used?</h3>
-
+    <h3>Using Incomplete Data</h3>

+    <ul>
+      <li class="fragment">"Certain" and "Possible" answers.</li>
+      <li class="fragment">Summary Statistics.</li>
+      <li class="fragment">Presenting Incompleteness.</li>
+      <li class="fragment">Incompleteness-Aware ML.</li>
+    </ul>
  </section>

 </section>
+
+<section>
+
+  <section>
+    <h3>UMAMI</h3>
+
+    <img src="graphics/2022-02-16/microstructure_filter.png">
+  </section>
+
+  <section>
+    <h3>Certain, Possible Answers</h3>
+
+    <pre><code class="python">
+      result = df[ df["NORMALIZED_INTERFERENCE"] < 10 
+                    and df["ABS_wf_D"] > 0.28] ]
+    </code></pre>
+    <p class="fragment">$3.1 > 2.8$ is True</p>
+    <p class="fragment">$\text{n/a} < 10$ is Uncertain</p>
+    <p class="fragment">$(3.1 > 2.8) \wedge (\text{n/a} > 10)$ is ???</p>
+
+  </section>
+
+  <section>
+    <p class="fragment">Records meeting both conditions are <i>certain</i> matches.</p>
+    <p class="fragment">Records with <tt>ABS_wf_D < 2.8</tt> are definitely not matches.</p>
+    <p class="fragment">Records with <tt>ABS_wf_D > 2.8</tt> but missing <tt>NORMALIZED_INTERFERENCE</tt> are <i>possible</i> matches.</p>
+  </section>
+
+  <section>
+    <h3>3-Valued Boolean Logic</h3>
+
+    <p>AND</p>
+
+    <table>
+      <tr><td></td><th>True</th><th>Unknown</th><th>False</th></tr>
+      <tr><th style="border-bottom: none; border-right: solid 1px black;">True</td><td>True</td><td>Unknown</td><td>False</td></tr>
+      <tr><th style="border-bottom: none; border-right: solid 1px black;">Unknown</td><td>Unknown</td><td>Unknown</td><td>False</td></tr>
+      <tr><th style="border-bottom: none; border-right: solid 1px black;">False</td><td>False</td><td>False</td><td>False</td></tr>
+    </table>
+
+    <p class="fragment">similar truth tables for OR, NOT, etc...</p>
+  </section>
+
+  <section>
+    <h3>Certain vs Possible</h3>
+
+    <table>
+      <thead>
+        <tr>
+          <th>Result</th>
+          <th>Includes...</th>
+        </tr>
+      </thead>
+      <tr>
+        <td>Certain</td>
+        <td>Filter = True</td>
+      </tr>
+      <tr>
+        <td>Possible</td>
+        <td>Filter $\in$ { True, Unknown }</td>
+      </tr>
+    </table>
+  </section>
+
+  <section>
+    <h3>Summary Statistics</h3>
+
+    <pre><code class="python">
+      result.count()
+    </code></pre>
+
+    <p class="fragment">There are at least <i>certain.count()</i> records.</p>
+    <p class="fragment">There are at most <i>possible.count()</i> records.</p>
+    <p class="fragment">The output is also incomplete</p>
+    <p class="fragment">(but in a different way)</p>
+  </section>
+
+</section>
+
+<section>
+  
+  <section>
+    <h3>Presenting Incompleteness</h3>
+
+    <p><b>Database: </b> This nanostructure is possibly a result.</p>
+    <p><b>You: </b> Why isn't it certain?</p>
+  </section>
+
+  <section>
+    <p>Incompleteness as a provenance problem <br>(aka lineage, pedigree)</p>
+      <attribution>Yang et. al., "Lenses: An On-Demand Approach to ETL"</attribution>
+  </section>
+  
+  <section>
+    <h3>Caveats</h3>
+
+    <p>Annotate incomplete values with a small note</p>
+
+    <p class="fragment"><b>Challenge: </b> Propagating annotations through queries.</p>
+  </section>
+    
+
+  <section>
+    <p>Filter = Uncertain $\rightarrow$ The annotation propagates to the row.</p>
+
+    <p class="fragment"><tt>df.count()</tt> $\rightarrow$ Row annotations propagate back to the value.</p>
+
+    <div class="fragment">
+      <p>We have caveats working with Apache Spark,<br/><i>with negligible overhead</i>.</p>
+      <attribution>Brachmann et. al., "Your notebook is not crumby enough, REPLace it"</attribution>
+      <attribution>Feng et. al., "Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers"</attribution>
+    </div>
+  </section>
+
+  <section>
+    <!-- <img src="https://odin.cse.buffalo.edu/assets/logos/mimir_logo_final.png" height="100px"><br> -->
+    <img src="graphics/2022-02-16/caveat_dataframe.png"><br>
+    (<a href="https://mimirdb.info">https://mimimrdb.info</a> | <a href="https://vizierdb.info">https://vizierdb.info</a>)
+  </section>
+</section>
+
+<section>
+  
+  <section>
+    <h2>Incompleteness-Aware ML</h2>
+
+    <p class="fragment">... a work in progress</p>
+
+    <p class="fragment">... and initially focused on explainable models like Bayes Nets</p>
+  </section>
+
+  <section>
+    <h3>Bayes Net</h3>
+
+    <p>Fitting a model involves repeatedly computing:</p>
+    <p>
+      $P[A | B, C]$ <span class="fragment">$= \frac{\sum_{D, E, \ldots} P[A, B, C, D, E, \ldots]}{\sum_{B, C, D, E, \ldots} P[A, B, C, D, E, \ldots]}$</span>
+    </p>
+
+    <pre class="fragment"><code class="python">
+    by_a = df.groupby("A").count()
+    counts = df.groupby(["A", "B", "C"]).count()
+    counts["count"] = counts.map( 
+                          lambda row: row["count"] / by_a[row["A"]] 
+                      )
+    </code></pre>
+  </section>
+
+  <section>
+    <h3>Bayes Net</h3>
+    <p>Fitting graphical models is just filtering &amp; counting<sup class="fragment">🤞</sup></p>
+
+    <p class="fragment">30 decades of work on incomplete databases come "for free"</p>
+  </section>
+
+  <section>
+    <h3>Incomplete Bayes Nets</h3>
+
+    <ul>
+      <li class="fragment">"$P[A < 23 | B = 2] \in [0.7, 0.99]$"</li>
+      <li class="fragment">"The estimate is inaccurate because 100 records are missing attribute C in the following 3 datasets."</li>
+      <li class="fragment">"Improve precision by at least 50% by running simulation C on the following three materials."</li>
+    </ul>
+  </section>
+
+</section>
+
+<section>
+  <h3>Open Questions...</h3>
+
+  <ul>
+    <li>Measuring "unknown unknowns"</li>
+    <li>Combining models of incompleteness</li>
+    <li>Presenting/summarizing result incompleteness</li>
+    <li class="fragment">... and your questions too 😀</li>
+  </ul>
+
+</section>
--- a/src/talks/graphics/2022-02-16/caveat_dataframe.png
+++ b/src/talks/graphics/2022-02-16/caveat_dataframe.png
--- a/src/talks/graphics/2022-02-16/microstructure_filter.png
+++ b/src/talks/graphics/2022-02-16/microstructure_filter.png