Website/src/talks/2022-02-16-MaterialsDB.erb

---
template: templates/talk_slides_v1.erb
title: "Caveatting your data"
---
<section>
  <h2>Caveatting your data</h2>
  <h3>Adding explainability to incomplete datasets</h3>
  <h4 style="margin-top: 20px;">Oliver Kennedy</h4>
  <h4>University at Buffalo</h4>
  <p style="font-size: 60%;">Based on joint work with Olga Wodo, Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...</p>
</section>

<section>
  <section>
    <h2>Story Time!</h2>
    <p class="fragment">Alice wants to analyze two unaligned time series.</p>
  </section>

  <section>
    <table style="font-size: 60%; display: inline; padding: 50px;">
      <tr><th>Time</th><th>Reading</th></tr>
      <tr><td>1575731001</td><td>0</td></tr>
      <tr><td>1575731014</td><td>0</td></tr>
      <tr><td>1575731030</td><td>0</td></tr>
      <tr><td>1575731035</td><td>0</td></tr>
      <tr><td colspan="2">...</td></tr>
      <tr><td>1575731219</td><td>1</td></tr>
      <tr><td>1575731229</td><td>1</td></tr>
      <tr><td>1575731240</td><td>1</td></tr>
    </table>
    <table style="font-size: 60%; display: inline; padding: 50px;">
      <tr><th>Time</th><th>Reading</th></tr>
      <tr><td>1575731011</td><td>0</td></tr>
      <tr><td>1575731020</td><td>0</td></tr>
      <tr><td>1575731031</td><td>0</td></tr>
      <tr><td>1575731039</td><td>0</td></tr>
      <tr><td colspan="2">...</td></tr>
      <tr><td>1575731218</td><td>1</td></tr>
      <tr><td>1575731228</td><td>1</td></tr>
      <tr><td>1575731237</td><td>1</td></tr>
    </table>
    <p class="fragment">Step 1: Line up the readings</p>
  </section>

  <section>
    <h3>Option 1: Do it right</h3>
  </section>

  <section>
    <img src="graphics/2022-02-16/timeseries.png" height="500px" style="display: inline; vertical-align: middle;" />
    <table style="display: inline; vertical-align: middle;">
      <tr>
        <td style="padding-bottom: 50px;" class="fragment">
          Lots of active research efforts!
        </td>
      </tr>
      <tr>
        <td style="padding-top: 50px;" class="fragment">
          ... but Alice is trying to to GSD!
        </td>
      </tr>
    </table>
  </section>

  <section>
    <h3>Alice's Observations</h3>

    <ul>
      <li>Readings every ~10s</li>
      <li>Readings are binary</li>
      <li>Readings are incredibly stable</li>
    </ul>
  </section>

  <section>
    <pre><code class="python">
      df1['bucket'] = df1['timestamp'].map(lambda x : x / 10)
      df2['bucket'] = df2['timestamp'].map(lambda x : x / 10)
      df1.groupby('bucket').first().join(
        df1.groupby('bucket').first(),
        on='bucket',
        how='outer'
      )
    </code></pre>
    <p class="fragment">Interpolate missing values</p>
    <p class="fragment">Hand tune around the switchover as-needed</p>
  </section>

  <section>
    <p>Time taken: &lt; 30 minutes</p>
  </section>

  <section>
    <img src="graphics/clipart/woman-reading-book-on-beach.svg" height="400px" />
    <attribution>FreeSVG.org</attribution>
  </section>

  <section>
    <h3>Enter Bob...</h3>

    <p class="fragment"><u>Similar</u> analysis...</p>
    <p class="fragment">... <u>different</u> data</p>

    <p class="fragment">Can Bob re-use Alice's prep+analytics workflow?</p>
  </section>

  <section>
    <h3>Maybe?</h3>
    <ul>
      <li class="fragment">Are readings still every ~10s?</li>
      <li class="fragment">Is the data still binary?</li>
      <li class="fragment">Is the data still (relatively) stable?</li>
    </ul>
    <p class="fragment">... and even then, some manual effort is needed!</p>
  </section>

  <section>
    <p>Bob needs to know Alice's assumptions<br/>(and how to use the workflow)?</p>
  </section>

  <section>
    <h3>In summary</h3>

    <ul>
      <li>Shortcuts are unavoidable</li>
      <li>... but introduce ambiguity into datasets</li>
      <li>... and processes</li>
    </ul>
  </section>
</section>

<section>
  <section>
    <h3>Act 2</h3>
    <p>Carol gets datasets from Dave, Eve, and Fred</p>
  </section>

  <section>
    <img src="graphics/clipart/female-computer-user.svg" height="100px"/><br/>
    <span class="fragment" data-fragment-index="1">↓</span><br/>
    <img src="graphics/clipart/Binary-file-20110715.svg" height="100px" style="vertical-align: middle;">
    <span class="fragment" data-fragment-index="1">→
      <img src="graphics/clipart/Prismatic-Cloud-Gears-2.svg" height="100px" style="vertical-align: middle;"/>
    </span>
    <span class="fragment" data-fragment-index="2">→
      <img src="graphics/clipart/moneybag.svg" height="100px" style="vertical-align: middle;">
    </span>
    <attribution>FreeSVG.org</attribution>
  </section>

  <section>
    <p>What questions can Carol's model answer?</p>
  </section>

  <section>
    <h3>It depends!</h3>

    <ul>
      <li>Do the multiple datasets share variables?</li>
      <li>Is the entire domain covered?</li>
      <li>... by every variable </li>
      <li>And more...</li>
    </ul>
  </section>
</section>

<section>
  <section>
    <h3>Incomplete Data</h3>

    <img src="graphics/2022-02-16/survivorship_bias.svg" height="400px">
    <attribution><a href="https://commons.wikimedia.org/wiki/File:Survivorship-bias.svg">Martin Grandjean (vector), McGeddon (picture), Cameron Moll (concept) - Own work</a>; CC BY-SA 4.0</attribution>
  </section>
  <section>
    <h3>Incomplete Data</h3>

    <img src="graphics/2022-02-16/hurricane.png" height="400px">
    <attribution>Doraiswami et. al., "Using Topological Analysis to Support Event-Guided Exploration in Urban Data"</attribution>
  </section>

  <section>
    <h3>Incomplete Coverage</h3>

    <svg data-src="graphics/2022-02-16/small_data.svg"/>
  </section>

  <section>
    <h3>Right now...</h3>

    <ul>
      <li class="fragment">Retain only a complete subset.</li>
      <li class="fragment">Impute the data.</li>
      <li class="fragment">Let the dataset users figure it out.</li>
    </ul>

    <p class="fragment"><b>The Curse of Small Data</b>: Utility vs Trustworthiness.</p>
  </section>

  <section>
    <img src="graphics/2022-02-16/montoya.jpeg" height="400px">
  </section>

</section>
<section>
  <section>
    <h2>Incomplete Data Management</h2>
  </section>

  <section>
    <h3>What is Incomplete Data?</h3>

    <ul>
      <li><tt>n/a</tt>, <tt>None</tt>, <tt>NULL</tt>, etc...</li>
      <li>Records that don't exist.</li>
      <li>The results of an expensive simulation.</li>
      <li>Excel accidentally interpreting one cell as a date.</li>
      <li>1µm resolution images vs 1.2µm resolution images.</li>
    </ul>

    <p class="fragment">Data that is not known precisely.</p>
  </section>

  <section>
    <h3>Data Model</h3>

    <h4 class="fragment">(General Idea)</h4>
  </section>

  <section>
    <ol>
      <li class="fragment">A 'placeholder' value or record. <ul style="font-size: 70%;">
        <li><tt>n/a</tt> or <tt>None</tt></li>
        <li>"The value is probably 3.2"</li>
      </ul></li>
      <li class="fragment">Constraints describing what we do know. <ul style="font-size: 70%;">
        <li>"The value must be between $0$ and $0.7$."</li>
        <li>"There are at most 10 records with values from $0.9$ to $1$."</li>
      </ul></li>
      <li class="fragment">Metadata describing the source of the error. <ul style="font-size: 70%;">
        <li>"The experiment hasn't been run yet."</li>
        <li>"Alignment issues between datasets $A$ and $B$."</li>
      </ul></li>
    </ol>
  </section>

  <section>
    <table style="font-size: 70%">
      <thead>
        <th>locale</th>
        <th>rate</th>
        <th>size</th>
      </thead>
      <tr>
        <td>Los Angeles</td>
        <td>$[3\%,4\%]^1$</td>
        <td>metro</td>
      </tr>
      <tr>
        <td>Austin</td>
        <td>$18\%$</td>
        <td>[city, metro]$^2$</td>
      </tr>
      <tr>
        <td>Houston</td>
        <td>$14\%$</td>
        <td>metro</td>
      </tr>
      <tr>
        <td>Berlin</td>
        <td colspan="2">$1\%$, town &nbsp;<b>or</b>&nbsp; $3\%$, city$^2$</td>
      </tr>
      <tr>
        <td>Sacramento</td>
        <td>$1\%$</td>
        <td><b>null</b></td>
      </tr>
      <tr>
        <td>Springfield</td>
        <td><b>null</b></td>
        <td>town</td>
      </tr>
    </table>
    <p style="font-size: 60%; margin-top: 20px; ">
      $1:$ Conflict between CDC and locality-reported statistics.<br/>
      $2:$ Multiple localities with this name.
    </p>

    <attribution>Feng et. al. "Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds"</attribution>
  </section>


</section>

<section>

  <section>
    <h3>Incomplete Databases</h3>

    <p>"Incompleteness" as a first-class database primitive.</p>

    <ul>
      <li class="fragment">Start with a Data Model (e.g., Relations/Data Frames)<ul>
        <li class="fragment">Add a (formal) notion of "Incompleteness"</li>
      </ul></li>
      <li class="fragment">Pick a Query Language (e.g., SQL, Pandas, Spark)<ul>
        <li class="fragment">Define semantics for queries under incompleteness.</li>
        <li class="fragment">Optimize.</li>
      </ul></li>
    </ul>

  </section>

  <section>
    <h3>Using Incomplete Data</h3>

    <ul>
      <li class="fragment">"Certain" and "Possible" answers.</li>
      <li class="fragment">Summary Statistics.</li>
      <li class="fragment">Presenting Incompleteness.</li>
      <li class="fragment">Incompleteness-Aware ML.</li>
    </ul>
  </section>

</section>

<section>

  <section>
    <h3>UMAMI</h3>

    <img src="graphics/2022-02-16/microstructure_filter.png">
  </section>

  <section>
    <h3>Certain, Possible Answers</h3>

    <pre><code class="python">
      result = df[ df["NORMALIZED_INTERFERENCE"] < 10
                    and df["ABS_wf_D"] > 0.28] ]
    </code></pre>
    <p class="fragment">$3.1 > 2.8$ is True</p>
    <p class="fragment">$\text{n/a} < 10$ is Uncertain</p>
    <p class="fragment">$(3.1 > 2.8) \wedge (\text{n/a} > 10)$ is ???</p>

  </section>

  <section>
    <p class="fragment">Records meeting both conditions are <i>certain</i> matches.</p>
    <p class="fragment">Records with <tt>ABS_wf_D < 2.8</tt> are definitely not matches.</p>
    <p class="fragment">Records with <tt>ABS_wf_D > 2.8</tt> but missing <tt>NORMALIZED_INTERFERENCE</tt> are <i>possible</i> matches.</p>
  </section>

  <section>
    <h3>3-Valued Boolean Logic</h3>

    <p>AND</p>

    <table>
      <tr><td></td><th>True</th><th>Unknown</th><th>False</th></tr>
      <tr><th style="border-bottom: none; border-right: solid 1px black;">True</td><td>True</td><td>Unknown</td><td>False</td></tr>
      <tr><th style="border-bottom: none; border-right: solid 1px black;">Unknown</td><td>Unknown</td><td>Unknown</td><td>False</td></tr>
      <tr><th style="border-bottom: none; border-right: solid 1px black;">False</td><td>False</td><td>False</td><td>False</td></tr>
    </table>

    <p class="fragment">similar truth tables for OR, NOT, etc...</p>
  </section>

  <section>
    <h3>Certain vs Possible</h3>

    <table>
      <thead>
        <tr>
          <th>Result</th>
          <th>Includes...</th>
        </tr>
      </thead>
      <tr>
        <td>Certain</td>
        <td>Filter = True</td>
      </tr>
      <tr>
        <td>Possible</td>
        <td>Filter $\in$ { True, Unknown }</td>
      </tr>
    </table>
  </section>

  <section>
    <h3>Summary Statistics</h3>

    <pre><code class="python">
      result.count()
    </code></pre>

    <p class="fragment">There are at least <i>certain.count()</i> records.</p>
    <p class="fragment">There are at most <i>possible.count()</i> records.</p>
    <p class="fragment">The output is also incomplete</p>
    <p class="fragment">(but in a different way)</p>
  </section>

</section>

<section>

  <section>
    <h3>Presenting Incompleteness</h3>

    <p><b>Database: </b> This nanostructure is possibly a result.</p>
    <p><b>You: </b> Why isn't it certain?</p>
  </section>

  <section>
    <p>Incompleteness as a provenance problem <br>(aka lineage, pedigree)</p>
      <attribution>Yang et. al., "Lenses: An On-Demand Approach to ETL"</attribution>
  </section>

  <section>
    <h3>Caveats</h3>

    <p>Annotate incomplete values with a small note</p>

    <p class="fragment"><b>Challenge: </b> Propagating annotations through queries.</p>
  </section>


  <section>
    <p>Filter = Uncertain $\rightarrow$ The annotation propagates to the row.</p>

    <p class="fragment"><tt>df.count()</tt> $\rightarrow$ Row annotations propagate back to the value.</p>

    <div class="fragment">
      <p>We have caveats working with Apache Spark,<br/><i>with negligible overhead</i>.</p>
      <attribution>Brachmann et. al., "Your notebook is not crumby enough, REPLace it"</attribution>
      <attribution>Feng et. al., "Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers"</attribution>
    </div>
  </section>

  <section>
    <!-- <img src="https://odin.cse.buffalo.edu/assets/logos/mimir_logo_final.png" height="100px"><br> -->
    <img src="graphics/2022-02-16/caveat_dataframe.png"><br>
    (<a href="https://mimirdb.info">https://mimimrdb.info</a> | <a href="https://vizierdb.info">https://vizierdb.info</a>)
  </section>
</section>

<section>

  <section>
    <h2>Incompleteness-Aware ML</h2>

    <p class="fragment">... a work in progress</p>

    <p class="fragment">... and initially focused on explainable models like Bayes Nets</p>
  </section>

  <section>
    <h3>Bayes Net</h3>

    <p>Fitting a model involves repeatedly computing:</p>
    <p>
      $P[A | B, C]$ <span class="fragment">$= \frac{\sum_{D, E, \ldots} P[A, B, C, D, E, \ldots]}{\sum_{B, C, D, E, \ldots} P[A, B, C, D, E, \ldots]}$</span>
    </p>

    <pre class="fragment"><code class="python">
    by_a = df.groupby("A").count()
    counts = df.groupby(["A", "B", "C"]).count()
    counts["count"] = counts.map(
                          lambda row: row["count"] / by_a[row["A"]]
                      )
    </code></pre>
  </section>

  <section>
    <h3>Bayes Net</h3>
    <p>Fitting graphical models is just filtering &amp; counting<sup class="fragment">🤞</sup></p>

    <p class="fragment">30 decades of work on incomplete databases come "for free"</p>
  </section>

  <section>
    <h3>Incomplete Bayes Nets</h3>

    <ul>
      <li class="fragment">"$P[A < 23 | B = 2] \in [0.7, 0.99]$"</li>
      <li class="fragment">"The estimate is inaccurate because 100 records are missing attribute C in the following 3 datasets."</li>
      <li class="fragment">"Improve precision by at least 50% by running simulation C on the following three materials."</li>
    </ul>
  </section>

</section>

<section>
  <h3>Open Questions...</h3>

  <ul>
    <li>Measuring "unknown unknowns"</li>
    <li>Combining models of incompleteness</li>
    <li>Presenting/summarizing result incompleteness</li>
    <li class="fragment">... and your questions too 😀</li>
  </ul>

</section>