Website/src/teaching/cse-562/2019sp/slide/2019-05-01-IncompleteDBs.erb

---
template: templates/cse4562_2019_slides.erb
title: Incomplete and Probabilistic Databases
date: May 1, 2019
textbook: "<a href='https://github.com/UBOdin/mimir/wiki/Concepts-CTables'>PDB Concepts and C-Tables</a>"
dependencies:
  - lib/slide_utils.rb
---
<%
  require "slide_utils.rb"
%>
<section>
  <section>
    <img src="graphics/2019-04-31-4or9.png" height="300px" />
  </section>

  <section>
    <img src="graphics/2019-04-31-guacamole.png" class="stretch" />
    <attribution><a href="https://www.anishathalye.com/2017/07/25/synthesizing-adversarial-examples/">https://www.anishathalye.com/2017/07/25/synthesizing-adversarial-examples/</a></attribution>
  </section>

  <section>
    <img src="graphics/2019-04-31-catVSdog.jpg" class="stretch" />
    <attribution><a href="https://www.pyimagesearch.com/pyimagesearch-gurus/?src=post-deep-learning-libs">Deep Learning Demystified</a></attribution>
  </section>

  <section>
    <h3>What happens when you don't know your data precisely?</h3>
  </section>

  <section>
    <pre><code>
      SELECT * FROM Posts WHERE image_class = 'Cat';
    </code></pre>
    <pre class="fragment"><code>
      SELECT COUNT(*) FROM Posts WHERE image_class = 'Cat';
    </code></pre>
    <pre class="fragment"><code>
      SELECT user_id FROM Posts
      WHERE image_class = 'Cat'
      GROUP BY user_id HAVING COUNT(*) > 10;
    </code></pre>
  </section>
</section>

<section>
  <section>
    <h3 class="fragment">Incomplete Databases<br/>↓</h3>
    <h3>Probabilistic Databases</h3>
  </section>

  <section>
    <ol>
      <li>Representing Incompleteness</li>
      <li class="fragment">Querying Incomplete Data</li>
      <li class="fragment">Implementing It</li>
    </ol>
  </section>

  <section>
    <table><tr><td>
      <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","1<span class='fragment highlight-current-red' data-fragment-index='1'>4</span>260"]], name: "$R_1$", rowids: true) %>
    </td><td class="fragment" data-fragment-index="3">or</td>
    <td class="fragment highlight-current-grey" data-fragment-index="2">
      <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","1<span class='fragment highlight-current-red' data-fragment-index='1'>9</span>260"]], name: "$R_2$", rowids: true) %>
    </td></tr></table>
  </section>

  <section>
    <p><b>Incomplete Database</b> ($\mathcal D$): A set of <i>possible worlds</i></p>
    <p class="fragment"><b>Possible World</b> ($D \in \mathcal D$): One (of many) database instances</p>
    <p class="fragment">(Require all possible worlds to have the same schema)</p>
  </section>

  <section>
    <p>What does it mean to run a query on an incomplete database?</p>
    <p class="fragment" data-fragment-index="1"><span class="fragment fade-out" data-fragment-index="2">$Q(\mathcal D) = ?$</span></p>
    <p class="fragment" data-fragment-index="2">$Q(\mathcal D) = \{\;Q(D)\;|\;D \in \mathcal D \}$</p>
  </section>

  <section>
    <table><tr><td>
      <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %>
    </td><td>or</td><td>
      <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>
    </td></tr></table>
    <p class="fragment" style="font-size: 90%">$$Q_1 = \pi_{Name}\big( \sigma_{state = \texttt{'NY'}} (R \bowtie_{zip} ZipLookups) \big)$$</p>
    <table class="fragment"><tr>
      <td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">{</td>
      <td style="vertical-align: middle;">
        <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$", rowids: true) %>
      </td><td style="vertical-align: middle; font-weight: bold;">or</td><td style="vertical-align: middle;">
        <%= data_table(["Name"], [["Alice"]], name: "$Q(R_2)$", rowids: true) %>
      </td>
      <td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">}</td>
    </tr></table>
    <aside class="notes">
      19260 is Phoenixville, PA
    </aside>
  </section>

  <section>
    <table><tr><td>
      <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %>
    </td><td>or</td><td>
      <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>
    </td></tr></table>
    <p class="fragment" style="font-size: 80%">$$Q_2 = \pi_{Name}\big( \sigma_{region = \texttt{'Northeast'}} (R \bowtie_{zip} ZipLookups) \big)$$</p>
    <table class="fragment">
      <td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">{</td>
      <td style="vertical-align: middle;">
        <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$", rowids: true) %>
      </td><td style="vertical-align: middle; font-weight: bold;">or</td><td style="vertical-align: middle;">
        <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_2)$", rowids: true) %>
      </td>
      <td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">}</td>
    </tr></table>
  </section>

  <section>
    <table><tr><td>
      <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %>
    </td><td>or</td><td>
      <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>
    </td></tr></table>
    <p style="font-size: 80%">$$Q_2 = \pi_{Name}\big( \sigma_{region = \texttt{'Northeast'}} (R \bowtie_{zip} ZipLookups) \big)$$</p>
    <table><tr>
      <td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">{</td>
      <td style="vertical-align: middle;">
        <%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$ or $Q(R_2)$", rowids: true) %>
      </td>
      <td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">}</td>
    </tr></table>
  </section>

  <section>
    <img src="graphics/2019-04-31-NormalDB.svg" /><br/>
    <hr class="fragment" data-fragment-index="1"/>
    <svg data-src="graphics/2019-04-31-IncompleteDB.svg" class="fragment" data-fragment-index="1"/>
  </section>
</section>

<section>
  <section>
    <p><b>Challenge:</b> There can be <u>lots</u> of possible worlds.</p>
  </section>

  <section>
    <p><b>Observation: </b> Possibilities for database creation break down into lots of independent choices.</p>

    <p class="fragment"><u>Factorize</u> the database.</p>
  </section>

  <section>
    <table><tr><td>
      <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"], ["Carol", "13201"]], name: "$R_1$", rowids: true) %>
    </td><td>
      <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"], ["Carol", "18201"]], name: "$R_2$", rowids: true) %>
    </td></tr>
    <tr><td>
      <%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"], ["Carol", "13201"]], name: "$R_3$", rowids: true) %>
    </td><td>
      <%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"], ["Carol", "18201"]], name: "$R_4$", rowids: true) %>
    </td></tr></table>
    <p class="fragment">Alice appears in both databases. <br/>The only differences are Bob and Carol's zip codes.</p>
  </section>

  <section>
    <h3>List Out Choices</h3>

    <ul>
      <li>$\texttt{bob}$<span class="fragment" data-fragment-index="1">$ \in \{ 4, 9 \}$</span> (Bob's zip code digit)</li>
      <li>$\texttt{carol}$<span class="fragment" data-fragment-index="1">$ \in \{ 3, 8 \}$</span> (Carol's zip code digit)</li>
    </ul>
  </section>

  <% [false, true].each do |with_annotations| %>
  <section>
      <%= data_table(
        ["Name", "ZipCode"],
        [ ["Alice", "10003"],
          ["Bob","14260"],
          ["Bob","14290"],
          ["Carol","13201"],
          ["Carol","18201"]
        ],
        name: "$\\mathcal R$",
        rowids: true,
        annotations: if with_annotations then [
          "always",
          "if $\\texttt{bob} = 4$",
          "if $\\texttt{bob} = 9$",
          "if $\\texttt{carol} = 3$",
          "if $\\texttt{carol} = 8$"
        ] else nil end
      ) %>
      <div class="fragment">
        <div style="font-size: 200%">+</div>
        <p>$\big[\;\texttt{bob} \in \{4, 9\},\; \texttt{carol} \in \{3, 8\}\;\big]$</p>
      </div>
  </section>
  <% end %>
  <section>
    <%= data_table(
      ["Name", "ZipCode"],
      [ ["Alice", "10003"],
        ["Bob","14260"],
        ["Bob","14290"],
        ["Carol","13201"],
        ["Carol","18201"]
      ],
      name: "$\\mathcal R$",
      rowids: true,
      annotations: [
        "a",
        "b",
        "c",
        "d",
        "e"
      ]
    ) %>
    <div style="font-size: 200%">+</div>
    <p>Pick one of each: $\big[\;\{a\},\; \{b, c\},\; \{d, e\}\;\big]$</p>
    <p>Set those variables to $T$ and all others to $F$</p>
  </section>

  <section>
    <p>$R_1 \equiv \big[a \rightarrow T, b \rightarrow T, d \rightarrow T, * \rightarrow F\big]$</p>
    <%= data_table(
      ["Name", "ZipCode"],
      [ ["Alice", "10003"],
        ["Bob","14260"],
        ["Bob","14290"],
        ["Carol","13201"],
        ["Carol","18201"]
      ],
      name: "$\\mathcal R$",
      rowids: true,
      annotations: [
        "T (a)",
        "T (b)",
        "F (c)",
        "T (d)",
        "F (e)"
      ]
    ) %>
  </section>

  <section>
    <p>Use provenance as before...</p>
    <p class="fragment">... but what about aggregates?</p>
  </section>

  <section>
    <pre><code>
                SELECT COUNT(*)
                FROM R NATURAL JOIN ZipCodeLookup
                WHERE State = 'NY'
    </code></pre>
    <p style="font-size: 70%" class="fragment">
    $$= \begin{cases}
      1 & \textbf{if } \texttt{bob} = 9 \wedge \texttt{carol} = 8\\
      2 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 8 \\&\; \vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3\\
      3 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 3
      \end{cases}$$</p>
    <p class="fragment"><b>Problem: </b> A combinatorial explosion of possibilities</p>
  </section>

  <section>
    <p><b>Idea: </b> Simplify the problem</p>
    <ol>
      <li class="fragment">Is a particular tuple <i>Possible</i>?</li>
      <li class="fragment">Is a particular tuple <i>Certain</i>?</li>
    </ol>
  </section>

  <section>
    <dl>
      <div class="fragment">
        <dt>Certain Tuple</dt>
        <dd>A tuple that appears in all possible worlds</dd>
        <dd class="fragment">$\forall D \in \mathcal D : t \in D$</dd>
      </div>

      <div class="fragment">
        <dt>Possible Tuple</dt>
        <dd>A tuple that appears in at least one possible world</dd>
        <dd class="fragment">$\exists D \in \mathcal D : t \in D$</dd>
      </div>
    </dl>
  </section>

  <section>
    <h3>Non-aggregate queries</h3>
    <dl>
      <dt>Is a tuple Certain?</dt>
      <dd class="fragment">Is the provenance polynomial a tautology?</dd>

      <dt>Is a tuple Possible?</dt>
      <dd class="fragment">Is the provenance polynomial a contradiction?</dd>
    </dl>
    <p class="fragment">Pick your favorite SAT solver, plug in and go</p>
  </section>

  <section>
    <h3>Aggregate queries</h3>

    <p style="margin-top: 50px; margin-bottom: 50px;">
      As before, factorize the possible outcomes
    </p>
    <p class="fragment">
      $$1 + \{\;1\;\textbf{if}\;\texttt{bob} = 4\;\} + \{\;1\;\textbf{if}\;\texttt{carol} = 3\;\}$$
    </p>
    <p style="margin-top: 50px;" class="fragment">
      Not bigger than the aggregate input...
    </p>
    <p class="fragment">
      ...but at least it only reduces to bin-packing <br/>(or a similarly NP problem.)
    </p>
  </section>

  <section>
    <p>In short, incomplete databases are limited, but have some uses.</p>
    <p class="fragment">What about probabilities?</p>
  </section>
</section>