Website/src/teaching/cse-562/2019sp/slide/2019-05-01-IncompleteDBs.erb

327 lines
12 KiB
Plaintext

---
template: templates/cse4562_2019_slides.erb
title: Incomplete and Probabilistic Databases
date: May 1, 2019
textbook: "<a href='https://github.com/UBOdin/mimir/wiki/Concepts-CTables'>PDB Concepts and C-Tables</a>"
dependencies:
- lib/slide_utils.rb
---
<%
require "slide_utils.rb"
%>
<section>
<section>
<img src="graphics/2019-04-31-4or9.png" height="300px" />
</section>
<section>
<img src="graphics/2019-04-31-guacamole.png" class="stretch" />
<attribution><a href="https://www.anishathalye.com/2017/07/25/synthesizing-adversarial-examples/">https://www.anishathalye.com/2017/07/25/synthesizing-adversarial-examples/</a></attribution>
</section>
<section>
<img src="graphics/2019-04-31-catVSdog.jpg" class="stretch" />
<attribution><a href="https://www.pyimagesearch.com/pyimagesearch-gurus/?src=post-deep-learning-libs">Deep Learning Demystified</a></attribution>
</section>
<section>
<h3>What happens when you don't know your data precisely?</h3>
</section>
<section>
<pre><code>
SELECT * FROM Posts WHERE image_class = 'Cat';
</code></pre>
<pre class="fragment"><code>
SELECT COUNT(*) FROM Posts WHERE image_class = 'Cat';
</code></pre>
<pre class="fragment"><code>
SELECT user_id FROM Posts
WHERE image_class = 'Cat'
GROUP BY user_id HAVING COUNT(*) > 10;
</code></pre>
</section>
</section>
<section>
<section>
<h3 class="fragment">Incomplete Databases<br/>↓</h3>
<h3>Probabilistic Databases</h3>
</section>
<section>
<ol>
<li>Representing Incompleteness</li>
<li class="fragment">Querying Incomplete Data</li>
<li class="fragment">Implementing It</li>
</ol>
</section>
<section>
<table><tr><td>
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","1<span class='fragment highlight-current-red' data-fragment-index='1'>4</span>260"]], name: "$R_1$", rowids: true) %>
</td><td class="fragment" data-fragment-index="3">or</td>
<td class="fragment highlight-current-grey" data-fragment-index="2">
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","1<span class='fragment highlight-current-red' data-fragment-index='1'>9</span>260"]], name: "$R_2$", rowids: true) %>
</td></tr></table>
</section>
<section>
<p><b>Incomplete Database</b> ($\mathcal D$): A set of <i>possible worlds</i></p>
<p class="fragment"><b>Possible World</b> ($D \in \mathcal D$): One (of many) database instances</p>
<p class="fragment">(Require all possible worlds to have the same schema)</p>
</section>
<section>
<p>What does it mean to run a query on an incomplete database?</p>
<p class="fragment" data-fragment-index="1"><span class="fragment fade-out" data-fragment-index="2">$Q(\mathcal D) = ?$</span></p>
<p class="fragment" data-fragment-index="2">$Q(\mathcal D) = \{\;Q(D)\;|\;D \in \mathcal D \}$</p>
</section>
<section>
<table><tr><td>
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %>
</td><td>or</td><td>
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>
</td></tr></table>
<p class="fragment" style="font-size: 90%">$$Q_1 = \pi_{Name}\big( \sigma_{state = \texttt{'NY'}} (R \bowtie_{zip} ZipLookups) \big)$$</p>
<table class="fragment"><tr>
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">{</td>
<td style="vertical-align: middle;">
<%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$", rowids: true) %>
</td><td style="vertical-align: middle; font-weight: bold;">or</td><td style="vertical-align: middle;">
<%= data_table(["Name"], [["Alice"]], name: "$Q(R_2)$", rowids: true) %>
</td>
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">}</td>
</tr></table>
<aside class="notes">
19260 is Phoenixville, PA
</aside>
</section>
<section>
<table><tr><td>
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %>
</td><td>or</td><td>
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>
</td></tr></table>
<p class="fragment" style="font-size: 80%">$$Q_2 = \pi_{Name}\big( \sigma_{region = \texttt{'Northeast'}} (R \bowtie_{zip} ZipLookups) \big)$$</p>
<table class="fragment">
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">{</td>
<td style="vertical-align: middle;">
<%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$", rowids: true) %>
</td><td style="vertical-align: middle; font-weight: bold;">or</td><td style="vertical-align: middle;">
<%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_2)$", rowids: true) %>
</td>
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">}</td>
</tr></table>
</section>
<section>
<table><tr><td>
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %>
</td><td>or</td><td>
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>
</td></tr></table>
<p style="font-size: 80%">$$Q_2 = \pi_{Name}\big( \sigma_{region = \texttt{'Northeast'}} (R \bowtie_{zip} ZipLookups) \big)$$</p>
<table><tr>
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">{</td>
<td style="vertical-align: middle;">
<%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$ or $Q(R_2)$", rowids: true) %>
</td>
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">}</td>
</tr></table>
</section>
<section>
<img src="graphics/2019-04-31-NormalDB.svg" /><br/>
<hr class="fragment" data-fragment-index="1"/>
<svg data-src="graphics/2019-04-31-IncompleteDB.svg" class="fragment" data-fragment-index="1"/>
</section>
</section>
<section>
<section>
<p><b>Challenge:</b> There can be <u>lots</u> of possible worlds.</p>
</section>
<section>
<p><b>Observation: </b> Possibilities for database creation break down into lots of independent choices.</p>
<p class="fragment"><u>Factorize</u> the database.</p>
</section>
<section>
<table><tr><td>
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"], ["Carol", "13201"]], name: "$R_1$", rowids: true) %>
</td><td>
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"], ["Carol", "18201"]], name: "$R_2$", rowids: true) %>
</td></tr>
<tr><td>
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"], ["Carol", "13201"]], name: "$R_3$", rowids: true) %>
</td><td>
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"], ["Carol", "18201"]], name: "$R_4$", rowids: true) %>
</td></tr></table>
<p class="fragment">Alice appears in both databases. <br/>The only differences are Bob and Carol's zip codes.</p>
</section>
<section>
<h3>List Out Choices</h3>
<ul>
<li>$\texttt{bob}$<span class="fragment" data-fragment-index="1">$ \in \{ 4, 9 \}$</span> (Bob's zip code digit)</li>
<li>$\texttt{carol}$<span class="fragment" data-fragment-index="1">$ \in \{ 3, 8 \}$</span> (Carol's zip code digit)</li>
</ul>
</section>
<% [false, true].each do |with_annotations| %>
<section>
<%= data_table(
["Name", "ZipCode"],
[ ["Alice", "10003"],
["Bob","14260"],
["Bob","14290"],
["Carol","13201"],
["Carol","18201"]
],
name: "$\\mathcal R$",
rowids: true,
annotations: if with_annotations then [
"always",
"if $\\texttt{bob} = 4$",
"if $\\texttt{bob} = 9$",
"if $\\texttt{carol} = 3$",
"if $\\texttt{carol} = 8$"
] else nil end
) %>
<div class="fragment">
<div style="font-size: 200%">+</div>
<p>$\big[\;\texttt{bob} \in \{4, 9\},\; \texttt{carol} \in \{3, 8\}\;\big]$</p>
</div>
</section>
<% end %>
<section>
<%= data_table(
["Name", "ZipCode"],
[ ["Alice", "10003"],
["Bob","14260"],
["Bob","14290"],
["Carol","13201"],
["Carol","18201"]
],
name: "$\\mathcal R$",
rowids: true,
annotations: [
"a",
"b",
"c",
"d",
"e"
]
) %>
<div style="font-size: 200%">+</div>
<p>Pick one of each: $\big[\;\{a\},\; \{b, c\},\; \{d, e\}\;\big]$</p>
<p>Set those variables to $T$ and all others to $F$</p>
</section>
<section>
<p>$R_1 \equiv \big[a \rightarrow T, b \rightarrow T, d \rightarrow T, * \rightarrow F\big]$</p>
<%= data_table(
["Name", "ZipCode"],
[ ["Alice", "10003"],
["Bob","14260"],
["Bob","14290"],
["Carol","13201"],
["Carol","18201"]
],
name: "$\\mathcal R$",
rowids: true,
annotations: [
"T (a)",
"T (b)",
"F (c)",
"T (d)",
"F (e)"
]
) %>
</section>
<section>
<p>Use provenance as before...</p>
<p class="fragment">... but what about aggregates?</p>
</section>
<section>
<pre><code>
SELECT COUNT(*)
FROM R NATURAL JOIN ZipCodeLookup
WHERE State = 'NY'
</code></pre>
<p style="font-size: 70%" class="fragment">
$$= \begin{cases}
1 & \textbf{if } \texttt{bob} = 9 \wedge \texttt{carol} = 8\\
2 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 8 \\&\; \vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3\\
3 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 3
\end{cases}$$</p>
<p class="fragment"><b>Problem: </b> A combinatorial explosion of possibilities</p>
</section>
<section>
<p><b>Idea: </b> Simplify the problem</p>
<ol>
<li class="fragment">Is a particular tuple <i>Possible</i>?</li>
<li class="fragment">Is a particular tuple <i>Certain</i>?</li>
</ol>
</section>
<section>
<dl>
<div class="fragment">
<dt>Certain Tuple</dt>
<dd>A tuple that appears in all possible worlds</dd>
<dd class="fragment">$\forall D \in \mathcal D : t \in D$</dd>
</div>
<div class="fragment">
<dt>Possible Tuple</dt>
<dd>A tuple that appears in at least one possible world</dd>
<dd class="fragment">$\exists D \in \mathcal D : t \in D$</dd>
</div>
</dl>
</section>
<section>
<h3>Non-aggregate queries</h3>
<dl>
<dt>Is a tuple Certain?</dt>
<dd class="fragment">Is the provenance polynomial a tautology?</dd>
<dt>Is a tuple Possible?</dt>
<dd class="fragment">Is the provenance polynomial a contradiction?</dd>
</dl>
<p class="fragment">Pick your favorite SAT solver, plug in and go</p>
</section>
<section>
<h3>Aggregate queries</h3>
<p style="margin-top: 50px; margin-bottom: 50px;">
As before, factorize the possible outcomes
</p>
<p class="fragment">
$$1 + \{\;1\;\textbf{if}\;\texttt{bob} = 4\;\} + \{\;1\;\textbf{if}\;\texttt{carol} = 3\;\}$$
</p>
<p style="margin-top: 50px;" class="fragment">
Not bigger than the aggregate input...
</p>
<p class="fragment">
...but at least it only reduces to bin-packing <br/>(or a similarly NP problem.)
</p>
</section>
<section>
<p>In short, incomplete databases are limited, but have some uses.</p>
<p class="fragment">What about probabilities?</p>
</section>
</section>