327 lines
12 KiB
Plaintext
327 lines
12 KiB
Plaintext
---
|
|
template: templates/cse4562_2019_slides.erb
|
|
title: Incomplete and Probabilistic Databases
|
|
date: May 1, 2019
|
|
textbook: "<a href='https://github.com/UBOdin/mimir/wiki/Concepts-CTables'>PDB Concepts and C-Tables</a>"
|
|
dependencies:
|
|
- lib/slide_utils.rb
|
|
---
|
|
<%
|
|
require "slide_utils.rb"
|
|
%>
|
|
<section>
|
|
<section>
|
|
<img src="graphics/2019-04-31-4or9.png" height="300px" />
|
|
</section>
|
|
|
|
<section>
|
|
<img src="graphics/2019-04-31-guacamole.png" class="stretch" />
|
|
<attribution><a href="https://www.anishathalye.com/2017/07/25/synthesizing-adversarial-examples/">https://www.anishathalye.com/2017/07/25/synthesizing-adversarial-examples/</a></attribution>
|
|
</section>
|
|
|
|
<section>
|
|
<img src="graphics/2019-04-31-catVSdog.jpg" class="stretch" />
|
|
<attribution><a href="https://www.pyimagesearch.com/pyimagesearch-gurus/?src=post-deep-learning-libs">Deep Learning Demystified</a></attribution>
|
|
</section>
|
|
|
|
<section>
|
|
<h3>What happens when you don't know your data precisely?</h3>
|
|
</section>
|
|
|
|
<section>
|
|
<pre><code>
|
|
SELECT * FROM Posts WHERE image_class = 'Cat';
|
|
</code></pre>
|
|
<pre class="fragment"><code>
|
|
SELECT COUNT(*) FROM Posts WHERE image_class = 'Cat';
|
|
</code></pre>
|
|
<pre class="fragment"><code>
|
|
SELECT user_id FROM Posts
|
|
WHERE image_class = 'Cat'
|
|
GROUP BY user_id HAVING COUNT(*) > 10;
|
|
</code></pre>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section>
|
|
<h3 class="fragment">Incomplete Databases<br/>↓</h3>
|
|
<h3>Probabilistic Databases</h3>
|
|
</section>
|
|
|
|
<section>
|
|
<ol>
|
|
<li>Representing Incompleteness</li>
|
|
<li class="fragment">Querying Incomplete Data</li>
|
|
<li class="fragment">Implementing It</li>
|
|
</ol>
|
|
</section>
|
|
|
|
<section>
|
|
<table><tr><td>
|
|
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","1<span class='fragment highlight-current-red' data-fragment-index='1'>4</span>260"]], name: "$R_1$", rowids: true) %>
|
|
</td><td class="fragment" data-fragment-index="3">or</td>
|
|
<td class="fragment highlight-current-grey" data-fragment-index="2">
|
|
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","1<span class='fragment highlight-current-red' data-fragment-index='1'>9</span>260"]], name: "$R_2$", rowids: true) %>
|
|
</td></tr></table>
|
|
</section>
|
|
|
|
<section>
|
|
<p><b>Incomplete Database</b> ($\mathcal D$): A set of <i>possible worlds</i></p>
|
|
<p class="fragment"><b>Possible World</b> ($D \in \mathcal D$): One (of many) database instances</p>
|
|
<p class="fragment">(Require all possible worlds to have the same schema)</p>
|
|
</section>
|
|
|
|
<section>
|
|
<p>What does it mean to run a query on an incomplete database?</p>
|
|
<p class="fragment" data-fragment-index="1"><span class="fragment fade-out" data-fragment-index="2">$Q(\mathcal D) = ?$</span></p>
|
|
<p class="fragment" data-fragment-index="2">$Q(\mathcal D) = \{\;Q(D)\;|\;D \in \mathcal D \}$</p>
|
|
</section>
|
|
|
|
<section>
|
|
<table><tr><td>
|
|
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %>
|
|
</td><td>or</td><td>
|
|
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>
|
|
</td></tr></table>
|
|
<p class="fragment" style="font-size: 90%">$$Q_1 = \pi_{Name}\big( \sigma_{state = \texttt{'NY'}} (R \bowtie_{zip} ZipLookups) \big)$$</p>
|
|
<table class="fragment"><tr>
|
|
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">{</td>
|
|
<td style="vertical-align: middle;">
|
|
<%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$", rowids: true) %>
|
|
</td><td style="vertical-align: middle; font-weight: bold;">or</td><td style="vertical-align: middle;">
|
|
<%= data_table(["Name"], [["Alice"]], name: "$Q(R_2)$", rowids: true) %>
|
|
</td>
|
|
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">}</td>
|
|
</tr></table>
|
|
<aside class="notes">
|
|
19260 is Phoenixville, PA
|
|
</aside>
|
|
</section>
|
|
|
|
<section>
|
|
<table><tr><td>
|
|
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %>
|
|
</td><td>or</td><td>
|
|
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>
|
|
</td></tr></table>
|
|
<p class="fragment" style="font-size: 80%">$$Q_2 = \pi_{Name}\big( \sigma_{region = \texttt{'Northeast'}} (R \bowtie_{zip} ZipLookups) \big)$$</p>
|
|
<table class="fragment">
|
|
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">{</td>
|
|
<td style="vertical-align: middle;">
|
|
<%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$", rowids: true) %>
|
|
</td><td style="vertical-align: middle; font-weight: bold;">or</td><td style="vertical-align: middle;">
|
|
<%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_2)$", rowids: true) %>
|
|
</td>
|
|
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">}</td>
|
|
</tr></table>
|
|
</section>
|
|
|
|
<section>
|
|
<table><tr><td>
|
|
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"]], name: "$R_1$", rowids: true) %>
|
|
</td><td>or</td><td>
|
|
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"]], name: "$R_2$", rowids: true) %>
|
|
</td></tr></table>
|
|
<p style="font-size: 80%">$$Q_2 = \pi_{Name}\big( \sigma_{region = \texttt{'Northeast'}} (R \bowtie_{zip} ZipLookups) \big)$$</p>
|
|
<table><tr>
|
|
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">{</td>
|
|
<td style="vertical-align: middle;">
|
|
<%= data_table(["Name"], [["Alice"], ["Bob"]], name: "$Q(R_1)$ or $Q(R_2)$", rowids: true) %>
|
|
</td>
|
|
<td style="font-size: 600%; margin: 0px; padding: 0px; height: 0.5em;">}</td>
|
|
</tr></table>
|
|
</section>
|
|
|
|
<section>
|
|
<img src="graphics/2019-04-31-NormalDB.svg" /><br/>
|
|
<hr class="fragment" data-fragment-index="1"/>
|
|
<svg data-src="graphics/2019-04-31-IncompleteDB.svg" class="fragment" data-fragment-index="1"/>
|
|
</section>
|
|
</section>
|
|
|
|
<section>
|
|
<section>
|
|
<p><b>Challenge:</b> There can be <u>lots</u> of possible worlds.</p>
|
|
</section>
|
|
|
|
<section>
|
|
<p><b>Observation: </b> Possibilities for database creation break down into lots of independent choices.</p>
|
|
|
|
<p class="fragment"><u>Factorize</u> the database.</p>
|
|
</section>
|
|
|
|
<section>
|
|
<table><tr><td>
|
|
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"], ["Carol", "13201"]], name: "$R_1$", rowids: true) %>
|
|
</td><td>
|
|
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"], ["Carol", "18201"]], name: "$R_2$", rowids: true) %>
|
|
</td></tr>
|
|
<tr><td>
|
|
<%= data_table(["Name", "ZipCode"], [["Alice", "10003"], ["Bob","14260"], ["Carol", "13201"]], name: "$R_3$", rowids: true) %>
|
|
</td><td>
|
|
<%= data_table(["Name", "Division"], [["Alice", "10003"], ["Bob","19260"], ["Carol", "18201"]], name: "$R_4$", rowids: true) %>
|
|
</td></tr></table>
|
|
<p class="fragment">Alice appears in both databases. <br/>The only differences are Bob and Carol's zip codes.</p>
|
|
</section>
|
|
|
|
<section>
|
|
<h3>List Out Choices</h3>
|
|
|
|
<ul>
|
|
<li>$\texttt{bob}$<span class="fragment" data-fragment-index="1">$ \in \{ 4, 9 \}$</span> (Bob's zip code digit)</li>
|
|
<li>$\texttt{carol}$<span class="fragment" data-fragment-index="1">$ \in \{ 3, 8 \}$</span> (Carol's zip code digit)</li>
|
|
</ul>
|
|
</section>
|
|
|
|
<% [false, true].each do |with_annotations| %>
|
|
<section>
|
|
<%= data_table(
|
|
["Name", "ZipCode"],
|
|
[ ["Alice", "10003"],
|
|
["Bob","14260"],
|
|
["Bob","14290"],
|
|
["Carol","13201"],
|
|
["Carol","18201"]
|
|
],
|
|
name: "$\\mathcal R$",
|
|
rowids: true,
|
|
annotations: if with_annotations then [
|
|
"always",
|
|
"if $\\texttt{bob} = 4$",
|
|
"if $\\texttt{bob} = 9$",
|
|
"if $\\texttt{carol} = 3$",
|
|
"if $\\texttt{carol} = 8$"
|
|
] else nil end
|
|
) %>
|
|
<div class="fragment">
|
|
<div style="font-size: 200%">+</div>
|
|
<p>$\big[\;\texttt{bob} \in \{4, 9\},\; \texttt{carol} \in \{3, 8\}\;\big]$</p>
|
|
</div>
|
|
</section>
|
|
<% end %>
|
|
<section>
|
|
<%= data_table(
|
|
["Name", "ZipCode"],
|
|
[ ["Alice", "10003"],
|
|
["Bob","14260"],
|
|
["Bob","14290"],
|
|
["Carol","13201"],
|
|
["Carol","18201"]
|
|
],
|
|
name: "$\\mathcal R$",
|
|
rowids: true,
|
|
annotations: [
|
|
"a",
|
|
"b",
|
|
"c",
|
|
"d",
|
|
"e"
|
|
]
|
|
) %>
|
|
<div style="font-size: 200%">+</div>
|
|
<p>Pick one of each: $\big[\;\{a\},\; \{b, c\},\; \{d, e\}\;\big]$</p>
|
|
<p>Set those variables to $T$ and all others to $F$</p>
|
|
</section>
|
|
|
|
<section>
|
|
<p>$R_1 \equiv \big[a \rightarrow T, b \rightarrow T, d \rightarrow T, * \rightarrow F\big]$</p>
|
|
<%= data_table(
|
|
["Name", "ZipCode"],
|
|
[ ["Alice", "10003"],
|
|
["Bob","14260"],
|
|
["Bob","14290"],
|
|
["Carol","13201"],
|
|
["Carol","18201"]
|
|
],
|
|
name: "$\\mathcal R$",
|
|
rowids: true,
|
|
annotations: [
|
|
"T (a)",
|
|
"T (b)",
|
|
"F (c)",
|
|
"T (d)",
|
|
"F (e)"
|
|
]
|
|
) %>
|
|
</section>
|
|
|
|
<section>
|
|
<p>Use provenance as before...</p>
|
|
<p class="fragment">... but what about aggregates?</p>
|
|
</section>
|
|
|
|
<section>
|
|
<pre><code>
|
|
SELECT COUNT(*)
|
|
FROM R NATURAL JOIN ZipCodeLookup
|
|
WHERE State = 'NY'
|
|
</code></pre>
|
|
<p style="font-size: 70%" class="fragment">
|
|
$$= \begin{cases}
|
|
1 & \textbf{if } \texttt{bob} = 9 \wedge \texttt{carol} = 8\\
|
|
2 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 8 \\&\; \vee\; \texttt{bob} = 9 \wedge \texttt{carol} = 3\\
|
|
3 & \textbf{if } \texttt{bob} = 4 \wedge \texttt{carol} = 3
|
|
\end{cases}$$</p>
|
|
<p class="fragment"><b>Problem: </b> A combinatorial explosion of possibilities</p>
|
|
</section>
|
|
|
|
<section>
|
|
<p><b>Idea: </b> Simplify the problem</p>
|
|
<ol>
|
|
<li class="fragment">Is a particular tuple <i>Possible</i>?</li>
|
|
<li class="fragment">Is a particular tuple <i>Certain</i>?</li>
|
|
</ol>
|
|
</section>
|
|
|
|
<section>
|
|
<dl>
|
|
<div class="fragment">
|
|
<dt>Certain Tuple</dt>
|
|
<dd>A tuple that appears in all possible worlds</dd>
|
|
<dd class="fragment">$\forall D \in \mathcal D : t \in D$</dd>
|
|
</div>
|
|
|
|
<div class="fragment">
|
|
<dt>Possible Tuple</dt>
|
|
<dd>A tuple that appears in at least one possible world</dd>
|
|
<dd class="fragment">$\exists D \in \mathcal D : t \in D$</dd>
|
|
</div>
|
|
</dl>
|
|
</section>
|
|
|
|
<section>
|
|
<h3>Non-aggregate queries</h3>
|
|
<dl>
|
|
<dt>Is a tuple Certain?</dt>
|
|
<dd class="fragment">Is the provenance polynomial a tautology?</dd>
|
|
|
|
<dt>Is a tuple Possible?</dt>
|
|
<dd class="fragment">Is the provenance polynomial a contradiction?</dd>
|
|
</dl>
|
|
<p class="fragment">Pick your favorite SAT solver, plug in and go</p>
|
|
</section>
|
|
|
|
<section>
|
|
<h3>Aggregate queries</h3>
|
|
|
|
<p style="margin-top: 50px; margin-bottom: 50px;">
|
|
As before, factorize the possible outcomes
|
|
</p>
|
|
<p class="fragment">
|
|
$$1 + \{\;1\;\textbf{if}\;\texttt{bob} = 4\;\} + \{\;1\;\textbf{if}\;\texttt{carol} = 3\;\}$$
|
|
</p>
|
|
<p style="margin-top: 50px;" class="fragment">
|
|
Not bigger than the aggregate input...
|
|
</p>
|
|
<p class="fragment">
|
|
...but at least it only reduces to bin-packing <br/>(or a similarly NP problem.)
|
|
</p>
|
|
</section>
|
|
|
|
<section>
|
|
<p>In short, incomplete databases are limited, but have some uses.</p>
|
|
<p class="fragment">What about probabilities?</p>
|
|
</section>
|
|
</section>
|