473 lines
17 KiB
HTML
473 lines
17 KiB
HTML
<!doctype html>
|
||
<html lang="en">
|
||
|
||
<head>
|
||
<meta charset="utf-8">
|
||
|
||
<title>Leaky Joins</title>
|
||
|
||
<meta name="description" content="Convergent Interactive Inference with Leaky Joins">
|
||
<meta name="author" content="Oliver Kennedy">
|
||
|
||
<meta name="apple-mobile-web-app-capable" content="yes" />
|
||
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
|
||
|
||
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
|
||
|
||
<link rel="stylesheet" href="../reveal.js-3.1.0/css/reveal.css">
|
||
<link rel="stylesheet" href="ubodin.css" id="theme">
|
||
|
||
<!-- Code syntax highlighting -->
|
||
<link rel="stylesheet" href="../reveal.js-3.1.0/lib/css/zenburn.css">
|
||
|
||
<!-- Printing and PDF exports -->
|
||
<script>
|
||
var link = document.createElement( 'link' );
|
||
link.rel = 'stylesheet';
|
||
link.type = 'text/css';
|
||
link.href = window.location.search.match( /print-pdf/gi ) ? '../reveal.js-3.1.0/css/print/pdf.css' : '../reveal.js-3.1.0/css/print/paper.css';
|
||
document.getElementsByTagName( 'head' )[0].appendChild( link );
|
||
</script>
|
||
|
||
<!--[if lt IE 9]>
|
||
<script src="../reveal.js-3.1.0/lib/js/html5shiv.js"></script>
|
||
<![endif]-->
|
||
</head>
|
||
|
||
<body>
|
||
|
||
<div class="reveal">
|
||
<!-- Any section element inside of this container is displayed as a slide -->
|
||
|
||
<div class="header">
|
||
<!-- Any Talk-Specific Header Content Goes Here -->
|
||
<center>
|
||
<a href="http://www.buffalo.edu" target="_blank">
|
||
<img src="../graphics/logos/ub-1line-ro-white.png" height="20"/>
|
||
</a>
|
||
</center>
|
||
</div>
|
||
<div class="footer">
|
||
<!-- Any Talk-Specific Footer Content Goes Here -->
|
||
<div style="float: left; margin-top: 15px; ">
|
||
Exploring <u><b>O</b></u>nline <u><b>D</b></u>ata <u><b>In</b></u>teractions
|
||
</div>
|
||
<a href="http://odin.cse.buffalo.edu" target="_blank">
|
||
<img src="../graphics/logos/odin-1line-white.png" height="40" style="float: right;"/>
|
||
</a>
|
||
</div>
|
||
|
||
<div class="slides">
|
||
|
||
<section>
|
||
<section>
|
||
<h1>Leaky Joins</h1>
|
||
<h3><u>Ying Yang</u>, Oliver Kennedy</h3>
|
||
</section>
|
||
|
||
<section>
|
||
<h1>Leaky Joins</h1>
|
||
<h3>Ying Yang, <u>Oliver Kennedy</u></h3>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/yingyang.jpg" style="float: right; margin-left: 20px; padding-top: 20px" />
|
||
<h3>Disclaimer</h3>
|
||
<p>Ying could not be here today. If you like her ideas, get in touch with her. </p>
|
||
<p class="fragment">(If you don't, blame my presentation)</p>
|
||
<p class="fragment">(Also, Ying is on the job market)</p>
|
||
</section>
|
||
</section>
|
||
|
||
<section>
|
||
<section>
|
||
<img src="graphics/mimir_logo_final.png" />
|
||
<p><a href="http://mimirdb.info">http://mimirdb.info</a></p>
|
||
<p style="font-size: smaller" class="fragment">(not immediately relevant to the talk, but you should check it out)</p>
|
||
</section>
|
||
<section>
|
||
<h2>Roughly 1-2 years ago...</h2>
|
||
<p><b>Ying</b>: To implement {cool feature in Mimir}, we'll need to be able to perform <u>inference</u> on <u>Graphical Models</u>, but we will <u>not know how complex they are</u>.</p>
|
||
</section>
|
||
<section>
|
||
<h2>Graphical Models</h2>
|
||
<p>Joint probability distributions are expensive to store<br/>
|
||
$$p(D, I, G, S, J)$$</p>
|
||
<p class="fragment">Bayes rule lets us break apart the distribution<br/>
|
||
$$= p(D, I, G, S) \cdot p(J | D, I, G, S)$$</p>
|
||
<p class="fragment">And conditional independence lets us further simplify<br/>
|
||
$$= p(D, I, G, S) \cdot p(J | G, S)$$</p>
|
||
<p class="fragment">This is basis for a type of graphical model called a "Bayes Net"</p>
|
||
</section>
|
||
<section>
|
||
<h2>Bayesean Networks</h2>
|
||
<svg width="500" height="400">
|
||
<image xlink:href="graphics/studentBN.svg" x="-125" y="-90"
|
||
height="650" width="650" />
|
||
<rect
|
||
class="fragment" data-fragment-index="1"
|
||
x="62" y="78" width="78" height="15"
|
||
style="fill: rgba(0,0,0,0); stroke: red; stroke-width: 3"
|
||
/>
|
||
<rect
|
||
class="fragment" data-fragment-index="2"
|
||
x="253" y="58.5" width="78" height="15"
|
||
style="fill: rgba(0,0,0,0); stroke: red; stroke-width: 3"
|
||
/>
|
||
<rect
|
||
class="fragment" data-fragment-index="3"
|
||
x="316" y="120" width="135" height="15"
|
||
style="fill: rgba(0,0,0,0); stroke: red; stroke-width: 3"
|
||
/>
|
||
<rect
|
||
class="fragment" data-fragment-index="4"
|
||
x="1" y="204" width="169" height="15"
|
||
style="fill: rgba(0,0,0,0); stroke: red; stroke-width: 3"
|
||
/>
|
||
<rect
|
||
class="fragment" data-fragment-index="5"
|
||
x="276" y="249" width="194" height="15"
|
||
style="fill: rgba(0,0,0,0); stroke: red; stroke-width: 3"
|
||
/>
|
||
</svg>
|
||
<p>$p(D=1, I=0, S=0, G=2, J=1)$<br/>
|
||
<span class="fragment" data-fragment-index="1">$=\;0.5$</span>
|
||
<span class="fragment" data-fragment-index="2">$\cdot\;0.7$</span>
|
||
<span class="fragment" data-fragment-index="3">$\cdot\;0.95$</span>
|
||
<span class="fragment" data-fragment-index="4">$\cdot\;0.25$</span>
|
||
<span class="fragment" data-fragment-index="5">$\cdot\;0.8$</span>
|
||
<span class="fragment" data-fragment-index="6">$=\;0.0665$</span>
|
||
</p>
|
||
</section>
|
||
<section>
|
||
<p>
|
||
$p(D,I,S,G,J)$ <br/>
|
||
$=$<br/>
|
||
$p(D) \bowtie p(I) \bowtie p(S|I) \bowtie p(G|D,I) \bowtie p(J|G,S)$
|
||
</p>
|
||
</section>
|
||
<section>
|
||
<h2>Inference</h2>
|
||
<p>$p(J) = \sum_{D,I,S,G}p(D,I,S,G,J)$</p>
|
||
(aka the computing the marginal probability)
|
||
</section>
|
||
<section>
|
||
<h2>Inference Algorithms</h2>
|
||
<dl>
|
||
<dt>Exact (e.g. Variable Elimination)</dt>
|
||
<dd>Fast and precise, but scales poorly with graph complexity.</dd>
|
||
<dt>Approximate (e.g. Gibbs Sampling)</dt>
|
||
<dd>Consistent performance, but only asymptotic convergence.</dd>
|
||
</dl>
|
||
<p class="fragment"><b>Key Challenge</b>: For {really cool feature} we don't know whether we should use exact or approximate inference.</p>
|
||
</section>
|
||
<section>
|
||
<p>Can we gracefully degrade from exact to approximate inference?</p>
|
||
</section>
|
||
</section>
|
||
|
||
<section>
|
||
<section>
|
||
<pre><code>
|
||
SELECT J.J, SUM(D.p * I.p * S.p * G.p * J.p) AS p
|
||
FROM D NATURAL JOIN I NATURAL JOIN S
|
||
NATURAL JOIN G NATURAL JOIN J
|
||
GROUP BY J.J
|
||
</code></pre>
|
||
<p class="fragment">(Inference is essentially a big group-by aggregate join query)</p>
|
||
<p class="fragment" style="font-size: smaller">(Variable elimination is Aggregate Pushdown + Join Ordering)</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Idea: Online Aggregation</h2>
|
||
</section>
|
||
<section>
|
||
<h2>Online Aggregation (OLA)</h2>
|
||
|
||
<p style="margin-top: 60px;">$Avg(3,6,10,9,1,3,9,7,9,4,7,9,2,1,2,4,10,8,9,7) = 6$</p>
|
||
<p class="fragment">$Avg(3,6,10,9,1) = 5.8$ <span class="fragment">$\approx 6$</span></p>
|
||
|
||
<p class="fragment">$Sum\left(\frac{k}{N} Samples\right) \cdot \frac{N}{k} \approx Sum(*)$</p>
|
||
|
||
<p class="fragment" style="font-weight: bold; margin-top: 60px;">Sampling lets you approximate aggregate values with orders of magnitude less data.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Typical OLA Challenges</h2>
|
||
<dl>
|
||
<dt>Birthday Paradox</dt>
|
||
<dd>$Sample(R) \bowtie Sample(S)$ is likely to be empty.</dd>
|
||
<dt>Stratified Sampling</dt>
|
||
<dd>It doesn't matter how important they are to the aggregate, rare samples are still rare.</dd>
|
||
<dt>Replacement</dt>
|
||
<dd> Does the sampling algorithm converge exactly or asymptotically?</dd>
|
||
</dl>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Replacement</h2>
|
||
<dl>
|
||
<dt>Sampling Without Replacement</dt>
|
||
<dd>... eventually converges to a precise answer.</dd>
|
||
<dt>Sampling With Replacement</dt>
|
||
<dd>... doesn't need to track what's been sampled.</dd>
|
||
<dd>... produces a better behaved estimate distribution.</dd>
|
||
</dl>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>OLA over GMs</h2>
|
||
<dl>
|
||
<dt>Tables are Small</dt>
|
||
<dd>Compute, not IO is the bottleneck.</dd>
|
||
<dt>Tables are Dense</dt>
|
||
<dd>Birthday Paradox and Stratified Sampling irrelevant.</dd>
|
||
<dt>Queries have High Tree-Width</dt>
|
||
<dd>Intermediate tables are large.</dd>
|
||
</dl>
|
||
<p class="fragment" style="font-weight: bold;">Classical OLA techniques aren't entirely appropriate.</p>
|
||
</section>
|
||
</section>
|
||
|
||
<section>
|
||
<section>
|
||
<h2>(Naive) OLA: Cyclic Sampling</h2>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>A Few Quick Insights</h2>
|
||
<ol>
|
||
<li class="fragment">Small Tables make random access to data possible.</li>
|
||
<li class="fragment">Dense Tables mean we can sample directly from join outputs.</li>
|
||
<li class="fragment">Cyclic PRNGs like Linear Congruential Generators can be used to generate a <u>randomly ordered</u>, but <u>non-repeating</u> sequence of integers from $0$ to any $N$ in constant memory.</li>
|
||
</ol>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Linear Congruential Generators</h2>
|
||
<p>If you pick $a$, $b$, and $N$ correctly, then the sequence:<p>
|
||
<p>$K_i = (a\cdot K_{i−1}+b)\;mod\;N$</p>
|
||
<p>will produce $N$ distinct, pseudorandom integers $K_i \in [0, N)$</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Cyclic Sampling</h2>
|
||
<p>To marginalize $p(\{X_i\})$...</p>
|
||
<ol>
|
||
<li>Init an LCG with a cycle of $N = \prod_i |dom(X_i)|$</li>
|
||
<li>Use the LCG to sample $\{x_i\} \in \{X_i\}$</li>
|
||
<li>Incorporate $p(x_i = X_i)$ into the OLA estimate</li>
|
||
<li>Repeat from 2 until done</li>
|
||
</ol>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Accuracy</h2>
|
||
<dl>
|
||
<dt>Sampling with Replacement</dt>
|
||
<dd>Chernoff Bounds, Hoeffding Bounds give an $\epsilon-\delta$ guarantee on the sum/avg of a sample <u>with replacement</u>.</dd>
|
||
<dt>Without Replacement?</dt>
|
||
<dd class="fragment">Serfling et. al. have a variant of Hoeffding Bounds for sampling without replacement.</dd>
|
||
</dl>
|
||
<p></p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Cyclic Sampling</h2>
|
||
<dl>
|
||
<dt>Advantages</dt>
|
||
<dd>Progressively better estimates over time.</dd>
|
||
<dd>Converges in bounded time.</dd>
|
||
<dt>Disadvantages</dt>
|
||
<dd>Exponential time in the number of variables.</dd>
|
||
</dl>
|
||
</section>
|
||
</section>
|
||
|
||
<section>
|
||
<section>
|
||
<h2>Better OLA: Leaky Joins</h2>
|
||
<p class="fragment">Make Cyclic Sampling into a composable operator</p>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/JoinGraph.svg" height="400" style="float:left"/>
|
||
<table style="float:right">
|
||
<tr><th>$G$</th><th>#</th><th>$\sum p_{\psi_2}$</th></tr>
|
||
<tr class="fragment" data-fragment-index="2"><td>1</td><td>1</td><td>0.126</td></tr>
|
||
<tr><td> </td><td> </td><td> </td></tr>
|
||
<tr><td> </td><td> </td><td> </td></tr>
|
||
</table>
|
||
<table>
|
||
<tr><th>$I$</th><th>$G$</th><th>#</th><th>$\sum p_{\psi_1}$</th></tr>
|
||
<tr class="fragment" data-fragment-index="1"><td>0</td><td>1</td><td>1</td><td>0.18</td></tr>
|
||
<tr><td> </td><td> </td><td> </td><td> </td></tr>
|
||
<tr><td> </td><td> </td><td> </td><td> </td></tr>
|
||
<tr><td> </td><td> </td><td> </td><td> </td></tr>
|
||
<tr><td> </td><td> </td><td> </td><td> </td></tr>
|
||
<tr><td> </td><td> </td><td> </td><td> </td></tr>
|
||
</table>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/JoinGraph.svg" height="400" style="float:left"/>
|
||
<table style="float:right">
|
||
<tr><th>$G$</th><th>#</th><th>$\sum p_{\psi_2}$</th></tr>
|
||
<tr><td>1</td><td>3</td><td>0.348</td></tr>
|
||
<tr><td>2</td><td>4</td><td>0.288</td></tr>
|
||
<tr><td>3</td><td>4</td><td>0.350</td></tr>
|
||
</table>
|
||
<table>
|
||
<tr><th>$I$</th><th>$G$</th><th>#</th><th>$\sum p_{\psi_1}$</th></tr>
|
||
<tr><td>0</td><td>1</td><td>2</td><td>0.140</td></tr>
|
||
<tr><td>1</td><td>1</td><td>2</td><td>0.222</td></tr>
|
||
<tr><td>0</td><td>2</td><td>2</td><td>0.238</td></tr>
|
||
<tr><td>1</td><td>2</td><td>2</td><td>0.050</td></tr>
|
||
<tr><td>0</td><td>3</td><td>2</td><td>0.322</td></tr>
|
||
<tr><td>1</td><td>3</td><td>2</td><td>0.028</td></tr>
|
||
</table>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/JoinGraph.svg" height="400" style="float:left"/>
|
||
<table style="float:right">
|
||
<tr><th>$G$</th><th>#</th><th>$\sum p_{\psi_2}$</th></tr>
|
||
<tr><td>1</td><td>4</td><td>0.362</td></tr>
|
||
<tr><td>2</td><td>4</td><td>0.288</td></tr>
|
||
<tr><td>3</td><td>4</td><td>0.350</td></tr>
|
||
</table>
|
||
<table>
|
||
<tr><th>$I$</th><th>$G$</th><th>#</th><th>$\sum p_{\psi_1}$</th></tr>
|
||
<tr><td>0</td><td>1</td><td>2</td><td>0.140</td></tr>
|
||
<tr><td>1</td><td>1</td><td>2</td><td>0.222</td></tr>
|
||
<tr><td>0</td><td>2</td><td>2</td><td>0.238</td></tr>
|
||
<tr><td>1</td><td>2</td><td>2</td><td>0.050</td></tr>
|
||
<tr><td>0</td><td>3</td><td>2</td><td>0.322</td></tr>
|
||
<tr><td>1</td><td>3</td><td>2</td><td>0.028</td></tr>
|
||
</table>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Leaky Joins</h2>
|
||
<ol>
|
||
<li>Build a normal join/aggregate graph as in variable elimination: One Cyclic Sampler for each Join+Aggregate.</li>
|
||
<li>Keep advancing Cyclic Samplers in parallel, resetting their output after every cycle so samples "leak" through.</li>
|
||
<li>When the sampler completes one full cycle with a complete input, mark it complete and stop sampling it.</li>
|
||
<li>Continue until a desired accuracy is reached or all tables marked complete.</li>
|
||
</ol>
|
||
</section>
|
||
|
||
<section>
|
||
<p>There's a bit of extra math to compute $\epsilon-\delta$ bounds by adapting Serfling's results. It's in the paper.</p>
|
||
</section>
|
||
</section>
|
||
|
||
<section>
|
||
<section>
|
||
<h2>Experiments</h2>
|
||
<dl>
|
||
<dt>Microbenchmarks</dt>
|
||
<dd>Fix time, vary domain size, measure accuracy</dd>
|
||
<dd>Fix domain size, vary time, measure accuracy</dd>
|
||
<dd>Vary domain size, measure time to completion</dd>
|
||
<dt>Macrobenchmarks</dt>
|
||
<dd>4 graphs from the bnlearn Repository</dd>
|
||
</dl>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Microbenchmarks</h2>
|
||
<img src="graphics/extended_student.png" />
|
||
<p><b>Student</b>: A common benchmark graph.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Accuracy vs Domain</h2>
|
||
<img src="graphics/fixed_time.png" height="400"/>
|
||
<p class="fragment" style="font-weight: bold">VE is binary: It completes, or it doesn't.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Accuracy vs Time</h2>
|
||
<img src="graphics/student_avg.png" height="400"/>
|
||
<p class="fragment" style="font-weight: bold">CS gets early results faster, but is overtaken by LJ.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Domain vs Time to 100%</h2>
|
||
<img src="graphics/student_scaling.png" height="400"/>
|
||
<p class="fragment" style="font-weight: bold">LJ is only 3-5x slower than VE.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>"Child"</h2>
|
||
<div>
|
||
<img src="graphics/child.png" style="float:left" height="280">
|
||
<img src="graphics/child_avg.png" height="350">
|
||
</div>
|
||
<p class="fragment" style="font-weight: bold; float:clear">LJ converges to an exact result before Gibbs gets an approx.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>"Insurance"</h2>
|
||
<div>
|
||
<img src="graphics/insurance.png" style="float:left" height="280">
|
||
<img src="graphics/insurance_avg.png" height="350">
|
||
</div>
|
||
<p class="fragment" style="font-weight: bold; float:clear">On some graphs Gibbs is better, but only marginally.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>More graphs in the paper.</h2>
|
||
</section>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Leaky Joins</h2>
|
||
<ul>
|
||
<li>Classical OLA isn't appropriate for GMs.</li>
|
||
<li><b>Idea 1</b>: LCGs can sample <u>without</u> replacement.</li>
|
||
<li><b>Idea 2</b>: "Leak" samples through a normal join graph.</li>
|
||
<li>Compared to both Variable Elim. and Gibbs Sampling, Leaky Joins are often better and never drastically worse.</li>
|
||
</ul>
|
||
<p class="fragment" style="font-weight: bold">Questions?</p>
|
||
</section>
|
||
|
||
</div></div>
|
||
|
||
<script src="../reveal.js-3.1.0/lib/js/head.min.js"></script>
|
||
<script src="../reveal.js-3.1.0/js/reveal.js"></script>
|
||
|
||
<script>
|
||
|
||
// Full list of configuration options available at:
|
||
// https://github.com/hakimel/../reveal.js#configuration
|
||
Reveal.initialize({
|
||
controls: false,
|
||
progress: true,
|
||
history: true,
|
||
center: true,
|
||
slideNumber: true,
|
||
|
||
transition: 'fade', // none/fade/slide/convex/concave/zoom
|
||
|
||
// Optional ../reveal.js plugins
|
||
dependencies: [
|
||
{ src: '../reveal.js-3.1.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
|
||
{ src: '../reveal.js-3.1.0/plugin/math/math.js',
|
||
condition: function() { return true; },
|
||
mathjax: '../reveal.js-3.1.0/js/MathJax.js'
|
||
},
|
||
{ src: '../reveal.js-3.1.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
|
||
{ src: '../reveal.js-3.1.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
|
||
{ src: '../reveal.js-3.1.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
|
||
{ src: '../reveal.js-3.1.0/plugin/zoom-js/zoom.js', async: true },
|
||
{ src: '../reveal.js-3.1.0/plugin/notes/notes.js', async: true }
|
||
]
|
||
});
|
||
|
||
</script>
|
||
|
||
</body>
|
||
</html>
|