2017-03-19 04:39:25 -04:00
<!doctype html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< title > Leaky Joins< / title >
< meta name = "description" content = "Convergent Interactive Inference with Leaky Joins" >
< meta name = "author" content = "Oliver Kennedy" >
< meta name = "apple-mobile-web-app-capable" content = "yes" / >
< meta name = "apple-mobile-web-app-status-bar-style" content = "black-translucent" / >
< meta name = "viewport" content = "width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui" >
< link rel = "stylesheet" href = "../reveal.js-3.1.0/css/reveal.css" >
< link rel = "stylesheet" href = "ubodin.css" id = "theme" >
<!-- Code syntax highlighting -->
< link rel = "stylesheet" href = "../reveal.js-3.1.0/lib/css/zenburn.css" >
<!-- Printing and PDF exports -->
< script >
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../reveal.js-3.1.0/css/print/pdf.css' : '../reveal.js-3.1.0/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
< / script >
<!-- [if lt IE 9]>
< script src = "../reveal.js-3.1.0/lib/js/html5shiv.js" > < / script >
<![endif]-->
< / head >
< body >
< div class = "reveal" >
<!-- Any section element inside of this container is displayed as a slide -->
< div class = "header" >
<!-- Any Talk - Specific Header Content Goes Here -->
< center >
< a href = "http://www.buffalo.edu" target = "_blank" >
< img src = "../graphics/logos/ub-1line-ro-white.png" height = "20" / >
< / a >
< / center >
< / div >
< div class = "footer" >
<!-- Any Talk - Specific Footer Content Goes Here -->
< div style = "float: left; margin-top: 15px; " >
Exploring < u > < b > O< / b > < / u > nline < u > < b > D< / b > < / u > ata < u > < b > In< / b > < / u > teractions
< / div >
2017-08-31 17:18:47 -04:00
< a href = "https://odin.cse.buffalo.edu" target = "_blank" >
2017-03-19 04:39:25 -04:00
< img src = "../graphics/logos/odin-1line-white.png" height = "40" style = "float: right;" / >
< / a >
< / div >
< div class = "slides" >
< section >
< section >
< h1 > Leaky Joins< / h1 >
< h3 > < u > Ying Yang< / u > , Oliver Kennedy< / h3 >
< / section >
< section >
< h1 > Leaky Joins< / h1 >
< h3 > Ying Yang, < u > Oliver Kennedy< / u > < / h3 >
< / section >
< section >
2017-03-20 17:38:45 -04:00
< img src = "graphics/yingyang.jpg" style = "float: right; margin-left: 20px; padding-top: 20px" / >
2017-03-19 04:39:25 -04:00
< h3 > Disclaimer< / h3 >
< p > Ying could not be here today. If you like her ideas, get in touch with her. < / p >
< p class = "fragment" > (If you don't, blame my presentation)< / p >
2017-03-20 17:38:45 -04:00
< p class = "fragment" > (Also, Ying is on the job market)< / p >
2017-03-19 04:39:25 -04:00
< / section >
< / section >
< section >
< section >
< img src = "graphics/mimir_logo_final.png" / >
< p > < a href = "http://mimirdb.info" > http://mimirdb.info< / a > < / p >
< p style = "font-size: smaller" class = "fragment" > (not immediately relevant to the talk, but you should check it out)< / p >
< / section >
< section >
< h2 > Roughly 1-2 years ago...< / h2 >
< p > < b > Ying< / b > : To implement {cool feature in Mimir}, we'll need to be able to perform < u > inference< / u > on < u > Graphical Models< / u > , but we will < u > not know how complex they are< / u > .< / p >
< / section >
< section >
< h2 > Graphical Models< / h2 >
< p > Joint probability distributions are expensive to store< br / >
$$p(D, I, G, S, J)$$< / p >
2017-03-20 17:38:45 -04:00
< p class = "fragment" > Bayes rule lets us break apart the distribution< br / >
2017-03-19 04:39:25 -04:00
$$= p(D, I, G, S) \cdot p(J | D, I, G, S)$$< / p >
2017-03-20 17:38:45 -04:00
< p class = "fragment" > And conditional independence lets us further simplify< br / >
2017-03-19 04:39:25 -04:00
$$= p(D, I, G, S) \cdot p(J | G, S)$$< / p >
< p class = "fragment" > This is basis for a type of graphical model called a "Bayes Net"< / p >
< / section >
< section >
< h2 > Bayesean Networks< / h2 >
< svg width = "500" height = "400" >
< image xlink:href = "graphics/studentBN.svg" x = "-125" y = "-90"
height="650" width="650" />
< rect
class="fragment" data-fragment-index="1"
x="62" y="78" width="78" height="15"
style="fill: rgba(0,0,0,0); stroke: red; stroke-width: 3"
/>
< rect
class="fragment" data-fragment-index="2"
x="253" y="58.5" width="78" height="15"
style="fill: rgba(0,0,0,0); stroke: red; stroke-width: 3"
/>
< rect
class="fragment" data-fragment-index="3"
x="316" y="120" width="135" height="15"
style="fill: rgba(0,0,0,0); stroke: red; stroke-width: 3"
/>
< rect
class="fragment" data-fragment-index="4"
x="1" y="204" width="169" height="15"
style="fill: rgba(0,0,0,0); stroke: red; stroke-width: 3"
/>
< rect
class="fragment" data-fragment-index="5"
x="276" y="249" width="194" height="15"
style="fill: rgba(0,0,0,0); stroke: red; stroke-width: 3"
/>
< / svg >
< p > $p(D=1, I=0, S=0, G=2, J=1)$< br / >
< span class = "fragment" data-fragment-index = "1" > $=\;0.5$< / span >
< span class = "fragment" data-fragment-index = "2" > $\cdot\;0.7$< / span >
< span class = "fragment" data-fragment-index = "3" > $\cdot\;0.95$< / span >
< span class = "fragment" data-fragment-index = "4" > $\cdot\;0.25$< / span >
< span class = "fragment" data-fragment-index = "5" > $\cdot\;0.8$< / span >
2017-03-20 17:38:45 -04:00
< span class = "fragment" data-fragment-index = "6" > $=\;0.0665$< / span >
2017-03-19 04:39:25 -04:00
< / p >
< / section >
< section >
< p >
$p(D,I,S,G,J)$ < br / >
$=$< br / >
$p(D) \bowtie p(I) \bowtie p(S|I) \bowtie p(G|D,I) \bowtie p(J|G,S)$
< / p >
< / section >
< section >
< h2 > Inference< / h2 >
< p > $p(J) = \sum_{D,I,S,G}p(D,I,S,G,J)$< / p >
(aka the computing the marginal probability)
< / section >
< section >
< h2 > Inference Algorithms< / h2 >
< dl >
< dt > Exact (e.g. Variable Elimination)< / dt >
< dd > Fast and precise, but scales poorly with graph complexity.< / dd >
< dt > Approximate (e.g. Gibbs Sampling)< / dt >
< dd > Consistent performance, but only asymptotic convergence.< / dd >
< / dl >
< p class = "fragment" > < b > Key Challenge< / b > : For {really cool feature} we don't know whether we should use exact or approximate inference.< / p >
< / section >
< section >
< p > Can we gracefully degrade from exact to approximate inference?< / p >
< / section >
< / section >
< section >
< section >
< pre > < code >
SELECT J.J, SUM(D.p * I.p * S.p * G.p * J.p) AS p
FROM D NATURAL JOIN I NATURAL JOIN S
NATURAL JOIN G NATURAL JOIN J
GROUP BY J.J
< / code > < / pre >
< p class = "fragment" > (Inference is essentially a big group-by aggregate join query)< / p >
< p class = "fragment" style = "font-size: smaller" > (Variable elimination is Aggregate Pushdown + Join Ordering)< / p >
< / section >
< section >
2017-03-20 17:38:45 -04:00
< h2 > Idea: Online Aggregation< / h2 >
2017-03-19 07:33:57 -04:00
< / section >
< section >
< h2 > Online Aggregation (OLA)< / h2 >
< p style = "margin-top: 60px;" > $Avg(3,6,10,9,1,3,9,7,9,4,7,9,2,1,2,4,10,8,9,7) = 6$< / p >
< p class = "fragment" > $Avg(3,6,10,9,1) = 5.8$ < span class = "fragment" > $\approx 6$< / span > < / p >
< p class = "fragment" > $Sum\left(\frac{k}{N} Samples\right) \cdot \frac{N}{k} \approx Sum(*)$< / p >
< p class = "fragment" style = "font-weight: bold; margin-top: 60px;" > Sampling lets you approximate aggregate values with orders of magnitude less data.< / p >
< / section >
< section >
< h2 > Typical OLA Challenges< / h2 >
< dl >
< dt > Birthday Paradox< / dt >
< dd > $Sample(R) \bowtie Sample(S)$ is likely to be empty.< / dd >
< dt > Stratified Sampling< / dt >
< dd > It doesn't matter how important they are to the aggregate, rare samples are still rare.< / dd >
< dt > Replacement< / dt >
< dd > Does the sampling algorithm converge exactly or asymptotically?< / dd >
< / dl >
< / section >
< section >
< h2 > Replacement< / h2 >
< dl >
< dt > Sampling Without Replacement< / dt >
< dd > ... eventually converges to a precise answer.< / dd >
< dt > Sampling With Replacement< / dt >
< dd > ... doesn't need to track what's been sampled.< / dd >
< dd > ... produces a better behaved estimate distribution.< / dd >
< / dl >
< / section >
< section >
< h2 > OLA over GMs< / h2 >
< dl >
< dt > Tables are Small< / dt >
< dd > Compute, not IO is the bottleneck.< / dd >
< dt > Tables are Dense< / dt >
< dd > Birthday Paradox and Stratified Sampling irrelevant.< / dd >
< dt > Queries have High Tree-Width< / dt >
< dd > Intermediate tables are large.< / dd >
< / dl >
< p class = "fragment" style = "font-weight: bold;" > Classical OLA techniques aren't entirely appropriate.< / p >
< / section >
< / section >
< section >
< section >
< h2 > (Naive) OLA: Cyclic Sampling< / h2 >
< / section >
< section >
< h2 > A Few Quick Insights< / h2 >
< ol >
< li class = "fragment" > Small Tables make random access to data possible.< / li >
< li class = "fragment" > Dense Tables mean we can sample directly from join outputs.< / li >
< li class = "fragment" > Cyclic PRNGs like Linear Congruential Generators can be used to generate a < u > randomly ordered< / u > , but < u > non-repeating< / u > sequence of integers from $0$ to any $N$ in constant memory.< / li >
< / ol >
< / section >
< section >
< h2 > Linear Congruential Generators< / h2 >
< p > If you pick $a$, $b$, and $N$ correctly, then the sequence:< p >
< p > $K_i = (a\cdot K_{i− 1}+b)\;mod\;N$< / p >
< p > will produce $N$ distinct, pseudorandom integers $K_i \in [0, N)$< / p >
< / section >
< section >
< h2 > Cyclic Sampling< / h2 >
< p > To marginalize $p(\{X_i\})$...< / p >
< ol >
< li > Init an LCG with a cycle of $N = \prod_i |dom(X_i)|$< / li >
< li > Use the LCG to sample $\{x_i\} \in \{X_i\}$< / li >
< li > Incorporate $p(x_i = X_i)$ into the OLA estimate< / li >
< li > Repeat from 2 until done< / li >
< / ol >
< / section >
< section >
< h2 > Accuracy< / h2 >
< dl >
< dt > Sampling with Replacement< / dt >
< dd > Chernoff Bounds, Hoeffding Bounds give an $\epsilon-\delta$ guarantee on the sum/avg of a sample < u > with replacement< / u > .< / dd >
< dt > Without Replacement?< / dt >
< dd class = "fragment" > Serfling et. al. have a variant of Hoeffding Bounds for sampling without replacement.< / dd >
< / dl >
2017-03-19 04:39:25 -04:00
< p > < / p >
< / section >
2017-03-20 17:38:45 -04:00
< section >
< h2 > Cyclic Sampling< / h2 >
< dl >
< dt > Advantages< / dt >
< dd > Progressively better estimates over time.< / dd >
< dd > Converges in bounded time.< / dd >
< dt > Disadvantages< / dt >
< dd > Exponential time in the number of variables.< / dd >
< / dl >
< / section >
2017-03-19 04:39:25 -04:00
< / section >
2017-03-19 07:33:57 -04:00
< section >
< section >
< h2 > Better OLA: Leaky Joins< / h2 >
< p class = "fragment" > Make Cyclic Sampling into a composable operator< / p >
< / section >
< section >
< img src = "graphics/JoinGraph.svg" height = "400" style = "float:left" / >
< table style = "float:right" >
< tr > < th > $G$< / th > < th > #< / th > < th > $\sum p_{\psi_2}$< / th > < / tr >
2017-03-20 17:38:45 -04:00
< tr class = "fragment" data-fragment-index = "2" > < td > 1< / td > < td > 1< / td > < td > 0.126< / td > < / tr >
< tr > < td > < / td > < td > < / td > < td > < / td > < / tr >
< tr > < td > < / td > < td > < / td > < td > < / td > < / tr >
2017-03-19 07:33:57 -04:00
< / table >
< table >
2017-03-20 17:38:45 -04:00
< tr > < th > $I$< / th > < th > $G$< / th > < th > #< / th > < th > $\sum p_{\psi_1}$< / th > < / tr >
< tr class = "fragment" data-fragment-index = "1" > < td > 0< / td > < td > 1< / td > < td > 1< / td > < td > 0.18< / td > < / tr >
< tr > < td > < / td > < td > < / td > < td > < / td > < td > < / td > < / tr >
< tr > < td > < / td > < td > < / td > < td > < / td > < td > < / td > < / tr >
< tr > < td > < / td > < td > < / td > < td > < / td > < td > < / td > < / tr >
< tr > < td > < / td > < td > < / td > < td > < / td > < td > < / td > < / tr >
< tr > < td > < / td > < td > < / td > < td > < / td > < td > < / td > < / tr >
2017-03-19 07:33:57 -04:00
< / table >
< / section >
< section >
< img src = "graphics/JoinGraph.svg" height = "400" style = "float:left" / >
< table style = "float:right" >
< tr > < th > $G$< / th > < th > #< / th > < th > $\sum p_{\psi_2}$< / th > < / tr >
< tr > < td > 1< / td > < td > 3< / td > < td > 0.348< / td > < / tr >
< tr > < td > 2< / td > < td > 4< / td > < td > 0.288< / td > < / tr >
< tr > < td > 3< / td > < td > 4< / td > < td > 0.350< / td > < / tr >
< / table >
< table >
2017-03-20 17:38:45 -04:00
< tr > < th > $I$< / th > < th > $G$< / th > < th > #< / th > < th > $\sum p_{\psi_1}$< / th > < / tr >
2017-03-19 07:33:57 -04:00
< tr > < td > 0< / td > < td > 1< / td > < td > 2< / td > < td > 0.140< / td > < / tr >
< tr > < td > 1< / td > < td > 1< / td > < td > 2< / td > < td > 0.222< / td > < / tr >
< tr > < td > 0< / td > < td > 2< / td > < td > 2< / td > < td > 0.238< / td > < / tr >
< tr > < td > 1< / td > < td > 2< / td > < td > 2< / td > < td > 0.050< / td > < / tr >
< tr > < td > 0< / td > < td > 3< / td > < td > 2< / td > < td > 0.322< / td > < / tr >
< tr > < td > 1< / td > < td > 3< / td > < td > 2< / td > < td > 0.028< / td > < / tr >
< / table >
< / section >
< section >
< img src = "graphics/JoinGraph.svg" height = "400" style = "float:left" / >
< table style = "float:right" >
< tr > < th > $G$< / th > < th > #< / th > < th > $\sum p_{\psi_2}$< / th > < / tr >
< tr > < td > 1< / td > < td > 4< / td > < td > 0.362< / td > < / tr >
< tr > < td > 2< / td > < td > 4< / td > < td > 0.288< / td > < / tr >
< tr > < td > 3< / td > < td > 4< / td > < td > 0.350< / td > < / tr >
< / table >
< table >
2017-03-20 17:38:45 -04:00
< tr > < th > $I$< / th > < th > $G$< / th > < th > #< / th > < th > $\sum p_{\psi_1}$< / th > < / tr >
2017-03-19 07:33:57 -04:00
< tr > < td > 0< / td > < td > 1< / td > < td > 2< / td > < td > 0.140< / td > < / tr >
< tr > < td > 1< / td > < td > 1< / td > < td > 2< / td > < td > 0.222< / td > < / tr >
< tr > < td > 0< / td > < td > 2< / td > < td > 2< / td > < td > 0.238< / td > < / tr >
< tr > < td > 1< / td > < td > 2< / td > < td > 2< / td > < td > 0.050< / td > < / tr >
< tr > < td > 0< / td > < td > 3< / td > < td > 2< / td > < td > 0.322< / td > < / tr >
< tr > < td > 1< / td > < td > 3< / td > < td > 2< / td > < td > 0.028< / td > < / tr >
< / table >
< / section >
< section >
< h2 > Leaky Joins< / h2 >
< ol >
< li > Build a normal join/aggregate graph as in variable elimination: One Cyclic Sampler for each Join+Aggregate.< / li >
< li > Keep advancing Cyclic Samplers in parallel, resetting their output after every cycle so samples "leak" through.< / li >
< li > When the sampler completes one full cycle with a complete input, mark it complete and stop sampling it.< / li >
< li > Continue until a desired accuracy is reached or all tables marked complete.< / li >
< / ol >
< / section >
< section >
< p > There's a bit of extra math to compute $\epsilon-\delta$ bounds by adapting Serfling's results. It's in the paper.< / p >
< / section >
< / section >
< section >
< section >
< h2 > Experiments< / h2 >
< dl >
< dt > Microbenchmarks< / dt >
< dd > Fix time, vary domain size, measure accuracy< / dd >
< dd > Fix domain size, vary time, measure accuracy< / dd >
< dd > Vary domain size, measure time to completion< / dd >
< dt > Macrobenchmarks< / dt >
< dd > 4 graphs from the bnlearn Repository< / dd >
< / dl >
< / section >
< section >
< h2 > Microbenchmarks< / h2 >
< img src = "graphics/extended_student.png" / >
< p > < b > Student< / b > : A common benchmark graph.< / p >
< / section >
< section >
< h2 > Accuracy vs Domain< / h2 >
< img src = "graphics/fixed_time.png" height = "400" / >
< p class = "fragment" style = "font-weight: bold" > VE is binary: It completes, or it doesn't.< / p >
< / section >
< section >
< h2 > Accuracy vs Time< / h2 >
< img src = "graphics/student_avg.png" height = "400" / >
< p class = "fragment" style = "font-weight: bold" > CS gets early results faster, but is overtaken by LJ.< / p >
< / section >
< section >
< h2 > Domain vs Time to 100%< / h2 >
< img src = "graphics/student_scaling.png" height = "400" / >
< p class = "fragment" style = "font-weight: bold" > LJ is only 3-5x slower than VE.< / p >
< / section >
< section >
< h2 > "Child"< / h2 >
< div >
< img src = "graphics/child.png" style = "float:left" height = "280" >
< img src = "graphics/child_avg.png" height = "350" >
< / div >
< p class = "fragment" style = "font-weight: bold; float:clear" > LJ converges to an exact result before Gibbs gets an approx.< / p >
< / section >
< section >
< h2 > "Insurance"< / h2 >
< div >
< img src = "graphics/insurance.png" style = "float:left" height = "280" >
< img src = "graphics/insurance_avg.png" height = "350" >
< / div >
< p class = "fragment" style = "font-weight: bold; float:clear" > On some graphs Gibbs is better, but only marginally.< / p >
< / section >
< section >
< h2 > More graphs in the paper.< / h2 >
< / section >
< / section >
< section >
< h2 > Leaky Joins< / h2 >
< ul >
< li > Classical OLA isn't appropriate for GMs.< / li >
< li > < b > Idea 1< / b > : LCGs can sample < u > without< / u > replacement.< / li >
< li > < b > Idea 2< / b > : "Leak" samples through a normal join graph.< / li >
< li > Compared to both Variable Elim. and Gibbs Sampling, Leaky Joins are often better and never drastically worse.< / li >
< / ul >
2017-03-20 17:38:45 -04:00
< p class = "fragment" style = "font-weight: bold" > Questions?< / p >
2017-03-19 07:33:57 -04:00
< / section >
2017-03-19 04:39:25 -04:00
< / div > < / div >
< script src = "../reveal.js-3.1.0/lib/js/head.min.js" > < / script >
< script src = "../reveal.js-3.1.0/js/reveal.js" > < / script >
< script >
// Full list of configuration options available at:
// https://github.com/hakimel/../reveal.js#configuration
Reveal.initialize({
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'fade', // none/fade/slide/convex/concave/zoom
// Optional ../reveal.js plugins
dependencies: [
{ src: '../reveal.js-3.1.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../reveal.js-3.1.0/plugin/math/math.js',
condition: function() { return true; },
mathjax: '../reveal.js-3.1.0/js/MathJax.js'
},
{ src: '../reveal.js-3.1.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.1.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.1.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../reveal.js-3.1.0/plugin/zoom-js/zoom.js', async: true },
{ src: '../reveal.js-3.1.0/plugin/notes/notes.js', async: true }
]
});
< / script >
< / body >
< / html >