Website/slides/cse501/2016/index.html

1073 lines
41 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Embracing Uncertainty</title>
<meta name="description" content="Mimir">
<meta name="author" content="Oliver Kennedy">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="../../reveal.js-3.1.0/css/reveal.css">
<link rel="stylesheet" href="ubodin.css" id="theme">
<!-- Code syntax highlighting -->
<link rel="stylesheet" href="../../reveal.js-3.1.0/lib/css/zenburn.css">
<!-- Printing and PDF exports -->
<script>
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../../reveal.js-3.1.0/css/print/pdf.css' : '../../reveal.js-3.1.0/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<!--[if lt IE 9]>
<script src="../../reveal.js-3.1.0/lib/js/html5shiv.js"></script>
<![endif]-->
</head>
<body>
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="header">
<!-- Any Talk-Specific Header Content Goes Here -->
Embracing Uncertainty
</div>
<div class="footer">
<!-- Any Talk-Specific Footer Content Goes Here -->
<div style="float: left; margin-top: 15px; ">
Exploring <u><b>O</b></u>nline <u><b>D</b></u>ata <u><b>In</b></u>teractions
</div>
<img src="graphics/FullText-white.png" height="40" style="float: right;"/>
</div>
<div class="slides">
<section>
<h4>Embracing uncertainty with</h4>
<img src="graphics/mimir_logo_final.png" />
</section>
<section>
<h4>Joint work with:</h4>
<p>
<i>Ying Yang, Poonam Kumari, William Spoth, Aaron Huber,<br/>
Lisa Lu, Jacob Powathikunnil Verghese</i>
</p><p>
Niccolo Meneghetti, Arindam Nandi (both now HPE/Vertica),<br/>
Vinayak Karuppasamy (now Bloomberg),
</p><p>
Ronny Fehling (Airbus), <br/>
Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle), <br/>
Boris Glavic (IIT), Juliana Freire (NYU)
</p>
</section>
<section>
<section>
<h3>A Big Data Fairy Tale</h3>
</section>
<section>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" />
<h4>Meet Alice</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" />
<img src="graphics/littlestorefront-800px.png" height="300" />
<h4>Alice has a Store</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/littlestorefront-800px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;" />
<h4>Alice's store collects sales data</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;">+</span>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">=</span>
<img src="graphics/saco-800px.png" height="300" style=" vertical-align: middle;" />
<h4>Alice wants to use her sales data to run a promotion</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<h4>So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">+&nbsp;?</span>
<h4>... asks her question ...</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">+&nbsp;?&nbsp;</span>
<img src="graphics/crystalball-800px.png" height="300" style=" vertical-align: middle;" />
<h4>... and basks in the limitless possibilities of big data.</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
</section>
<section>
<section>
<h2>Why is this a fairy tale?</h2>
</section>
<section>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<h4>It's never this easy...</h4>
</section>
</section>
<section>
<section>
<h2>CSV Import</h2>
<h4>Run a <code>SELECT</code> on a raw CSV File</h4>
<ul class="fragment">
<li>File may not have column headers</li>
<li>CSV does not provide "types"</li>
<li>Lines may be missing fields</li>
<li>Fields may be mistyped (typo, missing comma)</li>
<li>Comment text can be inlined into the file</li>
</ul>
<p class="fragment">
<b>State of the art</b>: External Table Defn <span class="fragment">+ "Manually" edit CSV</span>
</p>
</section>
<section>
<h2>Merge Two Datasets</h2>
<h4><code>UNION</code> two data sources</h4>
<ul class="fragment">
<li>Schema matching</li>
<li>Deduplication</li>
<li>Format alignment (GIS coordinates, $ vs €)
<li>Precision alignment (State vs County)</li>
</ul>
<p class="fragment">
<b>State of the art</b>: Manually map schema
</p>
</section>
<section>
<h2>JSON Shredding</h2>
<h4>Run a <code>SELECT</code> on JSON or a Doc Store</h4>
<ul class="fragment">
<li>Separating fields and record sets:<br/>(e.g., <code>{ A: "Bob", B: "Alice" }</code>)</li>
<li>Missing fields (Records with no 'address')</li>
<li>Type alignment (Records with 'address' as an array)</li>
<li>Schema matching$^2$</li>
</ul>
<p class="fragment">
<b>State of the art</b>: DataGuide, Wrangler, etc...
</p>
</section>
</section>
<section>
<section>
<h2>Data Cleaning is Hard!</h2>
</section>
<section>
<h3>State of the Art</h3>
<img src="graphics/BI-Analyst.jpg" height="400" />
<attribution>(skilledup.com)</attribution>
<p>Alice spends weeks cleaning her data before using it.</p>
</section>
<section>
<h3>Newer State of the Art</h3>
<img src="graphics/iu.jpeg" height=500 />
<attribution>(azure.microsoft.com)</attribution>
</section>
<section>
<img src="graphics/data-lake-to-data-swamp.jpg" height=500 />
<attribution>(timoelliott.com)</attribution>
</section>
</section>
<section>
<section>
<h2>Structure is hard!</h2>
<ul>
<li class="fragment">Structured models (RelDBs) force curation during loading.
<ul><li class="fragment"><b>Problem:</b> All curation costs are upfront.</li></ul>
</li>
<li class="fragment">Unstructured models (NoSQL) force curation into queries.
<ul><li class="fragment"><b>Problem:</b> Complexity/redundancy blowup in queries.</li></ul>
</li>
</ul>
<p class="fragment" style="margin-top: 50px;">Add structure, curation effort <b>On-Demand</b></p>
</section>
<section>
<h3>But... you still need some sort of structure?!?</h3>
<h3 class="fragment">Let the database make a guess!</h3>
</section>
<section>
<h3>
In the name of Codd,<br/><span class="fragment grow highlight-current-blue">thou shalt not give the user a wrong answer.</span>
</h3>
<h4 class="fragment">
... but what if we did?
</h4>
<h4 class="fragment">
What would it take for that to be ok?
</h4>
</section>
</section>
<section>
<section>
<h2>Industry says...</h2>
</section>
<section>
<img src="graphics/maybe-screen.png" height="500px" />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src="graphics/maybe-detail.png" height="500px" class="fragment" /><br/>
<p class="fragment">My phone is guessing, but is letting me know that it did</p>
</section>
<section>
<img src="graphics/Calendar_Base.png" height="500px" />
</section>
<section>
<img src="graphics/Calendar_Explain.png" height="500px" />
<p>Easy interactions to <i>accept</i>, <i>reject</i>, or <i>explain</i> uncertainty</p>
</section>
<section>
<img src="graphics/Bing-Translate.png" height="500px" />
<p class="fragment">Good Explanations, Alternatives, and Feedback Vectors</p>
</section>
<section>
<h2>Communication</h2>
<ul>
<li>What data is uncertain?</li>
<li>Why is my data uncertain?</li>
<li>How bad is it?</li>
<li>What can I do about it?</li>
</ul>
</section>
<section>
<h2>What if a database did the same?</h2>
</section>
<section>
<ul style="width:35%; font-size: 24pt; margin-top: 50px;">
<li class="fragment"><b>A:</b> Standard SQL.</li>
<li class="fragment"><b>B:</b> Annotated Output.</li>
<li class="fragment"><b>C:</b> Lens Diagram.</li>
<li class="fragment"><b>D:</b> Result Explanations.</li>
</ul>
<img src="graphics/UIExample.png" style="width:60%; float:right"/>
</section>
</section>
<section>
<section>
<h3>Lenses</h3>
<p class="fragment">Here's a problem with my data. <span class="fragment">Fix it.</span></p>
<ul>
<li class="fragment">What type is this column? (majority vote)</li>
<li class="fragment">How do the columns of these relations line up? (pick your favorite schema matching paper)</li>
<li class="fragment">How do I query heterogeneous JSON objects? (see above)</li>
<li class="fragment">What should these missing values be? (learning-based interpolation)</li>
</ul>
</section>
<section>
<svg width=500 height=350>
<g transform="scale(1.2)">
<text x="0" y="45">View:</text>
<image xlink:href="graphics/db.svg" x="130" y="10" height="50px" width="50px"/>
<text x="225" y="20" style="font-family: courier; font-size: 60%">SELECT</text>
<polygon
points="190,35 340,35 325,30 325,40 340,35"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="350" y="10" height="50px" width="50px"/>
</g>
<g transform="translate(0,150) scale(1.2)" class="fragment">
<text x="0" y="45">Lens:</text>
<image xlink:href="graphics/db.svg" x="130" y="10" height="50px" width="50px"/>
<text x="225" y="20" style="font-family: courier; font-size: 60%">SELECT</text>
<polygon
points="190,35 340,35 325,30 325,40 340,35"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="350" y="10" height="50px" width="50px"/>
<g class="fragment">
<text x="212" y="20" style="font-family: courier; font-size: 60%">[&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;]</text>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="355" y="15" height="50px" width="50px"/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="360" y="20" height="50px" width="50px"/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="365" y="25" height="50px" width="50px"/>
<g class="fragment">
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="350" y="110" height="60px" width="60px"/>
<polygon
points="380,80 380,105 385,90 375,90 380,105"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<text x="220" y="142" style="font-size: 60%">(best guess)</text>
</g>
</g>
</g>
</svg>
<p class="fragment">Lenses introduce <i>uncertainty</i></p>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<h2>The User's View</h2>
<pre><code>
SELECT NAME, DEPARTMENT FROM PRODUCTS;
</code></pre>
<table class="fragment" data-fragment-index="1">
<tr><th>Name</th><th>Department</th></tr>
<tr><td>Apple 6s, White</td><td>Phone</td></tr>
<tr><td>Dell, Intel 4 core</td><td>Computer</td></tr>
<tr><td>HP, AMD 2 core</td><td class="fragment highlight-red" data-fragment-index="2">Computer</td></tr>
<tr><td>...</td><td>...</td></tr>
</table>
<p class="fragment" data-fragment-index="2"><b>Simple UI:</b> Highlight values that are based on guesses.</p>
</section>
<section>
<pre><code>
SELECT NAME, DEPARTMENT FROM PRODUCTS;
</code></pre>
<small>
<table>
<tr><th>Name</th><th>Department</th></tr>
<tr><td>Apple 6s, White</td><td>Phone</td></tr>
<tr><td>Dell, Intel 4 core</td><td>Computer</td></tr>
<tr><td>HP, AMD 2 core</td><td style="color: red;">Computer</td></tr>
<tr><td>...</td><td>...</td></tr>
</table>
</small>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xl="http://www.w3.org/1999/xlink" version="1.1" viewBox="241 277 265 125" width="265pt" height="125pt" xmlns:dc="http://purl.org/dc/elements/1.1/" class="fragment" data-fragment-index="1">
<metadata> Produced by OmniGraffle 6.2.5 <dc:date>2015-09-20 14:45:55 +0000</dc:date></metadata>
<defs><font-face font-family="Helvetica Neue" font-size="16" panose-1="2 0 8 3 0 0 0 9 0 4" units-per-em="1000" underline-position="-100" underline-thickness="50" slope="0" x-height="517" cap-height="714" ascent="975.0061" descent="-216.99524" font-weight="bold"><font-face-src><font-face-name name="HelveticaNeue-Bold"/></font-face-src></font-face><font-face font-family="Helvetica Neue" font-size="16" panose-1="2 0 5 3 0 0 0 2 0 4" units-per-em="1000" underline-position="-100" underline-thickness="50" slope="0" x-height="517" cap-height="714" ascent="951.99585" descent="-212.99744" font-weight="500"><font-face-src><font-face-name name="HelveticaNeue"/></font-face-src></font-face></defs>
<g stroke="none" stroke-opacity="1" stroke-dasharray="none" fill="none" fill-opacity="1">
<title>Canvas 1</title>
<g>
<title>Layer 1</title>
<path d="M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" fill="white"/>
<path d="M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" stroke="black" stroke-linecap="round" stroke-linejoin="round" stroke-width="1"/>
<text transform="translate(293 293)" fill="black"><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="0" y="16" textLength="16.896" class="fragment" data-fragment-index="2">Pr</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="16.608" y="16" textLength="69.28" class="fragment" data-fragment-index="2">obability:</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="85.888" y="16" textLength="38.24" class="fragment" data-fragment-index="2"> 95%</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="0" y="53" textLength="62.224" class="fragment" data-fragment-index="3">Reason:</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="62.224" y="53" textLength="144.912" class="fragment" data-fragment-index="3"> Because I guessed </tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="0" y="71" textLength="206.592" class="fragment" data-fragment-index="3">Computer for Department </tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="0" y="89" textLength="196.16" class="fragment" data-fragment-index="3">on Row 3 of PRODUCTS</tspan></text>
</g>
</g>
</svg>
<p class="fragment" data-fragment-index="1">Allow users to <code>EXPLAIN</code> uncertain outputs</p>
<p class="fragment" data-fragment-index="3">Explanations include reasons given in English</p>
</section>
<section>
<div style="padding: 30px;">
<p>$PRODUCTS.DEPARTMENT_{3}$</p>
<div style="font-size: 2em"></div>
<p>"I guessed 'Computer' for 'Department' on Row '3'"</p>
</div>
</section>
<section>
<h3>Explanations</h3>
<ol>
<li>Mark <i>uncertain</i> data and results.</li>
<li>Upon request, provide more detail:
<ul style="font-size:80%; width: 600px">
<li>Why is my data uncertain? <span style="float:right; font-size:80%; margin-top: 5px">(provenance)</span></li>
<li>How bad is it? <span style="float:right; font-size:80%; margin-top: 5px">(confidence, entropy, bounds)</span></li>
<li>What are other possibile answers? <span style="float:right; font-size:80%; margin-top: 5px">(samples)</span></li>
<li>What can I do to fix it? <span style="float:right; font-size:80%; margin-top: 5px">(repairs)</span></li>
</ul></li>
</ol>
</section>
</section>
<section>
<img src="https://odin.cse.buffalo.edu/assets/people/oliver.jpg" height="300px">
<p><b>Email:</b> okennedy@buffalo.edu</p>
<p><b>Office:</b> Davis 338H</p>
</section>
<section>
<h1>Backup Slides</h1>
</section>
<section>
<section>
<h2>Mimir is a DB <u>Overlay</u></h2>
</section>
<section>
<svg width="500px" height="400px">
<g>
<g>
<image xlink:href="graphics/db.svg" x="10" y="5" height="50px" width="50px"/>
<text x="0" y="80" style="font-size:50%">(Any DB)</text>
</g>
<polygon
points="0,0 120,0 105,-5 105,5 120,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<g transform="translate(220,0)">
<image xlink:href="graphics/primary-queries.svg" x="0" y="5" height="50px" width="50px"/>
<text x="0" y="80" style="font-size:50%">(Lens)</text>
</g>
<polygon
points="0,0 100,0 85,-5 85,5 100,0"
transform="translate(290,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<g transform="translate(400,0)">
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="0" y="10" height="50px" width="50px"/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="5" y="15" height="50px" width="50px"/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="10" y="20" height="50px" width="50px"/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="15" y="25" height="50px" width="50px"/>
</g>
</g>
<g class="fragment">
<polygon
points="0,0 0,110 -5,95 5,95 0,110"
transform="translate(200,100)"
style="
stroke: red;
fill: red;
stroke-width: 4;
"
/>
<polygon
points="0,0 0,110 -5,95 5,95 0,110"
transform="translate(245,100)"
style="
stroke: red;
fill: red;
stroke-width: 4;
"
/>
<polygon
points="0,0 0,110 -5,95 5,95 0,110"
transform="translate(290,100)"
style="
stroke: red;
fill: red;
stroke-width: 4;
"
/>
<g transform="translate(0,200)">
<g>
<image xlink:href="graphics/db.svg" x="10" y="5" height="50px" width="50px"/>
<text x="0" y="80" style="font-size:50%">(Any DB)</text>
</g>
<polygon
points="0,0 120,0 105,-5 105,5 120,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<polygon
points="0,0 120,65 105,50 105,62 120,65"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<polygon
points="0,0 120,130 105,105 105,122 120,130"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<g transform="translate(210,0)">
<text x="0" y="45" style="font-size:50%; font-family: courier">SELECT</text>
<polygon
points="0,0 110,0 95,-5 95,5 110,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="200" y="15" height="50px" width="50px"/>
</g>
<g transform="translate(210,65)">
<text x="0" y="45" style="font-size:50%; font-family: courier">SELECT</text>
<polygon
points="0,0 110,0 95,-5 95,5 110,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="200" y="15" height="50px" width="50px"/>
</g>
<g transform="translate(210,130)">
<text x="0" y="45" style="font-size:50%; font-family: courier">SELECT</text>
<polygon
points="0,0 110,0 95,-5 95,5 110,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="200" y="15" height="50px" width="50px"/>
</g>
</g>
</g>
<g transform="translate(220,230)" class="fragment">
<text x="0" y="48" style="font-family: courier; font-size:40%">UNION</text>
<text x="0" y="113" style="font-family: courier; font-size:40%">UNION</text>
</g>
</svg>
<p class="fragment">Mimir <i>virtualizes</i> uncertainty
<attribution>(OpenClipArt.org)</attribution>
</section>
</section>
<section>
<section>
<h3>Labeled Nulls</h3>
<p>$Var(\ldots)$ constructs new variables</p>
<ul>
<li class="fragment">$Var('X')$ constructs a new variable $X$</li>
<li class="fragment">$Var('X', 1)$ constructs a new variable $X_{1}$</li>
<li class="fragment">$Var('X', ROWID)$ evaluates $ROWID$ and then constructs a new variable $X_{ROWID}$</li>
</ul>
</section>
<section>
<h3>Lazy Evaluation</h3>
<p>Variables can't be evaluated until they are bound.<br/>So, we allow arbitrary expressions to represent data.</p>
<ul>
<li class="fragment">$X$ is a legitimate data value.</li>
<li class="fragment">$X+1$ is a legitimate data value.</li>
<li class="fragment">$1+1$ is a legitimate data value<span class="fragment">, but can be reduced to $2$.</span></li>
</ul>
<p class="fragment">A lazy value without variables is <b>deterministic</b></p>
</section>
<section>
<p>Mimir SQL allows the $Var()$ operator to inlined</p>
<pre><code>
SELECT A, VAR('X', B)+2 AS C FROM R;
</code></pre>
<center><div style="width: 600px" class="fragment">
<table style="float: left">
<thead>
<tr><th>A</th><th>B</th></tr>
</thead><tbody>
<tr><td>1</td><td>2</th></tr>
<tr><td>3</td><td>4</th></tr>
<tr><td>5</td><td>6</th></tr>
</tbody>
</table>
<table style="float: right" class="fragment">
<tr><th>A</th><th>C</th></tr>
<tr><td>1</td><td>$X_2+2$</th></tr>
<tr><td>3</td><td>$X_4+2$</th></tr>
<tr><td>5</td><td>$X_6+2$</th></tr>
</table>
</div></center>
<div style="clear: both;">&nbsp;</div>
</section>
<section>
<p>Selects on $Var()$ need to be deferred too...</p>
<pre><code>
SELECT A FROM R WHERE VAR('X', B) > 2;
</code></pre>
<center><div style="width: 600px">
<table style="float: left">
<thead>
<tr><th>A</th><th>B</th></tr>
</thead><tbody>
<tr><td>1</td><td>2</th></tr>
<tr><td>3</td><td>4</th></tr>
<tr><td>5</td><td>6</th></tr>
</tbody>
</table>
<table style="float: right" class="fragment">
<tr><th>A</th><th>$\phi$</th></tr>
<tr><td>1</td><td>$X_2>2$</th></tr>
<tr><td>3</td><td>$X_4>2$</th></tr>
<tr><td>5</td><td>$X_6>2$</th></tr>
</table>
</div></center>
<div style="clear: both;">&nbsp;</div>
<p class="fragment">When evaluating the table, rows where $\phi = \bot$ are dropped.</p>
</section>
<section>
<h3>C-Tables</h3>
<ul>
<li>Original Formulation <small>[Imielinski, Lipski 1981]</small></li>
<li class="fragment">PC-Tables <small>[Green, Tannen 2006]</small></li>
<li class="fragment">Systems<ul>
<li>Orchestra <small>[Green, Karvounarakis, Taylor, Biton, Ives, Tannen 2007]</small></li>
<li>MayBMS <small>[Huang, Antova, Koch, Olteanu 2009]</small></li>
<li>Pip <small>[Kennedy, Koch 2009]</small>
<li>Sprout <small>[Fink, Hogue, Olteanu, Rath 2011]</small></li>
</ul></li>
<li class="fragment">Generalized PC-Tables <small>[Kennedy, Koch 2009]</small></li>
</ul>
</section>
</section>
<section>
<section>
<h2>Labeled nulls capture a lens' uncertainty</h2>
</section>
<section>
<pre><code>
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
</code></pre>
<div class="fragment">
<p>is (almost) the same as the query...</p>
<pre><code>
CREATE VIEW PRODUCTS
AS SELECT ID, NAME, ...,
CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
END AS DEPARTMENT
FROM PRODUCTS_RAW;
</code></pre>
</div>
<small class="fragment">
<table>
<tr><th>ID</th><th>Name</th><th>...</th><th>Department</th></tr>
<tr><td>123</td><td>Apple 6s, White</td><td>...</td><td>Phone</td></tr>
<tr><td>34234</td><td>Dell, Intel 4 core</td><td>...</td><td>Computer</td></tr>
<tr><td>34235</td><td>HP, AMD 2 core</td><td>...</td><td class="fragment">$Prod.Dept_3$</td></tr>
<tr><td>...</td><td>...</td><td>...</td><td>...</td></tr>
</table>
</small>
</section>
<section>
<pre><code>
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
</code></pre>
<div>
<p>Behind the scenes, a lens also creates a model...</p>
<pre class="fragment"><code>
SELECT * FROM PRODUCTS_RAW;
</code></pre>
</div>
<div class="fragment">
<div style="font-size: 1em; vertical-align: middle;"></div>
<div>
<img src="graphics/weka.png" />
</div>
</div>
<div class="fragment">
<div style="font-size: 1em; vertical-align: middle;"></div>
<div><p>An estimator for <small style="vertical-align: baseline;">$PRODUCTS.DEPARTMENT_{ROWID}$</small><p></div>
</div>
</section>
</section>
<section>
<section>
<h3>... but databases don't support labeled nulls</h3>
</section>
<section>
<h3>Labeled Nulls Percolate Up</h3>
<pre><code>
SELECT A, VAR('X', B)+2 AS C FROM R;
</code></pre>
<div class="fragment">
<p>Mimir dispatches this query to the DB:</p>
<pre><code>
SELECT A, B FROM R;
</code></pre>
</div>
<div class="fragment">
<p>And for each row of the result, evaluates:</p>
<pre><code>
SELECT A, VAR('X', B)+2 AS C FROM RESULT;
</code></pre>
</div>
</section>
<section>
<h3>Generating Explanations</h3>
<p>All uncertainty comes from labeled nulls in the expressions that Mimir evaluates for each row of the output.</p>
<dl>
<dt>Why is the data uncertain?</dt>
<dd>All relevant lenses referenced in <code>VAR('X', B)+2</code>.</dd>
<dt>How uncertain?</dt>
<dd>Estimate by sampling from <code>VAR('X', B)</code>.</dd>
<dt>How do I fix it?</dt>
<dd>Each lens fixes one well-defined type of error.</dd>
</dl>
</section>
<section>
<h3>Lazy evaluation can cause problems</h3>
<pre><code>
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
</code></pre>
<div class="fragment">
<p>Mimir dispatches this query to the DB:</p>
<pre><code>
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2 FROM R, S;
</code></pre>
</div>
<div class="fragment">
<p>And for each row of the result, evaluates:</p>
<pre><code>
SELECT A, C FROM RESULT WHERE VAR('X', TEMP_1) = TEMP_2;
</code></pre>
</div>
</section>
<section>
<p>Helper views allow the DB to interpret labeled nulls</h3>
<pre><code>
SELECT R.A, S.C FROM R, S
WHERE S.B = (SELECT VALUE FROM VARIABLE_X WHERE KEY = R.B);
</code></pre>
<p class="fragment">... but we lose the ability to <i>explain</i> outputs</p>
</section>
<section>
<h3>Provenance Recovers Explanations</h3>
<pre><code>
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
</code></pre>
<p>Mimir dispatches this query to the DB:</p>
<pre><code>
SELECT R.A, S.C,
R.ROWID AS ID_1, S.ROWID AS ID_2
WHERE S.B = (SELECT VALUE FROM VARIABLE_X WHERE KEY = R.B);
</code></pre>
<div class="fragment">
<p>Then to explain, Mimir dispatches the query:</p>
<pre><code>
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2
WHERE R.ROWID = ID_1 AND S.ROWID = ID_2
</code></pre>
</div>
</section>
</section>
<section>
<section>
<h3>Performance</h3>
<p>TPC-H Data, but replace 0.1% of FK references with NULL. Ask Mimir to fix.</p>
<p>(a worst case from a performance standpoint)</p>
<ul>
<li><b>Query 1:</b> Table scan. Overhead for a no-op.</li>
<li><b>Query 3:</b> 3-way join on an FK chain.</li>
<li><b>Query 5:</b> 6-way join on an FK tree.</li>
<li><b>Query 9:</b> 6-way join with cycles.</li>
</ul>
</section>
<section>
<img src="graphics/performance-sqlite1g.png" />
<p>Mimir over SQLite in 4 different execution modes.<br/>100% = Zero overhead</p>
</section>
</section>
<section>
<section>
<h3>C-Tables</h3>
<ul>
<li>Original Formulation <small>[Imielinski, Lipski 1981]</small></li>
<li class="fragment">PC-Tables <small>[Green, Tannen 2006]</small></li>
<li class="fragment">Systems<ul>
<li>Orchestra <small>[Green, Karvounarakis, Taylor, Biton, Ives, Tannen 2007]</small></li>
<li>MayBMS <small>[Huang, Antova, Koch, Olteanu 2009]</small></li>
<li>Pip <small>[Kennedy, Koch 2009]</small>
<li>Sprout <small>[Fink, Hogue, Olteanu, Rath 2011]</small></li>
</ul></li>
<li class="fragment">Generalized PC-Tables <small>[Kennedy, Koch 2009]</small></li>
</ul>
</section>
<section>
<h3>Lenses</h3>
<ul>
<li class="fragment">A VG-RA Expression</li>
<li class="fragment">A 'Model' <span class="fragment"> that defines for each variable...</span><ul>
<li class="fragment">A sampling process</li>
<li class="fragment">A best guess estimator</li>
<li class="fragment">A human-readable description</li>
</ul></li>
</ul>
<p class="fragment"><b>Lenses implement PC-Tables</b></p>
</section>
<section>
<pre><code>
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
</code></pre>
<ul>
<li><code>AS</code> clause defines source data.</li>
<li><code>USING</code> clause requests repairs.</li>
</ul>
</section>
</section>
<section>
<section>
<h2>Selection (Filtering)</h2>
<pre><code>
SELECT NAME FROM PRODUCTS
WHERE DEPARTMENT='PHONE'
AND ( VENDOR='APPLE'
OR PLATFORM='ANDROID' )
</code></pre>
<p class="fragment">Recall, row-level uncertainty is a boolean formula $\phi$.</p>
<p class="fragment">
For this query, $\phi$ can be as complex as:
<small>$$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$</small></p>
<p class="fragment"><b>Too many variables! Which is the most important?</b></p>
</section>
<section>
<h2>What is important?</h2>
<p class="fragment">Data Cleaning</p>
<h2 class="fragment">Which variables are important?</h2>
<p class="fragment">The ones that keep us from knowing everything</p>
</section>
<section>
<p><small>$$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$</small></p>
<div style="font-size: 2em"></div>
<p>$$A \wedge (B \vee C)$$</p>
</section>
<section>
<h3>Naive Approach</h3>
<p>Consider a game between a database and an impartial oracle.</p>
<ul>
<li>The DB picks a variable $v$ in $\phi$ and pays a cost $c_v$.</li>
<li>The Oracle reveals the truth value of $v$.</li>
<li>The DB updates $\phi$ accordingly and repeats until $\phi$ is deterministic.</li>
</ul>
<p class="fragment"><b>Naive Algorithm: </b> Pick all variables!</p>
<p class="fragment"><b>Less Naive Algorithm: </b> Minimize $E\left[\sum c_v\right]$.</p>
</section>
<section>
<h2>Exponential Time Bad!</h2>
</section>
</section>
<section>
<section>
<h3>The Value of What We Don't Know</h3>
<p>$$\phi = A \wedge (B \vee C)$$</p>
<ol>
<li class="fragment" data-fragment-index="1">Generate Samples for $A$, $B$, $C$</li>
<li class="fragment" data-fragment-index="2">Estimate $p(\phi)$</li>
<li class="fragment" data-fragment-index="3">Compute $H[\phi] = -\log\left(p(\phi) \cdot (1-p(\phi))\right)$</li>
</ol>
<p class="fragment" data-fragment-index="4"><b>Entropy is intuitive: </b><br/> $H = 1$ means we know nothing, <br/>$H = 0$ means we know everything.</p>
</section>
<section>
<h3>Information Gain</h3>
<p>$$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$</p>
<p><b>Information gain of</b> $v$: The reduction in entropy from knowing the truth value of a variable $v$.</p>
</section>
<section>
<h3>Expected Information Gain</h3>
<p>$$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$</p>
<p><b>Expected information gain of</b> $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.</p>
</section>
<section>
<h3>The Cost of Perfect Information</h3>
<p>Combine Information Gain and Cost</p>
<p>$$f(\mathcal I_{A}(\phi), c_A)$$</p>
<p class="fragment"><b>For example: </b>$EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$</p>
<p class="fragment"><b>Greedy Algorithm: </b> Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step</p>
</section>
<section>
<h3>Experimental Data</h3>
<ul>
<li>Start with a large dataset.</li>
<li>Delete random fields (~50%).</li>
</ul>
</section>
<section>
<h3>Experimental Queries</h3>
<p>Simulate an analyst trying to manually explore correlations.</p>
<ul>
<li>Train a tree-classifier on the base data.</li>
<li>Convert the decision tree to a query for all rows where the tree predicts a specific value.</li>
</ul>
</section>
<section>
<h3>Cost vs Entropy: Credit Data</h3>
<img src="graphics/credit_entropy.png" height=400 />
<p><small>
<b>EG2:</b> Greedy Cost/Value Ordering<br/>
<b>NMETC:</b> Naive Minimal Expected Total Cost<br/>
<b>Random:</b> Completely Random Order
</small></p>
</section>
<section>
<h3>Cost vs Entropy: Product Data</h3>
<img src="graphics/product_entropy.png" height=400 />
<p><small>
<b>EG2:</b> Greedy Cost/Value Ordering<br/>
<b>NMETC:</b> Naive Minimal Expected Total Cost<br/>
<b>Random:</b> Completely Random Order
</small></p>
</section>
</section>
</div></div>
<script src="../../reveal.js-3.1.0/lib/js/head.min.js"></script>
<script src="../../reveal.js-3.1.0/js/reveal.js"></script>
<script>
// Full list of configuration options available at:
// https://github.com/hakimel/../../reveal.js#configuration
Reveal.initialize({
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'fade', // none/fade/slide/convex/concave/zoom
// Optional ../../reveal.js plugins
dependencies: [
{ src: '../../reveal.js-3.1.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../../reveal.js-3.1.0/plugin/math/math.js',
condition: function() { return true; },
mathjax: '../../reveal.js-3.1.0/js/MathJax.js'
},
{ src: '../../reveal.js-3.1.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../../reveal.js-3.1.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../../reveal.js-3.1.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../../reveal.js-3.1.0/plugin/zoom-js/zoom.js', async: true },
{ src: '../../reveal.js-3.1.0/plugin/notes/notes.js', async: true }
]
});
</script>
</body>
</html>