Website/slides/cse501/2016/index.html

1073 lines
41 KiB
HTML
Raw Normal View History

2016-09-27 12:35:26 -04:00
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Embracing Uncertainty</title>
<meta name="description" content="Mimir">
<meta name="author" content="Oliver Kennedy">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="../../reveal.js-3.1.0/css/reveal.css">
<link rel="stylesheet" href="ubodin.css" id="theme">
<!-- Code syntax highlighting -->
<link rel="stylesheet" href="../../reveal.js-3.1.0/lib/css/zenburn.css">
<!-- Printing and PDF exports -->
<script>
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../../reveal.js-3.1.0/css/print/pdf.css' : '../../reveal.js-3.1.0/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<!--[if lt IE 9]>
<script src="../../reveal.js-3.1.0/lib/js/html5shiv.js"></script>
<![endif]-->
</head>
<body>
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="header">
<!-- Any Talk-Specific Header Content Goes Here -->
Embracing Uncertainty
</div>
<div class="footer">
<!-- Any Talk-Specific Footer Content Goes Here -->
<div style="float: left; margin-top: 15px; ">
Exploring <u><b>O</b></u>nline <u><b>D</b></u>ata <u><b>In</b></u>teractions
</div>
<img src="graphics/FullText-white.png" height="40" style="float: right;"/>
</div>
<div class="slides">
<section>
<h4>Embracing uncertainty with</h4>
<img src="graphics/mimir_logo_final.png" />
</section>
<section>
<h4>Joint work with:</h4>
<p>
2016-09-27 16:07:56 -04:00
<i>Ying Yang, Poonam Kumari, William Spoth, Aaron Huber,<br/>
Lisa Lu, Jacob Powathikunnil Verghese</i>
2016-09-27 12:35:26 -04:00
</p><p>
Niccolo Meneghetti, Arindam Nandi (both now HPE/Vertica),<br/>
Vinayak Karuppasamy (now Bloomberg),
</p><p>
Ronny Fehling (Airbus), <br/>
Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle), <br/>
Boris Glavic (IIT), Juliana Freire (NYU)
</p>
</section>
<section>
<section>
<h3>A Big Data Fairy Tale</h3>
</section>
<section>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" />
<h4>Meet Alice</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" />
<img src="graphics/littlestorefront-800px.png" height="300" />
<h4>Alice has a Store</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/littlestorefront-800px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;" />
<h4>Alice's store collects sales data</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;">+</span>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">=</span>
<img src="graphics/saco-800px.png" height="300" style=" vertical-align: middle;" />
<h4>Alice wants to use her sales data to run a promotion</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<h4>So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">+&nbsp;?</span>
<h4>... asks her question ...</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">+&nbsp;?&nbsp;</span>
<img src="graphics/crystalball-800px.png" height="300" style=" vertical-align: middle;" />
<h4>... and basks in the limitless possibilities of big data.</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
</section>
<section>
<section>
<h2>Why is this a fairy tale?</h2>
</section>
<section>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<h4>It's never this easy...</h4>
</section>
</section>
<section>
<section>
<h2>CSV Import</h2>
<h4>Run a <code>SELECT</code> on a raw CSV File</h4>
<ul class="fragment">
<li>File may not have column headers</li>
<li>CSV does not provide "types"</li>
<li>Lines may be missing fields</li>
<li>Fields may be mistyped (typo, missing comma)</li>
<li>Comment text can be inlined into the file</li>
</ul>
<p class="fragment">
<b>State of the art</b>: External Table Defn <span class="fragment">+ "Manually" edit CSV</span>
</p>
</section>
<section>
<h2>Merge Two Datasets</h2>
<h4><code>UNION</code> two data sources</h4>
<ul class="fragment">
<li>Schema matching</li>
<li>Deduplication</li>
<li>Format alignment (GIS coordinates, $ vs €)
<li>Precision alignment (State vs County)</li>
</ul>
<p class="fragment">
<b>State of the art</b>: Manually map schema
</p>
</section>
<section>
<h2>JSON Shredding</h2>
<h4>Run a <code>SELECT</code> on JSON or a Doc Store</h4>
<ul class="fragment">
<li>Separating fields and record sets:<br/>(e.g., <code>{ A: "Bob", B: "Alice" }</code>)</li>
<li>Missing fields (Records with no 'address')</li>
<li>Type alignment (Records with 'address' as an array)</li>
<li>Schema matching$^2$</li>
</ul>
<p class="fragment">
<b>State of the art</b>: DataGuide, Wrangler, etc...
</p>
</section>
</section>
<section>
<section>
<h2>Data Cleaning is Hard!</h2>
</section>
<section>
<h3>State of the Art</h3>
<img src="graphics/BI-Analyst.jpg" height="400" />
<attribution>(skilledup.com)</attribution>
<p>Alice spends weeks cleaning her data before using it.</p>
</section>
<section>
<h3>Newer State of the Art</h3>
<img src="graphics/iu.jpeg" height=500 />
<attribution>(azure.microsoft.com)</attribution>
</section>
<section>
<img src="graphics/data-lake-to-data-swamp.jpg" height=500 />
<attribution>(timoelliott.com)</attribution>
</section>
</section>
<section>
<section>
<h2>Structure is hard!</h2>
<ul>
<li class="fragment">Structured models (RelDBs) force curation during loading.
<ul><li class="fragment"><b>Problem:</b> All curation costs are upfront.</li></ul>
</li>
<li class="fragment">Unstructured models (NoSQL) force curation into queries.
<ul><li class="fragment"><b>Problem:</b> Complexity/redundancy blowup in queries.</li></ul>
</li>
</ul>
<p class="fragment" style="margin-top: 50px;">Add structure, curation effort <b>On-Demand</b></p>
</section>
<section>
<h3>But... you still need some sort of structure?!?</h3>
<h3 class="fragment">Let the database make a guess!</h3>
</section>
<section>
<h3>
In the name of Codd,<br/><span class="fragment grow highlight-current-blue">thou shalt not give the user a wrong answer.</span>
</h3>
<h4 class="fragment">
... but what if we did?
</h4>
<h4 class="fragment">
What would it take for that to be ok?
</h4>
</section>
</section>
<section>
<section>
<h2>Industry says...</h2>
</section>
<section>
<img src="graphics/maybe-screen.png" height="500px" />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src="graphics/maybe-detail.png" height="500px" class="fragment" /><br/>
<p class="fragment">My phone is guessing, but is letting me know that it did</p>
</section>
<section>
<img src="graphics/Calendar_Base.png" height="500px" />
</section>
<section>
<img src="graphics/Calendar_Explain.png" height="500px" />
<p>Easy interactions to <i>accept</i>, <i>reject</i>, or <i>explain</i> uncertainty</p>
</section>
<section>
<img src="graphics/Bing-Translate.png" height="500px" />
<p class="fragment">Good Explanations, Alternatives, and Feedback Vectors</p>
</section>
<section>
<h2>Communication</h2>
<ul>
<li>What data is uncertain?</li>
<li>Why is my data uncertain?</li>
<li>How bad is it?</li>
<li>What can I do about it?</li>
</ul>
</section>
<section>
<h2>What if a database did the same?</h2>
</section>
<section>
<ul style="width:35%; font-size: 24pt; margin-top: 50px;">
<li class="fragment"><b>A:</b> Standard SQL.</li>
<li class="fragment"><b>B:</b> Annotated Output.</li>
<li class="fragment"><b>C:</b> Lens Diagram.</li>
<li class="fragment"><b>D:</b> Result Explanations.</li>
</ul>
<img src="graphics/UIExample.png" style="width:60%; float:right"/>
</section>
</section>
<section>
<section>
<h3>Lenses</h3>
<p class="fragment">Here's a problem with my data. <span class="fragment">Fix it.</span></p>
<ul>
<li class="fragment">What type is this column? (majority vote)</li>
<li class="fragment">How do the columns of these relations line up? (pick your favorite schema matching paper)</li>
<li class="fragment">How do I query heterogeneous JSON objects? (see above)</li>
<li class="fragment">What should these missing values be? (learning-based interpolation)</li>
</ul>
</section>
<section>
<svg width=500 height=350>
<g transform="scale(1.2)">
<text x="0" y="45">View:</text>
<image xlink:href="graphics/db.svg" x="130" y="10" height="50px" width="50px"/>
<text x="225" y="20" style="font-family: courier; font-size: 60%">SELECT</text>
<polygon
points="190,35 340,35 325,30 325,40 340,35"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="350" y="10" height="50px" width="50px"/>
</g>
<g transform="translate(0,150) scale(1.2)" class="fragment">
<text x="0" y="45">Lens:</text>
<image xlink:href="graphics/db.svg" x="130" y="10" height="50px" width="50px"/>
<text x="225" y="20" style="font-family: courier; font-size: 60%">SELECT</text>
<polygon
points="190,35 340,35 325,30 325,40 340,35"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="350" y="10" height="50px" width="50px"/>
<g class="fragment">
<text x="212" y="20" style="font-family: courier; font-size: 60%">[&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;]</text>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="355" y="15" height="50px" width="50px"/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="360" y="20" height="50px" width="50px"/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="365" y="25" height="50px" width="50px"/>
<g class="fragment">
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="350" y="110" height="60px" width="60px"/>
<polygon
points="380,80 380,105 385,90 375,90 380,105"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<text x="220" y="142" style="font-size: 60%">(best guess)</text>
</g>
</g>
</g>
</svg>
<p class="fragment">Lenses introduce <i>uncertainty</i></p>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<h2>The User's View</h2>
<pre><code>
SELECT NAME, DEPARTMENT FROM PRODUCTS;
</code></pre>
<table class="fragment" data-fragment-index="1">
<tr><th>Name</th><th>Department</th></tr>
<tr><td>Apple 6s, White</td><td>Phone</td></tr>
<tr><td>Dell, Intel 4 core</td><td>Computer</td></tr>
<tr><td>HP, AMD 2 core</td><td class="fragment highlight-red" data-fragment-index="2">Computer</td></tr>
<tr><td>...</td><td>...</td></tr>
</table>
<p class="fragment" data-fragment-index="2"><b>Simple UI:</b> Highlight values that are based on guesses.</p>
</section>
<section>
<pre><code>
SELECT NAME, DEPARTMENT FROM PRODUCTS;
</code></pre>
<small>
<table>
<tr><th>Name</th><th>Department</th></tr>
<tr><td>Apple 6s, White</td><td>Phone</td></tr>
<tr><td>Dell, Intel 4 core</td><td>Computer</td></tr>
<tr><td>HP, AMD 2 core</td><td style="color: red;">Computer</td></tr>
<tr><td>...</td><td>...</td></tr>
</table>
</small>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xl="http://www.w3.org/1999/xlink" version="1.1" viewBox="241 277 265 125" width="265pt" height="125pt" xmlns:dc="http://purl.org/dc/elements/1.1/" class="fragment" data-fragment-index="1">
<metadata> Produced by OmniGraffle 6.2.5 <dc:date>2015-09-20 14:45:55 +0000</dc:date></metadata>
<defs><font-face font-family="Helvetica Neue" font-size="16" panose-1="2 0 8 3 0 0 0 9 0 4" units-per-em="1000" underline-position="-100" underline-thickness="50" slope="0" x-height="517" cap-height="714" ascent="975.0061" descent="-216.99524" font-weight="bold"><font-face-src><font-face-name name="HelveticaNeue-Bold"/></font-face-src></font-face><font-face font-family="Helvetica Neue" font-size="16" panose-1="2 0 5 3 0 0 0 2 0 4" units-per-em="1000" underline-position="-100" underline-thickness="50" slope="0" x-height="517" cap-height="714" ascent="951.99585" descent="-212.99744" font-weight="500"><font-face-src><font-face-name name="HelveticaNeue"/></font-face-src></font-face></defs>
<g stroke="none" stroke-opacity="1" stroke-dasharray="none" fill="none" fill-opacity="1">
<title>Canvas 1</title>
<g>
<title>Layer 1</title>
<path d="M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" fill="white"/>
<path d="M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" stroke="black" stroke-linecap="round" stroke-linejoin="round" stroke-width="1"/>
<text transform="translate(293 293)" fill="black"><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="0" y="16" textLength="16.896" class="fragment" data-fragment-index="2">Pr</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="16.608" y="16" textLength="69.28" class="fragment" data-fragment-index="2">obability:</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="85.888" y="16" textLength="38.24" class="fragment" data-fragment-index="2"> 95%</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="0" y="53" textLength="62.224" class="fragment" data-fragment-index="3">Reason:</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="62.224" y="53" textLength="144.912" class="fragment" data-fragment-index="3"> Because I guessed </tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="0" y="71" textLength="206.592" class="fragment" data-fragment-index="3">Computer for Department </tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="0" y="89" textLength="196.16" class="fragment" data-fragment-index="3">on Row 3 of PRODUCTS</tspan></text>
</g>
</g>
</svg>
<p class="fragment" data-fragment-index="1">Allow users to <code>EXPLAIN</code> uncertain outputs</p>
<p class="fragment" data-fragment-index="3">Explanations include reasons given in English</p>
</section>
<section>
<div style="padding: 30px;">
<p>$PRODUCTS.DEPARTMENT_{3}$</p>
<div style="font-size: 2em"></div>
<p>"I guessed 'Computer' for 'Department' on Row '3'"</p>
</div>
</section>
<section>
<h3>Explanations</h3>
<ol>
<li>Mark <i>uncertain</i> data and results.</li>
<li>Upon request, provide more detail:
<ul style="font-size:80%; width: 600px">
<li>Why is my data uncertain? <span style="float:right; font-size:80%; margin-top: 5px">(provenance)</span></li>
<li>How bad is it? <span style="float:right; font-size:80%; margin-top: 5px">(confidence, entropy, bounds)</span></li>
<li>What are other possibile answers? <span style="float:right; font-size:80%; margin-top: 5px">(samples)</span></li>
<li>What can I do to fix it? <span style="float:right; font-size:80%; margin-top: 5px">(repairs)</span></li>
</ul></li>
</ol>
</section>
</section>
<section>
2017-08-31 17:18:47 -04:00
<img src="https://odin.cse.buffalo.edu/assets/people/oliver.jpg" height="300px">
2016-09-27 15:31:01 -04:00
<p><b>Email:</b> okennedy@buffalo.edu</p>
<p><b>Office:</b> Davis 338H</p>
2016-09-27 12:35:26 -04:00
</section>
<section>
<h1>Backup Slides</h1>
</section>
<section>
<section>
<h2>Mimir is a DB <u>Overlay</u></h2>
</section>
<section>
<svg width="500px" height="400px">
<g>
<g>
<image xlink:href="graphics/db.svg" x="10" y="5" height="50px" width="50px"/>
<text x="0" y="80" style="font-size:50%">(Any DB)</text>
</g>
<polygon
points="0,0 120,0 105,-5 105,5 120,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<g transform="translate(220,0)">
<image xlink:href="graphics/primary-queries.svg" x="0" y="5" height="50px" width="50px"/>
<text x="0" y="80" style="font-size:50%">(Lens)</text>
</g>
<polygon
points="0,0 100,0 85,-5 85,5 100,0"
transform="translate(290,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<g transform="translate(400,0)">
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="0" y="10" height="50px" width="50px"/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="5" y="15" height="50px" width="50px"/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="10" y="20" height="50px" width="50px"/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="15" y="25" height="50px" width="50px"/>
</g>
</g>
<g class="fragment">
<polygon
points="0,0 0,110 -5,95 5,95 0,110"
transform="translate(200,100)"
style="
stroke: red;
fill: red;
stroke-width: 4;
"
/>
<polygon
points="0,0 0,110 -5,95 5,95 0,110"
transform="translate(245,100)"
style="
stroke: red;
fill: red;
stroke-width: 4;
"
/>
<polygon
points="0,0 0,110 -5,95 5,95 0,110"
transform="translate(290,100)"
style="
stroke: red;
fill: red;
stroke-width: 4;
"
/>
<g transform="translate(0,200)">
<g>
<image xlink:href="graphics/db.svg" x="10" y="5" height="50px" width="50px"/>
<text x="0" y="80" style="font-size:50%">(Any DB)</text>
</g>
<polygon
points="0,0 120,0 105,-5 105,5 120,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<polygon
points="0,0 120,65 105,50 105,62 120,65"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<polygon
points="0,0 120,130 105,105 105,122 120,130"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<g transform="translate(210,0)">
<text x="0" y="45" style="font-size:50%; font-family: courier">SELECT</text>
<polygon
points="0,0 110,0 95,-5 95,5 110,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="200" y="15" height="50px" width="50px"/>
</g>
<g transform="translate(210,65)">
<text x="0" y="45" style="font-size:50%; font-family: courier">SELECT</text>
<polygon
points="0,0 110,0 95,-5 95,5 110,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="200" y="15" height="50px" width="50px"/>
</g>
<g transform="translate(210,130)">
<text x="0" y="45" style="font-size:50%; font-family: courier">SELECT</text>
<polygon
points="0,0 110,0 95,-5 95,5 110,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<image xlink:href="graphics/jean-victor-balin-icon-table.svg" x="200" y="15" height="50px" width="50px"/>
</g>
</g>
</g>
<g transform="translate(220,230)" class="fragment">
<text x="0" y="48" style="font-family: courier; font-size:40%">UNION</text>
<text x="0" y="113" style="font-family: courier; font-size:40%">UNION</text>
</g>
</svg>
<p class="fragment">Mimir <i>virtualizes</i> uncertainty
<attribution>(OpenClipArt.org)</attribution>
</section>
</section>
<section>
<section>
<h3>Labeled Nulls</h3>
<p>$Var(\ldots)$ constructs new variables</p>
<ul>
<li class="fragment">$Var('X')$ constructs a new variable $X$</li>
<li class="fragment">$Var('X', 1)$ constructs a new variable $X_{1}$</li>
<li class="fragment">$Var('X', ROWID)$ evaluates $ROWID$ and then constructs a new variable $X_{ROWID}$</li>
</ul>
</section>
<section>
<h3>Lazy Evaluation</h3>
<p>Variables can't be evaluated until they are bound.<br/>So, we allow arbitrary expressions to represent data.</p>
<ul>
<li class="fragment">$X$ is a legitimate data value.</li>
<li class="fragment">$X+1$ is a legitimate data value.</li>
<li class="fragment">$1+1$ is a legitimate data value<span class="fragment">, but can be reduced to $2$.</span></li>
</ul>
<p class="fragment">A lazy value without variables is <b>deterministic</b></p>
</section>
<section>
<p>Mimir SQL allows the $Var()$ operator to inlined</p>
<pre><code>
SELECT A, VAR('X', B)+2 AS C FROM R;
</code></pre>
<center><div style="width: 600px" class="fragment">
<table style="float: left">
<thead>
<tr><th>A</th><th>B</th></tr>
</thead><tbody>
<tr><td>1</td><td>2</th></tr>
<tr><td>3</td><td>4</th></tr>
<tr><td>5</td><td>6</th></tr>
</tbody>
</table>
<table style="float: right" class="fragment">
<tr><th>A</th><th>C</th></tr>
<tr><td>1</td><td>$X_2+2$</th></tr>
<tr><td>3</td><td>$X_4+2$</th></tr>
<tr><td>5</td><td>$X_6+2$</th></tr>
</table>
</div></center>
<div style="clear: both;">&nbsp;</div>
</section>
<section>
<p>Selects on $Var()$ need to be deferred too...</p>
<pre><code>
SELECT A FROM R WHERE VAR('X', B) > 2;
</code></pre>
<center><div style="width: 600px">
<table style="float: left">
<thead>
<tr><th>A</th><th>B</th></tr>
</thead><tbody>
<tr><td>1</td><td>2</th></tr>
<tr><td>3</td><td>4</th></tr>
<tr><td>5</td><td>6</th></tr>
</tbody>
</table>
<table style="float: right" class="fragment">
<tr><th>A</th><th>$\phi$</th></tr>
<tr><td>1</td><td>$X_2>2$</th></tr>
<tr><td>3</td><td>$X_4>2$</th></tr>
<tr><td>5</td><td>$X_6>2$</th></tr>
</table>
</div></center>
<div style="clear: both;">&nbsp;</div>
<p class="fragment">When evaluating the table, rows where $\phi = \bot$ are dropped.</p>
</section>
<section>
<h3>C-Tables</h3>
<ul>
<li>Original Formulation <small>[Imielinski, Lipski 1981]</small></li>
<li class="fragment">PC-Tables <small>[Green, Tannen 2006]</small></li>
<li class="fragment">Systems<ul>
<li>Orchestra <small>[Green, Karvounarakis, Taylor, Biton, Ives, Tannen 2007]</small></li>
<li>MayBMS <small>[Huang, Antova, Koch, Olteanu 2009]</small></li>
<li>Pip <small>[Kennedy, Koch 2009]</small>
<li>Sprout <small>[Fink, Hogue, Olteanu, Rath 2011]</small></li>
</ul></li>
<li class="fragment">Generalized PC-Tables <small>[Kennedy, Koch 2009]</small></li>
</ul>
</section>
</section>
<section>
<section>
<h2>Labeled nulls capture a lens' uncertainty</h2>
</section>
<section>
<pre><code>
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
</code></pre>
<div class="fragment">
<p>is (almost) the same as the query...</p>
<pre><code>
CREATE VIEW PRODUCTS
AS SELECT ID, NAME, ...,
CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
END AS DEPARTMENT
FROM PRODUCTS_RAW;
</code></pre>
</div>
<small class="fragment">
<table>
<tr><th>ID</th><th>Name</th><th>...</th><th>Department</th></tr>
<tr><td>123</td><td>Apple 6s, White</td><td>...</td><td>Phone</td></tr>
<tr><td>34234</td><td>Dell, Intel 4 core</td><td>...</td><td>Computer</td></tr>
<tr><td>34235</td><td>HP, AMD 2 core</td><td>...</td><td class="fragment">$Prod.Dept_3$</td></tr>
<tr><td>...</td><td>...</td><td>...</td><td>...</td></tr>
</table>
</small>
</section>
<section>
<pre><code>
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
</code></pre>
<div>
<p>Behind the scenes, a lens also creates a model...</p>
<pre class="fragment"><code>
SELECT * FROM PRODUCTS_RAW;
</code></pre>
</div>
<div class="fragment">
<div style="font-size: 1em; vertical-align: middle;"></div>
<div>
<img src="graphics/weka.png" />
</div>
</div>
<div class="fragment">
<div style="font-size: 1em; vertical-align: middle;"></div>
<div><p>An estimator for <small style="vertical-align: baseline;">$PRODUCTS.DEPARTMENT_{ROWID}$</small><p></div>
</div>
</section>
</section>
<section>
<section>
<h3>... but databases don't support labeled nulls</h3>
</section>
<section>
<h3>Labeled Nulls Percolate Up</h3>
<pre><code>
SELECT A, VAR('X', B)+2 AS C FROM R;
</code></pre>
<div class="fragment">
<p>Mimir dispatches this query to the DB:</p>
<pre><code>
SELECT A, B FROM R;
</code></pre>
</div>
<div class="fragment">
<p>And for each row of the result, evaluates:</p>
<pre><code>
SELECT A, VAR('X', B)+2 AS C FROM RESULT;
</code></pre>
</div>
</section>
<section>
<h3>Generating Explanations</h3>
<p>All uncertainty comes from labeled nulls in the expressions that Mimir evaluates for each row of the output.</p>
<dl>
<dt>Why is the data uncertain?</dt>
<dd>All relevant lenses referenced in <code>VAR('X', B)+2</code>.</dd>
<dt>How uncertain?</dt>
<dd>Estimate by sampling from <code>VAR('X', B)</code>.</dd>
<dt>How do I fix it?</dt>
<dd>Each lens fixes one well-defined type of error.</dd>
</dl>
</section>
<section>
<h3>Lazy evaluation can cause problems</h3>
<pre><code>
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
</code></pre>
<div class="fragment">
<p>Mimir dispatches this query to the DB:</p>
<pre><code>
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2 FROM R, S;
</code></pre>
</div>
<div class="fragment">
<p>And for each row of the result, evaluates:</p>
<pre><code>
SELECT A, C FROM RESULT WHERE VAR('X', TEMP_1) = TEMP_2;
</code></pre>
</div>
</section>
<section>
<p>Helper views allow the DB to interpret labeled nulls</h3>
<pre><code>
SELECT R.A, S.C FROM R, S
WHERE S.B = (SELECT VALUE FROM VARIABLE_X WHERE KEY = R.B);
</code></pre>
<p class="fragment">... but we lose the ability to <i>explain</i> outputs</p>
</section>
<section>
<h3>Provenance Recovers Explanations</h3>
<pre><code>
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
</code></pre>
<p>Mimir dispatches this query to the DB:</p>
<pre><code>
SELECT R.A, S.C,
R.ROWID AS ID_1, S.ROWID AS ID_2
WHERE S.B = (SELECT VALUE FROM VARIABLE_X WHERE KEY = R.B);
</code></pre>
<div class="fragment">
<p>Then to explain, Mimir dispatches the query:</p>
<pre><code>
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2
WHERE R.ROWID = ID_1 AND S.ROWID = ID_2
</code></pre>
</div>
</section>
</section>
<section>
<section>
<h3>Performance</h3>
<p>TPC-H Data, but replace 0.1% of FK references with NULL. Ask Mimir to fix.</p>
<p>(a worst case from a performance standpoint)</p>
<ul>
<li><b>Query 1:</b> Table scan. Overhead for a no-op.</li>
<li><b>Query 3:</b> 3-way join on an FK chain.</li>
<li><b>Query 5:</b> 6-way join on an FK tree.</li>
<li><b>Query 9:</b> 6-way join with cycles.</li>
</ul>
</section>
<section>
<img src="graphics/performance-sqlite1g.png" />
<p>Mimir over SQLite in 4 different execution modes.<br/>100% = Zero overhead</p>
</section>
</section>
<section>
<section>
<h3>C-Tables</h3>
<ul>
<li>Original Formulation <small>[Imielinski, Lipski 1981]</small></li>
<li class="fragment">PC-Tables <small>[Green, Tannen 2006]</small></li>
<li class="fragment">Systems<ul>
<li>Orchestra <small>[Green, Karvounarakis, Taylor, Biton, Ives, Tannen 2007]</small></li>
<li>MayBMS <small>[Huang, Antova, Koch, Olteanu 2009]</small></li>
<li>Pip <small>[Kennedy, Koch 2009]</small>
<li>Sprout <small>[Fink, Hogue, Olteanu, Rath 2011]</small></li>
</ul></li>
<li class="fragment">Generalized PC-Tables <small>[Kennedy, Koch 2009]</small></li>
</ul>
</section>
<section>
<h3>Lenses</h3>
<ul>
<li class="fragment">A VG-RA Expression</li>
<li class="fragment">A 'Model' <span class="fragment"> that defines for each variable...</span><ul>
<li class="fragment">A sampling process</li>
<li class="fragment">A best guess estimator</li>
<li class="fragment">A human-readable description</li>
</ul></li>
</ul>
<p class="fragment"><b>Lenses implement PC-Tables</b></p>
</section>
<section>
<pre><code>
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
</code></pre>
<ul>
<li><code>AS</code> clause defines source data.</li>
<li><code>USING</code> clause requests repairs.</li>
</ul>
</section>
</section>
<section>
<section>
<h2>Selection (Filtering)</h2>
<pre><code>
SELECT NAME FROM PRODUCTS
WHERE DEPARTMENT='PHONE'
AND ( VENDOR='APPLE'
OR PLATFORM='ANDROID' )
</code></pre>
<p class="fragment">Recall, row-level uncertainty is a boolean formula $\phi$.</p>
<p class="fragment">
For this query, $\phi$ can be as complex as:
<small>$$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$</small></p>
<p class="fragment"><b>Too many variables! Which is the most important?</b></p>
</section>
<section>
<h2>What is important?</h2>
<p class="fragment">Data Cleaning</p>
<h2 class="fragment">Which variables are important?</h2>
<p class="fragment">The ones that keep us from knowing everything</p>
</section>
<section>
<p><small>$$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$</small></p>
<div style="font-size: 2em"></div>
<p>$$A \wedge (B \vee C)$$</p>
</section>
<section>
<h3>Naive Approach</h3>
<p>Consider a game between a database and an impartial oracle.</p>
<ul>
<li>The DB picks a variable $v$ in $\phi$ and pays a cost $c_v$.</li>
<li>The Oracle reveals the truth value of $v$.</li>
<li>The DB updates $\phi$ accordingly and repeats until $\phi$ is deterministic.</li>
</ul>
<p class="fragment"><b>Naive Algorithm: </b> Pick all variables!</p>
<p class="fragment"><b>Less Naive Algorithm: </b> Minimize $E\left[\sum c_v\right]$.</p>
</section>
<section>
<h2>Exponential Time Bad!</h2>
</section>
</section>
<section>
<section>
<h3>The Value of What We Don't Know</h3>
<p>$$\phi = A \wedge (B \vee C)$$</p>
<ol>
<li class="fragment" data-fragment-index="1">Generate Samples for $A$, $B$, $C$</li>
<li class="fragment" data-fragment-index="2">Estimate $p(\phi)$</li>
<li class="fragment" data-fragment-index="3">Compute $H[\phi] = -\log\left(p(\phi) \cdot (1-p(\phi))\right)$</li>
</ol>
<p class="fragment" data-fragment-index="4"><b>Entropy is intuitive: </b><br/> $H = 1$ means we know nothing, <br/>$H = 0$ means we know everything.</p>
</section>
<section>
<h3>Information Gain</h3>
<p>$$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$</p>
<p><b>Information gain of</b> $v$: The reduction in entropy from knowing the truth value of a variable $v$.</p>
</section>
<section>
<h3>Expected Information Gain</h3>
<p>$$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$</p>
<p><b>Expected information gain of</b> $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.</p>
</section>
<section>
<h3>The Cost of Perfect Information</h3>
<p>Combine Information Gain and Cost</p>
<p>$$f(\mathcal I_{A}(\phi), c_A)$$</p>
<p class="fragment"><b>For example: </b>$EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$</p>
<p class="fragment"><b>Greedy Algorithm: </b> Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step</p>
</section>
<section>
<h3>Experimental Data</h3>
<ul>
<li>Start with a large dataset.</li>
<li>Delete random fields (~50%).</li>
</ul>
</section>
<section>
<h3>Experimental Queries</h3>
<p>Simulate an analyst trying to manually explore correlations.</p>
<ul>
<li>Train a tree-classifier on the base data.</li>
<li>Convert the decision tree to a query for all rows where the tree predicts a specific value.</li>
</ul>
</section>
<section>
<h3>Cost vs Entropy: Credit Data</h3>
<img src="graphics/credit_entropy.png" height=400 />
<p><small>
<b>EG2:</b> Greedy Cost/Value Ordering<br/>
<b>NMETC:</b> Naive Minimal Expected Total Cost<br/>
<b>Random:</b> Completely Random Order
</small></p>
</section>
<section>
<h3>Cost vs Entropy: Product Data</h3>
<img src="graphics/product_entropy.png" height=400 />
<p><small>
<b>EG2:</b> Greedy Cost/Value Ordering<br/>
<b>NMETC:</b> Naive Minimal Expected Total Cost<br/>
<b>Random:</b> Completely Random Order
</small></p>
</section>
</section>
</div></div>
<script src="../../reveal.js-3.1.0/lib/js/head.min.js"></script>
<script src="../../reveal.js-3.1.0/js/reveal.js"></script>
<script>
// Full list of configuration options available at:
// https://github.com/hakimel/../../reveal.js#configuration
Reveal.initialize({
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'fade', // none/fade/slide/convex/concave/zoom
// Optional ../../reveal.js plugins
dependencies: [
{ src: '../../reveal.js-3.1.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../../reveal.js-3.1.0/plugin/math/math.js',
condition: function() { return true; },
mathjax: '../../reveal.js-3.1.0/js/MathJax.js'
},
{ src: '../../reveal.js-3.1.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../../reveal.js-3.1.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../../reveal.js-3.1.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../../reveal.js-3.1.0/plugin/zoom-js/zoom.js', async: true },
{ src: '../../reveal.js-3.1.0/plugin/notes/notes.js', async: true }
]
});
</script>
</body>
</html>