
485 lines
20 KiB
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!doctype html>
<html lang="en">
<meta charset="utf-8">
<title>Embracing Uncertainty</title>
<meta name="description" content="Mimir, an awesome system for embracing uncertainty">
<meta name="author" content="Oliver Kennedy">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="../reveal.js-3.1.0/css/reveal.css">
<link rel="stylesheet" href="ubodin.css" id="theme">
<!-- Code syntax highlighting -->
<link rel="stylesheet" href="../reveal.js-3.1.0/lib/css/zenburn.css">
<!-- Printing and PDF exports -->
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = /print-pdf/gi ) ? '../reveal.js-3.1.0/css/print/pdf.css' : '../reveal.js-3.1.0/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
<!--[if lt IE 9]>
<script src="../reveal.js-3.1.0/lib/js/html5shiv.js"></script>
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="header">
<!-- Any Talk-Specific Header Content Goes Here -->
Embracing Uncertainty &amp; ODIn Lab Overview
<div class="footer">
<!-- Any Talk-Specific Footer Content Goes Here -->
<div style="float: left; margin-top: 15px; ">
Exploring <u><b>O</b></u>nline <u><b>D</b></u>ata <u><b>In</b></u>teractions
<a href="" target="_blank">
<img src="graphics/FullText-white.png" height="40" style="float: right;"/>
<div class="slides">
<img src="graphics/FullText-black.png" height="100"/>
<h5><a href=""></a></h5>
<img src="graphics/qrcode.31361737.png" />
<h2>Embracing Uncertainty</h2>
<div class="headertext" style="float: left; color: #041a9b; height: 3em;">U @ Buffalo</div>
<div class="headertext" style="color: #041a9b">
Ying Yang, Niccolo Meneghetti, <br/>
Arindam Nandi, Vinayak Karuppasamy, <br/>
<u>Oliver Kennedy</u>, Jan Chomicki</div>
<div class="headertext" style="float: left; color: red;">Oracle</div>
<div class="headertext" style="color: red;">Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick</div>
<h2>Before we begin...</h2>
<h2>Insider Threats</h2>
<li>How do we identify <i>abnormal</i> query behavior from users?</li>
<li>What is <i>normal</i> user behavior?</li>
<li>Multiple gigs of query logs from M&amp;T</li>
<p>...with <b>Gokhan Kul, Duc Thanh Anh Luong, Ting Xie</b>, Shambhu, Varun, Hung</p>
<h2>Pocket Data</h2>
<li>Months of query logs from PhoneLab Phones (2 queries per phone per second)</li>
<li>SQLite is inefficient</li>
<li>SQLite is being used inefficiently</li>
<li>Let's develop a benchmark to help shine a light on these inefficiencies</li>
<p>...with <b>Jerry Ajay</b>, Geoff, Luke</p>
<h2>Just-in-Time Datastructures</h2>
<li>Decouple Physical Structure from Logical Interface.</li>
<li>Express Datastructure Organization through Rewrite Rules.</li>
<li>...allows hybridized datastructures for intermediate tradeoffs.</li>
<li>...allows for semifunctional datastructures with all the benefits but fewer tradeoffs.</li>
<p>...with Luke</p>
<h3>A Big Data Fairy Tale</h3>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" />
<h4>Meet Alice</h4>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" />
<img src="graphics/littlestorefront-800px.png" height="300" />
<h4>Alice has a Store</h4>
<img src="graphics/littlestorefront-800px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;" />
<h4>Alice's store collects sales data</h4>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;">+</span>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">=</span>
<img src="graphics/saco-800px.png" height="300" style=" vertical-align: middle;" />
<h4>Alice wants to use her sales data to run a promotion</h4>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<h4>So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.</h4>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">+&nbsp;?</span>
<h4>... asks her question ...</h4>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">+&nbsp;?&nbsp;</span>
<img src="graphics/crystalball-800px.png" height="300" style=" vertical-align: middle;" />
<h4>... and basks in the limitless possibilities of big data.</h4>
<h2>Why is this a fairy tale?</h2>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<h4>It's never this easy...</h4>
<h2>Loading Data<h2>
<li class="fragment">Validating and Fixing Outliers</li>
<li class="fragment">Handling Missing Data</li>
<li class="fragment">Matching Schemas</li>
<li class="fragment">Fixing Schemas</li>
<li class="fragment">Managing Stale Data</li>
<li class="fragment">Deduplicating Records</li>
<li class="fragment">... and lots more</li>
<h2>Data Cleaning is Hard!</h2>
<h3>State of the Art</h3>
<img src="graphics/BI-Analyst.jpg" height="400" />
<p>Alice spends weeks cleaning her data before using it.</p>
<h3>Newer State of the Art</h3>
<img src="graphics/azure-data-lake.png" height=500 />
<img src="graphics/data-lake-to-data-swamp.jpg" height=500 />
<h2>Making Cleaning Easier</h2>
<svg width=500 height=300>
points="60,50 60,60 40,50 60,40 60,50 440,50 440,40 460,50 440,60 440,50"
stroke: black;
fill: black;
stroke-width: 2;
<text x=0 y=30 style="font-size: 0.75em">Scalability</text>
<text x=370 y=30 style="font-size: 0.75em">Reliability</text>
<text class="fragment" x=-220 y=400 style="font-size: 0.75em" transform="rotate(-90 20,20)">Expert Analysis</text>
<text class="fragment" x=-220 y=250 style="font-size: 0.75em" transform="rotate(-90 20,20)">Crowdsourcing</text>
<text class="fragment" x=-180 y=100 style="font-size: 0.75em" transform="rotate(-90 20,20)">Automation</text>
<p class="fragment">Can we start with automation and work our way up?</p>
<li>Automate educated guesses for fast cleaning<ul>
<li><b>Lenses</b>: A family of simple data-cleaning operators</li>
<div class="fragment shrink fade-out" data-fragment-index="5">
<li class="fragment" data-fragment-index="1">... but what if the guesses are wrong?</li>
<div class="fragment shrink fade-out" data-fragment-index="5">
<li class="fragment" data-fragment-index="2">Annotate 'best guess' relations with the guesses<ul>
<li><b>Virtual C-Tables</b>: A lineage model based on views, labeled nulls, and lazy evaluation.</li>
<li class="fragment" data-fragment-index="3">... so now the user needs to interpret your guesses?</li>
<li class="fragment" data-fragment-index="4">Rank guesses by their impact on result uncertainty<ul>
<li><b>CPI</b>: A greedy heuristic for ranking sources of uncertainty.</li>
<p class="fragment" data-fragment-index="1">Here's a problem with my data. <span class="fragment" data-fragment-index="2">Fix it.</span></p>
<li class="fragment" data-fragment-index="3">What types should columns in this table have?
<ul class="fragment smalltext" data-fragment-index="7"><li>Majority Vote of All Castable Types</li></ul></li>
<li class="fragment" data-fragment-index="4">How do the columns of these relations line up?
<ul class="fragment smalltext" data-fragment-index="7"><li>Paygo and Countless Other Papers/Systems</li></ul></li>
<li class="fragment" data-fragment-index="5">How do I query heterogeneous JSON/XML objects?
<ul class="fragment smalltext" data-fragment-index="7"><li>XMorph and Many Others</li></ul></li>
<li class="fragment" data-fragment-index="6">What should these missing values be?
<ul class="fragment smalltext" data-fragment-index="7"><li>Machine Learning + Interpolation</li></ul></li>
<p>Each lens implements one automated data repair task with <b>minimal configuration or training</b>.</p>
<li class="fragment">A "SQL" Expression</li>
<li class="fragment">A Model that defines configuration parameters and best-guesses for data repairs.</li>
<li><code>AS</code> clause defines source data.</li>
<li><code>USING</code> clause requests repairs.</li>
<h4>The Lens Query</h4>
<small class="fragment">
<tr><td>123</td><td>Apple 6s, White</td><td>...</td><td>Phone</td></tr>
<tr><td>34234</td><td>Dell, Intel 4 core</td><td>...</td><td>Computer</td></tr>
<tr><td>34235</td><td>HP, AMD 2 core</td><td>...</td><td class="fragment">$Prod.Dept_3$</td></tr>
<h4>The Lens Model</h4>
<div class="fragment">
<div style="font-size: 1em; vertical-align: middle;"></div>
<img src="graphics/weka.png" />
<div class="fragment">
<div style="font-size: 1em; vertical-align: middle;"></div>
<div><p>An estimator for each <small style="vertical-align: baseline;">$Prod.Dept_{ROWID}$</small><p></div>
<h2>The User's View</h2>
<table class="fragment" data-fragment-index="1">
<tr><td>Apple 6s, White</td><td>Phone</td></tr>
<tr><td>Dell, Intel 4 core</td><td>Computer</td></tr>
<tr><td>HP, AMD 2 core</td><td class="fragment highlight-red" data-fragment-index="2">Computer</td></tr>
<p class="fragment" data-fragment-index="2"><b>Simple UI:</b> Highlight values (and rows) based on guesses.</p>
<tr><td>Apple 6s, White</td><td>Phone</td></tr>
<tr><td>Dell, Intel 4 core</td><td>Computer</td></tr>
<tr><td>HP, AMD 2 core</td><td style="color: red;">Computer</td></tr>
<svg xmlns="" xmlns:xl="" version="1.1" viewBox="241 277 265 125" width="265pt" height="125pt" xmlns:dc="" class="fragment" data-fragment-index="1">
<metadata> Produced by OmniGraffle 6.2.5 <dc:date>2015-09-20 14:45:55 +0000</dc:date></metadata>
<defs><font-face font-family="Helvetica Neue" font-size="16" panose-1="2 0 8 3 0 0 0 9 0 4" units-per-em="1000" underline-position="-100" underline-thickness="50" slope="0" x-height="517" cap-height="714" ascent="975.0061" descent="-216.99524" font-weight="bold"><font-face-src><font-face-name name="HelveticaNeue-Bold"/></font-face-src></font-face><font-face font-family="Helvetica Neue" font-size="16" panose-1="2 0 5 3 0 0 0 2 0 4" units-per-em="1000" underline-position="-100" underline-thickness="50" slope="0" x-height="517" cap-height="714" ascent="951.99585" descent="-212.99744" font-weight="500"><font-face-src><font-face-name name="HelveticaNeue"/></font-face-src></font-face></defs>
<g stroke="none" stroke-opacity="1" stroke-dasharray="none" fill="none" fill-opacity="1">
<title>Canvas 1</title>
<title>Layer 1</title>
<path d="M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" fill="white"/>
<path d="M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" stroke="black" stroke-linecap="round" stroke-linejoin="round" stroke-width="1"/>
<text transform="translate(293 293)" fill="black"><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="0" y="16" textLength="16.896" class="fragment" data-fragment-index="2">Pr</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="16.608" y="16" textLength="69.28" class="fragment" data-fragment-index="2">obability:</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="85.888" y="16" textLength="38.24" class="fragment" data-fragment-index="2"> 95%</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="0" y="53" textLength="62.224" class="fragment" data-fragment-index="3">Reason:</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="62.224" y="53" textLength="144.912" class="fragment" data-fragment-index="3"> Because I guessed </tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="0" y="71" textLength="206.592" class="fragment" data-fragment-index="3">Computer for Department </tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="0" y="89" textLength="196.16" class="fragment" data-fragment-index="3">on Row 3 of PRODUCTS</tspan></text>
<p class="fragment" data-fragment-index="1">Allow users to <code>EXPLAIN</code> uncertain outputs</p>
<p class="fragment" data-fragment-index="3">Explanations include reasons given in English</p>
<h3>Other Lenses</h3>
<li>Schema Matching (equivalently JSON/XML import)</li>
<li>Archival (how stale is my data?)</li>
<li>Type Inference</li>
<li style="color: grey;">Deduplication / Entity Resolution</li>
<li style="color: grey;">Schema Name Inference</li>
<li>And more...</li>
<h2>Mimir Demo</h2>
<p><a href="" target="_blank"><img src="" height="400"/></a></p>
<h2>Intuitive Uncertainty</h2>
<p><b>UB</b>: Ying Yang, Niccolo Meneghetti, <br/> Arindam Nandi, Vinayak Karuppasamy, <br/>Oliver Kennedy, Jan Chomicki</p>
<p><b>Oracle</b>: Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick</p>
<h4>Thanks to Oracle for multiple gifts that make this research possible</h4>
<script src="../reveal.js-3.1.0/lib/js/head.min.js"></script>
<script src="../reveal.js-3.1.0/js/reveal.js"></script>
// Full list of configuration options available at:
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'fade', // none/fade/slide/convex/concave/zoom
// Optional ../reveal.js plugins
dependencies: [
{ src: '../reveal.js-3.1.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../reveal.js-3.1.0/plugin/math/math.js',
condition: function() { return true; },
mathjax: '../reveal.js-3.1.0/js/MathJax.js'
{ src: '../reveal.js-3.1.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.1.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.1.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../reveal.js-3.1.0/plugin/zoom-js/zoom.js', async: true },
{ src: '../reveal.js-3.1.0/plugin/notes/notes.js', async: true }