Website/slides/talks/2015-5-UpBeat/index.html

485 lines
20 KiB
HTML
Raw Normal View History

2016-02-11 09:37:51 -05:00
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Embracing Uncertainty</title>
<meta name="description" content="Mimir, an awesome system for embracing uncertainty">
<meta name="author" content="Oliver Kennedy">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="../reveal.js-3.1.0/css/reveal.css">
<link rel="stylesheet" href="ubodin.css" id="theme">
<!-- Code syntax highlighting -->
<link rel="stylesheet" href="../reveal.js-3.1.0/lib/css/zenburn.css">
<!-- Printing and PDF exports -->
<script>
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../reveal.js-3.1.0/css/print/pdf.css' : '../reveal.js-3.1.0/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<!--[if lt IE 9]>
<script src="../reveal.js-3.1.0/lib/js/html5shiv.js"></script>
<![endif]-->
</head>
<body>
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="header">
<!-- Any Talk-Specific Header Content Goes Here -->
Embracing Uncertainty &amp; ODIn Lab Overview
</div>
<div class="footer">
<!-- Any Talk-Specific Footer Content Goes Here -->
<div style="float: left; margin-top: 15px; ">
Exploring <u><b>O</b></u>nline <u><b>D</b></u>ata <u><b>In</b></u>teractions
</div>
2017-08-31 17:18:47 -04:00
<a href="https://odin.cse.buffalo.edu" target="_blank">
2016-02-11 09:37:51 -05:00
<img src="graphics/FullText-white.png" height="40" style="float: right;"/>
</a>
</div>
<div class="slides">
<section>
<section>
<img src="graphics/FullText-black.png" height="100"/>
2017-08-31 17:18:47 -04:00
<h5><a href="https://odin.cse.buffalo.edu">https://odin.cse.buffalo.edu</a></h5>
2016-02-11 09:37:51 -05:00
<img src="graphics/qrcode.31361737.png" />
</section>
<section>
<h2>Embracing Uncertainty</h2>
<div class="headertext" style="float: left; color: #041a9b; height: 3em;">U @ Buffalo</div>
<div class="headertext" style="color: #041a9b">
Ying Yang, Niccolo Meneghetti, <br/>
Arindam Nandi, Vinayak Karuppasamy, <br/>
<u>Oliver Kennedy</u>, Jan Chomicki</div>
<div class="headertext" style="float: left; color: red;">Oracle</div>
<div class="headertext" style="color: red;">Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick</div>
</section>
<section>
<h2>Before we begin...</h2>
</section>
<section>
<h2>Insider Threats</h2>
<ul>
<li>How do we identify <i>abnormal</i> query behavior from users?</li>
<li>What is <i>normal</i> user behavior?</li>
<li>Multiple gigs of query logs from M&amp;T</li>
</ul>
<p>...with <b>Gokhan Kul, Duc Thanh Anh Luong, Ting Xie</b>, Shambhu, Varun, Hung</p>
</section>
<section>
<h2>Pocket Data</h2>
<ul>
<li>Months of query logs from PhoneLab Phones (2 queries per phone per second)</li>
<li>SQLite is inefficient</li>
<li>SQLite is being used inefficiently</li>
<li>Let's develop a benchmark to help shine a light on these inefficiencies</li>
</ul>
<p>...with <b>Jerry Ajay</b>, Geoff, Luke</p>
</section>
<section>
<h2>Just-in-Time Datastructures</h2>
<ul>
<li>Decouple Physical Structure from Logical Interface.</li>
<li>Express Datastructure Organization through Rewrite Rules.</li>
<li>...allows hybridized datastructures for intermediate tradeoffs.</li>
<li>...allows for semifunctional datastructures with all the benefits but fewer tradeoffs.</li>
</ul>
<p>...with Luke</p>
</section>
</section>
<section>
<section>
<h3>A Big Data Fairy Tale</h3>
</section>
<section>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" />
<h4>Meet Alice</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" />
<img src="graphics/littlestorefront-800px.png" height="300" />
<h4>Alice has a Store</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/littlestorefront-800px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;" />
<h4>Alice's store collects sales data</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/dagobert83-female-user-icon-800px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;">+</span>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">=</span>
<img src="graphics/saco-800px.png" height="300" style=" vertical-align: middle;" />
<h4>Alice wants to use her sales data to run a promotion</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<h4>So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">+&nbsp;?</span>
<h4>... asks her question ...</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
<section>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<span style="font-size: 3em; vertical-align: middle;">+&nbsp;?&nbsp;</span>
<img src="graphics/crystalball-800px.png" height="300" style=" vertical-align: middle;" />
<h4>... and basks in the limitless possibilities of big data.</h4>
<attribution>(OpenClipArt.org)</attribution>
</section>
</section>
<section>
<section>
<h2>Why is this a fairy tale?</h2>
</section>
<section>
<img src="graphics/matt-icons_text-x-log-300px.png" height="300" style=" vertical-align: middle;"/>
<span style="font-size: 3em; vertical-align: middle;"></span>
<img src="graphics/database-server-800px.png" height="300" style=" vertical-align: middle;" />
<h4>It's never this easy...</h4>
</section>
<section>
<h2>Loading Data<h2>
<small>
<ul>
<li class="fragment">Validating and Fixing Outliers</li>
<li class="fragment">Handling Missing Data</li>
<li class="fragment">Matching Schemas</li>
<li class="fragment">Fixing Schemas</li>
<li class="fragment">Managing Stale Data</li>
<li class="fragment">Deduplicating Records</li>
<li class="fragment">... and lots more</li>
</ul>
</small>
</section>
</section>
<section>
<section>
<h2>Data Cleaning is Hard!</h2>
</section>
<section>
<h3>State of the Art</h3>
<img src="graphics/BI-Analyst.jpg" height="400" />
<attribution>(skilledup.com)</attribution>
<p>Alice spends weeks cleaning her data before using it.</p>
</section>
<section>
<h3>Newer State of the Art</h3>
<img src="graphics/azure-data-lake.png" height=500 />
<attribution>(azure.microsoft.com)</attribution>
</section>
<section>
<img src="graphics/data-lake-to-data-swamp.jpg" height=500 />
<attribution>(timoelliott.com)</attribution>
</section>
</section>
<section>
<section>
<h2>Making Cleaning Easier</h2>
<svg width=500 height=300>
<polygon
points="60,50 60,60 40,50 60,40 60,50 440,50 440,40 460,50 440,60 440,50"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
<text x=0 y=30 style="font-size: 0.75em">Scalability</text>
<text x=370 y=30 style="font-size: 0.75em">Reliability</text>
<text class="fragment" x=-220 y=400 style="font-size: 0.75em" transform="rotate(-90 20,20)">Expert Analysis</text>
<text class="fragment" x=-220 y=250 style="font-size: 0.75em" transform="rotate(-90 20,20)">Crowdsourcing</text>
<text class="fragment" x=-180 y=100 style="font-size: 0.75em" transform="rotate(-90 20,20)">Automation</text>
</svg>
<p class="fragment">Can we start with automation and work our way up?</p>
</section>
</section>
<section>
<h1>Mimir</h1>
</section>
<section>
<ul>
<li>Automate educated guesses for fast cleaning<ul>
<li><b>Lenses</b>: A family of simple data-cleaning operators</li>
<div class="fragment shrink fade-out" data-fragment-index="5">
<li class="fragment" data-fragment-index="1">... but what if the guesses are wrong?</li>
</div>
</ul></li>
<div class="fragment shrink fade-out" data-fragment-index="5">
<li class="fragment" data-fragment-index="2">Annotate 'best guess' relations with the guesses<ul>
<li><b>Virtual C-Tables</b>: A lineage model based on views, labeled nulls, and lazy evaluation.</li>
<li class="fragment" data-fragment-index="3">... so now the user needs to interpret your guesses?</li>
</ul></li>
<li class="fragment" data-fragment-index="4">Rank guesses by their impact on result uncertainty<ul>
<li><b>CPI</b>: A greedy heuristic for ranking sources of uncertainty.</li>
</ul></li>
</div>
</ul>
</section>
<section>
<section>
<h3>Lenses</h3>
<p class="fragment" data-fragment-index="1">Here's a problem with my data. <span class="fragment" data-fragment-index="2">Fix it.</span></p>
<ul>
<li class="fragment" data-fragment-index="3">What types should columns in this table have?
<ul class="fragment smalltext" data-fragment-index="7"><li>Majority Vote of All Castable Types</li></ul></li>
<li class="fragment" data-fragment-index="4">How do the columns of these relations line up?
<ul class="fragment smalltext" data-fragment-index="7"><li>Paygo and Countless Other Papers/Systems</li></ul></li>
<li class="fragment" data-fragment-index="5">How do I query heterogeneous JSON/XML objects?
<ul class="fragment smalltext" data-fragment-index="7"><li>XMorph and Many Others</li></ul></li>
<li class="fragment" data-fragment-index="6">What should these missing values be?
<ul class="fragment smalltext" data-fragment-index="7"><li>Machine Learning + Interpolation</li></ul></li>
<ul>
</section>
<section>
<h3>Lenses</h3>
<p>Each lens implements one automated data repair task with <b>minimal configuration or training</b>.</p>
<ul>
<li class="fragment">A "SQL" Expression</li>
<li class="fragment">A Model that defines configuration parameters and best-guesses for data repairs.</li>
</ul>
</section>
<section>
<pre><code>
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
</code></pre>
<ul>
<li><code>AS</code> clause defines source data.</li>
<li><code>USING</code> clause requests repairs.</li>
</ul>
</section>
<section>
<div>
<h4>The Lens Query</h4>
<pre><code>
CREATE VIEW PRODUCTS
AS SELECT ID, NAME, ...,
CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
END AS DEPARTMENT
FROM PRODUCTS_RAW;
</code></pre>
</div>
<small class="fragment">
<table>
<tr><th>ID</th><th>Name</th><th>...</th><th>Department</th></tr>
<tr><td>123</td><td>Apple 6s, White</td><td>...</td><td>Phone</td></tr>
<tr><td>34234</td><td>Dell, Intel 4 core</td><td>...</td><td>Computer</td></tr>
<tr><td>34235</td><td>HP, AMD 2 core</td><td>...</td><td class="fragment">$Prod.Dept_3$</td></tr>
<tr><td>...</td><td>...</td><td>...</td><td>...</td></tr>
</table>
</small>
</section>
<section>
<div>
<h4>The Lens Model</h4>
<pre><code>
SELECT * FROM PRODUCTS_RAW;
</code></pre>
</div>
<div class="fragment">
<div style="font-size: 1em; vertical-align: middle;"></div>
<div>
<img src="graphics/weka.png" />
</div>
</div>
<div class="fragment">
<div style="font-size: 1em; vertical-align: middle;"></div>
<div><p>An estimator for each <small style="vertical-align: baseline;">$Prod.Dept_{ROWID}$</small><p></div>
</div>
</section>
</section>
<section>
<section>
<h2>The User's View</h2>
<pre><code>
SELECT NAME, DEPARTMENT FROM PRODUCTS;
</code></pre>
<table class="fragment" data-fragment-index="1">
<tr><th>Name</th><th>Department</th></tr>
<tr><td>Apple 6s, White</td><td>Phone</td></tr>
<tr><td>Dell, Intel 4 core</td><td>Computer</td></tr>
<tr><td>HP, AMD 2 core</td><td class="fragment highlight-red" data-fragment-index="2">Computer</td></tr>
<tr><td>...</td><td>...</td></tr>
</table>
<p class="fragment" data-fragment-index="2"><b>Simple UI:</b> Highlight values (and rows) based on guesses.</p>
</section>
<section>
<pre><code>
SELECT NAME, DEPARTMENT FROM PRODUCTS;
</code></pre>
<small>
<table>
<tr><th>Name</th><th>Department</th></tr>
<tr><td>Apple 6s, White</td><td>Phone</td></tr>
<tr><td>Dell, Intel 4 core</td><td>Computer</td></tr>
<tr><td>HP, AMD 2 core</td><td style="color: red;">Computer</td></tr>
<tr><td>...</td><td>...</td></tr>
</table>
</small>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xl="http://www.w3.org/1999/xlink" version="1.1" viewBox="241 277 265 125" width="265pt" height="125pt" xmlns:dc="http://purl.org/dc/elements/1.1/" class="fragment" data-fragment-index="1">
<metadata> Produced by OmniGraffle 6.2.5 <dc:date>2015-09-20 14:45:55 +0000</dc:date></metadata>
<defs><font-face font-family="Helvetica Neue" font-size="16" panose-1="2 0 8 3 0 0 0 9 0 4" units-per-em="1000" underline-position="-100" underline-thickness="50" slope="0" x-height="517" cap-height="714" ascent="975.0061" descent="-216.99524" font-weight="bold"><font-face-src><font-face-name name="HelveticaNeue-Bold"/></font-face-src></font-face><font-face font-family="Helvetica Neue" font-size="16" panose-1="2 0 5 3 0 0 0 2 0 4" units-per-em="1000" underline-position="-100" underline-thickness="50" slope="0" x-height="517" cap-height="714" ascent="951.99585" descent="-212.99744" font-weight="500"><font-face-src><font-face-name name="HelveticaNeue"/></font-face-src></font-face></defs>
<g stroke="none" stroke-opacity="1" stroke-dasharray="none" fill="none" fill-opacity="1">
<title>Canvas 1</title>
<g>
<title>Layer 1</title>
<path d="M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" fill="white"/>
<path d="M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" stroke="black" stroke-linecap="round" stroke-linejoin="round" stroke-width="1"/>
<text transform="translate(293 293)" fill="black"><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="0" y="16" textLength="16.896" class="fragment" data-fragment-index="2">Pr</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="16.608" y="16" textLength="69.28" class="fragment" data-fragment-index="2">obability:</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="85.888" y="16" textLength="38.24" class="fragment" data-fragment-index="2"> 95%</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="bold" x="0" y="53" textLength="62.224" class="fragment" data-fragment-index="3">Reason:</tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="62.224" y="53" textLength="144.912" class="fragment" data-fragment-index="3"> Because I guessed </tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="0" y="71" textLength="206.592" class="fragment" data-fragment-index="3">Computer for Department </tspan><tspan font-family="Helvetica Neue" font-size="16" font-weight="500" x="0" y="89" textLength="196.16" class="fragment" data-fragment-index="3">on Row 3 of PRODUCTS</tspan></text>
</g>
</g>
</svg>
<p class="fragment" data-fragment-index="1">Allow users to <code>EXPLAIN</code> uncertain outputs</p>
<p class="fragment" data-fragment-index="3">Explanations include reasons given in English</p>
</section>
<section>
<h3>Other Lenses</h3>
<ul>
<li>Schema Matching (equivalently JSON/XML import)</li>
<li>Archival (how stale is my data?)</li>
<li>Type Inference</li>
<li style="color: grey;">Deduplication / Entity Resolution</li>
<li style="color: grey;">Schema Name Inference</li>
<li>And more...</li>
</ul>
</section>
</section>
<section>
<section>
<h2>Mimir Demo</h2>
2017-08-31 17:18:47 -04:00
<p><a href="http://demo.odin.cse.buffalo.edu" target="_blank"><img src="https://odin.cse.buffalo.edu/wp-content/uploads/2015/08/Mimir_Screenshot.png" height="400"/></a></p>
2016-02-11 09:37:51 -05:00
</section>
<section>
<h2>Intuitive Uncertainty</h2>
<p><b>UB</b>: Ying Yang, Niccolo Meneghetti, <br/> Arindam Nandi, Vinayak Karuppasamy, <br/>Oliver Kennedy, Jan Chomicki</p>
<p><b>Oracle</b>: Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick</p>
<h4>Thanks to Oracle for multiple gifts that make this research possible</h4>
</section>
</section>
</div></div>
<script src="../reveal.js-3.1.0/lib/js/head.min.js"></script>
<script src="../reveal.js-3.1.0/js/reveal.js"></script>
<script>
// Full list of configuration options available at:
// https://github.com/hakimel/../reveal.js#configuration
Reveal.initialize({
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'fade', // none/fade/slide/convex/concave/zoom
// Optional ../reveal.js plugins
dependencies: [
{ src: '../reveal.js-3.1.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../reveal.js-3.1.0/plugin/math/math.js',
condition: function() { return true; },
mathjax: '../reveal.js-3.1.0/js/MathJax.js'
},
{ src: '../reveal.js-3.1.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.1.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.1.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../reveal.js-3.1.0/plugin/zoom-js/zoom.js', async: true },
{ src: '../reveal.js-3.1.0/plugin/notes/notes.js', async: true }
]
});
</script>
</body>
</html>