2016-02-11 09:37:51 -05:00
<!doctype html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< title > Embracing Uncertainty< / title >
< meta name = "description" content = "Mimir, an awesome system for embracing uncertainty" >
< meta name = "author" content = "Oliver Kennedy" >
< meta name = "apple-mobile-web-app-capable" content = "yes" / >
< meta name = "apple-mobile-web-app-status-bar-style" content = "black-translucent" / >
< meta name = "viewport" content = "width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui" >
< link rel = "stylesheet" href = "../reveal.js-3.1.0/css/reveal.css" >
< link rel = "stylesheet" href = "ubodin.css" id = "theme" >
<!-- Code syntax highlighting -->
< link rel = "stylesheet" href = "../reveal.js-3.1.0/lib/css/zenburn.css" >
<!-- Printing and PDF exports -->
< script >
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../reveal.js-3.1.0/css/print/pdf.css' : '../reveal.js-3.1.0/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
< / script >
<!-- [if lt IE 9]>
< script src = "../reveal.js-3.1.0/lib/js/html5shiv.js" > < / script >
<![endif]-->
< / head >
< body >
< div class = "reveal" >
<!-- Any section element inside of this container is displayed as a slide -->
< div class = "header" >
<!-- Any Talk - Specific Header Content Goes Here -->
Center for Multisource Information Fusion
< span style = "margin-left: 200px;" > < / style >
Embracing Uncertainty
< / div >
< div class = "footer" >
<!-- Any Talk - Specific Footer Content Goes Here -->
< div style = "float: left; margin-top: 15px; " >
Exploring < u > < b > O< / b > < / u > nline < u > < b > D< / b > < / u > ata < u > < b > In< / b > < / u > teractions
< / div >
2017-08-31 17:18:47 -04:00
< a href = "https://odin.cse.buffalo.edu" target = "_blank" >
2016-02-11 09:37:51 -05:00
< img src = "graphics/FullText-white.png" height = "40" style = "float: right;" / >
< / a >
< / div >
< div class = "slides" >
< section >
< h2 > Embracing Uncertainty< / h2 >
< div class = "headertext" style = "float: left; color: #041a9b; height: 3em;" > U @ Buffalo< / div >
< div class = "headertext" style = "color: #041a9b" >
Ying Yang, Niccolo Meneghetti, < br / >
Arindam Nandi, Vinayak Karuppasamy, < br / >
< u > Oliver Kennedy< / u > , Jan Chomicki< / div >
< div class = "headertext" style = "float: left; color: red;" > Oracle< / div >
< div class = "headertext" style = "color: red;" > Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick< / div >
< / section >
< section >
< section >
< h3 > A Big Data Fairy Tale< / h3 >
< / section >
< section >
< img src = "graphics/dagobert83-female-user-icon-800px.png" height = "300" / >
< h4 > Meet Alice< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/dagobert83-female-user-icon-800px.png" height = "300" / >
< img src = "graphics/littlestorefront-800px.png" height = "300" / >
< h4 > Alice has a Store< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/littlestorefront-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > →< / span >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > Alice's store collects sales data< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/dagobert83-female-user-icon-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > +< / span >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > =< / span >
< img src = "graphics/saco-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > Alice wants to use her sales data to run a promotion< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > →< / span >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > + ?< / span >
< h4 > ... asks her question ...< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > + ? →< / span >
< img src = "graphics/crystalball-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > ... and basks in the limitless possibilities of big data.< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< / section >
< section >
< section >
< h2 > Why is this a fairy tale?< / h2 >
< / section >
< section >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > →< / span >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > It's never this easy...< / h4 >
< / section >
< section >
< h2 > Loading Data< h2 >
< small >
< ul >
< li class = "fragment" > Validating and Fixing Outliers< / li >
< li class = "fragment" > Handling Missing Data< / li >
< li class = "fragment" > Matching Schemas< / li >
< li class = "fragment" > Fixing Schemas< / li >
< li class = "fragment" > Managing Stale Data< / li >
< li class = "fragment" > Deduplicating Records< / li >
< li class = "fragment" > ... and lots more< / li >
< / ul >
< / small >
< / section >
< / section >
< section >
< section >
< h2 > Data Cleaning is Hard!< / h2 >
< / section >
< section >
< h3 > State of the Art< / h3 >
< img src = "graphics/BI-Analyst.jpg" height = "400" / >
< attribution > (skilledup.com)< / attribution >
< p > Alice spends weeks cleaning her data before using it.< / p >
< / section >
< section >
< h3 > Newer State of the Art< / h3 >
< img src = "graphics/azure-data-lake.png" height = 500 / >
< attribution > (azure.microsoft.com)< / attribution >
< / section >
< section >
< img src = "graphics/data-lake-to-data-swamp.jpg" height = 500 / >
< attribution > (timoelliott.com)< / attribution >
< / section >
< / section >
< section >
< section >
< h2 > Making Cleaning Easier< / h2 >
< svg width = 500 height = 300 >
< polygon
points="60,50 60,60 40,50 60,40 60,50 440,50 440,40 460,50 440,60 440,50"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< text x = 0 y = 30 style = "font-size: 0.75em" > Scalability< / text >
< text x = 370 y = 30 style = "font-size: 0.75em" > Reliability< / text >
< text class = "fragment" x = -220 y = 400 style = "font-size: 0.75em" transform = "rotate(-90 20,20)" > Expert Analysis< / text >
< text class = "fragment" x = -220 y = 250 style = "font-size: 0.75em" transform = "rotate(-90 20,20)" > Crowdsourcing< / text >
< text class = "fragment" x = -180 y = 100 style = "font-size: 0.75em" transform = "rotate(-90 20,20)" > Automation< / text >
< / svg >
< p class = "fragment" > Can we start with automation and work our way up?< / p >
< / section >
< / section >
< section >
< h1 > Mimir< / h1 >
< / section >
< section >
< ul >
< li > Automate educated guesses for fast cleaning< ul >
< li > < b > Lenses< / b > : A family of simple data-cleaning operators< / li >
< div class = "fragment shrink fade-out" data-fragment-index = "5" >
< li class = "fragment" data-fragment-index = "1" > ... but what if the guesses are wrong?< / li >
< / div >
< / ul > < / li >
< div class = "fragment shrink fade-out" data-fragment-index = "5" >
< li class = "fragment" data-fragment-index = "2" > Annotate 'best guess' relations with the guesses< ul >
< li > < b > Virtual C-Tables< / b > : A lineage model based on views, labeled nulls, and lazy evaluation.< / li >
< li class = "fragment" data-fragment-index = "3" > ... so now the user needs to interpret your guesses?< / li >
< / ul > < / li >
< li class = "fragment" data-fragment-index = "4" > Rank guesses by their impact on result uncertainty< ul >
< li > < b > CPI< / b > : A greedy heuristic for ranking sources of uncertainty.< / li >
< / ul > < / li >
< / div >
< / ul >
< / section >
< section >
< section >
< h3 > Lenses< / h3 >
< p class = "fragment" data-fragment-index = "1" > Here's a problem with my data. < span class = "fragment" data-fragment-index = "2" > Fix it.< / span > < / p >
< ul >
< li class = "fragment" data-fragment-index = "3" > What types should columns in this table have?
< ul class = "fragment smalltext" data-fragment-index = "7" > < li > Majority Vote of All Castable Types< / li > < / ul > < / li >
< li class = "fragment" data-fragment-index = "4" > How do the columns of these relations line up?
< ul class = "fragment smalltext" data-fragment-index = "7" > < li > Paygo and Countless Other Papers/Systems< / li > < / ul > < / li >
< li class = "fragment" data-fragment-index = "5" > How do I query heterogeneous JSON/XML objects?
< ul class = "fragment smalltext" data-fragment-index = "7" > < li > XMorph and Many Others< / li > < / ul > < / li >
< li class = "fragment" data-fragment-index = "6" > What should these missing values be?
< ul class = "fragment smalltext" data-fragment-index = "7" > < li > Machine Learning + Interpolation< / li > < / ul > < / li >
< ul >
< / section >
< section >
< h3 > Lenses< / h3 >
< p > Each lens implements one automated data repair task with < b > minimal configuration or training< / b > .< / p >
< ul >
< li class = "fragment" > A "SQL" Expression< / li >
< li class = "fragment" > A Model that defines configuration parameters and best-guesses for data repairs.< / li >
< / ul >
< / section >
< section >
< pre > < code >
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
< / code > < / pre >
< ul >
< li > < code > AS< / code > clause defines source data.< / li >
< li > < code > USING< / code > clause requests repairs.< / li >
< / ul >
< / section >
< section >
< div >
< h4 > The Lens Query< / h4 >
< pre > < code >
CREATE VIEW PRODUCTS
AS SELECT ID, NAME, ...,
CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
END AS DEPARTMENT
FROM PRODUCTS_RAW;
< / code > < / pre >
< / div >
< small class = "fragment" >
< table >
< tr > < th > ID< / th > < th > Name< / th > < th > ...< / th > < th > Department< / th > < / tr >
< tr > < td > 123< / td > < td > Apple 6s, White< / td > < td > ...< / td > < td > Phone< / td > < / tr >
< tr > < td > 34234< / td > < td > Dell, Intel 4 core< / td > < td > ...< / td > < td > Computer< / td > < / tr >
< tr > < td > 34235< / td > < td > HP, AMD 2 core< / td > < td > ...< / td > < td class = "fragment" > $Prod.Dept_3$< / td > < / tr >
< tr > < td > ...< / td > < td > ...< / td > < td > ...< / td > < td > ...< / td > < / tr >
< / table >
< / small >
< / section >
< section >
< div >
< h4 > The Lens Model< / h4 >
< pre > < code >
SELECT * FROM PRODUCTS_RAW;
< / code > < / pre >
< / div >
< div class = "fragment" >
< div style = "font-size: 1em; vertical-align: middle;" > ↓< / div >
< div >
< img src = "graphics/weka.png" / >
< / div >
< / div >
< div class = "fragment" >
< div style = "font-size: 1em; vertical-align: middle;" > ↓< / div >
< div > < p > An estimator for each < small style = "vertical-align: baseline;" > $Prod.Dept_{ROWID}$< / small > < p > < / div >
< / div >
< / section >
< / section >
< section >
< section >
< h2 > The User's View< / h2 >
< pre > < code >
SELECT NAME, DEPARTMENT FROM PRODUCTS;
< / code > < / pre >
< table class = "fragment" data-fragment-index = "1" >
< tr > < th > Name< / th > < th > Department< / th > < / tr >
< tr > < td > Apple 6s, White< / td > < td > Phone< / td > < / tr >
< tr > < td > Dell, Intel 4 core< / td > < td > Computer< / td > < / tr >
< tr > < td > HP, AMD 2 core< / td > < td class = "fragment highlight-red" data-fragment-index = "2" > Computer< / td > < / tr >
< tr > < td > ...< / td > < td > ...< / td > < / tr >
< / table >
< p class = "fragment" data-fragment-index = "2" > < b > Simple UI:< / b > Highlight values (and rows) based on guesses.< / p >
< / section >
< section >
< pre > < code >
SELECT NAME, DEPARTMENT FROM PRODUCTS;
< / code > < / pre >
< small >
< table >
< tr > < th > Name< / th > < th > Department< / th > < / tr >
< tr > < td > Apple 6s, White< / td > < td > Phone< / td > < / tr >
< tr > < td > Dell, Intel 4 core< / td > < td > Computer< / td > < / tr >
< tr > < td > HP, AMD 2 core< / td > < td style = "color: red;" > Computer< / td > < / tr >
< tr > < td > ...< / td > < td > ...< / td > < / tr >
< / table >
< / small >
< svg xmlns = "http://www.w3.org/2000/svg" xmlns:xl = "http://www.w3.org/1999/xlink" version = "1.1" viewBox = "241 277 265 125" width = "265pt" height = "125pt" xmlns:dc = "http://purl.org/dc/elements/1.1/" class = "fragment" data-fragment-index = "1" >
< metadata > Produced by OmniGraffle 6.2.5 < dc:date > 2015-09-20 14:45:55 +0000< / dc:date > < / metadata >
< defs > < font-face font-family = "Helvetica Neue" font-size = "16" panose-1 = "2 0 8 3 0 0 0 9 0 4" units-per-em = "1000" underline-position = "-100" underline-thickness = "50" slope = "0" x-height = "517" cap-height = "714" ascent = "975.0061" descent = "-216.99524" font-weight = "bold" > < font-face-src > < font-face-name name = "HelveticaNeue-Bold" / > < / font-face-src > < / font-face > < font-face font-family = "Helvetica Neue" font-size = "16" panose-1 = "2 0 5 3 0 0 0 2 0 4" units-per-em = "1000" underline-position = "-100" underline-thickness = "50" slope = "0" x-height = "517" cap-height = "714" ascent = "951.99585" descent = "-212.99744" font-weight = "500" > < font-face-src > < font-face-name name = "HelveticaNeue" / > < / font-face-src > < / font-face > < / defs >
< g stroke = "none" stroke-opacity = "1" stroke-dasharray = "none" fill = "none" fill-opacity = "1" >
< title > Canvas 1< / title >
< g >
< title > Layer 1< / title >
< path d = "M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" fill = "white" / >
< path d = "M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" stroke = "black" stroke-linecap = "round" stroke-linejoin = "round" stroke-width = "1" / >
< text transform = "translate(293 293)" fill = "black" > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "bold" x = "0" y = "16" textLength = "16.896" class = "fragment" data-fragment-index = "2" > Pr< / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "bold" x = "16.608" y = "16" textLength = "69.28" class = "fragment" data-fragment-index = "2" > obability:< / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "500" x = "85.888" y = "16" textLength = "38.24" class = "fragment" data-fragment-index = "2" > 95%< / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "bold" x = "0" y = "53" textLength = "62.224" class = "fragment" data-fragment-index = "3" > Reason:< / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "500" x = "62.224" y = "53" textLength = "144.912" class = "fragment" data-fragment-index = "3" > Because I guessed < / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "500" x = "0" y = "71" textLength = "206.592" class = "fragment" data-fragment-index = "3" > ‘ Computer’ for ‘ Department’ < / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "500" x = "0" y = "89" textLength = "196.16" class = "fragment" data-fragment-index = "3" > on Row ‘ 3’ of ‘ PRODUCTS’ < / tspan > < / text >
< / g >
< / g >
< / svg >
< p class = "fragment" data-fragment-index = "1" > Allow users to < code > EXPLAIN< / code > uncertain outputs< / p >
< p class = "fragment" data-fragment-index = "3" > Explanations include reasons given in English< / p >
< / section >
< section >
< h3 > Other Lenses< / h3 >
< ul >
< li > Schema Matching (equivalently JSON/XML import)< / li >
< li > Archival (how stale is my data?)< / li >
< li > Type Inference< / li >
< li style = "color: grey;" > Deduplication / Entity Resolution< / li >
< li style = "color: grey;" > Schema Name Inference< / li >
< li > And more...< / li >
< / ul >
< / section >
< / section >
< section >
< section >
< h2 > Mimir Demo< / h2 >
2017-08-31 17:18:47 -04:00
< p > < a href = "http://demo.odin.cse.buffalo.edu" target = "_blank" > < img src = "https://odin.cse.buffalo.edu/wp-content/uploads/2015/08/Mimir_Screenshot.png" height = "400" / > < / a > < / p >
2016-02-11 09:37:51 -05:00
< / section >
< section >
< h2 > Intuitive Uncertainty< / h2 >
< p > < b > UB< / b > : Ying Yang, Niccolo Meneghetti, < br / > Arindam Nandi, Vinayak Karuppasamy, < br / > Oliver Kennedy, Jan Chomicki< / p >
< p > < b > Oracle< / b > : Ronny Fehling, Zhen-Hua Liu, Dieter Gawlick< / p >
< h4 > Thanks to Oracle for multiple gifts that make this research possible< / h4 >
< / section >
< / section >
< / div > < / div >
< script src = "../reveal.js-3.1.0/lib/js/head.min.js" > < / script >
< script src = "../reveal.js-3.1.0/js/reveal.js" > < / script >
< script >
// Full list of configuration options available at:
// https://github.com/hakimel/../reveal.js#configuration
Reveal.initialize({
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'fade', // none/fade/slide/convex/concave/zoom
// Optional ../reveal.js plugins
dependencies: [
{ src: '../reveal.js-3.1.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../reveal.js-3.1.0/plugin/math/math.js',
condition: function() { return true; },
mathjax: '../reveal.js-3.1.0/js/MathJax.js'
},
{ src: '../reveal.js-3.1.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.1.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.1.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../reveal.js-3.1.0/plugin/zoom-js/zoom.js', async: true },
{ src: '../reveal.js-3.1.0/plugin/notes/notes.js', async: true }
]
});
< / script >
< / body >
< / html >