2016-10-23 00:38:41 -04:00
<!doctype html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< title > Embracing Uncertainty< / title >
< meta name = "description" content = "Mimir" >
< meta name = "author" content = "Oliver Kennedy" >
< meta name = "apple-mobile-web-app-capable" content = "yes" / >
< meta name = "apple-mobile-web-app-status-bar-style" content = "black-translucent" / >
< meta name = "viewport" content = "width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui" >
< link rel = "stylesheet" href = "../reveal.js-3.1.0/css/reveal.css" >
< link rel = "stylesheet" href = "ubodin.css" id = "theme" >
<!-- Code syntax highlighting -->
< link rel = "stylesheet" href = "../reveal.js-3.1.0/lib/css/zenburn.css" >
<!-- Printing and PDF exports -->
< script >
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../reveal.js-3.1.0/css/print/pdf.css' : '../reveal.js-3.1.0/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
< / script >
<!-- [if lt IE 9]>
< script src = "../reveal.js-3.1.0/lib/js/html5shiv.js" > < / script >
<![endif]-->
< / head >
< body >
< div class = "reveal" >
<!-- Any section element inside of this container is displayed as a slide -->
< div class = "header" >
<!-- Any Talk - Specific Header Content Goes Here -->
Embracing Uncertainty
< / div >
< div class = "footer" >
<!-- Any Talk - Specific Footer Content Goes Here -->
< div style = "float: left; margin-top: 15px; " >
Exploring < u > < b > O< / b > < / u > nline < u > < b > D< / b > < / u > ata < u > < b > In< / b > < / u > teractions
< / div >
< img src = "graphics/FullText-white.png" height = "40" style = "float: right;" / >
< / div >
< div class = "slides" >
< section >
< h4 > Embracing uncertainty with< / h4 >
< img src = "graphics/mimir_logo_final.png" >
< / section >
< section >
< h4 > Joint work with:< / h4 >
< p style = "text-align:left;" > < small >
< b > PhD Students< / b > : Ying Yang, Will Spoth, Aaron Huber, Poonam Kumari, Jon Logan< br / >
< b > BS Students< / b > : Lisa Lu, Jacob P. Verghese< br / >
< b > Alums< / b > : Arindam Nandi, Niccoló Meneghetti (HPE/Vertica), Vinayak Karuppasamy (Bloomberg)< br / >
< b > Collabs< / b > : Ronny Fehling (Airbus), Zhen-Hua Liu (Oracle), Dieter Gawlick (Oracle), Beda Hammerschmidt (Oracle),
Boris Glavic (IIT), Wolfgang Gatterbauer (CMU), Juliana Freire (NYU), Heiko Mueller (NYU), Moises Sudit (UB-ISE)
< / small > < / p >
< / section >
< section >
< section >
< h3 > A Big Data Fairy Tale< / h3 >
< / section >
< section >
< img src = "graphics/dagobert83-female-user-icon-800px.png" height = "300" / >
< h4 > Meet Alice< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/dagobert83-female-user-icon-800px.png" height = "300" / >
< img src = "graphics/littlestorefront-800px.png" height = "300" / >
< h4 > Alice has a Store< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/littlestorefront-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > →< / span >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > Alice's store collects sales data< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/dagobert83-female-user-icon-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > +< / span >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > =< / span >
< img src = "graphics/saco-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > Alice wants to use her sales data to run a promotion< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > →< / span >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > + ?< / span >
< h4 > ... asks her question ...< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > + ? →< / span >
< img src = "graphics/crystalball-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > ... and basks in the limitless possibilities of big data.< / h4 >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< / section >
< section >
< section >
< h2 > Why is this a fairy tale?< / h2 >
< / section >
< section >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > →< / span >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > It's never this easy...< / h4 >
< / section >
< / section >
< section >
< section >
< h2 > CSV Import< / h2 >
< h4 > Run a < code > SELECT< / code > on a raw CSV File< / h4 >
< ul class = "fragment" >
< li > File may not have column headers< / li >
< li > CSV does not provide "types"< / li >
< li > Lines may be missing fields< / li >
< li > Fields may be mistyped (typo, missing comma)< / li >
< li > Comment text can be inlined into the file< / li >
< / ul >
< p class = "fragment" >
< b > State of the art< / b > : External Table Defn < span class = "fragment" > + "Manually" edit CSV< / span >
< / p >
< / section >
< section >
< h2 > Merge Two Datasets< / h2 >
< h4 > < code > UNION< / code > two data sources< / h4 >
< ul class = "fragment" >
< li > Schema matching< / li >
< li > Deduplication< / li >
< li > Format alignment (GIS coordinates, $ vs €)
< li > Precision alignment (State vs County)< / li >
< / ul >
< p class = "fragment" >
< b > State of the art< / b > : Manually map schema
< / p >
< / section >
< section >
< h2 > JSON Shredding< / h2 >
< h4 > Run a < code > SELECT< / code > on JSON or a Doc Store< / h4 >
< ul class = "fragment" >
< li > Separating fields and record sets:< br / > (e.g., < code > { A: "Bob", B: "Alice" }< / code > )< / li >
< li > Missing fields (Records with no 'address')< / li >
< li > Type alignment (Records with 'address' as an array)< / li >
< li > Schema matching$^2$< / li >
< / ul >
< p class = "fragment" >
< b > State of the art< / b > : DataGuide, Wrangler, etc...
< / p >
< / section >
< / section >
< section >
< section >
< h2 > Data Cleaning is Hard!< / h2 >
< / section >
< section >
< h3 > State of the Art< / h3 >
< img src = "graphics/BI-Analyst.jpg" height = "400" / >
< attribution > (skilledup.com)< / attribution >
< p > Alice spends weeks cleaning her data before using it.< / p >
< / section >
< section >
< h3 > Newer State of the Art< / h3 >
< img src = "graphics/iu.jpeg" height = 500 / >
< attribution > (azure.microsoft.com)< / attribution >
< / section >
< section >
< img src = "graphics/data-lake-to-data-swamp.jpg" height = 500 / >
< attribution > (timoelliott.com)< / attribution >
< / section >
< / section >
< section >
< section >
< h2 > Curation is hard!< / h2 >
< ul >
< li class = "fragment" > Structured models (RelDBs) force curation during loading.
< ul > < li class = "fragment" > < b > Problem:< / b > All curation costs are upfront.< / li > < / ul >
< / li >
< li class = "fragment" > Unstructured models (NoSQL) force curation into queries.
< ul > < li class = "fragment" > < b > Problem:< / b > Complexity/redundancy blowup in queries.< / li > < / ul >
< / li >
< / ul >
< p class = "fragment" style = "margin-top: 50px;" > Make structure, curation effort < b > On-Demand< / b > < / p >
< / section >
< section >
< h3 > Let the database make guesses!< / h3 >
< / section >
< section >
< h3 >
In the name of Codd,< br / > < span class = "fragment grow highlight-current-blue" > thou shalt not give the user a wrong answer.< / span >
< / h3 >
< h4 class = "fragment" >
... but what if we did?
< / h4 >
< h4 class = "fragment" >
What would it take for that to be ok?
< / h4 >
< / section >
< / section >
< section >
< section >
< h2 > Industry says...< / h2 >
< / section >
< section >
< img src = "graphics/maybe-screen.png" height = "500px" / >
< img src = "graphics/maybe-detail.png" height = "500px" class = "fragment" / > < br / >
< p class = "fragment" > My phone is guessing, but is letting me know that it did< / p >
< / section >
< section >
< img src = "graphics/Calendar_Base.png" height = "500px" / >
< / section >
< section >
< img src = "graphics/Calendar_Explain.png" height = "500px" / >
< p > Easy interactions to < i > accept< / i > , < i > reject< / i > , or < i > explain< / i > uncertainty< / p >
< / section >
< section >
< img src = "graphics/BingTranslate.png" height = "400px" / >
< p > Easy access to: Provenance, Alternatives, and Confidence< / p >
< / section >
< section >
< h2 > Communication< / h2 >
< ul >
< li > Why is my data uncertain?< / li >
< li > How bad is it?< / li >
< li > What can I do about it?< / li >
< / ul >
< / section >
< section >
< h2 > What if a database did the same?< / h2 >
< / section >
< section >
< ul style = "width:35%; font-size: 24pt; margin-top: 50px; margin-bottom: 100px; margin-right: 10px" >
< li class = "fragment" > < b > A:< / b > Standard SQL.< / li >
< li class = "fragment" > < b > B:< / b > Annotated Output.< / li >
< li class = "fragment" > < b > C:< / b > Subway Diagram.< / li >
< li class = "fragment" > < b > D:< / b > Result Explanations.< / li >
< / ul >
< a href = "http://localhost:9000" >
< img src = "graphics/UIExample.png" style = "width:60%; float:right" / >
< / a >
< b > < a href = "http://localhost:9000" class = "fragment" > Demo< / a > < / b >
< / section >
< / section >
< section >
< h2 > Mimir< / h2 >
< ul >
< li > < b > Lenses< / b > : Generic, best-guess data curation operators.< / li >
< li > < b > Explanations< / b > : How certain < b > is< / b > my data?< / li >
< li > < b > Provenance< / b > : What issues still need to be fixed?< / li >
< / ul >
< / section >
< section >
< section >
< h3 > Lenses< / h3 >
< p class = "fragment" > Here's a problem with my data. < span class = "fragment" > Fix it.< / span > < / p >
< ul >
< li class = "fragment" > What type is this column? (majority vote)< / li >
< li class = "fragment" > How do the columns of these relations line up? (pick your favorite schema matching paper)< / li >
< li class = "fragment" > How do I query heterogeneous JSON objects? (see above)< / li >
< li class = "fragment" > What should these missing values be? (learning-based interpolation)< / li >
< / ul >
< / section >
< section >
< svg width = 500 height = 350 >
< g transform = "scale(1.2)" >
< text x = "0" y = "45" > View:< / text >
< image xlink:href = "graphics/db.svg" x = "130" y = "10" height = "50px" width = "50px" / >
< text x = "225" y = "20" style = "font-family: courier; font-size: 60%" > SELECT< / text >
< polygon
points="190,35 340,35 325,30 325,40 340,35"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "350" y = "10" height = "50px" width = "50px" / >
< / g >
< g transform = "translate(0,150) scale(1.2)" class = "fragment" >
< text x = "0" y = "45" > Lens:< / text >
< image xlink:href = "graphics/db.svg" x = "130" y = "10" height = "50px" width = "50px" / >
< text x = "225" y = "20" style = "font-family: courier; font-size: 60%" > SELECT< / text >
< polygon
points="190,35 340,35 325,30 325,40 340,35"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "350" y = "10" height = "50px" width = "50px" / >
< g class = "fragment" >
< text x = "212" y = "20" style = "font-family: courier; font-size: 60%" > [ ]< / text >
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "355" y = "15" height = "50px" width = "50px" / >
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "360" y = "20" height = "50px" width = "50px" / >
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "365" y = "25" height = "50px" width = "50px" / >
< g class = "fragment" >
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "350" y = "110" height = "60px" width = "60px" / >
< polygon
points="380,80 380,105 385,90 375,90 380,105"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< text x = "220" y = "142" style = "font-size: 60%" > (best guess)< / text >
< / g >
< / g >
< / g >
< / svg >
< p class = "fragment" > Lenses introduce < i > uncertainty< / i > < / p >
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< section >
< h2 > The User's View< / h2 >
< pre > < code >
SELECT NAME, DEPARTMENT FROM PRODUCTS;
< / code > < / pre >
< table class = "fragment" data-fragment-index = "1" >
< tr > < th > Name< / th > < th > Department< / th > < / tr >
< tr > < td > Apple 6s, White< / td > < td > Phone< / td > < / tr >
< tr > < td > Dell, Intel 4 core< / td > < td > Computer< / td > < / tr >
< tr > < td > HP, AMD 2 core< / td > < td class = "fragment highlight-red" data-fragment-index = "2" > Computer< / td > < / tr >
< tr > < td > ...< / td > < td > ...< / td > < / tr >
< / table >
< p class = "fragment" data-fragment-index = "2" > < b > Simple UI:< / b > Highlight values that are based on guesses.< / p >
< / section >
< section >
< pre > < code >
SELECT NAME, DEPARTMENT FROM PRODUCTS;
< / code > < / pre >
< small >
< table >
< tr > < th > Name< / th > < th > Department< / th > < / tr >
< tr > < td > Apple 6s, White< / td > < td > Phone< / td > < / tr >
< tr > < td > Dell, Intel 4 core< / td > < td > Computer< / td > < / tr >
< tr > < td > HP, AMD 2 core< / td > < td style = "color: red;" > Computer< / td > < / tr >
< tr > < td > ...< / td > < td > ...< / td > < / tr >
< / table >
< / small >
< svg xmlns = "http://www.w3.org/2000/svg" xmlns:xl = "http://www.w3.org/1999/xlink" version = "1.1" viewBox = "241 277 265 125" width = "265pt" height = "125pt" xmlns:dc = "http://purl.org/dc/elements/1.1/" class = "fragment" data-fragment-index = "1" >
< metadata > Produced by OmniGraffle 6.2.5 < dc:date > 2015-09-20 14:45:55 +0000< / dc:date > < / metadata >
< defs > < font-face font-family = "Helvetica Neue" font-size = "16" panose-1 = "2 0 8 3 0 0 0 9 0 4" units-per-em = "1000" underline-position = "-100" underline-thickness = "50" slope = "0" x-height = "517" cap-height = "714" ascent = "975.0061" descent = "-216.99524" font-weight = "bold" > < font-face-src > < font-face-name name = "HelveticaNeue-Bold" / > < / font-face-src > < / font-face > < font-face font-family = "Helvetica Neue" font-size = "16" panose-1 = "2 0 5 3 0 0 0 2 0 4" units-per-em = "1000" underline-position = "-100" underline-thickness = "50" slope = "0" x-height = "517" cap-height = "714" ascent = "951.99585" descent = "-212.99744" font-weight = "500" > < font-face-src > < font-face-name name = "HelveticaNeue" / > < / font-face-src > < / font-face > < / defs >
< g stroke = "none" stroke-opacity = "1" stroke-dasharray = "none" fill = "none" fill-opacity = "1" >
< title > Canvas 1< / title >
< g >
< title > Layer 1< / title >
< path d = "M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" fill = "white" / >
< path d = "M 279 351 L 243 369 L 279 387 L 279 389 C 279 394.52285 283.47715 399 289 399 L 494 399 C 499.52285 399 504 394.52285 504 389 L 504 289 C 504 283.47715 499.52285 279 494 279 L 289 279 C 283.47715 279 279 283.47715 279 289 Z" stroke = "black" stroke-linecap = "round" stroke-linejoin = "round" stroke-width = "1" / >
< text transform = "translate(293 293)" fill = "black" > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "bold" x = "0" y = "16" textLength = "16.896" class = "fragment" data-fragment-index = "2" > Pr< / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "bold" x = "16.608" y = "16" textLength = "69.28" class = "fragment" data-fragment-index = "2" > obability:< / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "500" x = "85.888" y = "16" textLength = "38.24" class = "fragment" data-fragment-index = "2" > 95%< / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "bold" x = "0" y = "53" textLength = "62.224" class = "fragment" data-fragment-index = "3" > Reason:< / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "500" x = "62.224" y = "53" textLength = "144.912" class = "fragment" data-fragment-index = "3" > Because I guessed < / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "500" x = "0" y = "71" textLength = "206.592" class = "fragment" data-fragment-index = "3" > ‘ Computer’ for ‘ Department’ < / tspan > < tspan font-family = "Helvetica Neue" font-size = "16" font-weight = "500" x = "0" y = "89" textLength = "196.16" class = "fragment" data-fragment-index = "3" > on Row ‘ 3’ of ‘ PRODUCTS’ < / tspan > < / text >
< / g >
< / g >
< / svg >
< p class = "fragment" data-fragment-index = "1" > Allow users to < code > EXPLAIN< / code > uncertain outputs< / p >
< p class = "fragment" data-fragment-index = "3" > Explanations include reasons given in English< / p >
< / section >
< section >
< h3 > Explanations< / h3 >
< ol >
< li > Mark < i > uncertain< / i > data and results.< / li >
< li > Upon request, provide more detail:
< ul style = "font-size:80%; width: 600px" >
< li > Why is my data uncertain? < span style = "float:right; font-size:80%; margin-top: 5px" > (provenance)< / span > < / li >
< li > How bad is it? < span style = "float:right; font-size:80%; margin-top: 5px" > (confidence, entropy, bounds)< / span > < / li >
< li > What are other possibile answers? < span style = "float:right; font-size:80%; margin-top: 5px" > (samples)< / span > < / li >
< li > What can I do to fix it? < span style = "float:right; font-size:80%; margin-top: 5px" > (repairs)< / span > < / li >
< / ul > < / li >
< / ol >
< / section >
< / section >
< section >
< section >
< h2 > Available Lenses< / h2 >
< ul >
< li > Missing Value Repair (using Wekka)< / li >
< li > Schema Matching (merging datasets)< / li >
< li > Type Inference (for CSV import)< / li >
< li > Entity Extraction (for JSON import)< / li >
< li > Functional-Dependency Repair< / li >
< / ul >
< / section >
2016-10-25 10:24:23 -04:00
< section >
< h2 > Entity Extraction< / h2 >
< pre > < code >
{
"grad":{"students":[
{name:"Alice",deg:"PhD",credits:"10"},
{name:"Bob",deg:"MS"}, ...]},
"undergrad":{"students":[
{name:"Carol"},
{name:"Dave",deg:"U"}, ...]}
}
< / code > < / pre >
< / section >
< section >
< h2 > Entity Extraction< / h2 >
< img src = "graphics/extracted_entities.png" / >
< / section >
< section >
< h2 > Entity Extraction Lens< / h2 >
< img src = "graphics/synthesized_entities.png" / >
< / section >
< section >
< h2 > Shared Workspaces< / h2 >
< img src = "graphics/workspaces.png" / >
< / section >
2016-10-23 00:38:41 -04:00
< section >
2016-10-25 10:24:23 -04:00
< h2 > Other Efforts in Progress< / h2 >
2016-10-23 00:38:41 -04:00
< ul >
< li > Priority: Aggregate Queries< / li >
< li > Data Descriptors (DataGuides++)< / li >
< li > User-Interface Studies< / li >
< li > Prioritization of HITL Tasks< / li >
< li > Editable Query Results (< a href = "http://vizierdb.info/" > Vizier< / a > )< / li >
< / ul >
< / section >
< / section >
< section >
< section >
< img src = "graphics/mimir_logo_final.png" height = "200px" >
< ul >
2016-10-25 10:24:23 -04:00
< li > On-Demand Data Curation makes data exploration easier.< / li >
< li > "Best-Guess" results streamline analytics.
< div > ... if the DB communicates the resulting uncertainty.< / div > < / li >
2016-10-23 00:38:41 -04:00
< / ul >
< p class = "fragment" > < b > Questions?< / b > < / p >
< / section >
< / section >
< section >
< section >
< h1 > Backup Slides< / h1 >
< / section >
< / section >
< section >
< section >
< h2 > Mimir is a DB < u > Overlay< / u > < / h2 >
< / section >
< section >
< svg width = "500px" height = "400px" >
< g >
< g >
< image xlink:href = "graphics/db.svg" x = "10" y = "5" height = "50px" width = "50px" / >
< text x = "0" y = "80" style = "font-size:50%" > (Any DB)< / text >
< / g >
< polygon
points="0,0 120,0 105,-5 105,5 120,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< g transform = "translate(220,0)" >
< image xlink:href = "graphics/primary-queries.svg" x = "0" y = "5" height = "50px" width = "50px" / >
< text x = "0" y = "80" style = "font-size:50%" > (Lens)< / text >
< / g >
< polygon
points="0,0 100,0 85,-5 85,5 100,0"
transform="translate(290,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< g transform = "translate(400,0)" >
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "0" y = "10" height = "50px" width = "50px" / >
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "5" y = "15" height = "50px" width = "50px" / >
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "10" y = "20" height = "50px" width = "50px" / >
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "15" y = "25" height = "50px" width = "50px" / >
< / g >
< / g >
< g class = "fragment" >
< polygon
points="0,0 0,110 -5,95 5,95 0,110"
transform="translate(200,100)"
style="
stroke: red;
fill: red;
stroke-width: 4;
"
/>
< polygon
points="0,0 0,110 -5,95 5,95 0,110"
transform="translate(245,100)"
style="
stroke: red;
fill: red;
stroke-width: 4;
"
/>
< polygon
points="0,0 0,110 -5,95 5,95 0,110"
transform="translate(290,100)"
style="
stroke: red;
fill: red;
stroke-width: 4;
"
/>
< g transform = "translate(0,200)" >
< g >
< image xlink:href = "graphics/db.svg" x = "10" y = "5" height = "50px" width = "50px" / >
< text x = "0" y = "80" style = "font-size:50%" > (Any DB)< / text >
< / g >
< polygon
points="0,0 120,0 105,-5 105,5 120,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< polygon
points="0,0 120,65 105,50 105,62 120,65"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< polygon
points="0,0 120,130 105,105 105,122 120,130"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< g transform = "translate(210,0)" >
< text x = "0" y = "45" style = "font-size:50%; font-family: courier" > SELECT< / text >
< polygon
points="0,0 110,0 95,-5 95,5 110,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "200" y = "15" height = "50px" width = "50px" / >
< / g >
< g transform = "translate(210,65)" >
< text x = "0" y = "45" style = "font-size:50%; font-family: courier" > SELECT< / text >
< polygon
points="0,0 110,0 95,-5 95,5 110,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "200" y = "15" height = "50px" width = "50px" / >
< / g >
< g transform = "translate(210,130)" >
< text x = "0" y = "45" style = "font-size:50%; font-family: courier" > SELECT< / text >
< polygon
points="0,0 110,0 95,-5 95,5 110,0"
transform="translate(80,40)"
style="
stroke: black;
fill: black;
stroke-width: 2;
"
/>
< image xlink:href = "graphics/jean-victor-balin-icon-table.svg" x = "200" y = "15" height = "50px" width = "50px" / >
< / g >
< / g >
< / g >
< g transform = "translate(220,230)" class = "fragment" >
< text x = "0" y = "48" style = "font-family: courier; font-size:40%" > UNION< / text >
< text x = "0" y = "113" style = "font-family: courier; font-size:40%" > UNION< / text >
< / g >
< / svg >
< p class = "fragment" > Mimir < i > virtualizes< / i > uncertainty
< attribution > (OpenClipArt.org)< / attribution >
< / section >
< / section >
< section >
< section >
< h2 > How?< / h2 >
< / section >
< section >
< h3 > Labeled Nulls< / h3 >
< p > $Var(\ldots)$ constructs new variables< / p >
< ul >
< li class = "fragment" > $Var('X')$ constructs a new variable $X$< / li >
< li class = "fragment" > $Var('X', 1)$ constructs a new variable $X_{1}$< / li >
< li class = "fragment" > $Var('X', ROWID)$ evaluates $ROWID$ and then constructs a new variable $X_{ROWID}$< / li >
< / ul >
< / section >
< section >
< h3 > Lazy Evaluation< / h3 >
< p > Variables can't be evaluated until they are bound.< br / > So, we allow arbitrary expressions to represent data.< / p >
< ul >
< li class = "fragment" > $X$ is a legitimate data value.< / li >
< li class = "fragment" > $X+1$ is a legitimate data value.< / li >
< li class = "fragment" > $1+1$ is a legitimate data value< span class = "fragment" > , but can be reduced to $2$.< / span > < / li >
< / ul >
< p class = "fragment" > A lazy value without variables is < b > deterministic< / b > < / p >
< / section >
< section >
< p > Mimir SQL allows the $Var()$ operator to inlined< / p >
< pre > < code >
SELECT A, VAR('X', B)+2 AS C FROM R;
< / code > < / pre >
< center > < div style = "width: 600px" class = "fragment" >
< table style = "float: left" >
< thead >
< tr > < th > A< / th > < th > B< / th > < / tr >
< / thead > < tbody >
< tr > < td > 1< / td > < td > 2< / th > < / tr >
< tr > < td > 3< / td > < td > 4< / th > < / tr >
< tr > < td > 5< / td > < td > 6< / th > < / tr >
< / tbody >
< / table >
< table style = "float: right" class = "fragment" >
< tr > < th > A< / th > < th > C< / th > < / tr >
< tr > < td > 1< / td > < td > $X_2+2$< / th > < / tr >
< tr > < td > 3< / td > < td > $X_4+2$< / th > < / tr >
< tr > < td > 5< / td > < td > $X_6+2$< / th > < / tr >
< / table >
< / div > < / center >
< div style = "clear: both;" > < / div >
< / section >
< section >
< p > Selects on $Var()$ need to be deferred too...< / p >
< pre > < code >
SELECT A FROM R WHERE VAR('X', B) > 2;
< / code > < / pre >
< center > < div style = "width: 600px" >
< table style = "float: left" >
< thead >
< tr > < th > A< / th > < th > B< / th > < / tr >
< / thead > < tbody >
< tr > < td > 1< / td > < td > 2< / th > < / tr >
< tr > < td > 3< / td > < td > 4< / th > < / tr >
< tr > < td > 5< / td > < td > 6< / th > < / tr >
< / tbody >
< / table >
< table style = "float: right" class = "fragment" >
< tr > < th > A< / th > < th > $\phi$< / th > < / tr >
< tr > < td > 1< / td > < td > $X_2>2$< / th > < / tr >
< tr > < td > 3< / td > < td > $X_4>2$< / th > < / tr >
< tr > < td > 5< / td > < td > $X_6>2$< / th > < / tr >
< / table >
< / div > < / center >
< div style = "clear: both;" > < / div >
< p class = "fragment" > When evaluating the table, rows where $\phi = \bot$ are dropped.< / p >
< / section >
< section >
< h3 > C-Tables< / h3 >
< ul >
< li > Original Formulation < small > [Imielinski, Lipski 1981]< / small > < / li >
< li class = "fragment" > PC-Tables < small > [Green, Tannen 2006]< / small > < / li >
< li class = "fragment" > Systems< ul >
< li > Orchestra < small > [Green, Karvounarakis, Taylor, Biton, Ives, Tannen 2007]< / small > < / li >
< li > MayBMS < small > [Huang, Antova, Koch, Olteanu 2009]< / small > < / li >
< li > Pip < small > [Kennedy, Koch 2009]< / small >
< li > Sprout < small > [Fink, Hogue, Olteanu, Rath 2011]< / small > < / li >
< / ul > < / li >
< li class = "fragment" > Generalized PC-Tables < small > [Kennedy, Koch 2009]< / small > < / li >
< / ul >
< / section >
< / section >
< section >
< section >
< h2 > Labeled nulls capture a lens' uncertainty< / h2 >
< / section >
< section >
< pre > < code >
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
< / code > < / pre >
< div class = "fragment" >
< p > is (almost) the same as the query...< / p >
< pre > < code >
CREATE VIEW PRODUCTS
AS SELECT ID, NAME, ...,
CASE WHEN DEPARTMENT IS NOT NULL THEN DEPARTMENT
ELSE VAR('PRODUCTS.DEPARTMENT', ROWID)
END AS DEPARTMENT
FROM PRODUCTS_RAW;
< / code > < / pre >
< / div >
< small class = "fragment" >
< table >
< tr > < th > ID< / th > < th > Name< / th > < th > ...< / th > < th > Department< / th > < / tr >
< tr > < td > 123< / td > < td > Apple 6s, White< / td > < td > ...< / td > < td > Phone< / td > < / tr >
< tr > < td > 34234< / td > < td > Dell, Intel 4 core< / td > < td > ...< / td > < td > Computer< / td > < / tr >
< tr > < td > 34235< / td > < td > HP, AMD 2 core< / td > < td > ...< / td > < td class = "fragment" > $Prod.Dept_3$< / td > < / tr >
< tr > < td > ...< / td > < td > ...< / td > < td > ...< / td > < td > ...< / td > < / tr >
< / table >
< / small >
< / section >
< section >
< pre > < code >
CREATE LENS PRODUCTS
AS SELECT * FROM PRODUCTS_RAW
USING DOMAIN_REPAIR(DEPARTMENT NOT NULL);
< / code > < / pre >
< div >
< p > Behind the scenes, a lens also creates a model...< / p >
< pre class = "fragment" > < code >
SELECT * FROM PRODUCTS_RAW;
< / code > < / pre >
< / div >
< div class = "fragment" >
< div style = "font-size: 1em; vertical-align: middle;" > ↓< / div >
< div >
< img src = "graphics/weka.png" / >
< / div >
< / div >
< div class = "fragment" >
< div style = "font-size: 1em; vertical-align: middle;" > ↓< / div >
< div > < p > An estimator for < small style = "vertical-align: baseline;" > $PRODUCTS.DEPARTMENT_{ROWID}$< / small > < p > < / div >
< / div >
< / section >
< / section >
< section >
< section >
< h3 > ... but databases don't support labeled nulls< / h3 >
< / section >
< section >
< h3 > Labeled Nulls Percolate Up< / h3 >
< pre > < code >
SELECT A, VAR('X', B)+2 AS C FROM R;
< / code > < / pre >
< div class = "fragment" >
< p > Mimir dispatches this query to the DB:< / p >
< pre > < code >
SELECT A, B FROM R;
< / code > < / pre >
< / div >
< div class = "fragment" >
< p > And for each row of the result, evaluates:< / p >
< pre > < code >
SELECT A, VAR('X', B)+2 AS C FROM RESULT;
< / code > < / pre >
< / div >
< / section >
< section >
< h3 > Generating Explanations< / h3 >
< p > All uncertainty comes from labeled nulls in the expressions that Mimir evaluates for each row of the output.< / p >
< dl >
< dt > Why is the data uncertain?< / dt >
< dd > All relevant lenses referenced in < code > VAR('X', B)+2< / code > .< / dd >
< dt > How uncertain?< / dt >
< dd > Estimate by sampling from < code > VAR('X', B)< / code > .< / dd >
< dt > How do I fix it?< / dt >
< dd > Each lens fixes one well-defined type of error.< / dd >
< / dl >
< / section >
< section >
< h3 > Lazy evaluation can cause problems< / h3 >
< pre > < code >
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
< / code > < / pre >
< div class = "fragment" >
< p > Mimir dispatches this query to the DB:< / p >
< pre > < code >
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2 FROM R, S;
< / code > < / pre >
< / div >
< div class = "fragment" >
< p > And for each row of the result, evaluates:< / p >
< pre > < code >
SELECT A, C FROM RESULT WHERE VAR('X', TEMP_1) = TEMP_2;
< / code > < / pre >
< / div >
< / section >
< section >
2016-10-24 03:18:22 -04:00
< p > UDFs allow the DB to interpret labeled nulls< / h3 >
2016-10-23 00:38:41 -04:00
< pre > < code >
SELECT R.A, S.C FROM R, S
2016-10-24 03:18:22 -04:00
WHERE S.B = MIMIR_VG_BESTGUESS('VARIABLE_X', R.B);
2016-10-23 00:38:41 -04:00
< / code > < / pre >
< p class = "fragment" > ... but we lose the ability to < i > explain< / i > outputs< / p >
< / section >
< section >
< h3 > Provenance Recovers Explanations< / h3 >
< pre > < code >
SELECT R.A, S.C FROM R, S WHERE VAR('X', R.B) = S.B;
< / code > < / pre >
< p > Mimir dispatches this query to the DB:< / p >
< pre > < code >
SELECT R.A, S.C,
R.ROWID AS ID_1, S.ROWID AS ID_2
2016-10-24 03:18:22 -04:00
WHERE S.B = MIMIR_VG_BESTGUESS('VARIABLE_X', R.B);
2016-10-23 00:38:41 -04:00
< / code > < / pre >
< div class = "fragment" >
< p > Then to explain, Mimir dispatches the query:< / p >
< pre > < code >
SELECT R.A, S.C, R.B AS TEMP_1, S.B AS TEMP_2
WHERE R.ROWID = ID_1 AND S.ROWID = ID_2
< / code > < / pre >
< / div >
< / section >
< / section >
< section >
< section >
< h3 > Performance< / h3 >
2016-10-24 03:18:22 -04:00
< p > PDBench: TPC-H Data, but add random FK violations.< / p >
2016-10-23 00:38:41 -04:00
< ul >
2016-10-24 03:18:22 -04:00
< li > < b > Query 1:< / b > ~TPC-H Q3; 3-way FK Join with Predicates< / li >
< li > < b > Query 2:< / b > ~TPC-H Q6; Table Scan with Predicates.< / li >
< li > < b > Query 3:< / b > ~TPC-H Q7; 5-way Star Join with Predicates.< / li >
2016-10-23 00:38:41 -04:00
< / ul >
< / section >
< section >
< dl >
< dt > Partition:< / dt >
< dd > Separate query fragments compute 'certain' results and one or more classes of uncertain results.< / dd >
2016-10-24 03:18:22 -04:00
< dt > TupleBundle:< / dt >
< dd > Compute and summarize 10 sampled results in parallel< / dd >
2016-10-23 00:38:41 -04:00
< dt > Inline:< / dt >
2016-10-24 03:18:22 -04:00
< dd > UDFs dynamically inject best guess values into the query.< / dd >
2016-10-23 00:38:41 -04:00
< / dl >
< / section >
< section >
2016-10-24 03:18:22 -04:00
< table >
< tr > < th > Strategy< / th > < th > Q1< / th > < th > Q2< / th > < th > Q3< / th > < / tr >
< tr > < td > Inline< / td > < td > 85.5s< / td > < td > 676.6s< / td > < td > 103.3s< / td > < / tr >
< tr > < td > TupleBundle< / td > < td > 8.2s< / td > < td > 55.2s< / td > < td > 9.8s< / td > < / tr >
< tr > < td > Partition< / td > < td > > 1hr< / td > < td > 739.7s< / td > < td > > 1hr< / td > < / tr >
< / table >
2016-10-23 00:38:41 -04:00
< / section >
< / section >
< section >
< section >
< h3 > Presentation< / h3 >
< p > Participants were shown a table of 3 products with 3 ratings (e.g., Amazon, Best Buy, Walmart) each< / p >
2016-10-24 03:18:22 -04:00
< p > < b > Part 1< / b > : The randomly generated ratings were biased to encourage a predictable, but mildly ambiguous ordering of the three products.< / p >
2016-10-23 00:38:41 -04:00
< / section >
< section >
2016-10-24 03:18:22 -04:00
< p > < b > Part 2< / b > : We used the same randomization, but this time we marked several of the values as uncertain:
< table >
< tr > < td > Red Text< / td > < td > < span style = "color: red" > value< / span > < / td > < / tr >
< tr > < td > Red Background< / td > < td > < span style = "background-color: red" > value< / span > < / td > < / tr >
< tr > < td > Asterisk< / td > < td > $value*$< / td > < / tr >
< tr > < td > Tolerance< / td > < td > $value \pm tolerance$< / td > < / tr >
< tr > < td > Range< / td > < td > $low – high$< / td > < / tr >
< / table >
2016-10-23 00:38:41 -04:00
< / p >
< / section >
2016-10-24 03:18:22 -04:00
2016-10-23 00:38:41 -04:00
< section >
2016-10-24 03:18:22 -04:00
< h3 > Probability of Agreement With Elicited Order< / h3 >
2016-10-23 00:38:41 -04:00
< img src = "graphics/interfaces.png" / >
< / section >
2016-10-24 03:18:22 -04:00
< section >
< p > < b > Part 3< / b > : We asked participants to verbalize their thought process and tagged specific exclamations in the transcripts.< / p >
< img src = "graphics/contextvsUncertainty.png" height = "350px" / >
< small >
< code > CONTEXT-DOMAIN< / code > : The participant relied on the 0-5 range of reviews to infer an uncertain rating< br / >
< code > CONTEXT-ROW< / code > : The participant used other reviews for the same product to infer an uncertain rating< br / >
< code > UNCERTAINTY-IGNORED< / code > : The participant explicitly disregarded an uncertain rating< br / >
< code > UNCERTAINTY-IRRELEVANT< / code > : The participant didn't need the uncertain value.< br / >
< / small >
< / section >
< section >
< p > < b > Part 3< / b > : We asked participants to verbalize their thought process and tagged specific exclamations in the transcripts.< / p >
< img src = "graphics/ComfortvsDiscomfort.png" height = "350px" / >
< small >
< code > [DIS]COMFORT-*< / code > : The participant expressed a positive or negative emotional response.< br / >
< code > *-DATA< / code > : The emotional response pertained to the data itself.< br / >
< code > *-UNCERTAINTY< / code > : The emotional response pertained to the uncertain values or representation.< br / >
< / small >
< / section >
2016-10-23 00:38:41 -04:00
< / section >
< section >
< section >
< h2 > Selection (Filtering)< / h2 >
< pre > < code >
SELECT NAME FROM PRODUCTS
WHERE DEPARTMENT='PHONE'
AND ( VENDOR='APPLE'
OR PLATFORM='ANDROID' )
< / code > < / pre >
2016-10-24 03:18:22 -04:00
< p class = "fragment" > Row-level uncertainty is a boolean formula $\phi$.< / p >
2016-10-23 00:38:41 -04:00
< p class = "fragment" >
For this query, $\phi$ can be as complex as:
< small > $$DEPT_{ROWID}='P\ldots' \wedge \left( VEND_{ROWID}='Ap\ldots' \vee PLAT_{ROWID} = 'An\ldots' \right)$$< / small > < / p >
< p class = "fragment" > < b > Too many variables! Which is the most important?< / b > < / p >
< / section >
< section >
< h2 > What is important?< / h2 >
< p class = "fragment" > Data Cleaning< / p >
< h2 class = "fragment" > Which variables are important?< / h2 >
< p class = "fragment" > The ones that keep us from knowing everything< / p >
< / section >
< section >
< p > < small > $$D_{ROWID}='P' \wedge \left( V_{ROWID}='Ap' \vee PLAT_{ROWID} = 'An' \right)$$< / small > < / p >
< div style = "font-size: 2em" > ⬍< / div >
< p > $$A \wedge (B \vee C)$$< / p >
< / section >
< section >
< h3 > Naive Approach< / h3 >
< p > Consider a game between a database and an impartial oracle.< / p >
< ul >
< li > The DB picks a variable $v$ in $\phi$ and pays a cost $c_v$.< / li >
< li > The Oracle reveals the truth value of $v$.< / li >
< li > The DB updates $\phi$ accordingly and repeats until $\phi$ is deterministic.< / li >
< / ul >
< p class = "fragment" > < b > Naive Algorithm: < / b > Pick all variables!< / p >
< p class = "fragment" > < b > Less Naive Algorithm: < / b > Minimize $E\left[\sum c_v\right]$.< / p >
< / section >
< section >
< h2 > Exponential Time Bad!< / h2 >
< / section >
< / section >
< section >
< section >
< h3 > The Value of What We Don't Know< / h3 >
< p > $$\phi = A \wedge (B \vee C)$$< / p >
< ol >
< li class = "fragment" data-fragment-index = "1" > Generate Samples for $A$, $B$, $C$< / li >
< li class = "fragment" data-fragment-index = "2" > Estimate $p(\phi)$< / li >
< li class = "fragment" data-fragment-index = "3" > Compute $H[\phi] = -\log\left(p(\phi) \cdot (1-p(\phi))\right)$< / li >
< / ol >
< p class = "fragment" data-fragment-index = "4" > < b > Entropy is intuitive: < / b > < br / > $H = 1$ means we know nothing, < br / > $H = 0$ means we know everything.< / p >
< / section >
< section >
< h3 > Information Gain< / h3 >
< p > $$\mathcal I_{A \leftarrow \top} (\phi) = H\left[\phi\right] - H\left[\phi(A \leftarrow \top)\right]$$< / p >
< p > < b > Information gain of< / b > $v$: The reduction in entropy from knowing the truth value of a variable $v$.< / p >
< / section >
< section >
< h3 > Expected Information Gain< / h3 >
< p > $$\mathcal I_{A} (\phi) = \left(p(A)\cdot \mathcal I_{A\leftarrow \top}(\phi)\right) + \left(p(\neg A)\cdot \mathcal I_{A\leftarrow \bot}(\phi)\right)$$< / p >
< p > < b > Expected information gain of< / b > $v$: The probability-weighted average of the information gain for $v$ and $\neg v$.< / p >
< / section >
< section >
< h3 > The Cost of Perfect Information< / h3 >
< p > Combine Information Gain and Cost< / p >
< p > $$f(\mathcal I_{A}(\phi), c_A)$$< / p >
< p class = "fragment" > < b > For example: < / b > $EG2(\mathcal I_{A}(\phi), c_A) = \frac{2^{\mathcal I_{A}(\phi)} - 1}{c_A}$< / p >
< p class = "fragment" > < b > Greedy Algorithm: < / b > Minimize $f(\mathcal I_{A}(\phi), c_A)$ at each step< / p >
< / section >
< section >
< h3 > Experimental Data< / h3 >
< ul >
< li > Start with a large dataset.< / li >
< li > Delete random fields (~50%).< / li >
< / ul >
< / section >
< section >
< h3 > Experimental Queries< / h3 >
< p > Simulate an analyst trying to manually explore correlations.< / p >
< ul >
< li > Train a tree-classifier on the base data.< / li >
< li > Convert the decision tree to a query for all rows where the tree predicts a specific value.< / li >
< / ul >
< / section >
< section >
< h3 > Cost vs Entropy: Credit Data< / h3 >
< img src = "graphics/credit_entropy.png" height = 400 / >
< p > < small >
< b > EG2:< / b > Greedy Cost/Value Ordering< br / >
< b > NMETC:< / b > Naive Minimal Expected Total Cost< br / >
< b > Random:< / b > Completely Random Order
< / small > < / p >
< / section >
< section >
< h3 > Cost vs Entropy: Product Data< / h3 >
< img src = "graphics/product_entropy.png" height = 400 / >
< p > < small >
< b > EG2:< / b > Greedy Cost/Value Ordering< br / >
< b > NMETC:< / b > Naive Minimal Expected Total Cost< br / >
< b > Random:< / b > Completely Random Order
< / small > < / p >
< / section >
< / section >
< / div > < / div >
< script src = "../reveal.js-3.1.0/lib/js/head.min.js" > < / script >
< script src = "../reveal.js-3.1.0/js/reveal.js" > < / script >
< script >
// Full list of configuration options available at:
// https://github.com/hakimel/../reveal.js#configuration
Reveal.initialize({
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'fade', // none/fade/slide/convex/concave/zoom
// Optional ../reveal.js plugins
dependencies: [
{ src: '../reveal.js-3.1.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../reveal.js-3.1.0/plugin/math/math.js',
condition: function() { return true; },
mathjax: '../reveal.js-3.1.0/js/MathJax.js'
},
{ src: '../reveal.js-3.1.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.1.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.1.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../reveal.js-3.1.0/plugin/zoom-js/zoom.js', async: true },
{ src: '../reveal.js-3.1.0/plugin/notes/notes.js', async: true }
]
});
< / script >
< / body >
< / html >