2017-09-23 14:34:02 -04:00
<!doctype html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< title > Embracing Uncertainty< / title >
< meta name = "description" content = "Mimir" >
< meta name = "author" content = "Oliver Kennedy" >
< meta name = "apple-mobile-web-app-capable" content = "yes" / >
< meta name = "apple-mobile-web-app-status-bar-style" content = "black-translucent" / >
< meta name = "viewport" content = "width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui" >
< link rel = "stylesheet" href = "../reveal.js-3.5.0/css/reveal.css" >
< link rel = "stylesheet" href = "ubodin.css" id = "theme" >
<!-- Code syntax highlighting -->
< link rel = "stylesheet" href = "../reveal.js-3.5.0/lib/css/zenburn.css" >
<!-- Printing and PDF exports -->
< script >
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../reveal.js-3.5.0/css/print/pdf.css' : '../reveal.js-3.5.0/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
< / script >
<!-- [if lt IE 9]>
< script src = "../reveal.js-3.5.0/lib/js/html5shiv.js" > < / script >
<![endif]-->
< / head >
< body >
< div class = "reveal" >
<!-- Any section element inside of this container is displayed as a slide -->
< div class = "header" >
<!-- Any Talk - Specific Header Content Goes Here -->
Don't Wrangle, Guess
< / div >
< div class = "footer" >
<!-- Any Talk - Specific Footer Content Goes Here -->
< div style = "float: left; margin-top: 15px; " >
Exploring < u > < b > O< / b > < / u > nline < u > < b > D< / b > < / u > ata < u > < b > In< / b > < / u > teractions
< / div >
< img src = "graphics/FullText-white.png" height = "40" style = "float: right;" / >
< / div >
< div class = "slides" >
< section >
< h2 > Don't Wrangle, Guess Instead< / h2 >
< h4 > with< / h4 >
< img src = "graphics/mimir_logo_final.png" >
< / section >
< section >
< table >
< tr >
< th colspan = "6" style = "font-size: 12pt" > Students< / th >
< / tr >
< tr height = "80px" >
< td width = "100px" >
2017-10-02 23:14:25 -04:00
< img src = "people/poonam.jpg" width = "70px" height = "80px" style = "margin-bottom: 0px" / >
2017-09-23 14:34:02 -04:00
< p style = "margin-top: 0px; font-size: 10pt;" > Poonam< br / > (PhD-3Y)< / p >
< / td >
< td width = "100px" >
< img src = "people/will.png" width = "61px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Will< br / > (PhD-2Y)< / p >
< / td >
< td width = "100px" >
< img src = "people/aaron.jpg" width = "64px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Aaron< br / > (PhD-3Y)< / p >
< / td >
< td width = "100px" >
< img src = "people/shivang.jpg" width = "55px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Shivang< br / > (MS-2Y)< / p >
< / td >
< td width = "100px" >
< img src = "people/olivia.png" width = "50px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Olivia< br / > (BS-Sr)< / p >
< / td >
< / tr >
< / table >
< table style = "display: inline-block;" >
< tr >
< th colspan = "3" style = "font-size: 12pt" > Alumni< / th >
< / tr >
< tr height = "80px" >
< td width = "100px" >
< img src = "people/ying.jpg" width = "60px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Ying< br / > (PhD 2017)< / p >
< / td >
< td width = "100px" >
< img src = "people/niccolo.png" width = "50px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Niccolò< br / > (PhD 2016)< / p >
< / td >
< td width = "100px" >
< img src = "people/arindam.jpg" width = "80px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Arindam< br / > (MS 2016)< / p >
< / td >
< / tr >
< / table >
< table style = "display: inline-block; margin-left: 100px" >
< tr >
< th colspan = "1" style = "font-size: 12pt" > Dev< / th >
< / tr >
< tr >
< td width = "100px" >
< img src = "people/mike.jpg" width = "80px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Mike< br / > (Sr. Rsrch. Dev.)< / p >
< / td >
< / tr >
< / table >
< table >
< tr >
< th colspan = "4" style = "font-size: 12pt" > External Collaborators< / th >
< / tr >
< tr >
< td width = "130px" style = "font-size: 10pt;" >
Dieter Gawlick< br / > (Oracle)
< / td >
< td width = "130px" style = "font-size: 10pt;" >
Zhen Hua Liu< br / > (Oracle)
< / td >
< td width = "130px" style = "font-size: 10pt;" >
Ronny Fehling< br / > (Airbus)
< / td >
< td width = "130px" style = "font-size: 10pt;" >
Beda Hammerschmidt< br / > (Oracle)
< / td >
< / tr >
< / table >
< table style = "margin-top: 5px" >
< tr >
< td width = "140px" style = "font-size: 10pt;" >
Boris Glavic< br / > (IIT)
< / td >
2017-10-05 23:09:46 -04:00
< td width = "140px" style = "font-size: 10pt;" >
Su Feng< br / > (IIT)
< / td >
2017-09-23 14:34:02 -04:00
< td width = "140px" style = "font-size: 10pt;" >
Juliana Freire< br / > (NYU)
< / td >
< td width = "140px" style = "font-size: 10pt;" >
Wolfgang Gatterbauer< br / > (NEU)
< / td >
< td width = "140px" style = "font-size: 10pt;" >
Heiko Mueller< br / > (NYU)
< / td >
< td width = "140px" style = "font-size: 10pt;" >
Remi Rampin< br / > (NYU)
< / td >
< / tr >
< / table >
< / section >
< section >
< section >
< h3 > A Big Data Fairy Tale< / h3 >
< / section >
< section >
< img src = "graphics/dagobert83-female-user-icon-800px.png" height = "300" / >
< h4 > Meet Alice< / h4 >
2017-10-01 16:02:19 -04:00
< imagecredits > (OpenClipArt.org)< / imagecredits >
2017-09-23 14:34:02 -04:00
< / section >
< section >
< img src = "graphics/dagobert83-female-user-icon-800px.png" height = "300" / >
< img src = "graphics/littlestorefront-800px.png" height = "300" / >
< h4 > Alice has a Store< / h4 >
2017-10-01 16:02:19 -04:00
< imagecredits > (OpenClipArt.org)< / imagecredits >
2017-09-23 14:34:02 -04:00
< / section >
< section >
< img src = "graphics/littlestorefront-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > →< / span >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > Alice's store collects sales data< / h4 >
2017-10-01 16:02:19 -04:00
< imagecredits > (OpenClipArt.org)< / imagecredits >
2017-09-23 14:34:02 -04:00
< / section >
< section >
< img src = "graphics/dagobert83-female-user-icon-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > +< / span >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > =< / span >
< img src = "graphics/saco-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > Alice wants to use her sales data to run a promotion< / h4 >
2017-10-01 16:02:19 -04:00
< imagecredits > (OpenClipArt.org)< / imagecredits >
2017-09-23 14:34:02 -04:00
< / section >
< section >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > →< / span >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > So Alice loads up her sales data in her trusty database/hadoop/spark/etc... server.< / h4 >
2017-10-01 16:02:19 -04:00
< imagecredits > (OpenClipArt.org)< / imagecredits >
2017-09-23 14:34:02 -04:00
< / section >
< section >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > + ?< / span >
< h4 > ... asks her question ...< / h4 >
2017-10-01 16:02:19 -04:00
< imagecredits > (OpenClipArt.org)< / imagecredits >
2017-09-23 14:34:02 -04:00
< / section >
< section >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > + ? →< / span >
< img src = "graphics/crystalball-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > ... and basks in the limitless possibilities of big data.< / h4 >
2017-10-01 16:02:19 -04:00
< imagecredits > (OpenClipArt.org)< / imagecredits >
2017-09-23 14:34:02 -04:00
< / section >
< / section >
< section >
< section >
< h2 > Why is this a fairy tale?< / h2 >
< / section >
< section >
< img src = "graphics/matt-icons_text-x-log-300px.png" height = "300" style = " vertical-align: middle;" / >
< span style = "font-size: 3em; vertical-align: middle;" > →< / span >
< img src = "graphics/database-server-800px.png" height = "300" style = " vertical-align: middle;" / >
< h4 > It's never this easy...< / h4 >
< / section >
< / section >
< section >
< section >
< h2 > CSV Import< / h2 >
< h4 > Run a < code > SELECT< / code > on a raw CSV File< / h4 >
2017-09-25 13:20:04 -04:00
< ul >
2017-09-23 14:34:02 -04:00
< li > File may not have column headers< / li >
< li > CSV does not provide "types"< / li >
< li > Lines may be missing fields< / li >
< li > Fields may be mistyped (typo, missing comma)< / li >
< li > Comment text can be inlined into the file< / li >
< / ul >
2017-09-25 13:20:04 -04:00
< p >
< b > State of the art< / b > : External Table Defn < span > + "Manually" edit CSV< / span >
2017-09-23 14:34:02 -04:00
< / p >
< / section >
< section >
< h2 > Merge Two Datasets< / h2 >
< h4 > < code > UNION< / code > two data sources< / h4 >
2017-09-25 13:20:04 -04:00
< ul >
2017-09-23 14:34:02 -04:00
< li > Schema matching< / li >
< li > Deduplication< / li >
2017-10-04 03:31:50 -04:00
< li > Format alignment (GIS coordinates, $ vs €)< / li >
2017-09-23 14:34:02 -04:00
< li > Precision alignment (State vs County)< / li >
< / ul >
2017-09-25 13:20:04 -04:00
< p >
2017-09-23 14:34:02 -04:00
< b > State of the art< / b > : Manually map schema
< / p >
< / section >
< section >
< h2 > JSON Shredding< / h2 >
< h4 > Run a < code > SELECT< / code > on JSON or a Doc Store< / h4 >
2017-09-25 13:20:04 -04:00
< ul >
2017-09-23 14:34:02 -04:00
< li > Separating fields and record sets:< br / > (e.g., < code > { A: "Bob", B: "Alice" }< / code > )< / li >
< li > Missing fields (Records with no 'address')< / li >
< li > Type alignment (Records with 'address' as an array)< / li >
< li > Schema matching$^2$< / li >
< / ul >
2017-09-25 13:20:04 -04:00
< p >
2017-09-23 14:34:02 -04:00
< b > State of the art< / b > : DataGuide, Wrangler, etc...
< / p >
< / section >
< section >
2017-10-06 09:27:25 -04:00
< p > Loading requires curation...< / p >
< h2 class = "fragment" > Data Curation is Hard!< / h2 >
2017-09-23 14:34:02 -04:00
< / section >
< section >
< h3 > State of the Art< / h3 >
< img src = "graphics/BI-Analyst.jpg" height = "400" / >
2017-10-01 16:02:19 -04:00
< imagecredits > (skilledup.com)< / imagecredits >
2017-09-23 14:34:02 -04:00
2017-10-06 09:27:25 -04:00
< p > Alice spends weeks curating her data before using it.< / p >
< / section >
< section >
< h3 > Relational databases make this worse...< / h3 >
< p > The data needs...
< ul >
< li > ... a complete schema (e.g., Tables, Columns, Types, ...).< / li >
< li > ... to satisfy constraints (e.g., NOT NULL, Key, F-Key).< / li >
< / ul >
< / p >
< p class = "fragment" > This is all required upfront. < b > Before asking a single question< / b > .< / p >
2017-09-23 14:34:02 -04:00
< / section >
2017-09-25 13:20:04 -04:00
< / section >
2017-09-23 14:34:02 -04:00
2017-09-25 13:20:04 -04:00
< section >
2017-09-23 14:34:02 -04:00
< section >
2017-10-06 09:27:25 -04:00
< p > Relational DBs are useless in early stages of curation.< / p >
< h2 > Why?< / h2 >
2017-09-23 14:34:02 -04:00
< / section >
2017-09-23 15:30:37 -04:00
< section >
< h3 >
2017-10-06 09:27:25 -04:00
In the name of Codd,< br / > < span class = "fragment grow highlight-current-blue" data-fragment-index = "2" > thou shalt not give the user a wrong answer.< / span >
2017-09-23 15:30:37 -04:00
< / h3 >
2017-10-06 09:27:25 -04:00
< p class = "fragment" data-fragment-index = "1" style = "margin-top: 80px;" > There are tons of good heuristics available for < span class = "fragment highlight-current-blue" data-fragment-index = "2" > guessing< / span > how to clean data.< / p >
< / section >
< section >
< p >
Thou shalt not give the user a wrong answer.
< / p >
2017-09-23 15:30:37 -04:00
< h4 class = "fragment" >
... but what if we did?
< / h4 >
< h4 class = "fragment" >
What would it take for that to be ok?
< / h4 >
< / section >
2017-09-25 13:20:04 -04:00
2017-09-23 15:30:37 -04:00
< section >
< h2 > Industry says...< / h2 >
< / section >
< section >
< svg height = "500px" width = "700" >
< image xlink:href = "graphics/maybe-screen.png" x = "0" y = "0" height = "500px" width = "282" / >
< image xlink:href = "graphics/maybe-detail.png"
x="418" y="0" height="500px" width="282" height="500px"
class="fragment"
data-fragment-index="2"
/>
< g style = "
fill: rgba(0,0,0,0);
stroke-width: 4;
stroke: rgba(200, 50, 50, 1);
">
< rect
x="20px" y="200px"
width="135px" height="35px"
class="fragment"
data-fragment-index="1"
/>
< circle
cx="255" cy="205" r="15"
class="fragment"
data-fragment-index="2"
/>
< polyline
points="270,205 400,205 380,195 400,205 380,215"
class="fragment"
data-fragment-index="2"
/>
< rect
x="420px" y="150px"
width="200px" height="120px"
class="fragment"
data-fragment-index="3"
/>
2017-10-02 23:14:25 -04:00
<!--
2017-09-23 15:30:37 -04:00
< rect
x="670px" y="150px"
width="25px" height="120px"
class="fragment"
data-fragment-index="4"
/>
2017-10-02 23:14:25 -04:00
-->
2017-09-23 15:30:37 -04:00
< / g >
< / svg >
2017-10-02 23:14:25 -04:00
< p class = "fragment" data-fragment-index = "4" > My phone is guessing, but is letting me know that it did< / p >
2017-10-01 16:02:19 -04:00
< imagecredits > Apple iOS 10; Phone App< / imagecredits >
2017-09-23 15:30:37 -04:00
< / section >
< section >
< img src = "graphics/BingTranslate.png" / >
< p class = "fragment" > Good Explanations, Alternatives, and Feedback Vectors< / p >
2017-10-01 16:02:19 -04:00
< imagecredits > Bing Translate (c.a. 2016)< / imagecredits >
2017-09-23 15:30:37 -04:00
< / section >
< section >
< h2 > Communication< / h2 >
< ul >
< li > What data is uncertain?< / li >
< li > Why is my data uncertain?< / li >
< li > How bad is it?< / li >
< li > What can I do about it?< / li >
< / ul >
< / section >
< section >
2017-09-25 13:20:04 -04:00
< h3 > What if a database did the same?< / h3 >
2017-09-23 15:30:37 -04:00
< h4 class = "fragment" > (they can)< / h4 >
< / section >
< / section >
< section >
2017-10-01 09:57:58 -04:00
< section >
< h3 > On representing incomplete information in a relational data base< / h3 >
< h4 > T. Imielinski & W. Lipski Jr.< span style = "margin-left: 40px" > (< i > VLDB < span class = "fragment highlight-current-red" data-fragment-index = "1" > 1981< / span > < / i > )< / span > < / h4 >
< p class = "fragment" data-fragment-index = "1" style = "margin-top: 60px" >
Incomplete and Probabilistic Databases< br / > have existed since the 1980s
< / p >
< / section >
2017-09-23 15:30:37 -04:00
2017-10-01 09:57:58 -04:00
< section >
< svg width = "800" height = "500" >
< g transform = "translate(150,0)" >
2017-09-25 17:52:18 -04:00
< image
xlink:href="graphics/db.svg"
width="93" height="103"
2017-10-01 09:57:58 -04:00
x="0" y="10"
2017-09-25 17:52:18 -04:00
/>
2017-10-06 09:27:25 -04:00
< g class = "fragment" data-fragment-index = "2" >
2017-10-01 09:57:58 -04:00
< image
xlink:href="graphics/db.svg"
width="93" height="103"
x="0" y="130"
/>
< image
xlink:href="graphics/db.svg"
width="93" height="103"
x="0" y="250"
/>
< image
xlink:href="graphics/db.svg"
width="93" height="103"
x="0" y="370"
/>
< / g >
< / g >
< g
transform="translate(250, 0)"
2017-10-06 09:27:25 -04:00
class="fragment" data-fragment-index="1"
2017-10-01 09:57:58 -04:00
style="
fill: rgba(200, 50, 50, 0);
stroke-width: 4;
stroke: rgba(150, 150, 150, 1);
">
< polyline
points="0,60 220,60 200,50 220,60 200,70 220,60 0,60"
transform="translate(0,0)"
2017-09-25 17:52:18 -04:00
/>
2017-10-01 09:57:58 -04:00
< text x = "60" y = "50" > Q(D)< / text >
2017-09-25 17:52:18 -04:00
< image
2017-10-01 09:57:58 -04:00
xlink:href="graphics/jean-victor-balin-icon-table.svg"
width="96" height="96"
x="230" y="15"
2017-09-25 17:52:18 -04:00
/>
2017-10-01 09:57:58 -04:00
< g class = "fragment" data-fragment-index = "3" >
< polyline
points="0,60 220,60 200,50 220,60 200,70 220,60 0,60"
transform="translate(0,120)"
/>
< polyline
points="0,60 220,60 200,50 220,60 200,70 220,60 0,60"
transform="translate(0,240)"
/>
< polyline
points="0,60 220,60 200,50 220,60 200,70 220,60 0,60"
transform="translate(0,360)"
/>
< text x = "60" y = "170" > Q(D)< / text >
< text x = "60" y = "290" > Q(D)< / text >
< text x = "60" y = "410" > Q(D)< / text >
< image
xlink:href="graphics/jean-victor-balin-icon-table.svg"
width="96" height="96"
x="230" y="135"
/>
< image
xlink:href="graphics/jean-victor-balin-icon-table.svg"
width="96" height="96"
x="230" y="255"
/>
< image
xlink:href="graphics/jean-victor-balin-icon-table.svg"
width="96" height="96"
x="230" y="375"
/>
< / g >
2017-09-25 17:52:18 -04:00
< / g >
2017-10-01 09:57:58 -04:00
< g
transform="translate(0, 0)"
2017-10-06 09:27:25 -04:00
class="fragment" data-fragment-index="6"
2017-10-01 09:57:58 -04:00
style="
fill: rgba(200, 50, 50, 0);
stroke-width: 4;
stroke: rgba(150, 150, 150, 1);
">
< polyline
points="20,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(0,200) rotate(-60)"
/>
< polyline
points="70,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(-15,200) rotate(-20)"
/>
< polyline
points="70,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(25,170) rotate(20)"
/>
< polyline
points="20,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(102,220) rotate(60)"
/>
< text x = "40" y = "250" > ?< / text >
< / g >
< g
transform="translate(540, 0)"
class="fragment" data-fragment-index="4"
style="
fill: rgba(200, 50, 50, 0);
stroke-width: 4;
stroke: rgba(150, 150, 150, 1);
">
< polyline
points="20,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(102,30) rotate(60)"
/>
< polyline
points="70,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(0,120) rotate(20)"
/>
< polyline
points="70,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(-40,240) rotate(-20)"
/>
< polyline
points="20,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(0,390) rotate(-60)"
/>
2017-10-06 09:27:25 -04:00
< g style = "font-size: 18px; stroke-width: 0; fill: rgba(120,120,120,1); " >
< text x = "130" y = "211" > Probability< / text >
< text x = "130" y = "237" > Expectation< / text >
< text x = "130" y = "263" > Variance< / text >
< text x = "130" y = "289" > Histogram< / text >
< / g >
< g class = "fragment" data-fragment-index = "7" >
2017-10-01 09:57:58 -04:00
< image
xlink:href="graphics/dagobert83-female-user-icon-800px.png"
width="100" height="100"
2017-10-06 09:27:25 -04:00
x="110" y="190"
2017-10-01 09:57:58 -04:00
/>
< / g >
< / g >
< / svg >
2017-10-06 09:27:25 -04:00
< p class = "fragment" data-fragment-index = "5" style = "font-size: smaller" >
2017-10-01 09:57:58 -04:00
We've gotten good at query processing on uncertain data.< br / >
2017-10-06 09:27:25 -04:00
< span class = "fragment" data-fragment-index = "6" > But not sourcing uncertain data
< span class = "fragment" data-fragment-index = "7" > ... or communicating results to humans.< / span > < / span >
2017-10-01 09:57:58 -04:00
< / p >
< / section >
< section >
< h3 > Challenges< / h3 >
< ul >
2017-10-06 09:27:25 -04:00
< li > Where do probabilities/possible worlds come from?< / li >
< li > How do humans use the output of probabilistic queries?< / li >
2017-10-02 23:14:25 -04:00
< li class = "fragment" > Probablistic DB queries are sloooooow.< / li >
2017-10-01 09:57:58 -04:00
< / ul >
< p class = "fragment" style = "font-size: smaller;" > A small shift in how we think about PDBs addresses all three points.< / p >
< / section >
< / section >
< section >
< section >
< h3 > It's not the data that's uncertain,< br / > it's the interpretation< / h3 >
< / section >
< section >
< table >
< tr >
< th > Time< / th > < th > Sensor Reading< / th > < th class = "fragment" data-fragment-index = "2" > Temp Around Sensor< / th >
< / tr >
< tr > < td > 1< / td > < td > 31.6< / td > < td class = "fragment" data-fragment-index = "2" > Roughly 31.6˚C< / td > < / tr >
< tr > < td > 2< / td > < td > -999< / td > < td class = "fragment" data-fragment-index = "2" > Around 30˚C?< / td > < / tr >
2017-10-02 23:14:25 -04:00
< tr > < td > 3< / td > < td > 28.1< / td > < td class = "fragment" data-fragment-index = "2" > Roughly 28.1˚C?< / td > < / tr >
< tr > < td > 4< / td > < td > 32.2< / td > < td class = "fragment" data-fragment-index = "2" > Roughly 32.2˚C< / td > < / tr >
2017-10-01 09:57:58 -04:00
< / table >
< p class = "fragment" data-fragment-index = "1" > The < i > reading< / i > is deterministic< / p >
< p class = "fragment" data-fragment-index = "2" > ... but what we care about is what the reading measures< / p >
< / section >
< section >
< svg width = "650" height = "500" >
2017-09-25 13:20:04 -04:00
< image
2017-10-01 09:57:58 -04:00
xlink:href="graphics/db.svg"
width="93" height="103"
x="0" y="190"
2017-09-25 13:20:04 -04:00
/>
2017-10-01 09:57:58 -04:00
< g
transform="translate(30, 0)"
class="fragment" data-fragment-index="2"
style="
fill: rgba(200, 50, 50, 0);
stroke-width: 4;
stroke: rgba(150, 150, 150, 1);
">
2017-09-25 17:52:18 -04:00
< polyline
2017-10-01 09:57:58 -04:00
points="20,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(0,170) rotate(-60)"
2017-09-25 17:52:18 -04:00
/>
< polyline
2017-10-01 09:57:58 -04:00
points="70,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(-15,170) rotate(-20)"
2017-09-25 17:52:18 -04:00
/>
< polyline
2017-10-01 09:57:58 -04:00
points="70,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(25,185) rotate(20)"
/>
< polyline
points="20,60 140,60 120,50 140,60 120,70 140,60"
transform="translate(102,240) rotate(60)"
/>
< text x = "150" y = "70" > Q< tspan style = "font-size: smaller" > 1< / tspan > (D)< / text >
< text x = "150" y = "190" > Q< tspan style = "font-size: smaller" > 2< / tspan > (D)< / text >
< text x = "150" y = "310" > Q< tspan style = "font-size: smaller" > 3< / tspan > (D)< / text >
< text x = "150" y = "430" > Q< tspan style = "font-size: smaller" > 4< / tspan > (D)< / text >
< / g >
< g transform = "translate(500,0)"
class="fragment" data-fragment-index="1"
style="
fill: rgba(200, 50, 50, 0);
stroke-width: 4;
stroke: rgba(150, 150, 150, 1);
">
< image
xlink:href="graphics/jean-victor-balin-icon-table.svg"
width="96" height="96"
x="0" y="15"
2017-09-25 17:52:18 -04:00
/>
< image
xlink:href="graphics/jean-victor-balin-icon-table.svg"
width="96" height="96"
2017-10-01 09:57:58 -04:00
x="0" y="135"
2017-09-25 17:52:18 -04:00
/>
< image
xlink:href="graphics/jean-victor-balin-icon-table.svg"
width="96" height="96"
2017-10-01 09:57:58 -04:00
x="0" y="255"
2017-09-25 17:52:18 -04:00
/>
< image
xlink:href="graphics/jean-victor-balin-icon-table.svg"
width="96" height="96"
2017-10-01 09:57:58 -04:00
x="0" y="375"
2017-09-25 17:52:18 -04:00
/>
< / g >
2017-10-01 09:57:58 -04:00
< g transform = "translate(260,0)"
class="fragment" data-fragment-index="2"
style="
fill: rgba(200, 50, 50, 0);
stroke-width: 4;
stroke: rgba(150, 150, 150, 1);
">
< polyline
points="30,60 220,60 200,50 220,60 200,70 220,60"
transform="translate(0,0)"
/>
< polyline
points="30,60 220,60 200,50 220,60 200,70 220,60"
transform="translate(0,120)"
/>
< polyline
points="30,60 220,60 200,50 220,60 200,70 220,60"
transform="translate(0,240)"
/>
< polyline
points="30,60 220,60 200,50 220,60 200,70 220,60"
transform="translate(0,360)"
/>
< / g >
< / svg >
< p style = "font-size: smaller; margin-bottom: 0px; margin-top: 0px;" > < b > Insight:< / b > Treat data as 100% deterministic.< / p >
< p
style="font-size: smaller; margin-top: 0px;"
class="fragment" data-fragment-index="2"
>Instead, queries propose alternative interpretations.< / p >
< / section >
2017-10-05 23:09:46 -04:00
< section >
< p > < i > ratings1.rating< / i > matches either < i > ratings2.numratings< / i > or < i > ratings2.evaluation< / i > .< / p >
< pre > < code >
SELECT pid, rating FROM ratings1 UNION ALL
SELECT pid, num_ratings AS rating FROM ratings2;
< / code > < / pre >
or
< pre > < code >
SELECT pid, rating FROM ratings1 UNION ALL
SELECT pid, evaluation AS rating FROM ratings2;
< / code > < / pre >
< / section >
< section >
< p > Repair missing values in < i > rating< / i > < / p >
< pre > < code >
SELECT pid, CASE WHEN rating is null
THEN interpolate(...) ELSE rating END AS rating
FROM ratings;
< / code > < / pre >
or
< pre > < code >
SELECT pid, CASE WHEN rating is null
THEN classifier(...) ELSE rating END AS rating
FROM ratings;
< / code > < / pre >
or ...
< / section >
2017-10-01 09:57:58 -04:00
< section >
< h3 > Effects< / h3 >
< ol >
< li class = "fragment" style = "margin-top: 30px;" > It's clear where uncertainty comes from.< / li >
< li class = "fragment" style = "margin-top: 30px;" > Results can be communicated through provenance.< / li >
< li class = "fragment" style = "margin-top: 30px;" > Query evaluation is decoupled from physical layout.< / li >
< / ol >
< / section >
2017-09-25 17:52:18 -04:00
< / section >
< section >
2017-10-01 09:57:58 -04:00
< section >
< h3 > Non-Deterministic Queries< / h3 >
< / section >
< section >
2017-10-01 16:02:19 -04:00
< svg width = "530" height = "600" >
< g
transform="translate(30, 0)"
style="
fill: rgba(200, 50, 50, 0);
stroke-width: 4;
stroke: rgba(150, 150, 150, 1);
">
< image
xlink:href="graphics/db.svg"
width="93" height="103"
x="0" y="200"
/>
< g class = "fragment" data-fragment-index = "1" >
< text x = "100" y = "240" style = "font-size: smaller;" class = "fragment fade-out" data-fragment-index = "2" > Q(D)< / text >
< polyline
points="100,250 390,250 370,240 390,250 370,260"
class="fragment fade-out" data-fragment-index="3"
/>
< polyline
points="100,250 180,250"
/>
< image
xlink:href="graphics/jean-victor-balin-icon-table.svg"
width="96" height="96"
x="400" y="200"
class="fragment fade-out" data-fragment-index="3"
/>
< / g >
< g class = "fragment" data-fragment-index = "2" >
< g class = "fragment fade-out" data-fragment-index = "3" >
< text x = "280" y = "230" > Q< tspan style = "font-size: smaller;" > 1< / tspan > (D)< / text >
< image
xlink:href="graphics/hawk88-personal-information.svg"
width="89" height="85"
x="110" y="270"
/>
< text x = "150" y = "327" style = "fill: blue; font-size: larger;" > 1< / text >
< polyline
points="200,275 250,250"
/>
< / g >
< / g >
< g class = "fragment" data-fragment-index = "3" >
2017-10-05 23:09:46 -04:00
< text x = "280" y = "110" > Q< tspan style = "font-size: smaller;" > 2< / tspan > (D)< / text >
2017-10-01 16:02:19 -04:00
< image
xlink:href="graphics/hawk88-personal-information.svg"
width="89" height="85"
x="110" y="80"
/>
2017-10-05 23:09:46 -04:00
< text x = "150" y = "137" style = "fill: blue; font-size: larger;" > 2< / text >
2017-10-01 16:02:19 -04:00
< polyline
points="180,250 250,140 390,140 370,130 390,140 370,150 390,140 210,140"
/>
< image
xlink:href="graphics/jean-victor-balin-icon-table.svg"
width="96" height="96"
x="400" y="80"
/>
< / g >
< g class = "fragment" data-fragment-index = "3" >
2017-10-05 23:09:46 -04:00
< text x = "280" y = "330" > Q< tspan style = "font-size: smaller;" > 1< / tspan > (D)< / text >
2017-10-01 16:02:19 -04:00
< image
xlink:href="graphics/hawk88-personal-information.svg"
width="89" height="85"
x="110" y="320"
/>
2017-10-05 23:09:46 -04:00
< text x = "150" y = "377" style = "fill: blue; font-size: larger;" > 1< / text >
2017-10-01 16:02:19 -04:00
< polyline
points="180,250 250,360 390,360 370,350 390,360 370,370 390,360 210,360"
/>
< image
xlink:href="graphics/jean-victor-balin-icon-table.svg"
width="96" height="96"
x="400" y="320"
/>
< / g >
< / g >
< / svg >
< p class = "fragment" data-fragment-index = "3" >
Non-deterministic queries reference an external configuration.
< / p >
< imagecredits > (OpenClipArt.org)< / imagecredits >
< / section >
2017-10-05 23:09:46 -04:00
<!--
2017-10-01 16:02:19 -04:00
< section >
< h3 > Models for Incomplete and Probabilistic Information< / h3 >
< h4 > Green, Tannen< span style = "margin-left: 40px" > (< i > EDBT 2006< / i > )< / span > < / h4 >
2017-10-02 23:14:25 -04:00
< h3 style = "margin-top: 80px;" > m-tables: Representing Missing Data< / h3 >
< h4 > Bruhathi, Koutris, Lang, Naughton, Tannen< span style = "margin-left: 40px" > (< i > ICDT 2017< / i > )< / span > < / h4 >
2017-10-01 16:02:19 -04:00
< / section >
< section >
< svg width = "600" height = "450" >
< image
xlink:href="graphics/db.svg"
width="93" height="103"
x="100" y="340"
/>
< text x = "150" y = "30" > Q(D, Attr1, Attr2, Attr3)< / text >
< g transform = "translate(0,60)" >
< g class = "fragment" data-fragment-index = "1" >
< image
xlink:href="graphics/hawk88-personal-information.svg"
width="89" height="85"
x="220" y="300"
/>
< polyline style = "
stroke-width: 4;
stroke: rgba(0, 0, 0, 1);
" points="270,300 270,260" />
< text x = "220" y = "250" style = "font-size: 60px;" > π< tspan style = "font-size: 20px;" > Attr1< / tspan > < / text >
< text x = "180" y = "180" style = "font-size: 60px;" > ⋈< / text >
< polyline style = "
fill: rgba(0,0,0,0);
stroke-width: 4;
stroke: rgba(0, 0, 0, 1);
" points="150,270 150,220 205,180 250,210" />
< / g >
< g transform = "translate(120,0)" class = "fragment" data-fragment-index = "2" >
< image
xlink:href="graphics/hawk88-personal-information.svg"
width="89" height="85"
x="220" y="300"
/>
< polyline style = "
stroke-width: 4;
stroke: rgba(0, 0, 0, 1);
" points="270,300 270,260" />
< text x = "220" y = "250" style = "font-size: 60px;" > π< tspan style = "font-size: 20px;" > Attr2< / tspan > < / text >
< text x = "150" y = "110" style = "font-size: 60px;" > ⋈< / text >
< polyline style = "
fill: rgba(0,0,0,0);
stroke-width: 4;
stroke: rgba(0, 0, 0, 1);
" points="110,140 175,110 270,150 270,210" />
< / g >
< g transform = "translate(240,0)" class = "fragment" data-fragment-index = "2" >
< image
xlink:href="graphics/hawk88-personal-information.svg"
width="89" height="85"
x="220" y="300"
/>
< polyline style = "
stroke-width: 4;
stroke: rgba(0, 0, 0, 1);
" points="270,300 270,260" />
< text x = "220" y = "250" style = "font-size: 60px;" > π< tspan style = "font-size: 20px;" > Attr3< / tspan > < / text >
< text x = "150" y = "40" style = "font-size: 60px;" > ⋈< / text >
< polyline style = "
fill: rgba(0,0,0,0);
stroke-width: 4;
stroke: rgba(0, 0, 0, 1);
" points="80,70 175,40 270,80 270,210" />
< polyline style = "
fill: rgba(0,0,0,0);
stroke-width: 4;
stroke: rgba(0, 0, 0, 1);
" points="175,00 175,-20" />
< / g >
< / g >
< / svg >
< p class = "fragment" data-fragment-index = "3" >
< b > Problem< / b > : $|Config| = \text{# Branch Points}$
< / p >
< / section >
< section >
< h3 > Requirements< / h3 >
< ul >
< li > Assume an open-world of branch-points.< / li >
< li > Assign each branch point an identity.< / li >
< / ul >
2017-10-02 23:14:25 -04:00
< / section >
-->
2017-10-05 23:09:46 -04:00
2017-10-02 23:14:25 -04:00
< section >
2017-10-04 03:31:50 -04:00
< h3 > VGTerms< / h3 >
< p > A $VGTerm(\ldots)$ references configuration parameters< br / > (aka "variables").< / p >
2017-10-01 16:02:19 -04:00
< ul >
2017-10-04 03:31:50 -04:00
< li class = "fragment" > $VGTerm('X')$ references the variable $X$< / li >
< li class = "fragment" > $VGTerm('X', 1)$ references the variable $X_{1}$< / li >
< li class = "fragment" > $VGTerm('X', B)$ evaluates $B$ and then references the variable $X_{B}$< / li >
2017-10-01 16:02:19 -04:00
< / ul >
< citation > Lenses: An On-Demand Approach to ETL; Yang et. al.; VLDB 2015< / citation >
2017-10-04 03:31:50 -04:00
< aside class = "notes" >
This is basically a skolem function.
< / aside >
2017-10-01 16:02:19 -04:00
< / section >
< section >
2017-10-04 03:31:50 -04:00
< p > $VGTerm()$s can be used like normal expressions< / p >
2017-10-01 16:02:19 -04:00
< pre > < code >
SELECT A, VGTerm('X', B) AS C FROM R;
< / code > < / pre >
< center > < div style = "width: 600px" class = "fragment" data-fragment-index = "1" >
< table style = "float: left" >
< thead >
2017-10-06 09:27:25 -04:00
< tr > < th style = "border-right: 1px solid;" > R< / th > < th > A< / th > < th > B< / th > < / tr >
2017-10-01 16:02:19 -04:00
< / thead > < tbody >
2017-10-06 09:27:25 -04:00
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 1< / td > < td > 2< / td > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 3< / td > < td > 4< / td > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 5< / td > < td > 4< / td > < / tr >
2017-10-01 16:02:19 -04:00
< / tbody >
< / table >
< table style = "float: right" class = "fragment" data-fragment-index = "2" >
2017-10-06 09:27:25 -04:00
< tr > < th style = "border-right: 1px solid;" > Q(R)< / th > < th > A< / th > < th > C< / th > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 1< / td > < td > $X_2$< / td > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 3< / td > < td > $X_4$< / td > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 5< / td > < td > $X_4$< / td > < / tr >
2017-10-01 16:02:19 -04:00
< / table >
< / div > < / center >
< div style = "clear: both;" > < / div >
< p class = "fragment" data-fragment-index = "3" >
... variables are identified by a family (i.e. $'X'$),< br / > and optional indexes (i.e., $B$).
< / p >
< / section >
2017-10-04 03:31:50 -04:00
<!--
2017-10-01 16:02:19 -04:00
< section >
< p > Mimir defines a synthetic $ROWID$ value guaranteed to be unique for each row of a query.< / p >
< pre > < code >
SELECT A, VGTerm('X', ROWID) AS C FROM R;
< / code > < / pre >
< center > < div style = "width: 600px" >
< table style = "float: left" >
< thead >
< tr > < th > R |< / th > < th > A< / th > < th > B< / th > < / tr >
< / thead > < tbody >
< tr > < td align = "right" > |< / td > < td > 1< / td > < td > 2< / th > < / tr >
< tr > < td align = "right" > |< / td > < td > 3< / td > < td > 4< / th > < / tr >
< tr > < td align = "right" > |< / td > < td > 5< / td > < td > 6< / th > < / tr >
< / tbody >
< / table >
< table style = "float: right" >
< tr > < th > A< / th > < th > C< / th > < / tr >
< tr > < td > 1< / td > < td > $X_1$< / th > < / tr >
< tr > < td > 3< / td > < td > $X_2$< / th > < / tr >
< tr > < td > 5< / td > < td > $X_3$< / th > < / tr >
< / table >
< / div > < / center >
< / section >
2017-10-04 03:31:50 -04:00
-->
2017-10-01 16:02:19 -04:00
< section >
2017-10-04 03:31:50 -04:00
< h3 > Schema Matching< / h3 >
2017-10-01 16:02:19 -04:00
< div style = "font-size: 16pt" >
$$ratings2(pid, num\_ratings, evaluation) \rightarrow (pid, rating)$$
< / div >
< pre > < code >
SELECT
pid,
CASE VGTerm('MATCH_RATING')
WHEN 'NUM_RATINGS' THEN num_ratings
WHEN 'EVALUATION' THEN evaluation
ELSE null
END AS rating
FROM ratings2;
< / code > < / pre >
2017-10-04 03:31:50 -04:00
< p class = "fragment" style = "font-size: 18pt;" >
One global configuration variable decides which column gets mapped to "rating".
< / p >
2017-10-01 16:02:19 -04:00
< / section >
< section >
2017-10-04 03:31:50 -04:00
< h3 > Missing Value Imputation< / h3 >
2017-10-01 16:02:19 -04:00
< div style = "font-size: 16pt" >
$$ratings1(pid, rating, review\_ct) \text{ s.t. } rating \text{ is not NULL}$$
< / div >
< pre > < code >
SELECT
pid,
CASE WHEN rating IS NULL
THEN VGTerm('RATING', ROWID)
ELSE rating
END AS rating,
review_ct
FROM ratings1;
< / code > < / pre >
2017-10-04 03:31:50 -04:00
< p class = "fragment" style = "font-size: 18pt;" >
A family of variables indexed by < tt > ROWID< / tt > represent each imputed value.
< / p >
2017-10-01 16:02:19 -04:00
< / section >
2017-10-04 03:31:50 -04:00
< / section >
< section >
2017-10-01 16:02:19 -04:00
< section >
2017-10-04 03:31:50 -04:00
< h2 > Defining Configurations< / h2 >
< svg width = "700" height = "400" >
< image
xlink:href="graphics/hawk88-personal-information.svg"
width="89" height="85"
x="0" y="170"
/>
< text x = "10" y = "220" style = "font-size: 22px; font-weight: bold;" > Config.< / text >
< g class = "fragment" data-fragment-index = "1" >
< polyline
style="
fill: rgba(0, 0, 0, 0);
stroke: rgba(0, 0, 0, 1);
stroke-width: 3;
"
points="95,210 200,210 95,210 200,70 95,210 200,350"
/>
< image
xlink:href="graphics/Kliponius-Cardboard-box-package.svg"
width="94" height="99"
x="220" y="20"
/>
< text x = "240" y = "80" style = "font-size: 22px; font-weight: bold;" > Model< / text >
< image
xlink:href="graphics/Kliponius-Cardboard-box-package.svg"
width="94" height="99"
x="220" y="160"
/>
< text x = "240" y = "220" style = "font-size: 22px; font-weight: bold;" > Model< / text >
< image
xlink:href="graphics/Kliponius-Cardboard-box-package.svg"
width="94" height="99"
x="220" y="300"
/>
< text x = "240" y = "360" style = "font-size: 22px; font-weight: bold;" > Model< / text >
< / g >
< g >
< g class = "fragment" data-fragment-index = "2" >
< polyline
style="
fill: rgba(0, 0, 0, 0);
stroke: rgba(0, 0, 0, 1);
stroke-width: 3;
"
points="340,210 380,210"
/>
< text x = "385" y = "216" style = "font-size: 18px; font-weight: bold;" > All assignments for one family.< / text >
< / g >
< g class = "fragment" data-fragment-index = "3" >
< polyline
style="
fill: rgba(0, 0, 0, 0);
stroke: rgba(0, 0, 0, 1);
stroke-width: 3;
"
points="340,210 380,180"
/>
< text x = "385" y = "186" style = "font-size: 18px; font-weight: bold;" > Description of the family in English.< / text >
< / g >
< g class = "fragment" data-fragment-index = "4" >
< polyline
style="
fill: rgba(0, 0, 0, 0);
stroke: rgba(0, 0, 0, 1);
stroke-width: 3;
"
points="340,210 380,240"
/>
< text x = "385" y = "246" style = "font-size: 18px; font-weight: bold;" > Other feasible assignments.< / text >
< / g >
< / g >
< g class = "fragment" data-fragment-index = "4" >
< image
xlink:href="graphics/hawk88-personal-information.svg"
width="89" height="85"
x="0" y="20"
/>
< text x = "10" y = "70" style = "font-size: 22px; font-weight: bold;" > Config.< / text >
< image
xlink:href="graphics/hawk88-personal-information.svg"
width="89" height="85"
x="0" y="310"
/>
< text x = "10" y = "360" style = "font-size: 22px; font-weight: bold;" > Config.< / text >
< polyline
style="
fill: rgba(0, 0, 0, 0);
stroke: rgba(0, 0, 0, 1);
stroke-width: 3;
"
points="95,70 200,210 95,350 200,350 95,70 200,70 95,350"
/>
< / g >
< g class = "fragment" data-fragment-index = "5" >
< text x = "15" y = "195" style = "font-size: 22px; font-weight: bold;" > (Best)< / text >
< / g >
< / svg >
< p class = "fragment" data-fragment-index = "5" style = "font-size: 16pt;" >
Models designate one "best-guess" configuration.
< / p >
< / section >
< section >
< h3 > Example Models< / h3 >
2017-10-01 16:02:19 -04:00
< ul >
2017-10-04 03:31:50 -04:00
< li > Imputation using a SparkML classifier< / li >
< li > Heuristic detection of order-by columns for interpolation< / li >
< li > Schema matching based on edit-distance< / li >
< li > MayBMS-style probabilistic repair-key< / li >
< li > And more...< / li >
< / ul >
< / section >
< section >
< h3 > Convenience Operators: Lenses< / h3 >
< p > Lenses instantiate/train a model and wrap a query< / p >
< ul style = "font-size: 16pt" >
< li > Domain Constraint Repair / Missing Value Imputation †< / li >
< li > Schema Matching †< / li >
< li > Sequence Repair< / li >
< li > Key Repair< / li >
< li > Arbitrary Choice< / li >
< li > Type Detection *< / li >
< li > Header Detection *< / li >
< li > JSON Shredder *< / li >
2017-10-01 16:02:19 -04:00
< / ul >
2017-10-04 03:31:50 -04:00
< citation >
†Lenses: An On-Demand Approach to ETL; Yang et. al.; VLDB 2015< br / >
*Adaptive Schema Databases; Spoth et. al.; CIDR 2017
< / citation >
2017-10-01 09:57:58 -04:00
< / section >
2017-09-25 17:52:18 -04:00
< / section >
< section >
2017-10-01 09:57:58 -04:00
< section >
2017-10-05 23:09:46 -04:00
< h3 > Probabilistic ETL< / h3 >
< / section >
< section >
< h3 > ETL: Extract/Transform/Load< / h3 >
< p class = "fragment" > One big query that gets you to a clean dataset< / p >
2017-10-04 03:31:50 -04:00
2017-10-05 23:09:46 -04:00
< p class = "fragment" style = "margin-top: 100px;" > < b > Challenge:< / b > Designing ETL pipelines can be a full-time job.< / p >
2017-10-01 16:02:19 -04:00
< / section >
< section >
2017-10-05 23:09:46 -04:00
< p > Mimir starts with the default "guess" configuration.< / p >
2017-10-01 16:02:19 -04:00
2017-10-05 23:09:46 -04:00
< p > As users explore, they validate or refine guesses for configuration variables as necessary.< / p >
2017-10-01 16:02:19 -04:00
2017-10-04 03:31:50 -04:00
< aside class = "notes" >
It's worth noting here that this allows users to use the same tool (or at least the same backend) for analytics, exploration, ... the entire analytics workflow.
< / aside >
2017-10-01 16:02:19 -04:00
< / section >
< section >
2017-10-05 23:09:46 -04:00
< h3 > Useful Provenance Questions< / h3 >
2017-10-04 03:31:50 -04:00
< ol >
< li > How much of my query result is affected by unvalidated variables?< / li >
< li class = "fragment" > Which variables affect my query results?< / li >
< li class = "fragment" > How bad is the situation?< / li >
< / ol >
2017-10-01 09:57:58 -04:00
< / section >
2017-09-25 17:52:18 -04:00
< / section >
2017-09-25 13:20:04 -04:00
2017-09-25 17:52:18 -04:00
< section >
2017-10-01 09:57:58 -04:00
< section >
2017-10-06 09:27:25 -04:00
< h3 > Provenance Question 1< / h3 >
2017-10-05 23:09:46 -04:00
< p > How much of my query result is affected by unvalidated variables?< / p >
2017-10-01 16:02:19 -04:00
2017-10-04 03:31:50 -04:00
< p class = "fragment" > < b > Idea:< / b > Mark values in query results that depend on unvalidated variables.< / p >
< / section >
< section >
< img src = "graphics/console_results.png" / >
< citation > Communicating Data Quality in On-Demand Curation; Kumari et. al.; QDB 2016< / citation >
< / section >
< section >
< img src = "graphics/console_plot.png" / >
2017-10-01 16:02:19 -04:00
< / section >
< section >
< h3 > Non-Determinism Taint< / h3 >
< pre > < code >
SELECT A, VGTerm('X', ROWID) AS B FROM R;
< / code > < / pre >
↓ ↓ ↓ ↓
< pre > < code >
2017-10-04 18:25:00 -04:00
SELECT A, VGTerm('X', ROWID) AS B,
2017-10-01 16:02:19 -04:00
FALSE AS ROW_TAINTED,
FALSE AS A_TAINTED,
TRUE AS B_TAINTED
FROM R;
< / code > < / pre >
2017-10-04 03:31:50 -04:00
< p style = "font-size: smaller;" > The Mimir compiler adds < tt > *_TAINTED< / tt > fields to each row.< / p >
< / section >
< section >
< h3 > Non-Determinism Taint< / h3 >
< dl >
< dt > A row is untainted if...< / dt >
< dd > ... we can guarantee that it (or a counterpart) appears in the result regardless of configuration.< / dd >
< dt > A cell is untainted if...< / dt >
< dd > ... we can guarantee that its value in the result is independent of the configuration.< / dd >
< / dl >
2017-10-01 16:02:19 -04:00
< / section >
2017-10-04 03:31:50 -04:00
2017-10-01 16:02:19 -04:00
< section >
< h3 > Non-Determinism Taint< / h3 >
< pre > < code >
SELECT A, CASE WHEN B IS NULL
THEN VGTerm('X', ROWID)
ELSE B END AS B
FROM R;
< / code > < / pre >
↓ ↓ ↓ ↓
< pre > < code >
SELECT A, CASE WHEN B IS NULL
2017-10-04 18:25:00 -04:00
THEN VGTerm('X', ROWID)
2017-10-01 16:02:19 -04:00
ELSE B END AS B,
FALSE AS ROW_TAINTED, FALSE AS A_TAINTED,
(B IS NULL) AS B_TAINTED
FROM R;
< / code > < / pre >
2017-10-04 03:31:50 -04:00
< p style = "font-size: smaller;" > Expressions with VGTerms can be conditionally tainted.< / p >
2017-10-01 16:02:19 -04:00
< / section >
2017-10-04 18:25:00 -04:00
<!--
2017-10-01 16:02:19 -04:00
< section >
< pre > < code >
CREATE VIEW R_CLEANED AS
SELECT A, CASE WHEN B IS NULL
THEN VGTerm('X', ROWID)
ELSE B END AS B
FROM R;
< / code > < / pre >
< / section >
2017-10-04 18:25:00 -04:00
-->
2017-10-04 03:31:50 -04:00
<!--
2017-10-01 16:02:19 -04:00
< section >
< h3 > Non-Determinism Taint< / h3 >
< pre > < code >
SELECT A WHERE B > 3 FROM R_CLEANED;
< / code > < / pre >
↓ ↓ ↓ ↓
< pre > < code >
SELECT A, A_TAINTED AS A_TAINTED,
(ROW_TAINTED AND B_TAINTED) AS ROW_TAINTED
FROM R_CLEANED;
< / code > < / pre >
< p class = "fragment" style = "font-size: smaller;" > Selections can potentially taint rows.< / p >
< / section >
2017-10-04 03:31:50 -04:00
-->
2017-10-01 16:02:19 -04:00
< section >
< pre > < code >
2017-10-04 18:25:00 -04:00
CREATE VIEW R_CLEANED AS
SELECT A, CASE WHEN B IS NULL
THEN VGTerm('X', ROWID)
ELSE B END AS B
FROM R;
2017-10-01 16:02:19 -04:00
SELECT A, SUM(B) AS B FROM R_CLEANED GROUP BY A;
< / code > < / pre >
↓ ↓ ↓ ↓
< pre > < code >
SELECT A, SUM(B) AS B,
FALSE AS A_TAINTED,
2017-10-04 03:31:50 -04:00
GROUP_OR(B_TAINTED OR ROW_TAINTED)
OR (SELECT GROUP_OR(A_TAINTED) FROM R_CLEANED) AS B_TAINTED
2017-10-01 16:02:19 -04:00
GROUP_AND(A_TAINTED OR ROW_TAINTED) AS ROW_TAINTED
FROM R_CLEANED;
< / code > < / pre >
2017-10-04 03:31:50 -04:00
< p style = "font-size: smaller;" > Aggregates work too!< / p >
2017-10-01 16:02:19 -04:00
< / section >
< section >
2017-10-04 03:31:50 -04:00
< h3 > Taint Benefits< / h3 >
2017-10-01 16:02:19 -04:00
< ul >
2017-10-05 23:09:46 -04:00
< li class = "fragment" > Much faster than classical Probabilistic DBs< br / > (comparable to deterministic queries).< / li >
2017-10-04 03:31:50 -04:00
< li class = "fragment" > At-a-glance visual of how bad your data is.< / li >
< li class = "fragment" > Can help to focus subsequent analysis.< / li >
2017-10-01 16:02:19 -04:00
< / ul >
< / section >
< section >
2017-10-04 03:31:50 -04:00
< h3 > Taint Limitations< / h3 >
< ul >
< li class = "fragment" data-fragment-index = "1" > Taint is < span style = "color: grey; font-size: smaller;" > (probably *)< / span > C-Sound, but < span style = "color: grey; font-size: smaller;" > (usually *)< / span > not C-Complete.< br / > < / li >
< li class = "fragment" > Taint on group-by aggregates can be misleading.< / li >
< li class = "fragment" > Taint does not work well with set difference.< / li >
< / ul >
< p class = "fragment" > In spite of this, taint works well in practice.< / p >
< citation class = "fragment" data-fragment-index = "1" > *Ongong work w/ Su Feng, Aaron Huber, Boris Glavic< / citation >
2017-10-01 16:02:19 -04:00
< / section >
< / section >
< section >
< section >
2017-10-06 09:27:25 -04:00
< h3 > Provenance Question 2< / h3 >
< p > Which variables affect my query results?< / p >
2017-10-04 03:31:50 -04:00
< p class = "fragment" data-fragment-index = "1" > < b > Idea: < / b > Static dependency analysis produces a list of variable families and queries to generate all relevant indexes.< / p >
< citation class = "fragment" data-fragment-index = "1" > Mimir: Bringing CTables into Practice; Nandi et. al.; ArXiV< / citation >
2017-10-01 16:02:19 -04:00
< / section >
< section >
< img src = "graphics/console_analyze.png" / >
< / section >
< section >
2017-10-06 09:27:25 -04:00
< h3 > Provenance Question 3< / h3 >
< p > How bad is the situation?< / p >
2017-10-04 03:31:50 -04:00
< p class = "fragment" > < b > Idea: < / b > Sample from the space of alternatives to...
< ul >
< li class = "fragment" > Estimate error, expectations, or other statistical measures.< / li >
< li class = "fragment" > Highlight other possible query results.< / li >
< li class = "fragment" > Compute sensitivity < span style = "font-size: smaller" > (Kanagal & Deshpande; SIGMOD 2011)< / span > < / li >
< / ul >
< / p >
2017-10-01 09:57:58 -04:00
< / section >
2017-09-25 17:52:18 -04:00
< / section >
< section >
2017-10-04 03:31:50 -04:00
< section >
< h2 > Sampling is slooooow< / h2 >
< / section >
2017-09-25 17:52:18 -04:00
2017-10-01 09:57:58 -04:00
< section >
2017-10-01 16:02:19 -04:00
< h3 > Trivial Sampling< / h3 >
< p > Evaluate the query $N$ times.< br / > Plug in samples instead of best guesses.< / p >
< div class = "fragment" style = "margin-top: 60px;" >
< h3 > Better Solutions< / h3 >
2017-10-05 23:09:46 -04:00
< p > Merge evaluation to mitigate redundancy.< / p >
2017-10-01 16:02:19 -04:00
< / div >
< / section >
2017-10-05 23:09:46 -04:00
< section >
< h3 > Sparse Encoding< / h3 >
< table >
< tr > < td >
< table >
2017-10-06 09:27:25 -04:00
< tr > < th style = "border-right: 1px solid;" > $R_1$< / th > < th > A< / th > < th > B< / th > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 1< / td > < td > 2< / td > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 3< / td > < td > 4< / td > < / tr >
2017-10-05 23:09:46 -04:00
< tr > < td > < / td > < td > < / td > < td > < / td > < / tr >
2017-10-06 09:27:25 -04:00
< tr > < th style = "border-right: 1px solid;" > $R_2$< / th > < th > A< / th > < th > B< / th > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 1< / td > < td > 5< / td > < / tr >
2017-10-05 23:09:46 -04:00
< / table >
< / td > < td style = "vertical-align: middle;" >
➔
< / td > < td style = "vertical-align: middle;" >
< table >
2017-10-06 09:27:25 -04:00
< tr > < th style = "border-right: 1px solid;" > $R_{sparse}$< / th > < th > A< / th > < th > B< / th > < th > S#< / th > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 1< / td > < td > 2< / td > < td > 1< / td > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 3< / td > < td > 4< / td > < td > 1< / td > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 1< / td > < td > 5< / td > < td > 2< / td > < / tr >
2017-10-05 23:09:46 -04:00
< / table >
< / td > < / tr >
< / table >
< / section >
< section >
< h3 > Tuple Bundles< / h3 >
< table >
< tr > < td >
< table >
2017-10-06 09:27:25 -04:00
< tr > < th style = "border-right: 1px solid;" > $R_1$< / th > < th > A< / th > < th > B< / th > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 1< / td > < td > 2< / td > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 3< / td > < td > 4< / td > < / tr >
2017-10-05 23:09:46 -04:00
< tr > < td > < / td > < td > < / td > < td > < / td > < / tr >
2017-10-06 09:27:25 -04:00
< tr > < th style = "border-right: 1px solid;" > $R_2$< / th > < th > A< / th > < th > B< / th > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 1< / td > < td > 5< / td > < / tr >
2017-10-05 23:09:46 -04:00
< / table >
< / td > < td style = "vertical-align: middle;" >
➔
< / td > < td style = "vertical-align: middle;" >
< table >
2017-10-06 09:27:25 -04:00
< tr > < th style = "border-right: 1px solid;" > $R_{bundle}$< / th > < th > A< / th > < th > B< / th > < th > $\phi$< / th > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 1< / td > < td > [2,5]< / td > < td > [T,T]< / td > < / tr >
< tr > < td style = "border-right: 1px solid;" > < / td > < td > 3< / td > < td > 4< / td > < td > [T,F]< / td > < / tr >
2017-10-05 23:09:46 -04:00
< / table >
< / td > < / tr >
< / table >
< / section >
2017-10-01 16:02:19 -04:00
< section >
< img src = "graphics/sampling_time.png" / >
2017-10-01 09:57:58 -04:00
< / section >
2017-10-01 16:02:19 -04:00
< section >
< ul >
2017-10-05 23:09:46 -04:00
< li > Tuple Bundles faster on Aggregates< / li >
< li > Sparse Evaluation faster on Non-Deterministic Joins.< / li >
2017-10-01 16:02:19 -04:00
< / ul >
< p class = "fragment" > Which one to use?< / p >
< / section >
< section >
< h2 > Either!< / h2 >
< p class = "fragment" > Mimir isn't committed to one fixed data representation.< / p >
2017-10-04 03:31:50 -04:00
< p class = "fragment" style = "font-size: smaller;" > (optimization is a work in progress)< / p >
2017-10-01 16:02:19 -04:00
< / section >
< / section >
< section >
< img src = "graphics/mimir_logo_final.png" >
< h2 > Demo< / h2 >
< / section >
< section >
< p style = "font-size: x-large;" > < img src = "graphics/mimir_logo_final.png" height = "150px" > < br / > < a href = "http://mimirdb.info" > http://mimirdb.info< / a > < / p >
< ul style = "font-size: smaller;" >
< li > It's not the data that's uncertain, it's the interpretation.< / li >
< li > Tagged best-guess evaluation is faster and easier to understand.< / li >
< li > Not committing to one representation allows faster query processing.< / li >
< / ul >
2017-10-02 23:14:25 -04:00
< p > < b > Thanks!< / b > < / p >
2017-09-25 13:20:04 -04:00
< / section >
2017-09-23 14:34:02 -04:00
2017-10-01 16:02:19 -04:00
2017-09-23 14:34:02 -04:00
< / div > < / div >
< script src = "../reveal.js-3.5.0/lib/js/head.min.js" > < / script >
< script src = "../reveal.js-3.5.0/js/reveal.js" > < / script >
< script >
// Full list of configuration options available at:
// https://github.com/hakimel/../reveal.js#configuration
Reveal.initialize({
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'fade', // none/fade/slide/convex/concave/zoom
// Optional ../reveal.js plugins
dependencies: [
{ src: '../reveal.js-3.5.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../reveal.js-3.5.0/plugin/math/math.js',
condition: function() { return true; },
mathjax: '../reveal.js-3.5.0/js/MathJax.js'
},
{ src: '../reveal.js-3.5.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.5.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.5.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../reveal.js-3.5.0/plugin/zoom-js/zoom.js', async: true },
{ src: '../reveal.js-3.5.0/plugin/notes/notes.js', async: true }
]
});
< / script >
< / body >
< / html >