2020-10-08 00:02:29 -04:00
<!doctype html>
< html lang = "en" >
< head >
< meta charset = "utf-8" >
< title > Safe, Reusable Heuristic Data Transformation through Caveats< / title >
< meta name = "description" content = "Safe, Reusable Heuristic Data Transformation through Caveats" >
< meta name = "author" content = "Oliver Kennedy" >
< meta name = "apple-mobile-web-app-capable" content = "yes" / >
< meta name = "apple-mobile-web-app-status-bar-style" content = "black-translucent" / >
< meta name = "viewport" content = "width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui" >
< link rel = "stylesheet" href = "../../reveal.js-3.7.0/css/reveal.css" >
< link rel = "stylesheet" href = "ubodin.css" id = "theme" >
<!-- Code syntax highlighting -->
< link rel = "stylesheet" href = "../../reveal.js-3.7.0/lib/css/zenburn.css" >
< style type = "text/css" >
.reveal .slides section .fragment.growbig {
opacity: 1;
visibility: inherit; }
.reveal .slides section .fragment.growbig.visible {
-webkit-transform: scale(7);
transform: scale(7); }
< / style >
<!-- Printing and PDF exports -->
< script >
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? '../../reveal.js-3.7.0/css/print/pdf.css' : '../reveal.js-3.7.0/css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
< / script >
<!-- [if lt IE 9]>
< script src = "../reveal.js-3.5.0/lib/js/html5shiv.js" > < / script >
<![endif]-->
< / head >
< body >
< div class = "reveal" >
< div class = "header" >
<!-- Any Talk - Specific Header Content Goes Here -->
< center >
< a href = "http://www.buffalo.edu" target = "_blank" >
< img src = "../graphics/logos/ub-1line-ro-white.png" height = "20" / >
< / a >
< / center >
< / div >
< div class = "footer" >
<!-- Any Talk - Specific Footer Content Goes Here -->
< div style = "float: left; margin-top: 15px; " >
Exploring < u > < b > O< / b > < / u > nline < u > < b > D< / b > < / u > ata < u > < b > In< / b > < / u > teractions
< / div >
< a href = "https://odin.cse.buffalo.edu" target = "_blank" >
< img src = "../graphics/logos/odin-1line-white.png" height = "40" style = "float: right;" / >
< / a >
< / div >
< div class = "slides" >
<!-- Any section element inside of this container is displayed as a slide -->
< section >
<!-- Credits... introduce everyone, etc... -->
< section >
< h3 >
< img src = "graphics/vizier-blue.svg" height = "100px" style = "vertical-align: middle; margin-right: 20px;" / >
< span style = "vertical-align: middle;" > VizierDB< / span >
< / h3 >
< hr / >
< h3 > A Notebook with Caveats< / h3 >
< hr / >
< h4 > Oliver Kennedy
< a href = "mailto:okennedy@buffalo.edu" style = "margin-left: 50px;" > okennedy@buffalo.edu< / a > < / h4 >
< / section >
<!--
Let me tell you a story...
- Alice/Bob heuristic alignment
- Carol/Dave data changes
- Eve/HAL data ingestion
-->
< section >
< h2 > Story Time!< / h2 >
< / section >
< / section >
< section >
< section >
< h3 > Act 1< / h3 >
< p > Alice wants to analyze two unaligned time series.< / p >
< / section >
< section >
< table style = "font-size: 60%; display: inline; padding: 50px;" >
< tr > < th > Time< / th > < th > Reading< / th > < / tr >
< tr > < td > 1575731001< / td > < td > 0< / td > < / tr >
< tr > < td > 1575731014< / td > < td > 0< / td > < / tr >
< tr > < td > 1575731030< / td > < td > 0< / td > < / tr >
< tr > < td > 1575731035< / td > < td > 0< / td > < / tr >
< tr > < td colspan = "2" > ...< / td > < / tr >
< tr > < td > 1575731219< / td > < td > 1< / td > < / tr >
< tr > < td > 1575731229< / td > < td > 1< / td > < / tr >
< tr > < td > 1575731240< / td > < td > 1< / td > < / tr >
< / table >
< table style = "font-size: 60%; display: inline; padding: 50px;" >
< tr > < th > Time< / th > < th > Reading< / th > < / tr >
< tr > < td > 1575731011< / td > < td > 0< / td > < / tr >
< tr > < td > 1575731020< / td > < td > 0< / td > < / tr >
< tr > < td > 1575731031< / td > < td > 0< / td > < / tr >
< tr > < td > 1575731039< / td > < td > 0< / td > < / tr >
< tr > < td colspan = "2" > ...< / td > < / tr >
< tr > < td > 1575731218< / td > < td > 1< / td > < / tr >
< tr > < td > 1575731228< / td > < td > 1< / td > < / tr >
< tr > < td > 1575731237< / td > < td > 1< / td > < / tr >
< / table >
< p class = "fragment" > Step 1: Line up the readings< / p >
< / section >
< section >
< h3 > Option 1: Do it right< / h3 >
< / section >
< section >
< img src = "graphics/timeseries.png" height = "500px" style = "display: inline; vertical-align: middle;" / >
< table style = "display: inline; vertical-align: middle;" >
< tr >
< td style = "padding-bottom: 50px;" class = "fragment" >
Lots of active research efforts!
< / td >
< / tr >
< tr >
< td style = "padding-top: 50px;" class = "fragment" >
... but Alice is trying to to GSD!
< / td >
< / tr >
< / table >
< / section >
< section >
< h3 > Alice's Observations< / h3 >
< ul >
< li > Readings every ~10s< / li >
< li > Readings are binary< / li >
< li > Readings are incredibly stable< / li >
< / ul >
< / section >
< section >
< pre > < code class = "sql" >
INSERT INTO series_one_buckets
SELECT CAST(time / 10 AS int) AS bucket,
FIRST(reading)
FROM series_one
GROUP BY bucket;
< / code > < / pre >
< p class = "fragment" > Interpolate missing values< / p >
< p class = "fragment" > Hand tune around the switchover as-needed< / p >
< / section >
< section >
< p > Time taken: < 30 minutes< / p >
< / section >
< section >
< img src = "graphics/woman-reading-book-on-beach.svg" height = "400px" / >
< attribution > FreeSVG.org< / attribution >
< / section >
< section >
< h3 > Enter Bob...< / h3 >
< p class = "fragment" > < u > Similar< / u > analysis...< / p >
< p class = "fragment" > ... < u > different< / u > data< / p >
< p class = "fragment" > Can Bob re-use Alice's prep+analytics workflow?< / p >
< / section >
< section >
< h3 > Maybe?< / h3 >
< ul >
< li class = "fragment" > Are readings still every ~10s?< / li >
< li class = "fragment" > Is the data still binary?< / li >
< li class = "fragment" > Is the data still (relatively) stable?< / li >
< / ul >
< p class = "fragment" > ... and even then, some manual effort is needed!< / p >
< / section >
< section >
< p > Bob needs to know Alice's assumptions< br / > (and how to use the workflow)?< / p >
< / section >
< / section >
< section >
< section >
< h3 > Act 2< / h3 >
< p > Carol gets a dataset from Dave< / p >
< / section >
< section >
< img src = "graphics/female-computer-user.svg" height = "100px" / > < br / >
< span class = "fragment" data-fragment-index = "1" > ↓< / span > < br / >
< img src = "graphics/Binary-file-20110715.svg" height = "100px" style = "vertical-align: middle;" >
< span class = "fragment" data-fragment-index = "1" > →
< img src = "graphics/Prismatic-Cloud-Gears-2.svg" height = "100px" style = "vertical-align: middle;" / >
< / span >
< span class = "fragment" data-fragment-index = "2" > →
< img src = "graphics/moneybag.svg" height = "100px" style = "vertical-align: middle;" >
< / span >
< attribution > FreeSVG.org< / attribution >
< / section >
< section >
< p > Dave adds new data to the dataset!< / p >
< p > Can Carol re-use her workflow?< / p >
< / section >
< section >
< h3 > Maybe?< / h3 >
< ul >
< li class = "fragment" > Did the data dictionary change?< / li >
< li class = "fragment" > Did new errors get introduced?< / li >
< / ul >
< / section >
< section >
< p > Carol needs to remember her assumptions about the data and trust that the new data is like the old data< / p >
< / section >
< / section >
< section >
< section >
< h3 > Act 3< / h3 >
< p > Eve needs to load a CSV file< / p >
< img src = "graphics/Binary-file-20110715.svg" height = "100px" style = "vertical-align: middle;" >
→
< img src = "graphics/db.svg" height = "100px" style = "vertical-align: middle;" >
< attribution > FreeSVG.org< / attribution >
< / section >
< section >
< h3 > Scenario 1< / h3 >
< div >
< img src = "graphics/HAL9000_iconic_eye.svg" height = "150px;" >
< p class = "fragment" style = "font-family: monospace;" >
I'm sorry, I can't do that, Eve.< br / >
< / p >
< p class = "fragment" style = "font-family: monospace;" >
You have a non-numerical value at position 1252538:24.
< / p >
< / div >
< attribution > FreeSVG.org< / attribution >
< / section >
< section >
< h3 > Scenario 2< / h3 >
< div >
< img src = "graphics/HAL9000_iconic_eye.svg" height = "150px;" >
< p class = "fragment" style = "font-family: monospace;" data-fragment-index = "1" >
Load Successful!
< / p >
< div class = "fragment growbig" data-fragment-index = "3" >
< p class = "fragment" style = "font-family: monospace; font-size: 10%;" data-fragment-index = "2" >
(btw, 175326 records didn't load)
< / p >
< / div >
< / div >
< / section >
< section >
< p > Heuristics only work < b > most< / b > of the time.< / p >
< / section >
< / section >
< section >
<!--
Problem: Documentation Disconnected from Data
Solution: Annotate/propagate data
Formally Define Caveats
- Possible Worlds / Heuristic Choices
- Link to IDBs/PDBs
- Pick One World + Mark "Uncertain" Values
- Ex on TI-/BI-DB/CTables
- Emphasize: Don't need to be able to enumerate all worlds:
- Just need one world and the ability to decide what is certain
- Introduce the Caveat function
- Examples:
- 3 examples above
- Re-emphasize that enumeration is not required
- Propagation Overview
-->
< section >
< p > Data science is < span style = "color: lightgrey;" > nuanced< / span > .< / p >
< p class = "fragment" > Assumptions can't be avoided!< / p >
< p class = "fragment" > It's easy to miss an assumption when re-using work.< / p >
< / section >
< section >
< img src = "graphics/data_error.png" >
< attribution > < a href = "https://xkcd.com/2239/" > https://xkcd.com/2239/< / a > < / attribution >
< / section >
< section >
< h3 > Wouldn't it be nice if...< / h3 >
< img src = "graphics/montoya.jpeg" height = "400px" / >
< / section >
< section >
< h3 > Wouldn't it be nice if...< / h3 >
< p > ... this is what Bob saw:< / p >
< img src = "graphics/time_series_with_errors.svg" / >
< / section >
< section >
< h3 > Wouldn't it be nice if...< / h3 >
< p > ... this is what Carol saw:< / p >
< table >
< tr >
< td style = "
color: rgb(251, 189, 8);
background-color: #eed;
text-decoration: none;
text-decoration-color: rgb(251, 189, 8);
text-decoration-line: none;
text-decoration-style: solid;
vertical-align: middle;
border-radius: 15px 0px 0px 15px;
font-size: 150%">⚠< / td >
< td style = "
font-size: 70%;
background-color: #eee;
vertical-align: middle;
border-radius: 0px 15px 15px 0px;
padding: 20px;">
The data included an unexpected value: < b > 'Non-Hispanic White'< / b > < br / > The most similar known value is < b > 'White Non-Hispanic'< / b >
< / td >
< / tr >
< / table >
< / section >
< section >
< p > Annotate data with warnings.< / p >
< p class = "fragment" data-fragment-index = "1" > If you use this value/record, < br / > here's what you need to know!< / p >
< h3 class = "fragment" data-fragment-index = "2" > Caveat Physicus< / h3 >
< / section >
< section >
< h3 > Why?< / h3 >
< h4 class = "fragment" data-fragment-index = "1" > Propagation< / h4 >
< dl >
< dd class = "fragment" data-fragment-index = "2" style = "margin-left: -20px;" > Caveats...< / dd >
< div class = "fragment" data-fragment-index = "2" >
< dt > ... can go where the data goes< / dt >
< dd > Derived values retain caveats on source data.< / dd >
< / div >
< div class = "fragment" data-fragment-index = "3" >
< dt > ... stop where the data stops< / dt >
< dd > Irrelevant caveats don't get propagated< / dd >
< / div >
< / dl >
< / section >
< section >
< h3 > Wouldn't it be nice if...< / h3 >
< p > ... this is what Eve saw:< / p >
< img src = "graphics/caveat-spreadsheet.png" / >
< / section >
< / section >
< section >
< section >
< h3 > What is a Caveat?< / h3 >
< p class = "fragment" > A brief digression...< / p >
< / section >
< section >
< h3 > Classical Databases< / h3 >
< p class = "fragment" > One database $D$< / p >
< p class = "fragment" > Each query gets one answer $R \leftarrow Q(D)$< / p >
< / section >
< section >
< h3 > Incomplete Databases< / h3 >
< p class = "fragment" > Multiple < u > possible< / u > databases $D \in \mathcal D$< / p >
< p class = "fragment" > (possible worlds)< / p >
< p class = "fragment" > Queries get a < u > set< / u > of possible answers $\mathcal R \leftarrow \{\; Q(D) \;|\; D \in \mathcal D\;\}$< / p >
< / section >
< section >
< p class = "fragment" > < b > Certain< / b > tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$< / p >
< p class = "fragment" > < b > Uncertain< / b > tuples exist in at least one, < br / > but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$< / p >
< p style = "font-size: 70%;" class = "fragment" > (not limited to set semantics)< / p >
< / section >
< section >
< p > A caveat is an assumption tied to one or more data elements (cells or rows).< / p >
< p > If the assumption is wrong, so is the element.< / p >
< / section >
2020-11-15 17:15:06 -05:00
<!--
2020-10-08 00:02:29 -04:00
< section >
< h3 > Alice / Bob< / h3 >
< ul >
< li > < span style = "font-family: monospace;" > FIRST< / span > may not pick the right value for a bucket with 2+ distinct values.< / li >
< li > Interpolation may not pick the right value for a bucket with 0 values.< / li >
< / ul >
< / section >
< section >
< h3 > Carol / Dave< / h3 >
< ul >
< li > The model hyperparameters may not work if the data changes too significantly.< / li >
< li > New values could indicate new data errors that Carol's ingest script hasn't accounted for.< / li >
< / ul >
< / section >
< section >
< h3 > Eve / Hal< / h3 >
< ul >
< li > Replacing a parse error with a NULL might not be what Eve expects.< / li >
< / ul >
< / section >
2020-11-15 17:15:06 -05:00
-->
2020-10-08 00:02:29 -04:00
< section >
< p > An element has a caveat → The element is uncertain.< / p >
< p class = "fragment" > ... and btw, here's why.< / p >
< / section >
< / section >
< section >
<!--
Vizier
- Reproducibility-Focused Notebook
- Scripting
- Spreadsheets
- Point+Click
- Key Feature: Caveats
- Demo
-->
< section >
< h1 > < a href = "http://127.0.0.1:5000/vizier-db/api/v1/web-ui/vizier-db" target = "_blank" > Demo< / a > < / h1 >
< / section >
< / section >
< section >
< table style = "display: inline-block; margin-right: 100px" >
< tr >
< th colspan = "5" style = "font-size: 12pt" > Students< / th >
< / tr >
< tr height = "80px" >
< td width = "100px" >
< img src = "people/poonam.jpg" width = "70px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Poonam< br / > (PhD-4Y)< / p >
< / td >
< td width = "100px" >
< img src = "people/will.png" width = "61px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Will< br / > (PhD-3Y)< / p >
< / td >
< td width = "100px" >
< img src = "people/aaron.jpg" width = "64px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Aaron< br / > (PhD-4Y)< / p >
< / td >
< / tr >
< / table >
< table style = "display: inline-block; margin-left: 100px" >
< tr >
< th colspan = "1" style = "font-size: 12pt" > Dev< / th >
< / tr >
< tr >
< td width = "100px" >
< img src = "people/mike.jpg" width = "80px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Mike< br / > (Sr. Rsrch. Dev.)< / p >
< / td >
< / tr >
< / table >
< table style = "display: inline-block;" >
< tr >
< th colspan = "7" style = "font-size: 12pt" > Alumni< / th >
< / tr >
< tr height = "80px" >
< td width = "100px" >
< img src = "people/ying.jpg" width = "60px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Ying< br / > (PhD 2017)< / p >
< / td >
< td width = "100px" >
< img src = "people/niccolo.png" width = "50px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Niccolò< br / > (PhD 2016)< / p >
< / td >
< td width = "100px" >
< img src = "people/arindam.jpg" width = "80px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Arindam< br / > (MS 2016)< / p >
< / td >
< td width = "100px" >
< img src = "people/shivang.jpg" width = "55px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Shivang< br / > (MS 2018)< / p >
< / td >
< td width = "100px" >
< img src = "people/olivia.png" width = "50px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Olivia< br / > (BS 2017)< / p >
< / td >
< td width = "100px" >
< img src = "people/lisa.jpg" width = "71px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Lisa< br / > (BS 2018)< / p >
< / td >
< td width = "100px" >
< img src = "people/gourab.jpg" width = "80px" height = "80px" style = "margin-bottom: 0px" / >
< p style = "margin-top: 0px; font-size: 10pt;" > Gourab< br / > (MS 2018)< / p >
< / td >
< / tr >
< / table >
< table >
< tr >
< th colspan = "6" style = "font-size: 12pt" > External Collaborators< / th >
< / tr >
< tr >
< td width = "130px" style = "font-size: 10pt;" >
Zhen Hua Liu< br / > (Oracle)
< / td >
< td width = "130px" style = "font-size: 10pt;" >
Ying Lu< br / > (Oracle)
< / td >
< td width = "130px" style = "font-size: 10pt;" >
Beda Hammerschmidt< br / > (Oracle)
< / td >
< td width = "140px" style = "font-size: 10pt;" >
Boris Glavic< br / > (IIT)
< / td >
< td width = "140px" style = "font-size: 10pt;" >
Su Feng< br / > (IIT)
< / td >
< / tr >
< / table >
< table style = "margin-top: 5px" >
< tr >
< td width = "140px" style = "font-size: 10pt;" >
Juliana Freire< br / > (NYU)
< / td >
2020-10-08 00:05:49 -04:00
< td width = "140px" style = "font-size: 10pt;" >
Munaf Arshad Qazi< br / > (NYU)
< / td >
2020-10-08 00:02:29 -04:00
< td width = "140px" style = "font-size: 10pt;" >
Heiko Mueller< br / > (NYU)
< / td >
< td width = "140px" style = "font-size: 10pt;" >
Sonia Castelo Quispe< br / > (NYU)
< / td >
< td width = "140px" style = "font-size: 10pt;" style = "color: grey; " >
Carlos Bautista< br / > (NYU)
< / td >
< td width = "140px" style = "font-size: 10pt;" >
Remi Rampin< br / > (NYU)
< / td >
< / tr >
< / table >
< p style = "font-size: 10pt; text-decoration: underline;" > Vizier is supported by NSF Awards ACI-1640864 and #IIS-1750460 and gifts from Oracle< / p >
< / section >
2020-10-08 00:05:49 -04:00
< section >
< h3 >
< img src = "graphics/vizier-blue.svg" height = "100px" style = "vertical-align: middle; margin-right: 20px;" / >
< span style = "vertical-align: middle;" > < a href = "https://vizierdb.info/" > https://vizierdb.info< / a > < / span >
< / h3 >
< pre style = "margin-top: 50px;" > < code class = "sql" >
$> pip3 install --user vizier-webapi
$> vizier
< / code > < / pre >
< p > Or get an account from me and try it out at < a href = "https://demo.vizierdb.info" > https://demo.vizierdb.info< / a > < / p >
< / section >
2020-10-08 00:02:29 -04:00
< / div > < / div >
< script src = "../reveal.js-3.5.0/lib/js/head.min.js" > < / script >
< script src = "../reveal.js-3.5.0/js/reveal.js" > < / script >
< script >
// Full list of configuration options available at:
// https://github.com/hakimel/../reveal.js#configuration
Reveal.initialize({
controls: false,
progress: true,
history: true,
center: true,
slideNumber: true,
transition: 'fade', // none/fade/slide/convex/concave/zoom
// Optional ../reveal.js plugins
dependencies: [
{ src: '../../reveal.js-3.7.0/plugin/svginline/data-src-svg.js' },
{ src: '../reveal.js-3.5.0/lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: '../reveal.js-3.5.0/plugin/math/math.js',
condition: function() { return true; },
mathjax: '../reveal.js-3.5.0/js/MathJax.js'
},
{ src: '../reveal.js-3.5.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: '../reveal.js-3.5.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
//{ src: '../reveal.js-3.5.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'tt code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../../reveal.js-3.7.0/plugin/highlight/highlight-9.16.2.js', async: true,
callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: '../reveal.js-3.5.0/plugin/zoom-js/zoom.js', async: true },
{ src: '../reveal.js-3.5.0/plugin/notes/notes.js', async: true }
]
});
< / script >
< / body >
< / html >