Slides

2019-12-09 13:22:24 -05:00 · 2019-12-09 13:22:24 -05:00 · d0922d87cd
parent 4a666977a5
commit d0922d87cd
4 changed files with 249 additions and 100 deletions
--- a/slides/reveal.js-3.7.0/plugin/highlight/highlight-9.16.2.js
+++ b/slides/reveal.js-3.7.0/plugin/highlight/highlight-9.16.2.js
--- a/slides/talks/2019-5-VizierCaveats/graphics/caveat-list.png
+++ b/slides/talks/2019-5-VizierCaveats/graphics/caveat-list.png
--- a/slides/talks/2019-5-VizierCaveats/index.html
+++ b/slides/talks/2019-5-VizierCaveats/index.html
@ -14,11 +14,11 @@

 		<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">

-		<link rel="stylesheet" href="../reveal.js-3.5.0/css/reveal.css">
+		<link rel="stylesheet" href="../../reveal.js-3.7.0/css/reveal.css">
 		<link rel="stylesheet" href="ubodin.css" id="theme">

 		<!-- Code syntax highlighting -->
-		<link rel="stylesheet" href="../reveal.js-3.5.0/lib/css/zenburn.css">
+		<link rel="stylesheet" href="../../reveal.js-3.7.0/lib/css/zenburn.css">


    <style type="text/css">
@ -35,7 +35,7 @@
 			var link = document.createElement( 'link' );
 			link.rel = 'stylesheet';
 			link.type = 'text/css';
-			link.href = window.location.search.match( /print-pdf/gi ) ? '../reveal.js-3.5.0/css/print/pdf.css' : '../reveal.js-3.5.0/css/print/paper.css';
+			link.href = window.location.search.match( /print-pdf/gi ) ? '../../reveal.js-3.7.0/css/print/pdf.css' : '../reveal.js-3.7.0/css/print/paper.css';
 			document.getElementsByTagName( 'head' )[0].appendChild( link );
 		</script>

@ -116,6 +116,7 @@
            <tr><td>1575731228</td><td>1</td></tr>
            <tr><td>1575731237</td><td>1</td></tr>
          </table>
+          <p class="fragment">Step 1: Line up the readings</p>
        </section>

        <section>
@ -149,7 +150,7 @@
        </section>

        <section>
-          <pre><code>
+          <pre><code class="sql">
            INSERT INTO series_one_buckets
              SELECT CAST(time / 10 AS int) AS bucket, 
                     FIRST(reading)
@ -158,7 +159,7 @@
          </code></pre>
          <p class="fragment">Interpolate missing values</p>
          <p class="fragment">Hand tune around the switchover as-needed</p>
-          <pre class="fragment"><code>
+          <pre class="fragment"><code class="sql">
            SELECT a.time, a.reading AS reading_one
                           b.reading AS reading_two
            FROM series_one_buckets a, series_two_buckets b
@ -201,7 +202,8 @@

      <section>
        <section>
-          <p>Carol gets a dataset from Dave</p>
+          <h3>Act 2</h3>
+          <p class="fragment">Carol gets a dataset from Dave</p>
        </section>

        <section>
@ -301,10 +303,7 @@
        -->
        <section>
          <p>Data science is <span style="color: lightgrey;">nuanced</span>.</p>
-        </section>
-
-        <section>
-          <p>Assumptions can't be avoided!</p>
+          <p class="fragment">Assumptions can't be avoided!</p>
          <p class="fragment">It's easy to miss an assumption when re-using work.</p> 
        </section>

@ -312,16 +311,20 @@
          <img src="graphics/montoya.jpeg" height="400px" />
        </section>

+        <section>
+          <h3>There needs to be a better way!</h3>
+        </section>
+
        <section>
          <p>Annotate data with warnings.</p>

-          <p class="fragment" data-fragment-index="1">If you use this value/record, be warned!</p>
+          <p class="fragment" data-fragment-index="1">If you use this value/record, <br/>here's what you need to know!</p>
          
          <h3 class="fragment" data-fragment-index="2">Caveat Physicus</h3>
        </section>

        <section>
-          <p>Apply a caveat when volating an assumption might...</p>
+          <p>Declare a caveat when volating an assumption might...</p>
          <ul>
            <li class="fragment">... change one or more values</li>
            <li class="fragment">... remove one or more records</li>
@ -333,7 +336,9 @@

      <section>
        <section>
-          <p>A brief digression...</p>
+          <p>So what is a caveat?</p>
+
+          <p class="fragment">A brief digression...</p>
        </section>

        <section>
@ -355,23 +360,23 @@
        <section>
          <p class="fragment"><b>Possible</b> tuples exist in at least one one possible world. $$possible(\mathcal R) = \bigcup_{R \in \mathcal R} R$$</p>
          <p class="fragment"><b>Certain</b> tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$</p>
-          <p style="font-size: 70%;" class="fragment">(with generalizations beyond set semantics)</p>
+          <p style="font-size: 70%;" class="fragment">(not limited to set semantics)</p>
        </section>

        <section>
-          <pre><code>
+          <pre><code class="sql">
            SELECT setting_1, setting_2,
                   caveat(estimate, 'Only correct if phi is 42')
                     AS estimate
            FROM Simulation;
          </code></pre>
          is the same as
-          <pre><code>
+          <pre><code class="sql">
            SELECT setting_1, setting_2, estimate
            FROM Simulation;
          </code></pre>
-          <p class="fragment"><b>Caveat: </b>If it turns out that phi ≠ 42, <br/>all estimate values might be wrong.</p>
-          <p style="font-size: 70%" class="fragment">(The query annotates all <span style="font-family: monospace;">`estimate`</span> values with the caveat)</p>
+          <p class="fragment"><b>Caveat: </b>If it turns out that phi ≠ 42, <br/>all <span style="font-family: monospace;">estimate</span> values could be wrong.</p>
+          <p style="font-size: 70%" class="fragment">(The first query annotates all <span style="font-family: monospace;">`estimate`</span> values with the caveat)</p>
        </section>

        <section>
@ -391,56 +396,45 @@
        </section>

        <section>
-          <h3>Why?</h3>
-          <p>Caveats...</p>
-          <dl>
-            <div class="fragment">
-              <dt>... go where the data goes</dt>
-              <dd>Automatic propagation to derived values.</dd>
-            </div>
-
-            <div class="fragment">
-              <dt>... stop where the data stops</dt>
-              <dd>Irrelevant caveats don't get propagated</dd>
-            </div>
-          </dl>
+          <h3>Applying Caveats</h3>
+          <p class="fragment">a few examples...</p>
        </section>

        <section>
-          <p>a few examples...</p>
-        </section>
-
-        <section>
-          <pre><code>
-      SELECT bucket, 
-             CASE WHEN bucket_size > 1 THEN
-                    caveat(reading, 'a reading got offset')
-                  ELSE reading END AS reading
-      FROM (
-        SELECT CAST(time / 10 AS int) AS bucket, 
-               FIRST(reading) AS reading
-               COUNT(*) AS bucket_size
-        FROM sensor
-      )
+          <p>Mark multi-valued buckets <span class="fragment">(key repair).</span></p>
+          <pre><code class="sql" data-line-numbers="2-3">
+    SELECT bucket, 
+           CASE WHEN bucket_size > 1 THEN
+                 caveat(reading, 'Picked between two bucket values.')
+                ELSE reading END AS reading
+    FROM (
+      SELECT CAST(time / 10 AS int) AS bucket, 
+             FIRST(reading) AS reading
+             COUNT(*) AS bucket_size
+      FROM sensor
+    )
          </code></pre>
-          <p class="fragment">interpolation is more complex... but similar</p>
+          <p class="fragment">Interpolation is more complex... but similar.</p>
        </section>

        <section>
-          <pre><code>
-  CASE WHEN race_ethnicity 
-    IN ('white non-hispanic', 'black non-hispanic', /* ... */)
-    THEN race_ethnicity
-    ELSE caveat(race_ethnicity, 
-                  'Unexpected race_ethnicity: ' & race_ethnicity)
-  END
+          <p>Mark unexpected values the model wasn't trained on.</p>
+          <pre><code class="sql">
+  SELECT
+    CASE WHEN race_ethnicity 
+      IN ('white non-hispanic', 'black non-hispanic', /* ... */)
+      THEN race_ethnicity
+      ELSE caveat(race_ethnicity, 
+                    'Unexpected race_ethnicity: ' & race_ethnicity)
+    END, /* ... */
+  FROM R
          </code></pre>
-          <p class="fragment">we can automate checks like this</p>
+          <p class="fragment">This check can be automated.</p>
        </section>

        <section>
          <p>Spark's CSV loader can augment tables with a $\texttt{parse_error}$ column.</p>
-          <pre><code>
+          <pre><code class="sql">
        SELECT * FROM csv_file
        WHERE 
          CASE WHEN parse_error IS NULL THEN TRUE ELSE
@ -449,6 +443,26 @@
          </code></pre>
        </section>

+
+        <section>
+          <h3>Why?</h3>
+          <h4 class="fragment" data-fragment-index="1">Propagation</h4>
+            <dl>
+            <dd class="fragment" data-fragment-index="2"  style="margin-left: -20px;">Caveats...</dd>
+
+            <div class="fragment" data-fragment-index="2">
+              <dt>... can go where the data goes</dt>
+              <dd>Derived values retain caveats on source data.</dd>
+            </div>
+
+            <div class="fragment" data-fragment-index="3">
+              <dt>... stop where the data stops</dt>
+              <dd>Irrelevant caveats don't get propagated</dd>
+            </div>
+          </dl>
+        </section>
+
+
        <section>
          <h3>Caveats</h3>

@ -462,55 +476,61 @@

      <section>
        <section>
-          Another brief digression...
+          <h3>How are caveats propagated?</h3>
+          <p class="fragment">Another brief digression...</p>
        </section>
        <section>
          <h3>Value Annotations</h3>

-          <p style="margin-top: 50px; font-size: 70%;">
-            <b>MONDRIAN: Annotating and Querying Databases through Colors and Blocks.</b><br/>
-            Floris Geerts, Anastasios Kementsietsidis, Diego Milano
-          </p>
-          <p style="margin-top: 50px; font-size: 70%;">
+          <p style="margin-top: 50px; font-size: 70%;" class="fragment" data-fragment-index="1">
            <b>Provenance in Databases: Why, How, and <u>Where</u></b><br/>
            James Cheney, Laura Chiticariu and Wang-Chiew Tan
          </p>

-          <p>and more...</p>
+          <p style="margin-top: 50px; font-size: 70%;" class="fragment" data-fragment-index="2">
+            <b>MONDRIAN: Annotating and Querying Databases through Colors and Blocks.</b><br/>
+            Floris Geerts, Anastasios Kementsietsidis, Diego Milano
+          </p>
+
+          <p class="fragment" data-fragment-index="2">and more...</p>
        </section>

        <section>
          <h3>Value Annotations</h3>
-          <pre><code>
+          <pre><code class="sql">
            CREATE VIEW Q AS 
-              SELECT R.A, R.B+R.C AS B FROM R
+              SELECT R.A     AS X, 
+                     R.B+R.C AS Y 
+              FROM R
          </code></pre>
          <p class="fragment" style="font-size: 70%">
-            $$annot(\texttt{Q.A}, i) \leftarrow annot(\texttt{R.A}, i)$$
+            $$annot(\texttt{Q.X}[i]) \leftarrow annot(\texttt{R.A}[i])$$
          </p>
          <p class="fragment" style="font-size: 70%">
-            $$annot(\texttt{Q.B}, i) \leftarrow annot(\texttt{R.B}, i) \cup annot(\texttt{R.C}, i)$$
+            $$annot(\texttt{Q.Y}[i]) \leftarrow annot(\texttt{R.B}[i]) \cup annot(\texttt{R.C}[i])$$
          </p>
        </section>

        <section>
          <h3>Value Annotations</h3>
-          <pre><code>
+          <pre><code class="sql">
            CREATE VIEW Q AS 
-              SELECT R.A, SUM(R.B) AS B FROM R
+              SELECT R.A      AS X, 
+                     SUM(R.B) AS Y 
+              FROM R
          </code></pre>
          <p class="fragment" style="font-size: 70%">
-            $$annot(\texttt{Q.A}, i) \leftarrow \bigcup_{j\;:\;\texttt{R.A}[j] = Q.A[i]} annot(\texttt{R.A}, j)$$
+            $$annot(\texttt{Q.X}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.A}[j] = Q.A[i]} annot(\texttt{R.A}[j])$$
          </p>
          <p class="fragment" style="font-size: 70%">
-            $$annot(\texttt{Q.B}, i) \leftarrow \bigcup_{j\;:\;\texttt{R.B}[j] = Q.B[i]} annot(\texttt{R.B}, j)$$
+            $$annot(\texttt{Q.Y}[i]) \leftarrow \bigcup_{j\;:\;\texttt{R.B}[j] = Q.B[i]} annot(\texttt{R.B}[j])$$
          </p>
          <p class="fragment">... not the semantics we want</p>
        </section>

        <section>
          <p>
-            If $\texttt{R.A}$ is wrong, any $\texttt{Q.B}$ could change.
+            Caveats on $\texttt{R.A}$ also affect $\texttt{Q.B}$.
          </p>
        </section>

@ -549,13 +569,19 @@
        </section>

        <section>
-          <p><b>Certain Data Elements: </b> Elements guaranteed to be in the result as-is <u>in all possible worlds</u>.</p>
+          <p><b>Certain Data Elements: </b> Elements guaranteed to be in the result <u>in all possible worlds</u>.</p>
+
+          <p class="fragment">... i.e., elements unaffected by the choice of possible world.</p>
        </section>

        <section>
          <p>If a caveatted element can't affect an output element, don't propagate its caveats!</p>
          <p class="fragment">Propagate caveats to any data elements that could be affected by a change.</p>
-          <p class="fragment"><b>Problem: </b> This is expensive!</p>
+        </section>
+        <section>
+          <p><b>Challenge: </b> How do we propagate caveats<br/>without penalizing query evaluation.</p>
+
+          <p class="fragment">Don't!</p>
        </section>

        <section>
@ -596,31 +622,135 @@
        <section>
          <h3>Instrumenting Queries</h3>
          <p class="fragment">≅ computing certain answers! (CoNP-Complete)</p>
-          <p class="fragment">Need an approximation</p>
        </section>

        <section>
          <h3>Conservative Approximation</h3>

-          <div class="fragment">
+          <div class="fragment" data-fragment-index="1">
            <p style="margin-top: 20px; font-size: 60%;">
              <b>Correctness of SQL Queries on Databases with Nulls.</b><br/>
              Paolo Guagliardo, Leonid Libkin
            </p>
-            <p style="margin-top: 20px; font-size: 60%;">
+            <p style="margin-top: 20px; font-size: 60%;" class="fragment highlight-blue grow" data-fragment-index="4">
              <b>Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers</b><br/>
              Su Feng, Aaron Huber, Boris Glavic, Oliver Kennedy
            </p>
          </div>
          <ul>
-            <li class="fragment">Unmarked rows are guaranteed to be caveat-free.</li>
-            <li class="fragment">Marked rows might not be caveatted.</li>
+            <li class="fragment" data-fragment-index="2">Unmarked rows are guaranteed to be caveat-free.</li>
+            <li class="fragment" data-fragment-index="3">Marked rows might not be caveatted.</li>
          </ul>
        </section>

        <section>
-          <p>Ongoing effort to formalize propagation for values.</p>
-          <p><b>Work with Boris Glavic + Su Feng @ IIT</b></p>
+          <p>Add and maintain a binary "has caveat"<br/>column for each row/column.</p>
+        </section>
+
+        <section>
+          <pre><code class="sql">
+    CREATE VIEW by_language AS
+      SELECT language, 
+          CASE WHEN CAST(salary AS float) IS NOT NULL THEN
+
+            caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
+            
+            ELSE CAST(salary AS float) END AS salary
+      FROM raw_csv_data;
+          </code></pre>
+          <div class="fragment">
+          becomes
+          <pre><code class="sql">
+    CREATE VIEW by_language AS
+      SELECT language, CAST(salary AS float) AS salary,
+             FALSE                         AS _caveat_field_language,
+             CAST(salary as float) IS NULL AS _caveat_field_salary
+             FALSE                         AS _caveat_row
+      FROM raw_csv_data;
+          </code></pre>
+          </div>
+        </section>
+
+        <section>
+          <pre><code class="sql">
+            SELECT salary 
+            FROM by_language
+            WHERE language = 'Scala'
+          </code></pre>
+          <div class="fragment">
+          becomes
+          <pre><code class="sql">
+      SELECT salary, 
+             _caveat_field_salary AS _caveat_field_salary,
+             _caveat_row AND _caveat_field_language AS _caveat_row
+      FROM by_language
+      WHERE language = 'Scala'
+          </code></pre>
+          </div>
+        </section>
+
+        <section>
+          <pre><code class="sql">
+            SELECT AVG(salary) AS salary
+            FROM by_language
+          </code></pre>
+          <div class="fragment">
+          becomes
+          <pre><code class="sql">
+      SELECT salary, 
+             GROUP_OR(_caveat_field_salary) AS _caveat_field_salary,
+             FALSE AS _caveat_row
+      FROM by_language
+          </code></pre>
+          </div>
+        </section>
+
+        <section>
+          <pre><code class="sql">
+            SELECT language, AVG(salary) AS salary
+            FROM by_language
+            GROUP BY language
+          </code></pre>
+          <div class="fragment">
+          ... first we evaluate
+          <pre><code class="sql">
+      SELECT GROUP_OR(_caveat_field_language)
+      FROM by_language
+          </code></pre>
+          </div>
+          <p class="fragment">Can often be evaluated statically.</p>
+        </section>
+
+        <section>
+          <h3>If TRUE</h3>
+
+          <pre><code class="sql">
+    SELECT language, AVG(salary) AS salary
+           FALSE                  AS _caveat_field_language
+           TRUE                   AS _caveat_field_salary
+           GROUP_AND(_caveat_field_language OR 
+                     _caveat_row) AS _caveat_row
+    FROM by_language
+    GROUP BY language
+          </code></pre>
+        </section>
+
+        <section>
+          <h3>If FALSE</h3>
+
+          <pre><code class="sql">
+        SELECT language, AVG(salary) AS salary
+               FALSE                  AS _caveat_field_language
+               GROUP_OR(_caveat_field_salary,
+                        _caveat_row)  AS _caveat_field_salary
+               GROUP_AND(_caveat_row) AS _caveat_row
+        FROM by_language
+        GROUP BY language
+          </code></pre>
+        </section>
+
+        <section>
+          <p>Ongoing work with Boris Glavic + Su Feng @ IIT</p>
        </section>

      </section>
@ -651,14 +781,14 @@
        <section>
          <h3>Program Slicing</h3>

-          <p>Eliminate lines of code not relevant to computing a specific value.</p>
+          <p>Eliminate lines of code not relevant<br/>to computing a specific value.</p>

          <p class="fragment">This is <i>exactly</i> what a database optimizer does.</p>
        </section>

        <section>
          <p><b>Lookup: </b> Caveats on $\texttt{R.A}[i]$</p>
-          <pre class="fragment"><code>
+          <pre class="fragment"><code class="sql">
            SELECT A
            FROM R
            WHERE ROWID = i
@ -676,17 +806,17 @@

        <section>
          <h3>Isolate the message</h3>
-          <pre><code>
- WITH data_source AS
-   SELECT caveat(A, 'only valid '& B &' is within tolerances.') AS A,
-          C, D, E
-   FROM R
- 
- SELECT C, D, E FROM data_source 
+          <pre><code class="sql">
+   WITH data_source AS
+     SELECT caveat(A, 'valid if '& B &' is within tolerances.') AS A,
+            C, D, E
+     FROM R
+   
+   SELECT C, D, E FROM data_source 
          </code></pre>
          <p>becomes</p>
-          <pre><code>
-           SELECT 'only valid '& B &' is within tolerances.' 
+          <pre><code class="sql">
+           SELECT 'valid if '& B &' is within tolerances.' 
                    AS caveat_message
           FROM R
          </code></pre>
@ -828,11 +958,11 @@
        </section>

        <section>
-          <pre><code>
+          <pre><code class="sql">
            UPDATE R SET A = 'foo' WHERE ROWID = 3;
          </code></pre>
          becomes
-          <pre><code>
+          <pre><code class="sql">
            SELECT CASE ROWID
                     WHEN 3 THEN 'foo'
                     ELSE A END AS A, 
@ -842,11 +972,11 @@
        </section>

        <section>
-          <pre><code>
+          <pre><code class="sql">
            INSERT INTO R() VALUES ();
          </code></pre>
          becomes
-          <pre><code>
+          <pre><code class="sql">
              SELECT * FROM R
            UNION ALL
              SELECT NULL AS A, NULL AS B,
@ -855,11 +985,11 @@
        </section>

        <section>
-          <pre><code>
+          <pre><code class="sql">
            ALTER R ADD COLUMN `bar`;
          </code></pre>
          becomes
-          <pre><code>
+          <pre><code class="sql">
            SELECT *, NULL as `bar` FROM R;
          </code></pre>
        </section>
@ -913,7 +1043,7 @@
          <img src="graphics/vizier-blue.svg" height="100px" style="vertical-align: middle; margin-right: 20px;" />
          <span style="vertical-align: middle;" ><a href="https://vizierdb.info/">https://vizierdb.info</a></span>
        </h3>
-        <pre style="margin-top: 50px;"><code>
+        <pre style="margin-top: 50px;"><code class="sql">
          $> pip3 install --user vizier-webapi
          $> vizier
        </code></pre>
@ -1060,7 +1190,9 @@
 					 },
 					{ src: '../reveal.js-3.5.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
 					{ src: '../reveal.js-3.5.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
-					{ src: '../reveal.js-3.5.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'tt code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
+					//{ src: '../reveal.js-3.5.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'tt code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
+          { src: '../../reveal.js-3.7.0/plugin/highlight/highlight-9.16.2.js', async: true,
+              callback: function() { hljs.initHighlightingOnLoad(); } },
 					{ src: '../reveal.js-3.5.0/plugin/zoom-js/zoom.js', async: true },
 					{ src: '../reveal.js-3.5.0/plugin/notes/notes.js', async: true }
 				]
--- a/slides/talks/2019-5-VizierCaveats/notes.txt
+++ b/slides/talks/2019-5-VizierCaveats/notes.txt
@ -0,0 +1,15 @@
+--- Salary --- 
+
+1n9lRY5NxHjmXfqZfXJxytBqznzP_vWOPfSbv2rL-T38/Form Responses 1
+
+--- by_language ---
+
+SELECT PRIMARY_LANGUAGE_TECHNOLOGY_STACK AS LANGUAGE, COUNT(*) as tot,
+       AVG(HOW_MANY_YEARS_HAVE_YOU_WORKED_IN_TECH) as YEARSWORKED, 
+       MIN(HOW_MANY_YEARS_HAVE_YOU_WORKED_IN_TECH) as yearsworked_min, 
+       MAX(HOW_MANY_YEARS_HAVE_YOU_WORKED_IN_TECH) as yearsworked_max,
+       AVG(WHAT_IS_YOUR_ANNUALIZED_BASE_SALARY_IN_USD) as salary
+FROM salaries 
+GROUP BY PRIMARY_LANGUAGE_TECHNOLOGY_STACK 
+HAVING tot > 2;
+