Review slides

2019-03-11 12:12:47 -04:00 · 2019-03-11 12:12:47 -04:00 · 28f25b7e82
parent c710d3c0ad
commit 28f25b7e82
3 changed files with 802 additions and 1 deletions
--- a/src/teaching/cse-562/2019sp/slide/2019-03-04-ApproximateQueryProcessing.html
+++ b/src/teaching/cse-562/2019sp/slide/2019-03-04-ApproximateQueryProcessing.html
@ -6,6 +6,16 @@ textbook: (readings only)
 ---

 <section>
+  <!-- 2019 by OK 
+  
+    This lecture fell a bit flat.  
+
+      - There are a number of bugs in the builds that need to be worked out
+      - More examples would be helpful
+      - A deeper dive into the math might also be useful.
+      - More detail in the solutions to the birthday paradox
+  
+  -->

  <section>
    <p style="font-size: larger;">What is the best, correct technique for task <b>X</b>, when <b>Y</b> is true?</p>
--- a/src/teaching/cse-562/2019sp/slide/2019-03-06-Sketches.erb
+++ b/src/teaching/cse-562/2019sp/slide/2019-03-06-Sketches.erb
@ -186,7 +186,7 @@ textbook: (readings only)
      </ol></li>
      <li>Find $R$, the lowest index <b>not</b> in the set</li>
      <li>Estimate Count-Distinct as $\frac{2^R}{\phi}$ ($\phi \approx 0.77351$)</li>
-      <li>Repeat as needed</li>
+      <li>Repeat (in parallel) as needed</li>
    </ol>
  </section>
 </section>
--- a/src/teaching/cse-562/2019sp/slide/2019-03-11-Review.html
+++ b/src/teaching/cse-562/2019sp/slide/2019-03-11-Review.html
@ -0,0 +1,791 @@
+---
+template: templates/cse4562_2019_slides.erb
+title: Midterm Review
+date: March 11, 2019
+---
+
+<!-- ============================================ -->
+
+<section>
+
+  <section>
+    <p>What are Databases?</p>
+  </section>
+
+  <section>
+    <dl>
+      <dt>Analysis: Answering user-provided questions about data</dt>
+      <dd>What kind of tools can we give end-users? <ul class="tight">
+        <li>Declarative Languages</li>
+        <li>Organizational Datastructures (e.g., Indexes)</li>
+      </ul></dd>
+
+      <div style="color: grey;">
+        <dt>Manipulation: Safely persisting and sharing data updates</dt>
+        <dd>What kind of tools can we give end-users?<ul class="tight">
+          <li>Consistency Primitives</li>
+          <li>Data Validation Primitives</li>
+        </ul></dd>
+      </div>
+    </dl>
+  </section>
+
+  <section>
+    <dl style="font-size: smaller;">
+      <dt>Primitive</dt>
+      <dd>Basic building blocks like Int, Float, Char, String</dd>
+      <dt>Tuple</dt>
+      <dd>Several ‘fields’ of different types. (N-Tuple = N fields)</dd>
+      <dd>A Tuple has a ‘schema’ defining each field</dd>
+      <dt>Set</dt>
+      <dd>A collection of unique records, all of the same type</dd>
+      <dt>Bag</dt>
+      <dd>An unordered collection of records, all of the same type</dd>
+      <dt>List</dt>
+      <dd>An ordered collection of records, all of the same type</dd>
+    </dl>
+  </section>
+
+  <section>
+    <pre><code class="sql">
+            SELECT  [DISTINCT] targetlist
+            FROM    relationlist
+            WHERE   condition
+    </code></pre>
+    <ol>
+      <li class="fragment">Compute the $2^n$ combinations of tuples in all relations appearing in <span style="color: red;">relationlist</span></li>
+      <li class="fragment">Discard tuples that fail the <span style="color: red;">condition</span></li>
+      <li class="fragment">Delete attributes not in <span style="color: red;">targetlist</span></li>
+      <li class="fragment">If <span style="font-family: Courier, fixedwidth;">DISTINCT</span> is specified, eliminate duplicate rows</li>
+    </ol>
+    <p style="font-size: 70%;" class="fragment">
+      This is the least efficient strategy to compute a query!
+      A good optimizer will find <b>more efficient strategies</b> to compute <b>the same answer.</b>
+    </p>
+  </section>
+
+</section>
+
+<section>
+  <section>
+    <h2>Physical Layout</h2>          
+  </section>
+
+  <section>
+    <h3>Record Formats</h3>
+    <dl>
+      <dt>Fixed</dt>
+      <dd>Constant-size fields.  Field $i$ at byte $\sum_{j < i} |Field_j|$</dd>
+      <dt>Delimited</dt>
+      <dd>Special character or string (e.g., <code>,</code>) between fields</dd>
+      <dt>Header</dt>
+      <dd>Fixed-size header points to start of each field</dd>
+      <dt>&nbsp;</dt>
+      <dd>&nbsp;</dd>
+    </dl>
+  </section>
+
+  <section>
+    <h3>File Formats</h3>
+    <dl>
+      <dt>Fixed</dt>
+      <dd>Constant-size records.  Record $i$ at byte $|Record| \times i$</dd>
+      <dt>Delimited</dt>
+      <dd>Special character or string (e.g., <code>\r\n</code>) at record end</dd>
+      <dt>Header</dt>
+      <dd>Index in file points to start of each record</dd>
+      <dt class="fragment" data-fragment-index="1">Paged</dt>
+      <dd class="fragment" data-fragment-index="1">Align records to paging boundaries</dd>
+    </dl>
+  </section>
+
+  <section>
+    <dl>
+      <dt>File</dt>
+      <dd>A collection of pages (or records)</dd>
+      <dt>Page</dt>
+      <dd>A fixed-size collection of records</dd>
+      <dd style="font-size: smaller;">Page size is usually dictated by hardware.<br/>Mem Page $\approx$ 4KB&nbsp;&nbsp;&nbsp;Cache Line $\approx$ 64B</dd>
+      <dt>Record</dt>
+      <dd>One or more fields (for now)</dd>
+      <dt>Field</dt>
+      <dd>A primitive value (for now)</dd>
+    </dl>
+  </section>
+</section>
+
+<section>
+  <section>
+    <p>Relational Algebra</p>
+  </section>
+
+  <section>
+    <h3>Relational Algebra</h3>
+
+    <table style="font-size: 70%">
+      <tr><th>Operation</th><th>Sym</th><th>Meaning</th></tr>
+      <tr><td>Selection</td><td>$\sigma$</td><td>Select a subset of the input rows</td></tr>
+      <tr><td>Projection</td><td>$\pi$</td><td>Delete unwanted columns</td></tr>
+      <tr><td>Cross-product</td><td>$\times$</td><td>Combine two relations</td></tr>
+      <tr><td>Set-difference</td><td>$-$</td><td>Tuples in Rel 1, but not Rel 2</td></tr>
+      <tr><td>Union</td><td>$\cup$</td><td>Tuples either in Rel 1 or in Rel 2</td></tr>
+      <tr><td>Intersection</td><td>$\cap$</td><td>Tuples in both Rel 1 and Rel 2</td></tr>
+      <tr><td>Join</td><td>$\bowtie$</td><td>Pairs of tuples matching a specified condition</td></tr>
+      <tr style="color: lightgrey;"><td>Division</td><td>$/$</td><td>"Inverse" of cross-product</td></tr>
+      <tr><td>Sort</td> <td>$\tau_A$</td><td>Sort records by attribute(s) $A$</td></tr>
+      <tr><td>Limit</td><td>$\texttt{LIMIT}_N$</td><td>Return only the first $N$ records<br/>(according to sort order if paired with sort).</td></tr>
+    </table>
+  </section>
+
+  <section>
+    <h3>Equivalence</h3>
+    $$Q_1 = \pi_{A}\left( \sigma_{c}( R ) \right)$$
+    $$Q_2 = \sigma_{c}\left( \pi_{A}( R ) \right)$$
+
+    <div class="fragment">
+      $$Q_1 \stackrel{?}{\equiv} Q_2$$
+    </div>
+  </section>
+
+  <section>          
+    <table style="font-size: 60%">
+      <tr><th><b>Rule</b></th><th><b>Notes</b> </th></tr>
+      <tr><td>$\sigma_{C_1\wedge C_2}(R) \equiv \sigma_{C_1}(\sigma_{C_2}(R))$</td><td></td></tr>
+      <tr><td>$\sigma_{C_1\vee C_2}(R) \equiv \sigma_{C_1}(R) \cup \sigma_{C_2}(R)$</td><td>Only true for set, not bag union</td></tr>
+      <tr><td>$\sigma_C(R \times S) \equiv R \bowtie_C S$</td><td></td></tr>
+      <tr><td>$\sigma_C(R \times S) \equiv \sigma_C(R) \times S$</td><td>If $C$ references only $R$'s attributes, also works for joins</td></tr>
+      <tr><td>$\pi_{A}(\pi_{A \cup B}(R)) \equiv \pi_{A}(R)$</td><td> </td></tr>
+      <tr><td>$\sigma_C(\pi_{A}(R)) \equiv \pi_A(\sigma_C(R))$</td><td>If $A$ contains all of the attributes referenced by $C$ </td></tr>
+      <tr><td>$\pi_{A\cup B}(R\times S) \equiv \pi_A(R) \times \pi_B(S)$</td><td>Where $A$ (resp., $B$) contains attributes in $R$ (resp., $S$)</td></tr>
+      <tr><td>$R \times (S \times T) \equiv (R \times S) \times T$</td><td>Also works for joins  </td></tr>
+      <tr><td>$R \times S \equiv S \times R$</td><td>Also works for joins  </td></tr>
+      <tr><td>$R \cup (S \cup T) \equiv (R \cup S) \cup T$</td><td>Also works for intersection and bag-union  </td></tr>
+      <tr><td>$R \cup S \equiv S \cup R$</td><td>Also works for intersections and bag-union  </td></tr>
+      <tr><td>$\sigma_{C}(R \cup S) \equiv \sigma_{C}(R) \cup \sigma_{C}(S)$</td><td>Also works for intersections and bag-union  </td></tr>
+      <tr><td>$\pi_{A}(R \cup S) \equiv \pi_{A}(R) \cup \pi_{A}(S)$</td><td>Also works for intersections and bag-union  </td></tr>
+      <tr><td>$\sigma_{C}(\gamma_{A, AGG}(R)) \equiv \gamma_{A, AGG}(\sigma_{C}(R))$</td><td>If $A$ contains all of the attributes referenced by $C$</td></tr>
+    </table>
+  </section>
+
+  <section>
+    <h3>Algorithms</h3>
+    <dl>
+      <dt>"Volcano" Operators (Iterators)</dt>
+      <dd>Operators "pull" tuples, one-at-a-time, from their children.</dd>
+
+      <dt>2-Pass (External) Sort</dt>
+      <dd>Create sorted runs, then repeatedly merge runs</dd>
+
+      <dt>Join Algorithms</dt>
+      <dd>Quickly picking out <i>specific</i> pairs of tuples.</dd>
+
+      <dt>Aggregation Algorithms</dt>
+      <dd>In-Memory vs 2-Pass, Normal vs Group-By</dd>
+    </dl>
+  </section>
+
+  <section>
+    <h3>Nested-Loop Join</h3>
+    <svg data-src="graphics/2018-02-12-Join-NLJ.svg" />
+  </section>
+
+  <section>
+    <h3>Block-Nested Loop Join</h3>
+    <svg data-src="graphics/2018-02-12-Join-BNLJ.svg" class="stretch" />
+  </section>
+
+  <section>
+    <h3>Strategies for Implementing $R \bowtie_{R.A = S.A} S$</h3>
+
+    <dl>
+      <dt>Sort/Merge Join</dt>
+      <dd>Sort all of the data upfront, then scan over both sides.</dd>
+
+      <dt>In-Memory Index Join (1-pass Hash; Hash Join)</dt>
+      <dd>Build an in-memory index on one table, scan the other.</dd>
+
+      <dt>Partition Join (2-pass Hash; External Hash Join)</dt>
+      <dd>Partition both sides so that tuples don't join across partitions.</dd>
+    </dl>
+  </section>
+
+  <section>
+    <h3>Sort/Merge Join</h3>
+    <svg data-src="graphics/2018-02-12-Join-SortMerge.svg" />
+  </section>
+
+  <section>
+    <h3>Sort/Merge Join</h3>
+    <svg data-src="graphics/2018-02-12-Join-SortMerge.svg" />
+  </section>
+  
+  <section>
+    <h3>Sort/Merge Join</h3>
+    <img src="graphics/2018-02-12-Join-SortMerge.svg" />
+    <p>Sort/Merge typically expressed as 3 operators<br/>(2xSort + Merge)</p>
+  </section>
+
+  <section>
+    <h3>1-Pass Hash Join</h3>
+    <svg data-src="graphics/2018-02-12-Join-1PassHash.svg" />
+  </section>
+
+  <section>
+    <h3>2-Pass Hash Join</h3>
+    <dl>
+      <dt>Limited Queries</dt>
+      <dd>Only supports join conditions of the form $R.A = S.B$</dd>
+
+      <dt>Low Memory</dt>
+      <dd>Never need more than 1 pair of partitions in memory</dd>
+
+      <dt>High IO Cost</dt>
+      <dd>Every record gets written out to disk, and back in.</dd>
+    </dl>          
+    <p class="fragment">Can partition on data-values to support other types of queries.</p>
+  </section>
+
+  <section>
+    <h3>Index Nested Loop Join</h3>
+
+    To compute $R \bowtie_{R.A < S.B} S$ with an index on $S.B$
+
+    <ol>
+      <li>Read One Row of $R$</li>
+      <li>Get the value of $R.A$</li>
+      <li>Start index scan on $S.B > [R.A]$</li>
+      <li>Return rows as normal</li>
+    </ol>
+  </section>
+
+  <section>
+    <h3>Basic Aggregate Pattern</h3>
+    <dl>
+      <dt>Init</dt>
+      <dd>Define a starting value for the accumulator</dd>
+      <dt>Fold(Accum, New)</dt>
+      <dd>Merge a new value into the accumulator</dd>
+      <dt>Finalize(Accum)</dt>
+      <dd>Extract the aggregate from the accumulator.</dd>
+    </dl>
+  </section>
+
+  <section>
+    <h3>Basic Aggregate Types</h3>
+    <p class="fragment" style="font-size: 60%">Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals</p>
+
+    <dl>
+      <dt>Distributive</dt>
+      <dd>Finite-sized accumulator and doesn't need a finalize (COUNT, SUM)</dd>
+      <dt>Algebraic</dt>
+      <dd>Finite-sized accumulator but needs a finalize (AVG)</dd>
+      <dt>Holistic</dt>
+      <dd>Unbounded accumulator (MEDIAN)</dd>
+    </dl>
+  </section>
+
+  <section>
+    <h3>Grouping Algorithms</h3>
+
+    <dl>
+      <dt>2-pass Hash Aggregate</dt>
+      <dd>Like 2-pass Hash Join: Distribute groups across buckets, then do an in-memory aggregate for each bucket.</dd>
+
+      <dt>Sort-Aggregate</dt>
+      <dd>Like Sort-Merge Join: Sort data by groups, then group elements will be adjacent.</dd>
+    </dl>
+  </section>
+</section>
+
+<section>
+  <section>
+    <p>Indexing</p>
+  </section>
+
+  <section>
+    <h3>Data Organization</h3>
+
+    <dl>
+      <div class="fragment">
+        <dt>Unordered Heap</dt>
+        <dd>No organization at all.  $O(N)$ reads.</dd>
+      </div>
+
+      <div class="fragment">
+        <dt>(Secondary) Index</dt>
+        <dd>Index structure over unorganized data.  $O(\ll N)$ <b>random</b> reads for <b>some</b> queries.</dd>
+      </div>
+
+      <div class="fragment">
+        <dt>Clustered (Primary) Index</dt>
+        <dd>Index structure over clustered data.  $O(\ll N)$ <b>sequential</b> reads for <b>some</b> queries.</dd>
+      </div>
+    </dl>
+  </section>
+
+  <section>
+    <h3>Data Organization</h3>
+    <img src="graphics/2018-02-19-Index-Types.svg" />
+  </section>
+
+  <section>
+    <h3>Data Organization</h3>
+    <img src="graphics/2018-02-19-PrimaryVsSecondary.png" />
+  </section>
+
+  <section>
+    <h3>Tree-Based Indexes</h3>
+
+    <svg data-src="graphics/2018-02-19-Tree-Motivation.svg"/>
+  </section>
+
+  <section>
+    <svg data-src="graphics/2018-02-19-BTree-Reserved.svg" />
+  </section>
+
+  <section>
+    <h3>Rules of B+Trees</h3>
+
+    <dl>
+      <dt>Keep space open for insertions in inner/data nodes.</dt>
+      <dd>‘Split’ nodes when they’re full</dd>
+
+      <dt>Avoid under-using space</dt>
+      <dd>‘Merge’ nodes when they’re under-filled</dd>
+    </dl>
+
+    <p><b>Maintain Invariant:</b> All Nodes ≥ 50% Full</p>
+    <p>(Exception: The Root)</p>
+  </section>
+
+  <section>
+    <svg data-src="graphics/2018-02-23-HashTable.svg" class="stretch"/>
+  </section>
+
+  <section>
+    <h3>Problems</h3>
+    <dl>
+      <dt>$N$ is too small</dt>
+      <dd>Too many overflow pages (slower reads).</dd>
+      <dt>$N$ is too big</dt>
+      <dd>Too many normal pages (wasted space).</dd>
+    </dl>
+  </section>
+
+  <section>
+    <svg data-src="graphics/2018-02-23-HashResize-Naive.svg" class="stretch"/>
+  </section>
+
+  <section>
+    <h3>Problems</h3>
+    <dl>
+      <dt class="fragment" data-fragment-index="1">Changing hash functions reallocates everything</dt>
+      <dd class="fragment" data-fragment-index="1">Only double/halve the size of a hash function</dd>
+
+      <dt class="fragment" data-fragment-index="2">Changing sizes still requires reading everything</dt>
+      <dd class="fragment" data-fragment-index="3"><b>Idea:</b> Only redistribute buckets that are too big</dd>
+    </dl>
+  </section>
+
+  <section>
+    <svg data-src="graphics/2018-02-23-HashResize-Dynamic.svg" class="stretch" />
+  </section>
+</section>
+
+<section>
+  <section>
+    <p>Cost-Based Optimization</p>
+  </section>
+
+
+  <section>
+    <h3>Accounting</h3>
+    <p style="margin-top: 50px;">Figure out the cost of each <b>individual</b> operator.</p>
+    <p style="margin-top: 50px;">Only count the number of IOs <b>added</b> by each operator.</p>
+  </section>
+
+  <section>
+    <table style="font-size: 70%">
+      <tr><th>Operation</th><th>RA</th><th>IOs Added (#pages)</th><th>Memory (#tuples)</th></tr>
+      <tr>
+        <td>Table Scan</td>
+        <td>$R$</td>
+        <td>$\frac{|R|}{\mathcal P}$</td>
+        <td>$O(1)$</td>
+      </tr>
+      <tr>
+        <td>Projection</td>
+        <td>$\pi(R)$</td>
+        <td>$0$</td>
+        <td>$O(1)$</td>
+      </tr>
+      <tr>
+        <td>Selection</td>
+        <td>$\sigma(R)$</td>
+        <td>$0$</td>
+        <td>$O(1)$</td>
+      </tr>
+      <tr>
+        <td>Union</td>
+        <td>$R \uplus S$</td>
+        <td>$0$</td>
+        <td>$O(1)$</td>
+      </tr>
+      <tr>
+        <td style="vertical-align: middle;">Sort <span>(In-Mem)</span></td>
+        <td style="vertical-align: middle;">$\tau(R)$</td>
+        <td>$0$</td>
+        <td>$O(|R|)$</td>
+      </tr>
+      <tr>
+        <td>Sort (On-Disk)</td>
+        <td>$\tau(R)$</td>
+        <td>$\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$</td>
+        <td>$O(\mathcal B)$</td>
+      </tr>
+      <tr>
+        <td><span>(B+Tree)</span> Index Scan</td>
+        <td>$Index(R, c)$</td>
+        <td>$\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$</td>
+        <td>$O(1)$</td>
+      </tr>
+      <tr>
+        <td span>(Hash) Index Scan</td>
+        <td span>$Index(R, c)$</td>
+        <td>$1$</td>
+        <td>$O(1)$</td>
+      </tr>
+    </table>
+
+    <ol style="font-size: 50%; margin-top: 50px;">
+      <li>Tuples per Page ($\mathcal P$) <span>– Normally defined per-schema</span></li>
+      <li>Size of $R$ ($|R|$)</li>
+      <li>Pages of Buffer ($\mathcal B$)</li>
+      <li>Keys per Index Page ($\mathcal I$)</li>
+    </ol>
+  </section>
+  <section>
+    <table style="font-size: 70%">
+      <tr><th width="300px">Operation</th><th>RA</th><th>IOs Added (#pages)</th><th>Memory (#tuples)</th></tr>
+      <tr>
+        <td style="font-size: 60%">Nested Loop Join <span>(Buffer $S$ in mem)</span></td>
+        <td>$R \times S$</td>
+        <td>$0$</td>
+        <td>$O(|S|)$</td>
+      </tr>
+      <tr>
+        <td style="font-size: 60%">Nested Loop Join (Buffer $S$ on disk)</td>
+        <td>$R \times_{disk} S$</td>
+        <td>$(1+ |R|) \cdot \frac{|S|}{\mathcal P}$</td>
+        <td>$O(1)$</td>
+      </tr>
+      <tr>
+        <td>1-Pass Hash Join</td>
+        <td>$R \bowtie_{1PH, c} S$</td>
+        <td>$0$</td>
+        <td>$O(|S|)$</td>
+      </tr>
+      <tr>
+        <td>2-Pass Hash Join</td>
+        <td>$R \bowtie_{2PH, c} S$</td>
+        <td>$\frac{2|R| + 2|S|}{\mathcal P}$</td>
+        <td>$O(1)$</td>
+      </tr>
+      <tr>
+        <td>Sort-Merge Join </td>
+        <td>$R \bowtie_{SM, c} S$</td>
+        <td>[Sort]</td>
+        <td>[Sort]</td>
+      </tr>
+      <tr>
+        <td><span>(Tree)</span> Index NLJ</td>
+        <td>$R \bowtie_{INL, c}$</td>
+        <td>$|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$</td>
+        <td>$O(1)$</td>
+      </tr>
+      <tr>
+        <td>(Hash) Index NLJ</td>
+        <td>$R \bowtie_{INL, c}$</td>
+        <td>$|R| \cdot 1$</td>
+        <td>$O(1)$</td>
+      </tr>
+      <tr>
+        <td><span>(In-Mem)</span> Aggregate</td>
+        <td>$\gamma_A(R)$</td>
+        <td>$0$</td>
+        <td>$adom(A)$</td>
+      </tr>
+      <tr>
+        <td style="font-size: 90%">(Sort/Merge) Aggregate</td>
+        <td>$\gamma_A(R)$</td>
+        <td>[Sort]</td>
+        <td>[Sort]</td>
+      </tr>
+    </table>
+
+    <ol style="font-size: 50%;">
+      <li>Tuples per Page ($\mathcal P$) <span>– Normally defined per-schema</span></li>
+      <li>Size of $R$ ($|R|$)</li>
+      <li>Pages of Buffer ($\mathcal B$)</li>
+      <li>Keys per Index Page ($\mathcal I$)</li>
+      <li>Number of distinct values of $A$ ($adom(A)$)</li>
+    </ol>
+  </section>
+
+  <section>
+    <p>Estimating IOs requires Estimating $|Q(R)|$</p>
+  </section>
+
+  <section>
+    <table style="font-size: 70%">
+      <tr>
+        <th>Operator</th>
+        <th>RA</th>
+        <th>Estimated Size</th>
+      </tr>
+
+      <tr>
+        <td>Table</td>
+        <td>$R$</td>
+        <td>$|R|$</td>
+      </tr>
+
+      <tr>
+        <td>Projection</td>
+        <td>$\pi(Q)$</td>
+        <td>$|Q|$</td>
+      </tr>
+
+      <tr>
+        <td>Union</td>
+        <td>$Q_1 \uplus Q_2$</td>
+        <td>$|Q_1| + |Q_2|$</td>
+      </tr>
+
+      <tr>
+        <td>Cross Product</td>
+        <td>$Q_1 \times Q_2$</td>
+        <td>$|Q_1| \times |Q_2|$</td>
+      </tr>
+
+      <tr>
+        <td>Sort</td>
+        <td>$\tau(Q)$</td>
+        <td>$|Q|$</td>
+      </tr>
+
+      <tr>
+        <td>Limit</td>
+        <td>$\texttt{LIMIT}_N(Q)$</td>
+        <td>$N$</td>
+      </tr>
+
+      <tr>
+        <td>Selection</td>
+        <td>$\sigma_c(Q)$</td>
+        <td>$|Q| \times \texttt{SEL}(c, Q)$</td>
+      </tr>
+
+      <tr>
+        <td>Join</td>
+        <td>$Q_1 \bowtie_c Q_2$</td>
+        <td>$|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$</td>
+      </tr>
+
+      <tr>
+        <td>Distinct</td>
+        <td>$\delta_A(Q)$</td>
+        <td>$\texttt{UNIQ}(A, Q)$</td>
+      </tr>
+
+      <tr>
+        <td>Aggregate</td>
+        <td>$\gamma_{A, B \leftarrow \Sigma}(Q)$</td>
+        <td>$\texttt{UNIQ}(A, Q)$</td>
+      </tr>
+    </table>
+
+    <ul style="font-size: 50%; margin-top: 20px">
+      <li>$\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$</li>
+      <li>$\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.
+    </ul>
+  </section>
+
+  <section>
+    <h3>(Some) Estimation Techniques</h3>
+
+    <dl style="font-size: 80%">
+      <div class="fragment">
+        <dt>Guess Randomly</dt>
+        <dd>Rules of thumb if you have no other options...</dd>
+      </div>
+
+      <div class="fragment">
+        <dt>Uniform Prior</dt>
+        <dd>Use basic statistics to make a very rough guess.</dd>
+      </div>
+
+      <div class="fragment">
+        <dt>Sampling / History</dt>
+        <dd>Small, Quick Sampling Runs (or prior executions of the query).</dd>
+      </div>
+
+      <div class="fragment">
+        <dt>Histograms</dt>
+        <dd>Using more detailed statistics for improved guesses.</dd>
+      </div>
+
+      <div class="fragment">
+        <dt>Constraints</dt>
+        <dd>Using rules about the data for improved guesses.</dd>
+      </div>
+    </dl>
+  </section>
+</section>
+
+<section>
+  <section>
+    <h3>Sketching</h3>
+  </section>
+
+  <section>
+    <table>
+      <tr><th>Flips</th><th>Score</th><th>Probability</th>
+        <th data-fragment-index="1">E[# Games]</th>
+      </tr>
+      <tr><td>(👽)</td><td>0</td><td>0.5</td>
+        <td data-fragment-index="1">2</td>
+      </tr>
+      <tr><td>(🐕)(👽)</td><td>1</td><td>0.25</td>
+        <td data-fragment-index="1">4</td>
+      </tr>
+      <tr><td>(🐕)(🐕)(👽)</td><td>2</td><td>0.125</td>
+        <td data-fragment-index="1">8</td>
+      </tr>
+      <tr><td>(🐕)$\times N$ &nbsp;&nbsp;(👽)</td><td>$N$</td><td>$\frac{1}{2^{N+1}}$</td>
+        <td>$2^{N+1}$</td>
+      </tr>
+    </table>
+    <p style="margin-top: 50px;">If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.</p>
+    <p style="margin-top: 50px;">To do that, I only need to track my top score!</p>
+  </section>
+
+  <section>
+    <h3>Flajolet-Martin Sketches</h3>
+    <h4>($\approx$ HyperLogLog)</h4>
+
+    <ol>
+      <li>For each record...
+      <ol>
+        <li>Hash each record</li>
+        <li>Find the index of the lowest-order non-zero bit</li>
+        <li>Add the index of the bit to a set</li>
+      </ol></li>
+      <li>Find $R$, the lowest index <b>not</b> in the set</li>
+      <li>Estimate Count-Distinct as $\frac{2^R}{\phi}$ ($\phi \approx 0.77351$)</li>
+      <li>Repeat (in parallel) as needed</li>
+    </ol>
+  </section>
+  
+  <section>
+    <h3>Count Sketches</h3>
+
+    <ol>
+      <li>Pick a number of "trials" and a number of "bins"</li>
+      <li>For each record $O_i$
+        <ol>
+          <li>For each "trial" $j$
+            <ol>
+              <li>Use a hash function $h_j(O_i)$ to pick a bin</li>
+              <li>Add a $\pm 1$ value determined by hash function $\delta_j(O_i)$ to the bin</li>
+            </ol>
+          </li>
+        </ol>
+      </ol></li>
+      <li>For each trial $j$, estimate the count of $O_i$ by the value of bin $h_j(O_i)$</li>
+      <li>Take the <em>median</em> value for all trials.</li>
+    </ol>
+  </section>
+  
+  <section>
+    <h3>Count-Min Sketches</h3>
+
+    <ol>
+      <li>Pick a number of "trials" and a number of "bins"</li>
+      <li>For each record $O_i$
+        <ol>
+          <li>For each "trial" $j$
+            <ol>
+              <li>Use a hash function $h_j(O_i)$ to pick a bin</li>
+              <li>Add 1 to the bin</li>
+            </ol>
+          </li>
+        </ol>
+      </ol></li>
+      <li>For each trial $j$, estimate the count of $O_i$ by the value of bin $h_j(O_i)$</li>
+      <li>Take the <em>minimum</em> value for all trials.</li>
+    </ol>
+  </section>
+
+  <section>
+    <dl>
+      <dt>Flajolet-Martin Sketches (HyperLogLog)</dt>
+      <dd>Estimating Count-Distinct</dd>
+
+      <dt>Count Sketches</dt>
+      <dd>Estimating Count-GroupBy<br/>(roughly uniform counts)</dd>
+
+      <dt>Count-Min Sketches</dt>
+      <dd>Estimating Count-GroupBy<br/>(small number of heavy hitters)</dd>
+    </dl>
+  </section>
+</section>
+
+
+<section>
+  <section>
+    <h3>The <code>WINDOW</code> Operator</h3>
+    <ol>
+      <li class="fragment">Define a Sequence (i.e., sort the relation)</li>
+      <li class="fragment">Compute all subsequences<ul>
+        <li>Fixed <u>Physical</u> Size: N records exactly.</li>
+        <li>Fixed <u>Logical</u> Size: Records within N units of time.</li>
+      </ul></li>
+      <li class="fragment">Compute an aggregate for each subsequence (one output row per subsequence)</li>
+    </ol>
+  </section>
+
+  <section>
+    <pre><code class="sql">
+    SELECT L.state, T.month, 
+       AVG(S.sales) OVER W as movavg
+    FROM   Sales S, Times T, Locations L
+    WHERE  S.timeid = T.timeid 
+      AND  S.locid = L.locid
+    WINDOW W AS ( 
+       PARTITION BY L.state
+       ORDER BY T.month
+       RANGE BETWEEN INTERVAL ‘1’ MONTH PRECEDING
+             AND INTERVAL ‘1’ MONTH FOLLOWING
+    )
+    </code></pre>
+  </section>
+
+  <section>
+    <svg data-src="graphics/2019-03-08-Windows.svg" class="stretch" />
+  </section>
+
+  <section>
+    <h3>Summary</h3>
+    <dl>
+      <dt>Push vs Pull Data Flow</dt>
+      <dd>Push is a better fit because sources produce data at different rates.</dd>
+      <dt>Revisit Joins</dt>
+      <dd>Focus on ripple-style <code>WINDOW</code> joins.</dd>
+      <dt>Revisit Indexing</dt>
+      <dd>Linked Hash/Tree Indexes for efficient windowed indexing.</dd>
+      <dt>Revisit Aggregation</dt>
+      <dd>Sliding window aggregates.</dd>
+    </dl>
+  </section>
+</section>