Review slides

This commit is contained in:
Oliver Kennedy 2019-03-11 12:12:47 -04:00
parent c710d3c0ad
commit 28f25b7e82
3 changed files with 802 additions and 1 deletions

View file

@ -6,6 +6,16 @@ textbook: (readings only)
---
<section>
<!-- 2019 by OK
This lecture fell a bit flat.
- There are a number of bugs in the builds that need to be worked out
- More examples would be helpful
- A deeper dive into the math might also be useful.
- More detail in the solutions to the birthday paradox
-->
<section>
<p style="font-size: larger;">What is the best, correct technique for task <b>X</b>, when <b>Y</b> is true?</p>

View file

@ -186,7 +186,7 @@ textbook: (readings only)
</ol></li>
<li>Find $R$, the lowest index <b>not</b> in the set</li>
<li>Estimate Count-Distinct as $\frac{2^R}{\phi}$ ($\phi \approx 0.77351$)</li>
<li>Repeat as needed</li>
<li>Repeat (in parallel) as needed</li>
</ol>
</section>
</section>

View file

@ -0,0 +1,791 @@
---
template: templates/cse4562_2019_slides.erb
title: Midterm Review
date: March 11, 2019
---
<!-- ============================================ -->
<section>
<section>
<p>What are Databases?</p>
</section>
<section>
<dl>
<dt>Analysis: Answering user-provided questions about data</dt>
<dd>What kind of tools can we give end-users? <ul class="tight">
<li>Declarative Languages</li>
<li>Organizational Datastructures (e.g., Indexes)</li>
</ul></dd>
<div style="color: grey;">
<dt>Manipulation: Safely persisting and sharing data updates</dt>
<dd>What kind of tools can we give end-users?<ul class="tight">
<li>Consistency Primitives</li>
<li>Data Validation Primitives</li>
</ul></dd>
</div>
</dl>
</section>
<section>
<dl style="font-size: smaller;">
<dt>Primitive</dt>
<dd>Basic building blocks like Int, Float, Char, String</dd>
<dt>Tuple</dt>
<dd>Several fields of different types. (N-Tuple = N fields)</dd>
<dd>A Tuple has a schema defining each field</dd>
<dt>Set</dt>
<dd>A collection of unique records, all of the same type</dd>
<dt>Bag</dt>
<dd>An unordered collection of records, all of the same type</dd>
<dt>List</dt>
<dd>An ordered collection of records, all of the same type</dd>
</dl>
</section>
<section>
<pre><code class="sql">
SELECT [DISTINCT] targetlist
FROM relationlist
WHERE condition
</code></pre>
<ol>
<li class="fragment">Compute the $2^n$ combinations of tuples in all relations appearing in <span style="color: red;">relationlist</span></li>
<li class="fragment">Discard tuples that fail the <span style="color: red;">condition</span></li>
<li class="fragment">Delete attributes not in <span style="color: red;">targetlist</span></li>
<li class="fragment">If <span style="font-family: Courier, fixedwidth;">DISTINCT</span> is specified, eliminate duplicate rows</li>
</ol>
<p style="font-size: 70%;" class="fragment">
This is the least efficient strategy to compute a query!
A good optimizer will find <b>more efficient strategies</b> to compute <b>the same answer.</b>
</p>
</section>
</section>
<section>
<section>
<h2>Physical Layout</h2>
</section>
<section>
<h3>Record Formats</h3>
<dl>
<dt>Fixed</dt>
<dd>Constant-size fields. Field $i$ at byte $\sum_{j < i} |Field_j|$</dd>
<dt>Delimited</dt>
<dd>Special character or string (e.g., <code>,</code>) between fields</dd>
<dt>Header</dt>
<dd>Fixed-size header points to start of each field</dd>
<dt>&nbsp;</dt>
<dd>&nbsp;</dd>
</dl>
</section>
<section>
<h3>File Formats</h3>
<dl>
<dt>Fixed</dt>
<dd>Constant-size records. Record $i$ at byte $|Record| \times i$</dd>
<dt>Delimited</dt>
<dd>Special character or string (e.g., <code>\r\n</code>) at record end</dd>
<dt>Header</dt>
<dd>Index in file points to start of each record</dd>
<dt class="fragment" data-fragment-index="1">Paged</dt>
<dd class="fragment" data-fragment-index="1">Align records to paging boundaries</dd>
</dl>
</section>
<section>
<dl>
<dt>File</dt>
<dd>A collection of pages (or records)</dd>
<dt>Page</dt>
<dd>A fixed-size collection of records</dd>
<dd style="font-size: smaller;">Page size is usually dictated by hardware.<br/>Mem Page $\approx$ 4KB&nbsp;&nbsp;&nbsp;Cache Line $\approx$ 64B</dd>
<dt>Record</dt>
<dd>One or more fields (for now)</dd>
<dt>Field</dt>
<dd>A primitive value (for now)</dd>
</dl>
</section>
</section>
<section>
<section>
<p>Relational Algebra</p>
</section>
<section>
<h3>Relational Algebra</h3>
<table style="font-size: 70%">
<tr><th>Operation</th><th>Sym</th><th>Meaning</th></tr>
<tr><td>Selection</td><td>$\sigma$</td><td>Select a subset of the input rows</td></tr>
<tr><td>Projection</td><td>$\pi$</td><td>Delete unwanted columns</td></tr>
<tr><td>Cross-product</td><td>$\times$</td><td>Combine two relations</td></tr>
<tr><td>Set-difference</td><td>$-$</td><td>Tuples in Rel 1, but not Rel 2</td></tr>
<tr><td>Union</td><td>$\cup$</td><td>Tuples either in Rel 1 or in Rel 2</td></tr>
<tr><td>Intersection</td><td>$\cap$</td><td>Tuples in both Rel 1 and Rel 2</td></tr>
<tr><td>Join</td><td>$\bowtie$</td><td>Pairs of tuples matching a specified condition</td></tr>
<tr style="color: lightgrey;"><td>Division</td><td>$/$</td><td>"Inverse" of cross-product</td></tr>
<tr><td>Sort</td> <td>$\tau_A$</td><td>Sort records by attribute(s) $A$</td></tr>
<tr><td>Limit</td><td>$\texttt{LIMIT}_N$</td><td>Return only the first $N$ records<br/>(according to sort order if paired with sort).</td></tr>
</table>
</section>
<section>
<h3>Equivalence</h3>
$$Q_1 = \pi_{A}\left( \sigma_{c}( R ) \right)$$
$$Q_2 = \sigma_{c}\left( \pi_{A}( R ) \right)$$
<div class="fragment">
$$Q_1 \stackrel{?}{\equiv} Q_2$$
</div>
</section>
<section>
<table style="font-size: 60%">
<tr><th><b>Rule</b></th><th><b>Notes</b> </th></tr>
<tr><td>$\sigma_{C_1\wedge C_2}(R) \equiv \sigma_{C_1}(\sigma_{C_2}(R))$</td><td></td></tr>
<tr><td>$\sigma_{C_1\vee C_2}(R) \equiv \sigma_{C_1}(R) \cup \sigma_{C_2}(R)$</td><td>Only true for set, not bag union</td></tr>
<tr><td>$\sigma_C(R \times S) \equiv R \bowtie_C S$</td><td></td></tr>
<tr><td>$\sigma_C(R \times S) \equiv \sigma_C(R) \times S$</td><td>If $C$ references only $R$'s attributes, also works for joins</td></tr>
<tr><td>$\pi_{A}(\pi_{A \cup B}(R)) \equiv \pi_{A}(R)$</td><td> </td></tr>
<tr><td>$\sigma_C(\pi_{A}(R)) \equiv \pi_A(\sigma_C(R))$</td><td>If $A$ contains all of the attributes referenced by $C$ </td></tr>
<tr><td>$\pi_{A\cup B}(R\times S) \equiv \pi_A(R) \times \pi_B(S)$</td><td>Where $A$ (resp., $B$) contains attributes in $R$ (resp., $S$)</td></tr>
<tr><td>$R \times (S \times T) \equiv (R \times S) \times T$</td><td>Also works for joins </td></tr>
<tr><td>$R \times S \equiv S \times R$</td><td>Also works for joins </td></tr>
<tr><td>$R \cup (S \cup T) \equiv (R \cup S) \cup T$</td><td>Also works for intersection and bag-union </td></tr>
<tr><td>$R \cup S \equiv S \cup R$</td><td>Also works for intersections and bag-union </td></tr>
<tr><td>$\sigma_{C}(R \cup S) \equiv \sigma_{C}(R) \cup \sigma_{C}(S)$</td><td>Also works for intersections and bag-union </td></tr>
<tr><td>$\pi_{A}(R \cup S) \equiv \pi_{A}(R) \cup \pi_{A}(S)$</td><td>Also works for intersections and bag-union </td></tr>
<tr><td>$\sigma_{C}(\gamma_{A, AGG}(R)) \equiv \gamma_{A, AGG}(\sigma_{C}(R))$</td><td>If $A$ contains all of the attributes referenced by $C$</td></tr>
</table>
</section>
<section>
<h3>Algorithms</h3>
<dl>
<dt>"Volcano" Operators (Iterators)</dt>
<dd>Operators "pull" tuples, one-at-a-time, from their children.</dd>
<dt>2-Pass (External) Sort</dt>
<dd>Create sorted runs, then repeatedly merge runs</dd>
<dt>Join Algorithms</dt>
<dd>Quickly picking out <i>specific</i> pairs of tuples.</dd>
<dt>Aggregation Algorithms</dt>
<dd>In-Memory vs 2-Pass, Normal vs Group-By</dd>
</dl>
</section>
<section>
<h3>Nested-Loop Join</h3>
<svg data-src="graphics/2018-02-12-Join-NLJ.svg" />
</section>
<section>
<h3>Block-Nested Loop Join</h3>
<svg data-src="graphics/2018-02-12-Join-BNLJ.svg" class="stretch" />
</section>
<section>
<h3>Strategies for Implementing $R \bowtie_{R.A = S.A} S$</h3>
<dl>
<dt>Sort/Merge Join</dt>
<dd>Sort all of the data upfront, then scan over both sides.</dd>
<dt>In-Memory Index Join (1-pass Hash; Hash Join)</dt>
<dd>Build an in-memory index on one table, scan the other.</dd>
<dt>Partition Join (2-pass Hash; External Hash Join)</dt>
<dd>Partition both sides so that tuples don't join across partitions.</dd>
</dl>
</section>
<section>
<h3>Sort/Merge Join</h3>
<svg data-src="graphics/2018-02-12-Join-SortMerge.svg" />
</section>
<section>
<h3>Sort/Merge Join</h3>
<svg data-src="graphics/2018-02-12-Join-SortMerge.svg" />
</section>
<section>
<h3>Sort/Merge Join</h3>
<img src="graphics/2018-02-12-Join-SortMerge.svg" />
<p>Sort/Merge typically expressed as 3 operators<br/>(2xSort + Merge)</p>
</section>
<section>
<h3>1-Pass Hash Join</h3>
<svg data-src="graphics/2018-02-12-Join-1PassHash.svg" />
</section>
<section>
<h3>2-Pass Hash Join</h3>
<dl>
<dt>Limited Queries</dt>
<dd>Only supports join conditions of the form $R.A = S.B$</dd>
<dt>Low Memory</dt>
<dd>Never need more than 1 pair of partitions in memory</dd>
<dt>High IO Cost</dt>
<dd>Every record gets written out to disk, and back in.</dd>
</dl>
<p class="fragment">Can partition on data-values to support other types of queries.</p>
</section>
<section>
<h3>Index Nested Loop Join</h3>
To compute $R \bowtie_{R.A < S.B} S$ with an index on $S.B$
<ol>
<li>Read One Row of $R$</li>
<li>Get the value of $R.A$</li>
<li>Start index scan on $S.B > [R.A]$</li>
<li>Return rows as normal</li>
</ol>
</section>
<section>
<h3>Basic Aggregate Pattern</h3>
<dl>
<dt>Init</dt>
<dd>Define a starting value for the accumulator</dd>
<dt>Fold(Accum, New)</dt>
<dd>Merge a new value into the accumulator</dd>
<dt>Finalize(Accum)</dt>
<dd>Extract the aggregate from the accumulator.</dd>
</dl>
</section>
<section>
<h3>Basic Aggregate Types</h3>
<p class="fragment" style="font-size: 60%">Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals</p>
<dl>
<dt>Distributive</dt>
<dd>Finite-sized accumulator and doesn't need a finalize (COUNT, SUM)</dd>
<dt>Algebraic</dt>
<dd>Finite-sized accumulator but needs a finalize (AVG)</dd>
<dt>Holistic</dt>
<dd>Unbounded accumulator (MEDIAN)</dd>
</dl>
</section>
<section>
<h3>Grouping Algorithms</h3>
<dl>
<dt>2-pass Hash Aggregate</dt>
<dd>Like 2-pass Hash Join: Distribute groups across buckets, then do an in-memory aggregate for each bucket.</dd>
<dt>Sort-Aggregate</dt>
<dd>Like Sort-Merge Join: Sort data by groups, then group elements will be adjacent.</dd>
</dl>
</section>
</section>
<section>
<section>
<p>Indexing</p>
</section>
<section>
<h3>Data Organization</h3>
<dl>
<div class="fragment">
<dt>Unordered Heap</dt>
<dd>No organization at all. $O(N)$ reads.</dd>
</div>
<div class="fragment">
<dt>(Secondary) Index</dt>
<dd>Index structure over unorganized data. $O(\ll N)$ <b>random</b> reads for <b>some</b> queries.</dd>
</div>
<div class="fragment">
<dt>Clustered (Primary) Index</dt>
<dd>Index structure over clustered data. $O(\ll N)$ <b>sequential</b> reads for <b>some</b> queries.</dd>
</div>
</dl>
</section>
<section>
<h3>Data Organization</h3>
<img src="graphics/2018-02-19-Index-Types.svg" />
</section>
<section>
<h3>Data Organization</h3>
<img src="graphics/2018-02-19-PrimaryVsSecondary.png" />
</section>
<section>
<h3>Tree-Based Indexes</h3>
<svg data-src="graphics/2018-02-19-Tree-Motivation.svg"/>
</section>
<section>
<svg data-src="graphics/2018-02-19-BTree-Reserved.svg" />
</section>
<section>
<h3>Rules of B+Trees</h3>
<dl>
<dt>Keep space open for insertions in inner/data nodes.</dt>
<dd>Split nodes when theyre full</dd>
<dt>Avoid under-using space</dt>
<dd>Merge nodes when theyre under-filled</dd>
</dl>
<p><b>Maintain Invariant:</b> All Nodes ≥ 50% Full</p>
<p>(Exception: The Root)</p>
</section>
<section>
<svg data-src="graphics/2018-02-23-HashTable.svg" class="stretch"/>
</section>
<section>
<h3>Problems</h3>
<dl>
<dt>$N$ is too small</dt>
<dd>Too many overflow pages (slower reads).</dd>
<dt>$N$ is too big</dt>
<dd>Too many normal pages (wasted space).</dd>
</dl>
</section>
<section>
<svg data-src="graphics/2018-02-23-HashResize-Naive.svg" class="stretch"/>
</section>
<section>
<h3>Problems</h3>
<dl>
<dt class="fragment" data-fragment-index="1">Changing hash functions reallocates everything</dt>
<dd class="fragment" data-fragment-index="1">Only double/halve the size of a hash function</dd>
<dt class="fragment" data-fragment-index="2">Changing sizes still requires reading everything</dt>
<dd class="fragment" data-fragment-index="3"><b>Idea:</b> Only redistribute buckets that are too big</dd>
</dl>
</section>
<section>
<svg data-src="graphics/2018-02-23-HashResize-Dynamic.svg" class="stretch" />
</section>
</section>
<section>
<section>
<p>Cost-Based Optimization</p>
</section>
<section>
<h3>Accounting</h3>
<p style="margin-top: 50px;">Figure out the cost of each <b>individual</b> operator.</p>
<p style="margin-top: 50px;">Only count the number of IOs <b>added</b> by each operator.</p>
</section>
<section>
<table style="font-size: 70%">
<tr><th>Operation</th><th>RA</th><th>IOs Added (#pages)</th><th>Memory (#tuples)</th></tr>
<tr>
<td>Table Scan</td>
<td>$R$</td>
<td>$\frac{|R|}{\mathcal P}$</td>
<td>$O(1)$</td>
</tr>
<tr>
<td>Projection</td>
<td>$\pi(R)$</td>
<td>$0$</td>
<td>$O(1)$</td>
</tr>
<tr>
<td>Selection</td>
<td>$\sigma(R)$</td>
<td>$0$</td>
<td>$O(1)$</td>
</tr>
<tr>
<td>Union</td>
<td>$R \uplus S$</td>
<td>$0$</td>
<td>$O(1)$</td>
</tr>
<tr>
<td style="vertical-align: middle;">Sort <span>(In-Mem)</span></td>
<td style="vertical-align: middle;">$\tau(R)$</td>
<td>$0$</td>
<td>$O(|R|)$</td>
</tr>
<tr>
<td>Sort (On-Disk)</td>
<td>$\tau(R)$</td>
<td>$\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$</td>
<td>$O(\mathcal B)$</td>
</tr>
<tr>
<td><span>(B+Tree)</span> Index Scan</td>
<td>$Index(R, c)$</td>
<td>$\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$</td>
<td>$O(1)$</td>
</tr>
<tr>
<td span>(Hash) Index Scan</td>
<td span>$Index(R, c)$</td>
<td>$1$</td>
<td>$O(1)$</td>
</tr>
</table>
<ol style="font-size: 50%; margin-top: 50px;">
<li>Tuples per Page ($\mathcal P$) <span> Normally defined per-schema</span></li>
<li>Size of $R$ ($|R|$)</li>
<li>Pages of Buffer ($\mathcal B$)</li>
<li>Keys per Index Page ($\mathcal I$)</li>
</ol>
</section>
<section>
<table style="font-size: 70%">
<tr><th width="300px">Operation</th><th>RA</th><th>IOs Added (#pages)</th><th>Memory (#tuples)</th></tr>
<tr>
<td style="font-size: 60%">Nested Loop Join <span>(Buffer $S$ in mem)</span></td>
<td>$R \times S$</td>
<td>$0$</td>
<td>$O(|S|)$</td>
</tr>
<tr>
<td style="font-size: 60%">Nested Loop Join (Buffer $S$ on disk)</td>
<td>$R \times_{disk} S$</td>
<td>$(1+ |R|) \cdot \frac{|S|}{\mathcal P}$</td>
<td>$O(1)$</td>
</tr>
<tr>
<td>1-Pass Hash Join</td>
<td>$R \bowtie_{1PH, c} S$</td>
<td>$0$</td>
<td>$O(|S|)$</td>
</tr>
<tr>
<td>2-Pass Hash Join</td>
<td>$R \bowtie_{2PH, c} S$</td>
<td>$\frac{2|R| + 2|S|}{\mathcal P}$</td>
<td>$O(1)$</td>
</tr>
<tr>
<td>Sort-Merge Join </td>
<td>$R \bowtie_{SM, c} S$</td>
<td>[Sort]</td>
<td>[Sort]</td>
</tr>
<tr>
<td><span>(Tree)</span> Index NLJ</td>
<td>$R \bowtie_{INL, c}$</td>
<td>$|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$</td>
<td>$O(1)$</td>
</tr>
<tr>
<td>(Hash) Index NLJ</td>
<td>$R \bowtie_{INL, c}$</td>
<td>$|R| \cdot 1$</td>
<td>$O(1)$</td>
</tr>
<tr>
<td><span>(In-Mem)</span> Aggregate</td>
<td>$\gamma_A(R)$</td>
<td>$0$</td>
<td>$adom(A)$</td>
</tr>
<tr>
<td style="font-size: 90%">(Sort/Merge) Aggregate</td>
<td>$\gamma_A(R)$</td>
<td>[Sort]</td>
<td>[Sort]</td>
</tr>
</table>
<ol style="font-size: 50%;">
<li>Tuples per Page ($\mathcal P$) <span> Normally defined per-schema</span></li>
<li>Size of $R$ ($|R|$)</li>
<li>Pages of Buffer ($\mathcal B$)</li>
<li>Keys per Index Page ($\mathcal I$)</li>
<li>Number of distinct values of $A$ ($adom(A)$)</li>
</ol>
</section>
<section>
<p>Estimating IOs requires Estimating $|Q(R)|$</p>
</section>
<section>
<table style="font-size: 70%">
<tr>
<th>Operator</th>
<th>RA</th>
<th>Estimated Size</th>
</tr>
<tr>
<td>Table</td>
<td>$R$</td>
<td>$|R|$</td>
</tr>
<tr>
<td>Projection</td>
<td>$\pi(Q)$</td>
<td>$|Q|$</td>
</tr>
<tr>
<td>Union</td>
<td>$Q_1 \uplus Q_2$</td>
<td>$|Q_1| + |Q_2|$</td>
</tr>
<tr>
<td>Cross Product</td>
<td>$Q_1 \times Q_2$</td>
<td>$|Q_1| \times |Q_2|$</td>
</tr>
<tr>
<td>Sort</td>
<td>$\tau(Q)$</td>
<td>$|Q|$</td>
</tr>
<tr>
<td>Limit</td>
<td>$\texttt{LIMIT}_N(Q)$</td>
<td>$N$</td>
</tr>
<tr>
<td>Selection</td>
<td>$\sigma_c(Q)$</td>
<td>$|Q| \times \texttt{SEL}(c, Q)$</td>
</tr>
<tr>
<td>Join</td>
<td>$Q_1 \bowtie_c Q_2$</td>
<td>$|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$</td>
</tr>
<tr>
<td>Distinct</td>
<td>$\delta_A(Q)$</td>
<td>$\texttt{UNIQ}(A, Q)$</td>
</tr>
<tr>
<td>Aggregate</td>
<td>$\gamma_{A, B \leftarrow \Sigma}(Q)$</td>
<td>$\texttt{UNIQ}(A, Q)$</td>
</tr>
</table>
<ul style="font-size: 50%; margin-top: 20px">
<li>$\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$</li>
<li>$\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.
</ul>
</section>
<section>
<h3>(Some) Estimation Techniques</h3>
<dl style="font-size: 80%">
<div class="fragment">
<dt>Guess Randomly</dt>
<dd>Rules of thumb if you have no other options...</dd>
</div>
<div class="fragment">
<dt>Uniform Prior</dt>
<dd>Use basic statistics to make a very rough guess.</dd>
</div>
<div class="fragment">
<dt>Sampling / History</dt>
<dd>Small, Quick Sampling Runs (or prior executions of the query).</dd>
</div>
<div class="fragment">
<dt>Histograms</dt>
<dd>Using more detailed statistics for improved guesses.</dd>
</div>
<div class="fragment">
<dt>Constraints</dt>
<dd>Using rules about the data for improved guesses.</dd>
</div>
</dl>
</section>
</section>
<section>
<section>
<h3>Sketching</h3>
</section>
<section>
<table>
<tr><th>Flips</th><th>Score</th><th>Probability</th>
<th data-fragment-index="1">E[# Games]</th>
</tr>
<tr><td>(👽)</td><td>0</td><td>0.5</td>
<td data-fragment-index="1">2</td>
</tr>
<tr><td>(🐕)(👽)</td><td>1</td><td>0.25</td>
<td data-fragment-index="1">4</td>
</tr>
<tr><td>(🐕)(🐕)(👽)</td><td>2</td><td>0.125</td>
<td data-fragment-index="1">8</td>
</tr>
<tr><td>(🐕)$\times N$ &nbsp;&nbsp;(👽)</td><td>$N$</td><td>$\frac{1}{2^{N+1}}$</td>
<td>$2^{N+1}$</td>
</tr>
</table>
<p style="margin-top: 50px;">If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.</p>
<p style="margin-top: 50px;">To do that, I only need to track my top score!</p>
</section>
<section>
<h3>Flajolet-Martin Sketches</h3>
<h4>($\approx$ HyperLogLog)</h4>
<ol>
<li>For each record...
<ol>
<li>Hash each record</li>
<li>Find the index of the lowest-order non-zero bit</li>
<li>Add the index of the bit to a set</li>
</ol></li>
<li>Find $R$, the lowest index <b>not</b> in the set</li>
<li>Estimate Count-Distinct as $\frac{2^R}{\phi}$ ($\phi \approx 0.77351$)</li>
<li>Repeat (in parallel) as needed</li>
</ol>
</section>
<section>
<h3>Count Sketches</h3>
<ol>
<li>Pick a number of "trials" and a number of "bins"</li>
<li>For each record $O_i$
<ol>
<li>For each "trial" $j$
<ol>
<li>Use a hash function $h_j(O_i)$ to pick a bin</li>
<li>Add a $\pm 1$ value determined by hash function $\delta_j(O_i)$ to the bin</li>
</ol>
</li>
</ol>
</ol></li>
<li>For each trial $j$, estimate the count of $O_i$ by the value of bin $h_j(O_i)$</li>
<li>Take the <em>median</em> value for all trials.</li>
</ol>
</section>
<section>
<h3>Count-Min Sketches</h3>
<ol>
<li>Pick a number of "trials" and a number of "bins"</li>
<li>For each record $O_i$
<ol>
<li>For each "trial" $j$
<ol>
<li>Use a hash function $h_j(O_i)$ to pick a bin</li>
<li>Add 1 to the bin</li>
</ol>
</li>
</ol>
</ol></li>
<li>For each trial $j$, estimate the count of $O_i$ by the value of bin $h_j(O_i)$</li>
<li>Take the <em>minimum</em> value for all trials.</li>
</ol>
</section>
<section>
<dl>
<dt>Flajolet-Martin Sketches (HyperLogLog)</dt>
<dd>Estimating Count-Distinct</dd>
<dt>Count Sketches</dt>
<dd>Estimating Count-GroupBy<br/>(roughly uniform counts)</dd>
<dt>Count-Min Sketches</dt>
<dd>Estimating Count-GroupBy<br/>(small number of heavy hitters)</dd>
</dl>
</section>
</section>
<section>
<section>
<h3>The <code>WINDOW</code> Operator</h3>
<ol>
<li class="fragment">Define a Sequence (i.e., sort the relation)</li>
<li class="fragment">Compute all subsequences<ul>
<li>Fixed <u>Physical</u> Size: N records exactly.</li>
<li>Fixed <u>Logical</u> Size: Records within N units of time.</li>
</ul></li>
<li class="fragment">Compute an aggregate for each subsequence (one output row per subsequence)</li>
</ol>
</section>
<section>
<pre><code class="sql">
SELECT L.state, T.month,
AVG(S.sales) OVER W as movavg
FROM Sales S, Times T, Locations L
WHERE S.timeid = T.timeid
AND S.locid = L.locid
WINDOW W AS (
PARTITION BY L.state
ORDER BY T.month
RANGE BETWEEN INTERVAL 1 MONTH PRECEDING
AND INTERVAL 1 MONTH FOLLOWING
)
</code></pre>
</section>
<section>
<svg data-src="graphics/2019-03-08-Windows.svg" class="stretch" />
</section>
<section>
<h3>Summary</h3>
<dl>
<dt>Push vs Pull Data Flow</dt>
<dd>Push is a better fit because sources produce data at different rates.</dd>
<dt>Revisit Joins</dt>
<dd>Focus on ripple-style <code>WINDOW</code> joins.</dd>
<dt>Revisit Indexing</dt>
<dd>Linked Hash/Tree Indexes for efficient windowed indexing.</dd>
<dt>Revisit Aggregation</dt>
<dd>Sliding window aggregates.</dd>
</dl>
</section>
</section>