Website/src/teaching/cse-562/2021sp/slide/2021-03-02-Indexing1.erb

454 lines
13 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

---
template: templates/cse4562_2021_slides.erb
title: "Indexing"
date: March 2, 2021
textbook: "Ch. 8.3-8.4, 14.1-14.2, 14.4"
---
<!-- 2019 by OK
This section needs
(1) a more comprehensive discussion of the ISAM and B+Trees. The hastily imported keynote slides from yesteryears past are not doing a great job of conveying how the structures work.
(2) More *stuff*. The slides as-is are about 25-30 minutes out of 50. We had a great discussion working through a few examples to round out the class... that might be a good use of the time. A more thorough discussion of B+Trees with examples of Insert/Delete might also help.
(3) Another thing that would be handy is to show an example of *building* an ISAM index. The Sort => Scan => Left-Deep Build algorithm would help.
(4) Finally... to re-emphasize, more B+Tree examples.
-->
<section>
<section>
<h3>Today</h3>
<p>Leveraging Organization</p>
</section>
<section>
<table>
<tr>
<td>
<img src="graphics/Books/DBSystemsHardcover.jpg" height="200px">
</td>
<td>
<img src="graphics/Books/DBSystemsSoftcover.jpg" height="200px">
</td>
</tr>
<tr class="fragment">
<td>$150</td>
<td>$50</td>
</tr>
<tr class="fragment">
<td>Index<br/>ToC</td>
<td>No Index<br/>ToC Summary</td>
</tr>
</table>
</section>
<section>
<h3>Today's Focus</h3>
<p style="margin: 100px;">
$\sigma_C(R)$ <span style="margin: 50px">and</span> $(\ldots \bowtie_C R)$
</p>
<p class="fragment" style="font-size: 70%">(Finding records in a table <span class="fragment">really fast</span>)</p>
</section>
<section>
<img src="2021-03-02/SortedList.svg">
<p class="fragment">$\sigma_{R.A = 7}(R)$</p>
<p class="fragment">Where is the data for key 7?</p>
</section>
<section>
<p>Option 1: Linear search</p>
<p class="fragment">$O(N)$ IOs</p>
</section>
<section>
<h3>Initial Assumptions</h3>
<p>Data is sorted on an attribute of interest (R.A)</p>
<p>Updates are not relevant</p>
<p></p>
</section>
</section>
<section>
<section>
<p>Option 2: Binary Search</p>
</section>
<section>
<svg data-src="2021-03-02/BinarySearch.svg" width="700px"/>
</section>
<section>
<p>$O(\log_2 N)$ IOs</p>
<p class="fragment">Better, but still not ideal.</p>
</section>
<section>
<p>Idea: Precompute several layers of the decision tree and store them together.</p>
</section>
<section>
<svg data-src="2021-03-02/ISAM-Motivation.svg" height="525px" width="675px"/>
</section>
<section>
<h3>Fence Pointers</h3>
<svg data-src="2021-03-02/ISAM-OnePage.svg" height="525px" width="675px"/>
</section>
<section>
<p>... but what if we need more than one page?</p>
<p class="fragment">Add more indirection!</p>
</section>
<section>
<h3>ISAM Trees</h3>
<img src="graphics/2021-03-02-ISAM.png" height="500px">
</section>
</section>
<section>
<section>
<p>Which of the following is better?</p>
</section>
<section>
<img src="graphics/2021-03-02-Tree-Motivation.svg"/>
</section>
<section>
<img data-src="graphics/2021-03-02-Tree-Unbalanced.svg" height="500px"/>
</section>
<section>
<h3>Worst-Case Tree?</h3>
<div class="fragment" style="margin: 100px">$O(N)$ with the tree laid out left/right-deep</div>
<h3 class="fragment">Best-Case Tree?</h3>
<div class="fragment" style="margin: 100px">$O(\log N)$ with the tree perfectly balanced</div>
</section>
<section>
<p>It's important that the trees be balanced</p>
<p class="fragment">... but what if we need to update the tree?</p>
</section>
<section>
<h3>Challenges</h3>
<ul>
<li class="fragment">Finding space for new records</li>
<li class="fragment">Keeping the tree balanced as new records are added</li>
</ul>
</section>
<section>
<p><b>Idea 1:</b> Reserve space for new records</p>
</section>
<section>
<svg data-src="graphics/2021-03-02-BTree-Reserved.svg" />
</section>
<section>
<p>Just maintaining open space won't work forever...</p>
</section>
<section>
<h3>Rules of B+Trees</h3>
<dl>
<dt>Keep space open for insertions in inner/data nodes.</dt>
<dd>Split nodes when theyre full</dd>
<dt>Avoid under-using space</dt>
<dd>Merge nodes when theyre under-filled</dd>
</dl>
<p class="fragment"><b>Maintain Invariant:</b> All Nodes ≥ 50% Full</p>
<p class="fragment">(Exception: The Root)</p>
</section>
<section><img src="graphics/2021-03-02-InsertExample-1.png" height="300px"/></section>
<section><img src="graphics/2021-03-02-InsertExample-2.png" height="300px"/></section>
<section><img src="graphics/2021-03-02-InsertExample-3.png" height="300px"/></section>
<section><img src="graphics/2021-03-02-InsertExample-4.png" height="300px"/></section>
<section><img src="graphics/2021-03-02-InsertExample-5.png" height="300px"/></section>
<section><img src="graphics/2021-03-02-InsertExample-6.png" height="300px"/></section>
<section><img src="graphics/2021-03-02-InsertExample-7.png" height="300px"/></section>
<section><img src="graphics/2021-03-02-InsertExample-8.png" height="300px"/></section>
<section><p>Deletions reverse this process (at 50% fill).</p></section>
</section>
<section>
<section>
<h3>Incorporating Trees into Queries</h3>
</section>
<section>
<p style="margin: 100px;">
$\sigma_C(R)$ <span style="margin: 50px">and</span> $(\ldots \bowtie_C R)$
</p>
</section>
<section>
<p>Original Query: $\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$</p>
<p>Possible Implementations:<dl>
<div>
<dt>$\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$</dt>
<dd class="fragment">Always works... but slow</dd>
</div>
<div class="fragment">
<dt>$\pi_A\left(\sigma_{B = 1}( IndexScan(R,\;C < 3) ) \right)$</dt>
<dd class="fragment">Requires a non-hash index on $C$</dd>
</div>
<div class="fragment">
<dt>$\pi_A\left(\sigma_{C < 3}( IndexScan(R,\;B=1) ) \right)$</dt>
<dd class="fragment">Requires a any index on $B$</dd>
</div>
<div class="fragment">
<dt>$\pi_A\left( IndexScan(R,\;B = 1, C < 3) \right)$</dt>
<dd class="fragment">Requires any index on $(B, C)$</dd>
</div>
</ul></p>
</section>
<section>
<h3>Lexical Sort (Non-Hash Only)</h3>
<p>Sort data on $(A, B, C, \ldots)$</p>
<p>First sort on $A$, $B$ is a tiebreaker for $A$,<br/> $C$ is a tiebreaker for $B$, etc...</p>
<dl>
<div class="fragment">
<dt>All of the $A$ values are adjacent.</dt>
<dd>Supports $\sigma_{A = a}$ or $\sigma_{A \geq b}$</dd>
</div>
<div class="fragment">
<dt>For a specific $A$, all of the $B$ values are adjacent</dt>
<dd>Supports $\sigma_{A = a \wedge B = b}$ or $\sigma_{A = a \wedge B \geq b}$</dd>
</div>
<div class="fragment">
<dt>For a specific $(A,B)$, all of the $C$ values are adjacent</dt>
<dd>Supports $\sigma_{A = a \wedge B = b \wedge C = c}$ or $\sigma_{A = a \wedge B = b \wedge C \geq c}$</dd>
</div>
<dt class="fragment">...</dt>
</dl>
</section>
<section>
<h3>For a query $\sigma_{c_1 \wedge \ldots \wedge c_N}(R)$</h3>
<ol>
<li class="fragment">For every $c_i \equiv (A = a)$: Do you have any index on $A$?</li>
<li class="fragment">For every $c_i \in \{\; (A \geq a), (A > a), (A \leq a), (A < a)\;\}$: Do you have a tree index on $A$?</li>
<li class="fragment">For every $c_i, c_j$, do you have an appropriate index?</li>
<li class="fragment">A simple table scan is also an option</li>
</ol>
<p class="fragment">Which one do we pick?</p>
<p class="fragment">(You need to know the cost of each plan)</p>
</section>
<section>
<p>These are called "Access Paths"</p>
</section>
<section>
<h3>Strategies for Implementing $(\ldots \bowtie_{c} S)$</h3>
<dl>
<dt>Sort/Merge Join</dt>
<dd>Sort all of the data upfront, then scan over both sides.</dd>
<dt>In-Memory Index Join (1-pass Hash; Hash Join)</dt>
<dd>Build an in-memory index on one table, scan the other.</dd>
<dt>Partition Join (2-pass Hash; External Hash Join)</dt>
<dd>Partition both sides so that tuples don't join across partitions.</dd>
<dt class="fragment" data-fragment-index="1">Index Nested Loop Join</dt>
<dd class="fragment" data-fragment-index="1">Use an <i>existing</i> index instead of building one.</dd>
</dl>
</section>
<section>
<h3>Index Nested Loop Join</h3>
To compute $R \bowtie_{S.B > R.A} S$ with an index on $S.B$
<ol>
<li>Read one row of $R$</li>
<li>Get the value of $a = R.A$</li>
<li>Start index scan on $S.B > a$</li>
<li>Return all rows from the index scan</li>
<li>Read the next row of $R$ and repeat</li>
</ol>
</section>
<section>
<h3>Index Nested Loop Join</h3>
To compute $R \bowtie_{S.B\;[\theta]\;R.A} S$ with an index on $S.B$
<ol>
<li>Read one row of $R$</li>
<li>Get the value of $a = R.A$</li>
<li>Start index scan on $S.B\;[\theta]\;a$</li>
<li>Return all rows from the index scan</li>
<li>Read the next row of $R$ and repeat</li>
</ol>
</section>
</section>
<section>
<section>
<p>What if we need multiple sort orders?</p>
</section>
<section>
<h3>Data Organization</h3>
<img src="2021-03-02/PrimaryVsSecondary.png" />
</section>
<section>
<h3>Data Organization</h3>
<img src="2021-03-02/Index-Types.svg" />
</section>
<section>
<h3>Data Organization</h3>
<dl>
<div>
<dt>Unordered Heap</dt>
<dd>$O(N)$ reads.</dd>
</div>
<div>
<dt>Sorted List</dt>
<dd>$O(\log_2 N)$ <b>random</b> reads for <b>some</b> queries.</dd>
</div>
<div>
<dt>Clustered (Primary) Index</dt>
<dd>$O(\ll N)$ <b>sequential</b> reads for <b>some</b> queries.</dd>
</div>
<div>
<dt>(Secondary) Index</dt>
<dd>$O(\ll N)$ <b>random</b> reads for <b>some</b> queries.</dd>
</div>
</dl>
</section>
</section>
<section>
<section>
<h3>Hash Indexes</h3>
<p style="margin-top: 100px; text-align: left;">
A hash function $h(k)$ is ...
<dl>
<dt>... deterministic</dt>
<dd>The same $k$ always produces the same hash value.</dd>
<dt>... (pseudo-)random</dt>
<dd>Different $k$s are unlikely to have the same hash value.</dd>
</dl>
</p>
<p class="fragment">$h(k)\mod N$ gives you a random number in $[0, N)$</p>
</section>
<section>
<svg data-src="graphics/2021-03-02-HashTable.svg" class="stretch"/>
</section>
<section>
<h3>Problems</h3>
<dl>
<dt>$N$ is too small</dt>
<dd>Too many overflow pages (slower reads).</dd>
<dt>$N$ is too big</dt>
<dd>Too many normal pages (wasted space).</dd>
</dl>
</section>
</section>
<section>
<section>
<p><b>Idea:</b> Resize the structure as needed</p>
</section>
<section>
<p>To keep things simple, let's use $$h(k) = k$$</p>
<p class="fragment" style="font-size: 70%">(you wouldn't actually do this in practice)</p>
</section>
<section>
<svg data-src="graphics/2021-03-02-HashResize-Naive.svg" class="stretch"/>
</section>
<section>
<p>Changing hash functions reallocates everything <b>randomly</b></p>
<p class="fragment">Need to keep the entire source and hash table in memory!</p>
</section>
<section>
$$h(k) \mod N$$
vs
$$h(h) \mod 2N$$
</section>
<section>
<p>if $h(k) = x \mod N$</p>
<p>then</p>
<p>$h(k) = $ either $x$ or $2x \mod 2N$</p>
<p class="fragment">Each key is moved (or not) to precisely one of two buckets in the resized hash table.</p>
<p class="fragment">Never need more than 3 pages in memory at once.</p>
</section>
<section>
<p>Changing sizes still requires reading everything!</p>
<p class="fragment"><b>Idea:</b> Only redistribute buckets that are too big</p>
</section>
<section>
<p>Add a directory (a level of indirection)</p>
<ul>
<li>$N$ hash buckets = $N$ directory entries<br/>(but $\leq N$ actual pages)</li>
<li>Directory entries point to actual pages on disk.</li>
<li>Multiple directory entries can point to the same page.</li>
<li>When a page fills up, it (and its directory entries) split.</li>
</ul>
</section>
<section>
<svg data-src="graphics/2021-03-02-HashResize-Dynamic.svg" class="stretch" />
</section>
<section>
<h3>Dynamic Hashing</h3>
<ul>
<li>Add a level of indirection (Directory).</li>
<li>A data page $i$ can store data with $h(k)%2^n=i$ for any $n$.</li>
<li>Double the size of the directory (almost free) by duplicating existing entries.</li>
<li>When bucket $i$ fills up, split on the next power of 2.</li>
<li>Can also merge buckets/halve the directory size. </li>
</ul>
</section>
</section>
<section>
<p>Next time: LSM Trees and CDF-Indexing</p>
</section>