454 lines
13 KiB
Plaintext
454 lines
13 KiB
Plaintext
---
|
||
template: templates/cse4562_2021_slides.erb
|
||
title: "Indexing"
|
||
date: March 2, 2021
|
||
textbook: "Ch. 8.3-8.4, 14.1-14.2, 14.4"
|
||
---
|
||
<!-- 2019 by OK
|
||
|
||
This section needs
|
||
(1) a more comprehensive discussion of the ISAM and B+Trees. The hastily imported keynote slides from yesteryears past are not doing a great job of conveying how the structures work.
|
||
|
||
(2) More *stuff*. The slides as-is are about 25-30 minutes out of 50. We had a great discussion working through a few examples to round out the class... that might be a good use of the time. A more thorough discussion of B+Trees with examples of Insert/Delete might also help.
|
||
|
||
(3) Another thing that would be handy is to show an example of *building* an ISAM index. The Sort => Scan => Left-Deep Build algorithm would help.
|
||
|
||
(4) Finally... to re-emphasize, more B+Tree examples.
|
||
-->
|
||
|
||
<section>
|
||
<section>
|
||
<h3>Today</h3>
|
||
|
||
<p>Leveraging Organization</p>
|
||
</section>
|
||
<section>
|
||
<table>
|
||
<tr>
|
||
<td>
|
||
<img src="graphics/Books/DBSystemsHardcover.jpg" height="200px">
|
||
</td>
|
||
<td>
|
||
<img src="graphics/Books/DBSystemsSoftcover.jpg" height="200px">
|
||
</td>
|
||
</tr>
|
||
<tr class="fragment">
|
||
<td>$150</td>
|
||
<td>$50</td>
|
||
</tr>
|
||
<tr class="fragment">
|
||
<td>Index<br/>ToC</td>
|
||
<td>No Index<br/>ToC Summary</td>
|
||
</tr>
|
||
</table>
|
||
</section>
|
||
<section>
|
||
<h3>Today's Focus</h3>
|
||
|
||
<p style="margin: 100px;">
|
||
$\sigma_C(R)$ <span style="margin: 50px">and</span> $(\ldots \bowtie_C R)$
|
||
</p>
|
||
<p class="fragment" style="font-size: 70%">(Finding records in a table <span class="fragment">really fast</span>)</p>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="2021-03-02/SortedList.svg">
|
||
<p class="fragment">$\sigma_{R.A = 7}(R)$</p>
|
||
<p class="fragment">Where is the data for key 7?</p>
|
||
|
||
</section>
|
||
|
||
<section>
|
||
<p>Option 1: Linear search</p>
|
||
|
||
<p class="fragment">$O(N)$ IOs</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Initial Assumptions</h3>
|
||
<p>Data is sorted on an attribute of interest (R.A)</p>
|
||
<p>Updates are not relevant</p>
|
||
<p></p>
|
||
|
||
</section>
|
||
</section>
|
||
|
||
<section>
|
||
<section>
|
||
<p>Option 2: Binary Search</p>
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="2021-03-02/BinarySearch.svg" width="700px"/>
|
||
</section>
|
||
|
||
<section>
|
||
<p>$O(\log_2 N)$ IOs</p>
|
||
|
||
<p class="fragment">Better, but still not ideal.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<p>Idea: Precompute several layers of the decision tree and store them together.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="2021-03-02/ISAM-Motivation.svg" height="525px" width="675px"/>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Fence Pointers</h3>
|
||
<svg data-src="2021-03-02/ISAM-OnePage.svg" height="525px" width="675px"/>
|
||
</section>
|
||
|
||
<section>
|
||
<p>... but what if we need more than one page?</p>
|
||
<p class="fragment">Add more indirection!</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>ISAM Trees</h3>
|
||
<img src="graphics/2021-03-02-ISAM.png" height="500px">
|
||
</section>
|
||
</section>
|
||
|
||
<section>
|
||
|
||
<section>
|
||
<p>Which of the following is better?</p>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/2021-03-02-Tree-Motivation.svg"/>
|
||
</section>
|
||
|
||
<section>
|
||
<img data-src="graphics/2021-03-02-Tree-Unbalanced.svg" height="500px"/>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Worst-Case Tree?</h3>
|
||
<div class="fragment" style="margin: 100px">$O(N)$ with the tree laid out left/right-deep</div>
|
||
<h3 class="fragment">Best-Case Tree?</h3>
|
||
<div class="fragment" style="margin: 100px">$O(\log N)$ with the tree perfectly balanced</div>
|
||
</section>
|
||
|
||
<section>
|
||
<p>It's important that the trees be balanced</p>
|
||
<p class="fragment">... but what if we need to update the tree?</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Challenges</h3>
|
||
|
||
<ul>
|
||
<li class="fragment">Finding space for new records</li>
|
||
<li class="fragment">Keeping the tree balanced as new records are added</li>
|
||
</ul>
|
||
</section>
|
||
|
||
<section>
|
||
<p><b>Idea 1:</b> Reserve space for new records</p>
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2021-03-02-BTree-Reserved.svg" />
|
||
</section>
|
||
|
||
<section>
|
||
<p>Just maintaining open space won't work forever...</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Rules of B+Trees</h3>
|
||
|
||
<dl>
|
||
<dt>Keep space open for insertions in inner/data nodes.</dt>
|
||
<dd>‘Split’ nodes when they’re full</dd>
|
||
|
||
<dt>Avoid under-using space</dt>
|
||
<dd>‘Merge’ nodes when they’re under-filled</dd>
|
||
</dl>
|
||
|
||
<p class="fragment"><b>Maintain Invariant:</b> All Nodes ≥ 50% Full</p>
|
||
<p class="fragment">(Exception: The Root)</p>
|
||
</section>
|
||
|
||
<section><img src="graphics/2021-03-02-InsertExample-1.png" height="300px"/></section>
|
||
<section><img src="graphics/2021-03-02-InsertExample-2.png" height="300px"/></section>
|
||
<section><img src="graphics/2021-03-02-InsertExample-3.png" height="300px"/></section>
|
||
<section><img src="graphics/2021-03-02-InsertExample-4.png" height="300px"/></section>
|
||
<section><img src="graphics/2021-03-02-InsertExample-5.png" height="300px"/></section>
|
||
<section><img src="graphics/2021-03-02-InsertExample-6.png" height="300px"/></section>
|
||
<section><img src="graphics/2021-03-02-InsertExample-7.png" height="300px"/></section>
|
||
<section><img src="graphics/2021-03-02-InsertExample-8.png" height="300px"/></section>
|
||
<section><p>Deletions reverse this process (at 50% fill).</p></section>
|
||
</section>
|
||
|
||
<section>
|
||
<section>
|
||
<h3>Incorporating Trees into Queries</h3>
|
||
</section>
|
||
|
||
<section>
|
||
<p style="margin: 100px;">
|
||
$\sigma_C(R)$ <span style="margin: 50px">and</span> $(\ldots \bowtie_C R)$
|
||
</p>
|
||
</section>
|
||
|
||
<section>
|
||
<p>Original Query: $\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$</p>
|
||
|
||
<p>Possible Implementations:<dl>
|
||
<div>
|
||
<dt>$\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$</dt>
|
||
<dd class="fragment">Always works... but slow</dd>
|
||
</div>
|
||
<div class="fragment">
|
||
<dt>$\pi_A\left(\sigma_{B = 1}( IndexScan(R,\;C < 3) ) \right)$</dt>
|
||
<dd class="fragment">Requires a non-hash index on $C$</dd>
|
||
</div>
|
||
<div class="fragment">
|
||
<dt>$\pi_A\left(\sigma_{C < 3}( IndexScan(R,\;B=1) ) \right)$</dt>
|
||
<dd class="fragment">Requires a any index on $B$</dd>
|
||
</div>
|
||
<div class="fragment">
|
||
<dt>$\pi_A\left( IndexScan(R,\;B = 1, C < 3) \right)$</dt>
|
||
<dd class="fragment">Requires any index on $(B, C)$</dd>
|
||
</div>
|
||
</ul></p>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Lexical Sort (Non-Hash Only)</h3>
|
||
|
||
<p>Sort data on $(A, B, C, \ldots)$</p>
|
||
<p>First sort on $A$, $B$ is a tiebreaker for $A$,<br/> $C$ is a tiebreaker for $B$, etc...</p>
|
||
|
||
<dl>
|
||
<div class="fragment">
|
||
<dt>All of the $A$ values are adjacent.</dt>
|
||
<dd>Supports $\sigma_{A = a}$ or $\sigma_{A \geq b}$</dd>
|
||
</div>
|
||
<div class="fragment">
|
||
<dt>For a specific $A$, all of the $B$ values are adjacent</dt>
|
||
<dd>Supports $\sigma_{A = a \wedge B = b}$ or $\sigma_{A = a \wedge B \geq b}$</dd>
|
||
</div>
|
||
<div class="fragment">
|
||
<dt>For a specific $(A,B)$, all of the $C$ values are adjacent</dt>
|
||
<dd>Supports $\sigma_{A = a \wedge B = b \wedge C = c}$ or $\sigma_{A = a \wedge B = b \wedge C \geq c}$</dd>
|
||
</div>
|
||
<dt class="fragment">...</dt>
|
||
</dl>
|
||
|
||
</section>
|
||
|
||
<section>
|
||
<h3>For a query $\sigma_{c_1 \wedge \ldots \wedge c_N}(R)$</h3>
|
||
<ol>
|
||
<li class="fragment">For every $c_i \equiv (A = a)$: Do you have any index on $A$?</li>
|
||
<li class="fragment">For every $c_i \in \{\; (A \geq a), (A > a), (A \leq a), (A < a)\;\}$: Do you have a tree index on $A$?</li>
|
||
<li class="fragment">For every $c_i, c_j$, do you have an appropriate index?</li>
|
||
<li class="fragment">A simple table scan is also an option</li>
|
||
</ol>
|
||
<p class="fragment">Which one do we pick?</p>
|
||
<p class="fragment">(You need to know the cost of each plan)</p>
|
||
</section>
|
||
|
||
<section>
|
||
<p>These are called "Access Paths"</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Strategies for Implementing $(\ldots \bowtie_{c} S)$</h3>
|
||
|
||
<dl>
|
||
<dt>Sort/Merge Join</dt>
|
||
<dd>Sort all of the data upfront, then scan over both sides.</dd>
|
||
|
||
<dt>In-Memory Index Join (1-pass Hash; Hash Join)</dt>
|
||
<dd>Build an in-memory index on one table, scan the other.</dd>
|
||
|
||
<dt>Partition Join (2-pass Hash; External Hash Join)</dt>
|
||
<dd>Partition both sides so that tuples don't join across partitions.</dd>
|
||
|
||
<dt class="fragment" data-fragment-index="1">Index Nested Loop Join</dt>
|
||
<dd class="fragment" data-fragment-index="1">Use an <i>existing</i> index instead of building one.</dd>
|
||
</dl>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Index Nested Loop Join</h3>
|
||
|
||
To compute $R \bowtie_{S.B > R.A} S$ with an index on $S.B$
|
||
|
||
<ol>
|
||
<li>Read one row of $R$</li>
|
||
<li>Get the value of $a = R.A$</li>
|
||
<li>Start index scan on $S.B > a$</li>
|
||
<li>Return all rows from the index scan</li>
|
||
<li>Read the next row of $R$ and repeat</li>
|
||
</ol>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Index Nested Loop Join</h3>
|
||
|
||
To compute $R \bowtie_{S.B\;[\theta]\;R.A} S$ with an index on $S.B$
|
||
|
||
<ol>
|
||
<li>Read one row of $R$</li>
|
||
<li>Get the value of $a = R.A$</li>
|
||
<li>Start index scan on $S.B\;[\theta]\;a$</li>
|
||
<li>Return all rows from the index scan</li>
|
||
<li>Read the next row of $R$ and repeat</li>
|
||
</ol>
|
||
</section>
|
||
</section>
|
||
|
||
<section>
|
||
|
||
<section>
|
||
<p>What if we need multiple sort orders?</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Data Organization</h3>
|
||
<img src="2021-03-02/PrimaryVsSecondary.png" />
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Data Organization</h3>
|
||
<img src="2021-03-02/Index-Types.svg" />
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Data Organization</h3>
|
||
|
||
<dl>
|
||
<div>
|
||
<dt>Unordered Heap</dt>
|
||
<dd>$O(N)$ reads.</dd>
|
||
</div>
|
||
|
||
<div>
|
||
<dt>Sorted List</dt>
|
||
<dd>$O(\log_2 N)$ <b>random</b> reads for <b>some</b> queries.</dd>
|
||
</div>
|
||
|
||
<div>
|
||
<dt>Clustered (Primary) Index</dt>
|
||
<dd>$O(\ll N)$ <b>sequential</b> reads for <b>some</b> queries.</dd>
|
||
</div>
|
||
|
||
<div>
|
||
<dt>(Secondary) Index</dt>
|
||
<dd>$O(\ll N)$ <b>random</b> reads for <b>some</b> queries.</dd>
|
||
</div>
|
||
|
||
</dl>
|
||
</section>
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<section>
|
||
<h3>Hash Indexes</h3>
|
||
|
||
<p style="margin-top: 100px; text-align: left;">
|
||
A hash function $h(k)$ is ...
|
||
<dl>
|
||
<dt>... deterministic</dt>
|
||
<dd>The same $k$ always produces the same hash value.</dd>
|
||
|
||
<dt>... (pseudo-)random</dt>
|
||
<dd>Different $k$s are unlikely to have the same hash value.</dd>
|
||
</dl>
|
||
</p>
|
||
<p class="fragment">$h(k)\mod N$ gives you a random number in $[0, N)$</p>
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2021-03-02-HashTable.svg" class="stretch"/>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Problems</h3>
|
||
<dl>
|
||
<dt>$N$ is too small</dt>
|
||
<dd>Too many overflow pages (slower reads).</dd>
|
||
<dt>$N$ is too big</dt>
|
||
<dd>Too many normal pages (wasted space).</dd>
|
||
</dl>
|
||
</section>
|
||
</section>
|
||
|
||
<section>
|
||
<section>
|
||
<p><b>Idea:</b> Resize the structure as needed</p>
|
||
</section>
|
||
|
||
<section>
|
||
<p>To keep things simple, let's use $$h(k) = k$$</p>
|
||
<p class="fragment" style="font-size: 70%">(you wouldn't actually do this in practice)</p>
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2021-03-02-HashResize-Naive.svg" class="stretch"/>
|
||
</section>
|
||
|
||
<section>
|
||
<p>Changing hash functions reallocates everything <b>randomly</b></p>
|
||
|
||
<p class="fragment">Need to keep the entire source and hash table in memory!</p>
|
||
</section>
|
||
|
||
<section>
|
||
$$h(k) \mod N$$
|
||
vs
|
||
$$h(h) \mod 2N$$
|
||
</section>
|
||
|
||
<section>
|
||
<p>if $h(k) = x \mod N$</p>
|
||
<p>then</p>
|
||
<p>$h(k) = $ either $x$ or $2x \mod 2N$</p>
|
||
|
||
<p class="fragment">Each key is moved (or not) to precisely one of two buckets in the resized hash table.</p>
|
||
<p class="fragment">Never need more than 3 pages in memory at once.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<p>Changing sizes still requires reading everything!</p>
|
||
<p class="fragment"><b>Idea:</b> Only redistribute buckets that are too big</p>
|
||
</section>
|
||
|
||
<section>
|
||
<p>Add a directory (a level of indirection)</p>
|
||
<ul>
|
||
<li>$N$ hash buckets = $N$ directory entries<br/>(but $\leq N$ actual pages)</li>
|
||
<li>Directory entries point to actual pages on disk.</li>
|
||
<li>Multiple directory entries can point to the same page.</li>
|
||
<li>When a page fills up, it (and its directory entries) split.</li>
|
||
</ul>
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2021-03-02-HashResize-Dynamic.svg" class="stretch" />
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Dynamic Hashing</h3>
|
||
<ul>
|
||
<li>Add a level of indirection (Directory).</li>
|
||
<li>A data page $i$ can store data with $h(k)%2^n=i$ for any $n$.</li>
|
||
<li>Double the size of the directory (almost free) by duplicating existing entries.</li>
|
||
<li>When bucket $i$ fills up, split on the next power of 2.</li>
|
||
<li>Can also merge buckets/halve the directory size. </li>
|
||
</ul>
|
||
</section>
|
||
</section>
|
||
|
||
<section>
|
||
<p>Next time: LSM Trees and CDF-Indexing</p>
|
||
</section> |