Website/src/teaching/cse-562/2021sp/slide/2021-03-02-Indexing1.erb

---
template: templates/cse4562_2021_slides.erb
title: "Indexing"
date: March 2, 2021
textbook: "Ch. 8.3-8.4, 14.1-14.2, 14.4"
---
<!-- 2019 by OK

  This section needs
    (1) a more comprehensive discussion of the ISAM and B+Trees.  The hastily imported keynote slides from yesteryears past are not doing a great job of conveying how the structures work.

    (2) More *stuff*.  The slides as-is are about 25-30 minutes out of 50.  We had a great discussion working through a few examples to round out the class... that might be a good use of the time.  A more thorough discussion of B+Trees with examples of Insert/Delete might also help.

    (3) Another thing that would be handy is to show an example of *building* an ISAM index.  The Sort => Scan => Left-Deep Build algorithm would help.

    (4) Finally... to re-emphasize, more B+Tree examples.
  -->

<section>
  <section>
    <h3>Today</h3>

    <p>Leveraging Organization</p>
  </section>
  <section>
    <table>
      <tr>
        <td>
          <img src="graphics/Books/DBSystemsHardcover.jpg" height="200px">
        </td>
        <td>
          <img src="graphics/Books/DBSystemsSoftcover.jpg" height="200px">
        </td>
      </tr>
      <tr class="fragment">
        <td>$150</td>
        <td>$50</td>
      </tr>
      <tr class="fragment">
        <td>Index<br/>ToC</td>
        <td>No Index<br/>ToC Summary</td>
      </tr>
    </table>
  </section>
  <section>
    <h3>Today's Focus</h3>

    <p style="margin: 100px;">
      $\sigma_C(R)$ <span style="margin: 50px">and</span> $(\ldots \bowtie_C R)$
    </p>
    <p class="fragment" style="font-size: 70%">(Finding records in a table <span class="fragment">really fast</span>)</p>
  </section>

  <section>
    <img src="2021-03-02/SortedList.svg">
    <p class="fragment">$\sigma_{R.A = 7}(R)$</p>
    <p class="fragment">Where is the data for key 7?</p>

  </section>

  <section>
    <p>Option 1: Linear search</p>

    <p class="fragment">$O(N)$ IOs</p>
  </section>

  <section>
    <h3>Initial Assumptions</h3>
    <p>Data is sorted on an attribute of interest (R.A)</p>
    <p>Updates are not relevant</p>
    <p></p>

  </section>
</section>

<section>
  <section>
    <p>Option 2: Binary Search</p>
  </section>

  <section>
    <svg data-src="2021-03-02/BinarySearch.svg" width="700px"/>
  </section>

  <section>
    <p>$O(\log_2 N)$ IOs</p>

    <p class="fragment">Better, but still not ideal.</p>
  </section>

  <section>
    <p>Idea: Precompute several layers of the decision tree and store them together.</p>
  </section>

  <section>
    <svg data-src="2021-03-02/ISAM-Motivation.svg" height="525px" width="675px"/>
  </section>

  <section>
    <h3>Fence Pointers</h3>
    <svg data-src="2021-03-02/ISAM-OnePage.svg" height="525px" width="675px"/>
  </section>

  <section>
    <p>... but what if we need more than one page?</p>
    <p class="fragment">Add more indirection!</p>
  </section>

  <section>
    <h3>ISAM Trees</h3>
    <img src="graphics/2021-03-02-ISAM.png" height="500px">
  </section>
</section>

<section>

  <section>
    <p>Which of the following is better?</p>
  </section>

  <section>
    <img src="graphics/2021-03-02-Tree-Motivation.svg"/>
  </section>

  <section>
    <img data-src="graphics/2021-03-02-Tree-Unbalanced.svg" height="500px"/>
  </section>

  <section>
    <h3>Worst-Case Tree?</h3>
    <div class="fragment" style="margin: 100px">$O(N)$ with the tree laid out left/right-deep</div>
    <h3 class="fragment">Best-Case Tree?</h3>
    <div class="fragment" style="margin: 100px">$O(\log N)$ with the tree perfectly balanced</div>
  </section>

  <section>
    <p>It's important that the trees be balanced</p>
    <p class="fragment">... but what if we need to update the tree?</p>
  </section>

  <section>
    <h3>Challenges</h3>

    <ul>
      <li class="fragment">Finding space for new records</li>
      <li class="fragment">Keeping the tree balanced as new records are added</li>
    </ul>
  </section>

  <section>
    <p><b>Idea 1:</b> Reserve space for new records</p>
  </section>

  <section>
    <svg data-src="graphics/2021-03-02-BTree-Reserved.svg" />
  </section>

  <section>
    <p>Just maintaining open space won't work forever...</p>
  </section>

  <section>
    <h3>Rules of B+Trees</h3>

    <dl>
      <dt>Keep space open for insertions in inner/data nodes.</dt>
      <dd>‘Split’ nodes when they’re full</dd>

      <dt>Avoid under-using space</dt>
      <dd>‘Merge’ nodes when they’re under-filled</dd>
    </dl>

    <p class="fragment"><b>Maintain Invariant:</b> All Nodes ≥ 50% Full</p>
    <p class="fragment">(Exception: The Root)</p>
  </section>

  <section><img src="graphics/2021-03-02-InsertExample-1.png" height="300px"/></section>
  <section><img src="graphics/2021-03-02-InsertExample-2.png" height="300px"/></section>
  <section><img src="graphics/2021-03-02-InsertExample-3.png" height="300px"/></section>
  <section><img src="graphics/2021-03-02-InsertExample-4.png" height="300px"/></section>
  <section><img src="graphics/2021-03-02-InsertExample-5.png" height="300px"/></section>
  <section><img src="graphics/2021-03-02-InsertExample-6.png" height="300px"/></section>
  <section><img src="graphics/2021-03-02-InsertExample-7.png" height="300px"/></section>
  <section><img src="graphics/2021-03-02-InsertExample-8.png" height="300px"/></section>
  <section><p>Deletions reverse this process (at 50% fill).</p></section>
</section>

<section>
  <section>
    <h3>Incorporating Trees into Queries</h3>
  </section>

  <section>
    <p style="margin: 100px;">
      $\sigma_C(R)$ <span style="margin: 50px">and</span> $(\ldots \bowtie_C R)$
    </p>
  </section>

  <section>
    <p>Original Query: $\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$</p>

    <p>Possible Implementations:<dl>
      <div>
        <dt>$\pi_A\left(\sigma_{B = 1 \wedge C < 3}(R)\right)$</dt>
        <dd class="fragment">Always works... but slow</dd>
      </div>
      <div class="fragment">
        <dt>$\pi_A\left(\sigma_{B = 1}( IndexScan(R,\;C < 3) ) \right)$</dt>
        <dd class="fragment">Requires a non-hash index on $C$</dd>
      </div>
      <div class="fragment">
        <dt>$\pi_A\left(\sigma_{C < 3}( IndexScan(R,\;B=1) ) \right)$</dt>
        <dd class="fragment">Requires a any index on $B$</dd>
      </div>
      <div class="fragment">
        <dt>$\pi_A\left( IndexScan(R,\;B = 1, C < 3) \right)$</dt>
        <dd class="fragment">Requires any index on $(B, C)$</dd>
      </div>
    </ul></p>
  </section>

  <section>
    <h3>Lexical Sort (Non-Hash Only)</h3>

    <p>Sort data on $(A, B, C, \ldots)$</p>
    <p>First sort on $A$, $B$ is a tiebreaker for $A$,<br/> $C$ is a tiebreaker for $B$, etc...</p>

    <dl>
      <div class="fragment">
        <dt>All of the $A$ values are adjacent.</dt>
        <dd>Supports $\sigma_{A = a}$ or $\sigma_{A \geq b}$</dd>
      </div>
      <div class="fragment">
        <dt>For a specific $A$, all of the $B$ values are adjacent</dt>
        <dd>Supports $\sigma_{A = a \wedge B = b}$ or $\sigma_{A = a \wedge B \geq b}$</dd>
      </div>
      <div class="fragment">
        <dt>For a specific $(A,B)$, all of the $C$ values are adjacent</dt>
        <dd>Supports $\sigma_{A = a \wedge B = b \wedge C = c}$ or $\sigma_{A = a \wedge B = b \wedge C \geq c}$</dd>
      </div>
      <dt class="fragment">...</dt>
    </dl>

  </section>

  <section>
    <h3>For a query $\sigma_{c_1 \wedge \ldots \wedge c_N}(R)$</h3>
    <ol>
      <li class="fragment">For every $c_i \equiv (A = a)$: Do you have any index on $A$?</li>
      <li class="fragment">For every $c_i \in \{\; (A \geq a), (A > a), (A \leq a), (A < a)\;\}$: Do you have a tree index on $A$?</li>
      <li class="fragment">For every $c_i, c_j$, do you have an appropriate index?</li>
      <li class="fragment">A simple table scan is also an option</li>
    </ol>
    <p class="fragment">Which one do we pick?</p>
    <p class="fragment">(You need to know the cost of each plan)</p>
  </section>

  <section>
    <p>These are called "Access Paths"</p>
  </section>

  <section>
    <h3>Strategies for Implementing $(\ldots \bowtie_{c} S)$</h3>

    <dl>
      <dt>Sort/Merge Join</dt>
      <dd>Sort all of the data upfront, then scan over both sides.</dd>

      <dt>In-Memory Index Join (1-pass Hash; Hash Join)</dt>
      <dd>Build an in-memory index on one table, scan the other.</dd>

      <dt>Partition Join (2-pass Hash; External Hash Join)</dt>
      <dd>Partition both sides so that tuples don't join across partitions.</dd>

      <dt class="fragment" data-fragment-index="1">Index Nested Loop Join</dt>
      <dd class="fragment" data-fragment-index="1">Use an <i>existing</i> index instead of building one.</dd>
    </dl>
  </section>

  <section>
    <h3>Index Nested Loop Join</h3>

    To compute $R \bowtie_{S.B > R.A} S$ with an index on $S.B$

    <ol>
      <li>Read one row of $R$</li>
      <li>Get the value of $a = R.A$</li>
      <li>Start index scan on $S.B > a$</li>
      <li>Return all rows from the index scan</li>
      <li>Read the next row of $R$ and repeat</li>
    </ol>
  </section>

  <section>
    <h3>Index Nested Loop Join</h3>

    To compute $R \bowtie_{S.B\;[\theta]\;R.A} S$ with an index on $S.B$

    <ol>
      <li>Read one row of $R$</li>
      <li>Get the value of $a = R.A$</li>
      <li>Start index scan on $S.B\;[\theta]\;a$</li>
      <li>Return all rows from the index scan</li>
      <li>Read the next row of $R$ and repeat</li>
    </ol>
  </section>
</section>

<section>

  <section>
    <p>What if we need multiple sort orders?</p>
  </section>

  <section>
    <h3>Data Organization</h3>
    <img src="2021-03-02/PrimaryVsSecondary.png" />
  </section>

  <section>
    <h3>Data Organization</h3>
    <img src="2021-03-02/Index-Types.svg" />
  </section>

  <section>
    <h3>Data Organization</h3>

    <dl>
      <div>
        <dt>Unordered Heap</dt>
        <dd>$O(N)$ reads.</dd>
      </div>

      <div>
        <dt>Sorted List</dt>
        <dd>$O(\log_2 N)$ <b>random</b> reads for <b>some</b> queries.</dd>
      </div>

      <div>
        <dt>Clustered (Primary) Index</dt>
        <dd>$O(\ll N)$ <b>sequential</b> reads for <b>some</b> queries.</dd>
      </div>

      <div>
        <dt>(Secondary) Index</dt>
        <dd>$O(\ll N)$ <b>random</b> reads for <b>some</b> queries.</dd>
      </div>

    </dl>
  </section>
</section>


<section>
  <section>
    <h3>Hash Indexes</h3>

    <p style="margin-top: 100px; text-align: left;">
      A hash function $h(k)$ is ...
      <dl>
        <dt>... deterministic</dt>
        <dd>The same $k$ always produces the same hash value.</dd>

        <dt>... (pseudo-)random</dt>
        <dd>Different $k$s are unlikely to have the same hash value.</dd>
      </dl>
    </p>
    <p class="fragment">$h(k)\mod N$ gives you a random number in $[0, N)$</p>
  </section>

  <section>
    <svg data-src="graphics/2021-03-02-HashTable.svg" class="stretch"/>
  </section>

  <section>
    <h3>Problems</h3>
    <dl>
      <dt>$N$ is too small</dt>
      <dd>Too many overflow pages (slower reads).</dd>
      <dt>$N$ is too big</dt>
      <dd>Too many normal pages (wasted space).</dd>
    </dl>
  </section>
</section>

<section>
  <section>
    <p><b>Idea:</b> Resize the structure as needed</p>
  </section>

  <section>
    <p>To keep things simple, let's use $$h(k) = k$$</p>
    <p class="fragment" style="font-size: 70%">(you wouldn't actually do this in practice)</p>
  </section>

  <section>
    <svg data-src="graphics/2021-03-02-HashResize-Naive.svg" class="stretch"/>
  </section>

  <section>
    <p>Changing hash functions reallocates everything  <b>randomly</b></p>

    <p class="fragment">Need to keep the entire source and hash table in memory!</p>
  </section>

  <section>
    $$h(k) \mod N$$
    vs
    $$h(h) \mod 2N$$
  </section>

  <section>
    <p>if $h(k) = x \mod N$</p>
    <p>then</p>
    <p>$h(k) = $ either $x$ or $2x \mod 2N$</p>

    <p class="fragment">Each key is moved (or not) to precisely one of two buckets in the resized hash table.</p>
    <p class="fragment">Never need more than 3 pages in memory at once.</p>
  </section>

  <section>
    <p>Changing sizes still requires reading everything!</p>
    <p class="fragment"><b>Idea:</b> Only redistribute buckets that are too big</p>
  </section>

  <section>
    <p>Add a directory (a level of indirection)</p>
    <ul>
      <li>$N$ hash buckets = $N$ directory entries<br/>(but $\leq N$ actual pages)</li>
      <li>Directory entries point to actual pages on disk.</li>
      <li>Multiple directory entries can point to the same page.</li>
      <li>When a page fills up, it (and its directory entries) split.</li>
    </ul>
  </section>

  <section>
    <svg data-src="graphics/2021-03-02-HashResize-Dynamic.svg" class="stretch" />
  </section>

  <section>
    <h3>Dynamic Hashing</h3>
    <ul>
      <li>Add a level of indirection (Directory).</li>
      <li>A data page $i$ can store data with $h(k)%2^n=i$ for any $n$.</li>
      <li>Double the size of the directory (almost free) by duplicating existing entries.</li>
      <li>When bucket $i$ fills up, split on the next power of 2.</li>
      <li>Can also merge buckets/halve the directory size. </li>
    </ul>
  </section>
</section>

<section>
  <p>Next time: LSM Trees and CDF-Indexing</p>
</section>