CSE 4/562 - Database Systems
<div class="slides">
<h3>CSE 4/562 Database Systems</h3>
<h5>February 23, 2018</h5>
<div style="display: inline-block; margin-right: 100px;">
<img src="graphics/2018-02-23-Index.png" height="300px" />
<div style="display: inline-block;">
<img src="graphics/2018-02-23-Data.png" height="300px" />
<p>Data, even if well organized still requires you to page through a lot.</p>
<p>An index helps you quickly jump to specific data you might be interested in.</p>
<h3>Data Organization</h3>
<div class="fragment">
<dt>Unordered Heap</dt>
<dd>No organization at all. $O(N)$ reads.</dd>
<div class="fragment">
<dt>(Secondary) Index</dt>
<dd>Index structure over unorganized data. $O(\ll N)$ <b>random</b> reads for <b>some</b> queries.</dd>
<div class="fragment">
<dt>Clustered (Primary) Index</dt>
<dd>Index structure over clustered data. $O(\ll N)$ <b>sequential</b> reads for <b>some</b> queries.</dd>
<h3>Hash Indexes</h3>
<p style="margin-top: 100px; text-align: left;">
A hash function $h(k)$ is ...
<dt>... deterministic</dt>
<dd>The same $k$ always produces the same hash value.</dd>
<dt>... (pseudo-)random</dt>
<dd>Different $k$s are unlikely to have the same hash value.</dd>
<p class="fragment">Modulus $h(k)\%N$ gives you a random number in $[0, N)$</p>
<svg data-src="graphics/2018-02-23-HashTable.svg" class="stretch"/>
<dt>$N$ is too small</dt>
<dd>Too many overflow pages (slower reads).</dd>
<dt>$N$ is too big</dt>
<dd>Too many normal pages (wasted space).</dd>
<p><b>Idea:</b> Resize the structure as needed</p>
<p>To keep things simple, let's use $$h(k) = k$$</p>
<p class="fragment" style="font-size: 70%">(you wouldn't actually do this in practice)</p>
<svg data-src="graphics/2018-02-23-HashResize-Naive.svg" class="stretch"/>
<dt class="fragment" data-fragment-index="1">Changing hash functions reallocates everything</dt>
<dd class="fragment" data-fragment-index="1">Only double/halve the size of a hash function</dd>
<dt class="fragment" data-fragment-index="2">Changing sizes still requires reading everything</dt>
<dd class="fragment" data-fragment-index="3"><b>Idea:</b> Only redistribute buckets that are too big</dd>
<svg data-src="graphics/2018-02-23-HashResize-Dynamic.svg" class="stretch" />
<h3>Dynamic Hashing</h3>
<li>Add a level of indirection (Directory).</li>
<li>A data page $i$ can store data with $h(k)%2^n=i$ for any $n$.</li>
<li>Double the size of the directory (almost free) by duplicating existing entries.</li>
<li>When bucket $i$ fills up, split on the next power of 2.</li>
<li>Can also merge buckets/halve the directory size. </li>
<h3>CDF-Based Indexing</h3>
<p class="fragment" style="margin-top: 100px;"><b>"The Case for Learned Index Structures"</b><br/>by Kraska, Beutel, Chi, Dean, Polyzotis</p>
<svg data-src="graphics/2018-02-23-CDF-Linear.svg"/>
<svg data-src="graphics/2018-02-23-CDF-LinearApprox.svg"/>
<h3>Cumulative Distribution Function (CDF)</h3>
<img src="graphics/2018-02-23-CDF-Plot.png" />
<p>$f(key) \mapsto position$</p>
<p class="fragment" style="font-size: 50%">(not exactly true, but close enough for today)</p>
<h3>Using CDFs to find records</h3>
<dt>Ideal: $f(k) = position$</dt>
<dd>$f$ encodes the <b>exact</b> location of a record</dd>
<dt class="fragment">Ok: $f(k) \approx position$ <span class="fragment">($\left|f(k) - position\right| < \epsilon$)</span></dt>
<dd class="fragment">$f$ gets you to within $\epsilon$ of the key</dd>
<dd class="fragment">Only need local search on one (or so) leaf pages.</dd>
<p class="fragment"><b>Simplified Use Case:</b> Static data with "infinite" prep time.</p>
<h3>How to define $f$?</h3>
<li class="fragment">Linear ($f(k) = a\cdot k + b$)</li>
<li class="fragment">Polynomial ($f(k) = a\cdot k + b \cdot k^2 + \ldots$)</li>
<li class="fragment">Neural Network ($f(k) = $<img src="graphics/Clipart/magic-wand.png" height="100px" style="vertical-align: middle;">)</li>
<p>We have infinite prep time, so fit a (tiny) neural network to the CDF.</p>
<h3>Neural Networks</h3>
<dt class="fragment" data-fragment-index="1">Extremely Generalized Regression</dt>
<dd class="fragment" data-fragment-index="1">Essentially a really really really complex, fittable function with a lot of parameters.</dd>
<dt class="fragment" data-fragment-index="2">Captures Nonlinearities</dt>
<dd class="fragment" data-fragment-index="2">Most regressions can't handle discontinuous functions, which many key spaces have.</dd>
<dt class="fragment" data-fragment-index="3">No Branching</dt>
<dd class="fragment" data-fragment-index="3"><code>if</code> statements are <b>really</b> expensive on modern processors.</dd>
<dd class="fragment" data-fragment-index="4">(Compare to B+Trees with $\log_2 N$ if statements)</dd>
<dt>Tree Indexes</dt>
<dd>$O(\log N)$ access, supports range queries, easy size changes.</dd>
<dt>Hash Indexes</dt>
<dd>$O(1)$ access, doesn't change size efficiently, only equality tests.</dd>
<dt>CDF Indexes</dt>
<dd>$O(1)$ access, supports range queries, static data only.</dd>
<p><b>Next Class:</b> Using Indexes</p>
