<h1>Query Evaluation</h1>
<h3>CSE 4/562 Database Systems</h3>
<h5>February 12, 2018</h5>
<h3>Query Evaluation Styles</h3>
<dt class="fragment highlight-grey" data-fragment-index="2">All-At-Once (Collections)</dt>
<dd class="fragment highlight-grey" data-fragment-index="2">Bottom-up, one operator at a time.</dd>
<dt>Volcano-Style (Iterators)</dt>
<dd>Operators "request" one tuple at a time from children.</dd>
<dt class="fragment highlight-grey" data-fragment-index="1">Push-Style (Buffers)</dt>
<dd class="fragment highlight-grey" data-fragment-index="1">Operators continuously produce/consume tuples.</dd>
<h3>Basic Mindset</h3>
<img src="graphics/2018-02-05-RA-Tree.svg" style="display: inline-block; vertical-align: middle;" />
<pre style="display: inline-block; vertical-align: middle; margin-left: 20px; width:550px;"><code class="python">
r = get_table("R")
s = get_table("S")
temp1 = apply_join(r, s, "R.B = S.B")
temp2 = apply_select(temp1, "S.C = 10")
result = apply_projection(temp2, "R.A")
<h3>Basic Mindset</h3>
<pre><code class="python">
def build_tree(operator):
if """ operator is a base table """:
return get_table(...)
elif """ operator is a selection """:
return apply_select(operator.child, operator.condition)
elif """ handle remaining cases similarly """:
<p class="fragment" style="display: inline-block; vertical-align: middle; margin-right: 100px">
$$\sigma_{A \neq 3} R$$
<table style="display: inline-block; vertical-align: middle;">
<tr class="fragment highlight-grey"><td>3</td><td>4</td></tr>
<pre><code class="python">
def apply_select(input, condition)
result = []
for row in input:
if condition(row):
result += [row]
return result;
<p class="fragment">(All-At-Once)</p>
<p style="display: inline-block; vertical-align: middle; margin-right: 100px">
$$\sigma_{A \neq 3} R$$
<table style="display: inline-block; vertical-align: middle; font-size: 80%">
<tr class="fragment"><td colspan="2"><code>getNext()</code></td><td style="text-align: left"><code>for row in input:</code></td></tr>
<tr class="fragment"><td>1</td><td>2</td><td class="fragment" style="color: green; text-align: left;"><code style="margin-left: 30px;">return row;</code></td></tr>
<tr class="fragment"><td colspan="2"><code>getNext()</code></td><td style="text-align: left"><code>for row in input:</code></td></tr>
<tr class="fragment"><td>3</td><td>4</td><td class="fragment" style="color: red; text-align: left;"><span style="margin-left: 30px;">X</span></td></tr>
<tr class="fragment" ><td>5</td><td>6</td><td class="fragment" style="color: green; text-align: left;"><code style="margin-left: 30px;">return row;</code></td></tr>
<tr class="fragment"><td colspan="2"><code>getNext()</code></td><td style="text-align: left"><code>for row in input:</code></td></tr>
<tr class="fragment"><td colspan="2"><code>None</code></td><td class="fragment" style="color: red; text-align: left;"><code>return None;</code></td></tr>
<svg data-src="graphics/2018-02-12-Flow-Select.svg" />
<svg data-src="graphics/2018-02-12-Flow-Project.svg" />
<svg data-src="graphics/2018-02-12-Flow-Union.svg" />
<pre><code class="python">
def apply_cross(lhs, rhs):
result = []
for r in lhs:
for s in rhs:
result += [r + s]
return result
<svg data-src="graphics/2018-02-12-Flow-Cross.svg" />
<p>What's the complexity of this cross-product algorithm?</p>
<p>... in terms of compute</p>
<p>... in terms of IOs</p>
<h3>Cross Product Problems</h3>
<dt>Need to scan the inner relation multiple times!</dt>
<dd class="fragment">Load data intelligently to mitigate expensive IOs</dd>
<dt>Every tuple needs to be paired with every other tuple!</dt>
<dd class="fragment">Exploit join conditions to minimize pairs of tuples</dd>
<h3>Preloading Data</h3>
<p class="fragment">Nested-Loop Join</p>
<pre><code class="python">
def apply_cross(lhs, rhs):
result = []
while r =
while s =
result += [r + s]
return result
<h3>Nested-Loop Join</h3>
<svg data-src="graphics/2018-02-12-Join-NLJ.svg" />
<p><b>Problem</b>: We need to evaluate <code>rhs</code> iterator<br/> once per record in <code>lhs</code></p>
<h3>Preloading Data</h3>
<p><b>Naive Solution</b>: Preload records from <code>lhs</code></p>
<pre><code class="python">
def apply_cross(lhs, rhs):
result = []
rhs_preloaded = []
while s =
rhs_preloaded += [s]
while r =
for s in rhs_preloaded:
result += [r + s]
return result
<p class="fragment">Any problems with this?</p>
<h3>Preloading Data</h3>
<p><b>Better Solution</b>: Load both <code>lhs</code> and <code>rhs</code> records in blocks.</p>
<pre><code class="python">
def apply_cross(lhs, rhs):
result = []
while r_block = lhs.take(100):
while s_block = rhs.take(100):
for r in r_block:
for s in s_block:
result += [r + s]
return result
<h3>Block-Nested Loop Join</h3>
<svg data-src="graphics/2018-02-12-Join-BNLJ.svg" class="stretch" />
<p>How big should the blocks be?</p>
<p class="fragment">What is the IO complexity of the algorithm?</p>
<h3>Join Conditions</h3>
<svg data-src="graphics/2018-02-12-Join-Grid.svg" />
<p class="fragment"><b>Problem</b>: Naively, any tuple matches any other</p>
<h3>Join Conditions</h3>
<svg data-src="graphics/2018-02-12-Join-OrderGrid.svg" />
<p><b>Solution</b>: First organize the data</p>
<dt>Sort/Merge Join</dt>
<dd>Sort all of the data upfront, then scan over both sides.</dd>
<dt>In-Memory Index Join (1-pass Hash; Hash Join)</dt>
<dd>Build an in-memory index on one table, scan the other.</dd>
<dt>Partition Join (2-pass Hash; External Hash Join)</dt>
<dd>Partition both sides so that tuples don't join across partitions.</dd>
<h3>Sort/Merge Join</h3>
<svg data-src="graphics/2018-02-12-Join-SortMerge.svg" />
<h3>Hash Functions</h3>
<li>A hash function is a function that maps a large data value to a small fixed-size value<ul>
<li>Typically is deterministic &amp; pseudorandom</li>
<li>Used in Checksums, Hash Tables, Partitioning, Bloom Filters, Caching, Cryptography, Password Storage, …</li>
<li>Examples: MD5, SHA1, SHA2<ul>
<li>MD5() part of OpenSSL (on most OSX / Linux / Unix)</li>
<li>Can map h(k) to range [0,N) with h(k) % N (modulus)</li>
<h3>Hash Functions</h3>
<p style="margin-top: 50px">
$$h(X) \mod N$$
<li>Pseudorandom output between $[0, N)$</li>
<li>Always the same output for a given $X$</li>
<h3>1-Pass Hash Join</h3>
<svg data-src="graphics/2018-02-12-Join-1PassHash.svg" />
<h3>2-Pass Hash Join</h3>
<svg data-src="graphics/2018-02-12-Join-2PassHash.svg" />
<p>Why is it important that the hash function is pseudorandom?</p>
<h3>Next Class</h3>
<p style="margin-top: 100px">More operators, More algorithms</p>
