Website/src/teaching/cse-562/2021sp/slide/2021-03-11-CostOpt2.erb

---
template: templates/cse4562_2021_slides.erb
title: "Cost-Based Optimization"
date: March 11, 2021
textbook: Ch. 16
---

<section>
  <section>
    <h3>Remember the Real Goals</h3>
    <ol>
      <li>Accurately <b>rank</b> the plans.</li>
      <li>Don't spend more time optimizing than you get back.</li>
      <li>Don't pick a plan that uses more memory than you have.</li>
    </ol>
  </section>

  <section>
    <table style="font-size: 70%">
      <tr><th>Operation</th><th>RA</th><th>Total IOs (#pages)</th><th>Memory (#tuples)</th></tr>
      <tr  >
        <td>Table Scan</td>
        <td>$R$</td>
        <td  >$\frac{|R|}{\mathcal P}$</td>
        <td  >$O(1)$</td>
      </tr>
      <tr  >
        <td>Projection</td>
        <td>$\pi(R)$</td>
        <td  >$\textbf{io}(R)$</td>
        <td  >$O(1)$</td>
      </tr>
      <tr  >
        <td>Selection</td>
        <td>$\sigma(R)$</td>
        <td>$\textbf{io}(R)$</td>
        <td>$O(1)$</td>
      </tr>
      <tr  >
        <td>Union</td>
        <td>$R \uplus S$</td>
        <td>$\textbf{io}(R) + \textbf{io}(S)$</td>
        <td>$O(1)$</td>
      </tr>
      <tr  >
        <td style="vertical-align: middle;">Sort <span  >(In-Mem)</span></td>
        <td style="vertical-align: middle;">$\tau(R)$</td>
        <td  >$\textbf{io}(R)$</td>
        <td  >$O(|R|)$</td>
      </tr>
      <tr>
        <td  >Sort (On-Disk)</td>
        <td  >$\tau(R)$</td>
        <td  >$\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P} + \textbf{io}(R)$</td>
        <td  >$O(\mathcal B)$</td>
      </tr>
      <tr  >
        <td><span  >(B+Tree)</span> Index Scan</td>
        <td>$Index(R, c)$</td>
        <td  >$\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$</td>
        <td  >$O(1)$</td>
      </tr>
      <tr>
        <td span  >(Hash) Index Scan</td>
        <td span  >$Index(R, c)$</td>
        <td  >$1$</td>
        <td  >$O(1)$</td>
      </tr>
    </table>

    <ol style="font-size: 50%; margin-top: 50px;">
      <li  >Tuples per Page ($\mathcal P$) <span>– Normally defined per-schema</span></li>
      <li  >Size of $R$ ($|R|$)</li>
      <li  >Pages of Buffer ($\mathcal B$)</li>
      <li  >Keys per Index Page ($\mathcal I$)</li>
    </ol>
  </section>
  <section>
    <table style="font-size: 70%">
      <tr><th width="300px">Operation</th><th>RA</th><th>Total IOs (#pages)</th><th style="font-size: 80%;">Mem (#tuples)</th></tr>
      <tr  >
        <td style="font-size: 60%">Nested Loop Join <span  >(Buffer $S$ in mem)</span></td>
        <td>$R \times_{mem} S$</td>
        <td  >$\textbf{io}(R)+\textbf{io}(S)$</td>
        <td  >$O(|S|)$</td>
      </tr>
      <tr>
        <td   style="font-size: 60%">Block NLJ (Buffer $S$ on disk)</td>
        <td  >$R \times_{disk} S$</td>
        <td  >$\frac{|R|}{\mathcal B} \cdot \frac{|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$</td>
        <td  >$O(1)$</td>
      </tr>
      <tr>
        <td   style="font-size: 60%">Block NLJ (Recompute $S$)</td>
        <td  >$R \times_{redo} S$</td>
        <td  >$\textbf{io}(R) + \frac{|R|}{\mathcal B} \cdot \textbf{io}(S)$</td>
        <td  >$O(1)$</td>
      </tr>
      <tr  >
        <td>1-Pass Hash Join</td>
        <td>$R \bowtie_{1PH, c} S$</td>
        <td  >$\textbf{io}(R) + \textbf{io}(S)$</td>
        <td  >$O(|S|)$</td>
      </tr>
      <tr  >
        <td>2-Pass Hash Join</td>
        <td>$R \bowtie_{2PH, c} S$</td>
        <td  >$\frac{2|R| + 2|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$</td>
        <td  >$O(1)$</td>
      </tr>
      <tr  >
        <td>Sort-Merge Join </td>
        <td>$R \bowtie_{SM, c} S$</td>
        <td  >[Sort]</td>
        <td  >[Sort]</td>
      </tr>
      <tr  >
        <td><span  >(Tree)</span> Index NLJ</td>
        <td>$R \bowtie_{INL, c}$</td>
        <td  >$|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$</td>
        <td  >$O(1)$</td>
      </tr>
      <tr>
        <td  >(Hash) Index NLJ</td>
        <td  >$R \bowtie_{INL, c}$</td>
        <td  >$|R| \cdot 1$</td>
        <td  >$O(1)$</td>
      </tr>
      <tr  >
        <td><span  >(In-Mem)</span> Aggregate</td>
        <td>$\gamma_A(R)$</td>
        <td  >$\textbf{io}(R)$</td>
        <td  >$adom(A)$</td>
      </tr>
      <tr>
        <td   style="font-size: 90%">(Sort/Merge) Aggregate</td>
        <td  >$\gamma_A(R)$</td>
        <td  >[Sort]</td>
        <td  >[Sort]</td>
      </tr>
    </table>
  </section>
</section>

<section>
  <section>
    <h3>Cardinality Estimation</h3>
    <h4>(The Hard Parts)</h4>

    <dl>
      <dt style="margin-top: 50px;">$\sigma_c(Q)$ (Cardinality Estimation)</dt>
      <dd>How many tuples will a condition $c$ allow to pass?</dd>

      <dt style="margin-top: 50px;">$\delta_A(Q)$ (Distinct Values Estimation)</dt>
      <dd>How many distinct values of attribute(s) $A$ exist?</dd>
    </dl>
  </section>

  <section>
    <h3>Remember the Real Goals</h3>
    <ol>
      <li>Accurately <b>rank</b> the plans.</li>
      <li>Don't spend more time optimizing than you get back.</li>
    </ol>
  </section>

  <section>
    <h3>(Some) Estimation Techniques</h3>

    <dl style="font-size: 80%">
      <div class="fragment highlight-grey" data-fragment-index="1">
        <dt>Guess Randomly</dt>
        <dd>Rules of thumb if you have no other options...</dd>
      </div>

      <div class="fragment highlight-grey" data-fragment-index="1">
        <dt>Uniform Prior</dt>
        <dd>Use basic statistics to make a very rough guess.</dd>
      </div>

      <div>
        <dt>Sampling / History</dt>
        <dd>Small, Quick Sampling Runs (or prior executions of the query).</dd>
      </div>

      <div>
        <dt>Histograms</dt>
        <dd>Using more detailed statistics for improved guesses.</dd>
      </div>

      <div>
        <dt>Constraints</dt>
        <dd>Using rules about the data for improved guesses.</dd>
      </div>
    </dl>
  </section>
</section>


<section>
  <section>
    <h3>(Some) Estimation Techniques</h3>

    <dl style="font-size: 80%">
      <dt style="color: grey;">Guess Randomly</dt>
      <dd style="color: grey;">Rules of thumb if you have no other options...</dd>

      <dt style="color: grey;">Uniform Prior</dt>
      <dd style="color: grey;">Use basic statistics to make a very rough guess.</dd>

      <dt style="color: blue;">Sampling / History</dt>
      <dd style="color: blue;">Small, Quick Sampling Runs (or prior executions of the query).</dd>

      <dt style="color: grey;">Histograms</dt>
      <dd style="color: grey;">Using more detailed statistics for improved guesses.</dd>

      <dt style="color: grey;">Constraints</dt>
      <dd style="color: grey;">Using rules about the data for improved guesses.</dd>
    </dl>
  </section>

  <section>
    <p><b>Idea 1:</b> Pick 100 tuples at random from each input table.</p>
  </section>

  <section>
    <svg data-src="2021-03-11/JoinIssue.svg" />
  </section>

  <section>
    <h3>The Birthday Paradox</h3>

    <p style="margin-top: 50px;">
      Assume: $\texttt{UNIQ}(A, R) = \texttt{UNIQ}(A, S) = N$
    </p>

    <p style="margin-top: 50px;">
      It takes $O(\sqrt{N})$ samples from both $R$ and $S$ <br/> to get even <b>one match.</b>
    </p>
  </section>

  <section>
    <p>To be resumed later in the term when we talk about AQP</p>
  </section>

  <section>
    <p><b>How DBs Do It</b>: Instrument queries while running them.<ul>
      <li class="fragment">The first time you run a query it <i>might</i> be slow.</li>
      <li class="fragment">The second, third, fourth, etc... times it'll be fast.</li>
    </ul></p>
  </section>
</section>

<section>

  <section>
    <h3>(Some) Estimation Techniques</h3>

    <dl style="font-size: 80%">
      <dt style="color: grey;">Guess Randomly</dt>
      <dd style="color: grey;">Rules of thumb if you have no other options...</dd>

      <dt style="color: grey;">Uniform Prior</dt>
      <dd style="color: grey;">Use basic statistics to make a very rough guess.</dd>

      <dt style="color: grey;">Sampling / History</dt>
      <dd style="color: grey;">Small, Quick Sampling Runs (or prior executions of the query).</dd>

      <dt style="color: blue;">Histograms</dt>
      <dd style="color: blue;">Using more detailed statistics for improved guesses.</dd>

      <dt style="color: grey;">Constraints</dt>
      <dd style="color: grey;">Using rules about the data for improved guesses.</dd>
    </dl>
  </section>

  <section>
    <h3>Limitations of Uniform Prior</h3>

    <dl>
      <div class="fragment highlight-grey" data-fragment-index="1">
        <dt>Don't always have statistics for $Q$</dt>
        <dd>For example, $\pi_{A \leftarrow (B \times C)}(R)$</dd>
      </div>

      <div class="fragment highlight-grey" data-fragment-index="1">
        <dt>Don't always have clear rules for $c$</dt>
        <dd>For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$</dd>
      </div>

      <div class="fragment highlight-blue" data-fragment-index="1">
        <dt>Attribute values are not always uniformly distributed.</dt>
        <dd>For example, <span style="font-size: 60%"> $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$</span></dd>
      </div>

      <div class="fragment highlight-grey" data-fragment-index="1">
        <dt>Attribute values are sometimes correlated.</dt>
        <dd>For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$</dd>
      </div>

    </dl>
  </section>

  <section>
    <p class="fragment highlight-grey" data-fragment-index="1">
      <b>Ideal Case:</b> You have some
      $$f(x) = \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$
      (and similarly for the other aggregates)
    </p>
    <p class="fragment" data-fragment-index="1">
      <b>Slightly Less Ideal Case:</b> You have some
      $$f(x) \approx \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$
    </p>
  </section>

  <section>
    <p>If this sounds like CDF-based indexing... you're right!</p>

    <p class="fragment">... but we're not going to talk about NNs today</p>
  </section>
</section>

<section>
  <section>
    <p>
      <b>Simpler/Faster Idea: </b> Break $f(x)$ into chunks
    </p>
  </section>

  <section>
    <h3>Example Data</h3>
    <table style="font-size: 80%">
      <tr><th>Name</th>      <th>YearsEmployed</th>  <th>Role</th></tr>
      <tr><td>'Alice'</td>   <td>3</td>              <td>1</td></tr>
      <tr><td>'Bob'</td>     <td>2</td>              <td>2</td></tr>
      <tr><td>'Carol'</td>   <td>3</td>              <td>1</td></tr>
      <tr><td>'Dave'</td>    <td>1</td>              <td>3</td></tr>
      <tr><td>'Eve'</td>     <td>2</td>              <td>2</td></tr>
      <tr><td>'Fred'</td>    <td>2</td>              <td>3</td></tr>
      <tr><td>'Gwen'</td>    <td>4</td>              <td>1</td></tr>
      <tr><td>'Harry'</td>   <td>2</td>              <td>3</td></tr>
    </table>
  </section>

  <section>
    <h3>Histograms</h3>
    <table style="font-size: 70%">
      <tr><th>YearsEmployed</th><th>COUNT</th></tr>
      <tr><td>1</td>            <td>1</td>    </tr>
      <tr><td>2</td>            <td>4</td>    </tr>
      <tr><td>3</td>            <td>2</td>    </tr>
      <tr><td>4</td>            <td>1</td>    </tr>
    </table>

    <table>
      <tr class="fragment"><td style="font-size: 70%"><code>COUNT(DISTINCT YearsEmployed)</code> </td><td class="fragment">$= 4$</td></tr>
      <tr class="fragment"><td style="font-size: 70%"><code>MIN(YearsEmployed)</code>            </td><td class="fragment">$= 1$</td></tr>
      <tr class="fragment"><td style="font-size: 70%"><code>MAX(YearsEmplyed)</code>             </td><td class="fragment">$= 4$</td></tr>
      <tr class="fragment"><td style="font-size: 70%"><code>COUNT(*) YearsEmployed = 2</code>    </td><td class="fragment">$= 4$</td></tr>
    </table>
  </section>

  <section>
    <h3>Histograms</h3>
    <table style="font-size: 70%">
      <tr><th>YearsEmployed</th><th>COUNT</th></tr>
      <tr><td>1-2</td>          <td>5</td>    </tr>
      <tr><td>3-4</td>          <td>3</td>    </tr>
    </table>

    <table>
      <tr class="fragment"><td style="font-size: 70%"><code>COUNT(DISTINCT YearsEmployed)</code> </td><td class="fragment">$= 4$</td></tr>
      <tr class="fragment"><td style="font-size: 70%"><code>MIN(YearsEmployed)</code>            </td><td class="fragment">$= 1$</td></tr>
      <tr class="fragment"><td style="font-size: 70%"><code>MAX(YearsEmplyed)</code>             </td><td class="fragment">$= 4$</td></tr>
      <tr class="fragment"><td style="font-size: 70%"><code>COUNT(*) YearsEmployed = 2</code>    </td><td class="fragment">$= \frac{5}{2}$</td></tr>
    </table>
  </section>

  <section>
    <h3>The Extreme Case</h3>
    <table style="font-size: 70%">
      <tr><th>YearsEmployed</th><th>COUNT</th></tr>
      <tr><td>1-4</td>          <td>8</td>    </tr>
    </table>

    <table>
      <tr class="fragment"><td style="font-size: 70%"><code>COUNT(DISTINCT YearsEmployed)</code> </td><td class="fragment">$= 4$</td></tr>
      <tr class="fragment"><td style="font-size: 70%"><code>MIN(YearsEmployed)</code>            </td><td class="fragment">$= 1$</td></tr>
      <tr class="fragment"><td style="font-size: 70%"><code>MAX(YearsEmplyed)</code>             </td><td class="fragment">$= 4$</td></tr>
      <tr class="fragment"><td style="font-size: 70%"><code>COUNT(*) YearsEmployed = 2</code>    </td><td class="fragment">$= \frac{8}{4}$</td></tr>
    </table>
  </section>

  <section>
    <h3>More Example Data</h3>
    <table style="font-size: 80%; float: left;">
      <tr><th>Value</th>  <th>COUNT</th>  </tr>
      <tr><td> 1-10</td>  <td>20</td>     </tr>
      <tr><td>11-20</td>  <td> 0</td>     </tr>
      <tr><td>21-30</td>  <td>15</td>     </tr>
      <tr><td>31-40</td>  <td>30</td>     </tr>
      <tr><td>41-50</td>  <td>22</td>     </tr>
      <tr><td>51-60</td>  <td>63</td>     </tr>
      <tr><td>61-70</td>  <td>10</td>     </tr>
      <tr><td>71-80</td>  <td>10</td>     </tr>
    </table>

    <table style="margin-top: 100px;">
      <tr class="fragment">
        <td style="font-size: 70%; width: 350px;"><code>SELECT … WHERE A = 33</code> </td>
        <td class="fragment" style="font-size: 80%; text-align: left; width: 200px;">$= \frac{1}{40-30}\cdot 30 = 3$</td>
      </tr>
      <tr><td style="height: 70px;"></td><td></td></tr>
      <tr class="fragment">
        <td style="font-size: 70%; width: 350px;"><code>SELECT … WHERE A > 33</code> </td>
        <td class="fragment" style="font-size: 80%; text-align: left; width: 200px;">$= \frac{40-33}{40-30}\cdot 30+22$ $\;\;\;+63+10+10$ $= 126$ </td>
      </tr>
    </table>
  </section>
</section>

<section>
  <section>
    <h3>(Some) Estimation Techniques</h3>

    <dl style="font-size: 80%">
      <dt style="color: grey;">Guess Randomly</dt>
      <dd style="color: grey;">Rules of thumb if you have no other options...</dd>

      <dt style="color: grey;">Uniform Prior</dt>
      <dd style="color: grey;">Use basic statistics to make a very rough guess.</dd>

      <dt style="color: grey;">Sampling / History</dt>
      <dd style="color: grey;">Small, Quick Sampling Runs (or prior executions of the query).</dd>

      <dt style="color: grey;">Histograms</dt>
      <dd style="color: grey;">Using more detailed statistics for improved guesses.</dd>

      <dt style="color: blue;">Constraints</dt>
      <dd style="color: blue;">Using rules about the data for improved guesses.</dd>
    </dl>
  </section>

  <section>
    <h3>Key / Unique Constraints</h3>
    <pre style="margin-top: 50px;"><code class="sql">
      CREATE TABLE R (
        A int,
        B int UNIQUE
        ...
        PRIMARY KEY A
      );
    </code></pre>
    <p style="margin-top: 50px;">
      No duplicate values in the column.
      $$\texttt{COUNT(DISTINCT A)} = \texttt{COUNT(*)}$$
    </p>
  </section>

  <section>
    <h3>Foreign Key Constraints</h3>
    <pre style="margin-top: 50px;"><code class="sql">
      CREATE TABLE S (
        B int,
        ...
        FOREIGN KEY B REFERENCES R.B
      );
    </code></pre>
    <p style="margin-top: 50px;">
      All values in the column appear in another table.
      $$\pi_{attrs(S)}\left(S \bowtie_B R\right) \subseteq S$$
    </p>
  </section>

  <section>
    <h3>Functional Dependencies</h3>

    <pre style="margin-top: 50px;"><code class="sql">
      Not expressible in SQL
    </code></pre>

    <p style="margin-top: 50px;">
      One set of columns uniquely determines another.<br/>
      $\pi_{A}(\delta(\pi_{A, B}(R)))$ has no duplicates and...
      $$\pi_{attrs(R)-A}(R) \bowtie_A \delta(\pi_{A, B}(R)) = R$$
    </p>
  </section>

  <section>
    <h3>Constraints</h3>

    <h4>The Good</h4>
    <ul>
      <li style="font-size: 70%" class="fragment">Sanity check on your data: Inconsistent data triggers failures.</li>
      <li style="font-size: 70%" class="fragment">More opportunities for query optimization.</li>
    </ul>

    <h4 style="margin-top: 50px;" class="fragment">The Not-So Good</h4>
    <ul>
      <li style="font-size: 70%" class="fragment">Validating constraints whenever data changes is (usually) expensive.</li>
      <li style="font-size: 70%" class="fragment">Inconsistent data triggers failures.</li>
    </ul>

  </section>

  <section>
    <h3>Foreign Key Constraints</h3>

    <p style="margin-top: 50px;">Foreign keys are like pointers.  What happens with broken pointers?</p>
  </section>

  <section>
    <h3>Foreign Key Enforcement</h3>

    <p>Foreign keys are defined with update triggers <code>ON INSERT [X]</code>, <code>ON UPDATE [X]</code>, <code>ON DELETE [X]</code>.  Depending on what [X] is, the constraint is enforced differently:</p>

    <dl style="font-size: 80%">
      <dt><code>CASCADE</code></dt>
      <dd>Create/delete rows as needed to avoid invalid foreign keys.</dd>

      <dt><code>NO ACTION</code></dt>
      <dd>Abort any transaction that ends with an invalid foreign key reference.</dd>

      <dt><code>SET NULL</code></dt>
      <dd>Automatically replace any invalid foreign key references with NULL</dd>
    </dl>
  </section>

  <section>
    <p style="font-weight: bold;">
      <code>CASCADE</code> and <code>NO ACTION</code> ensure that the data never has broken pointers, so
    </p>
    $$\pi_{attrs(S)}\left(S \bowtie_B R\right) = S$$
  </section>

  <section>
    <h3>Functional Dependencies</h3>

    <p style="margin-top: 50px;"><b>A generalization of keys:</b> One set of attributes that uniquely identify another.</p>

    <ul>
      <li>SS# uniquely identifies Name.</li>
      <li>Employee uniquely identifies Manager.</li>
      <li>Order number uniquely identifies Customer Address.</li>
    </ul>

    <p class="fragment">Two rows with the same As must have the same Bs</p>
    <p class="fragment" style="font-size: 80%">(but can still have identical Bs for two different As)</p>
  </section>

  <section>
    <h3>Normal Forms</h3>
    <p style="margin-top: 50px;">"All functional dependencies should be keys."</p>
    <p class="fragment">(Otherwise you want two separate relations)</p>
    <p class="fragment">(for more details, see CSE 560)</p>
  </section>

  <section>

    <p style="font-size: 70%">
      $$P(A = B) = min\left(\frac{1}{\texttt{COUNT}(\texttt{DISTINCT } A)}, \frac{1}{\texttt{COUNT}(\texttt{DISTINCT } B)}\right)$$
    </p>

  </section>
  <section>

    <p>
      $$R \bowtie_{R.A = S.B} S = \sigma_{R.A = S.B}(R \times S)$$
      (and $S.B$ is a foreign key referencing $R.A$)
    </p>

    <p class="fragment" style="margin-top: 30px; font-size: 80%">
      The (foreign) key constraint gives us two things...
      $$\texttt{COUNT}(\texttt{DISTINCT } A) \approx \texttt{COUNT}(\texttt{DISTINCT } B)$$
      <span style="font-size: 60%; font-weight: bold; margin: 0px;">and</span>
      $$\texttt{COUNT}(\texttt{DISTINCT } A) = |R|$$
    </p>

    <p class="fragment" style="margin-top: 30px; font-size: 80%">
      Based on the first property the total number of rows is roughly...
      $$|R| \times |S| \times \frac{1}{\texttt{COUNT}(\texttt{DISTINCT } A)}$$
    </p>

    <p class="fragment" style="margin-top: 30px; font-size: 80%">
      Then based on the second property...
      $$ = |R| \times |S| \times \frac{1}{|R|} = |S|$$
    </p>

    <p class="fragment" style="margin-top: 30px; font-size: 50%">(Statistics/Histograms will give you the same outcome... but constraints can be easier to propagate)</p>
  </section>
</section>