diff --git a/slides/cse4562sp2018/2018-02-28-CostBasedOptimization1.html b/slides/cse4562sp2018/2018-02-28-CostBasedOptimization1.html index d02130e6..5c2b95d5 100644 --- a/slides/cse4562sp2018/2018-02-28-CostBasedOptimization1.html +++ b/slides/cse4562sp2018/2018-02-28-CostBasedOptimization1.html @@ -159,7 +159,7 @@ Union - $R \cup S$ + $R \uplus S$ $0$ $O(1)$ diff --git a/slides/cse4562sp2018/2018-03-05-CostBasedOptimization2.html b/slides/cse4562sp2018/2018-03-05-CostBasedOptimization2.html new file mode 100644 index 00000000..aac71888 --- /dev/null +++ b/slides/cse4562sp2018/2018-03-05-CostBasedOptimization2.html @@ -0,0 +1,573 @@ + + + + + + + CSE 4/562 - Spring 2018 + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + +
+ + CSE 4/562 - Database Systems +
+ +
+ +
+

Cost Based Optimization

+

CSE 4/562 – Database Systems

+
February 28, 2018
+
+ + +
+
+

Remember the Real Goals

+
    +
  1. Accurately rank the plans.
  2. +
  3. Don't spend more time optimizing than you get back.
  4. +
  5. Don't pick a plan that uses more memory than you have.
  6. +
+
+ +
+

Accounting

+

Figure out the cost of each individual operator.

+

Only count the number of IOs added by each operator.

+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OperationRAIOs Added (#pages)Memory (#tuples)
Table Scan$R$$\frac{|R|}{\mathcal P}$$O(1)$
Projection$\pi(R)$$0$$O(1)$
Selection$\sigma(R)$$0$$O(1)$
Union$R \cup S$$0$$O(1)$
Sort (In-Mem)$\tau(R)$$0$$O(|R|)$
Sort (On-Disk)$\tau(R)$$\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$$O(\mathcal B)$
(B+Tree) Index Scan$Index(R, c)$$\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$$O(1)$
(Hash) Index Scan$Index(R, c)$$1$$O(1)$
+ +
    +
  1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
  2. +
  3. Size of $R$ ($|R|$)
  4. +
  5. Pages of Buffer ($\mathcal B$)
  6. +
  7. Keys per Index Page ($\mathcal I$)
  8. +
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OperationRAIOs Added (#pages)Memory (#tuples)
Nested Loop Join (Buffer $S$ in mem)$R \times S$$0$$O(|S|)$
Nested Loop Join (Buffer $S$ on disk)$R \times_{disk} S$$(1+ |R|) \cdot \frac{|S|}{\mathcal P}$$O(1)$
1-Pass Hash Join$R \bowtie_{1PH, c} S$$0$$O(|S|)$
2-Pass Hash Join$R \bowtie_{2PH, c} S$$\frac{2|R| + 2|S|}{\mathcal P}$$O(1)$
Sort-Merge Join $R \bowtie_{SM, c} S$[Sort][Sort]
(Tree) Index NLJ$R \bowtie_{INL, c}$$|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$$O(1)$
(Hash) Index NLJ$R \bowtie_{INL, c}$$|R| \cdot 1$$O(1)$
(In-Mem) Aggregate$\gamma_A(R)$$0$$adom(A)$
(Sort/Merge) Aggregate$\gamma_A(R)$[Sort][Sort]
+ +
    +
  1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
  2. +
  3. Size of $R$ ($|R|$)
  4. +
  5. Pages of Buffer ($\mathcal B$)
  6. +
  7. Keys per Index Page ($\mathcal I$)
  8. +
  9. Number of distinct values of $A$ ($adom(A)$)
  10. +
+
+
+ + + +
+
+

Estimating IOs requires Estimating $|Q(R)|$

+
+ +
+

Cardinality Estimation

+

Unlike estimating IOs, cardinality estimation doesn't care about the algorithm, so we'll just be working with raw RA.

+ +

Also unlike estimating IOs, we care about the cardinality of $|Q(R)|$ as a whole, rather than the contribution of each individual operator.

+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OperatorRAEstimated Size
Table$R$$|R|$
Projection$\pi(Q)$$|Q|$
Union$Q_1 \uplus Q_2$$|Q_1| + |Q_2|$
Cross Product$Q_1 \times Q_2$$|Q_1| \times |Q_2|$
Sort$\tau(Q)$$|Q|$
Limit$\texttt{LIMIT}_N(Q)$$N$
Selection$\sigma_c(Q)$$|Q| \times \texttt{SEL}(c, Q)$
Join$Q_1 \bowtie_c Q_2$$|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct$\delta_A(Q)$$\texttt{UNIQ}(A, Q)$
Aggregate$\gamma_{A, B \leftarrow \Sigma}(Q)$$\texttt{UNIQ}(A, Q)$
+ +
    +
  • $\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
  • +
  • $\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$. +
+
+ + + +
+

Cardinality Estimation

+

(The Hard Parts)

+ +
+
$\sigma_c(Q)$ (Cardinality Estimation)
+
How many tuples will a condition $c$ allow to pass?
+ +
$\delta_A(Q)$ (Distinct Values Estimation)
+
How many distinct values of attribute(s) $A$ exist?
+
+
+ +
+

Remember the Real Goals

+
    +
  1. Accurately rank the plans.
  2. +
  3. Don't spend more time optimizing than you get back.
  4. +
+
+ +
+

(Some) Estimation Techniques

+ +
+
+
Guess Randomly
+
Rules of thumb if you have no other options...
+
+ +
+
Uniform Prior
+
Use basic statistics to make a very rough guess.
+
+ +
+
Sampling / History
+
Small, Quick Sampling Runs (or prior executions of the query).
+
+ +
+
Histograms
+
Using more detailed statistics for improved guesses.
+
+ +
+
Constraints
+
Using rules about the data for improved guesses.
+
+
+
+
+ + + +
+
+

(Some) Estimation Techniques

+ +
+
Guess Randomly
+
Rules of thumb if you have no other options...
+ +
Uniform Prior
+
Use basic statistics to make a very rough guess.
+ +
Sampling / History
+
Small, Quick Sampling Runs (or prior executions of the query).
+ +
Histograms
+
Using more detailed statistics for improved guesses.
+ +
Constraints
+
Using rules about the data for improved guesses.
+
+
+ +
+

The 10% Selectivity Rule

+ +

Every select or distinct operator passes 10% of all rows.

+ +
+ $$\sigma_{A = 1 \wedge B = 2}(R)$$ +
+
+ $$|\sigma_{A = 1 \wedge B = 2}(R)| = 0.1 \cdot |R|$$ +
+ +
+ $$\sigma_{A = 1}(\sigma_{B = 2}(R))$$ +
+
+ $$|\sigma_{A = 1}(\sigma_{B = 2}(R))| = 0.1 \cdot |\sigma_{B = 2}(R)| = 0.1 \cdot 0.1 \cdot |R|$$ +
+ +

(Queries are typically standardized first)

+ +

(The specific % varies by DBMS. E.g., Teradata uses 10% for the first AND clause, and 75% for every subsequent clause)

+
+ +
+

The 10% rule is a fallback when everything else fails.
Usually, databases collect statistics...

+
+
+ + + +
+
+

(Some) Estimation Techniques

+ +
+
Guess Randomly
+
Rules of thumb if you have no other options...
+ +
Uniform Prior
+
Use basic statistics to make a very rough guess.
+ +
Sampling / History
+
Small, Quick Sampling Runs (or prior executions of the query).
+ +
Histograms
+
Using more detailed statistics for improved guesses.
+ +
Constraints
+
Using rules about the data for improved guesses.
+
+
+ +
+

Uniform Prior

+ +

We assume that for $\sigma_c(Q)$...

+
    +
  1. Basic statistics are known about $Q$:
      +
    • COUNT(*)
    • +
    • COUNT(DISTINCT A) (for each A)
    • +
    • MIN(A), MAX(A) (for each numeric A)
    • +
  2. +
  3. Attribute values are uniformly distributed.
  4. +
  5. No inter-attribute correlations.
  6. +
+

+ If (1) fails, fall back to the 10% rule. +

+

+ If (2) or (3) fails, it'll often still be a good enough estimate. +

+
+ +
+

Some Conditions

+ +

Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
$P(A = x_1)$$=$$\frac{1}{\texttt{COUNT(DISTINCT A)}}$
$P(A \in (x_1, x_2, \ldots, x_N))$$=$$\frac{N}{\texttt{COUNT(DISTINCT A)}}$
$P(A \leq x_1)$$=$$\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(x_1 \leq A \leq x_2)$$=$$\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(A = B)$$=$$\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$
$P(c_1 \wedge c_2)$$=$$P(c_1) \cdot P(c_2)$
$P(c_1 \vee c_2)$$=$$1 - (1 - P(c_1)) \cdot (1 - P(c_2))$
+ +

(With constants $x_1$, $x_2$, ...)

+
+
+ +
+ + + + + + +