diff --git a/slides/cse4562sp2018/2018-02-28-CostBasedOptimization1.html b/slides/cse4562sp2018/2018-02-28-CostBasedOptimization1.html index d02130e6..5c2b95d5 100644 --- a/slides/cse4562sp2018/2018-02-28-CostBasedOptimization1.html +++ b/slides/cse4562sp2018/2018-02-28-CostBasedOptimization1.html @@ -159,7 +159,7 @@
Figure out the cost of each individual operator.
+Only count the number of IOs added by each operator.
+Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|
Table Scan | +$R$ | +$\frac{|R|}{\mathcal P}$ | +$O(1)$ | +
Projection | +$\pi(R)$ | +$0$ | +$O(1)$ | +
Selection | +$\sigma(R)$ | +$0$ | +$O(1)$ | +
Union | +$R \cup S$ | +$0$ | +$O(1)$ | +
Sort (In-Mem) | +$\tau(R)$ | +$0$ | +$O(|R|)$ | +
Sort (On-Disk) | +$\tau(R)$ | +$\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$ | +$O(\mathcal B)$ | +
(B+Tree) Index Scan | +$Index(R, c)$ | +$\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$ | +$O(1)$ | +
(Hash) Index Scan | +$Index(R, c)$ | +$1$ | +$O(1)$ | +
Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|
Nested Loop Join (Buffer $S$ in mem) | +$R \times S$ | +$0$ | +$O(|S|)$ | +
Nested Loop Join (Buffer $S$ on disk) | +$R \times_{disk} S$ | +$(1+ |R|) \cdot \frac{|S|}{\mathcal P}$ | +$O(1)$ | +
1-Pass Hash Join | +$R \bowtie_{1PH, c} S$ | +$0$ | +$O(|S|)$ | +
2-Pass Hash Join | +$R \bowtie_{2PH, c} S$ | +$\frac{2|R| + 2|S|}{\mathcal P}$ | +$O(1)$ | +
Sort-Merge Join | +$R \bowtie_{SM, c} S$ | +[Sort] | +[Sort] | +
(Tree) Index NLJ | +$R \bowtie_{INL, c}$ | +$|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$ | +$O(1)$ | +
(Hash) Index NLJ | +$R \bowtie_{INL, c}$ | +$|R| \cdot 1$ | +$O(1)$ | +
(In-Mem) Aggregate | +$\gamma_A(R)$ | +$0$ | +$adom(A)$ | +
(Sort/Merge) Aggregate | +$\gamma_A(R)$ | +[Sort] | +[Sort] | +
Estimating IOs requires Estimating $|Q(R)|$
+Unlike estimating IOs, cardinality estimation doesn't care about the algorithm, so we'll just be working with raw RA.
+ +Also unlike estimating IOs, we care about the cardinality of $|Q(R)|$ as a whole, rather than the contribution of each individual operator.
+Operator | +RA | +Estimated Size | +
---|---|---|
Table | +$R$ | +$|R|$ | +
Projection | +$\pi(Q)$ | +$|Q|$ | +
Union | +$Q_1 \uplus Q_2$ | +$|Q_1| + |Q_2|$ | +
Cross Product | +$Q_1 \times Q_2$ | +$|Q_1| \times |Q_2|$ | +
Sort | +$\tau(Q)$ | +$|Q|$ | +
Limit | +$\texttt{LIMIT}_N(Q)$ | +$N$ | +
Selection | +$\sigma_c(Q)$ | +$|Q| \times \texttt{SEL}(c, Q)$ | +
Join | +$Q_1 \bowtie_c Q_2$ | +$|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$ | +
Distinct | +$\delta_A(Q)$ | +$\texttt{UNIQ}(A, Q)$ | +
Aggregate | +$\gamma_{A, B \leftarrow \Sigma}(Q)$ | +$\texttt{UNIQ}(A, Q)$ | +
Every select or distinct operator passes 10% of all rows.
+ +(Queries are typically standardized first)
+ +(The specific % varies by DBMS. E.g., Teradata uses 10% for the first AND
clause, and 75% for every subsequent clause)
The 10% rule is a fallback when everything else fails.
Usually, databases collect statistics...
We assume that for $\sigma_c(Q)$...
+COUNT(*)
COUNT(DISTINCT A)
(for each A)MIN(A)
, MAX(A)
(for each numeric A)+ If (1) fails, fall back to the 10% rule. +
++ If (2) or (3) fails, it'll often still be a good enough estimate. +
+Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)
+$P(A = x_1)$ | +$=$ | +$\frac{1}{\texttt{COUNT(DISTINCT A)}}$ | +
$P(A \in (x_1, x_2, \ldots, x_N))$ | +$=$ | +$\frac{N}{\texttt{COUNT(DISTINCT A)}}$ | +
$P(A \leq x_1)$ | +$=$ | +$\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ | +
$P(x_1 \leq A \leq x_2)$ | +$=$ | +$\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$ | +
$P(A = B)$ | +$=$ | +$\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$ | +
$P(c_1 \wedge c_2)$ | +$=$ | +$P(c_1) \cdot P(c_2)$ | +
$P(c_1 \vee c_2)$ | +$=$ | +$1 - (1 - P(c_1)) \cdot (1 - P(c_2))$ | +
(With constants $x_1$, $x_2$, ...)
+