diff --git a/src/teaching/cse-562/2021sp/index.erb b/src/teaching/cse-562/2021sp/index.erb
index f06aac31..87ec2bbe 100644
--- a/src/teaching/cse-562/2021sp/index.erb
+++ b/src/teaching/cse-562/2021sp/index.erb
@@ -51,12 +51,16 @@ schedule:
materials:
slides: slide/2021-03-04-Indexing2.html
- date: "Mar. 9"
- topic: "Spark's Optimizer + Checkpoint 2"
- due: "Checkpoint 1"
- - date: "Mar. 11"
topic: "Cost-Based Optimization"
- - date: "Mar. 16"
+ due: "Checkpoint 1"
+ materials:
+ slides: slide/2021-03-09-CostOpt1.html
+ - date: "Mar. 11"
topic: "Cost-Based Optimization (contd.)"
+ materials:
+ slides: slide/2021-03-11-CostOpt2.html
+ - date: "Mar. 16"
+ topic: "Spark's Optimizer + Checkpoint 2"
- date: "Mar. 18"
topic: "Distributed Queries: Challenges + Partitioning"
- date: "Mar. 23"
diff --git a/src/teaching/cse-562/2021sp/slide/2021-02-18-QueryAlgorithms.erb b/src/teaching/cse-562/2021sp/slide/2021-02-18-QueryAlgorithms.erb
index b38ce150..d7be63f8 100644
--- a/src/teaching/cse-562/2021sp/slide/2021-02-18-QueryAlgorithms.erb
+++ b/src/teaching/cse-562/2021sp/slide/2021-02-18-QueryAlgorithms.erb
@@ -17,6 +17,23 @@ textbook: "Ch. 15.1-15.5, 16.7"
More similar examples with Union and Cross would also help.
Might help to tighten up the time spent a little too. I had to cut out before introducing Sort-Merge Joins
+
+
+-------
+ 2021 by OK:
+
+ Applied changes above. Things went better.
+
+ Looking at costs in terms of the "overhead" of each operator is proving to be *really*
+ hard for the students to grasp. I suspect it might be easier for the students to grasp
+ a recursive definition.
+
+ e.g., cost(\pi(R)) = cost(R)
+
+ This would, among other things, make the (B)NLJ cost a lot easier to specify.
+
+ I made these changes already to 03-09-CostOpt1, so they should probably be backported here next time I teach the class.
+
-->
Idea 1: Run each plan If we can't get the exact cost of a plan, what can we do? Idea 2: Run each plan on a small sample of the data. Idea 3: Analytically estimate the cost of a plan. Figure out the IO cost of the entire* subtree. Only count the amount of memory added by each operator. * Different from earlier in the semester. * unless $R$ is a query Estimating IOs requires Estimating $|Q(R)|$, $|\delta_A(Q(R))|$ Unlike estimating IOs, cardinality estimation doesn't care about the algorithm, so we'll just be working with raw RA. Also unlike estimating IOs, we care about the cardinality of $|Q(R)|$ as a whole, rather than the contribution of each individual operator. Idea 1: Assume each selection filters down to 10% of the data. no... really! $|\sigma_{c_1}(\sigma_{c_2}(R))| \neq |\sigma_{c_1 \wedge c_2}(R)|$ $|\sigma_{id = 1}(\texttt{STUDENTS})| = |\sigma_{residence = 'NY'}(\texttt{STUDENTS})|$ ... but remember that all we need is to rank plans. Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better. (The specific % varies by DBMS.) (Teradata uses 10% for the first We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...
+ If necessary statistics aren't available (point 1), fall back to the 10% rule.
+
+ If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule.
+ $\texttt{UNIQ}(A, \pi_{A, \ldots}(R)) = \texttt{UNIQ}(A, R)$ $\texttt{UNIQ}(A, \sigma(R)) \approx \texttt{UNIQ}(A, R)$ $\texttt{UNIQ}(A, R \times S) = \texttt{UNIQ}(A, R)$ or $\texttt{UNIQ}(A, S)$ $$max(\texttt{UNIQ}(A, R), \texttt{UNIQ}(A, S)) \leq\\ \texttt{UNIQ}(A, R \uplus S)\\ \leq \texttt{UNIQ}(A, R) + \texttt{UNIQ}(A, S)$$ $min_A(\pi_{A, \ldots}(R)) = min_A(R)$ $min_A(\sigma_{A, \ldots}(R)) \approx min_A(R)$ $min_A(R \times S) = min_A(R)$ or $min_A(S)$ $min_A(R \uplus S) = min(min_A(R), min_A(S))$ Estimating $\delta_A(Q)$ requires only Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$) (With constants $x_1$, $x_2$, ...) ...but handles most usage patterns Figure out the cost of each individual operator. Only count the number of IOs added by each operator. Idea 1: Pick 100 tuples at random from each input table.
+ Assume: $\texttt{UNIQ}(A, R) = \texttt{UNIQ}(A, S) = N$
+
+ It takes $O(\sqrt{N})$ samples from both $R$ and $S$ To be resumed later in the term when we talk about AQP How DBs Do It: Instrument queries while running them.General Query Optimizers
+
+
+
+
+ Plan Cost
+
+
+ Remember the Real Goals
+
+
+ Accounting
+
+
+
+
+ Operation RA Total IOs (#pages) Memory (#tuples)
+
+ Table Scan
+ $R$
+ $\frac{|R|}{\mathcal P}$
+ $O(1)$
+
+
+ Projection
+ $\pi(R)$
+ $\textbf{io}(R)$
+ $O(1)$
+
+
+ Selection
+ $\sigma(R)$
+ $\textbf{io}(R)$
+ $O(1)$
+
+
+ Union
+ $R \uplus S$
+ $\textbf{io}(R) + \textbf{io}(S)$
+ $O(1)$
+
+
+ Sort (In-Mem)
+ $\tau(R)$
+ $0$
+ $O(|R|)$
+
+
+ Sort (On-Disk)
+ $\tau(R)$
+ $\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P} + \textbf{io}(R)$
+ $O(\mathcal B)$
+
+
+ (B+Tree) Index Scan
+ $Index(R, c)$
+ $\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$
+ $O(1)$
+
+
+ (Hash) Index Scan
+ $Index(R, c)$
+ $1$
+ $O(1)$
+
+
+
+
+
+
+ Operation RA Total IOs (#pages) Mem (#tuples)
+
+ Nested Loop Join (Buffer $S$ in mem)
+ $R \times_{mem} S$
+ $\textbf{io}(R)+\textbf{io}(S)$
+ $O(|S|)$
+
+
+ Block NLJ (Buffer $S$ on disk)
+ $R \times_{disk} S$
+ $\frac{|R|}{\mathcal B} \cdot \frac{|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$
+ $O(1)$
+
+
+ Block NLJ (Recompute $S$)
+ $R \times_{redo} S$
+ $\textbf{io}(R) + \frac{|R|}{\mathcal B} \cdot \textbf{io}(S)$
+ $O(1)$
+
+
+ 1-Pass Hash Join
+ $R \bowtie_{1PH, c} S$
+ $\textbf{io}(R) + \textbf{io}(S)$
+ $O(|S|)$
+
+
+ 2-Pass Hash Join
+ $R \bowtie_{2PH, c} S$
+ $\frac{2|R| + 2|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$
+ $O(1)$
+
+
+ Sort-Merge Join
+ $R \bowtie_{SM, c} S$
+ [Sort]
+ [Sort]
+
+
+ (Tree) Index NLJ
+ $R \bowtie_{INL, c}$
+ $|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$
+ $O(1)$
+
+
+ (Hash) Index NLJ
+ $R \bowtie_{INL, c}$
+ $|R| \cdot 1$
+ $O(1)$
+
+
+ (In-Mem) Aggregate
+ $\gamma_A(R)$
+ $\textbf{io}(R)$
+ $adom(A)$
+
+
+ (Sort/Merge) Aggregate
+ $\gamma_A(R)$
+ [Sort]
+ [Sort]
+
+
+
+
+
+ Symbol Parameter Type
+
+ $\mathcal P$ Tuples Per Page
+ Fixed ($\frac{|\text{page}|}{|\text{tuple}|}$)
+
+
+ $|R|$ Size of $R$
+ Precomputed$^*$ ($|R|$)
+
+
+ $\mathcal B$ Pages of Buffer
+ Configurable Parameter
+
+
+ $\mathcal I$ Keys per Index Page
+ Fixed ($\frac{|\text{page}|}{|\text{key+pointer}|}$)
+
+
+ $adom(A)$ Number of distinct values of $A$
+ Precomputed$^*$ ($|\delta_A(R)|$)
+ Cardinality Estimation
+
+
+
+
+
+
+ Operator
+ RA
+ Estimated Size
+
+
+
+ Table
+ $R$
+ $|R|$
+
+
+
+ Projection
+ $\pi(Q)$
+ $|Q|$
+
+
+
+ Union
+ $Q_1 \uplus Q_2$
+ $|Q_1| + |Q_2|$
+
+
+
+ Cross Product
+ $Q_1 \times Q_2$
+ $|Q_1| \times |Q_2|$
+
+
+
+ Sort
+ $\tau(Q)$
+ $|Q|$
+
+
+
+ Limit
+ $\texttt{LIMIT}_N(Q)$
+ $N$
+
+
+
+ Selection
+ $\sigma_c(Q)$
+ $|Q| \times \texttt{SEL}(c, Q)$
+
+
+
+ Join
+ $Q_1 \bowtie_c Q_2$
+ $|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$
+
+
+
+ Distinct
+ $\delta_A(Q)$
+ $\texttt{UNIQ}(A, Q)$
+
+
+ Aggregate
+ $\gamma_{A, B \leftarrow \Sigma}(Q)$
+ $\texttt{UNIQ}(A, Q)$
+
+
+ Cardinality Estimation
+ (The Hard Parts)
+
+
+
+ ... there are problems
+ Inconsistent estimation
+ Too consistent estimation
+ AND
clause,
cut by another 75% for every subsequent clause)(Some) Estimation Techniques
+
+
+
+ Uniform Prior
+
+
+
+
+
COUNT(*)
COUNT(DISTINCT A)
(for each A)MIN(A)
, MAX(A)
(for each numeric A)COUNT(DISTINCT A)
+ MIN(A), MAX(A)
+ COUNT(DISTINCT A)
Estimating Selectivity
+
+
+
+
+
+
+
+ $P(A = x_1)$
+ $=$
+ $\frac{1}{\texttt{COUNT(DISTINCT A)}}$
+
+
+
+ $P(A \in (x_1, x_2, \ldots, x_N))$
+ $=$
+ $\frac{N}{\texttt{COUNT(DISTINCT A)}}$
+
+
+
+ $P(A \leq x_1)$
+ $=$
+ $\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
+
+
+
+ $P(x_1 \leq A \leq x_2)$
+ $=$
+ $\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
+
+
+
+ $P(A = B)$
+ $=$
+ $\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$
+
+
+
+ $P(c_1 \wedge c_2)$
+ $=$
+ $P(c_1) \cdot P(c_2)$
+
+
+ $P(c_1 \vee c_2)$
+ $=$
+ $1 - (1 - P(c_1)) \cdot (1 - P(c_2))$
+ Limitations
+
+
+
+ Remember the Real Goals
+
+
+ Accounting
+
+
+
+
+ Operation RA Total IOs (#pages) Memory (#tuples)
+
+ Table Scan
+ $R$
+ $\frac{|R|}{\mathcal P}$
+ $O(1)$
+
+
+ Projection
+ $\pi(R)$
+ $\textbf{io}(R)$
+ $O(1)$
+
+
+ Selection
+ $\sigma(R)$
+ $\textbf{io}(R)$
+ $O(1)$
+
+
+ Union
+ $R \uplus S$
+ $\textbf{io}(R) + \textbf{io}(S)$
+ $O(1)$
+
+
+ Sort (In-Mem)
+ $\tau(R)$
+ $\textbf{io}(R)$
+ $O(|R|)$
+
+
+ Sort (On-Disk)
+ $\tau(R)$
+ $\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P} + \textbf{io}(R)$
+ $O(\mathcal B)$
+
+
+ (B+Tree) Index Scan
+ $Index(R, c)$
+ $\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$
+ $O(1)$
+
+
+ (Hash) Index Scan
+ $Index(R, c)$
+ $1$
+ $O(1)$
+
+
+
+
+
+
+ Operation RA Total IOs (#pages) Mem (#tuples)
+
+ Nested Loop Join (Buffer $S$ in mem)
+ $R \times_{mem} S$
+ $\textbf{io}(R)+\textbf{io}(S)$
+ $O(|S|)$
+
+
+ Block NLJ (Buffer $S$ on disk)
+ $R \times_{disk} S$
+ $\frac{|R|}{\mathcal B} \cdot \frac{|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$
+ $O(1)$
+
+
+ Block NLJ (Recompute $S$)
+ $R \times_{redo} S$
+ $\textbf{io}(R) + \frac{|R|}{\mathcal B} \cdot \textbf{io}(S)$
+ $O(1)$
+
+
+ 1-Pass Hash Join
+ $R \bowtie_{1PH, c} S$
+ $\textbf{io}(R) + \textbf{io}(S)$
+ $O(|S|)$
+
+
+ 2-Pass Hash Join
+ $R \bowtie_{2PH, c} S$
+ $\frac{2|R| + 2|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$
+ $O(1)$
+
+
+ Sort-Merge Join
+ $R \bowtie_{SM, c} S$
+ [Sort]
+ [Sort]
+
+
+ (Tree) Index NLJ
+ $R \bowtie_{INL, c}$
+ $|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$
+ $O(1)$
+
+
+ (Hash) Index NLJ
+ $R \bowtie_{INL, c}$
+ $|R| \cdot 1$
+ $O(1)$
+
+
+ (In-Mem) Aggregate
+ $\gamma_A(R)$
+ $0$
+ $adom(A)$
+
+
+ (Sort/Merge) Aggregate
+ $\gamma_A(R)$
+ [Sort]
+ [Sort]
+
+
+ Cardinality Estimation
+ (The Hard Parts)
+
+
+
+ Remember the Real Goals
+
+
+ (Some) Estimation Techniques
+
+
+
+ (Some) Estimation Techniques
+
+
+
+ The Birthday Paradox
+
+
to get even one match.
+
+
+ Ideal Case: You have some + $$f(x) = \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$ + (and similarly for the other aggregates) +
++ Slightly Less Ideal Case: You have some + $$f(x) \approx \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$ +
+If this sounds like CDF-based indexing... you're right!
+ +... but we're not going to talk about NNs today
++ Simpler/Faster Idea: Break $f(x)$ into chunks +
+Name | YearsEmployed | Role |
---|---|---|
'Alice' | 3 | 1 |
'Bob' | 2 | 2 |
'Carol' | 3 | 1 |
'Dave' | 1 | 3 |
'Eve' | 2 | 2 |
'Fred' | 2 | 3 |
'Gwen' | 4 | 1 |
'Harry' | 2 | 3 |
YearsEmployed | COUNT |
---|---|
1 | 1 |
2 | 4 |
3 | 2 |
4 | 1 |
COUNT(DISTINCT YearsEmployed) | $= 4$ |
MIN(YearsEmployed) | $= 1$ |
MAX(YearsEmplyed) | $= 4$ |
COUNT(*) YearsEmployed = 2 | $= 4$ |
YearsEmployed | COUNT |
---|---|
1-2 | 5 |
3-4 | 3 |
COUNT(DISTINCT YearsEmployed) | $= 4$ |
MIN(YearsEmployed) | $= 1$ |
MAX(YearsEmplyed) | $= 4$ |
COUNT(*) YearsEmployed = 2 | $= \frac{5}{2}$ |
YearsEmployed | COUNT |
---|---|
1-4 | 8 |
COUNT(DISTINCT YearsEmployed) | $= 4$ |
MIN(YearsEmployed) | $= 1$ |
MAX(YearsEmplyed) | $= 4$ |
COUNT(*) YearsEmployed = 2 | $= \frac{8}{4}$ |
Value | COUNT |
---|---|
1-10 | 20 |
11-20 | 0 |
21-30 | 15 |
31-40 | 30 |
41-50 | 22 |
51-60 | 63 |
61-70 | 10 |
71-80 | 10 |
SELECT … WHERE A = 33 |
+ $= \frac{1}{40-30}\cdot 30 = 3$ | +
SELECT … WHERE A > 33 |
+ $= \frac{40-33}{40-30}\cdot 30+22$ $\;\;\;+63+10+10$ $= 126$ | +
+ CREATE TABLE R (
+ A int,
+ B int UNIQUE
+ ...
+ PRIMARY KEY A
+ );
+
+ + No duplicate values in the column. + $$\texttt{COUNT(DISTINCT A)} = \texttt{COUNT(*)}$$ +
+
+ CREATE TABLE S (
+ B int,
+ ...
+ FOREIGN KEY B REFERENCES R.B
+ );
+
+ + All values in the column appear in another table. + $$\pi_{attrs(S)}\left(S \bowtie_B R\right) \subseteq S$$ +
+
+ Not expressible in SQL
+
+
+
+ One set of columns uniquely determines another.
+ $\pi_{A}(\delta(\pi_{A, B}(R)))$ has no duplicates and...
+ $$\pi_{attrs(R)-A}(R) \bowtie_A \delta(\pi_{A, B}(R)) = R$$
+
Foreign keys are like pointers. What happens with broken pointers?
+Foreign keys are defined with update triggers ON INSERT [X]
, ON UPDATE [X]
, ON DELETE [X]
. Depending on what [X] is, the constraint is enforced differently:
CASCADE
NO ACTION
SET NULL
+ CASCADE
and NO ACTION
ensure that the data never has broken pointers, so
+
A generalization of keys: One set of attributes that uniquely identify another.
+ +Two rows with the same As must have the same Bs
+(but can still have identical Bs for two different As)
+"All functional dependencies should be keys."
+(Otherwise you want two separate relations)
+(for more details, see CSE 560)
++ $$P(A = B) = min\left(\frac{1}{\texttt{COUNT}(\texttt{DISTINCT } A)}, \frac{1}{\texttt{COUNT}(\texttt{DISTINCT } B)}\right)$$ +
+ ++ $$R \bowtie_{R.A = S.B} S = \sigma_{R.A = S.B}(R \times S)$$ + (and $S.B$ is a foreign key referencing $R.A$) +
+ ++ The (foreign) key constraint gives us two things... + $$\texttt{COUNT}(\texttt{DISTINCT } A) \approx \texttt{COUNT}(\texttt{DISTINCT } B)$$ + and + $$\texttt{COUNT}(\texttt{DISTINCT } A) = |R|$$ +
+ ++ Based on the first property the total number of rows is roughly... + $$|R| \times |S| \times \frac{1}{\texttt{COUNT}(\texttt{DISTINCT } A)}$$ +
+ ++ Then based on the second property... + $$ = |R| \times |S| \times \frac{1}{|R|} = |S|$$ +
+ +(Statistics/Histograms will give you the same outcome... but constraints can be easier to propagate)
+