diff --git a/src/teaching/cse-562/2021sp/index.erb b/src/teaching/cse-562/2021sp/index.erb index f06aac31..87ec2bbe 100644 --- a/src/teaching/cse-562/2021sp/index.erb +++ b/src/teaching/cse-562/2021sp/index.erb @@ -51,12 +51,16 @@ schedule: materials: slides: slide/2021-03-04-Indexing2.html - date: "Mar. 9" - topic: "Spark's Optimizer + Checkpoint 2" - due: "Checkpoint 1" - - date: "Mar. 11" topic: "Cost-Based Optimization" - - date: "Mar. 16" + due: "Checkpoint 1" + materials: + slides: slide/2021-03-09-CostOpt1.html + - date: "Mar. 11" topic: "Cost-Based Optimization (contd.)" + materials: + slides: slide/2021-03-11-CostOpt2.html + - date: "Mar. 16" + topic: "Spark's Optimizer + Checkpoint 2" - date: "Mar. 18" topic: "Distributed Queries: Challenges + Partitioning" - date: "Mar. 23" diff --git a/src/teaching/cse-562/2021sp/slide/2021-02-18-QueryAlgorithms.erb b/src/teaching/cse-562/2021sp/slide/2021-02-18-QueryAlgorithms.erb index b38ce150..d7be63f8 100644 --- a/src/teaching/cse-562/2021sp/slide/2021-02-18-QueryAlgorithms.erb +++ b/src/teaching/cse-562/2021sp/slide/2021-02-18-QueryAlgorithms.erb @@ -17,6 +17,23 @@ textbook: "Ch. 15.1-15.5, 16.7" More similar examples with Union and Cross would also help. Might help to tighten up the time spent a little too. I had to cut out before introducing Sort-Merge Joins + + +------- + 2021 by OK: + + Applied changes above. Things went better. + + Looking at costs in terms of the "overhead" of each operator is proving to be *really* + hard for the students to grasp. I suspect it might be easier for the students to grasp + a recursive definition. + + e.g., cost(\pi(R)) = cost(R) + + This would, among other things, make the (B)NLJ cost a lot easier to specify. + + I made these changes already to 03-09-CostOpt1, so they should probably be backported here next time I teach the class. + -->
diff --git a/src/teaching/cse-562/2021sp/slide/2021-03-04-Indexing2.html b/src/teaching/cse-562/2021sp/slide/2021-03-04-Indexing2.html index c8208fd1..8b63c9c8 100644 --- a/src/teaching/cse-562/2021sp/slide/2021-03-04-Indexing2.html +++ b/src/teaching/cse-562/2021sp/slide/2021-03-04-Indexing2.html @@ -1,5 +1,5 @@ --- -template: templates/cse4562_2019_slides.erb +template: templates/cse4562_2021_slides.erb title: "Indexing (Part 2) and Views" date: March 4, 2021 textbook: "Papers and Ch. 8.1-8.2" diff --git a/src/teaching/cse-562/2021sp/slide/2021-03-09-CostOpt1.erb b/src/teaching/cse-562/2021sp/slide/2021-03-09-CostOpt1.erb new file mode 100644 index 00000000..a47dbf22 --- /dev/null +++ b/src/teaching/cse-562/2021sp/slide/2021-03-09-CostOpt1.erb @@ -0,0 +1,555 @@ +--- +template: templates/cse4562_2021_slides.erb +title: "Cost-Based Optimization" +date: March 9, 2021 +textbook: Ch. 16 +--- + + + +
+
+

General Query Optimizers

+
    +
  1. Apply blind heuristics (e.g., push down selections)
  2. +
  3. Enumerate all possible execution plans by varying (or for a reasonable subset) +
      +
    • Join/Union Evaluation Order (commutativity, associativity, distributivity)
    • +
    • Algorithms for Joins, Aggregates, Sort, Distinct, and others
    • +
    • Data Access Paths
    • +
    +
  4. +
  5. Estimate the cost of each execution plan
  6. +
  7. Pick the execution plan with the lowest cost
  8. +
+
+
+ +
+
+

Idea 1: Run each plan

+
+ +
+ + © Paramount Pictures +
+ +
+

If we can't get the exact cost of a plan, what can we do?

+
+ +
+

Idea 2: Run each plan on a small sample of the data.

+

Idea 3: Analytically estimate the cost of a plan.

+
+ +
+

Plan Cost

+
+
+
CPU Time
+
How much time is spent processing.
+
+ +
+
# of IOs
+
How many random reads + writes go to disk.
+
+ +
+
Memory Required
+
How much memory do you need.
+
+
+
+ +
+ + Randal Munroe (cc-by-nc) +
+ +
+

Remember the Real Goals

+
    +
  1. Accurately rank the plans.
  2. +
  3. Don't spend more time optimizing than you get back.
  4. +
  5. Don't pick a plan that uses more memory than you have.
  6. +
+
+
+ + + +
+
+

Accounting

+

Figure out the IO cost of the entire* subtree.

+ +

Only count the amount of memory added by each operator.

+ + +

* Different from earlier in the semester.

+ +
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OperationRATotal IOs (#pages)Memory (#tuples)
Table Scan$R$$\frac{|R|}{\mathcal P}$$O(1)$
Projection$\pi(R)$$\textbf{io}(R)$$O(1)$
Selection$\sigma(R)$$\textbf{io}(R)$$O(1)$
Union$R \uplus S$$\textbf{io}(R) + \textbf{io}(S)$$O(1)$
Sort (In-Mem)$\tau(R)$$0$$O(|R|)$
Sort (On-Disk)$\tau(R)$$\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P} + \textbf{io}(R)$$O(\mathcal B)$
(B+Tree) Index Scan$Index(R, c)$$\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$$O(1)$
(Hash) Index Scan$Index(R, c)$$1$$O(1)$
+ +
    +
  1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
  2. +
  3. Size of $R$ ($|R|$)
  4. +
  5. Pages of Buffer ($\mathcal B$)
  6. +
  7. Keys per Index Page ($\mathcal I$)
  8. +
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OperationRATotal IOs (#pages)Mem (#tuples)
Nested Loop Join (Buffer $S$ in mem)$R \times_{mem} S$$\textbf{io}(R)+\textbf{io}(S)$$O(|S|)$
Block NLJ (Buffer $S$ on disk)$R \times_{disk} S$$\frac{|R|}{\mathcal B} \cdot \frac{|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$$O(1)$
Block NLJ (Recompute $S$)$R \times_{redo} S$$\textbf{io}(R) + \frac{|R|}{\mathcal B} \cdot \textbf{io}(S)$$O(1)$
1-Pass Hash Join$R \bowtie_{1PH, c} S$$\textbf{io}(R) + \textbf{io}(S)$$O(|S|)$
2-Pass Hash Join$R \bowtie_{2PH, c} S$$\frac{2|R| + 2|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$$O(1)$
Sort-Merge Join $R \bowtie_{SM, c} S$[Sort][Sort]
(Tree) Index NLJ$R \bowtie_{INL, c}$$|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$$O(1)$
(Hash) Index NLJ$R \bowtie_{INL, c}$$|R| \cdot 1$$O(1)$
(In-Mem) Aggregate$\gamma_A(R)$$\textbf{io}(R)$$adom(A)$
(Sort/Merge) Aggregate$\gamma_A(R)$[Sort][Sort]
+ +
    +
  1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
  2. +
  3. Size of $R$ ($|R|$)
  4. +
  5. Pages of Buffer ($\mathcal B$)
  6. +
  7. Keys per Index Page ($\mathcal I$)
  8. +
  9. Number of distinct values of $A$ ($adom(A)$)
  10. +
+
+ +
+ + + + + + + + + + + + + + + + + + + + + + +
SymbolParameterType
$\mathcal P$Tuples Per PageFixed ($\frac{|\text{page}|}{|\text{tuple}|}$)
$|R|$Size of $R$Precomputed$^*$ ($|R|$)
$\mathcal B$Pages of BufferConfigurable Parameter
$\mathcal I$Keys per Index PageFixed ($\frac{|\text{page}|}{|\text{key+pointer}|}$)
$adom(A)$Number of distinct values of $A$Precomputed$^*$ ($|\delta_A(R)|$)
+

* unless $R$ is a query

+
+ +
+ + +
+
+

Estimating IOs requires Estimating $|Q(R)|$, $|\delta_A(Q(R))|$

+
+ +
+

Cardinality Estimation

+

Unlike estimating IOs, cardinality estimation doesn't care about the algorithm, so we'll just be working with raw RA.

+ +

Also unlike estimating IOs, we care about the cardinality of $|Q(R)|$ as a whole, rather than the contribution of each individual operator.

+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OperatorRAEstimated Size
Table$R$$|R|$
Projection$\pi(Q)$$|Q|$
Union$Q_1 \uplus Q_2$$|Q_1| + |Q_2|$
Cross Product$Q_1 \times Q_2$$|Q_1| \times |Q_2|$
Sort$\tau(Q)$$|Q|$
Limit$\texttt{LIMIT}_N(Q)$$N$
Selection$\sigma_c(Q)$$|Q| \times \texttt{SEL}(c, Q)$
Join$Q_1 \bowtie_c Q_2$$|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct$\delta_A(Q)$$\texttt{UNIQ}(A, Q)$
Aggregate$\gamma_{A, B \leftarrow \Sigma}(Q)$$\texttt{UNIQ}(A, Q)$
+ +
    +
  • $\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
  • +
  • $\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$.
  • +
+
+ +
+

Cardinality Estimation

+

(The Hard Parts)

+ +
+
$\sigma_c(Q)$ (Cardinality Estimation)
+
How many tuples will a condition $c$ allow to pass?
+ +
$\delta_A(Q)$ (Distinct Values Estimation)
+
How many distinct values of attribute(s) $A$ exist?
+
+
+
+ +
+
+

Idea 1: Assume each selection filters down to 10% of the data.

+
+ +
+ +

no... really!

+ © Paramount Pictures +
+ +
+

... there are problems

+
+

Inconsistent estimation

+

$|\sigma_{c_1}(\sigma_{c_2}(R))| \neq |\sigma_{c_1 \wedge c_2}(R)|$

+
+
+

Too consistent estimation

+

$|\sigma_{id = 1}(\texttt{STUDENTS})| = |\sigma_{residence = 'NY'}(\texttt{STUDENTS})|$

+
+

... but remember that all we need is to rank plans.

+
+ +
+

Many major databases (Oracle, Postgres, Teradata, etc...) use something like 10% rule if they have nothing better.

+ + +

(The specific % varies by DBMS.)

+ +

(Teradata uses 10% for the first AND clause,
cut by another 75% for every subsequent clause)

+
+ +
+

(Some) Estimation Techniques

+ +
+
+
The 10% rule
+
Rules of thumb if you have no other options...
+
+ +
+
Uniform Prior
+
Use basic statistics to make a very rough guess.
+
+ +
+
Sampling / History
+
Small, Quick Sampling Runs (or prior executions of the query).
+
+ +
+
Histograms
+
Using more detailed statistics for improved guesses.
+
+ +
+
Constraints
+
Using rules about the data for improved guesses.
+
+
+
+
+ + + +
+ +
+

Uniform Prior

+ +

We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...

+
    +
  1. Basic statistics are known about $Q$:
      +
    • COUNT(*)
    • +
    • COUNT(DISTINCT A) (for each A)
    • +
    • MIN(A), MAX(A) (for each numeric A)
    • +
  2. +
  3. Attribute values are uniformly distributed.
  4. +
  5. No inter-attribute correlations.
  6. +
+

+ If necessary statistics aren't available (point 1), fall back to the 10% rule. +

+

+ If statistical assumptions (points 2, 3) aren't perfectly true, we'll still likely be getting a better estimate than the 10% rule. +

+
+ +
+

COUNT(DISTINCT A)

+

$\texttt{UNIQ}(A, \pi_{A, \ldots}(R)) = \texttt{UNIQ}(A, R)$

+

$\texttt{UNIQ}(A, \sigma(R)) \approx \texttt{UNIQ}(A, R)$

+

$\texttt{UNIQ}(A, R \times S) = \texttt{UNIQ}(A, R)$ or $\texttt{UNIQ}(A, S)$

+

$$max(\texttt{UNIQ}(A, R), \texttt{UNIQ}(A, S)) \leq\\ \texttt{UNIQ}(A, R \uplus S)\\ \leq \texttt{UNIQ}(A, R) + \texttt{UNIQ}(A, S)$$

+
+ +
+

MIN(A), MAX(A)

+

$min_A(\pi_{A, \ldots}(R)) = min_A(R)$

+

$min_A(\sigma_{A, \ldots}(R)) \approx min_A(R)$

+

$min_A(R \times S) = min_A(R)$ or $min_A(S)$

+

$min_A(R \uplus S) = min(min_A(R), min_A(S))$

+
+ +
+

Estimating $\delta_A(Q)$ requires only COUNT(DISTINCT A)

+
+ +
+

Estimating Selectivity

+ +

Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
$P(A = x_1)$$=$$\frac{1}{\texttt{COUNT(DISTINCT A)}}$
$P(A \in (x_1, x_2, \ldots, x_N))$$=$$\frac{N}{\texttt{COUNT(DISTINCT A)}}$
$P(A \leq x_1)$$=$$\frac{x_1 - \texttt{MIN(A)}}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(x_1 \leq A \leq x_2)$$=$$\frac{x_2 - x_1}{\texttt{MAX(A)} - \texttt{MIN(A)}}$
$P(A = B)$$=$$\textbf{min}\left( \frac{1}{\texttt{COUNT(DISTINCT A)}}, \frac{1}{\texttt{COUNT(DISTINCT B)}} \right)$
$P(c_1 \wedge c_2)$$=$$P(c_1) \cdot P(c_2)$
$P(c_1 \vee c_2)$$=$$1 - (1 - P(c_1)) \cdot (1 - P(c_2))$
+ +

(With constants $x_1$, $x_2$, ...)

+
+ +
+

Limitations

+ +
+
+
Don't always have statistics for $Q$
+
For example, $\pi_{A \leftarrow (B \cdot C)}(R)$
+
+ +
+
Don't always have clear rules for $c$
+
For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
+
+ +
+
Attribute values are not always uniformly distributed.
+
For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
+
+ +
+
Attribute values are sometimes correlated.
+
For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$
+
+
+

...but handles most usage patterns

+
+ +
+ ... next class more! +
+ +
\ No newline at end of file diff --git a/src/teaching/cse-562/2021sp/slide/2021-03-09/EstimationXKCD.png b/src/teaching/cse-562/2021sp/slide/2021-03-09/EstimationXKCD.png new file mode 100644 index 00000000..f31d42c6 Binary files /dev/null and b/src/teaching/cse-562/2021sp/slide/2021-03-09/EstimationXKCD.png differ diff --git a/src/teaching/cse-562/2021sp/slide/2021-03-11-CostOpt2.erb b/src/teaching/cse-562/2021sp/slide/2021-03-11-CostOpt2.erb new file mode 100644 index 00000000..fd59dc4d --- /dev/null +++ b/src/teaching/cse-562/2021sp/slide/2021-03-11-CostOpt2.erb @@ -0,0 +1,606 @@ +--- +template: templates/cse4562_2021_slides.erb +title: "Cost-Based Optimization" +date: March 11, 2021 +textbook: Ch. 16 +--- + +
+
+

Remember the Real Goals

+
    +
  1. Accurately rank the plans.
  2. +
  3. Don't spend more time optimizing than you get back.
  4. +
  5. Don't pick a plan that uses more memory than you have.
  6. +
+
+ +
+

Accounting

+

Figure out the cost of each individual operator.

+

Only count the number of IOs added by each operator.

+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OperationRATotal IOs (#pages)Memory (#tuples)
Table Scan$R$$\frac{|R|}{\mathcal P}$$O(1)$
Projection$\pi(R)$$\textbf{io}(R)$$O(1)$
Selection$\sigma(R)$$\textbf{io}(R)$$O(1)$
Union$R \uplus S$$\textbf{io}(R) + \textbf{io}(S)$$O(1)$
Sort (In-Mem)$\tau(R)$$\textbf{io}(R)$$O(|R|)$
Sort (On-Disk)$\tau(R)$$\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P} + \textbf{io}(R)$$O(\mathcal B)$
(B+Tree) Index Scan$Index(R, c)$$\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$$O(1)$
(Hash) Index Scan$Index(R, c)$$1$$O(1)$
+ +
    +
  1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
  2. +
  3. Size of $R$ ($|R|$)
  4. +
  5. Pages of Buffer ($\mathcal B$)
  6. +
  7. Keys per Index Page ($\mathcal I$)
  8. +
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OperationRATotal IOs (#pages)Mem (#tuples)
Nested Loop Join (Buffer $S$ in mem)$R \times_{mem} S$$\textbf{io}(R)+\textbf{io}(S)$$O(|S|)$
Block NLJ (Buffer $S$ on disk)$R \times_{disk} S$$\frac{|R|}{\mathcal B} \cdot \frac{|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$$O(1)$
Block NLJ (Recompute $S$)$R \times_{redo} S$$\textbf{io}(R) + \frac{|R|}{\mathcal B} \cdot \textbf{io}(S)$$O(1)$
1-Pass Hash Join$R \bowtie_{1PH, c} S$$\textbf{io}(R) + \textbf{io}(S)$$O(|S|)$
2-Pass Hash Join$R \bowtie_{2PH, c} S$$\frac{2|R| + 2|S|}{\mathcal P} + \textbf{io}(R) + \textbf{io}(S)$$O(1)$
Sort-Merge Join $R \bowtie_{SM, c} S$[Sort][Sort]
(Tree) Index NLJ$R \bowtie_{INL, c}$$|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$$O(1)$
(Hash) Index NLJ$R \bowtie_{INL, c}$$|R| \cdot 1$$O(1)$
(In-Mem) Aggregate$\gamma_A(R)$$0$$adom(A)$
(Sort/Merge) Aggregate$\gamma_A(R)$[Sort][Sort]
+ +
    +
  1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
  2. +
  3. Size of $R$ ($|R|$)
  4. +
  5. Pages of Buffer ($\mathcal B$)
  6. +
  7. Keys per Index Page ($\mathcal I$)
  8. +
  9. Number of distinct values of $A$ ($adom(A)$)
  10. +
+
+
+ +
+
+

Cardinality Estimation

+

(The Hard Parts)

+ +
+
$\sigma_c(Q)$ (Cardinality Estimation)
+
How many tuples will a condition $c$ allow to pass?
+ +
$\delta_A(Q)$ (Distinct Values Estimation)
+
How many distinct values of attribute(s) $A$ exist?
+
+
+ +
+

Remember the Real Goals

+
    +
  1. Accurately rank the plans.
  2. +
  3. Don't spend more time optimizing than you get back.
  4. +
+
+ +
+

(Some) Estimation Techniques

+ +
+
+
Guess Randomly
+
Rules of thumb if you have no other options...
+
+ +
+
Uniform Prior
+
Use basic statistics to make a very rough guess.
+
+ +
+
Sampling / History
+
Small, Quick Sampling Runs (or prior executions of the query).
+
+ +
+
Histograms
+
Using more detailed statistics for improved guesses.
+
+ +
+
Constraints
+
Using rules about the data for improved guesses.
+
+
+
+
+ + +
+
+

(Some) Estimation Techniques

+ +
+
Guess Randomly
+
Rules of thumb if you have no other options...
+ +
Uniform Prior
+
Use basic statistics to make a very rough guess.
+ +
Sampling / History
+
Small, Quick Sampling Runs (or prior executions of the query).
+ +
Histograms
+
Using more detailed statistics for improved guesses.
+ +
Constraints
+
Using rules about the data for improved guesses.
+
+
+ +
+

Idea 1: Pick 100 tuples at random from each input table.

+
+ +
+ +
+ +
+

The Birthday Paradox

+ +

+ Assume: $\texttt{UNIQ}(A, R) = \texttt{UNIQ}(A, S) = N$ +

+ +

+ It takes $O(\sqrt{N})$ samples from both $R$ and $S$
to get even one match. +

+
+ +
+

To be resumed later in the term when we talk about AQP

+
+ +
+

How DBs Do It: Instrument queries while running them.

    +
  • The first time you run a query it might be slow.
  • +
  • The second, third, fourth, etc... times it'll be fast.
  • +

+
+
+ +
+ +
+

(Some) Estimation Techniques

+ +
+
Guess Randomly
+
Rules of thumb if you have no other options...
+ +
Uniform Prior
+
Use basic statistics to make a very rough guess.
+ +
Sampling / History
+
Small, Quick Sampling Runs (or prior executions of the query).
+ +
Histograms
+
Using more detailed statistics for improved guesses.
+ +
Constraints
+
Using rules about the data for improved guesses.
+
+
+ +
+

Limitations of Uniform Prior

+ +
+
+
Don't always have statistics for $Q$
+
For example, $\pi_{A \leftarrow (B \times C)}(R)$
+
+ +
+
Don't always have clear rules for $c$
+
For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
+
+ +
+
Attribute values are not always uniformly distributed.
+
For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
+
+ +
+
Attribute values are sometimes correlated.
+
For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$
+
+ +
+
+ +
+

+ Ideal Case: You have some + $$f(x) = \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$ + (and similarly for the other aggregates) +

+

+ Slightly Less Ideal Case: You have some + $$f(x) \approx \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$ +

+
+ +
+

If this sounds like CDF-based indexing... you're right!

+ +

... but we're not going to talk about NNs today

+
+
+ +
+
+

+ Simpler/Faster Idea: Break $f(x)$ into chunks +

+
+ +
+

Example Data

+ + + + + + + + + + +
Name YearsEmployed Role
'Alice' 3 1
'Bob' 2 2
'Carol' 3 1
'Dave' 1 3
'Eve' 2 2
'Fred' 2 3
'Gwen' 4 1
'Harry' 2 3
+
+ +
+

Histograms

+ + + + + + +
YearsEmployedCOUNT
1 1
2 4
3 2
4 1
+ + + + + + +
COUNT(DISTINCT YearsEmployed) $= 4$
MIN(YearsEmployed) $= 1$
MAX(YearsEmplyed) $= 4$
COUNT(*) YearsEmployed = 2 $= 4$
+
+ +
+

Histograms

+ + + + +
YearsEmployedCOUNT
1-2 5
3-4 3
+ + + + + + +
COUNT(DISTINCT YearsEmployed) $= 4$
MIN(YearsEmployed) $= 1$
MAX(YearsEmplyed) $= 4$
COUNT(*) YearsEmployed = 2 $= \frac{5}{2}$
+
+ +
+

The Extreme Case

+ + + +
YearsEmployedCOUNT
1-4 8
+ + + + + + +
COUNT(DISTINCT YearsEmployed) $= 4$
MIN(YearsEmployed) $= 1$
MAX(YearsEmplyed) $= 4$
COUNT(*) YearsEmployed = 2 $= \frac{8}{4}$
+
+ +
+

More Example Data

+ + + + + + + + + + +
Value COUNT
1-10 20
11-20 0
21-30 15
31-40 30
41-50 22
51-60 63
61-70 10
71-80 10
+ + + + + + + + + + + +
SELECT … WHERE A = 33 $= \frac{1}{40-30}\cdot 30 = 3$
SELECT … WHERE A > 33 $= \frac{40-33}{40-30}\cdot 30+22$ $\;\;\;+63+10+10$ $= 126$
+
+
+ +
+
+

(Some) Estimation Techniques

+ +
+
Guess Randomly
+
Rules of thumb if you have no other options...
+ +
Uniform Prior
+
Use basic statistics to make a very rough guess.
+ +
Sampling / History
+
Small, Quick Sampling Runs (or prior executions of the query).
+ +
Histograms
+
Using more detailed statistics for improved guesses.
+ +
Constraints
+
Using rules about the data for improved guesses.
+
+
+ +
+

Key / Unique Constraints

+

+      CREATE TABLE R ( 
+        A int,
+        B int UNIQUE
+        ... 
+        PRIMARY KEY A
+      );
+    
+

+ No duplicate values in the column. + $$\texttt{COUNT(DISTINCT A)} = \texttt{COUNT(*)}$$ +

+
+ +
+

Foreign Key Constraints

+

+      CREATE TABLE S ( 
+        B int,
+        ... 
+        FOREIGN KEY B REFERENCES R.B
+      );
+    
+

+ All values in the column appear in another table. + $$\pi_{attrs(S)}\left(S \bowtie_B R\right) \subseteq S$$ +

+
+ +
+

Functional Dependencies

+ +

+      Not expressible in SQL
+    
+ +

+ One set of columns uniquely determines another.
+ $\pi_{A}(\delta(\pi_{A, B}(R)))$ has no duplicates and... + $$\pi_{attrs(R)-A}(R) \bowtie_A \delta(\pi_{A, B}(R)) = R$$ +

+
+ +
+

Constraints

+ +

The Good

+
    +
  • Sanity check on your data: Inconsistent data triggers failures.
  • +
  • More opportunities for query optimization.
  • +
+ +

The Not-So Good

+
    +
  • Validating constraints whenever data changes is (usually) expensive.
  • +
  • Inconsistent data triggers failures.
  • +
+ +
+ +
+

Foreign Key Constraints

+ +

Foreign keys are like pointers. What happens with broken pointers?

+
+ +
+

Foreign Key Enforcement

+ +

Foreign keys are defined with update triggers ON INSERT [X], ON UPDATE [X], ON DELETE [X]. Depending on what [X] is, the constraint is enforced differently:

+ +
+
CASCADE
+
Create/delete rows as needed to avoid invalid foreign keys.
+ +
NO ACTION
+
Abort any transaction that ends with an invalid foreign key reference.
+ +
SET NULL
+
Automatically replace any invalid foreign key references with NULL
+
+
+ +
+

+ CASCADE and NO ACTION ensure that the data never has broken pointers, so +

+ $$\pi_{attrs(S)}\left(S \bowtie_B R\right) = S$$ +
+ +
+

Functional Dependencies

+ +

A generalization of keys: One set of attributes that uniquely identify another.

+ +
    +
  • SS# uniquely identifies Name.
  • +
  • Employee uniquely identifies Manager.
  • +
  • Order number uniquely identifies Customer Address.
  • +
+ +

Two rows with the same As must have the same Bs

+

(but can still have identical Bs for two different As)

+
+ +
+

Normal Forms

+

"All functional dependencies should be keys."

+

(Otherwise you want two separate relations)

+

(for more details, see CSE 560)

+
+ +
+ +

+ $$P(A = B) = min\left(\frac{1}{\texttt{COUNT}(\texttt{DISTINCT } A)}, \frac{1}{\texttt{COUNT}(\texttt{DISTINCT } B)}\right)$$ +

+ +
+
+ +

+ $$R \bowtie_{R.A = S.B} S = \sigma_{R.A = S.B}(R \times S)$$ + (and $S.B$ is a foreign key referencing $R.A$) +

+ +

+ The (foreign) key constraint gives us two things... + $$\texttt{COUNT}(\texttt{DISTINCT } A) \approx \texttt{COUNT}(\texttt{DISTINCT } B)$$ + and + $$\texttt{COUNT}(\texttt{DISTINCT } A) = |R|$$ +

+ +

+ Based on the first property the total number of rows is roughly... + $$|R| \times |S| \times \frac{1}{\texttt{COUNT}(\texttt{DISTINCT } A)}$$ +

+ +

+ Then based on the second property... + $$ = |R| \times |S| \times \frac{1}{|R|} = |S|$$ +

+ +

(Statistics/Histograms will give you the same outcome... but constraints can be easier to propagate)

+
+
+ diff --git a/src/teaching/cse-562/2021sp/slide/2021-03-11/JoinIssue.svg b/src/teaching/cse-562/2021sp/slide/2021-03-11/JoinIssue.svg new file mode 100644 index 00000000..b84b5f70 --- /dev/null +++ b/src/teaching/cse-562/2021sp/slide/2021-03-11/JoinIssue.svg @@ -0,0 +1,279 @@ + + + + + + + + + + image/svg+xml + + + + + + + + + σ + R + S + T + + + + + + + + 100 Tuples + + + + 10 Tuples + + + + 100 Tuples + + + + 0 Tuples + + + + 0 Tuples + + +