diff --git a/slides/cse4562sp2018/2018-03-05-CostBasedOptimization2.html b/slides/cse4562sp2018/2018-03-05-CostBasedOptimization2.html index 7d001ee6..7b8c80be 100644 --- a/slides/cse4562sp2018/2018-03-05-CostBasedOptimization2.html +++ b/slides/cse4562sp2018/2018-03-05-CostBasedOptimization2.html @@ -432,7 +432,7 @@

Uniform Prior

-

We assume that for $\sigma_c(Q)$...

+

We assume that for $\sigma_c(Q)$ or $\delta_A(Q)$...

  1. Basic statistics are known about $Q$:
    • COUNT(*)
    • @@ -451,7 +451,11 @@
-

Some Conditions

+

Estimating $\delta_A(Q)$ requires only COUNT(DISTINCT A)

+
+ +
+

Estimating Selectivity

Selectivity is a probability ($\texttt{SEL}(c, Q) = P(c)$)

@@ -553,35 +557,36 @@

Idea 1: Pick 100 tuples at random from each input table.

+ +
+ +
+ +
+

The Birthday Paradox

+ +

+ Assume: $\texttt{UNIQ}(A, R) = \texttt{UNIQ}(A, S) = N$ +

+ +

+ It takes $O(\sqrt{N})$ samples from both $R$ and $S$
to get even one match. +

+
+ +
+

To be resumed later in the term when we talk about AQP

+
+ +
+

How DBs Do It: Instrument queries while running them.

+
-
-

Limitations

- -
-
-
Don't always have statistics for $Q$
-
For example, $\pi_{A \leftarrow (B \times C)}(R)$
-
- -
-
Don't always have clear rules for $c$
-
For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
-
- -
-
Attribute values are not always uniformly distributed.
-
For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
-
- -
-
Attribute values are sometimes correlated.
-
For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$
-
- -
-

(Some) Estimation Techniques

@@ -603,7 +608,173 @@
Using rules about the data for improved guesses.
- + +
+

Limitations of Uniform Prior

+ +
+
+
Don't always have statistics for $Q$
+
For example, $\pi_{A \leftarrow (B \times C)}(R)$
+
+ +
+
Don't always have clear rules for $c$
+
For example, $\sigma_{\texttt{FitsModel}(A, B, C)}(R)$
+
+ +
+
Attribute values are not always uniformly distributed.
+
For example, $|\sigma_{SPC\_COMMON = 'pin\ oak'}(T)|$ vs $|\sigma_{SPC\_COMMON = 'honeylocust'}(T)|$
+
+ +
+
Attribute values are sometimes correlated.
+
For example, $\sigma_{(stump < 5) \wedge (diam > 3)}(T)$
+
+ +
+
+ +
+

+ Ideal Case: You have some + $$f(x) = \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$ + (and similarly for the other aggregates) +

+

+ Slightly Less Ideal Case: You have some + $$f(x) \approx \left(\texttt{SELECT COUNT(*) WHERE A = x}\right)$$ +

+
+ +
+

If this sounds like CDF-based indexing... you're right!

+ +

... but we're not going to talk about NNs today

+
+
+ +
+
+

+ Simpler/Faster Idea: Break $f(x)$ into chunks +

+
+ +
+

Example Data

+
+ + + + + + + + + +
Name YearsEmployed Role
'Alice' 3 1
'Bob' 2 2
'Carol' 3 1
'Dave' 1 3
'Eve' 2 2
'Fred' 2 3
'Gwen' 4 1
'Harry' 2 3
+
+ +
+

Histograms

+ + + + + + +
YearsEmployedCOUNT
1 1
2 4
3 2
4 1
+ + + + + + +
COUNT(DISTINCT YearsEmployed) $= 4$
MIN(YearsEmployed) $= 1$
MAX(YearsEmplyed) $= 4$
COUNT(*) YearsEmployed = 2 $= 4$
+
+ +
+

Histograms

+ + + + +
YearsEmployedCOUNT
1-2 5
3-4 3
+ + + + + + +
COUNT(DISTINCT YearsEmployed) $= 4$
MIN(YearsEmployed) $= 1$
MAX(YearsEmplyed) $= 4$
COUNT(*) YearsEmployed = 2 $= \frac{5}{2}$
+
+ +
+

The Extreme Case

+ + + +
YearsEmployedCOUNT
1-4 8
+ + + + + + +
COUNT(DISTINCT YearsEmployed) $= 4$
MIN(YearsEmployed) $= 1$
MAX(YearsEmplyed) $= 4$
COUNT(*) YearsEmployed = 2 $= \frac{8}{4}$
+
+ +
+

More Example Data

+ + + + + + + + + + +
Value COUNT
1-10 20
11-20 0
21-30 15
31-40 30
41-50 22
51-60 63
61-70 10
71-80 10
+ + + + + + + + + + + +
SELECT … WHERE A = 33 $= \frac{1}{40-30}\cdot 30 = 3$
SELECT … WHERE A > 33 $= \frac{40-33}{40-30}\cdot 30+22$ $\;\;\;+63+10+10$ $= 126$
+
+ + +
+
+

(Some) Estimation Techniques

+ +
+
Guess Randomly
+
Rules of thumb if you have no other options...
+ +
Uniform Prior
+
Use basic statistics to make a very rough guess.
+ +
Sampling / History
+
Small, Quick Sampling Runs (or prior executions of the query).
+ +
Histograms
+
Using more detailed statistics for improved guesses.
+ +
Constraints
+
Using rules about the data for improved guesses.
+
+
diff --git a/slides/cse4562sp2018/graphics/2018-03-05-JoinIssue.svg b/slides/cse4562sp2018/graphics/2018-03-05-JoinIssue.svg new file mode 100644 index 00000000..b84b5f70 --- /dev/null +++ b/slides/cse4562sp2018/graphics/2018-03-05-JoinIssue.svg @@ -0,0 +1,279 @@ + + + + + + + + + + image/svg+xml + + + + + + + + + σ + R + S + T + + + + + + + + 100 Tuples + + + + 10 Tuples + + + + 100 Tuples + + + + 0 Tuples + + + + 0 Tuples + + +