What is the best, correct technique for task X, when Y is true?
diff --git a/src/teaching/cse-562/2019sp/slide/2019-03-06-Sketches.erb b/src/teaching/cse-562/2019sp/slide/2019-03-06-Sketches.erb index 755fc6c6..f58d4017 100644 --- a/src/teaching/cse-562/2019sp/slide/2019-03-06-Sketches.erb +++ b/src/teaching/cse-562/2019sp/slide/2019-03-06-Sketches.erb @@ -186,7 +186,7 @@ textbook: (readings only)What are Databases?
+
+ SELECT [DISTINCT] targetlist
+ FROM relationlist
+ WHERE condition
+
+ + This is the least efficient strategy to compute a query! + A good optimizer will find more efficient strategies to compute the same answer. +
+,
) between fields\r\n
) at record endRelational Algebra
+Operation | Sym | Meaning |
---|---|---|
Selection | $\sigma$ | Select a subset of the input rows |
Projection | $\pi$ | Delete unwanted columns |
Cross-product | $\times$ | Combine two relations |
Set-difference | $-$ | Tuples in Rel 1, but not Rel 2 |
Union | $\cup$ | Tuples either in Rel 1 or in Rel 2 |
Intersection | $\cap$ | Tuples in both Rel 1 and Rel 2 |
Join | $\bowtie$ | Pairs of tuples matching a specified condition |
Division | $/$ | "Inverse" of cross-product |
Sort | $\tau_A$ | Sort records by attribute(s) $A$ |
Limit | $\texttt{LIMIT}_N$ | Return only the first $N$ records (according to sort order if paired with sort). |
Rule | Notes |
---|---|
$\sigma_{C_1\wedge C_2}(R) \equiv \sigma_{C_1}(\sigma_{C_2}(R))$ | |
$\sigma_{C_1\vee C_2}(R) \equiv \sigma_{C_1}(R) \cup \sigma_{C_2}(R)$ | Only true for set, not bag union |
$\sigma_C(R \times S) \equiv R \bowtie_C S$ | |
$\sigma_C(R \times S) \equiv \sigma_C(R) \times S$ | If $C$ references only $R$'s attributes, also works for joins |
$\pi_{A}(\pi_{A \cup B}(R)) \equiv \pi_{A}(R)$ | |
$\sigma_C(\pi_{A}(R)) \equiv \pi_A(\sigma_C(R))$ | If $A$ contains all of the attributes referenced by $C$ |
$\pi_{A\cup B}(R\times S) \equiv \pi_A(R) \times \pi_B(S)$ | Where $A$ (resp., $B$) contains attributes in $R$ (resp., $S$) |
$R \times (S \times T) \equiv (R \times S) \times T$ | Also works for joins |
$R \times S \equiv S \times R$ | Also works for joins |
$R \cup (S \cup T) \equiv (R \cup S) \cup T$ | Also works for intersection and bag-union |
$R \cup S \equiv S \cup R$ | Also works for intersections and bag-union |
$\sigma_{C}(R \cup S) \equiv \sigma_{C}(R) \cup \sigma_{C}(S)$ | Also works for intersections and bag-union |
$\pi_{A}(R \cup S) \equiv \pi_{A}(R) \cup \pi_{A}(S)$ | Also works for intersections and bag-union |
$\sigma_{C}(\gamma_{A, AGG}(R)) \equiv \gamma_{A, AGG}(\sigma_{C}(R))$ | If $A$ contains all of the attributes referenced by $C$ |
Sort/Merge typically expressed as 3 operators
(2xSort + Merge)
Can partition on data-values to support other types of queries.
+Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
+ +Indexing
+Maintain Invariant: All Nodes ≥ 50% Full
+(Exception: The Root)
+Cost-Based Optimization
+Figure out the cost of each individual operator.
+Only count the number of IOs added by each operator.
+Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|
Table Scan | +$R$ | +$\frac{|R|}{\mathcal P}$ | +$O(1)$ | +
Projection | +$\pi(R)$ | +$0$ | +$O(1)$ | +
Selection | +$\sigma(R)$ | +$0$ | +$O(1)$ | +
Union | +$R \uplus S$ | +$0$ | +$O(1)$ | +
Sort (In-Mem) | +$\tau(R)$ | +$0$ | +$O(|R|)$ | +
Sort (On-Disk) | +$\tau(R)$ | +$\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$ | +$O(\mathcal B)$ | +
(B+Tree) Index Scan | +$Index(R, c)$ | +$\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$ | +$O(1)$ | +
(Hash) Index Scan | +$Index(R, c)$ | +$1$ | +$O(1)$ | +
Operation | RA | IOs Added (#pages) | Memory (#tuples) |
---|---|---|---|
Nested Loop Join (Buffer $S$ in mem) | +$R \times S$ | +$0$ | +$O(|S|)$ | +
Nested Loop Join (Buffer $S$ on disk) | +$R \times_{disk} S$ | +$(1+ |R|) \cdot \frac{|S|}{\mathcal P}$ | +$O(1)$ | +
1-Pass Hash Join | +$R \bowtie_{1PH, c} S$ | +$0$ | +$O(|S|)$ | +
2-Pass Hash Join | +$R \bowtie_{2PH, c} S$ | +$\frac{2|R| + 2|S|}{\mathcal P}$ | +$O(1)$ | +
Sort-Merge Join | +$R \bowtie_{SM, c} S$ | +[Sort] | +[Sort] | +
(Tree) Index NLJ | +$R \bowtie_{INL, c}$ | +$|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$ | +$O(1)$ | +
(Hash) Index NLJ | +$R \bowtie_{INL, c}$ | +$|R| \cdot 1$ | +$O(1)$ | +
(In-Mem) Aggregate | +$\gamma_A(R)$ | +$0$ | +$adom(A)$ | +
(Sort/Merge) Aggregate | +$\gamma_A(R)$ | +[Sort] | +[Sort] | +
Estimating IOs requires Estimating $|Q(R)|$
+Operator | +RA | +Estimated Size | +
---|---|---|
Table | +$R$ | +$|R|$ | +
Projection | +$\pi(Q)$ | +$|Q|$ | +
Union | +$Q_1 \uplus Q_2$ | +$|Q_1| + |Q_2|$ | +
Cross Product | +$Q_1 \times Q_2$ | +$|Q_1| \times |Q_2|$ | +
Sort | +$\tau(Q)$ | +$|Q|$ | +
Limit | +$\texttt{LIMIT}_N(Q)$ | +$N$ | +
Selection | +$\sigma_c(Q)$ | +$|Q| \times \texttt{SEL}(c, Q)$ | +
Join | +$Q_1 \bowtie_c Q_2$ | +$|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$ | +
Distinct | +$\delta_A(Q)$ | +$\texttt{UNIQ}(A, Q)$ | +
Aggregate | +$\gamma_{A, B \leftarrow \Sigma}(Q)$ | +$\texttt{UNIQ}(A, Q)$ | +
Flips | Score | Probability | +E[# Games] | +
---|---|---|---|
(👽) | 0 | 0.5 | +2 | +
(🐕)(👽) | 1 | 0.25 | +4 | +
(🐕)(🐕)(👽) | 2 | 0.125 | +8 | +
(🐕)$\times N$ (👽) | $N$ | $\frac{1}{2^{N+1}}$ | +$2^{N+1}$ | +
If I told you that in a series of games, my best score was $N$, you might expect that I played $2^{N+1}$ games.
+To do that, I only need to track my top score!
+WINDOW
Operator
+ SELECT L.state, T.month,
+ AVG(S.sales) OVER W as movavg
+ FROM Sales S, Times T, Locations L
+ WHERE S.timeid = T.timeid
+ AND S.locid = L.locid
+ WINDOW W AS (
+ PARTITION BY L.state
+ ORDER BY T.month
+ RANGE BETWEEN INTERVAL ‘1’ MONTH PRECEDING
+ AND INTERVAL ‘1’ MONTH FOLLOWING
+ )
+
+ WINDOW
joins.