From 28d3d16b0c615b43a7fb44975d05503d8acb6e63 Mon Sep 17 00:00:00 2001 From: Oliver Kennedy Date: Thu, 8 Mar 2018 23:22:09 -0500 Subject: [PATCH] Review slides --- slides/cse4562sp2018/2018-03-09-Review.html | 754 ++++++++++++++++++++ 1 file changed, 754 insertions(+) create mode 100644 slides/cse4562sp2018/2018-03-09-Review.html diff --git a/slides/cse4562sp2018/2018-03-09-Review.html b/slides/cse4562sp2018/2018-03-09-Review.html new file mode 100644 index 00000000..3d3f32c3 --- /dev/null +++ b/slides/cse4562sp2018/2018-03-09-Review.html @@ -0,0 +1,754 @@ + + + + + + + CSE 4/562 - Spring 2018 + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + +
+ + CSE 4/562 - Database Systems +
+ +
+ +
+

Midterm Review

+

CSE 4/562 – Database Systems

+
March 9, 2018
+
+ + +
+ +
+

What are Databases?

+
+ +
+
+
Analysis: Answering user-provided questions about data
+
What kind of tools can we give end-users?
    +
  • Declarative Languages
  • +
  • Organizational Datastructures (e.g., Indexes)
  • +
+ +
+
Manipulation: Safely persisting and sharing data updates
+
What kind of tools can we give end-users?
    +
  • Consistency Primitives
  • +
  • Data Validation Primitives
  • +
+
+
+
+ +
+
+
Primitive
+
Basic building blocks like Int, Float, Char, String
+
Tuple
+
Several ‘fields’ of different types. (N-Tuple = N fields)
+
A Tuple has a ‘schema’ defining each field
+
Set
+
A collection of unique records, all of the same type
+
Bag
+
An unordered collection of records, all of the same type
+
List
+
An ordered collection of records, all of the same type
+
+
+ +
+

+            SELECT  [DISTINCT] targetlist
+            FROM    relationlist
+            WHERE   condition
+          
+
    +
  1. Compute the $2^n$ combinations of tuples in all relations appearing in relationlist
  2. +
  3. Discard tuples that fail the condition
  4. +
  5. Delete attributes not in targetlist
  6. +
  7. If DISTINCT is specified, eliminate duplicate rows
  8. +
+

+ This is the least efficient strategy to compute a query! + A good optimizer will find more efficient strategies to compute the same answer. +

+
+ +
+ +
+
+

Physical Layout

+
+ +
+

Record Formats

+
+
Fixed
+
Constant-size fields. Field $i$ at byte $\sum_{j < i} |Field_j|$
+
Delimited
+
Special character or string (e.g., ,) between fields
+
Header
+
Fixed-size header points to start of each field
+
 
+
 
+
+
+ +
+

File Formats

+
+
Fixed
+
Constant-size records. Record $i$ at byte $|Record| \times i$
+
Delimited
+
Special character or string (e.g., \r\n) at record end
+
Header
+
Index in file points to start of each record
+
Paged
+
Align records to paging boundaries
+
+
+ +
+
+
File
+
A collection of pages (or records)
+
Page
+
A fixed-size collection of records
+
Page size is usually dictated by hardware.
Mem Page $\approx$ 4KB   Cache Line $\approx$ 64B
+
Record
+
One or more fields (for now)
+
Field
+
A primitive value (for now)
+
+
+
+ +
+
+

Relational Algebra

+
+ +
+

Relational Algebra

+ + + + + + + + + + + + + + +
OperationSymMeaning
Selection$\sigma$Select a subset of the input rows
Projection$\pi$Delete unwanted columns
Cross-product$\times$Combine two relations
Set-difference$-$Tuples in Rel 1, but not Rel 2
Union$\cup$Tuples either in Rel 1 or in Rel 2
Intersection$\cap$Tuples in both Rel 1 and Rel 2
Join$\bowtie$Pairs of tuples matching a specified condition
Division$/$"Inverse" of cross-product
Sort
Limit
+
+ +
+

Equivalence

+ $$Q_1 = \pi_{A}\left( \sigma_{c}( R ) \right)$$ + $$Q_2 = \sigma_{c}\left( \pi_{A}( R ) \right)$$ + +
+ $$Q_1 \stackrel{?}{\equiv} Q_2$$ +
+
+ +
+ + + + + + + + + + + + + + + + +
RuleNotes
$\sigma_{C_1\wedge C_2}(R) \equiv \sigma_{C_1}(\sigma_{C_2}(R))$
$\sigma_{C_1\vee C_2}(R) \equiv \sigma_{C_1}(R) \cup \sigma_{C_2}(R)$Only true for set, not bag union
$\sigma_C(R \times S) \equiv R \bowtie_C S$
$\sigma_C(R \times S) \equiv \sigma_C(R) \times S$If $C$ references only $R$'s attributes, also works for joins
$\pi_{A}(\pi_{A \cup B}(R)) \equiv \pi_{A}(R)$
$\sigma_C(\pi_{A}(R)) \equiv \pi_A(\sigma_C(R))$If $A$ contains all of the attributes referenced by $C$
$\pi_{A\cup B}(R\times S) \equiv \pi_A(R) \times \pi_B(S)$Where $A$ (resp., $B$) contains attributes in $R$ (resp., $S$)
$R \times (S \times T) \equiv (R \times S) \times T$Also works for joins
$R \times S \equiv S \times R$Also works for joins
$R \cup (S \cup T) \equiv (R \cup S) \cup T$Also works for intersection and bag-union
$R \cup S \equiv S \cup R$Also works for intersections and bag-union
$\sigma_{C}(R \cup S) \equiv \sigma_{C}(R) \cup \sigma_{C}(S)$Also works for intersections and bag-union
$\pi_{A}(R \cup S) \equiv \pi_{A}(R) \cup \pi_{A}(S)$Also works for intersections and bag-union
$\sigma_{C}(\gamma_{A, AGG}(R)) \equiv \gamma_{A, AGG}(\sigma_{C}(R))$If $A$ contains all of the attributes referenced by $C$
+
+ +
+

Algorithms

+
+
"Volcano" Operators (Iterators)
+
Operators "pull" tuples, one-at-a-time, from their children.
+ +
2-Pass (External) Sort
+
Create sorted runs, then repeatedly merge runs
+ +
Join Algorithms
+
Quickly picking out specific pairs of tuples.
+ +
Aggregation Algorithms
+
In-Memory vs 2-Pass, Normal vs Group-By
+
+
+ +
+

Nested-Loop Join

+ +
+ +
+

Block-Nested Loop Join

+ +
+ +
+

Strategies for Implementing $R \bowtie_{R.A = S.A} S$

+ +
+
Sort/Merge Join
+
Sort all of the data upfront, then scan over both sides.
+ +
In-Memory Index Join (1-pass Hash; Hash Join)
+
Build an in-memory index on one table, scan the other.
+ +
Partition Join (2-pass Hash; External Hash Join)
+
Partition both sides so that tuples don't join across partitions.
+
+
+ +
+

Sort/Merge Join

+ +
+ +
+

1-Pass Hash Join

+ +
+ +
+

2-Pass Hash Join

+
+
Limited Queries
+
Only supports join conditions of the form $R.A = S.B$
+ +
Low Memory
+
Never need more than 1 pair of partitions in memory
+ +
High IO Cost
+
Every record gets written out to disk, and back in.
+
+

Can partition on data-values to support other types of queries.

+
+ +
+

Index Nested Loop Join

+ + To compute $R \bowtie_{R.A < S.B} S$ with an index on $S.B$ + +
    +
  1. Read One Row of $R$
  2. +
  3. Get the value of $R.A$
  4. +
  5. Start index scan on $S.B > [R.A]$
  6. +
  7. Return rows as normal
  8. +
+
+ +
+

Basic Aggregate Pattern

+
+
Init
+
Define a starting value for the accumulator
+
Fold(Accum, New)
+
Merge a new value into the accumulator
+
Finalize(Accum)
+
Extract the aggregate from the accumulator.
+
+
+ +
+

Basic Aggregate Types

+

Grey et. al. "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

+ +
+
Distributive
+
Finite-sized accumulator and doesn't need a finalize (COUNT, SUM)
+
Algebraic
+
Finite-sized accumulator but needs a finalize (AVG)
+
Holistic
+
Unbounded accumulator (MEDIAN)
+
+
+ +
+

Grouping Algorithms

+ +
+
2-pass Hash Aggregate
+
Like 2-pass Hash Join: Distribute groups across buckets, then do an in-memory aggregate for each bucket.
+ +
Sort-Aggregate
+
Like Sort-Merge Join: Sort data by groups, then group elements will be adjacent.
+
+
+
+ +
+
+

Indexing

+
+ +
+

Data Organization

+ +
+
+
Unordered Heap
+
No organization at all. $O(N)$ reads.
+
+ +
+
(Secondary) Index
+
Index structure over unorganized data. $O(\ll N)$ random reads for some queries.
+
+ +
+
Clustered (Primary) Index
+
Index structure over clustered data. $O(\ll N)$ sequential reads for some queries.
+
+
+
+ +
+

Data Organization

+ +
+ +
+

Data Organization

+ +
+ +
+

Tree-Based Indexes

+ + +
+ +
+ +
+ +
+

Rules of B+Trees

+ +
+
Keep space open for insertions in inner/data nodes.
+
‘Split’ nodes when they’re full
+ +
Avoid under-using space
+
‘Merge’ nodes when they’re under-filled
+
+ +

Maintain Invariant: All Nodes ≥ 50% Full

+

(Exception: The Root)

+
+ +
+ +
+ +
+

Problems

+
+
$N$ is too small
+
Too many overflow pages (slower reads).
+
$N$ is too big
+
Too many normal pages (wasted space).
+
+
+ +
+ +
+ +
+

Problems

+
+
Changing hash functions reallocates everything
+
Only double/halve the size of a hash function
+ +
Changing sizes still requires reading everything
+
Idea: Only redistribute buckets that are too big
+
+
+ +
+ +
+
+ +
+
+

Cost-Based Optimization

+
+ + +
+

Accounting

+

Figure out the cost of each individual operator.

+

Only count the number of IOs added by each operator.

+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OperationRAIOs Added (#pages)Memory (#tuples)
Table Scan$R$$\frac{|R|}{\mathcal P}$$O(1)$
Projection$\pi(R)$$0$$O(1)$
Selection$\sigma(R)$$0$$O(1)$
Union$R \uplus S$$0$$O(1)$
Sort (In-Mem)$\tau(R)$$0$$O(|R|)$
Sort (On-Disk)$\tau(R)$$\frac{2 \cdot \lfloor log_{\mathcal B}(|R|) \rfloor}{\mathcal P}$$O(\mathcal B)$
(B+Tree) Index Scan$Index(R, c)$$\log_{\mathcal I}(|R|) + \frac{|\sigma_c(R)|}{\mathcal P}$$O(1)$
(Hash) Index Scan$Index(R, c)$$1$$O(1)$
+ +
    +
  1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
  2. +
  3. Size of $R$ ($|R|$)
  4. +
  5. Pages of Buffer ($\mathcal B$)
  6. +
  7. Keys per Index Page ($\mathcal I$)
  8. +
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OperationRAIOs Added (#pages)Memory (#tuples)
Nested Loop Join (Buffer $S$ in mem)$R \times S$$0$$O(|S|)$
Nested Loop Join (Buffer $S$ on disk)$R \times_{disk} S$$(1+ |R|) \cdot \frac{|S|}{\mathcal P}$$O(1)$
1-Pass Hash Join$R \bowtie_{1PH, c} S$$0$$O(|S|)$
2-Pass Hash Join$R \bowtie_{2PH, c} S$$\frac{2|R| + 2|S|}{\mathcal P}$$O(1)$
Sort-Merge Join $R \bowtie_{SM, c} S$[Sort][Sort]
(Tree) Index NLJ$R \bowtie_{INL, c}$$|R| \cdot (\log_{\mathcal I}(|S|) + \frac{|\sigma_c(S)|}{\mathcal P})$$O(1)$
(Hash) Index NLJ$R \bowtie_{INL, c}$$|R| \cdot 1$$O(1)$
(In-Mem) Aggregate$\gamma_A(R)$$0$$adom(A)$
(Sort/Merge) Aggregate$\gamma_A(R)$[Sort][Sort]
+ +
    +
  1. Tuples per Page ($\mathcal P$) – Normally defined per-schema
  2. +
  3. Size of $R$ ($|R|$)
  4. +
  5. Pages of Buffer ($\mathcal B$)
  6. +
  7. Keys per Index Page ($\mathcal I$)
  8. +
  9. Number of distinct values of $A$ ($adom(A)$)
  10. +
+
+ +
+

Estimating IOs requires Estimating $|Q(R)|$

+
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OperatorRAEstimated Size
Table$R$$|R|$
Projection$\pi(Q)$$|Q|$
Union$Q_1 \uplus Q_2$$|Q_1| + |Q_2|$
Cross Product$Q_1 \times Q_2$$|Q_1| \times |Q_2|$
Sort$\tau(Q)$$|Q|$
Limit$\texttt{LIMIT}_N(Q)$$N$
Selection$\sigma_c(Q)$$|Q| \times \texttt{SEL}(c, Q)$
Join$Q_1 \bowtie_c Q_2$$|Q_1| \times |Q_2| \times \texttt{SEL}(c, Q_1\times Q_2)$
Distinct$\delta_A(Q)$$\texttt{UNIQ}(A, Q)$
Aggregate$\gamma_{A, B \leftarrow \Sigma}(Q)$$\texttt{UNIQ}(A, Q)$
+ +
    +
  • $\texttt{SEL}(c, Q)$: Selectivity of $c$ on $Q$, or $\frac{|\sigma_c(Q)|}{|Q|}$
  • +
  • $\texttt{UNIQ}(A, Q)$: # of distinct values of $A$ in $Q$. +
+
+ +
+

(Some) Estimation Techniques

+ +
+
+
Guess Randomly
+
Rules of thumb if you have no other options...
+
+ +
+
Uniform Prior
+
Use basic statistics to make a very rough guess.
+
+ +
+
Sampling / History
+
Small, Quick Sampling Runs (or prior executions of the query).
+
+ +
+
Histograms
+
Using more detailed statistics for improved guesses.
+
+ +
+
Constraints
+
Using rules about the data for improved guesses.
+
+
+
+ + +
+
+ + + + + + +