diff --git a/slides/cse4562sp2018/2018-02-12-Algorithms.html b/slides/cse4562sp2018/2018-02-12-Algorithms.html new file mode 100644 index 00000000..05d59ccc --- /dev/null +++ b/slides/cse4562sp2018/2018-02-12-Algorithms.html @@ -0,0 +1,342 @@ + + + + + + + CSE 4/562 - Spring 2018 + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + +
+ + CSE 4/562 - Database Systems +
+ +
+ +
+

Query Evaluation

+

CSE 4/562 – Database Systems

+
February 12, 2018
+
+ +
+
+

Query Evaluation Styles

+ +
+
All-At-Once (Collections)
+
Bottom-up, one operator at a time.
+ +
Volcano-Style (Iterators)
+
Operators "request" one tuple at a time from children.
+ +
Push-Style (Buffers)
+
Operators continuously produce/consume tuples.
+
+
+ +
+

Basic Mindset

+ +

+  r = get_table("R")
+
+  s = get_table("S")
+  
+  temp1 = apply_join(r, s, "R.B = S.B")
+  
+  temp2 = apply_select(temp1, "S.C = 10")
+  
+  result = apply_projection(temp2, "R.A")
+          
+
+ +
+

Basic Mindset

+

+      def build_tree(operator):
+
+        if """ operator is a base table """:
+          return get_table(...)
+
+        elif """ operator is a selection """:
+          return apply_select(operator.child, operator.condition)
+
+        elif """ handle remaining cases similarly """:
+          
+
+ +
+

Select

+ +

+ $$\sigma_{A \neq 3} R$$ +

+ + + + + +
AB
12
34
56
+
+ +
+

Select

+ +

+                  def apply_select(input, condition)
+                    result = []
+
+                    for row in input:
+                      if condition(row):
+                        result += [row]
+
+                    return result;
+          
+

(All-At-Once)

+
+ +
+

Select

+ +

+ $$\sigma_{A \neq 3} R$$ +

+ + + + + + + + + +
AB
getNext()for row in input:
12return row;
getNext()for row in input:
34X
56return row;
getNext()for row in input:
Nonereturn None;
+
+ +
+

Select

+ +
+ +
+

Project

+ +
+ +
+

Union

+ +
+ +
+

Cross

+

+                    def apply_cross(lhs, rhs):
+                      result = []
+
+                      for r in lhs:
+                        for s in rhs:
+                          result += [r + s]
+
+                      return result
+          
+
+ +
+

Cross

+ +
+ +
+

What's the complexity of this cross-product algorithm?

+

... in terms of compute

+

... in terms of IOs

+
+
+ +
+
+

Cross Product Problems

+
+
Need to scan the inner relation multiple times!
+
Load data intelligently to mitigate expensive IOs
+ +
Every tuple needs to be paired with every other tuple!
+
Exploit join conditions to minimize pairs of tuples
+
+
+ +
+

Preloading Data

+

Nested-Loop Join

+

+                    def apply_cross(lhs, rhs):
+                      result = []
+
+                      while r = lhs.next():
+                        while s = rhs.next():
+                          result += [r + s]
+                        rhs.reset()
+
+                      return result
+          
+

Need to evaluate rhs iterator once per record in lhs

+
+ +
+

Preloading Data

+ +

Naive Solution: Preload records from lhs

+

+                    def apply_cross(lhs, rhs):
+                      result = []
+                      rhs_preloaded = []
+                      
+                      while s = rhs.next():
+                        rhs_preloaded += [s]
+
+                      while r = lhs.next():
+                        for s in rhs_preloaded:
+                          result += [r + s]
+
+                      return result
+          
+ +

Any problems with this?

+
+ +
+

Preloading Data

+ +

Better Solution: Load both lhs and rhs records in blocks.

+ +

+                    def apply_cross(lhs, rhs):
+                      result = []
+
+                      while r_block = lhs.take(100):
+                        while s_block = rhs.take(100):
+                          for r in r_block:
+                            for s in s_block: 
+                              result += [r + s]
+                        rhs.reset()
+
+                      return result
+          
+
+ +
+

Block-Nested Loop Join

+ +
+ +
+

Join Conditions

+
+
+ +
+ + + + + + + diff --git a/slides/cse4562sp2018/2018-02-12-ExtRA.html b/slides/cse4562sp2018/2018-02-14-ExtRA.html similarity index 89% rename from slides/cse4562sp2018/2018-02-12-ExtRA.html rename to slides/cse4562sp2018/2018-02-14-ExtRA.html index 892fea13..5ac94534 100644 --- a/slides/cse4562sp2018/2018-02-12-ExtRA.html +++ b/slides/cse4562sp2018/2018-02-14-ExtRA.html @@ -49,7 +49,7 @@
-

Extended Relational Algebra

+

Extended RA

CSE 4/562 – Database Systems

February 12, 2018
@@ -59,7 +59,7 @@

Extended Relational Algebra

-
Set Operations
+
Set/Bag Operations
Select ($\sigma$), Project ($\pi$), Join ($\bowtie$), Union ($\cup$)
Bag Operations
@@ -71,6 +71,30 @@
Arithmetic Operations
Extended Projection ($\pi$), Aggregation ($\sigma$), Grouping ($\gamma$)
+ +
+ +
+ + +
+

Sort / Limit

+ +

+ $$\tau_{A}(R)$$ + The tuples of $R$ in ascending order according to 'A' +

+ +

+ $$\textbf{L}_{n}(R)$$ + The first $n$ tuples of R +

(Typically combined with sort. If not, pick arbitrarily.)
+

+
+ +
+

Distinct

+
diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Flow-Cross.svg b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Cross.svg new file mode 100644 index 00000000..df8d8209 --- /dev/null +++ b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Cross.svg @@ -0,0 +1,423 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + start + + Have Old 'r'? + + no + Read LHS Row 'r'and Reset RHS + + + + Read RHS Row 's' + not empty + + + + not empty + Return <r s> + + + + yes + + + + Done + + empty + + + + empty + + + diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Flow-Project.svg b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Project.svg new file mode 100644 index 00000000..dbe2612e --- /dev/null +++ b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Project.svg @@ -0,0 +1,252 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + Read One Row + + not empty + Compute New Row + + + + Return Row + + + + Done! + + empty + + start + + diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Flow-Select.svg b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Select.svg new file mode 100644 index 00000000..43e07a21 --- /dev/null +++ b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Select.svg @@ -0,0 +1,285 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + Read One Row + + not empty + Check Condition + + + + satisfied + Return Row + + + + + unsatisfied + + + Done! + + empty + + start + + diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Flow-Union.svg b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Union.svg new file mode 100644 index 00000000..4c1c31d3 --- /dev/null +++ b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Union.svg @@ -0,0 +1,319 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + start + Read LHS Row + + not empty + Return Row + + + + Read RHS Row + + empty + + + not empty + + + + Done! + + empty + + + diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Join-BNLJ.svg b/slides/cse4562sp2018/graphics/2018-02-12-Join-BNLJ.svg new file mode 100644 index 00000000..2d227ef3 --- /dev/null +++ b/slides/cse4562sp2018/graphics/2018-02-12-Join-BNLJ.svg @@ -0,0 +1,269 @@ + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Join-NLJ.svg b/slides/cse4562sp2018/graphics/2018-02-12-Join-NLJ.svg new file mode 100644 index 00000000..5a44cba7 --- /dev/null +++ b/slides/cse4562sp2018/graphics/2018-02-12-Join-NLJ.svg @@ -0,0 +1,274 @@ + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/src/seminar/2018sp.erb b/src/seminar/2018sp.erb index 96e825d5..90bb24cf 100644 --- a/src/seminar/2018sp.erb +++ b/src/seminar/2018sp.erb @@ -12,13 +12,16 @@ schedule: bio: | Julia Stoyanovich is Assistant Professor of Computer Science at Drexel University, and an affiliated faculty at the Center for Information Technology Policy at Princeton. She is a recipient of an NSF CAREER award and of an NSF/CRA Computing Innovations Fellowship. Julia's research focuses on responsible data management and analysis practices: on operationalizing fairness, diversity, transparency, and data protection in all stages of the data acquisition and processing lifecycle. She established the Data, Responsibly consortium, serves on the ACM task force to revise the Code of Ethics and Professional Conduct, and is active in the New York City algorithmic transparency effort. In addition to data ethics, Julia works on management and analysis of preference data, and on querying large evolving graphs. She holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst. - when: March 7 (tentative); Time TBD - what: Title TBD + what: "Deep Curation: Putting Open Science Data to Work" who: Bill Howe (University of Washington) where: Location TBD details: - abstract: TBD + abstract: | + Data in public repositories and in the scientific literature remains remarkably underused despite significant investments in open data and open science. Making data available online turns out to be the easy part; making the data usable for data science requires new services to support longitudinal, multi-dataset analysis rather than just settling for keyword search.
+ In the Deep Curation project, we use distant supervision and co-learning to automatically label datasets with zero training data. We have applied this approach to curate gene expression data and identify figures in the scientific literature, outperforming state-of-the-art supervised methods that rely on supervision. We then use claims extracted from the text of papers to guide probabilistic data integration and schema matching, affording experiments to automatically verify claims against open data, providing a repository-wide "report card" for the utility of data and the reliability of the claims against them. bio: | Bill Howe is an Associate Professor in the Information School, Adjunct Associate Professor in Computer Science & Engineering, and Associate Director and Senior Data Science Fellow at the UW eScience Institute. He is a co-founder of Urban@UW, and with support from the MacArthur Foundation and Microsoft, leads UW's participation in the MetroLab Network. He created a first MOOC on Data Science through Coursera, and led the creation of the UW Data Science Masters Degree, where he serves as its first Program Director and Faculty Chair. He also serves on the Steering Committee of the Center for Statistics in the Social Sciences. + headshot: "https://faculty.washington.edu/~billhowe/images/billhowe.png" - when: May 15; Time TBD what: Rethinking Query Execution on Big Data who: Dan Suciu (University of Washington) diff --git a/src/teaching/cse-562/2018sp/index.erb b/src/teaching/cse-562/2018sp/index.erb index 6120997f..28441bbe 100644 --- a/src/teaching/cse-562/2018sp/index.erb +++ b/src/teaching/cse-562/2018sp/index.erb @@ -42,9 +42,10 @@ In this course, you will learn...
  • Project Groups: 1-3 people
  • Grading: