+ def build_tree(operator):
+
+ if """ operator is a base table """:
+ return get_table(...)
+
+ elif """ operator is a selection """:
+ return apply_select(operator.child, operator.condition)
+
+ elif """ handle remaining cases similarly """:
+
+
+
+
+
Select
+
+
+ $$\sigma_{A \neq 3} R$$
+
+
+
A
B
+
1
2
+
3
4
+
5
6
+
+
+
+
+
Select
+
+
+ def apply_select(input, condition)
+ result = []
+
+ for row in input:
+ if condition(row):
+ result += [row]
+
+ return result;
+
+
(All-At-Once)
+
+
+
+
Select
+
+
+ $$\sigma_{A \neq 3} R$$
+
+
+
A
B
+
getNext()
for row in input:
+
1
2
return row;
+
getNext()
for row in input:
+
3
4
X
+
5
6
return row;
+
getNext()
for row in input:
+
None
return None;
+
+
+
+
+
Select
+
+
+
+
+
Project
+
+
+
+
+
Union
+
+
+
+
+
Cross
+
+ def apply_cross(lhs, rhs):
+ result = []
+
+ for r in lhs:
+ for s in rhs:
+ result += [r + s]
+
+ return result
+
+
+
+
+
Cross
+
+
+
+
+
What's the complexity of this cross-product algorithm?
+
... in terms of compute
+
... in terms of IOs
+
+
+
+
+
+
Cross Product Problems
+
+
Need to scan the inner relation multiple times!
+
Load data intelligently to mitigate expensive IOs
+
+
Every tuple needs to be paired with every other tuple!
+
Exploit join conditions to minimize pairs of tuples
+
+
+
+
+
Preloading Data
+
Nested-Loop Join
+
+ def apply_cross(lhs, rhs):
+ result = []
+
+ while r = lhs.next():
+ while s = rhs.next():
+ result += [r + s]
+ rhs.reset()
+
+ return result
+
+
Need to evaluate rhs iterator once per record in lhs
+
+
+
+
Preloading Data
+
+
Naive Solution: Preload records from lhs
+
+ def apply_cross(lhs, rhs):
+ result = []
+ rhs_preloaded = []
+
+ while s = rhs.next():
+ rhs_preloaded += [s]
+
+ while r = lhs.next():
+ for s in rhs_preloaded:
+ result += [r + s]
+
+ return result
+
+
+
Any problems with this?
+
+
+
+
Preloading Data
+
+
Better Solution: Load both lhs and rhs records in blocks.
+
+
+ def apply_cross(lhs, rhs):
+ result = []
+
+ while r_block = lhs.take(100):
+ while s_block = rhs.take(100):
+ for r in r_block:
+ for s in s_block:
+ result += [r + s]
+ rhs.reset()
+
+ return result
+
+
+
+
+
Block-Nested Loop Join
+
+
+
+
+
Join Conditions
+
+
+
+
+
+
+
+
+
+
+
diff --git a/slides/cse4562sp2018/2018-02-12-ExtRA.html b/slides/cse4562sp2018/2018-02-14-ExtRA.html
similarity index 89%
rename from slides/cse4562sp2018/2018-02-12-ExtRA.html
rename to slides/cse4562sp2018/2018-02-14-ExtRA.html
index 892fea13..5ac94534 100644
--- a/slides/cse4562sp2018/2018-02-12-ExtRA.html
+++ b/slides/cse4562sp2018/2018-02-14-ExtRA.html
@@ -49,7 +49,7 @@
-
Extended Relational Algebra
+
Extended RA
CSE 4/562 – Database Systems
February 12, 2018
@@ -59,7 +59,7 @@
Extended Relational Algebra
-
Set Operations
+
Set/Bag Operations
Select ($\sigma$), Project ($\pi$), Join ($\bowtie$), Union ($\cup$)
+ $$\tau_{A}(R)$$
+ The tuples of $R$ in ascending order according to 'A'
+
+
+
+ $$\textbf{L}_{n}(R)$$
+ The first $n$ tuples of R
+
(Typically combined with sort. If not, pick arbitrarily.)
+
+
+
+
+
Distinct
+
diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Flow-Cross.svg b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Cross.svg
new file mode 100644
index 00000000..df8d8209
--- /dev/null
+++ b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Cross.svg
@@ -0,0 +1,423 @@
+
+
+
+
diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Flow-Project.svg b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Project.svg
new file mode 100644
index 00000000..dbe2612e
--- /dev/null
+++ b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Project.svg
@@ -0,0 +1,252 @@
+
+
+
+
diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Flow-Select.svg b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Select.svg
new file mode 100644
index 00000000..43e07a21
--- /dev/null
+++ b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Select.svg
@@ -0,0 +1,285 @@
+
+
+
+
diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Flow-Union.svg b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Union.svg
new file mode 100644
index 00000000..4c1c31d3
--- /dev/null
+++ b/slides/cse4562sp2018/graphics/2018-02-12-Flow-Union.svg
@@ -0,0 +1,319 @@
+
+
+
+
diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Join-BNLJ.svg b/slides/cse4562sp2018/graphics/2018-02-12-Join-BNLJ.svg
new file mode 100644
index 00000000..2d227ef3
--- /dev/null
+++ b/slides/cse4562sp2018/graphics/2018-02-12-Join-BNLJ.svg
@@ -0,0 +1,269 @@
+
+
+
+
diff --git a/slides/cse4562sp2018/graphics/2018-02-12-Join-NLJ.svg b/slides/cse4562sp2018/graphics/2018-02-12-Join-NLJ.svg
new file mode 100644
index 00000000..5a44cba7
--- /dev/null
+++ b/slides/cse4562sp2018/graphics/2018-02-12-Join-NLJ.svg
@@ -0,0 +1,274 @@
+
+
+
+
diff --git a/src/seminar/2018sp.erb b/src/seminar/2018sp.erb
index 96e825d5..90bb24cf 100644
--- a/src/seminar/2018sp.erb
+++ b/src/seminar/2018sp.erb
@@ -12,13 +12,16 @@ schedule:
bio: |
Julia Stoyanovich is Assistant Professor of Computer Science at Drexel University, and an affiliated faculty at the Center for Information Technology Policy at Princeton. She is a recipient of an NSF CAREER award and of an NSF/CRA Computing Innovations Fellowship. Julia's research focuses on responsible data management and analysis practices: on operationalizing fairness, diversity, transparency, and data protection in all stages of the data acquisition and processing lifecycle. She established the Data, Responsibly consortium, serves on the ACM task force to revise the Code of Ethics and Professional Conduct, and is active in the New York City algorithmic transparency effort. In addition to data ethics, Julia works on management and analysis of preference data, and on querying large evolving graphs. She holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst.
- when: March 7 (tentative); Time TBD
- what: Title TBD
+ what: "Deep Curation: Putting Open Science Data to Work"
who: Bill Howe (University of Washington)
where: Location TBD
details:
- abstract: TBD
+ abstract: |
+ Data in public repositories and in the scientific literature remains remarkably underused despite significant investments in open data and open science. Making data available online turns out to be the easy part; making the data usable for data science requires new services to support longitudinal, multi-dataset analysis rather than just settling for keyword search.
+ In the Deep Curation project, we use distant supervision and co-learning to automatically label datasets with zero training data. We have applied this approach to curate gene expression data and identify figures in the scientific literature, outperforming state-of-the-art supervised methods that rely on supervision. We then use claims extracted from the text of papers to guide probabilistic data integration and schema matching, affording experiments to automatically verify claims against open data, providing a repository-wide "report card" for the utility of data and the reliability of the claims against them.
bio: |
Bill Howe is an Associate Professor in the Information School, Adjunct Associate Professor in Computer Science & Engineering, and Associate Director and Senior Data Science Fellow at the UW eScience Institute. He is a co-founder of Urban@UW, and with support from the MacArthur Foundation and Microsoft, leads UW's participation in the MetroLab Network. He created a first MOOC on Data Science through Coursera, and led the creation of the UW Data Science Masters Degree, where he serves as its first Program Director and Faculty Chair. He also serves on the Steering Committee of the Center for Statistics in the Social Sciences.
+ headshot: "https://faculty.washington.edu/~billhowe/images/billhowe.png"
- when: May 15; Time TBD
what: Rethinking Query Execution on Big Data
who: Dan Suciu (University of Washington)
diff --git a/src/teaching/cse-562/2018sp/index.erb b/src/teaching/cse-562/2018sp/index.erb
index 6120997f..28441bbe 100644
--- a/src/teaching/cse-562/2018sp/index.erb
+++ b/src/teaching/cse-562/2018sp/index.erb
@@ -42,9 +42,10 @@ In this course, you will learn...