Seeds and labor day

2017-08-30 14:38:42 -04:00 · 2017-08-30 14:38:42 -04:00 · c2bcd76f34
parent dc3be48966
commit c2bcd76f34
2 changed files with 34 additions and 17 deletions
--- a/src/teaching/cse-662/2017fa/index.md
+++ b/src/teaching/cse-662/2017fa/index.md
@ -75,14 +75,13 @@ After the taking the course, students should be able to:
 ## Course Schedule 

 * **Aug. 28** : Introduction [ [slides](slides/2017-08-28-Intro.pdf) | [form groups](group_formation.html) ]
-* **Aug. 30** : Project Seeds 1 [ [slides](slides/2017-08-30-Seeds.pdf) ]
-* **Sept. 01** : Project Seeds 2 [ [slides](slides/2017-08-30-Seeds.pdf) ]
-* **Sept. 04** : Database Cracking [ [paper](http://stratos.seas.harvard.edu/files/IKM_CIDR07.pdf) | [feedback](feedback/01-cracking.html) ]
-* **Sept. 06** : Functional Data Structures
-* **Sept. 12** : Just-in-Time Data Structures [ [paper](http://odin.cse.buffalo.edu/papers/2015/CIDR-jitd-final.pdf) ]
-* **Sept. 8** : Incomplete Databases 1
-* **Sept. 11** : Incomplete Databases 2
-* **Sept. 13** : Incomplete Databases 3
+* **Aug. 30** : Project Seeds [ [slides](slides/2017-08-30-Seeds.pdf) ]
+* **Sept. 01** : Functional Data Structures
+* **Sept. 04** : **No Class, Labor Day**
+* **Sept. 06** : Database Cracking [ [paper](http://stratos.seas.harvard.edu/files/IKM_CIDR07.pdf) | [feedback](feedback/01-cracking.html) ]
+* **Sept. 08** : Just-in-Time Data Structures [ [paper](http://odin.cse.buffalo.edu/papers/2015/CIDR-jitd-final.pdf) ]
+* **Sept. 11** : Incomplete Databases 1
+* **Sept. 13** : Incomplete Databases 2
 * **Sept. 15** : Mimir [ [paper](http://odin.cse.buffalo.edu/papers/2015/VLDB-lenses-final.pdf) ]
 * **Sept. 18** : MayBMS [ [paper](http://maybms.sourceforge.net/download/INFOSYS-TR-2007-2.pdf) ]
 * **Sept. 20** : Sampling From Probabilistic Queries [ [paper](http://dl.acm.org/citation.cfm?id=1376686) ]
@ -90,9 +89,9 @@ After the taking the course, students should be able to:
 * **Sept. 25** : R-Trees and Multidimensional Indexing [ [paper](http://dl.acm.org/citation.cfm?id=98741) ]
 * **Checkpoint 1 report due by 11:59 PM Sept. 26**
 * **Sept. 27 - Sept. 29** : Student Project Presentations
-* **Oct. 2** : BloomL [ [paper-1](http://cidrdb.org/cidr2011/Papers/CIDR11_Paper35.pdf), [paper-2](http://dl.acm.org/citation.cfm?id=2391230) ]
-* **Oct. 4 - Oct. 6** : *Oliver Away* (Content TBD)
-* **Oct. 9** : NoDB [ [paper](http://www.vldb.org/pvldb/vol7/p1119-karpathiotakis.pdf) ]
+* **Oct. 02** : BloomL [ [paper-1](http://cidrdb.org/cidr2011/Papers/CIDR11_Paper35.pdf), [paper-2](http://dl.acm.org/citation.cfm?id=2391230) ]
+* **Oct. 04 - Oct. 6** : *Oliver Away* (Content TBD)
+* **Oct. 09** : NoDB [ [paper](http://www.vldb.org/pvldb/vol7/p1119-karpathiotakis.pdf) ]
 * **Oct. 11 - Oct. 13** : Student Project Presentations
 * **Oct. 16** : Lazy Transactions [ [paper](http://dl.acm.org/citation.cfm?id=2610529) ]
 * **Oct. 18** : Streaming [ [paper](http://www.cs.cornell.edu/johannes/papers/2007/2007-CIDR-Cayuga.pdf) ]
@ -100,8 +99,8 @@ After the taking the course, students should be able to:
 * **Checkpoint 2 report due by 11:59 PM Oct. 22**
 * **Oct. 23 - Oct. 27** : Checkpoint 2 Reviews
 * **Oct. 30** : Declarative Games [ [paper](https://infoscience.epfl.ch/record/166858/files/31-sigmod2007_games.pdf) ]
-* **Nov. 1 - Nov. 3** : Student Project Presentations
-* **Nov. 6 - Nov. 10** : *Oliver Away* (Content TBD)
+* **Nov. 01 - Nov. 3** : Student Project Presentations
+* **Nov. 06 - Nov. 10** : *Oliver Away* (Content TBD)
 * **Nov. 13** : *Buffer*
 * **Nov. 15 - Nov. 17** : Student Project Presentations
 * **Nov. 20** : *Buffer*
@ -109,7 +108,7 @@ After the taking the course, students should be able to:
 * **Nov. 27** : *Buffer*
 * **Nov. 29 - Dec. 1** : Student Project Presentations
 * **Checkpoint 3 report due by 11:59 PM Dec. 3**
-* **Dec. 4 - Dec. 8** : Checkpoint 3 Reviews
+* **Dec. 04 - Dec. 8** : Checkpoint 3 Reviews
 * **Demo Day Time/Location To Be Announced**

 -----
@ -177,7 +176,18 @@ The aim of this project would be to implement a simple frontend to an existing d
 * [DBExplain](https://cudbg.github.io/lab/dbexplain)
 * [Scorpion](http://sirrice.github.io/files/papers/scorpion-vldb13.pdf)

-
#### Physical Layouts for Multiversion (Uncertain) Data

Classical versioning is a monotone operation: It’s rare that someone will want to maintain parallel versions of the data.  Conversely, data cleaning requires us to keep track of many different versions of a dataset.  For example, there exist some very powerful regression algorithms that can detect outliers very effectively.  However, these techniques can't really point out why those outliers are there.  Maybe there's missing context that would explain the outlier?  Maybe there's an actual data error?  Maybe there's a problem with how the data is being interpreted.  In short, every outlier should be classified as an "optional" version.  In other words, for every outlier, we may wish to fork the data, creating one set of versions with and an otherwise equivalent set without the outlier.  Obviously, this will create an exponential number of versions, so we need some ways to eliminate redundancy in the stored version.  

Fundamentally, the aim of this project is to outline a range of different workflow options for uncertain data, and derive one or more techniques for how to store, sort, index, and query this data.

###### Background Material:

* [C-Tables](http://dl.acm.org/citation.cfm?id=1886)
* [Data Polygamy](http://dl.acm.org/citation.cfm?id=2915245)
* [MauveDB](http://dl.acm.org/citation.cfm?id=1142483)
* [Indexing Uncertain Data](http://dl.acm.org/citation.cfm?id=1559816)
+#### Physical Layouts for Multiversion (Uncertain) Data
+
+Classical versioning is a monotone operation: It’s rare that someone will want to maintain parallel versions of the data.  Conversely, data cleaning requires us to keep track of many different versions of a dataset.  For example, there exist some very powerful regression algorithms that can detect outliers very effectively.  However, these techniques can't really point out why those outliers are there.  Maybe there's missing context that would explain the outlier?  Maybe there's an actual data error?  Maybe there's a problem with how the data is being interpreted.  In short, every outlier should be classified as an "optional" version.  In other words, for every outlier, we may wish to fork the data, creating one set of versions with and an otherwise equivalent set without the outlier.  Obviously, this will create an exponential number of versions, so we need some ways to eliminate redundancy in the stored version.  
+
+Fundamentally, the aim of this project is to outline a range of different workflow options for uncertain data, and derive one or more techniques for how to store, sort, index, and query this data.
+
+###### Background Material:
+
+* [C-Tables](http://dl.acm.org/citation.cfm?id=1886)
+* [Data Polygamy](http://dl.acm.org/citation.cfm?id=2915245)
+* [MauveDB](http://dl.acm.org/citation.cfm?id=1142483)
+* [Indexing Uncertain Data](http://dl.acm.org/citation.cfm?id=1559816)

 #### Garbage Collection in Embedded Databases

@ -188,7 +198,7 @@ The [PocketData](http://pocketdata.info) project explores the performance of dat
 ###### Background Material:

 * [The PocketData Benchmark](http://odin.cse.buffalo.edu/research/pocketdata/)
-* [PocketBench on GitHub](https://github.com/UBOdin/PocketBench)
+* [PocketBench on GitHub](https://github.com/UBOdin/PocketBench)


 #### Adaptive Multidimensional Indexing
@ -206,7 +216,14 @@ Partitioning is especially a problem in 2-dimensional (and 3-, 4-, etc... dimens

 #### Mimir on SparkSQL

-Spark's DataFrames are a powerful set of relational-algebra-like primitives for defining computation that can efficiently run locally or in a distributed setting.  However, because Spark aimed at predominantly analytical workloads, it can not be used directly as a drop-in replacement for SQLite.  The aim of this project is to transition a large database application (Mimir) from a classical relational database to Spark.  Key challenges include:

1. Spark is generally designed to be read-only.  Mimir needs to keep track of a variety of metadata.  That means this metadata needs to be stored somewhere on the side.  Step one will to create a metadata storage and lookup layer.  
2. Rewriting components of Mimir to use this metadata layer.
    * The [View Manager](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/views/ViewManager.scala) stores and tracks view definitions and associated metadata.
    * The [Adaptive Schema Manager](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/adaptive/AdaptiveSchemaManager.scala) stores and tracks adaptive schema definitions and associated metadata.
    * The [Model Manager](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/models/ModelManager.scala) stores and tracks materialized instances of a variety of different models.
3. The [Compiler](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/exec/Compiler.scala) infrastructure and [Backend](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/sql/Backend.scala) will need to be modified to work with Spark Data Frames.  Because Data Frames are relatively close to relational algebra, it may be best to go directly from one to the other without using SQL as an intermediate.
+Spark's DataFrames are a powerful set of relational-algebra-like primitives for defining computation that can efficiently run locally or in a distributed setting.  However, because Spark aimed at predominantly analytical workloads, it can not be used directly as a drop-in replacement for SQLite.  The aim of this project is to transition a large database application (Mimir) from a classical relational database to Spark.  Key challenges include:
+
+1. Spark is generally designed to be read-only.  Mimir needs to keep track of a variety of metadata.  That means this metadata needs to be stored somewhere on the side.  Step one will to create a metadata storage and lookup layer.  
+2. Rewriting components of Mimir to use this metadata layer.
+    * The [View Manager](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/views/ViewManager.scala) stores and tracks view definitions and associated metadata.
+    * The [Adaptive Schema Manager](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/adaptive/AdaptiveSchemaManager.scala) stores and tracks adaptive schema definitions and associated metadata.
+    * The [Model Manager](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/models/ModelManager.scala) stores and tracks materialized instances of a variety of different models.
+3. The [Compiler](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/exec/Compiler.scala) infrastructure and [Backend](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/sql/Backend.scala) will need to be modified to work with Spark Data Frames.  Because Data Frames are relatively close to relational algebra, it may be best to go directly from one to the other without using SQL as an intermediate.

 -----

--- a/src/teaching/cse-662/2017fa/slides/2017-08-30-Seeds.pdf
+++ b/src/teaching/cse-662/2017fa/slides/2017-08-30-Seeds.pdf