Merge branch 'master' of gitlab.odin.cse.buffalo.edu:odin-lab/Website

pull/1/head
Oliver Kennedy 2017-08-27 23:52:40 -04:00
commit d8dfe0fee6
7 changed files with 94 additions and 30 deletions

View File

@ -8,6 +8,7 @@ require "cv.rb"
require "nsfcp.rb"
require "nsfconflicts.rb"
require "bootstrap_markdown.rb"
require "disqus.rb"
include GemSmith
$db = JDB.new("db")

22
lib/disqus.rb Normal file
View File

@ -0,0 +1,22 @@
module Disqus
def Disqus::embed(url, identifier)
return "<div id=\"disqus_thread\"></div>
<script>
var disqus_config = function () {
this.page.url = \"#{url}\"
this.page.identifier = \"#{identifier}\"
};
(function() {
var d = document, s = d.createElement('script');
s.src = 'https://ubodin.disqus.com/embed.js';
s.setAttribute('data-timestamp', +new Date());
(d.head || d.body).appendChild(s);
})();
</script>
<noscript>Please enable JavaScript to view the <a href=\"https://disqus.com/?ref_noscript\">comments powered by Disqus.</a></noscript>
"
end
end

View File

@ -0,0 +1,13 @@
<h2>Reading Assignment 1: Adaptive Indexing</h2>
<dl>
<dt>Paper</dt>
<dd><a href="http://stratos.seas.harvard.edu/files/IKM_CIDR07.pdf">Database Cracking</a></dd>
</dl>
<p>All students should provide one short paragraph identifying at least one strength and at least one weakness of the approach described in the week's reading.</p>
<%= Disqus::embed(
"http://odin.cse.buffalo.edu/teaching/cse-662/2017fa/group_formation.html",
"cse662.2017fa.group_formation"
) %>

View File

@ -0,0 +1,6 @@
<h2>Group Formation Thread</h2>
<%= Disqus::embed(
"http://odin.cse.buffalo.edu/teaching/cse-662/2017fa/group_formation.html",
"cse662.2017fa.group_formation"
) %>

View File

@ -34,7 +34,7 @@ the group's report, deliverables, and any related tools and technology.
3. At the final stage, students are expected to provide a roughly 5-page report
detailing their project, any algorithms or data structures developed, and
evaluating their project against any comparable state of the art systems and
techniques. Groups will also be expected to demon- strate their project and
techniques. Groups will also be expected to demonstrate their project and
present their findings in-class, or in a meeting with both instructors if
necessitated by time constraints. This report and presentation constitute
50% of the final grade.
@ -73,30 +73,32 @@ After the taking the course, students should be able to:
## Course Schedule
* **Aug. 28** : Introduction ([overview](2017-08-28-Introduction.html))
* **Aug. 30** : Project Seeds - Mimir
* **Sept. 01** : Project Seeds - JITDs &amp; PocketData
* **Sept. 04** : Database Cracking ( [Cracking](http://stratos.seas.harvard.edu/files/IKM_CIDR07.pdf) )
* **Aug. 28** : Introduction [ [slides](slides/2017-08-28-Intro.pdf) | [form groups](group_formation.html) ]
* **Aug. 30** : Project Seeds 1 [ [slides](slides/2017-08-30-Seeds.pdf) ]
* **Sept. 01** : Project Seeds 2 [ [slides](slides/2017-08-30-Seeds.pdf) ]
* **Sept. 04** : Database Cracking [ [paper](http://stratos.seas.harvard.edu/files/IKM_CIDR07.pdf) | [feedback](feedback/01-cracking.html) ]
* **Sept. 06** : Functional Data Structures
* **Sept. 12** : Just-in-Time Data Structures ( [JITDs])(http://odin.cse.buffalo.edu/papers/2015/CIDR-jitd-final.pdf) )
* **Sept. 12** : Just-in-Time Data Structures [ [paper](http://odin.cse.buffalo.edu/papers/2015/CIDR-jitd-final.pdf) ]
* **Sept. 8** : Incomplete Databases 1
* **Sept. 11** : Incomplete Databases 2
* **Sept. 13** : Incomplete Databases 3
* **Sept. 15** : Mimir ( [Mimir](http://odin.cse.buffalo.edu/papers/2015/VLDB-lenses-final.pdf) )
* **Sept. 18** : MayBMS ( [MayBMS](http://maybms.sourceforge.net/download/INFOSYS-TR-2007-2.pdf) )
* **Sept. 20** : Sampling From Probabilistic Queries ( [MCDB](http://dl.acm.org/citation.cfm?id=1376686) )
* **Sept. 22** : Probabilistic Constraint Repair ( [Sampling from Repairs](https://cs.uwaterloo.ca/~ilyas/papers/BeskalesVLDBJ2014.pdf) )
* **Sept. 25** : R-Trees and Multidimensional Indexing
* **Sept. 15** : Mimir [ [paper](http://odin.cse.buffalo.edu/papers/2015/VLDB-lenses-final.pdf) ]
* **Sept. 18** : MayBMS [ [paper](http://maybms.sourceforge.net/download/INFOSYS-TR-2007-2.pdf) ]
* **Sept. 20** : Sampling From Probabilistic Queries [ [paper](http://dl.acm.org/citation.cfm?id=1376686) ]
* **Sept. 22** : Probabilistic Constraint Repair [ [paper](https://cs.uwaterloo.ca/~ilyas/papers/BeskalesVLDBJ2014.pdf) ]
* **Sept. 25** : R-Trees and Multidimensional Indexing [ [paper](http://dl.acm.org/citation.cfm?id=98741) ]
* **Checkpoint 1 report due by 11:59 PM Sept. 26**
* **Sept. 27 - Sept. 29** : Student Project Presentations
* **Oct. 2** : BloomL ( [Bloom/Bud](http://cidrdb.org/cidr2011/Papers/CIDR11_Paper35.pdf), [BloomL](http://dl.acm.org/citation.cfm?id=2391230) )
* **Oct. 2** : BloomL [ [paper-1](http://cidrdb.org/cidr2011/Papers/CIDR11_Paper35.pdf), [paper-2](http://dl.acm.org/citation.cfm?id=2391230) ]
* **Oct. 4 - Oct. 6** : *Oliver Away* (Content TBD)
* **Oct. 9** : NoDB ( [NoDB](http://www.vldb.org/pvldb/vol7/p1119-karpathiotakis.pdf) )
* **Oct. 9** : NoDB [ [paper](http://www.vldb.org/pvldb/vol7/p1119-karpathiotakis.pdf) ]
* **Oct. 11 - Oct. 13** : Student Project Presentations
* **Oct. 16** : Lazy Transactions ( [Stickies](http://dl.acm.org/citation.cfm?id=2610529) )
* **Oct. 18** : Streaming ( [Cayuga](http://www.cs.cornell.edu/johannes/papers/2007/2007-CIDR-Cayuga.pdf) )
* **Oct. 20** : Scan Sharing ( [Crescando](http://dl.acm.org/citation.cfm?id=1807326) )
* **Oct. 16** : Lazy Transactions [ [paper](http://dl.acm.org/citation.cfm?id=2610529) ]
* **Oct. 18** : Streaming [ [paper](http://www.cs.cornell.edu/johannes/papers/2007/2007-CIDR-Cayuga.pdf) ]
* **Oct. 20** : Scan Sharing [ [paper](http://dl.acm.org/citation.cfm?id=1807326) ]
* **Checkpoint 2 report due by 11:59 PM Oct. 22**
* **Oct. 23 - Oct. 27** : Checkpoint 2 Reviews
* **Oct. 30** : Declarative Games ( [SGL](https://infoscience.epfl.ch/record/166858/files/31-sigmod2007_games.pdf) )
* **Oct. 30** : Declarative Games [ [paper](https://infoscience.epfl.ch/record/166858/files/31-sigmod2007_games.pdf) ]
* **Nov. 1 - Nov. 3** : Student Project Presentations
* **Nov. 6 - Nov. 10** : *Oliver Away* (Content TBD)
* **Nov. 13** : *Buffer*
@ -105,7 +107,9 @@ After the taking the course, students should be able to:
* **Nov. 22 - Nov. 24** : Student Project Presentations
* **Nov. 27** : *Buffer*
* **Nov. 29 - Dec. 1** : Student Project Presentations
* **Checkpoint 3 report due by 11:59 PM Dec. 3**
* **Dec. 4 - Dec. 8** : Checkpoint 3 Reviews
* **Demo Day Time/Location To Be Announced**
-----
@ -120,7 +124,7 @@ There are a number of reasons that data might go bad: sensor errors, data entry
3. Using Mimir to warning users when a query result depends on a tuple that participates in a violation
4. Suggesting and ranking modifications that repair violations
###### Background material:
###### Background Material:
* [Sampling from Repairs](https://cs.uwaterloo.ca/~ilyas/papers/BeskalesVLDBJ2014.pdf)
* [Qualitative Data Cleaning](http://dl.acm.org/citation.cfm?id=3007320)
@ -138,7 +142,7 @@ Most probabilistic database systems aim to produce all possible results. A few,
Perhaps counterintuitively, our preliminary implementations of the [Interleave](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/exec/mode/SampleRows.scala) and [Tuple Bundle](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/exec/mode/TupleBundle.scala) algorithms suggest that none of these approaches will be the best in all cases. For example, in a simple select-aggregate query, tuple-bundles are the most efficient. Conversely, if you're joining on an attribute with different values in each possible world, interleave will be faster. We suspect that there are some cases where Naive will win out as well. The aim of this project is to implement a query optimizer for sampling-based probabilistic database queries. If I hand you a query, you tell me which strategy is fastest for that query. As an optional extension, you may be able to interleave different strategies, each evaluating a different part of the query.
###### Background material:
###### Background Material:
* [MCDB](http://dl.acm.org/citation.cfm?id=1376686)
* [BlinkDB](http://blinkdb.org/)
@ -162,28 +166,46 @@ If you ask "Why is this result so low", the system can look at the above constra
```
SELECT COUNT(*) FROM Publications WHERE author = 'Alice' AND venue = 'ICDE' AND year = 2017;
```
The aim of this project would be to implement a simple frontend to an existing database system (Spark, SQLite, or Oracle) that accepts a set of constrants and answers questions like this. This project is part of ongoing joint work with Boris Glavic and Sudeepa Roy.
The aim of this project would be to implement a simple frontend to an existing database system (Spark, SQLite, or Oracle) that accepts a set of constrants and answers questions like this.
###### Background material:
(This project is part of ongoing joint work with Boris Glavic and Sudeepa Roy)
###### Background Material:
* [Causality and Explanations in Databases](https://users.cs.duke.edu/~sudeepa/vldb2014-Tutorial-causality-explanations.pdf)
* [DBExplain](https://cudbg.github.io/lab/dbexplain)
* [Scorpion](http://sirrice.github.io/files/papers/scorpion-vldb13.pdf)
#### Adaptive Multidimensional Indexing
(Summary In Progress)
#### Mimir on SparkSQL
(Summary In Progress)
#### Physical Layouts for Multiversion (Uncertain) Data Classical versioning is a monotone operation: Its rare that someone will want to maintain parallel versions of the data. Conversely, data cleaning requires us to keep track of many different versions of a dataset. For example, there exist some very powerful regression algorithms that can detect outliers very effectively. However, these techniques can't really point out why those outliers are there. Maybe there's missing context that would explain the outlier? Maybe there's an actual data error? Maybe there's a problem with how the data is being interpreted. In short, every outlier should be classified as an "optional" version. In other words, for every outlier, we may wish to fork the data, creating one set of versions with and an otherwise equivalent set without the outlier. Obviously, this will create an exponential number of versions, so we need some ways to eliminate redundancy in the stored version. Fundamentally, the aim of this project is to outline a range of different workflow options for uncertain data, and derive one or more techniques for how to store, sort, index, and query this data. ###### Background Material: * [C-Tables](http://dl.acm.org/citation.cfm?id=1886) * [Data Polygamy](http://dl.acm.org/citation.cfm?id=2915245) * [MauveDB](http://dl.acm.org/citation.cfm?id=1142483) * [Indexing Uncertain Data](http://dl.acm.org/citation.cfm?id=1559816)
#### Garbage Collection in Embedded Databases
(Summary In Progress)
The [PocketData](http://pocketdata.info) project explores the performance of database systems designed for embedded devices (e.g., smartphones, tablets, or sensor networks). As part of this project, we have developed several benchmark workloads aimed at Android and other Java-based settings. Although Java is a garbage-collected language, the short duration of most of our workloads means that they rarely (if ever) involve interactions with the garbage collector. The aim of this project is to develop an understanding of when (if ever) the garbage collector impacts the performance of embedded databases like SQLite or BerkeleyDB.
(This project will be co-advised by Lukasz Ziarek)
###### Background Material:
* [The PocketData Benchmark](http://odin.cse.buffalo.edu/research/pocketdata/)
* [PocketBench on GitHub](https://github.com/UBOdin/PocketBench)
#### Adaptive Multidimensional Indexing
Indexes work by reducing the effort required to locate specific data records. For example, in a tree index, if the range of records in a given subtree doesn't overlap with the query, the entire subtree can be ruled out (or ruled in). Not surprisingly, this means that data partitioning plays a large role in how effective the index is. The fewer partitions lie on query boundaries, the less work is required to respond to those queries.
Partitioning is especially a problem in 2-dimensional (and 3-, 4-, etc... dimensional) indexes, where there are always two entirely orthogonal dimensions to partition on. Accordingly, there's a wide range of techniques for organizing 2-dimensional data, including a family of indexes based on R-Trees. The aim of this project is to develop a "dynamic" r-like tree structure that adaptively partitions its contents, and (if time permits) that adapts its partition boundaries to changing workloads.
###### Background Material:
* [Database Cracking](http://stratos.seas.harvard.edu/files/IKM_CIDR07.pdf)
* [The R*-tree: an efficient and robust access method for points and rectangles](http://dl.acm.org/citation.cfm?id=98741)
* [The Big Red Data Spatial Indexing Project](http://www.cs.cornell.edu/database/spatial-indexing/)
#### Mimir on SparkSQL
Spark's DataFrames are a powerful set of relational-algebra-like primitives for defining computation that can efficiently run locally or in a distributed setting. However, because Spark aimed at predominantly analytical workloads, it can not be used directly as a drop-in replacement for SQLite. The aim of this project is to transition a large database application (Mimir) from a classical relational database to Spark. Key challenges include: 1. Spark is generally designed to be read-only. Mimir needs to keep track of a variety of metadata. That means this metadata needs to be stored somewhere on the side. Step one will to create a metadata storage and lookup layer. 2. Rewriting components of Mimir to use this metadata layer. * The [View Manager](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/views/ViewManager.scala) stores and tracks view definitions and associated metadata. * The [Adaptive Schema Manager](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/adaptive/AdaptiveSchemaManager.scala) stores and tracks adaptive schema definitions and associated metadata. * The [Model Manager](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/models/ModelManager.scala) stores and tracks materialized instances of a variety of different models. 3. The [Compiler](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/exec/Compiler.scala) infrastructure and [Backend](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/sql/Backend.scala) will need to be modified to work with Spark Data Frames. Because Data Frames are relatively close to relational algebra, it may be best to go directly from one to the other without using SQL as an intermediate.
-----