Website/src/teaching/cse-662/index.md

240 lines
11 KiB
Markdown
Raw Normal View History

2016-08-17 10:03:05 -04:00
# CSE 662 - Fall 2016
2016-08-17 09:55:06 -04:00
Addressing the challenges of big data requires a combination of human intuition
and automation. Rather than tackling these challenges head-on with
build-from-scratch solutions, or through general-purpose database systems,
developer and analyst communities are turning to building blocks: Specialized
languages, runtimes, data-structures, services, compilers, and frameworks that
simplify the task of creating a systems that are powerful enough to handle
terabytes of data or more or efficient enough to run on your smartphone. In
this class, we will explore these fundamental building blocks and how they
relate to the constraints imposed by workloads and the platforms they run on.
Coursework consists of lectures and a multi-stage final project. Students are
expected to attend all lectures. Projects may be performed individually or in
groups. Projects will be evaluated in three stages through code deliverables,
reports, and group meetings with either or both of the instructors. During
these meetings, instructors will question the entire group extensively about
the group's report, deliverables, and any related tools and technology.
1. At initial stage, students are expected to demonstrate a high level of
proficiency with the tools, techniques, data structures, algorithms and
source code that will form the basis of their project. The group is expected
to submit and defend a roughly 5-page report surveying the space in which
their project will be performed. This report and presentation constitute 15%
of the final grade.
2. At the second stage, students are expected to provide a detailed design for
their final project. A roughly 5-page report should outline the groups
proposed design, any algorithms or data structures being introduced, as well
as a strategy for evaluating the resulting project against the current state
of the art. This report and presentation constitute 35% of the final grade.
3. At the final stage, students are expected to provide a roughly 5-page report
detailing their project, any algorithms or data structures developed, and
evaluating their project against any comparable state of the art systems and
techniques. Groups will also be expected to demon- strate their project and
present their findings in-class, or in a meeting with both instructors if
necessitated by time constraints. This report and presentation constitute
50% of the final grade.
2016-08-29 10:09:20 -04:00
-----
2016-08-29 10:07:46 -04:00
## Information
* **Instructors**
2016-08-29 13:58:19 -04:00
* Lukasz Ziarek (Davis 338E; Office Hours Mon 1:00-2:00)
* Oliver Kennedy (Davis 338H; Office Hours Weds 2:00-3:30)
2016-08-29 10:07:46 -04:00
* **Time**: MWF 11:00-11:50
* **Location**: Knox 04
2016-08-17 10:03:05 -04:00
2016-08-17 10:03:59 -04:00
## Course Objectives
2016-08-17 09:55:06 -04:00
After the taking the course, students should be able to:
2016-08-17 09:56:20 -04:00
2016-08-17 09:55:06 -04:00
* **Design domain specific query languages**, by first developing an understanding
the common tropes of a target domain, exploring ways of allowing users to
efficiently express those tropes, and developing ways of mapping the resulting
programs to an efficient evaluation strategy.
2016-08-17 09:56:20 -04:00
2016-08-17 09:55:06 -04:00
* **Identify concurrency challenges in data-intensive computing tasks**, and
address them through locking, code associativity, and correctness analysis.
2016-08-17 09:56:20 -04:00
2016-08-17 09:55:06 -04:00
* **Understand a variety of index data structures**, as well as their application
and use in data management systems for high velocity, volume, veracity,
and/or variety data.
2016-08-17 09:56:20 -04:00
2016-08-17 09:55:06 -04:00
* **Understand query and program compilation techniques**, including the design of
intermediate representations, subexpression equivalence, cost estimation, and
the construction of target-representation code.
2016-08-29 10:09:20 -04:00
-----
2016-08-17 10:03:59 -04:00
## Course Schedule
2016-08-17 09:55:06 -04:00
* **Aug. 29** : Introduction ([slides](http://odin.cse.buffalo.edu/slides/cse662fa2016/2016-08-29-Intro.pdf))
2016-09-10 00:11:59 -04:00
* **Aug. 31** : Project Background 1 ([Mimir](http://odin.cse.buffalo.edu/slides/talks/2016-3-NYU-Mimir/), [JITDs](http://odin.cse.buffalo.edu/slides/talks/2015-3-JITDs.zip))
* **Sept. 02** : Project Background 2 ([PocketData](http://odin.cse.buffalo.edu/slides/talks/2015-7-OhioPocketData/))
* **Sept. 07** : Database Cracking ([paper](http://stratos.seas.harvard.edu/files/IKM_CIDR07.pdf), [slides](http://odin.cse.buffalo.edu/slides/cse662fa2016/2016-09-07-Cracking.pdf))
* **Sept. 09** : Functional Data Structures ([slides](http://odin.cse.buffalo.edu/slides/cse662fa2016/2016-09-09-FuncDSes.pdf))
2016-09-10 00:17:26 -04:00
* **Sept. 12** : Just-in-Time Data Structures ([paper](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper9.pdf))
2016-09-10 00:16:56 -04:00
* **Sept. 14** : Incomplete Databases ([paper](http://dl.acm.org/citation.cfm?id=1376686))
2016-09-10 00:11:59 -04:00
* **Sept. 16** : Probabilistic Databases
-----
## Topics
2016-08-17 09:55:06 -04:00
1. Datstructures and Indexes
* Functional Datastructures
* Adaptive Indexes
* Cracker Indexes
* Adaptive Merge Trees
* Just-in-Time Datastructures
2. Emerging Workload Challenges
* PocketData
* Object-Relational Mappers
3. Probabilistic Languages & Data
* Probabilistic DBs
* Possible Worlds
* C-Tables vs PC-Tables vs VC-Tables
* Probabilistic Programming Languages
4. Transactions & Synchrony
* CAP and CALM
* BloomL
* "Partial results in database systems"
* Software Transactional Memory
5. Incremental Computation
* Incremental View Maintenance
* DBToaster
* Self-Adapting Computation
6. Program Analysis & Optimization (Time Permitting)
* PL/Compiler Optimization Principles
* DSLs for Data-Driven Applications
* Declarative Games
* Truffle/Graal/LMS
2016-08-29 10:09:20 -04:00
-----
2016-08-17 10:03:59 -04:00
## Project Ideas
2016-08-17 10:03:05 -04:00
2016-08-25 11:59:24 -04:00
### JITDs
Adaptive indexes are a mechanisms to improve the performance of a database by
modifying indexing structures "on-the-fly" instead of in one large and costly
bulk operation. Adaptive index have been shown to work very well, but only on
worklaods that exhibiti sepcific characteristics. JITDs are a generalization of
adapative indexes, allowing for the expression of many different types of
adaptive indexes. To do this, they provide basic building blocks called nodes
that define index structure. How nodes "fit" together ends up defining index
behavior.
#### Workload-Specific Policies
This project is aimed at coming up with and studying a specific workload and
structuring a policy to manage the creation and usage of nodes based on this
workload. The policy should result in runtime behavoir that is beneficial for
the workload over classic, workload agnostic policies.
##### Background material:
* [Database Cracking](http://stratos.seas.harvard.edu/files/IKM_CIDR07.pdf)
* [JITDs Presentation](http://odin.cse.buffalo.edu/slides/talks/2015-3-JITDs.zip)
* [JITDs Intro Paper](http://odin.cse.buffalo.edu/papers/2015/CIDR-jitd-final.pdf)
* [GitHub Repo](https://github.com/UBOdin/jitd)
2016-08-25 11:59:24 -04:00
### PocketData
While "big data" is all the rage, another important type of data is the data
gathered and curated on your mobile devices. We call this data "pocket data".
#### App Workload Characteristics
In this project you will be given access to anonymized traces of database usage
patterns on mobile devices. This data has been gathered "in the wild" through
the phonelab project. You will be tasked with analyzing this data set and
characterising different applicaitons based on how they utilize pocket data.
Based on your analysis, you will be create a benchmark that exhibits the
characteristics you observed and test the performance of modern mobile
databases and compare and contrast how they perform with respect to classic
database implementations.
##### Background material:
* [PocketData Slides](http://odin.cse.buffalo.edu/slides/talks/2015-7-OhioPocketData/)
* [PocketData Preliminary Study](http://odin.cse.buffalo.edu/papers/2015/TPCTC-sqlite-final.pdf)
* [PhoneLab](https://phone-lab.org)
* [PhoneLab Example Dataset](https://phone-lab.org/static/experiment/sample_dataset.tgz)
2016-08-25 11:59:24 -04:00
#### App Workload Characteristics
2016-08-17 10:03:59 -04:00
### Mimir
2016-08-17 10:03:05 -04:00
Mimir is a probabilistic database for data cleaning. The idea is to
acknowledge that, while automation is amazing and makes data management easy,
automation usually relies on heuristics that could screw things up. Instead of
hiding potential screw-ups from the user, Mimir keeps track of where things
might go wrong, communicates that information to users, and helps them overcome
potential consequences of screw-ups.
2016-08-17 10:03:59 -04:00
#### More Natural Query Interfaces
2016-08-17 10:03:05 -04:00
While Mimir makes data quality more intuitive, it's still limited to classical
SQL queries. For this project, you will explore a variety of graphical user
interfaces for data, and integrate one or more of them into Mimir itself.
2016-08-17 10:03:59 -04:00
##### A few ideas:
2016-08-17 10:03:05 -04:00
* Workspsace-style data management and exploration interfaces, like GestureDB, Vizdom, or Apple's Numbers.
* Natural language queries. For example, a tool called Kueri was recently released that purports to attach to SQL databases and add support for natural language queries 'for free'.
2016-08-17 10:03:59 -04:00
##### Background material:
2016-08-17 10:05:48 -04:00
* [Mimir system overview](http://arxiv.org/abs/1601.00073)
* [Mimir on GitHub](https://github.com/UBOdin/mimir)
* [Mimir Website](http://mimirdb.info)
* [GestureDB](http://interact.osu.edu/gesturedb/)
* [Vizdom](http://tupleware.cs.brown.edu/vizdom/)
* [Kueri](http://kueri.me/tour/)
2016-08-17 10:03:05 -04:00
2016-08-17 10:03:59 -04:00
#### Performance + Postgres
2016-08-17 10:03:05 -04:00
The goal of this project idea is to first apply a few performance improvements
to Mimir's existing SQLite backend, and then port those improvements to
Postgres. The main optimization target is a sort of placeholder that Mimir
uses to mark places in the data where automated heuristics needed to make a
choice (that could be screwed up). The placeholders (or VG Terms) get filled
in with Mimir's best guess when the user runs a query... a process that is
pretty expensive the way Mimir does it right now. After improving the
situation for SQLite, the next step of the project (time permitting) would be
to port Mimir (and the fix) to Postgres.
2016-08-17 10:03:59 -04:00
##### Background material:
2016-08-17 10:05:48 -04:00
* [Mimir system overview](http://arxiv.org/abs/1601.00073)
* [Mimir on GitHub](https://github.com/UBOdin/mimir)
* [Mimir Website](http://mimirdb.info)
* [VGTerms + Lenses](http://odin.cse.buffalo.edu/papers/2015/VLDB-lenses-final.pdf)
* [SQLite UDF Examples in Scala](https://github.com/UBOdin/mimir/blob/dd03b267322cc720cff0a2dc282854f3ac999576/mimircore/src/main/scala/mimir/sql/sqlite/SQLiteCompat.scala)
2016-08-17 10:03:05 -04:00
2016-08-29 10:09:20 -04:00
-----
2016-08-17 10:03:05 -04:00
2016-08-17 10:03:59 -04:00
## Academic Content
2016-08-17 09:55:06 -04:00
The course will involve lectures and readings drawn from an assortment of
academic papers selected by the instructors. There is no textbook for the
course.
2016-08-29 10:09:20 -04:00
-----
2016-08-17 10:03:59 -04:00
## Academic Integrity
2016-08-17 09:55:06 -04:00
Students may discuss and advise one another on their lab projects, but groups
are expected to turn in their own work. Cheating on any course deliverable will
result in automatic failure of the course. The University policy on academic
integrity can be reviewed at:
2016-08-17 09:56:20 -04:00
http://academicintegrity.buffalo.edu/policies/index.php
2016-08-17 09:55:06 -04:00
2016-08-29 10:09:20 -04:00
-----
2016-08-17 10:03:59 -04:00
## Accessibility Resources
2016-08-17 09:55:06 -04:00
If you have a diagnosed disability (physical, learning, or psychological) that
will make it difficult for you to carry out the course work as outlined, or
that requires accommodations such as recruiting note-takers, readers, or
extended time on exams or assignments, please advise the instructor during the
first two weeks of the course so that we may review possible arrangements for
reasonable accommodations. In addition, if you have not yet done so, contact
the Office of Accessibility Resources (formerly the Office of Disability
Services).