There are a number of reasons that data might go bad: sensor errors, data entry errors, lag spikes, filesystem corruption, and more. One thing is certain though: you don't want to make decisions based on bad data. What people will often do is do basic sanity checks. For example, if we have a record of someone checking _out_ of a hospital, we should have a record of them checking _in_ at some point earlier. Declaring these sanity checks is hard, but fixing violations is even harder. In this project, you will explore ways to safely defer repairing the data. Challenges include:
1. Deciding what types of sanity constraints you want to be able to support.
2. Interfacing with an existing database (Spark, SQLite, or Oracle) to identify sets of tuples that come together to violate a sanity constraint.
3. Using Mimir to warning users when a query result depends on a tuple that participates in a violation
4. Suggesting and ranking modifications that repair violations
Most probabilistic database systems aim to produce all possible results. A few, most notably [MCDB](http://dl.acm.org/citation.cfm?id=1376686), instead generate samples of possible results. The basic idea is to split the database into a fixed number (N) of _possible worlds_, and run the query on all N possible worlds in parallel. There are actually a few different ways to do this. Three relatively common examples include:
* **Naive**: Literally run N copies of the query and union the results at the end.
* **Interleave**: Tag each tuple with the possible world that it comes from, and then just run one query. Make sure the query ensures that tuples from different possible worlds can't interact (i.e., Joins always happen between tuples from the same world and the world becomes another group-by column)
* **Tuple Bundle**: Create mega-tuples, that represent alternative versions of the same tuple in different possible worlds. If an attribute value is the same in all possible worlds store only one copy of it. (See [MCDB](http://dl.acm.org/citation.cfm?id=1376686))
Perhaps counterintuitively, our preliminary implementations of the [Interleave](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/exec/mode/SampleRows.scala) and [Tuple Bundle](https://github.com/UBOdin/mimir/blob/master/src/main/scala/mimir/exec/mode/TupleBundle.scala) algorithms suggest that none of these approaches will be the best in all cases. For example, in a simple select-aggregate query, tuple-bundles are the most efficient. Conversely, if you're joining on an attribute with different values in each possible world, interleave will be faster. We suspect that there are some cases where Naive will win out as well. The aim of this project is to implement a query optimizer for sampling-based probabilistic database queries. If I hand you a query, you tell me which strategy is fastest for that query. As an optional extension, you may be able to interleave different strategies, each evaluating a different part of the query.
When looking at queries, a common question is "why is this result the way it is?" While a broad question, database researchers have been hard at work isolating and addressing specific cases. For this particular project, we'd like to explore one specific category of explanation, where users have provided us with points of stability: Group-by aggregates that are supposed to remain stable over time. For example, consider
```
SELECT author, STDDEV(cnt) FROM (
SELECT author, year, COUNT(*) AS cnt FROM Publications
);
```
This query gives the variation per user in terms of number of publications per year. We might use a query like this to define a constraint that says "For any author, the number of publications per year stays roughly constant". Constraints like this help can us to explain aggregate values. For example let's say you run the following query and the result is lower than expected.
```
SELECT COUNT(*) FROM Publications WHERE author = 'Alice' AND venue = 'ICDE' AND year = 2017;
```
If you ask "Why is this result so low", the system can look at the above constraint and figure out that there's another aggregate query that is higher than usual (to preserve the stability of the publications/year constraint defined above)
```
SELECT COUNT(*) FROM Publications WHERE author = 'Alice' AND venue = 'ICDE' AND year = 2017;
The aim of this project would be to implement a simple frontend to an existing database system (Spark, SQLite, or Oracle) that accepts a set of constrants and answers questions like this.
Indexes work by reducing the effort required to locate specific data records. For example, in a tree index, if the range of records in a given subtree doesn't overlap with the query, the entire subtree can be ruled out (or ruled in). Not surprisingly, this means that data partitioning plays a large role in how effective the index is. The fewer partitions lie on query boundaries, the less work is required to respond to those queries.
Partitioning is especially a problem in 2-dimensional (and 3-, 4-, etc... dimensional) indexes, where there are always two entirely orthogonal dimensions to partition on. Accordingly, there's a wide range of techniques for organizing 2-dimensional data, including a family of indexes based on R-Trees. The aim of this project is to develop a "dynamic" r-like tree structure that adaptively partitions its contents, and (if time permits) that adapts its partition boundaries to changing workloads.