Fleshing out the probabilistic database reading list a bunch
parent
fb1eaa76ef
commit
0fcf5654be
|
@ -10,6 +10,8 @@ Uncertain data typically appears in one of three forms: Row-level uncertainty, w
|
|||
* [Morgan & Claypool: Probabilistic Databases](http://www.morganclaypool.com/doi/abs/10.2200/S00362ED1V01Y201105DTM016)
|
||||
A solid, theory-focused survey of techniques for probabilistic databases. A good starting point for anyone working in the space.
|
||||
|
||||
* [Springer: Managing and Mining Uncertain Data](http://link.springer.com/book/10.1007/978-0-387-09690-2?no-access=true#page=131)
|
||||
|
||||
## Formal Systems for Incomplete Information
|
||||
|
||||
#### C-Tables
|
||||
|
@ -26,16 +28,14 @@ Somewhat related to uncertainty is a concept called three-valued logic, which ex
|
|||
|
||||
## Probabilistic Database Systems
|
||||
|
||||
#### MCDB
|
||||
(UFL/Rice/IBM)
|
||||
#### MCDB (UFL/Rice/IBM)
|
||||
|
||||
MCDB, or the Monte-Carlo Data Base introduced to probabilistic databases the idea of describing a probability distribution by using a function that can compute a sample from the distribution. VG-Functions are table-generating functions that output a random sample from the table's possible worlds. MCDB processes queries by (1) generating a set of sampled possible worlds, (2) factorizing the possible worlds into a more compact representation, and (3) running the query over each of those possible worlds (conceptually) in parallel. Conveniently, the factorized representation also admits a more efficient query evaluation strategy. The main advantage of this approach is that it's simple and expressive: If you can generate samples of an uncertain table, you can use MCDB over it. In contrast to many of the other systems here, MCDB can support open-world uncertainty. Sampling upfront, however, can limit the accuracy of the query results, particularly if you have an extremely selective filtering predicate over the data.
|
||||
|
||||
* [MCDB: a monte carlo approach to managing uncertain data](http://dl.acm.org/citation.cfm?id=1376686)
|
||||
* [MCDB-R: risk analysis in the database](http://dl.acm.org/citation.cfm?id=1920941)
|
||||
|
||||
#### MayBMS
|
||||
(Cornell)
|
||||
#### MayBMS (Cornell)
|
||||
|
||||
The central idea behind MayBMS is a practical implementation of Probabilistic C-Tables called U-Relations (in fact, it's not uncommon to discuss U-Relations, calling them C-Tables). The idea is to avoid labeled nulls (which most databases do not support) and instead focus entirely on row-level uncertainty. As it turns out, if you're considering only finite, discrete (i.e., categorical) distributions, row-level uncertainty can encode attribute level uncertainty as well. By further limiting condition columns to conjunctions of boolean equalities (which is still sufficient to capture a significant class of queries), MayBMS can use a classical deterministic database engine to evaluate probabilistic queries.
|
||||
|
||||
|
@ -50,38 +50,63 @@ The central idea behind MayBMS is a practical implementation of Probabilistic C-
|
|||
* BOOK CHAPTER: [MayBMS: A system for managing large uncertain and probabilistic databases](http://link.springer.com/content/pdf/10.1007/978-0-387-09690-2.pdf#page=166)
|
||||
* MANUAL: [MayBMS: A Probabilistic Database System.](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.147.1226&rep=rep1&type=pdf)
|
||||
|
||||
#### Pip
|
||||
(Cornell)
|
||||
#### Sprout (Oxford)
|
||||
|
||||
* [PIP: A database system for great and small expectations](http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5447879)
|
||||
Sprout grew out of the MayBMS project, and has functionally replaced it. It uses a similar data model, but work on sprout has focused largely on optimizing query evaluation. In particular, much of the work involves clever tricks for making it easier/faster to compute reliable approximations of boolean formulas -- the lifeblood of probabilistic databases.
|
||||
|
||||
#### MystiQ
|
||||
(UWash)
|
||||
* [Efficient query evaluation on probabilistic databases](http://link.springer.com/article/10.1007/s00778-006-0004-3)
|
||||
; The work on sprout has focused largely on optimizing query evaluation, ain
|
||||
|
||||
#### Orion
|
||||
(UMD)
|
||||
* [Approximate Confidence Computation in Probabilistic Databases](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5447826)
|
||||
* [SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4812442)
|
||||
* [A dichotomy for non-repeating queries with negation in probabilistic databases](http://dl.acm.org/citation.cfm?id=2594549)
|
||||
* [Anytime Approximation in Probabilistic Databases](http://dl.acm.org/citation.cfm?id=2581854)
|
||||
* [Aggregates in Probabilistic Databases via Knowledge Compilation](http://dl.acm.org/citation.cfm?id=2140445)
|
||||
* [Ranking in Probabilistic Databases: Complexity and Efficient Algorithms](http://dl.acm.org/citation.cfm?id=2140445)
|
||||
* [Using OBDDs for Efficient Query Evaluation on Probabilistic Databases](http://link.springer.com/chapter/10.1007/978-3-540-87993-0_26)
|
||||
|
||||
#### PrDB
|
||||
(UMD)
|
||||
#### MystiQ (UWash)
|
||||
|
||||
Some of the original dichotomy and complexity results in probabilistic databases came out of the MystiQ project.
|
||||
|
||||
* [Efficient query evaluation on probabilistic databases](http://homes.cs.washington.edu/~suciu/tech-report-vldb2004.pdf)
|
||||
|
||||
#### Orion 2.0 (UMD)
|
||||
|
||||
Orion 2.0 was the first attempt to support continuous probability distributions in a probabilistic database engine. Although it limits itself to a fixed set of probability distributions, by doing so it is able to compute exact, closed-form solutions to probability mass computations in some cases.
|
||||
|
||||
* [Orion 2.0: native support for uncertain data](http://dl.acm.org/citation.cfm?id=1376744)
|
||||
|
||||
#### PrDB (UMD)
|
||||
* PrDB: Managing and Exploiting Rich Correlations in Probabilistic Databases.
|
||||
* Lineage Processing Over Correlated Probabilistic Databases
|
||||
|
||||
#### Trio
|
||||
(Stanford)
|
||||
#### Orchestra (UPenn)
|
||||
|
||||
#### Sprout
|
||||
(Oxford)
|
||||
* Approximate Confidence Computation in Probabilistic Databases
|
||||
* SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases
|
||||
* A dichotomy for non-repeating queries with negation in probabilistic databases
|
||||
* Anytime Approximation in Probabilistic Databases
|
||||
* Aggregates in Probabilistic Databases via Knowledge Compilation
|
||||
* Ranking in Probabilistic Databases: Complexity and Efficient Algorithms
|
||||
* Orchestra (UPenn)
|
||||
* [ORCHESTRA: facilitating collaborative data sharing](http://dl.acm.org/citation.cfm?id=1247631)
|
||||
* [ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data](http://cis.upenn.edu/~zives/research/orchestra-cidr.pdf)
|
||||
* [The ORCHESTRA Collaborative Data Sharing System](http://dl.acm.org/citation.cfm?id=1462577)
|
||||
* TECH REPORT: [Provenance in ORCHESTRA](http://repository.upenn.edu/cis_papers/655/)
|
||||
|
||||
#### Trio (Stanford)
|
||||
|
||||
* [Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4497511)
|
||||
* [Working Models for Uncertain Data](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=1617375)
|
||||
* [ULDBs: databases with uncertainty and lineage](http://dl.acm.org/citation.cfm?id=1164209)
|
||||
* [Databases with uncertainty and lineage](http://link.springer.com/article/10.1007/s00778-007-0080-z)
|
||||
* [Trio-One:Layering Uncertainty and Lineage on a Conventional DBMS](http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p30.pdf)
|
||||
* TECH REPORT: [Continuous Uncertainty in Trio](http://ilpubs.stanford.edu:8090/928/)
|
||||
* TECH REPORT: [Trio: A System for Integrated Management of Data, Accuracy, and Lineage](http://ilpubs.stanford.edu:8090/658/)
|
||||
* TECH REPORT: [An Introduction to ULDBs and the Trio System](http://ilpubs.stanford.edu:8090/793/)
|
||||
|
||||
|
||||
#### Pip (Cornell)
|
||||
|
||||
Pip was an early attempt to fully implement a database engine supporting C-Tables using user-defined types and a front-end query rewriter. Pip builds on MCDB, allowing VG-Functions to be defined alongside secondary metadata functions, consequently models are defined using what the paper refers to as a 'Grey-Box' function. Although not explicitly called out in the paper, one of the key contributions is the idea of using expressions as variable identifiers --- this makes it possible to perform extended projections over probabilistic data without needing to allocate fresh variable identifiers during query evaluation.
|
||||
|
||||
* [PIP: A database system for great and small expectations](http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5447879)
|
||||
|
||||
#### Mimir (UB)
|
||||
|
||||
#### Mimir
|
||||
(UBuff)
|
||||
* [On-Demand Query Result Cleaning](http://www.vldb.org/2014/phd_workshop.proceedings_files/Camera-Ready%20Papers/Paper%201283/p1283-Yang.pdf)
|
||||
* [Detecting the Temporal Context of Queries](http://link.springer.com/chapter/10.1007/978-3-662-46839-5_7)
|
||||
* [Lenses: an on-demand approach to ETL](http://dl.acm.org/citation.cfm?id=2824055)
|
||||
|
@ -93,18 +118,45 @@ The central idea behind MayBMS is a practical implementation of Probabilistic C-
|
|||
|
||||
## Model Database Systems
|
||||
|
||||
* BayesStore
|
||||
* MauveDB
|
||||
* Velox
|
||||
* Plato (Signal Processing)
|
||||
#### BayesStore (Berkeley/UFL)
|
||||
|
||||
* [BayesStore: managing large, uncertain data repositories with probabilistic graphical models](http://dl.acm.org/citation.cfm?id=1453896)
|
||||
* THESIS: [Extracting and Querying Probabilistic Information in BayesStore](http://escholarship.org/uc/item/3557w390)
|
||||
|
||||
#### MauveDB
|
||||
|
||||
* [MauveDB: supporting model-based user views in database systems](http://dl.acm.org/citation.cfm?id=1142483)
|
||||
|
||||
|
||||
#### Velox (Berkeley)
|
||||
|
||||
Database with Active Learning.
|
||||
|
||||
* [The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox](http://arxiv.org/abs/1409.3809)
|
||||
|
||||
|
||||
#### Plato (UCSD)
|
||||
|
||||
Signal Processing
|
||||
|
||||
* [Combining Databases and Signal Processing in Plato](http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper26.pdf)
|
||||
|
||||
## Markov Processes
|
||||
|
||||
* Lahar
|
||||
* SimSQL
|
||||
* [Simulation of database-valued markov chains using SimSQL](http://dl.acm.org/ft_gateway.cfm?id=2465283)
|
||||
* MCMC using IVM
|
||||
* [Scalable probabilistic databases with factor graphs and MCMC](http://dl.acm.org/citation.cfm?id=1920942)
|
||||
#### Lahar
|
||||
|
||||
* [Access Methods for Markovian Streams](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4812407)
|
||||
* [Approximation trade-offs in a Markovian stream warehouse: An empirical study](http://www.sciencedirect.com/science/article/pii/S0306437912000555)
|
||||
* THESIS: [Lahar: Warehousing Markovian Streams](http://lahar.cs.washington.edu/content/papers/letchner_thesis.pdf)
|
||||
* DEMO: [Lahar demonstration: warehousing Markovian streams](http://dl.acm.org/citation.cfm?id=1687605)
|
||||
|
||||
#### SimSQL
|
||||
|
||||
* [Simulation of database-valued markov chains using SimSQL](http://dl.acm.org/ft_gateway.cfm?id=2465283)
|
||||
|
||||
#### MCMC using IVM
|
||||
|
||||
* [Scalable probabilistic databases with factor graphs and MCMC](http://dl.acm.org/citation.cfm?id=1920942)
|
||||
|
||||
## Lineage & Why/Why Not Queries
|
||||
|
||||
|
|
Loading…
Reference in New Issue