Table of Contents
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Summary
Probabilistic databases revolve around socalled possible worlds semantics. An uncertain database is actually a set of possible databases, also termed possible worlds. To run a query on the uncertain database, you run the query (in principle) in each of the possible worlds, and you get a set of possible answers. Define a probability distribution over the possible worlds, and you get a distribution over the possible answers; This pair of an uncertain database and a probability distribution is what is commonly referred to as a probabilistic database.
Uncertain data typically appears in one of three forms: Rowlevel uncertainty, where a tuple may or may not be in a given table, Attributelevel uncertainty, where the exact value of an attribute is not known, and Openworld uncertainty, where the set of tuples in a given result table is not known ahead of time. Note the distinction between rowlevel and openworld uncertainty: With the former, you can describe precisely, and in a finite way which tuples could be in the table, while with the latter you can not.
Benchmarks
Surveys

Morgan & Claypool: Probabilistic Databases A solid, theoryfocused survey of techniques for probabilistic databases. A good starting point for anyone working in the space.
Theoretical Foundations of Uncertain Databases
This section outlines foundational work on probabilistic and uncertain databases in theory.
CTables
The CTables data representation is a way to compactly encode row and attributelevel uncertainty in a classical deterministic database. The idea is to allow the use of labeled nulls, and to tag each row in an uncertain relation with a socalled 'local condition' The local condition is a boolean formula who's atoms are comparisons over the labeled nulls. Every possible assignment of values to the local nulls defines a possible world; Rows are only part of possible worlds who's valuation causes the row's local condition to evaluate to true.
 On representing incomplete information in a relational data base (no longer available on the internet?)
 Incomplete Information in Relational Databases (JACM article)
ThreeValued Logic
Somewhat related to uncertainty is a concept called threevalued logic, which extends classical boolean logic's True and False with a third "Unknown" value. This is the type of logic used in SQL when NULL values appear in queries, and it has the potential to mess all sorts of things up.
Dichotomy Results
In general, query processing in a probabilistic database is NPHard. This means that for any meaningfully sized query, good luck getting precise results before the heat death of the universe. However, there are a number of cases where you can take shortcuts. For example, if you can prove that query results are independent (P(X  Y) = P(X), P(Y  X) = P(Y)) or disjoint (P(XY) = 0, P(YX) = 0), you can compute the probability of a disjunction significantly faster. There's been a bunch of theoretical work on trying to distinguish between queries for which such tricks exist, and queries where they don't. These are typically called 'dichotomy results'.
 The dichotomy of probabilistic inference for unions of conjunctive queries
 The complexity of causality and responsibility for query answers and nonanswers
 Dichotomies for Queries with Negation in Probabilistic Databases
Provenance Semirings
A few UPenn folks sat down to formally define a general model for provenance and came up with this beauty. There was already a well known link between database computations and semirings (Union & Aggregation are +, Joins are *). The authors here noted that many existing provenance models fit nicely slotted into the semiring model as well. The key idea is that you can express provenance as a polynomial expression (e.g., a + b*c
) and then depending on what you slot in for the +, *, and base type, you get a different provenance model. This lets you build one system that supports all of those models. The crucial insight (with respect to Probabilistic Databases at least), is that the semiring model is equivalent to the condition columns of CTables. (NB: the paper incorrectly claims that they subsume all of CTables, although they don't capture the behavior of a CTable's labeled nulls)
SUM Aggregation
Aggregation in a probabilistic database is hard. The short reason is that you could potentially end up with a different aggregate value in each and every possible world. Thus, exact algorithms for aggregation must at least scale with the number of possible aggregate values, which could be exponential in the input size in the worst case. In practice though, that's not always the case. COUNT(*) for example is quite friendly to exact computation, since the number of distinct aggregate outputs is bounded by the number of input rows. SUM() can also be friendly on integers or lowprecision floats (or more generally when your aggregate inputs are binned), as noted by the paper below out of INRIA.
Probabilistic Database Systems
This section outlines complete endend systems for probabilistic query processing. A number of these are purely academic systems, and not all have been released.
MCDB (UFL/Rice/IBM)
MCDB, or the MonteCarlo Data Base introduced to probabilistic databases the idea of describing a probability distribution by using a function that can compute a sample from the distribution. VGFunctions are tablegenerating functions that output a random sample from the table's possible worlds. MCDB processes queries by (1) generating a set of sampled possible worlds, (2) factorizing the possible worlds into a more compact representation, and (3) running the query over each of those possible worlds (conceptually) in parallel. Conveniently, the factorized representation also admits a more efficient query evaluation strategy. The main advantage of this approach is that it's simple and expressive: If you can generate samples of an uncertain table, you can use MCDB over it. In contrast to many of the other systems here, MCDB can support openworld uncertainty. Sampling upfront, however, can limit the accuracy of the query results, particularly if you have an extremely selective filtering predicate over the data.
MayBMS (Cornell)
The central idea behind MayBMS is a practical implementation of Probabilistic CTables called URelations (in fact, it's not uncommon to discuss URelations, calling them CTables). The idea is to avoid labeled nulls (which most databases do not support) and instead focus entirely on rowlevel uncertainty. As it turns out, if you're considering only finite, discrete (i.e., categorical) distributions, rowlevel uncertainty can encode attribute level uncertainty as well. By further limiting condition columns to conjunctions of boolean equalities (which is still sufficient to capture a significant class of queries), MayBMS can use a classical deterministic database engine to evaluate probabilistic queries.
 MayBMS: Managing incomplete information with probabilistic worldset decompositions
 10^(10^6) worlds and beyond: efficient representation and processing of incomplete information
 MayBMS: a probabilistic database management system
 A Compositional Query Algebra for SecondOrder Logic and Uncertain Databases
 On Query Algebras for Probabilistic Databases
 On APIs for Probabilistic Databases
 From Complete to Incomplete Information and Back
 DEMO: Query language support for incomplete information in the MayBMS system
 BOOK CHAPTER: MayBMS: A system for managing large uncertain and probabilistic databases
 MANUAL: MayBMS: A Probabilistic Database System.
Sprout (Oxford)
Sprout grew out of the MayBMS project, and has functionally replaced it. It uses a similar data model, but work on sprout has focused largely on optimizing query evaluation. In particular, much of the work involves clever tricks for making it easier/faster to compute reliable approximations of boolean formulas  the main bottleneck in most probabilistic database computations.
 Approximate Confidence Computation in Probabilistic Databases
 SPROUT: Lazy vs. Eager Query Plans for TupleIndependent Probabilistic Databases
 A dichotomy for nonrepeating queries with negation in probabilistic databases
 Anytime Approximation in Probabilistic Databases
 Aggregates in Probabilistic Databases via Knowledge Compilation
 Ranking in Probabilistic Databases: Complexity and Efficient Algorithms
 Using OBDDs for Efficient Query Evaluation on Probabilistic Databases
MystiQ (UWash)
Some of the original dichotomy and complexity results in probabilistic databases came out of the MystiQ project.
 MYSTIQ: a system for finding more answers by using probabilities
 TECH REPORT: Efficient query evaluation on probabilistic databases
Orion 2.0 (UMD)
Orion 2.0 was the first attempt to support continuous probability distributions in a probabilistic database engine. Although it limits itself to a fixed set of probability distributions, by doing so it is able to compute exact, closedform solutions to probability mass computations in some cases.
PrDB (UMD)
 PrDB: Managing and Exploiting Rich Correlations in Probabilistic Databases.
 Lineage Processing Over Correlated Probabilistic Databases
Orchestra (UPenn)
Though not technically a probabilistic database, Orchestra shares many of the same characteristics. Orchestra was designed as a datasharing system for biologists, and as a result most of the uncertainty that they deal with comes from data exchange. Under the hood, they share a lot of ideas with CTablesbased systems like Pip and MayBMS. They also leverage a lot of this infrastructure for provenance computations.
 ORCHESTRA: facilitating collaborative data sharing
 ORCHESTRA: Rapid, Collaborative Sharing of Dynamic Data
 The ORCHESTRA Collaborative Data Sharing System
 TECH REPORT: Provenance in ORCHESTRA
Trio (Stanford)
Trio was pretty much the first major system for provenance management, and was also one of the first who's authors recognized the connection with uncertainty (like provenance semirings above). Unfortunately, Trio restricted itself to lineage as a provenance model (Why provenance), and as a result was only able to handle comparatively limited classes of uncertainty and queries.
 Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases
 Working Models for Uncertain Data
 ULDBs: databases with uncertainty and lineage
 Databases with uncertainty and lineage
 TrioOne:Layering Uncertainty and Lineage on a Conventional DBMS
 TECH REPORT: Continuous Uncertainty in Trio
 TECH REPORT: Trio: A System for Integrated Management of Data, Accuracy, and Lineage
 TECH REPORT: An Introduction to ULDBs and the Trio System
Pip (Cornell)
Pip was an early attempt to fully implement a database engine supporting CTables using userdefined types and a frontend query rewriter. Pip builds on MCDB, allowing VGFunctions to be defined alongside secondary metadata functions, consequently models are defined using what the paper refers to as a 'GreyBox' function. Although not explicitly called out in the paper, one of the key contributions is the idea of using expressions as variable identifiers – this makes it possible to perform extended projections over probabilistic data without needing to allocate fresh variable identifiers during query evaluation.
Mimir (UB)
Mimir is the first effort to virtualize uncertaintybased computations. In comparison to other systems which define an ondisk representation of uncertainty, Mimir borrows the VGTerms of Pip and defines uncertain relations by the query used to construct them. Then, depending on the specific requirements of the user, Mimir selects one of several intermediate representations best suited for the workload. In addition to exploring virtualized query processing, the Mimir group has also explored techniques for to presenting uncertain query results to users.
 OnDemand Query Result Cleaning
 Detecting the Temporal Context of Queries
 Lenses: an ondemand approach to ETL
 Communicating Data Quality in OnDemand Curation
 Mimir: Bringing CTables into Practice
Jigsaw (Cornell/Microsoft)
Jigsaw is a variant of MCDB. The underlying system implementation basically follows the same pattern. The goal here is to address the fact that VGFunctions are blackboxes and identify scenarios where two different parameterizations are going to create similar results. The trick is to generate a bunch of samples from the function and use that as a fingerprint. When approximating query results in a firstpass Jigsaw avoids repeated invocation of a VG Function by calling only once per fingerprint.
 Jigsaw: Efficient optimization over uncertain enterprise data
 DEMO: Fuzzy prophet: parameter exploration in uncertain enterprise scenarios
Querying Machine Learning Models
Virtually all probabilistic database systems adopt a data model based on tuples. A number of efforts have come up looking at how to use similar techniques to directly query data defined by a graphical model and/or how to represent graphical models in a database.
BayesStore (Berkeley/UFL)
Use databases to evaluate graphical models. Now part of MADLib
 BayesStore: managing large, uncertain data repositories with probabilistic graphical models
 THESIS: Extracting and Querying Probabilistic Information in BayesStore
MauveDB
Create views that represent data in a continuous domain.
SimSQL
Database for fitting and obtaining metrics over Markov Chain Simulations.
DeepDive
Machine Learning toolkit for information extraction
HoloClean/Snorkel
Spinoff of DeepDive and BigDansig
Ihab Ilyas' Projects
 BigDansig
 Descriptive and Prescriptive Cleaning
 Sampling from Repairs
 NADEEF
 DataXFormer
 Data Civilizer
Lahar
System for fitting and querying markov chains.
 Challenges for Event Queries over Markovian Streams
 Event Queries on Correlated Probabilistic Streams
 Access Methods for Markovian Streams
 Lahar Demonstration: Warehousing Markovian Streams (demo)
 Lineage for Markovian Stream Event Queries
Velox (Berkeley)
Database with Active Learning.
Plato (UCSD)
Signal Processing Database
ModelDB
Integration for Spark/SciKit that allows users to store, query, retrieve, and reuse training parameters for models.
MCMC using IVM
Probabilistic Programming Languages
This section outlines a few languages specially aimed at allowing programmers to write turingcomplete programs (or at least imperativelooking programs) where data/variables are defined as probability distributions.
ENFrame
 ENFrame: A Framework for Processing Probabilistic Data
 Declarative Probabilistic Programming with Datalog
Fabular
Formal semantics for a language for defining hierarchical linear regressions, based on similar language extensions found in many R packages. The authors integrate the language functionality into Tabular, a DSL for programming Excel spreadsheets.
Result Completeness
A number of efforts have taken a step back from probabilistic query processing and come up with some clever things to do when all you have is a weaker description of uncertainty in your data. One particular class of efforts explored in this section is incomplete results  cases where you can precisely characterize what fragment of your input data is missing (e.g., when one partition goes down in a distributed system, the other partition generally know exactly shards of data were on that partition)

Partial Results in Database Systems develops a classification system for completeness. As in most work on PQP, they adopt notions of value and rowlevel uncertainty, which they model by indicating that a particular attribute/table is complete or incomplete (and/or providing statistical measures over the resulting attributes). They then show how completeness propagates through a query.

mtables: Representing Missing Data is the theoretical answer to Partial Results in Database Systems. The idea is to model tuples that could be missing along with a hypothetical distribution over their multiplicities.

Identifying the Extent of Completeness of Query Answers over Partially Complete Databases does something almost identical. The underlying idea is to provide a set of query patterns and show how these patterns propagate through a query. This allows them to relate annotated fragments of query inputs to query outputs, which in turn makes it possible to detect what results are incomplete.
Data Models
Spatial Data
 Indexing Metric Uncertain Data for Range Queries, in which the authors propose an indexing scheme for spatial data with an uncertain position. One of several papers that adopts a similar key trick: Index one point in the probability distribution. Then, when you query, ask the user to provide you with a confidence bound/threshold and use the threshold to define a sufficiently large radius around the area you're querying to ensure that you hit any point that could be possibly in the result with at least that threshold.
Array Data
 Supporting Data Uncertainty in Array Databases, in which the authors address positional uncertainty in an array database, a premise quite similar to spatial uncertainty but on a discrete domain. The key challenge the paper addresses is the tradeoff between replication (casting attributelevel uncertainty as tuplelevel uncertainty as in MayBMS) and the overhead of relaxing query constraints on indexes to detect nonreplicated records. Interestingly, more replication increases query cost due to redundant tuple copies getting returned.
Lineage & Why/Why Not Queries
 Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
 Meiliu/Gatterbaur/Suciu: "Why or why no"?
 Approximate Lineage for Probabilistic Databases
Sensitivity
 Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases
 Lenses: An OnDemand Approach to ETL
Explanations
 Scorpion: Explaining away outliers in aggregate queries.
 A formal approach to finding explanations for database queries.
 Explaining Query Answers with ExplanationReady Databases
Training Probabilistic Databases
Classically, PDBs assume that you come to them with data already annotated with probabilities. This is, however, not always the case. Papers in this section explore how probabilistic databases can 'learn' the underlying distribution over time.
BetaPDBs
A popular model for probabilistic databases is called the TupleIndependent model (creating TIPDBs for short). Tupleindependent probabilistic databases annotate each input tuple with a Bernoullidistributed random variable. That is, we assume that each row of the input data is effectively present according to a random coinflip. In a BetaPDB, this is instead a BetaBernoulli distribution. It's still a coin flip, but the bias of the coin comes from training data given by two parameters (typically called a, b). Naively, these parameters represent samples: You flip a coin a+b times, and it comes up with a heads, that corresponds to a betadistribution with parameters a, b. Propagating this training data through queries turns out to be surprisingly harder, which is the subject of this paper.
PayGo
A Graphish database with missing values that prioritizes triples for cleaning based on the anticipated number of added results the triple could produce.