Update ReadingList Probabilistic DBs

master
Oliver Kennedy 2017-11-08 14:39:26 -05:00
parent e752214280
commit ee6f97178e
1 changed files with 6 additions and 1 deletions

@ -47,10 +47,15 @@ In general, query processing in a probabilistic database is NP-Hard. This means
* [Dichotomies for Queries with Negation in Probabilistic Databases](http://www.cs.ox.ac.uk/dan.olteanu/papers/fo-tods16.pdf)
#### Provenance Semirings
A few UPenn folks sat down to formally define a general model for provenance and came up with this beauty. There was already a well known link between database computations and semirings (Union & Aggregation are +, Joins are *). The authors here noted that many existing provenance models fit nicely slotted into the semiring model as well. The key idea is that you can express provenance as a polynomial expression (e.g., `a + b*c`) and then depending on what you slot in for the +, *, and base type, you get a different provenance model. This lets you build one system that supports all of those models. Another key insight (and the reason that this paper is on _this_ list), is that the semiring model is equivalent to the condition columns of C-Tables (although the paper incorrectly claims that they subsume all of C-Tables).
A few UPenn folks sat down to formally define a general model for provenance and came up with this beauty. There was already a well known link between database computations and semirings (Union & Aggregation are +, Joins are *). The authors here noted that many existing provenance models fit nicely slotted into the semiring model as well. The key idea is that you can express provenance as a polynomial expression (e.g., `a + b*c`) and then depending on what you slot in for the +, *, and base type, you get a different provenance model. This lets you build one system that supports all of those models. The crucial insight (with respect to Probabilistic Databases at least), is that the semiring model is equivalent to the condition columns of C-Tables. (NB: the paper incorrectly claims that they subsume all of C-Tables, although they don't capture the behavior of a C-Table's labeled nulls)
* [Provenance Semirings](http://dl.acm.org/citation.cfm?id=1265535)
#### SUM Aggregation
Aggregation in a probabilistic database is hard. The short reason is that you could potentially end up with a different aggregate value in each and every possible world. Thus, exact algorithms for aggregation must at least scale with the number of possible aggregate values, which could be exponential in the input size in the worst case. In practice though, that's not always the case. COUNT(*) for example is quite friendly to exact computation, since the number of distinct aggregate outputs is bounded by the number of input rows. SUM() can also be friendly on integers or low-precision floats (or more generally when your aggregate inputs are binned), as noted by the paper below out of INRIA.
* [Efficient Evaluation of SUM Queries over Probabilistic Data](http://ieeexplore.ieee.org/document/6171191/)
Probabilistic Database Systems
-----------------------------------------------------