master
Boris Glavic 2020-11-17 09:26:01 -06:00
parent 9f2a1cc70c
commit 5af0132778
1 changed files with 17 additions and 6 deletions

View File

@ -2,18 +2,29 @@
\section{Introduction}
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of modern probabilistic databases (PDBs) are built in the setting of set semantics, and this contributes to slow computation time. In the set semantics setting, one cannot get better than \#P runtime in computing the expectation of the output polynomial of a result tuple for an arbitrary query. In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of modern probabilistic databases (PDBs) are built in the setting of set semantics, and this contributes to slow computation time. In the set semantics setting, one cannot get better than \#P runtime\BG{I think when talking complexity classes, it is preferable to state the complexity, e.g., saying this problem is \#P-hard} in computing the expectation\BG{of what?} of the output polynomial\BG{I think this is dropped on the reader without any warning. What is the polynomial here. I think before getting at this point you need an introductory sentence to explain how polynomials are used in (bag) PDB. Also mention briefly what the output is: the expected multiplicity of the tuple?} of a result tuple for an arbitrary query. In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.
\BG{Introductions also serve to embed the work into the context of related work, it would be good to add citations and state explicitly who has done what}
There is limited work and results in the area of bag semantic PDBs. When considering PDBs in the bag setting a subtelty arises that is easily overlooked due to the \textit{oversimplification} of PDBs in the set setting.
\BG{what is the oversimplification of PDBs in the set semantics setting? what subtelty arises because of this?}
For almost all modern PDB implementations, an output polynomial is only ever considered in its expanded SOP form.
\BG{I don't think this statement holds up to scrutiny. For instance, ProvSQL uses circuits.}
Any computation over the polynomial then cannot hope to be in less than linear in the number of monomials, which is known to be exponential\BG{in what?} in the general case.
There is limited work and results in the area of bag semantic PDBs. When considering PDBs in the bag setting a subtelty arises that is easily overlooked due to the \textit{oversimplification} of PDBs in the set setting. For almost all modern PDB implementations, an output polynomial is only ever considered in its expanded SOP form. Any computation over the polynomial then cannot hope to be in less than linear in the number of monomials, which is known to be exponential in the general case.
\AH{New para maybe? Introduce the subtelty here between a giant pile of monomials and a compressed version of the polynomial}
It turns out, that should an implementation allow for a compressed form of the polynomial, that computations over the polynomial can be done in better than linear runtime, again, linear in the number of monomials of the expanded Sum of Products (SOP) form. While it is true that the runtime in the number of monomials versus runtime in the size of the polynomial is the same when the polynomial is given in SOP form, the runtimes are not the same when we allow for compressed versions of the polynomial as input to the desired computation. While the naive algorithm to compute the expectation of a polynomial is to generate all the monomials and compute each of their probabilities, factorized polynomials in the bag setting allow us to compute the probability in the number of terms that make up the compressed representation.
It turns out, that should an implementation allow for a compressed form of the polynomial, that computations over the polynomial can be done in better than linear runtime, again, linear in the number of monomials of the expanded Sum of Products (SOP) form. While it is true that the runtime in the number of monomials versus runtime in the size of the polynomial is the same when the polynomial is given in SOP form, the runtimes are not the same when we allow for compressed versions of the polynomial as input to the desired computation. While the naive algorithm to compute the expectation of a polynomial\BG{what is the input to the problem? The polynomial + X. What is X?} is to generate all the monomials and compute each of their probabilities, factorized polynomials in the bag setting allow us to compute the probability\BG{It is probably not clear to the reader what a probability means in the bag setting. An output tuple would have a multiplicity in the bag setting. What probability are we computing? The probability that the multiplicity is larger than 0 (the tuple exists)?} in the number of terms that make up the compressed representation.
\AH{Perhaps a new para here? Above para needs to gently but explicitly highlight the differences in traditional logical implemenation and our approach.}
As implied above, we define hard to be anything greater than linear in the number of monomials when the polynomial is in SOP form. In this work, we show, that computing the expectation over the output polynomial for even a query class of $PJ$ over a bag $\ti$ where all tuples have probability $\prob$ is hard in the general case. However, allowing for compressed versions of the polynomial paves the way for an approximation algorithm that performs in linear time with $\epsilon/\delta$ guarantees. Also, while implied in the preceeding, in this work, the input size to the approximation algorithm is considered to be the query polynomial as opposed to the input database.
As implied above, we define hard to be anything greater than linear in the number of monomials when the polynomial is in SOP form. In this work, we show, that computing the expectation over the output polynomial for even a query class of $PJ$\BG{should we use query classes that are more familiar to PODS people like CQ? Or at least mention that this is equivalent to CQ?} over a bag $\ti$ where all tuples have probability $\prob$\BG{I think the setting needs to be explained better. Are we restricting ourselves to the case where the input tuple's probability means it's probability to appear once? This needs to be pointed out?} is hard in the general case. However, allowing for compressed versions of the polynomial paves the way for an approximation algorithm that performs in linear time with $\epsilon/\delta$ guarantees. Also, while implied in the preceeding, in this work, the input size to the approximation algorithm is considered to be the query polynomial as opposed to the input database.
\AH{The para below I think should be incorporated in the para above.}
The richness of the problem we explore gives us a lower and upper bound in the compressed form of the polynomial, and its size in SOP form, [compressed, SOP]. In approximating the expectation, an expression tree to model the query output polynomial, which indeed facilitates polyomials in compressed form.
The richness of the problem we explore gives us a lower and upper bound in the compressed form of the polynomial, and its size in SOP form, [compressed, SOP]. \BG{This sentence is incomplete?:} In approximating the expectation, an expression tree to model the query output polynomial, which indeed facilitates polyomials in compressed form.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: