paper-BagRelationalPDBsAreHard/intro.tex

%root: main.tex

\section{Introduction}

In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics.  In contrast, most implementations of modern probabilistic databases (PDBs) are built in the setting of set semantics, and this contributes to slow computation time.  In the set semantics setting, one cannot get better than \#P runtime in computing the expectation of the output polynomial of a result tuple for an arbitrary query.  In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.  

There is limited work and results in the area of bag semantic PDBs.  When considering PDBs in the bag setting a subtelty arises that is easily overlooked due to the \textit{oversimplification} of PDBs in the set setting.  For almost all modern PDB implementations, an output polynomial is only ever considered in its expanded SOP form.  Any computation over the polynomial then cannot hope to be in less than linear in the number of monomials, which is known to be exponential in the general case.

\AH{New para maybe?  Introduce the subtelty here between a giant pile of monomials and a compressed version of the polynomial}
It turns out, that should an implementation allow for a compressed form of the polynomial, that computations over the polynomial can be done in better than linear runtime, again, linear in the number of monomials of the expanded Sum of Products (SOP) form.  While it is true that the runtime in the number of monomials versus runtime in the size of the polynomial is the same when the polynomial is given in SOP form, the runtimes are not the same when we allow for compressed versions of the polynomial as input to the desired computation.  While the naive algorithm to compute the expectation of a polynomial is to generate all the monomials and compute each of their probabilities, factorized polynomials in the bag setting allow us to compute the probability in the number of terms that make up the compressed representation.  

\AH{Perhaps a new para here?  Above para needs to gently but explicitly highlight the differences in traditional logical implemenation and our approach.}

As implied above, we define hard to be anything greater than linear in the number of monomials when the polynomial is in SOP form.  In this work, we show, that computing the expectation over the output polynomial for even a query class of $PJ$ over a bag $\ti$ where all tuples have probability $\prob$ is hard in the general case.  However, allowing for compressed versions of the polynomial paves the way for an approximation algorithm that performs in linear time with $\epsilon/\delta$ guarantees.  Also, while implied in the preceeding, in this work, the input size to the approximation algorithm is considered to be the query polynomial as opposed to the input database.

\AH{The para below I think should be incorporated in the para above.}

The richness of the problem we explore gives us a lower and upper bound in the compressed form of the polynomial, and its size in SOP form, [compressed, SOP].  In approximating the expectation, an expression tree to model the query output polynomial, which indeed facilitates polyomials in compressed form.
First draft of Introduction. 2020-11-16 12:10:35 -05:00			`%root: main.tex`

			`\section{Introduction}`

			In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of modern probabilistic databases (PDBs) are built in the setting of set semantics, and this contributes to slow computation time. In the set semantics setting, one cannot get better than \#P runtime in computing the expectation of the output polynomial of a result tuple for an arbitrary query. In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.

			`There is limited work and results in the area of bag semantic PDBs. When considering PDBs in the bag setting a subtelty arises that is easily overlooked due to the \textit{oversimplification} of PDBs in the set setting. For almost all modern PDB implementations, an output polynomial is only ever considered in its expanded SOP form. Any computation over the polynomial then cannot hope to be in less than linear in the number of monomials, which is known to be exponential in the general case.`

			`\AH{New para maybe? Introduce the subtelty here between a giant pile of monomials and a compressed version of the polynomial}`
			It turns out, that should an implementation allow for a compressed form of the polynomial, that computations over the polynomial can be done in better than linear runtime, again, linear in the number of monomials of the expanded Sum of Products (SOP) form. While it is true that the runtime in the number of monomials versus runtime in the size of the polynomial is the same when the polynomial is given in SOP form, the runtimes are not the same when we allow for compressed versions of the polynomial as input to the desired computation. While the naive algorithm to compute the expectation of a polynomial is to generate all the monomials and compute each of their probabilities, factorized polynomials in the bag setting allow us to compute the probability in the number of terms that make up the compressed representation.

			`\AH{Perhaps a new para here? Above para needs to gently but explicitly highlight the differences in traditional logical implemenation and our approach.}`

			As implied above, we define hard to be anything greater than linear in the number of monomials when the polynomial is in SOP form. In this work, we show, that computing the expectation over the output polynomial for even a query class of $PJ$ over a bag $\ti$ where all tuples have probability $\prob$ is hard in the general case. However, allowing for compressed versions of the polynomial paves the way for an approximation algorithm that performs in linear time with $\epsilon/\delta$ guarantees. Also, while implied in the preceeding, in this work, the input size to the approximation algorithm is considered to be the query polynomial as opposed to the input database.

			`\AH{The para below I think should be incorporated in the para above.}`

			`The richness of the problem we explore gives us a lower and upper bound in the compressed form of the polynomial, and its size in SOP form, [compressed, SOP]. In approximating the expectation, an expression tree to model the query output polynomial, which indeed facilitates polyomials in compressed form.`