More rewriting the Intro.

This commit is contained in:
Aaron Huber 2020-11-23 19:41:15 -05:00
parent 9f0b6dadb0
commit 94d0c4566c

View file

@ -7,6 +7,7 @@ In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag s
\subsection{Theoretical Problem}
%Consider an arbitrary output polynomial $\poly$. Further, consider the same polynomial, with all exponents $e > 1$ set to $1$ and call the resulting polynomial $\rpoly$.
%Figures, etc
%Relations for example 1
@ -75,6 +76,13 @@ Note that such a query in set semantics is indeed \#-P hard, since it is a query
However, in the bag setting, $\expct\pbox{\poly(\prob_1,\ldots, \prob_\numvar)}$ is indeed linear in the size of the output polynomial as the number of operations in the computation is \textit{exactly} the number of output polynomial operations.
Now, consider query $\poly^2() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right)$. Abusing notation again, the output polynomial will be $\left(ab + bc + ca\right) \cdot \left(ab + bc + ca\right)$. Now, assume the restriction of all variables $X \in \vct{X}$ set to $\prob$. Here, again, in the setting of bag semantics, we have a query that is linear in the size of the expanded output polynomial, however it is not readily obvious that we achieve linearity for the factorized version of the polynomial as well. But if we think of this query in a graph theoretic setting, one can see that we end up with
\[\sum\limits_{(i, j) \in E}X_iX_j + \sum\limits_{\substack{(i, j), (i \ell) \in E,\\ i \neq \ell}}X_iX_jX_\ell + \sum\limits_{\substack{(i, j), (k, \ell) \in E,\\ i\neq j\neq k \neq \ell}}X_iX_jX_kX_\ell.\]
Notice that the first term is the sum of edges, and for $\rpoly(\prob,\ldots, \prob)$, this summation is computable in $O(\numedge)$ time. Similarly, the second summation is the sum over all two paths, which can also be evaluated in $O(\numedge)$ time. Finally, the third term is indeed computable in $O(\numedge)$ time by the closed form expression $\sum\limits_{(i, j) \in E}\binom{\numedge - d_i - d_j + 1}{2}$, and for all summations, we only need to multiply by the correct exponentiation of $\prob$.
It is not until we compute a query such as $\poly^3() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right)$ that we find hardness results for a compressed polynomial, specifically that the computation is greater than linear time, i.e., superlinear.
\AH{{\bf \Large New Material Up To This Point}}
\AR{The para below has some text that is too coloquial and should not be in a paper, e.g. ``giant pile of monomials" or ``folks".}
Most implementations of modern probabilistic databases (PDBs) view the annotation polynomials as a giant pile of monomials in disujunctive normal form (DNF). Most folks have considered bag PDBs as easy, and due to the almost all theoretical framework of PDBs being in set semantics, few have considered bag PDBs. However, there is a subtle, but easliy missed advantage in the bag semantic setting, that expectation can push through addition, making the computation easier than the oversimplified view of the polynomial being in its expanded sum of products (SOP) form. There is not a lot of existing work in bag PDBs per se, however this work seeks to unite previous work in factorized databases with theoretical guarantees when used in computations over bag PDBs, which have not been extensively studied in the literature. We give theoretical results for computing the expectation of a bag polynomial, while introducing a linear time approximation algorithm for computing the expecation of a bag PDB tuple.