Finished rewriting Intro based on @atri Riot 112420 chat.

This commit is contained in:
Aaron Huber 2020-11-24 16:12:56 -05:00
parent 398a1a156d
commit d8366d1b4e
2 changed files with 61 additions and 30 deletions

View file

@ -26,12 +26,12 @@ In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag s
\end{subfigure}
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c c}
$E$ & A & B\\
\begin{tabular}{ c | c c c}
$E$ & A & B & $\Phi$\\
\hline
& a & b\\
& b & c\\
& c & a\\
& a & b & 1\\
& b & c & 1\\
& c & a & 1\\
\end{tabular}
%\caption{Atom 3 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom3}
@ -68,41 +68,72 @@ In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag s
\label{fig:intro-ex-graph}
\end{figure}
\begin{Example}\label{intro:ex}
Suppose we are given the following query $\poly() := R(A), E(A, B), R(B)$ over a Tuple Independent Database ($\ti$). The $\ti$ relations are given in ~\cref{fig:intro-ex}. Table E has a probability of $1$ for each of its tuples, and can viewed as deterministic, thus we omit annotations. The output for $\poly$ can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
\begin{Example}\label{ex:intro}
Suppose we are given the following query $\poly() := R(A), E(A, B), R(B)$ over a Tuple Independent Database ($\ti$). The $\ti$ relations are given in ~\cref{fig:intro-ex}. While for completeness we should include annotations for Table E, since each tuple has a probability of $1$, we drop them for simplicity. The output for $\poly$ can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
\end{Example}
Note that such a query in set semantics is indeed \#-P hard, since it is a query that is non-hierarchical, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$, as defined by Dalvi and Suciu in ~\cite{10.1145/1265530.1265571}. Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$ ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics. To see this intuitively, for query $\poly$ over set semantics, we have that the output polynomial $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent and the computation of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$ but exponential in the worst case.
While our work handles Block Independent Disjoint Databases ($\bi$), for now we consider the $\ti$ model. Define the probability distribution to be $P[W_i = 1] = \prob$ for $i$ in $\{a, b, c\}$.
Note that the query of ~\cref{ex:intro} in set semantics is indeed \#-P hard, since it is a query that is non-hierarchical, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$, as defined by Dalvi and Suciu in ~\cite{10.1145/1265530.1265571}. Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$, i.e. the probability of the output tuple with annotation $\poly(W_a, W_b, W_c)$, ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics. To see this intuitively, for query $\poly$ over set semantics, we have that the output polynomial $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent and the computation of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$ but exponential in the worst case.
%Using Shannon's Expansion,
%\begin{align*}
%&W_aW_b \vee W_bW_c \vee W_cW_a
%= &W_a
%\end{align*}
However, in the bag setting, the output polynomial is $\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a$. The expectation is simply
\[\expct\pbox{\poly(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_c},\]
which is indeed linear in the size of the output polynomial as the number of operations in the computation is \textit{exactly} the number of output polynomial operations.
However, in the bag setting, the output polynomial is $\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a$. The expectation computation the output polynomial is a computation of what the 'average' multiplicity of the tuple across possible worlds. In ~\cref{ex:intro}, the expectation is simply
\begin{align*}
&\expct\pbox{\poly(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}\\
= &\expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}\\
= &\prob^2 + \prob^2 + \prob^2,
\end{align*}
which is indeed linear in the size of the output polynomial as the number of operations in the computation is \textit{exactly} the number of output polynomial operations. Note that the answer is the same as $\poly(\prob, \prob, \prob)$, although this is coincidental and not true for the general case.
Now, consider query
\[\poly^2() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right).\]
A compressed output polynomial is
\[\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right).\]
Note that the SOP equivalent representation is
Now, consider the query
\begin{equation*}
\poly^2() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right),
\end{equation*}
whose factorized output polynomial is
\begin{equation*}
\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right),
\end{equation*}
, whose SOP equivalent representation is
%\[W_a^2W_b^2 + W_aW_b^2W_c + W_a^2W_bW_c + W_aW_b^2W_c + W_b^2W_c^2 + W_aW_bW_c^2 + W_a^2W_bW_c + W_aW_bW_c^2 + W_c^2W_a^2 =\]
\[W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.\]
\begin{equation*}
W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.
\end{equation*}
If we go further and compute query
\[\poly^3() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right)\times \left(\rel(A), E(A, B), R(B)\right),\]
it is then the case that output polynomial in a compressed version is
\[\left(W_aW_b + W_bW_c + W_cW_a\right)^3, \]
while the expanded equivalent version has too many (27) terms to list.
The expectation then is
\begin{align*}
&\expct\pbox{\poly^2(W_a, W_b, W_c)} = \expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} + \\
&\qquad \expct\pbox{2W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{2W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} + \expct\pbox{2W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}\\
= &\prob^2 + \prob^2 + \prob^2 + 2\prob^3 + 2\prob^3 + 2\prob^3\\
= & 3\prob^2(1 + 2\prob) \neq \poly(\prob, \prob, \prob).
\end{align*}
In $\poly^2()$ and $\poly^3()$, we would like to when computing the expectation over the compressed output polynomial is linear in the size of the compressed polynomial, and when it is not. Additionally, given a class of queries such that the expectation is hard, is there a linear approximation algorithm with confidence guarantees?
In this case, $\poly(\prob, \prob, \prob)$ is not the answer we seek since for a random variable $X$, $\expct\pbox{X^2} = \sum_{x \in Dom(X)}x^2 \cdot p(x)$. Note, that for our example, $Dom(W_i) = \{0, 1\}$.
\AH{{\bf 1)} Be sure to speak of our results.\par{\bf 2)} Explicitly mention the substitution of $\prob_i$ in for $W_i$ vars.}
Now, assume the following restrictions. First, all variables $X \in \vct{X}$ are set to $\prob$. Second, all exponents $e > 1$ in the expanded polynomial are set to $1$. Call this modified polynomial $\rpoly(\prob,\ldots, \prob)$. We show that $\expct\pbox{\poly(\prob,\ldots, \prob)} = \rpoly(\prob,\ldots, \prob)$. Here, again, in the setting of bag semantics, we have a query that is linear in the size of the expanded output polynomial, however it is not readily obvious that we achieve linearity for the factorized version of the polynomial as well. But if we think of this query in a graph theoretic setting, one can see that we end up with
Define $\rpoly(\vct{X})$ to be the resulting polynomial when all exponents $e > 1$ are set to $1$ in $\poly$. Note that this structure $\rpoly(\prob, \prob, \prob)$ is the expectation we computed, since it is always the case that $i^2 = i$ for all $i$ in $\{0, 1\}$. And, $\poly^2()$ is still computable in linear time \textit{in} the size of the output polynomial, compressed or SOP.
\[\sum\limits_{(i, j) \in E}X_iX_j + \sum\limits_{\substack{(i, j), (i \ell) \in E,\\ i \neq \ell}}X_iX_jX_\ell + \sum\limits_{\substack{(i, j), (k, \ell) \in E,\\ i\neq j\neq k \neq \ell}}X_iX_jX_kX_\ell.\]
As seen in the example, a compressed polynomial can be exponentially smaller in $k$ for $k$-products. It is also always the case that computing the expectation of an output polynomial in SOP is always linear in the size of the polynomial, since expecation can be pushed through addition.
Notice that the first term is the sum of edges, and for $\rpoly(\prob,\ldots, \prob)$, this summation is computable in $O(\numedge)$ time. Similarly, the second summation is the sum over all two paths, which can also be evaluated in $O(\numedge)$ time. Finally, the third term is indeed computable in $O(\numedge)$ time by the closed form expression $\sum\limits_{(i, j) \in E}\binom{\numedge - d_i - d_j + 1}{2}$, and for all summations, we only need to multiply by the correct exponentiation of $\prob$.
This works seeks to explore the complexity landscape for compressed representations of polynomials. Up to this point the message seems consistent that bags are always easy, but is it always the case that bags are easy in the size of the polynomial? We prove that bags are not always linear via a reduction to known hardness results in graph theory. We then introduce an approximation algorithm with confidence guarantees to compute $\rpoly(\vct{X})$ in linear time. Further, our apporximation algorithm generalizes to the $\bi$ model as well.
It is not until we compute a query such as $\poly^3() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right)$ that we find hardness results for a compressed polynomial, specifically that the computation is greater than linear time, i.e., superlinear.
\AH{{\bf \Large New Material Up To This Point}}
%\[\poly^3() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right)\times \left(\rel(A), E(A, B), R(B)\right),\]
%it is then the case that output polynomial in a compressed version is
%\[\left(W_aW_b + W_bW_c + W_cW_a\right)^3, \]
%while the expanded equivalent version has too many (27) terms to list.
%
%In $\poly^2()$ and $\poly^3()$, we would like to when computing the expectation over the compressed output polynomial is linear in the size of the compressed polynomial, and when it is not. Additionally, given a class of queries such that the expectation is hard, is there a linear approximation algorithm with confidence guarantees?
%
%\AH{{\bf 1)} Be sure to speak of our results.\par{\bf 2)} Explicitly mention the substitution of $\prob_i$ in for $W_i$ vars.}
%Now, assume the following restrictions. First, all variables $X \in \vct{X}$ are set to $\prob$. Second, all exponents $e > 1$ in the expanded polynomial are set to $1$. Call this modified polynomial $\rpoly(\prob,\ldots, \prob)$. We show that $\expct\pbox{\poly(\prob,\ldots, \prob)} = \rpoly(\prob,\ldots, \prob)$. Here, again, in the setting of bag semantics, we have a query that is linear in the size of the expanded output polynomial, however it is not readily obvious that we achieve linearity for the factorized version of the polynomial as well. But if we think of this query in a graph theoretic setting, one can see that we end up with
%
%\[\sum\limits_{(i, j) \in E}X_iX_j + \sum\limits_{\substack{(i, j), (i \ell) \in E,\\ i \neq \ell}}X_iX_jX_\ell + \sum\limits_{\substack{(i, j), (k, \ell) \in E,\\ i\neq j\neq k \neq \ell}}X_iX_jX_kX_\ell.\]
%
%Notice that the first term is the sum of edges, and for $\rpoly(\prob,\ldots, \prob)$, this summation is computable in $O(\numedge)$ time. Similarly, the second summation is the sum over all two paths, which can also be evaluated in $O(\numedge)$ time. Finally, the third term is indeed computable in $O(\numedge)$ time by the closed form expression $\sum\limits_{(i, j) \in E}\binom{\numedge - d_i - d_j + 1}{2}$, and for all summations, we only need to multiply by the correct exponentiation of $\prob$.
%
%It is not until we compute a query such as $\poly^3() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right)$ that we find hardness results for a compressed polynomial, specifically that the computation is greater than linear time, i.e., superlinear.
\AH{{\bf \Large New Material Stops Here.}}
\AR{The para below has some text that is too coloquial and should not be in a paper, e.g. ``giant pile of monomials" or ``folks".}
Most implementations of modern probabilistic databases (PDBs) view the annotation polynomials as a giant pile of monomials in disujunctive normal form (DNF). Most folks have considered bag PDBs as easy, and due to the almost all theoretical framework of PDBs being in set semantics, few have considered bag PDBs. However, there is a subtle, but easliy missed advantage in the bag semantic setting, that expectation can push through addition, making the computation easier than the oversimplified view of the polynomial being in its expanded sum of products (SOP) form. There is not a lot of existing work in bag PDBs per se, however this work seeks to unite previous work in factorized databases with theoretical guarantees when used in computations over bag PDBs, which have not been extensively studied in the literature. We give theoretical results for computing the expectation of a bag polynomial, while introducing a linear time approximation algorithm for computing the expecation of a bag PDB tuple.
\AR{The para above does not quite seem to follow the outline we discussed for the first para. Here is what I thought we had discussed (but it is possible I'm mis-remembering). Here is how you can pharse this, line by line:

View file

@ -88,7 +88,7 @@ sensitive=true
%%%%%%%%%%%%%%%%%%%%
% \textbullet Modelling Uncertainty as Attribute-level Taints and its Relationship to Provenance}
\title{Sketching Worlds for Incomplete Databases}
\title{Exact and Approximate Expectation Over Bag PDBs}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%