More changes per @atri Riot 112320 conversation.

This commit is contained in:
Aaron Huber 2020-11-24 11:27:34 -05:00
parent 93a7d3ab4a
commit 398a1a156d

View file

@ -1,12 +1,12 @@
%root: main.tex
\section{Introduction}
\subsection{Problem Statement}
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of PDBs are built in the setting of set semantics, where the annotation polynomial is in disjunctive normal form (DNF), and computations over the output polynomial such as expectation (probability of the tuple) are \#-P hard in general. However for the equivalent sum of products (SOP) representation in the bags setting, computing the expectation (expected multiplicity) over the output polynomial is iinear. In this work we show that, if we use alternative representations of the output polynomial, such as factorized forms, the complexity landscape becomes much more nuanced.
\subsection{Theoretical Problem}
%Consider an arbitrary output polynomial $\poly$. Further, consider the same polynomial, with all exponents $e > 1$ set to $1$ and call the resulting polynomial $\rpoly$.
%Figures, etc
@ -14,12 +14,12 @@ In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag s
\begin{figure}[ht]
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c}
$\rel$ & A\\
\begin{tabular}{ c | c | c}
$\rel$ & A & $\Phi$\\
\hline
& a \\
& b \\
& c \\
& a & $W_a$\\
& b & $W_b$\\
& c & $W_c$\\
\end{tabular}
%\caption{Atom 1 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom1}
@ -38,12 +38,12 @@ In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag s
\end{subfigure}
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c}
$\rel$ & B\\
\begin{tabular}{ c | c | c}
$\rel$ & B & $\Phi$\\
\hline
& b\\
& c\\
& a\\
& b & $W_b$\\
& c & $W_c$\\
& a & $W_a$\\
\end{tabular}
%\caption{Atom 2 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom2}
@ -69,14 +69,33 @@ In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag s
\end{figure}
\begin{Example}\label{intro:ex}
Suppose we have a query $\poly() := R(A), E(A, B), R(B)$, whose relations are given in ~\cref{fig:intro-ex}. The output for $\poly$ is visualized as a graph in ~\cref{fig:intro-ex-graph}.
Suppose we are given the following query $\poly() := R(A), E(A, B), R(B)$ over a Tuple Independent Database ($\ti$). The $\ti$ relations are given in ~\cref{fig:intro-ex}. Table E has a probability of $1$ for each of its tuples, and can viewed as deterministic, thus we omit annotations. The output for $\poly$ can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
\end{Example}
Note that such a query in set semantics is indeed \#-P hard, since it is a query that is non-hierarchical, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$, as defined by Dalvi and Suciu in ~\cite{10.1145/1265530.1265571}. Thus, abusing notation, denoting the output polynomial as $\poly(\prob_1,\ldots, \prob_\numvar)$, computing $\expct\pbox{\poly(\prob_1,\ldots, \prob_\numvar)}$ is hard in set semantics.
Note that such a query in set semantics is indeed \#-P hard, since it is a query that is non-hierarchical, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$, as defined by Dalvi and Suciu in ~\cite{10.1145/1265530.1265571}. Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$ ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics. To see this intuitively, for query $\poly$ over set semantics, we have that the output polynomial $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent and the computation of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$ but exponential in the worst case.
However, in the bag setting, $\expct\pbox{\poly(\prob_1,\ldots, \prob_\numvar)}$ is indeed linear in the size of the output polynomial as the number of operations in the computation is \textit{exactly} the number of output polynomial operations.
However, in the bag setting, the output polynomial is $\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a$. The expectation is simply
\[\expct\pbox{\poly(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_c},\]
which is indeed linear in the size of the output polynomial as the number of operations in the computation is \textit{exactly} the number of output polynomial operations.
Now, consider query $\poly^2() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right)$. Abusing notation again, the output polynomial will be $\left(ab + bc + ca\right) \cdot \left(ab + bc + ca\right)$. Now, assume the following restrictions. First, all variables $X \in \vct{X}$ are set to $\prob$. Second, all exponents $e > 1$ in the expanded polynomial are set to $1$. Call this modified polynomial $\rpoly(\prob,\ldots, \prob)$. We show that $\expct\pbox{\poly(\prob,\ldots, \prob)} = \rpoly(\prob,\ldots, \prob)$. Here, again, in the setting of bag semantics, we have a query that is linear in the size of the expanded output polynomial, however it is not readily obvious that we achieve linearity for the factorized version of the polynomial as well. But if we think of this query in a graph theoretic setting, one can see that we end up with
Now, consider query
\[\poly^2() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right).\]
A compressed output polynomial is
\[\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right).\]
Note that the SOP equivalent representation is
%\[W_a^2W_b^2 + W_aW_b^2W_c + W_a^2W_bW_c + W_aW_b^2W_c + W_b^2W_c^2 + W_aW_bW_c^2 + W_a^2W_bW_c + W_aW_bW_c^2 + W_c^2W_a^2 =\]
\[W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.\]
If we go further and compute query
\[\poly^3() := \left(\rel(A), E(A, B), R(B)\right) \times \left(\rel(A), E(A, B), R(B)\right)\times \left(\rel(A), E(A, B), R(B)\right),\]
it is then the case that output polynomial in a compressed version is
\[\left(W_aW_b + W_bW_c + W_cW_a\right)^3, \]
while the expanded equivalent version has too many (27) terms to list.
In $\poly^2()$ and $\poly^3()$, we would like to when computing the expectation over the compressed output polynomial is linear in the size of the compressed polynomial, and when it is not. Additionally, given a class of queries such that the expectation is hard, is there a linear approximation algorithm with confidence guarantees?
\AH{{\bf 1)} Be sure to speak of our results.\par{\bf 2)} Explicitly mention the substitution of $\prob_i$ in for $W_i$ vars.}
Now, assume the following restrictions. First, all variables $X \in \vct{X}$ are set to $\prob$. Second, all exponents $e > 1$ in the expanded polynomial are set to $1$. Call this modified polynomial $\rpoly(\prob,\ldots, \prob)$. We show that $\expct\pbox{\poly(\prob,\ldots, \prob)} = \rpoly(\prob,\ldots, \prob)$. Here, again, in the setting of bag semantics, we have a query that is linear in the size of the expanded output polynomial, however it is not readily obvious that we achieve linearity for the factorized version of the polynomial as well. But if we think of this query in a graph theoretic setting, one can see that we end up with
\[\sum\limits_{(i, j) \in E}X_iX_j + \sum\limits_{\substack{(i, j), (i \ell) \in E,\\ i \neq \ell}}X_iX_jX_\ell + \sum\limits_{\substack{(i, j), (k, \ell) \in E,\\ i\neq j\neq k \neq \ell}}X_iX_jX_kX_\ell.\]