paper-BagRelationalPDBsAreHard/intro.tex

255 lines
17 KiB
TeX

%root: main.tex
\section{Introduction}
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing expectations and other moments of an output tuple is analogous to weighted model counting, a known $\sharpP$ problem.
%the annotation of the tuple is a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, which can essentially be thought of as a boolean formula. It is known that computing the probability of a lineage formula is \#-P hard in general
~\cite{DBLP:series/synthesis/2011Suciu}. In PDBs, the boolean formula encodes which input tuples contributed to an output tuple in an arbitrary query. This formula is also called a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, and is a deterministic formula for each output tuple generated by query processing. However, instead of using boolean lineage formulas, the bag case requires the use of polynomials over random variables, representing the probability distribution of the multiplicity of input tuples. In this case, the probability is interpreted as the probability of the input tuple contributing to the multiplicity of the output tuple. Computing the expectation of the polynomial in the bag setting is linear in the number of terms of the expanded formula, with the result that many regard bags to be easy. In this work we consider compressed representations of the lineage formula showing that the complexity landscape becomes much more nuanced, and is \textit{not} linear in general.
%Consider an arbitrary output polynomial $\poly$. Further, consider the same polynomial, with all exponents $e > 1$ set to $1$ and call the resulting polynomial $\rpoly$.
%Figures, etc
%Relations for example 1
\begin{figure}[ht]
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c | c}
$\rel$ & A & $\Phi$\\
\hline
& a & $W_a$\\
& b & $W_b$\\
& c & $W_c$\\
\end{tabular}
%\caption{Atom 1 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom1}
\end{subfigure}
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c c c}
$E$ & A & B & $\Phi$\\
\hline
& a & b & 1\\
& b & c & 1\\
& c & a & 1\\
\end{tabular}
%\caption{Atom 3 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom3}
\end{subfigure}
% \begin{subfigure}{0.15\textwidth}
% \centering
% \begin{tabular}{ c | c | c}
% $\rel$ & B & $\Phi$\\
% \hline
% & b & $W_b$\\
% & c & $W_c$\\
% & a & $W_a$\\
% \end{tabular}
% %\caption{Atom 2 of query $\poly$ in ~\cref{intro:ex}}
% \label{subfig:ex-atom2}
% \end{subfigure}
\caption{$\ti$ relations for $\poly$}
\label{fig:intro-ex}
\end{figure}
%Graph of query output for intro example
\begin{figure}
\begin{tikzpicture}
\node at (1.5, 3) [tree_node](top){a};
\node at (0, 0) [tree_node](left){b};
\node at (3, 0) [tree_node](right){c};
\draw (top)--(left);
\draw (left)--(right);
\draw (right)--(top);
\end{tikzpicture}
\caption{Graph of tuples in table E}
\label{fig:intro-ex-graph}
\end{figure}
\begin{Example}\label{ex:intro}
Assume a set semantics setting. Suppose we are given a Tuple Independent Database ($\ti$), which is a PDB of which all its tuples are assumed to be independent from one another. We are given the following boolean query $\poly() :- R(A), E(A, B), R(B)$, where the lineage of the output will consist of the lineages of all contributing tuples. The $\ti$ example instances are given in~\cref{fig:intro-ex}. While for completeness we should include annotations for Table E, since each tuple has a probability of $1$, we drop them for simplicity. Note that the attribute column $\Phi$ contains a variable/value in the range of $[0, 1]$ denoting its marginal probability. Finally, see that the tuples in table E can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
This query is hard in set semantics because of correlations in the lineage formula, but under bag semantics with a polynomial formula representing the multiple contributing tuples from the input set $\ti$, it is easy since we enjoy linearity of expectation.
\end{Example}
Our work also handles Block Independent Disjoint Databases($\bi$), a PDB model in which tuples are arranged in blocks, where all blocks are independent from one another, but tuples within the same block are mutually exclusive. For now, let us consider the $\ti$ model. In the example, consider a fixed probability for all tuples.
Note that computing the probability of the query of ~\cref{ex:intro} in set semantics is indeed \#-P hard, since it is a query that is non-hierarchical
%, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$,
as defined by Dalvi and Suciu in ~\cite{10.1145/1265530.1265571}. For the purposes of this work, we define hard to be anything greater than linear time. %Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$, i.e. the probability of the output with annotation $\poly(W_a, W_b, W_c)$, ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics.
To see why this computation is hard for query $\poly$ over set semantics, we have an output lineage formula of $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent of one another and the computation
\begin{equation*}
\expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
\end{equation*}
of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$. In general, such a computation can be exponential.
%Using Shannon's Expansion,
%\begin{align*}
%&W_aW_b \vee W_bW_c \vee W_cW_a
%= &W_a
%\end{align*}
However, in the bag setting, the lineage formula is $\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a$. To be precise, the output lineage formula is produced from a query over a set $\ti$ input, where duplicates are allowed in the output. The expectation computation over the output lineage is a computation of the 'average' multiplicity of an output tuple across possible worlds. In ~\cref{ex:intro}, the expectation is simply
\begin{align*}
&\expct\pbox{\poly(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}\\
= &\expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}\\
= &\prob^2 + \prob^2 + \prob^2 = 3\prob^2,
\end{align*}
which is indeed linear in the size of the lineage as the number of operations in the computation is \textit{exactly} the number of lineage operations. The above equalities hold, since expectation is linear over addition of the natural numbers. Further, we exploited linearity of expectation over multiplication since in the $\ti$ model, all variables are independent. Note that the answer is the same as $\poly(\prob, \prob, \prob)$, although this is coincidental and not true for the general case.
Now, consider the query
\begin{equation*}
\poly^2() := \rel(A), E(A, B), \rel(B), \rel(C), E(C, D), \rel(D),
\end{equation*}
For an arbitrary lineage formula, which we can view as a polynomial, it is known that there may exist equivalent compressed representations of the polynomial. One such compression is known as the factorized polynomial, where the polynomial can be broken up into separate factors, and this is generally smaller than the expanded polynomial. Another equivalent form of the polynomial is the SOP, which is the expansion of the factorized polynomial by multiplying out all terms, and in general is exponentially larger (in the number of products) than the factorized version.
A factorized polynomial of $\poly^2$ is
\begin{equation*}
\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right).
\end{equation*}
This factorized expression can be easily modeled as an expression tree as depicted by ~\cref{fig:intro-q2-etree}.
\begin{figure}[h!]
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=4.55cm}, level 2/.style={sibling distance=1.5cm}, level 3/.style={sibling distance=0.7cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
\node[tree_node](root){$\boldsymbol{\times}$}
child{node[tree_node]{$\boldsymbol{+}$}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_a$}}
child{node[tree_node]{$W_b$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_b$}}
child{node[tree_node]{$W_c$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_c$}}
child{node[tree_node]{$W_a$}}
}
}
child{node[tree_node]{$\boldsymbol{+}$}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_a$}}
child{node[tree_node]{$W_b$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_b$}}
child{node[tree_node]{$W_c$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_c$}}
child{node[tree_node]{$W_a$}}
}
};
\end{tikzpicture}
\caption{Expression tree for query $\poly^2$.}
\label{fig:intro-q2-etree}
\end{figure}
In contrast, the SOP equivalent representation is
\begin{equation*}
W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.
\end{equation*}
The expectation then is
\begin{align*}
&\expct\pbox{\poly^2(W_a, W_b, W_c)}\\
&= \expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} +\\
&\qquad \expct\pbox{2W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{2W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} +\\
&\qquad \expct\pbox{2W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}\\
= &\prob^2 + \prob^2 + \prob^2 + 2\prob^3 + 2\prob^3 + 2\prob^3\\
= & 3\prob^2(1 + 2\prob) \neq \poly^2(\prob, \prob, \prob).
\end{align*}
In this case, even though we substituting probability or expecation values in for each variable, $\poly^2(\prob, \prob, \prob)$ is not the answer we seek since for a random variable $X$, $\expct\pbox{X^2} = \sum_{x \in Dom(X)}x^2 \cdot p(x)$. Note, that for our example, $Dom(W_i) = \{0, 1\}$. Intuitively, bags are only hard with self-joins.\AH{Atri suggests a proof in the appendix regarding the last claim.}
Define $\rpoly^2(\vct{X})$ to be the resulting polynomial when all exponents $e > 1$ are set to $1$ in $\poly^2$. Note that this structure $\rpoly^2(\prob, \prob, \prob)$ is the expectation we computed, since it is always the case that $i^2 = i$ for all $i$ in $\{0, 1\}$. And, $\poly^2()$ is still computable in linear time in the size of the output polynomial, compressed or SOP.
A compressed polynomial can be exponentially smaller in $k$ for $k$-products. It is also always the case that computing the expectation of an output polynomial in SOP is always linear in the size of the polynomial, since expecation can be pushed through addition.
This works seeks to explore the complexity landscape for compressed representations of polynomials. We use the term 'easy' to mean linear time, and the term 'hard' to mean superlinear time or greater. Note that when we are linear in the size of the lineage formula, we essentially have runtime that is of deterministic query complexity.
Up to this point the message seems consistent that bags are always easy, but
\begin{Question}
Is it always the case that bags are easy in the size of the compressed polynomial?
\end{Question}
If bags \textit{are} always easy for any compressed version of the polynomial, then there is no need for improvement. But, if proveably not, then the option to approximate the computation over a compressed polynomial in linear time is desirable.
Consider the query
\begin{equation*}
\poly^3() := \rel(A), E(A, B), R(B), \rel(C), E(C, D), R(D), \rel(F), E(F, G), R(G).
\end{equation*}
Upon inspection one can see that the factorized output polynomial consists of three product terms, while the SOP version consists of $3^3$ terms. We show in this paper that, given a $\ti$ and any conjunctive query with input $\prob$ for all variables of $\poly^3$, this particular query is hard given a factorized polynomial as input. We show this via a reduction to computing the number of $3$-matchings over an arbitrary graph. The fact that bags are not easy in the general case when considering compressed polynomials necessitates an approximation algorithm that computes the expected multiplicity of the output in linear time when the output polynomial is in factorized form. We introduce such an approximation algorithm with confidence guarantees to compute $\rpoly(\vct{X})$ in linear time. Our apporximation algorithm generalizes to the $\bi$ model as well. This shows that for all RA+ queries, the processing time in approximation is essentially the same deterministic processing.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Interesting contributions, problem definition, known results, our results, etc
%
%\paragraph{Problem Definition/Known Results/Our Results/Our Techniques}
%This work addresses the problem of performing computations over the output query polynomial efficiently. We specifically focus on computing the
%expectation over the polynomial that is the result of a query over a PDB. This is a problem where, to the best of our knowledge, there has not
%been a lot of study. Our results show that the problem is hard (superlinear) in the general case via a reduction to known hardness results
%in the field of graph theory. Further we introduce a linear approximation time algorithm with guaranteed confidence bounds. We then prove the
%claimed runtime and confidence bounds. The algorithm accepts an expression tree which models the output polynomial, samples uniformly from the
%expression tree, and then outputs an approximation within the claimed bounds in the claimed runtime.
%
%\paragraph{Interesting Mathematical Contributions}
%This work shows an equivalence between the polynomial $\poly$ and $\rpoly$, where $\rpoly$ is the polynomial $\poly$ such that all
%exponents $e > 1$ are set to $1$ across all variables over all monomials. The equivalence is realized when $\vct{X}$ is in $\{0, 1\}^\numvar$.
%This setting then allows for yet another equivalence, where we prove that $\rpoly(\prob,\ldots, \prob)$ is indeed $\expct\pbox{\poly(\vct{X})}$.
%This realization facilitates the building of an algorithm which approximates $\rpoly(\prob,\ldots, \prob)$ and in turn the expectation of
%$\poly(\vct{X})$.
%
%Another interesting result in this work is the reduction of the computation of $\rpoly(\prob,\ldots, \prob)$ to finding the number of
%3-paths, 3-matchings, and triangles of an arbitrary graph, a problem that is known to be superlinear in the general case, which is, by our definition
%hard. We show in Thm 2.1 that the exact computation of $\rpoly(\prob, \ldots, \prob)$ is indeed hard. We finally propose and prove
%an approximation algorithm of $\rpoly(\prob,\ldots, \prob)$, a linear time algorithm with guaranteed $\epsilon/\delta$ bounds. The algorithm
%leverages the efficiency of compressed polynomial input by taking in an expression tree of the output polynomial, which allows for factorized
%forms of the polynomial to be input and efficiently sampled from. One subtlety that comes up in the discussion of the algorithm is that the input
%of the algorithm is the output polynomial of the query as opposed to the input DB of the query. This then implies that our results are linear
%in the size of the output polynomial rather than the input DB of the query, a polynomial that might be greater or lesser than the input depending
%on the structure of the query.
%
%\section{Outline of the rest of the paper}
%\begin{enumerate}
% \item Background Knowledge and Notation
% \begin{enumerate}
% \item Review notation for PDBs
% \item Review the use of semirings as generating output polynomials
% \item Review the translation of semiring operators to RA operators
% \item Polynomial formulation and notation
% \end{enumerate}
% \item Reduction to hardness results in graph theory
% \begin{enumerate}
% \item $\rpoly$ and its equivalence to $\expct\pbox{\poly}$ when $\vct{X} \in \{0, 1\}^\numvar$
% \item Results for SOP polynomial
% \item Results for compressed version of polynomial
% \item ~\cref{lem:const-p} proof
% \end{enumerate}
% \item Approximation Algorithm
% \begin{enumerate}
% \item Description of the Algorithm
% \item Theoretical guarantees
% \item Will we have time to tackle BIDB?
% \begin{enumerate}
% \item If so, experiments on BIDBs?
% \end{enumerate}
% \end{enumerate}
% \item Future Work
% \item Conclusion
%\end{enumerate}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: