paper-BagRelationalPDBsAreHard/intro.tex
2020-12-13 17:45:44 -05:00

336 lines
21 KiB
TeX

%root: main.tex
%!TEX root=./main.tex
\section{Introduction}
Modern production databases like Postgres and Oracle use bag semantics.
Conversely, probabilistic database research (PDBs)~\cite{DBLP:series/synthesis/2011Suciu,DBLP:conf/sigmod/BoulosDMMRS05,DBLP:conf/icde/AntovaKO07a,DBLP:conf/sigmod/SinghMMPHS08} focuseses predominantly on queries evaluated under set semantics.
This is not surprising, as the conventional strategy for encoding the lineage of a query result --- a key component of query evaluation in PDBs --- makes computing typical statistics like marginal probabilities or moments easy (at worst linear in the size of the lineage) for bags, but hard (at worst exponential in the size of the lineage) for sets.
However, conventional encodings of a result's lineage are typically far larger than a comparable deterministic query result, and so even for Bag-PDBs, computing such statistics still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
In this paper, we formally prove this limitation of PDBs, and address it by proposing an approximation algorithm that, to the best of our knowledge, is the first approach to computing expected counts that stays within a constant factor of deterministic query processing.
Consider the dominant problem in Set-PDBs: Computing marginal probabilities, and the corresponding problem in Bag-PDBs: computing expected counts.
In work that addresses the former problem~\cite{DBLP:series/synthesis/2011Suciu}, the lineage is a boolean formula over random variables that captures the conditions under which each output tuple appears in the result.
Computing the probability of the tuple appearing in the result is thus analogous to weighted model counting (a known \sharpphard problem).
In the corresponding problem for Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty}, the lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
Thus, the expectation of the multiplicity is the expectation of this formula.
Lineage in Set-PDBs is typically encoded in disjunctive normal form.
This representation is significantly larger than the query result sans lineage.
However, even with alternative encodings~\cite{DBLP:journals/vldb/FinkHO13}, the limiting factor in computing marginal probabilities remains the probability computation itself and not the lineage formula.
The corresponding representation in Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of clauses, each of which is the product of a set of integer or variable atoms.
Thanks to linearity of expectation, computing the expectation of tuple multiplicities is linear in the number of clauses in the SOP polynomial.
Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
Such compressed representations like Factorized Databases~\cite{10.1145/3003665.3003667,DBLP:conf/tapp/Zavodny11} or Polynomial Circuits (cite), are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
Thus, measuring the performance of a PDB algorithm in terms of the size of a \emph{compressed} lineage formula allows us to more closely relate the algorithm's performance to the complexity of query evaluation in a deterministic database.
The initial picture is not good.
In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{10.1145/3003665.3003667} --- polynomial by reduction to counting the 3-matchings of a graph.
Thus, even bag PDBs can not enjoy the same computational complexity as deterministic databases.
This motivates our second goal, a linear time approximation algorithm for computing expected multiplicities in a bag database that runs in time linear in the size of a factorized lineage formula.
\todo[noinline]{as we show...}The worst-case size of the factorized lineage formula for a query is on the same order as the worst-case complexity of deterministic query evaluation~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}, making it possible to estimate expected multiplicities for tuples in the result of an SPJU query with a complexity to within a constant factor of the equivalent deterministic query.
\subsection{Sets vs Bags}
%Consider an arbitrary output polynomial $\poly$. Further, consider the same polynomial, with all exponents $e > 1$ set to $1$ and call the resulting polynomial $\rpoly$.
%Figures, etc
%Relations for example 1
\begin{figure}[ht]
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c c}
$\rel$ & A & $\Phi$\\
\hline
& a & $W_a$\\
& b & $W_b$\\
& c & $W_c$\\
\end{tabular}
%\caption{Atom 1 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom1}
\end{subfigure}
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c c c}
$E$ & A & B & $\Phi$\\
\hline
& a & b & $\top$\\
& b & c & $\top$\\
& c & a & $\top$\\
\end{tabular}
%\caption{Atom 3 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom3}
\end{subfigure}
% \begin{subfigure}{0.15\textwidth}
% \centering
% \begin{tabular}{ c | c | c}
% $\rel$ & B & $\Phi$\\
% \hline
% & b & $W_b$\\
% & c & $W_c$\\
% & a & $W_a$\\
% \end{tabular}
% %\caption{Atom 2 of query $\poly$ in ~\cref{intro:ex}}
% \label{subfig:ex-atom2}
% \end{subfigure}
\caption{$\ti$ relations for $\poly$}
\label{fig:intro-ex}
\end{figure}
%Graph of query output for intro example
%\begin{figure}
% \begin{tikzpicture}
% \node at (1.5, 3) [tree_node](top){a};
% \node at (0, 0) [tree_node](left){b};
% \node at (3, 0) [tree_node](right){c};
% \draw (top)--(left);
% \draw (left)--(right);
% \draw (right)--(top);
% \end{tikzpicture}
%\caption{Graph of tuples in table E}
%\label{fig:intro-ex-graph}
%\end{figure}
\begin{Example}\label{ex:intro}
Consider the Tuple Independent ($\ti$) Set-PDB given in \cref{fig:intro-ex} with two input relations $R$ and $E$\footnote{Our work also handles Block Independent Disjoint Databases ($\bi$)~\cite{DBLP:conf/sigmod/BoulosDMMRS05,DBLP:series/synthesis/2011Suciu}, but we focus here on the $\ti$ model to keep exposition simple.}.
Each input tuple is are assigned an annotation (attribute $\Phi$), an independent random boolean variable ($W_i$) or the constant $\top$.
Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) defines one \emph{possible world} containing exactly those tuples annotated with $\top$ or with a variable assigned to $\top$.
The probability of this world is the joint probability of the corresponding assignments.
For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p^2-p^3$
\end{Example}
Prior efforts to generalize incomplete databases to bags~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17} replace the boolean annotations with natural numbers.
Analogously, we generalize the above model of Set-PDBs to bags by using natural-number-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and positive natural number constants.
Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while query evaluation follows bag semantics.
We contrast bag and set query evaluation with the following example:
\begin{Example}\label{ex:bag-vs-set}
Continuing the prior example, we are given the following boolean (resp,. count) query
$$\poly() :- R(A), E(A, B), R(B)$$
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a boolean formula (resp., polynomial) over random variables in the input (i.e., $W_a$, $W_b$, $W_c$).
Because the boolean query has only a nullary relation, we write $Q(\cdot)$ to denote the function mapping variable assignments to a concrete value:
\begin{align*}
\poly_{set}(W_a, W_b, W_c) &= W_aW_b \vee W_bW_c \vee W_cW_a\\
\poly_{bag}(W_a, W_b, W_c) &= W_aW_b + W_bW_c + W_cW_a
\end{align*}
It is left as an exercise for the reader to show that, given assignments to $W_a$, $W_b$, $W_c$, these expressions correspond to the existence (resp., count) of the single nullary result tuple;
We show one possible world here, with the assignment $\{\;W_a\mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$ (and the corresponding bag assignment),
The polynomials evaluate as:
\begin{align*}
&\poly_{set}(\top, \top, \bot) = \top\top \vee \top\bot \vee \top\bot = \top\\
&\poly_{bag}(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1
\end{align*}
The Set-PDB query is satisfied in this possible world, while the Bag-PDB query produces a nullary tuple with a multiplicity of 1.
The marginal probability (resp., expected count) of this query is computed over all possible worlds:
{\small
\begin{align*}
P[\poly_{set}] &= \sum_{w_i \in \{\top,\bot\}} \mu(\poly_{set}(w_a, w_b, w_c))P[W_a = w_a,W_b = w_b,W_c = w_c]\\
\expct[\poly_{bag}] &= \sum_{w_i \in \{0,1\}} \poly_{bag}(w_a, w_b, w_c)\cdot P[W_a = w_a,W_b = w_b,W_c = w_c]
\end{align*}
}
\end{Example}
Note that the query of \cref{ex:bag-vs-set} in set semantics is indeed \sharpphard, since it non-hierarchical~\cite{10.1145/1265530.1265571}.
To see why computing this probability is hard, observe that the clauses of the disjunctive normal form boolean lineage are neither independent nor disjoint, forcing~\cite{DBLP:journals/vldb/FinkHO13} the use of Shannon decomposition, which is at worst exponential in the size of the input.
% \begin{equation*}
% \expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
% \end{equation*}
% In general, such a computation can be exponential in the size of the database.
%Using Shannon's Expansion,
%\begin{align*}
%&W_aW_b \vee W_bW_c \vee W_cW_a
%= &W_a
%\end{align*}
Conversely, in Bag-PDBs, correlations between clauses of the SOP polynomial are not problematic thanks to linearity of expectation.
The expectation computation over the output lineage is simply the sum of expectations of each clause
For \cref{ex:intro}, the expectation is simply
{\small
\begin{align*}
\expct\pbox{\poly(W_a, W_b, W_c)} &= \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}\\
\intertext{\normalsize
In this particular lineage polynomial, all variables in each product clause are independent, so we can push expectations through.
}
&= \expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
\end{align*}
}
Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
A further interesting coincidence arises in the example.
Note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals:
{\small
\begin{align*}
\expct\pbox{\poly_{bag}} =&\; P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
&+ P[W_c = 1]P[W_a = 1]\\
=&\; \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1])
\end{align*}
}
\begin{figure}[h!]
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=4.55cm}, level 2/.style={sibling distance=1.5cm}, level 3/.style={sibling distance=0.7cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
\node[tree_node](root){$\boldsymbol{\times}$}
child{node[tree_node]{$\boldsymbol{+}$}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_a$}}
child{node[tree_node]{$W_b$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_b$}}
child{node[tree_node]{$W_c$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_c$}}
child{node[tree_node]{$W_a$}}
}
}
child{node[tree_node]{$\boldsymbol{+}$}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_a$}}
child{node[tree_node]{$W_b$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_b$}}
child{node[tree_node]{$W_c$}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{$W_c$}}
child{node[tree_node]{$W_a$}}
}
};
\end{tikzpicture}
\caption{Expression tree for query $\poly^2$.}
\label{fig:intro-q2-etree}
\end{figure}
\subsection{Superlinearity of Bag PDBs}
Moving forward, we focus exclusively on bags and drop the subscript from $\poly_{bag}$.
Consider the cartesian product of $\poly$ with itself:
\begin{equation*}
\poly^2() := \rel(A), E(A, B), \rel(B),\; \rel(C), E(C, D), \rel(D)
\end{equation*}
For an arbitrary lineage formula, which we can view as a polynomial, it is known that there may exist equivalent compressed representations of the polynomial.
One such compression is the factorized polynomial~\cite{10.1145/3003665.3003667}, where the polynomial can be broken up into separate factors.
For example:
\begin{equation*}
\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right).
\end{equation*}
This factorized expression can be easily modeled as an expression tree as in \cref{fig:intro-q2-etree}.
In contrast, the equivalent SOP representation is
\begin{equation*}
W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.
\end{equation*}
The expectation then is
\begin{align*}
&\expct\pbox{\poly^2(W_a, W_b, W_c)}\\
&= \expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} +\\
&\qquad \expct\pbox{2W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{2W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} +\\
&\qquad \expct\pbox{2W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}\\
\end{align*}
In $\poly$, $\expct\pbox{\poly} = \poly(P\pbox{W_a}, P\pbox{W_b}, P\pbox{W_c})$.
This same property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(P\pbox{W_a}, P\pbox{W_b}, P\pbox{W_c})$).
Nevertheless, the structure of the query does admit some simplification.
Observe that under assumption that $Dom(W_i) = \{0, 1\}$, it is generally true that for any $k$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
This property leads us to consider another structure related to $\poly$.
% \AH{I don't know if we want to include the following statement: \par \emph{ bags are only hard with self-joins }
% \par Atri suggests a proof in the appendix regarding this claim.}
For any polynomial $\poly(\vct{X})$, we define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.
With $\poly^2$ as an example, we have:
\begin{align*}
\rpoly^2(W_a, W_b, W_c) =&\; W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
&+ 2W_aW_bW_c\\
=&\; W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c
\end{align*}
The reduced polynomial is a closed form formula for the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(P\pbox{W_a=1}, P\pbox{W_b=1}, P\pbox{W_c=1})$).
Note that our initial example polynomial $\poly$ is already in reduced form.
Observe that the reduced form of a polynomial can be obtained in a linear scan over the clauses of a SOP encoding of the polynomial.
In prior work on PDBs, where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.
In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the cartesian product of $k$ copies of $\poly$ has a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
This leads us to the central question of this paper:
\begin{Question}
Is it always the case that bags are linear in the size of the \emph{compressed} polynomial?
\end{Question}
If bags \textit{are} always linear for any compressed version of the polynomial, then there is no need for improvement.
But, if proveably not, then an approximation algorithm for the expected count is mandatory for creating a PDB with practical performance.
% Consider the :
% \begin{equation*}
% \poly^3() := \left(\rel(A), E(A, B), R(B)\right), \left(\rel(C), E(C, D), R(D)\right), \left(\rel(F), E(F, G), R(G)\right).
% \end{equation*}
% The factorized output polynomial consists of a product of three identical three-way summations, while the SOP encoding is exponential --- $3^3$ clauses to be precise.
Concretely, in this paper:
(i) We show that conjunctive queries over a bag-$\ti$ are hard (i.e., superlinear in the size of a compressed lineage encoding) by reduction to counting the number of $3$-matchings over an arbitrary graph;
(ii) We present an $\epsilon - \delta$ approximation algorithm for bag-$\ti$s and show that its complexity is linear in the size of the compressed lineage encoding;
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further generalize our results to higher moments, polynomial circuits, and prove RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Interesting contributions, problem definition, known results, our results, etc
%
%\paragraph{Problem Definition/Known Results/Our Results/Our Techniques}
%This work addresses the problem of performing computations over the output query polynomial efficiently. We specifically focus on computing the
%expectation over the polynomial that is the result of a query over a PDB. This is a problem where, to the best of our knowledge, there has not
%been a lot of study. Our results show that the problem is hard (superlinear) in the general case via a reduction to known hardness results
%in the field of graph theory. Further we introduce a linear approximation time algorithm with guaranteed confidence bounds. We then prove the
%claimed runtime and confidence bounds. The algorithm accepts an expression tree which models the output polynomial, samples uniformly from the
%expression tree, and then outputs an approximation within the claimed bounds in the claimed runtime.
%
%\paragraph{Interesting Mathematical Contributions}
%This work shows an equivalence between the polynomial $\poly$ and $\rpoly$, where $\rpoly$ is the polynomial $\poly$ such that all
%exponents $e > 1$ are set to $1$ across all variables over all monomials. The equivalence is realized when $\vct{X}$ is in $\{0, 1\}^\numvar$.
%This setting then allows for yet another equivalence, where we prove that $\rpoly(\prob,\ldots, \prob)$ is indeed $\expct\pbox{\poly(\vct{X})}$.
%This realization facilitates the building of an algorithm which approximates $\rpoly(\prob,\ldots, \prob)$ and in turn the expectation of
%$\poly(\vct{X})$.
%
%Another interesting result in this work is the reduction of the computation of $\rpoly(\prob,\ldots, \prob)$ to finding the number of
%3-paths, 3-matchings, and triangles of an arbitrary graph, a problem that is known to be superlinear in the general case, which is, by our definition
%hard. We show in Thm 2.1 that the exact computation of $\rpoly(\prob, \ldots, \prob)$ is indeed hard. We finally propose and prove
%an approximation algorithm of $\rpoly(\prob,\ldots, \prob)$, a linear time algorithm with guaranteed $\epsilon/\delta$ bounds. The algorithm
%leverages the efficiency of compressed polynomial input by taking in an expression tree of the output polynomial, which allows for factorized
%forms of the polynomial to be input and efficiently sampled from. One subtlety that comes up in the discussion of the algorithm is that the input
%of the algorithm is the output polynomial of the query as opposed to the input DB of the query. This then implies that our results are linear
%in the size of the output polynomial rather than the input DB of the query, a polynomial that might be greater or lesser than the input depending
%on the structure of the query.
%
%\section{Outline of the rest of the paper}
%\begin{enumerate}
% \item Background Knowledge and Notation
% \begin{enumerate}
% \item Review notation for PDBs
% \item Review the use of semirings as generating output polynomials
% \item Review the translation of semiring operators to RA operators
% \item Polynomial formulation and notation
% \end{enumerate}
% \item Reduction to hardness results in graph theory
% \begin{enumerate}
% \item $\rpoly$ and its equivalence to $\expct\pbox{\poly}$ when $\vct{X} \in \{0, 1\}^\numvar$
% \item Results for SOP polynomial
% \item Results for compressed version of polynomial
% \item ~\cref{lem:const-p} proof
% \end{enumerate}
% \item Approximation Algorithm
% \begin{enumerate}
% \item Description of the Algorithm
% \item Theoretical guarantees
% \item Will we have time to tackle BIDB?
% \begin{enumerate}
% \item If so, experiments on BIDBs?
% \end{enumerate}
% \end{enumerate}
% \item Future Work
% \item Conclusion
%\end{enumerate}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: