paper-BagRelationalPDBsAreHard/intro.tex

%root: main.tex
%!TEX root=./main.tex

\section{Introduction}
\label{sec:intro}
Most theoretical developments in probabilistic database (PDBs) have been made in the setting of set semantics.  This is largely due to the stark contrast in hardness results when computing the first moment of a tuple's lineage (a boolean formula encoding the input tuples that produced the output tuple) in set semantics versus the linear runtime when computing the expectation over a bag PDB.  However, when viewed more closely, the assumption of linear runtime in the bag setting relies on the lineage polynomial being in sum of products (SOP), or expanded form.  What can be said about computing the expectation of a more compressed form of the lineage polyomial (e.g. factorized polynomial) under bag semantics?

%As explainability and fairness become more relevant to the data science community, it is now more critical than ever to understand how reliable a dataset is.
%Probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu} are a compelling solution, but a major roadblock to their adoption remains:
%PDBs are orders of magnitude slower than classical (i.e., deterministic) database systems~\cite{feng:2019:sigmod:uncertainty}.
%Naively, one might suggest that this is because most work on probabilistic databases assumes set semantics, while, virtually all implementations of the relational data model use bag semantics.
%However, as we show in this paper, there is a more subtle problem behind this barrier to adoption.
\subsection{Sets vs. Bags}
In the setting of set semantics, this problem can be defined as: given a query, probabilistic database, and possible result tuple, compute the marginal probability of the tuple appearing in the result.  It has been shown that this is equivalent to computing the probability of the lineage formula. %, which records how the result tuple was derived from input tuples.
Given this correspondence, the problem reduces to weighted model counting over the lineage (a \sharpphard problem, even if the lineage is in DNF).
%A large body of work has focused on identifying tractable cases by either identifying tractable classes of queries (e.g.,~\cite{DS12}) or studying compressed representations of lineage formulas that are tractable for certain classes of input databases (e.g.,~\cite{AB15}).  In this work we define a compressed representation as any one of the possible circuit representations of the lineage formula (please see Definitions~\ref{def:circuit},~\ref{def:poly-func}, and~\ref{def:circuit-set}).

In bag semantics this problem corresponds to computing the expected multiplicity of a query result tuple, which can be reduced to computing the expectation of the lineage formula.  Under bag semantics, the lineage formula is the standard polynomial parameterized in the lineage variables, with the usual notion of mulitplication and addition operations.

\begin{Example}\label{ex:intro}
The tables $\rel$ and $E$ in \Cref{fig:intro-ex} are examples of an incomplete database.  Every tuple $\tup$ (disregard $\Phi_{bag}$ for the moment) of these tables is annotated with a variable or the symbol $\top$.  Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) identifies one \emph{possible world}, a deterministic database instance containing exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$.  When each variable represents an \emph{independent} event, this encoding is called a Tuple Independent Database $(\ti)$.

The probability of this world is the joint probability of the corresponding assignments.
For example, let $\probOf[W_a] = \probOf[W_b] = \probOf[W_c] = \prob$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and its probability is $\probOf[W_a]\cdot \probOf[W_b] \cdot \probOf[\neg W_c] = \prob\cdot \prob\cdot (1-\prob)=\prob^2-\prob^3$.
\end{Example}

\begin{figure}[t]
	\begin{subfigure}{0.33\linewidth}
		\centering
		\resizebox{!}{10mm}{
		\begin{tabular}{ c | c c c}
			$\rel$ & A & $\Phi_{set}$ & $\Phi_{bag}$\\
			\hline
			& a & $W_a$ & $W_a$\\
			& b & $W_b$ & $W_b$\\
			& c & $W_c$ & $W_c$\\
		\end{tabular}
}		\caption{Relation $R$ in ~\Cref{ex:intro}}
		\label{subfig:ex-atom1}
	\end{subfigure}%
	\begin{subfigure}{0.33\linewidth}
		\centering
		\resizebox{!}{10mm}{
		\begin{tabular}{ c | c c c c}
			$E$ & A & B & $\Phi_{set}$ & $\Phi_{bag}$ \\
			\hline
			& a & b & $\top$ & $1$\\
			& b & c & $\top$ & $1$\\
			& c & a & $\top$ & $1$\\
		\end{tabular}
		}
		\caption{Relation $E$ in ~\Cref{ex:intro}}
		\label{subfig:ex-atom3}
	\end{subfigure}%
	\begin{subfigure}{0.33\linewidth}
		\centering
	\resizebox{!}{29mm}{
		\begin{tikzpicture}[thick]
			\node[tree_node] (a1) at (0, 0){$W_a$};
			\node[tree_node] (b1) at (1, 0){$W_b$};
			\node[tree_node] (c1) at (2, 0){$W_c$};
			\node[tree_node] (d1) at (3, 0){$W_d$};

			\node[tree_node] (a2) at (0.75, 0.8){$\boldsymbol{\circmult}$};
			\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circmult}$};
			\node[tree_node] (c2) at (2.25, 0.8){$\boldsymbol{\circmult}$};

			\node[tree_node] (a3) at (1.9, 1.6){$\boldsymbol{\circplus}$};
			\node[tree_node] (a4) at (0.75, 1.6){$\boldsymbol{\circplus}$};
			\node[tree_node] (a5) at (0.75, 2.5){$\boldsymbol{\circmult}$};

			\draw[->] (a1) -- (a2);
			\draw[->] (b1) -- (a2);
			\draw[->] (b1) -- (b2);
			\draw[->] (c1) -- (b2);
			\draw[->] (c1) -- (c2);
			\draw[->] (d1) -- (c2);
			\draw[->] (c2) -- (a3);
			\draw[->] (a2) -- (a4);
			\draw[->] (b2) -- (a3);
			\draw[->] (a3) -- (a4);
			%sink
			\draw[thick, ->] (a4.110) -- (a5.250);
			\draw[thick, ->] (a4.70) -- (a5.290);
			\draw[thick, ->] (a5) -- (0.75, 3.0);
		\end{tikzpicture}
	}
	\caption{Circuit encoding for query $\poly^2$.}
	\label{fig:circuit-q2-intro}
	\end{subfigure}
	%\vspace*{3mm}
  \vspace*{-3mm}
	\caption{ }%{$\ti$ relations for $\poly$}
	\label{fig:intro-ex}
  \trimfigurespacing
\end{figure}


Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,GL16}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $\domain(\randomvar_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while query evaluation follows bag semantics.

\begin{Example}\label{ex:bag-vs-set}
Continuing the prior example, we are given the following Boolean (resp,. count) query
$$\poly() :- R(A), E(A, B), R(B)$$
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a Boolean formula  (resp., polynomial) over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
Because the query result is a nullary relation, we write $Q(\cdot)$ to denote the function that evaluates the lineage over one specific assignment of values to the variables (i.e., the value of the lineage in the corresponding possible world):

\setlength\parindent{0pt}
\vspace*{-3mm}
\begin{tabular}{@{}l l}
	\begin{minipage}[b]{0.45\linewidth}
		\begin{equation}
			\poly_{set}(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a\label{eq:poly-set}
		\end{equation}
	\end{minipage}\hspace*{5mm}
	&
	\begin{minipage}[b]{0.45\linewidth}
		\begin{equation}
			\poly_{bag}(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a\label{eq:poly-bag}
		\end{equation}
	\end{minipage}\\
\end{tabular}
\vspace*{1mm}


These functions compute the existence (resp., count) of the nullary tuple resulting  from applying $\poly$ on the PDB of \Cref{fig:intro-ex}.
For the same possible world as in the prior example:
$$
\begin{tabular}{c c}
	\begin{minipage}[b]{0.45\linewidth}
	$\poly_{set}(\top, \top, \bot) = \top\top \vee \top\bot \vee \bot\top = \top$
	\end{minipage}
	&
	\begin{minipage}[b]{0.45\linewidth}
	$\poly_{bag}(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1$
	\end{minipage}\\
\end{tabular}
$$

The Set-PDB query is satisfied in this possible world and the Bag-PDB result tuple has a multiplicity of 1.
The marginal probability (resp., expected count) of this query is computed over all possible worlds:
{\small
\begin{align*}
\probOf[\poly_{set}] &= \hspace*{-1mm}
 \sum_{w_i \in \{\top,\bot\}} \indicator{\poly_{set}(w_a, w_b, w_c)}\probOf[W_a = w_a,W_b = w_b,W_c = w_c]\\
\expct[\poly_{bag}] &= \sum_{w_i \in \{0,1\}} \poly_{bag}(w_a, w_b, w_c)\cdot \probOf[W_a = w_a,W_b = w_b,W_c = w_c]
\end{align*}
}
\end{Example}

Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{DS12}, and thus \sharpphard.
To see why computing this probability is hard, observe that the three clauses $(W_aW_b, W_bW_c,\text{ and }W_aW_c)$ of $(\ref{eq:poly-set})$ are not independent (the same variables appear in multiple clauses) nor disjoint (the clauses are not mutually exclusive).  Computing the probability of such formulas exactly requires exponential time algorithms (e.g., Shanon Decomposition).
Conversely, in Bag-PDBs, correlations between monomials of the SOP polynomial (\ref{eq:poly-bag}) are not problematic thanks to linearity of expectation.
The expectation computation over the output lineage is simply the sum of expectations of each clause.
For \Cref{ex:intro}, the expectation is simply
\begin{equation*}
\expct\pbox{\poly_{bag}(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}
\end{equation*}
In this particular lineage polynomial, all variables in each product clause are independent, so we can push expectations through.
\begin{equation*}
= \expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
\end{equation*}
Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
As a further interesting feature of this example, note that $\expct\pbox{W_i} = \probOf[W_i = 1]$, and so taking the same polynomial over the reals:
\begin{equation}
\label{eqn:can-inline-probabilities-into-polynomial}
\expct\pbox{\poly_{bag}}
= \poly_{bag}(\probOf[W_a=1], \probOf[W_b=1], \probOf[W_c=1])
\end{equation}
\Cref{eqn:can-inline-probabilities-into-polynomial} is not true in general, as we shall see in \Cref{sec:suplin-bags}.

Though computing the expected count of an output bag PDB tuple $\tup$  is linear when $\tup$ is in SOP form, %has received much less attention, perhaps due to the property of linearity of expectation noted above.
%, perhaps because on the surface, the problem is trivially tractable.In fact, as mentioned, it is linear time when the lineage polynomial is encoded in an SOP representation.
there exist compressed representations of polynomials, e.g., factorizations~\cite{factorized-db}, that can be polynomially more concise than their SOP counterpart.
These compression schemes are analogous to typical database optimizations like projection push-down~\cite{DBLP:books/daglib/0020812}, (where e.g. in the case of a projection followed by a join, addition would be performed prior to multiplication, yielding a product of sums), hinting that perhaps even Bag-PDBs have higher query processing complexity than deterministic databases.
In this paper, we confirm this intuition, first proving that computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a compressed lineage representation, and then relating the size of the compressed lineage to the cost of answering a deterministic query.

In view of this hardness result, we develop an approximation algorithm for expected counts of SPJU query results over Bag-PDBs that is, to our knowledge, the first linear time (in the size of the factorized lineage) $(1-\epsilon)$-approximation.
By extension, this algorithm only has a constant factor slower runtime relative to deterministic query processing.\footnote{
	Monte-carlo sampling~\cite{jampani2008mcdb} is also trivially a constant factor slower, but can only guarantee additive rather than our stronger multiplicative bounds.
}
This is an important result, because it implies that computing approximate expectations for SPJU queries can indeed be competitive with deterministic query evaluation over bag databases.

\subsection{Overview of our results and techniques}
Concretely, in this paper:
(i) We show that computing the expected count of conjunctive queries whose output is a bag-$\ti$ is hard (i.e., superlinear in the size of a compressed lineage encoding) by reduction from counting the number of $k$-matchings over an arbitrary graph;
(ii) We present an $(1-\epsilon)$-approximation algorithm for bag-$\ti$s and show that its complexity is linear in the size of the compressed lineage encoding;
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further generalize our results to higher moments and prove that for RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.

Our hardness results follow by considering a suitable generalization of the lineage polynomial in Example~\ref{ex:bag-vs-set}. First it is easy to generalize the polynomial in Example~\ref{ex:bag-vs-set} to $\poly_G(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\poly_G^k(X_1,\dots,X_n)$ (i.e., $\inparen{\poly_G(X_1,\dots,X_n)}^k$) encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ (see \Cref{def:reduced-poly}) can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (which computing is \sharpwonehard) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le  2k$, then we can set up a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed  it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.


The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(\prob,\dots, \prob)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ (i.e., the \emph{input} tuple probabilities) and $k=\degree(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_\numvar)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(1,\ldots, 1)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well.
%For the ease of exposition, we start off with expression trees (see~\Cref{fig:circuit-q2-intro} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).

We formalize our claim that, since our approximation algorithm runs in time linear in the size of the polynomial circuit, we can approximate the expected output tuple multiplicities with only a $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

\paragraph{Paper Organization.} We present some relevant background and set up our notation in~\Cref{sec:background}. We present our hardness results in~\Cref{sec:hard} and our approximation algorithm in~\Cref{sec:algo}. We present some (easy) generalizations of our results in~\Cref{sec:gen}. We do a quick overview of related work in~\Cref{sec:related-work} and conclude with some open questions in~\Cref{sec:concl-future-work}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: