paper-BagRelationalPDBsAreHard/intro-new.tex

%root: main.tex
%!TEX root=./main.tex

\section{Introduction}
\label{sec:intro}
A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds, paired with a probability distribution $\pd$ over these worlds.
A well-studied problem in probabilistic databases is to, given a query $\query$ and a probabilistic database $\pdb$, compute the \emph{marginal probability} of a tuple $\tup$ (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs), which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases, from cases that are in \ptime for unions of conjunctive queries (UCQs).
In this work we consider bag semantics, where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analogous problem of computing the expectation of the multiplicity of a query result tuple $\tup$ (denoted $\query(\db)(t)$):
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{equation}\label{eq:intro-bag-expectation}
\expct_{\overline{\mathbf{D}} \sim \probDist}[\query(\overline{\mathbf{D}})(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \pd(\db) \hspace{2cm}\text{\textbf{(Expected Result Multiplicity)}}
\end{equation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
	\begin{subfigure}[b]{0.49\linewidth}
		\centering
{\small
		\begin{tabular}{ c | c c c}
			$OnTime$ & City$_\ell$ & $\Phi$  & \textbf{p}\\
			\hline
                     & Buffalo     & $L_a$ & 0.9 \\
                     & Chicago     & $L_b$ & 0.5\\
                     & Bremen      & $L_c$ & 0.5\\
                     & Zurich      & $L_d$ & 1.0\\
		\end{tabular}
	}
		\caption{Relation $OnTime$}
		\label{subfig:ex-shipping-simp-loc}
	\end{subfigure}%
	\begin{subfigure}[b]{0.49\linewidth}
		\centering
{\small
		\begin{tabular}{ c | c c c c}
			$Route$ & $\text{City}_1$ & $\text{City}_2$ & $\Phi$ & \textbf{p} \\
			\hline
                    & Buffalo         & Chicago         & $R_a$          & 1.0        \\
                    & Chicago         & Zurich          & $R_b$          & 1.0        \\
                    %& $\cdots$        & $\cdots$        & $\cdots$     & $\cdots$   \\
                    & Chicago         & Bremen          & $R_c$          & 1.0        \\
		\end{tabular}
		}
		\caption{Relation $Route$}
		\label{subfig:ex-shipping-simp-route}
      \end{subfigure}%
  %     	\begin{subfigure}[b]{0.17\linewidth}
  %   	\centering

  %   \caption{Circuit for $(Chicago)$}
  %   \label{subfig:ex-proj-push-circ-q3}
  % \end{subfigure}

	\begin{subfigure}[b]{0.66\linewidth}
		\centering
{\small
          \begin{tabular}{ c | c c c}
            $\query_1$ & City    & $\Phi$                          & $\expct_{\idb \sim \probDist}[\query(\db)(t)]$ \\ \hline
                       & Buffalo & $L_a \cdot R_a$                 & $0.9$                                            \\
                       & Chicago & $L_b \cdot R_b + L_b \cdot R_c$ & $0.5 \cdot 1.0 + 0.5 \cdot 1.0 = 1.0$                                               \\
            %& $\cdots$ & $\cdots$ & $\cdots$ \\
          \end{tabular}
		}
		\caption{$Q_1$'s Result}
		\label{subfig:ex-shipping-simp-queries}
      \end{subfigure}%
	\begin{subfigure}[b]{0.33\linewidth}
      \centering
		\resizebox{!}{16mm} {
			\begin{tikzpicture}[thick]
				\node[tree_node] (a2) at (0, 0){$R_b$};
				\node[tree_node] (b2) at (1, 0){$L_b$};
				\node[tree_node] (c2) at (2, 0){$R_c$};
				%level 1
				\node[tree_node] (a1) at (0.5, 0.8){$\boldsymbol{\circmult}$};
				\node[tree_node] (b1) at (1.5, 0.8){$\boldsymbol{\circmult}$};
				%level 0
				\node[tree_node] (a0) at (1.0, 1.6){$\boldsymbol{\circplus}$};
				%edges
				\draw[->] (a2) -- (a1);
				\draw[->] (b2) -- (a1);
				\draw[->] (b2) -- (b1);
				\draw[->] (c2) -- (b1);
				\draw[->] (a1) -- (a0);
				\draw[->] (b1) -- (a0);
			\end{tikzpicture}
		}
		\resizebox{!}{16mm} {
			\begin{tikzpicture}[thick]
				\node[tree_node] (a1) at (1, 0){$R_b$};
				\node[tree_node] (b1) at (2, 0){$R_c$};
				%level 1
				\node[tree_node] (a2) at (0.75, 0.8){$L_b$};
				\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circplus}$};
				%level 0
				\node[tree_node] (a3) at (1.1, 1.6){$\boldsymbol{\circmult}$};
				%edges
				\draw[->] (a1) -- (b2);
				\draw[->] (b1) -- (b2);
				\draw[->] (a2) -- (a3);
				\draw[->] (b2) -- (a3);
			\end{tikzpicture}
		}
	\caption{Two circuits for $Q_1(Chicago)$}
	\label{subfig:ex-proj-push-circ-q4}
	\end{subfigure}%
  \vspace*{-3mm}
	\caption{\ti instance and query results for \Cref{ex:intro-tbls}.}%{$\ti$ relations for $\poly$}
	\label{fig:ex-shipping-simp}
  \trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Example}\label{ex:intro-tbls}
  Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analogously to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where shipment processing is on time.

$$Q_1(\text{City}) \dlImp OnTime(\text{City}), Route(\text{City}, \dlDontcare)$$

\Cref{subfig:ex-shipping-simp-queries} shows the possible results of this query.
For example, there is a 90\% probability there is a single route starting in Buffalo that is on time, and the expected multiplicity of this result tuple is $0.9$.
There are two shipping routes starting in Chicago.
Since the Chicago location has a 50\% probability of being on schedule (we assume that delays are linked), the expected multiplicity of this result tuple is $0.5 + 0.5 = 1.0$.
\end{Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

A well-known result in probabilistic databases is that under set semantics, the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions)
over random variables
($\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07})
that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the obvious analog of probability distribution $\probDist$ defined over $\vct{X}$).

For bag semantics, the lineage of a tuple is a polynomial over variables $\vct{X}=(X_1,\dots,X_n)$ with % \in \mathbb{N}^\numvar$ with
coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$).
Analogously to sets, evaluating the lineage for $t$  over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which for this example we denote\footnote{
In later sections, where we focus on a single lineage polynomial, we will simply refer to $\linsett{\query}{\pdb}{\tup}$ as $Q$.
} as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Example}\label{ex:intro-lineage}
Associating a lineage variable with every input tuple as shown in \Cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \Cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation to  solve the problem in linear time (in the size of  $\Phi$).
The expectation of the sum is the sum of the expectations of monomials.
The expectation of each monomial is then computed by multiplying the probabilities of the variables (tuples) occurring in the monomial.
The expected multiplicity for Chicago is $1.0$.
\end{Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The expected multiplicity of a query result can be computed in linear time (in the size of the result's lineage) if the lineage is in SOP form.
However, this need not be true for compressed representations of polynomials, including factorized polynomials or arithmetic circuits.
For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}.
The left circuit encodes the lineage as a SOP while the right circuit uses distributivity to push the addition gate below the multiplication, resulting in a smaller circuit.
Given that there is a large body of work (on, e.g., deterministic bag-relational query processing) that can  output such compressed representations~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}, %\BG{cite FDBs and FAQ},
an interesting question is whether computing expectations is still in linear time for such compressed representations.
If the answer is in the affirmative, and if lineage formulas can also be computed in linear time (in the lineage size), then bag-relational probabilistic databases can theoretically match the performance of deterministic databases.
Unfortunately, we prove that this is not the case: computing the expected count of a query result tuple is super-linear under standard complexity assumptions (\sharpwonehard) in the size of a lineage circuit.

Concretely, we make the following contributions:
(i) We show that the expected result multiplicity problem (\Cref{def:the-expected-multipl}) for conjunctive queries for bag-$\ti$s is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) its complexity is linear in the size of the compressed lineage encoding; %;\BG{Fix not linear in all cases, restate after 4 is done}
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further prove that for \raPlus queries (an equivalently expressive, but factorizable form of UCQs),  we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

%\mypar{Implications of our Results} As mentioned above

\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial by continuing \Cref{ex:intro-tbls}.

%Moving forward, we focus exclusively on bags.
Consider the query $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$ over the bag relations of \Cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\poly^2()\dlImp Q(), Q()$.
%The factorized representation of $\poly^2$ is (for simplicity we ignore the random variables of $Route$ since each variable has probability of $1$):
%\begin{equation*}
%\poly^2 = \left(L_aL_b + L_bL_d + L_bL_c\right) \cdot \left(L_aL_b + L_bL_d + L_bL_c\right)
%\end{equation*}
%This equivalent SOP representation is
The lineage polynomial for $Q^2$ is given by $\Phi^2$:
\begin{equation*}
\left(L_aL_b + L_bL_d + L_bL_c\right)^2=L_a^2L_b^2 + L_b^2L_d^2 + L_b^2L_c^2 + 2L_aL_b^2L_d + 2L_aL_b^2L_c + 2L_b^2L_dL_c.
\end{equation*}
The expectation $\expct\pbox{\Phi^2}$ then is:
\begin{multline*}
\expct\pbox{L_a}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_d} \\
+ 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{L_d}\expct\pbox{L_c}
\end{multline*}
\noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
\begin{footnotesize}
\begin{equation*}
\expct\pbox{L_a^2}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{L_d}\expct\pbox{L_c}
\end{equation*}
\end{footnotesize}
\noindent This property leads us to consider a structure related to the lineage polynomial.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the SOP form of $\poly(\vct{X})$ to $1$.
\end{Definition}
With $\Phi^2$ as an example, we have:
\begin{align*}
\widetilde{\Phi^2}(L_a, L_b, L_c, L_d)
=&\; L_aL_b + L_bL_d + L_bW_c + 2L_aL_bL_d + 2L_aL_bL_c + 2L_bL_cL_d
\end{align*}
It can be verified that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1}, \probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, we show in \Cref{lem:exp-poly-rpoly} that this equivalence holds for {\em all} UCQs over TIDB/BIDB.

%The reduced form of a lineage polynomial can be obtained but requires a linear scan over the clauses of an SOP encoding of the polynomial.  Note that for a compressed representation, this scheme would require an exponential number of computations in the size of the compressed representation.  In \Cref{sec:hard}, we use $\rpoly$ to prove our hardness results .

To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode variaous hard graph-counting problems.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$). For the upper bound it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl}). Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:
intro 2021-04-06 09:50:42 -04:00			`%root: main.tex`
			`%!TEX root=./main.tex`

			`\section{Introduction}`
			`\label{sec:intro}`
updates 2021-04-09 16:12:46 -04:00			`A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds, paired with a probability distribution $\pd$ over these worlds.`
Minor adjustments 2021-04-10 00:19:16 -04:00			`A well-studied problem in probabilistic databases is to, given a query $\query$ and a probabilistic database $\pdb$, compute the \emph{marginal probability} of a tuple $\tup$ (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).`
updates 2021-04-09 16:12:46 -04:00			`This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs), which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases, from cases that are in \ptime for unions of conjunctive queries (UCQs).`
clarifications, flow, grammar 2021-04-06 18:46:06 -04:00			`In this work we consider bag semantics, where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analogous problem of computing the expectation of the multiplicity of a query result tuple $\tup$ (denoted $\query(\db)(t)$):`
intro 2021-04-06 09:50:42 -04:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Finished pass on Intro (Aaron) 2021-04-06 16:11:19 -04:00			`\begin{equation}\label{eq:intro-bag-expectation}`
Done with pass on S1 2021-04-07 22:49:21 -04:00			`\expct_{\overline{\mathbf{D}} \sim \probDist}[\query(\overline{\mathbf{D}})(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \pd(\db) \hspace{2cm}\text{\textbf{(Expected Result Multiplicity)}}`
intro 2021-04-06 09:50:42 -04:00			`\end{equation}`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`

			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`\begin{figure}[t]`
			`\begin{subfigure}[b]{0.49\linewidth}`
			`\centering`
			`{\small`
			`\begin{tabular}{ c \| c c c}`
			`$OnTime$ & City$_\ell$ & $\Phi$ & \textbf{p}\\`
			`\hline`
			`& Buffalo & $L_a$ & 0.9 \\`
			`& Chicago & $L_b$ & 0.5\\`
			`& Bremen & $L_c$ & 0.5\\`
Done with pass on S1 2021-04-07 22:49:21 -04:00			`& Zurich & $L_d$ & 1.0\\`
intro 2021-04-06 09:50:42 -04:00			`\end{tabular}`
			`}`
			`\caption{Relation $OnTime$}`
			`\label{subfig:ex-shipping-simp-loc}`
			`\end{subfigure}%`
			`\begin{subfigure}[b]{0.49\linewidth}`
			`\centering`
			`{\small`
			`\begin{tabular}{ c \| c c c c}`
			`$Route$ & $\text{City}_1$ & $\text{City}_2$ & $\Phi$ & \textbf{p} \\`
			`\hline`
			`& Buffalo & Chicago & $R_a$ & 1.0 \\`
			`& Chicago & Zurich & $R_b$ & 1.0 \\`
Done with pass on S1 2021-04-07 22:49:21 -04:00			`%& $\cdots$ & $\cdots$ & $\cdots$ & $\cdots$ \\`
intro 2021-04-06 09:50:42 -04:00			`& Chicago & Bremen & $R_c$ & 1.0 \\`
			`\end{tabular}`
			`}`
			`\caption{Relation $Route$}`
			`\label{subfig:ex-shipping-simp-route}`
			`\end{subfigure}%`
			`% \begin{subfigure}[b]{0.17\linewidth}`
			`% \centering`

			`% \caption{Circuit for $(Chicago)$}`
			`% \label{subfig:ex-proj-push-circ-q3}`
			`% \end{subfigure}`

			`\begin{subfigure}[b]{0.66\linewidth}`
			`\centering`
			`{\small`
			`\begin{tabular}{ c \| c c c}`
			`$\query_1$ & City & $\Phi$ & $\expct_{\idb \sim \probDist}[\query(\db)(t)]$ \\ \hline`
			`& Buffalo & $L_a \cdot R_a$ & $0.9$ \\`
			`& Chicago & $L_b \cdot R_b + L_b \cdot R_c$ & $0.5 \cdot 1.0 + 0.5 \cdot 1.0 = 1.0$ \\`
Done with pass on S1 2021-04-07 22:49:21 -04:00			`%& $\cdots$ & $\cdots$ & $\cdots$ \\`
intro 2021-04-06 09:50:42 -04:00			`\end{tabular}`
			`}`
			`\caption{$Q_1$'s Result}`
			`\label{subfig:ex-shipping-simp-queries}`
			`\end{subfigure}%`
			`\begin{subfigure}[b]{0.33\linewidth}`
			`\centering`
			`\resizebox{!}{16mm} {`
			`\begin{tikzpicture}[thick]`
			`\node[tree_node] (a2) at (0, 0){$R_b$};`
			`\node[tree_node] (b2) at (1, 0){$L_b$};`
			`\node[tree_node] (c2) at (2, 0){$R_c$};`
			`%level 1`
			`\node[tree_node] (a1) at (0.5, 0.8){$\boldsymbol{\circmult}$};`
			`\node[tree_node] (b1) at (1.5, 0.8){$\boldsymbol{\circmult}$};`
			`%level 0`
			`\node[tree_node] (a0) at (1.0, 1.6){$\boldsymbol{\circplus}$};`
			`%edges`
			`\draw[->] (a2) -- (a1);`
			`\draw[->] (b2) -- (a1);`
			`\draw[->] (b2) -- (b1);`
			`\draw[->] (c2) -- (b1);`
			`\draw[->] (a1) -- (a0);`
			`\draw[->] (b1) -- (a0);`
			`\end{tikzpicture}`
			`}`
			`\resizebox{!}{16mm} {`
			`\begin{tikzpicture}[thick]`
			`\node[tree_node] (a1) at (1, 0){$R_b$};`
			`\node[tree_node] (b1) at (2, 0){$R_c$};`
			`%level 1`
			`\node[tree_node] (a2) at (0.75, 0.8){$L_b$};`
			`\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circplus}$};`
			`%level 0`
			`\node[tree_node] (a3) at (1.1, 1.6){$\boldsymbol{\circmult}$};`
			`%edges`
			`\draw[->] (a1) -- (b2);`
			`\draw[->] (b1) -- (b2);`
			`\draw[->] (a2) -- (a3);`
			`\draw[->] (b2) -- (a3);`
			`\end{tikzpicture}`
			`}`
Done with pass on S1 2021-04-07 22:49:21 -04:00			`\caption{Two circuits for $Q_1(Chicago)$}`
intro 2021-04-06 09:50:42 -04:00			`\label{subfig:ex-proj-push-circ-q4}`
			`\end{subfigure}%`
			`\vspace*{-3mm}`
			`\caption{\ti instance and query results for \Cref{ex:intro-tbls}.}%{$\ti$ relations for $\poly$}`
			`\label{fig:ex-shipping-simp}`
			`\trimfigurespacing`
			`\end{figure}`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`


			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`\begin{Example}\label{ex:intro-tbls}`
Minor adjustments 2021-04-10 00:19:16 -04:00			Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analogously to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where shipment processing is on time.
intro 2021-04-06 09:50:42 -04:00
small 2021-04-10 09:40:43 -04:00			`$$Q_1(\text{City}) \dlImp OnTime(\text{City}), Route(\text{City}, \dlDontcare)$$`
intro 2021-04-06 09:50:42 -04:00
clarifications, flow, grammar 2021-04-06 18:46:06 -04:00			`\Cref{subfig:ex-shipping-simp-queries} shows the possible results of this query.`
updates 2021-04-09 16:12:46 -04:00			`For example, there is a 90\% probability there is a single route starting in Buffalo that is on time, and the expected multiplicity of this result tuple is $0.9$.`
			`There are two shipping routes starting in Chicago.`
Minor adjustments 2021-04-08 11:51:26 -04:00			`Since the Chicago location has a 50\% probability of being on schedule (we assume that delays are linked), the expected multiplicity of this result tuple is $0.5 + 0.5 = 1.0$.`
intro 2021-04-06 09:50:42 -04:00			`\end{Example}`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`

Minor adjustments 2021-04-10 00:19:16 -04:00			`A well-known result in probabilistic databases is that under set semantics, the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions)`
Merge branch 'master' of gitlab.odin.cse.buffalo.edu:ahuber/SketchingWorlds 2021-04-10 09:42:02 -04:00			`over random variables`
			`($\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07})`
Minor adjustments 2021-04-10 00:19:16 -04:00			`that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.`
Minor adjustments 2021-04-08 11:51:26 -04:00			`Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the obvious analog of probability distribution $\probDist$ defined over $\vct{X}$).`
intro 2021-04-06 09:50:42 -04:00
Done with pass on S1 2021-04-07 22:49:21 -04:00			`For bag semantics, the lineage of a tuple is a polynomial over variables $\vct{X}=(X_1,\dots,X_n)$ with % \in \mathbb{N}^\numvar$ with`
updates 2021-04-09 16:12:46 -04:00			`coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$).`
Minor tweaks 2021-04-10 13:21:10 -04:00			`Analogously to sets, evaluating the lineage for $t$ over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which for this example we denote\footnote{`
			`In later sections, where we focus on a single lineage polynomial, we will simply refer to $\linsett{\query}{\pdb}{\tup}$ as $Q$.`
			`} as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.`
intro 2021-04-06 09:50:42 -04:00
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`\begin{Example}\label{ex:intro-lineage}`
small 2021-04-10 09:40:43 -04:00			Associating a lineage variable with every input tuple as shown in \Cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \Cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
			`In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation to solve the problem in linear time (in the size of $\Phi$).`
updates 2021-04-09 16:12:46 -04:00			`The expectation of the sum is the sum of the expectations of monomials.`
clarifications, flow, grammar 2021-04-06 18:46:06 -04:00			`The expectation of each monomial is then computed by multiplying the probabilities of the variables (tuples) occurring in the monomial.`
Minor adjustments 2021-04-10 00:19:16 -04:00			`The expected multiplicity for Chicago is $1.0$.`
intro 2021-04-06 09:50:42 -04:00			`\end{Example}`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`

clarifications, flow, grammar 2021-04-06 18:46:06 -04:00			`The expected multiplicity of a query result can be computed in linear time (in the size of the result's lineage) if the lineage is in SOP form.`
updates 2021-04-09 16:12:46 -04:00			`However, this need not be true for compressed representations of polynomials, including factorized polynomials or arithmetic circuits.`
			`For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}.`
			`The left circuit encodes the lineage as a SOP while the right circuit uses distributivity to push the addition gate below the multiplication, resulting in a smaller circuit.`
Minor adjustments 2021-04-10 00:19:16 -04:00			`Given that there is a large body of work (on, e.g., deterministic bag-relational query processing) that can output such compressed representations~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}, %\BG{cite FDBs and FAQ},`
updates 2021-04-09 16:12:46 -04:00			`an interesting question is whether computing expectations is still in linear time for such compressed representations.`
clarifications, flow, grammar 2021-04-06 18:46:06 -04:00			`If the answer is in the affirmative, and if lineage formulas can also be computed in linear time (in the lineage size), then bag-relational probabilistic databases can theoretically match the performance of deterministic databases.`
Done with pass on S1 2021-04-07 22:49:21 -04:00			`Unfortunately, we prove that this is not the case: computing the expected count of a query result tuple is super-linear under standard complexity assumptions (\sharpwonehard) in the size of a lineage circuit.`
intro 2021-04-06 09:50:42 -04:00
Finished pass on Intro (Aaron) 2021-04-06 16:11:19 -04:00			`Concretely, we make the following contributions:`
Minor changes. 2021-04-10 13:20:30 -04:00			`(i) We show that the expected result multiplicity problem (\Cref{def:the-expected-multipl}) for conjunctive queries for bag-$\ti$s is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;`
Minor adjustments 2021-04-10 00:19:16 -04:00			`(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) its complexity is linear in the size of the compressed lineage encoding; %;\BG{Fix not linear in all cases, restate after 4 is done}`
intro 2021-04-06 09:50:42 -04:00			`(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;`
Small changes. 2021-04-10 12:21:30 -04:00			`(iv) We further prove that for \raPlus queries (an equivalently expressive, but factorizable form of UCQs), we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).`
Done with pass on S1 2021-04-07 22:49:21 -04:00
updates 2021-04-09 16:12:46 -04:00			`%\mypar{Implications of our Results} As mentioned above`
Done with pass on S1 2021-04-07 22:49:21 -04:00
cref 2021-04-10 09:48:26 -04:00			`\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial by continuing \Cref{ex:intro-tbls}.`
Done with pass on S1 2021-04-07 22:49:21 -04:00
updates 2021-04-09 16:12:46 -04:00			`%Moving forward, we focus exclusively on bags.`
cref 2021-04-10 09:48:26 -04:00			`Consider the query $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$ over the bag relations of \Cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\poly^2()\dlImp Q(), Q()$.`
Done with pass on S1 2021-04-07 22:49:21 -04:00			`%The factorized representation of $\poly^2$ is (for simplicity we ignore the random variables of $Route$ since each variable has probability of $1$):`
			`%\begin{equation*}`
			`%\poly^2 = \left(L_aL_b + L_bL_d + L_bL_c\right) \cdot \left(L_aL_b + L_bL_d + L_bL_c\right)`
			`%\end{equation*}`
			`%This equivalent SOP representation is`
Minor adjustments 2021-04-10 00:19:16 -04:00			`The lineage polynomial for $Q^2$ is given by $\Phi^2$:`
Done with pass on S1 2021-04-07 22:49:21 -04:00			`\begin{equation*}`
			`\left(L_aL_b + L_bL_d + L_bL_c\right)^2=L_a^2L_b^2 + L_b^2L_d^2 + L_b^2L_c^2 + 2L_aL_b^2L_d + 2L_aL_b^2L_c + 2L_b^2L_dL_c.`
			`\end{equation*}`
			`The expectation $\expct\pbox{\Phi^2}$ then is:`
Minor adjustments 2021-04-08 11:51:26 -04:00			`\begin{multline*}`
			`\expct\pbox{L_a}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_d} \\`
			`+ 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{L_d}\expct\pbox{L_c}`
			`\end{multline*}`
Minor adjustments 2021-04-10 00:19:16 -04:00			`\noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:`
Done with pass on S1 2021-04-07 22:49:21 -04:00			`\begin{footnotesize}`
			`\begin{equation*}`
			`\expct\pbox{L_a^2}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{L_d}\expct\pbox{L_c}`
			`\end{equation*}`
			`\end{footnotesize}`
Minor changes. 2021-04-10 13:20:30 -04:00			`\noindent This property leads us to consider a structure related to the lineage polynomial.`
Done with pass on S1 2021-04-07 22:49:21 -04:00			`\begin{Definition}\label{def:reduced-poly}`
Minor adjustments 2021-04-08 11:51:26 -04:00			`For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the SOP form of $\poly(\vct{X})$ to $1$.`
Done with pass on S1 2021-04-07 22:49:21 -04:00			`\end{Definition}`
			`With $\Phi^2$ as an example, we have:`
			`\begin{align*}`
			`\widetilde{\Phi^2}(L_a, L_b, L_c, L_d)`
			`=&\; L_aL_b + L_bL_d + L_bW_c + 2L_aL_bL_d + 2L_aL_bL_c + 2L_bL_cL_d`
			`\end{align*}`
Minor changes. 2021-04-10 13:20:30 -04:00			`It can be verified that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1}, \probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, we show in \Cref{lem:exp-poly-rpoly} that this equivalence holds for {\em all} UCQs over TIDB/BIDB.`
Done with pass on S1 2021-04-07 22:49:21 -04:00
			`%The reduced form of a lineage polynomial can be obtained but requires a linear scan over the clauses of an SOP encoding of the polynomial. Note that for a compressed representation, this scheme would require an exponential number of computations in the size of the compressed representation. In \Cref{sec:hard}, we use $\rpoly$ to prove our hardness results .`

Minor changes. 2021-04-10 13:20:30 -04:00			To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode variaous hard graph-counting problems. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$). For the upper bound it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
intro 2021-04-06 09:50:42 -04:00
up 2021-04-06 10:16:09 -04:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Minor changes. 2021-04-10 13:20:30 -04:00			\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl}). Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.
intro 2021-04-06 09:50:42 -04:00








			`%%% Local Variables:`
			`%%% mode: latex`
			`%%% TeX-master: "main"`
			`%%% End:`