paper-BagRelationalPDBsAreHard/intro-rewrite-070921.tex

%!TEX root=./main.tex
%root: main.tex
\section{Introduction}\label{sec:intro}


This work explores the problem of computing the expectation of a tuple's multiplicity in bag \abbrTIDB\xplural, which we term as \abbrCTIDB\xplural.  A \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ encodes a bag of uncertain tuples such that each possible tuple encoded in $\pdb$ has a multiplicity of at most $\bound$.  $\tupset$ is the set of tuples appearing across all possible worlds, and the set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\numvar=\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity and $\bpd$ is the probability distribution over $\worlds$.  A given world $\worldvec \in\worlds$ can be interpreted such that, for each $\tup \in \tupset$, $\worldvec_{\tup}$ is the multiplicity of $\tup$ in $\worldvec$.  The probability distribution $\bpd$ for any tuple $\tup$ can then be encoded as $\prob_{\tup, j} = \probOf\pbox{\worldvec_{\tup} = j}$ (for $j \in\pbox{\bound}$), where each tuple multiplicity combination $\inparen{\inparen{\tup, \bound} \in \tupset\times\pbox{\bound}}$ %distribution
is an independent random event. %for $\tup \in \tupset$.

Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB, which (assuming set query semantics), is the same as the traditional set \abbrTIDB.
In this work, since we are generally considering bag query input, we will only be considering bag query semantics.  We denote by $\query\inparen{\worldvec}\inparen{\tup}$ the multiplicity of $\tup$ in query $\query$ over possible world $\worldvec\in\worlds$.

We can formally state our problem of computing the expected multiplicity of a result tuple as:

\begin{Problem}\label{prob:expect-mult}
Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$
\footnote{
A query $\query$ is an $\raPlus$ query if it is composed entirely of one or more of the positive relational operators $\inset{\select, \project, \join, \union}$.
}
, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$.
\end{Problem}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Example of product distribution of c-TIDB
%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%\begin{figure}[h!]
%	\centering
%	\textcolor{red}{
%	\begin{tabular}{>{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c}
%		\multicolumn{5}{c}{$\mathbf{\rel}$}\\
%		\toprule
%		A & Mult. $\inparen{M}$ &$\probOf\pbox{M=1}$ &$\probOf\pbox{M=2}$ &$\probOf\pbox{M=3}$\\
%		\midrule
%		$a$ & $2$ & $0.4$ &$0.3$ &$0.0$\\
%		$b$ & $3$ & $0.2$ &$0.35$ &$0.15$\\
%	\end{tabular}
%%	\hspace*{0.5cm}
%%	{\LARGE $\Rightarrow$}
%%	\hspace*{0.5cm}
%	\begin{tabular}{>{\footnotesize}c | >{\footnotesize}c}
%		\multicolumn{2}{c}{$\textbf{World Probabilities}$}\\
%		\toprule
%		World & Probability \\
%		\midrule
%		$\emptyset$ & $0.3\cdot0.3 = 0.09$\\
%		$\inset{\intup{a, 1}}$ & $0.4\cdot0.3 = 0.12$\\
%		$\inset{\intup{a, 2}}$ & $0.3\cdot0.3 = 0.09$\\
%		$\inset{\intup{b, 1}}$ & $0.3\cdot0.2 = 0.06$\\
%		$\inset{\intup{b, 2}}$ & $0.3\cdot0.35 = 0.105$\\
%		$\inset{\intup{b, 3}}$ & $0.3\cdot0.15 = 0.045$\\
%		$\inset{\intup{a, 1}, \intup{b, 1}}$ & $0.4\cdot0.2 = 0.08$\\
%		$\inset{\intup{a, 1}, \intup{b, 2}}$ & $0.4\cdot0.35 = 0.14$\\
%		$\inset{\intup{a, 1}, \intup{b, 3}}$ & $0.4\cdot0.15 = 0.06$\\
%		$\inset{\intup{a, 2}, \intup{b, 1}}$ & $0.3\cdot0.2 = 0.06$\\
%		$\inset{\intup{a, 2}, \intup{b, 2}}$ & $0.3\cdot0.35 = 0.105$\\
%		$\inset{\intup{a, 2}, \intup{b, 3}}$ & $0.3\cdot0.15 = 0.045$\\
%	\end{tabular}
%	}
%	\caption{\textcolor{red}{\abbrCTIDB relation$\rel$ and its possible worlds with their probabilities.% Reduction to $1$-\abbrBIDB ($\rel'$).  Note the probability distribution over tuple multiplicities of $\rel$, where e.g. tuple $\tup_1$ has a probability $\prob_{1, j} > 0$ for each multiplicity $j \in [2]$.  This is better expressed as a block of mutually exclusive tuples, where each tuple has a specific multiplicity in $[c]$.  Multiplicities that don't exist are automatically assigned a probability of $0$.  Also note that it is implicit in \abbrBIDB\xplural for a block $i$ that $1 - \sum_{j \in [c]}\prob_{i, j}$ is the probability that no tuple in that block will be selected for a possible world.
%}}
%	\label{fig:ctidb-red}
%\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Figure may work for an example in a later section of the Intro
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{figure}[h!]
%	\centering
%	\textcolor{red}{
%	\begin{tabular}{c | c | c}
%		\multicolumn{3}{c}{$\mathbf{\rel}$}\\
%		\toprule
%		A & Mult. & id\\
%		\midrule
%		$a$ & $2$ & $\tup_1$\\
%		$b$ & $3$ & $\tup_2$\\
%	\end{tabular}
%	\hspace*{1cm}
%	{\LARGE $\equiv$}
%	\hspace*{1cm}
%	\begin{tabular}{c | c | c}
%		\multicolumn{3}{c}{$\textbf{\rel'}$}\\
%		\toprule
%		A & $\poly$ & $\probOf\pbox{x_{i, j}}$\\
%		\midrule
%		a & $\pVar_{1, 1}$ & $0.4$\\
%		a & $\pVar_{1, 2}$ & $0.3$\\
%		a & $\pVar_{1, 3}$ & $0.0$\\
%		\midrule
%		b & $\pVar_{2, 1}$ & $0.2$\\
%		b & $\pVar_{2, 2}$ & $0.35$\\
%		b & $\pVar_{2, 3}$ & $0.15$\\
%	\end{tabular}
%	}
%	\caption{\textcolor{red}{\abbrCTIDB ($\rel$) Reduction to $1$-\abbrBIDB ($\rel'$).  Note the probability distribution over tuple multiplicities of $\rel$, where e.g. tuple $\tup_1$ has a probability $\prob_{1, j} > 0$ for each multiplicity $j \in [2]$.  This is better expressed as a block of mutually exclusive tuples, where each tuple has a specific multiplicity in $[c]$.  Multiplicities that don't exist are automatically assigned a probability of $0$.  Also note that it is implicit in \abbrBIDB\xplural for a block $i$ that $1 - \sum_{j \in [c]}\prob_{i, j}$ is the probability that no tuple in that block will be selected for a possible world.
%}}
%	\label{fig:ctidb-red}
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
It is natural to explore computing the expected multiplicity of a result tuple as this is the analog for computing the marginal probability of a tuple in a set \abbrPDB.
In this work we will assume that $c =\bigO{1}$ since this is what typically seen in practice.
Allowing for unbounded $c$ is an interesting open problem.

\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and the data complexity of the problem in general has been shown by Dalvi and Suicu to be \sharpphard\cite{10.1145/1265530.1265571}.  For our setting, there exists a trivial polytime algorithm to compute~\Cref{prob:expect-mult} for any $\raPlus$ query over a \abbrCTIDB due to linearity of expection by simply computing the expectation over a `sum-of-products' representation of the query operations of $\query\inparen{\pdb}\inparen{\tup}$.
Since we can compute~\Cref{prob:expect-mult} in polynomial time, the interesting question that we explore deals with analyzing the hardness of computing expectation using fine-grained analysis and parameterized complexity, where we are interested in the exponent of polynomial runtime.

Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in time linear in the runtime of an equivalent deterministic query.  If this is true, then this would open up the way for deployment of \abbrCTIDB\xplural in practice.  To analyze this question we denote by $\timeOf{}^*(Q,\pdb)$ the optimal runtime complexity of computing~\Cref{prob:expect-mult} over \abbrCTIDB $\pdb$.

Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$, deterministic database $\gentupset$, and multiplicity bound $\bound$.  This paper considers $\raPlus$ queries for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ.  Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} = \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$.  Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{Note that our work applies to any $\query \in\raPlus$, which implies that specific heuristics for choosing an optimized query can be abstracted away, i.e., our work does not consider heuristic techniques.}

\begin{table}[t!]
\begin{tabular}{|p{0.43\textwidth}|p{0.12\textwidth}|p{0.35\textwidth}|}
\hline
Lower bound on $\timeOf{}^*(\query,\pdb)$ & Num. $\bpd$s & Hardness Assumption\\
\hline
$\Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{1+\eps_0}}$ for {\em some} $\eps_0>0$ & Single & Triangle Detection hypothesis\\
%\hline
$\omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{C_0}}$ for {\em all} $C_0>0$ & Multiple &$\sharpwzero\ne\sharpwone$\\
%\hline
$\Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
\hline
\end{tabular}
\caption{Our lower bounds for a specific hard query $\query$ parameterized by $k$.For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\label{tab:lbs}
\end{table}
\mypar{Our lower bound results}
Our question is whether or not it is always true that $\timeOf{}^*\inparen{\query, \pdb}\leq\qruntime{\optquery{\query}, \tupset, \bound}$.  Unfortunately this is not the case.
~\Cref{tab:lbs} shows our results.%our lower bounds for computing~\Cref{prob:expect-mult} on \abbrCTIDB\xplural.

Specifically, depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question.  To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le  \bigO{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $k$ is the join width (our notion of join width follows from~\Cref{def:degree-of-poly} and~\Cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).

What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for~\Cref{prob:expect-mult}.
 However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\optquery{\query}, \tupset, \bound}$ by just $\numvar$ (indeed these results follow from known lower bounds for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.

\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$.  This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query %$\timeOf{Approx}^*\inparen{\query, \pdb}\leq\qruntim{\optquery{\query},\tupset,\bound}$ (where $\timeOf{Approx}^*\inparen{\cdot}$ denotes runtime of approximation algorithm),
and bag \abbrPDB\xplural are deployable in practice.
% In particular, we show the following upper bound results.
%(i) We show that e.g. for a circuit representation of the lineage polynomial (more on this later), when the circuit is a tree and there is a single
% result tuple, we also have the same runtime  (we can also handle the case of multiple result tuples\footnote{We can approximate the expected result tuple multiplicities (for all result tuples {\em simultanesouly}) with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of result tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}).
%Further, we show that for {\em any} $\raPlus$ query on a \abbrTIDB $(1$-$\abbrTIDB)$, we also obtain linear runtime for approximation.
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}) in set-\abbrPDB\xplural need time $\Omega(\qruntime{\optquery{\query}, \tupset, \bound}^{2k})$ %, where $\circuit$ is a representation of the query operations and input to produce $\tup$; more on this shortly.
(see \Cref{sec:karp-luby}).
Further, our approximation algorithm works for a more general notion of bag \abbrPDB\xplural beyond \abbrCTIDB\xplural
%we generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases
(see \Cref{subsec:tidbs-and-bidbs}). %(\abbrBIDB\xplural).

\subsection{Polynomial Equivalence}\label{sec:intro-poly-equiv}
A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski1989IncompleteII,Antova_fastand,DBLP:conf/vldb/AgrawalBSHNSW06} and many others) relies on annotating tuples with lineages or propositional formulas that describe the set of possible worlds that the tuple appears in.  The bag semantics analog is a provenance/lineage polynomial (see~\Cref{fig:nxDBSemantics}) $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07}, a polynomial with non-zero integer coefficients and exponents, over integer variables $\vct{X}$ encoding input tuple multiplicities.

%Intuitively, a \abbrCTIDB lends itself to a useful reduction to a specific type of block independent database (\abbrBIDB) which we refer to as a $1$-\abbrBIDB.  A $1$-\abbrBIDB is a \abbrBIDB in the traditional sense of allowing no duplicate tuples, \emph{but} where we use bag query semantics instead of the usual set query semantics.
%(see~\Cref{fig:nxDBSemantics} for a definition)
\begin{figure}[b!]
  \begin{align*}
	  \polyqdt{\project_A(\query)}{\gentupset}{\tup} =& \sum_{\tup': \project_A(\tup') = \tup} \polyqdt{\query}{\gentupset}{\tup'} &
	  \polyqdt{\query_1 \union \query_2}{\gentupset}{\tup} =& \polyqdt{\query_1}{\gentupset}{\tup} + \polyqdt{\query_2}{\gentupset}{\tup}\\
	  \polyqdt{\select_\theta(\query)}{\gentupset}{\tup} =& \begin{cases}
	    \polyqdt{\query}{\gentupset}{\tup} & \text{if }\theta(\tup) \\
	    0                       & \text{otherwise}.
	    \end{cases} &
	       \begin{aligned}
	          \polyqdt{\query_1 \join \query_2}{\gentupset}{\tup} =\\ ~
	        \end{aligned}&
	          \begin{aligned}
	            &\polyqdt{\query_1}{\gentupset}{\project_{\attr{\query_1}}{\tup}}  \\
	            &~~~\cdot\polyqdt{\query_2}{\gentupset}{\project_{\attr{\query_2}}{\tup}}
	          \end{aligned}\\
	                                           & & & \polyqdt{\rel}{\gentupset}{\tup} = X_\tup%\sum_{j \in [c]}j\cdot\pVar_{\tup, j}
	\end{align*}\\[-10mm]
	\caption{Construction of the lineage (polynomial) for an $\raPlus$ query $\query$ over an arbitrary deterministic database $\gentupset$, where $\vct{X}$ consists of all $X_\tup$ over all $\rel$ in $\gentupset$ and $\tup$ in $\rel$. Here $\gentupset.\rel$ denotes the instance of relation $\rel$ in $\gentupset$.  Please note, after we introduce the reduction to $1$-\abbrBIDB, the base case will be expressed alternatively.}
	\label{fig:nxDBSemantics}
\end{figure}

We drop $\query$, $\tupset$, and $\tup$ from $\apolyqdt$ when they are clear from the context or irrelevant to the discussion. We now specify the problem of computing the expectation of tuple multiplicity in the language of lineage polynomials:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Expected Multiplicity of Lineage Polynomials]\label{prob:bag-pdb-poly-expected}
Given an $\raPlus$ query $\query$, \abbrCTIDB $\pdb$ and result tuple $\tup$, compute the expected
multiplicity of the polynomial $\apolyqdt$ (i.e., $\expct_{\vct{W}\sim \pdassign}\pbox{\apolyqdt(\vct{W})}$, where $\vct{W} \in \worlds$).%\inset{0,\ldots, \bound}^{\abs{\tupset}}$).
%,
%where $\pdassign$ is the distribution induced by $\pd$ on the relevant assignments $\vct{W}$ to variables of $\apolyqdt$.
\end{Problem}
We note that computing \Cref{prob:expect-mult}
is equivalent (yields the same result as) to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
%In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials.

All of our results rely on working with a {\em reduced} form $\inparen{\rpoly}$ of the lineage polynomial $\poly$. In fact, it turns out that for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the $1$-\abbrTIDB.  This is also true when the query input(s) is a block independent disjoint probabilistic database~\cite{DBLP:conf/icde/OlteanuHK10} (bag query semantics with tuple multiplicity at most $1$), for which the proof of~\Cref{lem:tidb-reduce-poly} (introduced shortly) holds .
% For our results to be applicable to \abbrCTIDB\xplural, we introduce the following reduction.
%\begin{Definition}
%Any \abbrCTIDB $\pdb$, can be reduced to an equivalent $1$-\abbrBIDB $\pdb'$ in the following manner.  For each $\tup_i \in \tupset$, create a block of $\bound + 1$ disjoint \abbrBIDB tuples in $\pdb'$ such that each tuple in the newly formed block is mapped to its own boolean variable $X_{i, j}$ for $i \in \abs{D}$ and $j \in \pbox{c+1}$.  Then, given $\worldvec \in \worlds$, the equivalent world in $\pdb'$ will set each variable $X_{i, j} = 1$ for each $\worldvec\pbox{i} = j$, while $\inparen{\text{for }\ell \neq j}$ all other $X_{i, \ell} \in \vct{X}$ of $\pdb'$ are set to $0$.
%\end{Definition}
%\begin{Example}
%Consider the $\boldsymbol{Route}$ relation of~\Cref{fig:two-step} and query $\query = \project_{\text{City}_1}\inparen{\boldsymbol{Route}}$.  The output relation $\query$ is $\inset{\intup{Chicago, X}, \intup{Chicago, Y}}$ and can be represented as a \abbrCTIDB $\query' = \inset{\intup{Chicago, X', 2}}$, where the following probabilities are true: $\probOf\pbox{X' = 0} = \probOf\pbox{\neg X \wedge \neg Y}$, $\probOf\pbox{X' = 1} = \probOf\pbox{\inparen{X \vee Y}\wedge\inparen{\neg X \vee \neg Y}}$, and $\probOf\pbox{X' = 2} = \probOf\pbox{X\wedge Y}$.  $\query'$ can then be reduced to a $1$-\abbrBIDB by creating a block of the following disjoint tuples: $\query'' = \inset{\intup{\text{Chicago}, X'_0}, \intup{\text{Chicago}, X'_1}, \intup{\text{Chicago}, X'_2}}$ such that $\probOf\pbox{X'_i = 1} = \probOf\pbox{X' = i}$.
%\end{Example}
Next, we motivate this reduced polynomial.
Consider the query $\query_1$ defined as follows over the bag relations of \Cref{fig:two-step}:

\begin{lstlisting}
SELECT 1 FROM T $t_1$, R r, T $t_2$
WHERE $t_1$.city = r.city1 AND $t_2$.city = r.city2
\end{lstlisting}

It can be verified that $\poly\inparen{A, B, C, E, X, Y, Z}$ for the sole result tuple (i.e. the count) of $\query$ is $AXB + BYE + BZC$. Now consider the product query $\query_1^2 = \query_1 \times \query_1$.
The lineage polynomial for $Q_1^2$ is given by $\poly_1^2\inparen{A, B, C, E, X, Y, Z}$
$$
=A^2X^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AXB^2YE + 2AXB^2ZC + 2B^2YEZC.
$$
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand.  To keep things simple, let us focus on the monomial $\poly_1^{\inparen{ABX}^2} = A^2X^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$.  Let $\randWorld_X$ be the random variable corresponding to a lineage variable $X$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2}$.  Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can further derive $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$.  Observe that if $X\in\inset{0, 1}$, then we further would have $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B} = \prob_A\cdot\prob_X\cdot\prob_B$ (denoting $\probOf\pbox{\randWorld_A = 1} = \prob_A$) $= \rpoly_1^{\inparen{ABX}^2}\inparen{\prob_A, \prob_X, \prob_B}$ (see $ii)$ of~\Cref{def:reduced-poly}).  However, in this example, we get stuck with $\expct\pbox{\randWorld_X^2}$, since $\randWorld_X\in\inset{0, 1, 2}$ and for $\randWorld_X \gets 2$, $\randWorld_X^2 \neq \randWorld_X$.

 %the expectation is $\expct\pbox{A^2X^2B^2} = A\cdot\prob_A\cdot\inparen{\sum\limits_{i \in [2]}X_i\cdot \prob_{X, i}}\cdot B\prob_B$ for $X \in \inset{0, 1, 2}$.

Denote the variables of $\poly$ to be $\vars{\poly}.$  In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}\inparen{\vct{X_R}}}$ that is of use to us, where $\abs{\vct{X_R}} = \bound\cdot\abs{\vct{X}}$ .  Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$.  We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}jX_{\tup, j}$ where each $X_{\tup, j}\in\inset{0, 1}$.  Then for any $\worldvec\in\worlds$, we set $X_{\tup, j} = 1$ for $\worldvec_\tup = j$, while $X_{\tup, j'} = 0$ for all $j'\neq j\in\pbox{\bound}$.  By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X_R}}$ $\inparen{\vct{X_R} = \vars{\refpoly{}}}$ since for any $X_\tup\in\vars{\poly}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.

Considering again our example,
\begin{multline*}
\refpoly{1, }^{\inparen{ABX}^2}\inparen{A, X, B} =  \poly_1^{\inparen{AXB}^2}\inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}, \sum_{j_2\in\pbox{\bound}}j_2X_{j_2}, \sum_{j_3\in\pbox{\bound}}j_3B_{j_3}} \\
= \inparen{\sum_{j_1\in\pbox{\bound}}j_1A_{j_1}}^2\inparen{\sum_{j_2\in\pbox{\bound}}j_2X_{j_2}}^2\inparen{\sum_{j_3\in\pbox{\bound}}j_3B_{j_3}}^2.
\end{multline*}
Since the set of multiplicities for tuple $\tup$ by nature are disjoint we can drop all cross terms and have $\refpoly{1, }^2 = \sum_{j_1, j_2, j_3 \in \pbox{\bound}}j_1^2A^2_{j_1}j_2^2X_{j_2}^2j_3^2B^2_{j_3}$. Computing  expectation we get $\expct\pbox{\refpoly{1, }^2}=\sum_{j_1,j_2,j_3\in\pbox{\bound}}j_1^2j_2^2j_3^2\expct\pbox{\randWorld_{A_{j_1}}}\expct\pbox{\randWorld_{X_{j_2}}}\expct\pbox{\randWorld_{B_{j_3}}}$, since we now have that all $\randWorld_{X_j}\in\inset{0, 1}$.
% \begin{footnotesize}
% \begin{align*}
% &\expct\pbox{\randWorld_A^2\randWorld_X^2\randWorld_B^2} = \expct\pbox{\randWorld_A^2}\expct\pbox{\inparen{\randWorld_{X_1} + \randWorld_{X_2}}^2}\expct\pbox{\randWorld_B^2} = \expct\pbox{\randWorld_A}\expct\pbox{\randWorld_{X_1}^2 + 2\randWorld_{X_1}\randWorld_{X_2} + \randWorld_{X_2}^2}\expct\pbox{\randWorld_B} =\\
% &\expct\pbox{\randWorld_A}\inparen{\expct\pbox{\randWorld_{X_1}^2}+\expct\pbox{2\randWorld_{X_1}\randWorld_{X_2}}+\expct\pbox{\randWorld_{X_2}^2}}\expct\pbox{\randWorld_B} = \expct\pbox{\randWorld_A}\inparen{\expct\pbox{\randWorld_{X_1}} + \expct\pbox{2\randWorld_{X_1}\randWorld_{X_2}} + \expct\pbox{\randWorld_{X_2}}}\expct\pbox{\randWorld_B} = \\
% &\expct\pbox{\randWorld_A}\inparen{\sum\limits_{j \in \pbox{\bound}}\expct\pbox{j\cdot\randWorld_{X_j}}}\expct\pbox{\randWorld_B}.
% \end{align*}
% \end{footnotesize}
%We can drop the term $\expct\pbox{2\randWorld_{X_1}\randWorld_{X_2}}$ since by definition a tuple can only have one multiplicity value in a possible world, thus always making $\randWorld_{X_1}\cdot \randWorld_{X_2} = 0$.
%Another subtlety to note is that for any $i\in \pbox{\bound}$,  $\expct\pbox{\randWorld_{X_i}} = i\cdot\prob_{X, i}$.
This leads us to consider a structure related to the lineage polynomial.

%By exploiting linearity of expectation, further pushing expectation through independent variables and observing that for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$, the expectation is
%$\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\poly^2\inparen{\vct{\randWorld}}}$ (where $\randWorld_A$ is the random variable corresponding to $A$, distributed by $\pdassign$).

%Atri: Combined the the first step below with the next one to save space.
%\begin{footnotesize}
%\begin{multline*}
%\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y^2}\expct\pbox{\randWorld_E^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z^2}\expct\pbox{\randWorld_C^2} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\\
%+ 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C}.
%\end{multline*}
%\end{footnotesize}
%\noindent Since for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$,
%then for any $k > 0$, $\expct\pbox{\randWorld^k} = \expct\pbox{\randWorld}$, which means that
%$\expct\limits_{\vct{\randWorld}\sim\pdassign}\pbox{\poly^2\inparen{\vct{\randWorld}}}$ simplifies to:

%\begin{footnotesize}
%\begin{multline*}
%\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B}\expct{\randWorld_Y}\expct\pbox{\randWorld_E} \\
%+ 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_E}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C}.
%\end{multline*}
%\end{footnotesize}
%\noindent This property leads us to consider a structure related to the lineage polynomial.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly\inparen{\inparen{X_\tup}_{\tup\in\tupset}}$ define the reformulated polynomial $\refpoly{}\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}%X_{1, 1},\ldots X_{1, \bound}, X_{2, 1}\ldots X_{\numvar, \bound}}
$ to be the polynomial $\refpoly{}$ = $\poly\inparen{\inparen{\sum_{j\in\pbox{\bound}}j\cdot X_{\tup, j}}_{\tup\in\tupset}}%,\ldots,\sum_{j\in\pbox{\bound}}j\cdot X_{\numvar, j}}
$ and ii) define the \emph{reduced polynomial} $\rpoly\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}%(X_{1, 1},\ldots X_{1, \bound}, X_{2, 1}\ldots X_{\numvar, \bound})
$ to be the polynomial resulting from converting $\refpoly{}$ into the standard monomial basis (\abbrSMB),
\footnote{
  This is the representation, typically used in set-\abbrPDB\xplural, where the polynomial is reresented as sum of `pure' products. See \Cref{def:smb} for a formal definition.
}
removing all monomials containing the term $X_{\tup, j}X_{\tup, j'}$ for $\tup\in\tupset, j\neq j'\in\pbox{c}$, and setting all \emph{variable} exponents $e > 1$ to $1$.
\end{Definition}
Continuing with the example
\footnote{
To save clutter we do not show the full expansion for variables with greatest multiplicity $= 1$ since e.g. for variable $A$, the sum of products itself evaluates to $1^2\cdot A^2 = A$.
}
 $\poly_1^2\inparen{A, B, C, E, X_1, X_2, Y, Z}$ we have
\begin{multline*}
\rpoly_1^2(A, B, C, E, X_1, X_2, Y, Z) = \\
A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}B + BYE + BZC + 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BYE + 2A\inparen{\sum\limits_{j\in\pbox{\bound}}j^2X_j}BZC + 2BYEZC =\\
ABX_1 + AB\inparen{2}^2X_2 + BYE + BZC + 2AX_1BYE + 2A\inparen{2}^2X_2BYE + 2AX_1BZC + 2A\inparen{2}^2X_2BZC + 2BYEZC.
%&\; = AXB + BYD + BZC + 2AXBYD + 2AXBZC + 2BYDZC
\end{multline*}
Note that we have argued that for our specific example the expectation that we want is $\rpoly_1^2(\probOf\inparen{A=1},$ $\probOf\inparen{B=1}, \probOf\inparen{C=1}), \probOf\inparen{E=1}, \probOf\inparen{X_1=1}, \probOf\inparen{X_2=1}, \probOf\inparen{Y=1}, \probOf\inparen{Z=1})$.
%It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}} = \widetilde{\Phi^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{D=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$).
\Cref{lem:tidb-reduce-poly} generalizes the equivalence to {\em all} $\raPlus$ queries on \abbrCTIDB\xplural (proof in \Cref{subsec:proof-exp-poly-rpoly}).
\begin{Lemma}\label{lem:tidb-reduce-poly}
For any \abbrCTIDB $\pdb$, $\raPlus$ query $\query$, and lineage polynomial
%\BG{Term has not been introduced yet.}
%Atri: fixed
 $\poly\inparen{\vct{X}}=\poly\pbox{\query,\tupset,\tup}\inparen{\vct{X}}$, it holds that $
	\expct_{\vct{W} \sim \pdassign}\pbox{\refpoly{}\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}
$, where $\probAllTup = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in\pbox{c}}}.$%,\ldots,\prob_{\abs{\tupset}, \bound}}$ is defined by $\bpd$.
\end{Lemma}

\subsection{Our Techniques}
\mypar{Lower Bound Proof Techniques}
Our main hardness result shows that computing~\Cref{prob:expect-mult} is $\sharpwonehard$ for $1$-\abbrTIDB. To prove this result we show that for the same $\query_1$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $\bigO{1}$ tuples in \Cref{fig:two-step}).
We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymbol{R}$ of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature.

\mypar{Upper Bound Techniques}
Our negative results (\Cref{tab:lbs}) indicate that \abbrCTIDB{}s (even for $\bound=1$) can not achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for $1$-\abbrTIDB\xplural. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.

\input{two-step-model}
We adopt a two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
(i) \termStepOne (\abbrStepOne): Given input $\tupset$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial ($\poly(\vct{X})=\apolyqdt\inparen{\vct{X}}$);
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct_{\randWorld\sim\bpd}\pbox{\poly(\vct{\randWorld})}$.
Let $\timeOf{\abbrStepOne}(Q,\tupset,\circuit)$ denote the runtime of \abbrStepOne when it outputs $\circuit$ (which is a representation of $\poly$ as an arithmetic circuit --- more on this representation in~\Cref{sec:expression-trees}).
Denote by $\timeOf{\abbrStepTwo}(\circuit, \epsilon)$ (recall $\circuit$ is the output of \abbrStepOne) the runtime of \abbrStepTwo, which we can leverage~\Cref{def:reduced-poly} and~\Cref{lem:tidb-reduce-poly} to address the next formal objective: % to formally define our objective:

\begin{Problem}[\abbrCTIDB linear time approximation]\label{prob:big-o-joint-steps}
Given \abbrCTIDB $\pdb$, $\raPlus$ query $\query$,
is there a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ for all result tuples $\tup$ where
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\tupset, \circuit) + \timeOf{\abbrStepTwo}(\circuit, \epsilon) \le O_\epsilon(\qruntime{\optquery{\query}, \tupset, \bound})$?
\end{Problem}

We show in \Cref{sec:circuit-depth} an $\bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$,
and hence, just $\timeOf{\abbrStepOne}(\query,\tupset,\circuit)$ will be too large.

However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
Accordingly, this work uses (arithmetic) circuits\footnote{
  An arithmetic circuit is a DAG with variable and/or numeric source nodes and internal, each nodes representing either an addition or multiplication operator.
}
as the representation system of $\poly(\vct{X})$.

Given that there exists a representation $\circuit^*$ such that $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$, we can now focus on the complexity of the \abbrStepTwo step.
We can represent the factorized lineage polynomial by its correspoding arithmetic circuit $\circuit$ (whose size we denote by $|\circuit|$).
As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$).
Thus, the question of approximation %\Cref{prob:big-o-joint-steps}
can be stated as the following stronger (since~\Cref{prob:big-o-joint-steps} has access to \emph{all} equivalent \circuit representing $\query\inparen{\vct{W}}\inparen{\tup}$), but sufficient condition:
\begin{Problem}\label{prob:intro-stmt}
Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrBPDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
\end{Problem}

For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then $\poly\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation.  For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$ (and similarly for the other variables), we can see that

\begin{footnotesize}
\begin{align*}
\hspace*{-3mm}
	\poly_1^2\inparen{\probAllTup} &= \prob_A^2\prob_X^2\prob_B^2 + \prob_B^2\prob_Y^2\prob_E^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_X\prob_B^2\prob_Y\prob_E + 2\prob_A\prob_X\prob_B^2\prob_Z\prob_C + 2\prob_B^2\prob_Y\prob_E\prob_Z\prob_C\\
	&\leq\prob_A\prob_X\prob_B + \prob_B\prob_Y\prob_E + \prob_B\prob_Z\prob_C +
2\prob_A\prob_X\prob_B\prob_Y\prob_E + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_E\prob_Z\prob_C
	= \rpoly_1^2\inparen{\vct{p}}
	 %\inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
\end{align*}
\end{footnotesize}
If we assume that all seven probability values are at least $p_0>0$,
%Choose the least factor that is reduced in $\rpoly^2\inparen{\vct{X}}$, in this case $\prob_A\prob_X\prob_B$, and
we get that $\poly_1^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdot\rpoly^2_1\inparen{\vct{\prob}}, \rpoly_1^2\inparen{\vct{\prob}}]$, which is \emph{not a tight approximation}.
%
%To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from the \abbrSMB representation of $\poly$ and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.
In~\cref{sec:algo} we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
To get an $(1\pm \epsilon)$-multiplicative approximation and solve~\Cref{prob:intro-stmt}, using \circuit we uniformly sample monomials from the equivalent \abbrSMB representation of $\poly$ (without materializing the \abbrSMB representation) and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\mypar{Applications}
Recent work in heuristic data cleaning~\cite{yang:2015:pvldb:lenses,DBLP:journals/vldb/SaRR0W0Z17,DBLP:journals/pvldb/RekatsinasCIR17,DBLP:journals/pvldb/BeskalesIG10,DBLP:journals/vldb/SaRR0W0Z17} emits a \abbrPDB when insufficient data exists to select the `correct' data repair.
Probabilistic data cleaning is a crucial innovation, as the alternative is to arbitrarily select one repair and `hope' that queries receive meaningful results.
Although \abbrPDB queries instead convey the trustworthiness of results~\cite{kumari:2016:qdb:communicating}, they are impractically slow~\cite{feng:2019:sigmod:uncertainty,feng:2021:sigmod:efficient}, even in approximation (see \Cref{sec:karp-luby}).
Bags, as we consider, are sufficient for production use, where bag-relational algebra is already the default for performance reasons.
Our results show that bag-\abbrPDB\xplural can be competitive, laying the groundwork for probabilistic functionality in production database engines.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. %We present some (easy) generalizations of our results in \Cref{sec:gen}.
%and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem
%\AH{I don't think I understand what the sentence (about extensions) is saying.}
% (\Cref{def:the-expected-multipl}).
Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.  All proofs are in the appendix.

%No reviewer comments in arxiv submission.
%Our responses to ICDT first cycle reviewer comments are in \Cref{sec:rebuttal}. % the appendix.\AR{Would be good to have a specific app ref to rebuttal}


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: