paper-BagRelationalPDBsAreHard/intro-rewrite2.tex

%root: main.tex
\setenumerate[1]{label = \Roman*.}
		\setenumerate[2]{label=\Alph*.}
		\setenumerate[3]{label=\roman*.}
		\setenumerate[4]{label=\alph*.}

\section{Introduction (Outline and Rewrite)}
		%for outline functionality
		\begin{outline}[enumerate]
			\1 Overall Problem
				\2 Hardness of deterministic queries, e.g., counting cliques, multiple joins, etc.
				\2 Assuming $\query$ is easy, how does the probabilistic computation compare in \abbrPDB\xplural?
				\3 Introduce two-step process
					\4 Deterministic Process: compute query, lineage, representation aka circuit
					\4 Probability Computation
				\3 Why the two-step process?
					\4 Semiring provenance nicely follows this model
					\4 Set \abbrPDB\xplural use this process
					\4 Model allows us to separate the deterministic from the probability computation
					\AH{The part below should maybe be moved further down.  The order in the current draft is further down.}
				\3 Assuming a bag-\abbrTIDB, when the probability of all tuples $\prob_i = 1$, the problem of computing the expected count is linear
				\3 However, when $\prob_i < 1$, the problem is not linear (in circuit size)
				\3 An approximation algorithm exists to bring the second step back down to linear time
				\3 For set-\abbrPDB, the problem is \sharpphard with respect to exact computation
				\3 For set-\abbrPDB, the problem is quadratic with respect to approximation
			\2 Note that in all cases, Step 2 is at least as hard as Step 1
				\3 Bag-\abbrPDB\xplural are useful for things like a count query
				\3 This work focuses on Step 2 for bag-\abbrPDB\xplural
					\4 Given a lineage polynomial generated by a query, compute the expected multiplicity
				\3 Why focus on tuple expected multiplicity?
					\4 Note that bag-\abbrPDB query output is a probability distribution over counts; this contrasts the marginal probability paradigm of set-\abbrPDB\xplural
					\4 From a theoretical perspective, bag-\abbrPDB\xplural are not well studied
					\4 There are several statistical measures that can be done
					\4 We focus on expected count since it is a natural and simplistic statistic to consider
					\4 Appendix also considers higher moments
			\2 The setting for our problem assumes set-\abbrPDB inputs
				\3 A simple generalization exists

		\end{outline}

\begin{figure}[H]
	\centering
	\includegraphics[width=\textwidth]{twostep}
	\caption{Old inkscape graphic}
	\label{fig:old-inkscape}
\end{figure}

\usetikzlibrary{shapes.geometric}%for cylinder
\usetikzlibrary{shapes.arrows}%for arrow shape
\usetikzlibrary{shapes.misc}

\begin{figure}[h!]
	\centering
	\resizebox{\textwidth}{4.5cm}{%
	\begin{tikzpicture}
		%pdb cylinder
		\node[cylinder, text width=0.28\textwidth, align=center, draw=black, text=blue, cylinder uses custom fill, cylinder body fill=blue!10, aspect=0.12, minimum height=5cm, minimum width=2.5cm, cylinder end fill=blue!50, shape border rotate=90] (cylinder) at (0, 0) {
		\tabcolsep=0.1cm
		\begin{tabular}{>{\footnotesize}c | >{\footnotesize}c >{\footnotesize}c >{\footnotesize}c}
				$OnTime$ & City$_\ell$ & $\Phi$  & \textbf{p}\\
				\hline
	                     & Buffalo     & $L_a$ & 0.9 \\
	                     & Chicago     & $L_b$ & 0.5\\
	                     & Bremen      & $L_c$ & 0.5\\
	                     & Zurich      & $L_d$ & 1.0\\
			\end{tabular}\\
			\tabcolsep=0.05cm
			\begin{tabular}{ >{\scriptsize}c | >{\scriptsize}c >{\scriptsize}c >{\scriptsize}c >{\scriptsize}c}
				$Route$ & $\text{City}_1$ & $\text{City}_2$ & $\Phi$ & \textbf{p} \\
				\hline
	                    & Buffalo         & Chicago         & $R_a$          & 1.0        \\
	                    & Chicago         & Zurich          & $R_b$          & 1.0        \\
	                    %& $\cdots$        & $\cdots$        & $\cdots$     & $\cdots$   \\
	                    & Chicago         & Bremen          & $R_c$          & 1.0        \\
			\end{tabular}};
			%label below cylinder
			\node[below=0.2 cm of cylinder]{$\pdb$};
		%First arrow
		\node[single arrow, right=0.25 of cylinder, draw=black, fill=black!65, text=white, minimum height=0.75cm, minimum width=0.25cm](arrow1) {\textbf{Step 1}};
		\node[above=of arrow1](arrow1Label) {$\query$};
		\usetikzlibrary{arrows.meta}%for the following arrow configurations
			\draw[line width=0.5mm, dashed, arrows = -{Latex[length=3mm, open]}] (arrow1Label)->(arrow1);
		%Query output (output of step 1)
		\node[rectangle, right=0.175 of arrow1, draw=black, text=purple, fill=purple!15, minimum height=4.5cm, minimum width=2cm](rect) {
		\tabcolsep=0.075cm
		 \begin{tabular}{ >{\small}c | >{\small}c >{\small}c >{\centering\arraybackslash\small}m{1.95cm}}
	            $\query$ & City    & $\Phi$ & Circuit\\%                          & $\expct_{\idb \sim \probDist}[\query(\db)(t)]$ \\ \hline
	            \hline \\\\[-3.5\medskipamount]
	                       & Buffalo & $L_a R_a$ &\resizebox{!}{10mm}{
	                       \begin{tikzpicture}[thick]
	                       		\node[gen_tree_node](sink) at (0.5, 0.8){$\boldsymbol{\circmult}$};
	                       		\node[gen_tree_node](source1) at (0, 0){$L_a$};
	                       		\node[gen_tree_node](source2) at (1, 0){$R_a$};
	                       		\draw[->](source1)--(sink);
	                       		\draw[->] (source2)--(sink);
					\end{tikzpicture}% & $0.5 \cdot 1.0 + 0.5 \cdot 1.0 = 1.0$
					}\\%                 & $0.9$                                            \\
	                       & Chicago & $L_b R_b + L_b R_c$&\resizebox{!}{16mm} {
					\begin{tikzpicture}[thick]
						\node[gen_tree_node] (a2) at (0, 0){$R_b$};
						\node[gen_tree_node] (b2) at (1, 0){$L_b$};
						\node[gen_tree_node] (c2) at (2, 0){$R_c$};
						%level 1
						\node[gen_tree_node] (a1) at (0.5, 0.8){$\boldsymbol{\circmult}$};
						\node[gen_tree_node] (b1) at (1.5, 0.8){$\boldsymbol{\circmult}$};
						%level 0
						\node[gen_tree_node] (a0) at (1.0, 1.6){$\boldsymbol{\circplus}$};
						%edges
						\draw[->] (a2) -- (a1);
						\draw[->] (b2) -- (a1);
						\draw[->] (b2) -- (b1);
						\draw[->] (c2) -- (b1);
						\draw[->] (a1) -- (a0);
						\draw[->] (b1) -- (a0);
					\end{tikzpicture}
				}\\
	          \end{tabular}
		};
		%label below rectangle
		\node[below=0.2cm of rect]{$\query(\pdb)$};
		%Second arrow
		\node[single arrow, right=0.25 of rect, draw=black, fill=black!65, text=white, minimum height=0.75cm, minimum width=0.25cm](arrow2) {\textbf{Step 2}};
		%Expectation computation; (output of step 2)
		\node[rectangle, right=0.25 of arrow2, rounded corners, draw=black, fill=red!15, text=red, minimum height=4.5cm, minimum width=2cm](rrect) {
		\tabcolsep=0.09cm
		\begin{tabular}{>{\footnotesize}c | >{\footnotesize}c >{\footnotesize}m{1.95cm}}
			$\query$ & City & $\mathbb{E}[\poly(\vct{X})]$\\
			\hline
			& Buffalo & $1.0 \cdot 0.9 = 0.9$\\
			& Chicago & $(0.5 \cdot 1.0) + $\newline $\hspace{0.2cm}(0.5 \cdot 1.0)$\newline $= 1.0$\\
		\end{tabular}
			};
		%label of rounded rectangle
		\node[below=0.2cm of rrect]{$\expct\pbox{\poly(\vct{X})}$};
	\end{tikzpicture}
	}
	\caption{Two step model of computation}
	\label{fig:two-step}
\end{figure}
A probabilistic database (\abbrPDB) $\pdb$ is a two-tuple ($\idb, \pd$) such that $\idb$ is the set of possible worlds $\db$ represented by $\pdb$, and $\pd$ is the associated probability distribution across each $\db$ in $\idb$.  Given a query $\query$ the output of $\query(\pdb)$ is ($\idb', \pd'$) such that $\idb' = \{\query(\db_i) \suchthat i \in [\numvar]\}$ where $\numvar = \abs{\idb}$, the number of possible worlds in $\pdb$, and $\pd'$ is the resulting probability distribution over $\idb'$.  In a similar way, one can view $\pdb$ (and $\query(\pdb)$) as the set of possible tuples appearing in $\idb$ each annotated with a lineage polynomial ($\poly(\vct{X})$) and respective expectation ($\expct\pbox{\poly(\vct{X})})$.  When considering query complexity it is tivial to note that deterministic query complexity has its \emph{own} non-linear hardness results and there exist
%The problem of computing an arbitrary $\query$ over an arbitrary $\pdb$ has been studied extensively in the literature, predominantly in the context of set-$\abbrPDB\xplural$.  In the deterministic setting it is well known that
queries $\query$ such that the runtime of $\query$ is superlinear as in the case of counting cliques, or even exponential in the size of $\query$, as is the case of a multi-join.  This observation suggests a natural model of computation for query processing over $\abbrPDB\xplural$.  %Assuming $\query$ is linear or better, how does query computation of a $\abbrPDB$ compare to deterministic query processing?

This model views query processing in $\abbrPDB\xplural$ as two steps.  As depicted in \cref{fig:two-step}, computing $\query$ over a $\abbrPDB$ consists of the first step, which is essentially the deterministic computation of both the query output and result tuple lineage polynomial(s) encoded in the respective representation.\footnote{Note that the runtime of the first step is the same in both the deterministic and \abbrPDB settings, since the computation of the linage is never greater than the query processing time.} The second step consists of computing the expectation of the lineage representation. This model of computation is nicely followed by set-\abbrPDB semantics and also by that of semiring provenance, and further, it is useful in this work for the purpose separating the deterministic computation from the probability computation.

A set-\abbrPDB $\pdb$ views each element $\db$ in $\idb$ to be a set of tuples.  Queries over set-\abbrPDB\xplural produce set-$\abbrPDB$ output.  The problem of computing $\query$ \emph{exactly} over a set-\abbrPDB is known to be \sharpphard in the general case.  The dichotomy of Dalvi and Suicu shows that for set-\abbrPDB\xplural it is the case that $\query(\pdb)$ is either polynomial or \sharpphard.  Further, this dichotomy is \emph{based} on the query structure and in general is independent of the representation of the lineage polynomial, meaning that the bottleneck is always in the second step of the computation model.\footnote{We do note that there exist specific cases when given a specific database instance combined with an amenable representation, that a hard $\query$ can become easy, but this is \emph{not} the general case.}  In this setting, if one allows for approximation, the query processing problem can then be brought back down to quadratic time.

Since set-\abbrPDB\xplural are essentially limited to computing the marginal probability of $\tup$, bag-\abbrPDB\xplural are a more natural fit for computing queries such as count queries.  Traditionally, bag-\abbrPDB\xplural have long been considered to be bottlenecked in step one only, or linear in the size of query.  This may partially be due to the prevalence that exists in using a sum of products (\abbrSOP) representation of the lineage polynomial amongst many of the most well-known implementations of set-\abbrPDB\xplural.  Such a representation used in the bag-\abbrPDB setting \emph{indeed} allows for step two to be linear in the \emph{size} of the \abbrSOP representation, a result due to linearity of expectation.

However, it is not necessarily satisfying to stop here.  Since typical implementations of \abbrPDB\xplural compute the representation of the lineage polynomial in sync with the particular choice of query plan, it is important that optimizations are allowed if we want to have a true comparison between step one and step two in bag-\abbrPDB queries.  Optimizations like projection push-down produce factorized or non-\abbrSOP representations of the lineage polynomial.  Our work explores whether or not step two in the computation model is \emph{always} linear in the \emph{size} of the representation of the lineage polynomial when step one of $\query(\pdb)$ is easy.  %This works focuses on step two of the computation model specifically in regards to bag-\abbrPDB queries.
Given a bag-\abbrTIDB\footnote{A \abbrTIDB is a \abbrPDB such that each tuple is considered to be an independent random event.} $\pdb$, when the probability of all tuples $\prob_i = 1$, the problem of computing the expected count is linear, and we have deterministic runtime.  However, for the class of \abbrTIDB\xplural with $\prob_i < 1$, the problem of computing the expected count (step two of the computation model) in general is no longer linear in the size of the lineage polynomial representation.  This work focuses on analyzing step two of the query processing problem over bag-\abbrPDB queries, specifically:
\begin{Problem}
 ``Given a lineage polynomial circuit generated by a query $\query$, what is the complexity of computing the expected multiplicity of $\tup$?''
 \end{Problem}

 As alluded earlier, the general case of query complexity over bag-\abbrPDB lineage representations is \emph{not} linear.   Our work further introduces an approximation algorithm of the expected count of $\tup$ from the bag-\abbrPDB query $\query$ which runs in linear time.

As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\tup$, which is a stark contrast to the marginal probability paradigm of set-\abbrPDB\xplural.  Further, from a theoretical perspective, not much work has been done considering bag-\abbrPDB\xplural.  Focusing on computing the expected count of $\tup$ is therfore a natural (and simplistic) statistic to consider in further developing the theoretical foundations of bag-\abbrPDB\xplural.  There are indeed other statistical measures that can be computed, but which are beyond the scope of this paper, though we additionally consider higher moments, which can be found in the appendix.

%A tuple independent database (\abbrTIDB) is a \abbrPDB whose tuples are treated as independent random events.  Given a set-\abbrTIDB $\pdb$, $\query(\pdb)$ is essentially limited to computing the \emph{marginal} probability for a member tuple $\tup$.  When it is desirable to compute either a probability distribution over the set of possible multiplicities of $\tup$ or to compute certain statistical measures over the multiplicity of $\tup$, bag-\abbrPDB\xplural are a natural fit, proving very useful for posing questions such as count queries to the database.  While other statistical measures can be computed, we focus primarily on computing the expected multiplicity of $\tup$, a natural  interpretation of step two in the bag setting.\footnote{We consider this natural since it is true that computing the marginal probability of $\tup$ in set-\abbrPDB\xplural is essentially computing $\tup$'s expectation.}  It is also compelling to consider the expected multiplicity since bag-\abbrPDB\xplural are not well studied from a theoretical perspective, and the expected count is both natural and simplistic to consider as a first building block.  We consider higher moments in the appendix.\AH{Pointer here.}

%\footnote{It is known that, in general, there exist queries that are \emph{not} linear in the size of the data.  Such queries as multiple joins and counting cliques are specific examples of this.  We are considering cases where the query is linear in the size of the data.}  Indeed, if for all $i \in [\numvar]$, $\prob_i = 1$, computation is essentially a deterministic query and dominated by the first step.  This changes, however, for $\prob_i < 1$, a case for which our work shows that in general the problem is not linear in the size of the representation.

Our work focuses on the following setting for query computation.  Inputs of $\query$ are set-\abbrPDB\xplural, while the output of $\query$ is a bag-\abbrPDB.  This setting, however, is not limiting as a simple generalization exists, which involves assigning a unique id to each tuple of bag-\abbrPDB inputs.

%%%%%%%%%%%%%%%%%%%%%%%%%
%Contributions, Overview, Paper Organization
%%%%%%%%%%%%%%%%%%%%%%%%%
Concretely, we make the following contributions:
(i) We show that the expected result multiplicity problem (\Cref{def:the-expected-multipl}) for conjunctive queries for bag-$\ti$s is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) its complexity is linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are quadratic); (iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data; (iv) We further prove that for \raPlus queries (an equivalently expressive, but factorizable form of UCQs),  we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial in what follows.%continuing \Cref{ex:intro-tbls}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
	\begin{subfigure}[b]{0.49\linewidth}
		\centering
{\small
		\begin{tabular}{ c | c c c}
			$OnTime$ & City$_\ell$ & $\Phi$  & \textbf{p}\\
			\hline
                     & Buffalo     & $L_a$ & 0.9 \\
                     & Chicago     & $L_b$ & 0.5\\
                     & Bremen      & $L_c$ & 0.5\\
                     & Zurich      & $L_d$ & 1.0\\
		\end{tabular}
	}
		\caption{Relation $OnTime$}
		\label{subfig:ex-shipping-simp-loc}
	\end{subfigure}%
	\begin{subfigure}[b]{0.49\linewidth}
		\centering
{\small
		\begin{tabular}{ c | c c c c}
			$Route$ & $\text{City}_1$ & $\text{City}_2$ & $\Phi$ & \textbf{p} \\
			\hline
                    & Buffalo         & Chicago         & $R_a$          & 1.0        \\
                    & Chicago         & Zurich          & $R_b$          & 1.0        \\
                    %& $\cdots$        & $\cdots$        & $\cdots$     & $\cdots$   \\
                    & Chicago         & Bremen          & $R_c$          & 1.0        \\
		\end{tabular}
		}
		\caption{Relation $Route$}
		\label{subfig:ex-shipping-simp-route}
      \end{subfigure}%
  %     	\begin{subfigure}[b]{0.17\linewidth}
  %   	\centering

  %   \caption{Circuit for $(Chicago)$}
  %   \label{subfig:ex-proj-push-circ-q3}
  % \end{subfigure}

	\begin{subfigure}[b]{0.66\linewidth}
		\centering
{\small
          \begin{tabular}{ c | c c c}
            $\query_1$ & City    & $\Phi$                          & $\expct_{\idb \sim \probDist}[\query(\db)(t)]$ \\ \hline
                       & Buffalo & $L_a \cdot R_a$                 & $0.9$                                            \\
                       & Chicago & $L_b \cdot R_b + L_b \cdot R_c$ & $0.5 \cdot 1.0 + 0.5 \cdot 1.0 = 1.0$                                               \\
            %& $\cdots$ & $\cdots$ & $\cdots$ \\
          \end{tabular}
		}
		\caption{$Q_1$'s Result}
		\label{subfig:ex-shipping-simp-queries}
      \end{subfigure}%
	\begin{subfigure}[b]{0.33\linewidth}
      \centering
		\resizebox{!}{16mm} {
			\begin{tikzpicture}[thick]
				\node[tree_node] (a2) at (0, 0){$R_b$};
				\node[tree_node] (b2) at (1, 0){$L_b$};
				\node[tree_node] (c2) at (2, 0){$R_c$};
				%level 1
				\node[tree_node] (a1) at (0.5, 0.8){$\boldsymbol{\circmult}$};
				\node[tree_node] (b1) at (1.5, 0.8){$\boldsymbol{\circmult}$};
				%level 0
				\node[tree_node] (a0) at (1.0, 1.6){$\boldsymbol{\circplus}$};
				%edges
				\draw[->] (a2) -- (a1);
				\draw[->] (b2) -- (a1);
				\draw[->] (b2) -- (b1);
				\draw[->] (c2) -- (b1);
				\draw[->] (a1) -- (a0);
				\draw[->] (b1) -- (a0);
			\end{tikzpicture}
		}
		\resizebox{!}{16mm} {
			\begin{tikzpicture}[thick]
				\node[tree_node] (a1) at (1, 0){$R_b$};
				\node[tree_node] (b1) at (2, 0){$R_c$};
				%level 1
				\node[tree_node] (a2) at (0.75, 0.8){$L_b$};
				\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circplus}$};
				%level 0
				\node[tree_node] (a3) at (1.1, 1.6){$\boldsymbol{\circmult}$};
				%edges
				\draw[->] (a1) -- (b2);
				\draw[->] (b1) -- (b2);
				\draw[->] (a2) -- (a3);
				\draw[->] (b2) -- (a3);
			\end{tikzpicture}
		}
	\caption{Two circuits for $Q_1(Chicago)$}
	\label{subfig:ex-proj-push-circ-q4}
	\end{subfigure}%
  \vspace*{-3mm}
	\caption{\ti instance and query results for \cref{ex:overview}}%\Cref{ex:intro-tbls}.}%{$\ti$ relations for $\poly$}
	\label{fig:ex-shipping-simp}
  \trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Consider the query $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$ over the bag relations of \Cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\query^2()\dlImp Q(), Q()$.

The lineage polynomial for $Q^2$ is given by $\Phi^2$:
\begin{equation*}
\left(L_aL_b + L_bL_d + L_bL_c\right)^2=L_a^2L_b^2 + L_b^2L_d^2 + L_b^2L_c^2 + 2L_aL_b^2L_d + 2L_aL_b^2L_c + 2L_b^2L_dL_c.
\end{equation*}
The expectation $\expct\pbox{\Phi^2}$ then is:
\begin{multline*}
\expct\pbox{L_a}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_d} \\
+ 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{L_d}\expct\pbox{L_c}
\end{multline*}
\noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
\begin{footnotesize}
\begin{equation*}
\expct\pbox{L_a}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{L_d}\expct\pbox{L_c}
\end{equation*}
\end{footnotesize}
\noindent This property leads us to consider a structure related to the lineage polynomial.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the SOP form of $\poly(\vct{X})$ to $1$.
\end{Definition}
With $\Phi^2$ as an example, we have:
\begin{align*}
\widetilde{\Phi^2}(L_a, L_b, L_c, L_d)
=&\; L_aL_b + L_bL_d + L_bL_c + 2L_aL_bL_d + 2L_aL_bL_c + 2L_bL_cL_d
\end{align*}
It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1},$ $\probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, we show in \Cref{lem:exp-poly-rpoly} that this equivalence holds for {\em all} $\raPlus$ queries over TIDB/BIDB.

To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode various hard graph-counting problems.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$). \AH{What is meant by the following sentence?}For the upper bound it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. \AH{Why do we say `approximation'?  This is a linear {\emph exact} computation.}  To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl})\AH{Aren't they the same?}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.