paper-BagRelationalPDBsAreHard/Sketching Worlds/intro-rewrite2.tex

%root: main.tex
%\setenumerate[1]{label = \Roman*.}
%		\setenumerate[2]{label=\Alph*.}
%		\setenumerate[3]{label=\roman*.}
%		\setenumerate[4]{label=\alph*.}

\section{Introduction (Rewrite)}
%		%for outline functionality
%		\begin{outline}[enumerate]
%			\1 Overall Problem
%				\2 Hardness of deterministic queries, e.g., counting cliques, multiple joins, etc.
%				\2 Assuming $\query$ is easy, how does the probabilistic computation compare in \abbrPDB\xplural?
%				\3 Introduce two-step process
%					\4 Deterministic Process: compute query, lineage, representation aka circuit
%					\4 Probability Computation
%				\3 Why the two-step process?
%					\4 Semiring provenance nicely follows this model
%					\4 Set \abbrPDB\xplural use this process
%					\4 Model allows us to separate the deterministic from the probability computation
%					\AH{The part below should maybe be moved further down.  The order in the current draft is further down.}
%				\3 Assuming a bag-\abbrTIDB, when the probability of all tuples $\prob_i = 1$, the problem of computing the expected count is linear
%				\3 However, when $\prob_i < 1$, the problem is not linear (in circuit size)
%				\3 An approximation algorithm exists to bring the second step back down to linear time
%				\3 For set-\abbrPDB, the problem is \sharpphard with respect to exact computation
%				\3 For set-\abbrPDB, the problem is quadratic with respect to approximation
%			\2 Note that in all cases, Step 2 is at least as hard as Step 1
%				\3 Bag-\abbrPDB\xplural are useful for things like a count query
%				\3 This work focuses on Step 2 for bag-\abbrPDB\xplural
%					\4 Given a lineage polynomial generated by a query, compute the expected multiplicity
%				\3 Why focus on tuple expected multiplicity?
%					\4 Note that bag-\abbrPDB query output is a probability distribution over counts; this contrasts the marginal probability paradigm of set-\abbrPDB\xplural
%					\4 From a theoretical perspective, bag-\abbrPDB\xplural are not well studied
%					\4 There are several statistical measures that can be done
%					\4 We focus on expected count since it is a natural and simplistic statistic to consider
%					\4 Appendix also considers higher moments
%			\2 The setting for our problem assumes set-\abbrPDB inputs
%				\3 A simple generalization exists
%
%		\end{outline}

%\begin{figure}[H]
%	\centering
%	\includegraphics[width=\textwidth]{twostep}
%	\caption{Old inkscape graphic}
%	\label{fig:old-inkscape}
%\end{figure}

\usetikzlibrary{shapes.geometric}%for cylinder
\usetikzlibrary{shapes.arrows}%for arrow shape
\usetikzlibrary{shapes.misc}
%rid of vertical spacing for booktabs rules
\renewcommand{\aboverulesep}{0pt}
\renewcommand{\belowrulesep}{0pt}

\begin{figure}[h!]
	\centering
	\resizebox{\textwidth}{5.5cm}{%
	\begin{tikzpicture}
		%pdb cylinder
		\node[cylinder, text width=0.28\textwidth, align=center, draw=black, text=black, cylinder uses custom fill, cylinder body fill=blue!10, aspect=0.12, minimum height=5cm, minimum width=2.5cm, cylinder end fill=blue!50, shape border rotate=90] (cylinder) at (0, 0) {
		\tabcolsep=0.1cm
		\begin{tabular}{>{\small}c | >{\small}c | >{\small}c}
				\multicolumn{3}{c}{$\boldsymbol{OnTime}$}\\
				%\toprule
				City$_\ell$ & $\Phi$  & \textbf{p}\\
				\midrule
	                     Buffalo     & $L_a$ & 0.9 \\
	                     Chicago     & $L_b$ & 0.5\\
	                     Bremen      & $L_c$ & 0.5\\
	                     Zurich      & $L_d$ & 1.0\\
			\end{tabular}\\
			\tabcolsep=0.05cm
			%\captionof{table}{Route}
			\begin{tabular}{>{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c | >{\footnotesize}c}
				\multicolumn{4}{c}{$\boldsymbol{Route$}}\\
				%\toprule
				$\text{City}_1$ & $\text{City}_2$ & $\Phi$ & \textbf{p} \\
				\midrule
	                    Buffalo         & Chicago         & $R_a$          & 1.0        \\
	                    Chicago         & Zurich          & $R_b$          & 1.0        \\
	                    %& $\cdots$        & $\cdots$        & $\cdots$     & $\cdots$   \\
	                    Chicago         & Bremen          & $R_c$          & 1.0        \\
			\end{tabular}};
			%label below cylinder
			\node[below=0.2 cm of cylinder]{{\LARGE$ \pdb$}};
		%First arrow
		\node[single arrow, right=0.25 of cylinder, draw=black, fill=black!65, text=white, minimum height=0.75cm, minimum width=0.25cm](arrow1) {\textbf{Step 1}};
		\node[above=of arrow1](arrow1Label) {$\query$};
		\usetikzlibrary{arrows.meta}%for the following arrow configurations
			\draw[line width=0.5mm, dashed, arrows = -{Latex[length=3mm, open]}] (arrow1Label)->(arrow1);
		%Query output (output of step 1)
		\node[rectangle, right=0.175 of arrow1, draw=black, text=black, fill=purple!10, minimum height=4.5cm, minimum width=2cm](rect) {
		\tabcolsep=0.075cm
		%\captionof{table}{Q}
		 \begin{tabular}{>{\normalsize}c | >{\normalsize}c | >{\centering\arraybackslash\small}m{1.95cm}}
	            %\multicolumn{3}{c}{$\boldsymbol{\query(\pdb)}$}\\[1mm]
	            %\toprule
	            City    & $\Phi$ & Circuit\\%                          & $\expct_{\idb \sim \probDist}[\query(\db)(t)]$ \\ \hline
			\midrule
	          	%\hline
	          	\\\\[-3.5\medskipamount]
	                 Buffalo & $L_a R_a$ &\resizebox{!}{10mm}{
	                       \begin{tikzpicture}[thick]
	                       		\node[gen_tree_node](sink) at (0.5, 0.8){$\boldsymbol{\circmult}$};
	                       		\node[gen_tree_node](source1) at (0, 0){$L_a$};
	                       		\node[gen_tree_node](source2) at (1, 0){$R_a$};
	                       		\draw[->](source1)--(sink);
	                       		\draw[->] (source2)--(sink);
					\end{tikzpicture}% & $0.5 \cdot 1.0 + 0.5 \cdot 1.0 = 1.0$
					}\\[5mm]%                 & $0.9$                                            \\
	                       Chicago & $L_b(R_b + R_c)$&
	                       \resizebox{!}{16mm} {
						\begin{tikzpicture}[thick]
							\node[gen_tree_node] (a1) at (1, 0){$R_b$};
							\node[gen_tree_node] (b1) at (2, 0){$R_c$};
							%level 1
							\node[gen_tree_node] (a2) at (0.75, 0.8){$L_b$};
							\node[gen_tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circplus}$};
							%level 0
							\node[gen_tree_node] (a3) at (1.1, 1.6){$\boldsymbol{\circmult}$};
							%edges
							\draw[->] (a1) -- (b2);
							\draw[->] (b1) -- (b2);
							\draw[->] (a2) -- (a3);
							\draw[->] (b2) -- (a3);
						\end{tikzpicture}
					}\newline\text{Or}\newline
					%%%%%%%%%%%
					%Non factorized circuit%
					%%%%%%%%%%%
	                       \resizebox{!}{16mm} {
					\begin{tikzpicture}[thick]
						\node[gen_tree_node] (a2) at (0, 0){$R_b$};
						\node[gen_tree_node] (b2) at (1, 0){$L_b$};
						\node[gen_tree_node] (c2) at (2, 0){$R_c$};
						%level 1
						\node[gen_tree_node] (a1) at (0.5, 0.8){$\boldsymbol{\circmult}$};
						\node[gen_tree_node] (b1) at (1.5, 0.8){$\boldsymbol{\circmult}$};
						%level 0
						\node[gen_tree_node] (a0) at (1.0, 1.6){$\boldsymbol{\circplus}$};
						%edges
						\draw[->] (a2) -- (a1);
						\draw[->] (b2) -- (a1);
						\draw[->] (b2) -- (b1);
						\draw[->] (c2) -- (b1);
						\draw[->] (a1) -- (a0);
						\draw[->] (b1) -- (a0);
					\end{tikzpicture}
				}\\
	          \end{tabular}
		};
		%label below rectangle
		\node[below=0.2cm of rect]{{\LARGE $\query(\pdb)$}};
		%Second arrow
		\node[single arrow, right=0.25 of rect, draw=black, fill=black!65, text=white, minimum height=0.75cm, minimum width=0.25cm](arrow2) {\textbf{Step 2}};
		%Expectation computation; (output of step 2)
		\node[rectangle, right=0.25 of arrow2, rounded corners, draw=black, fill=red!10, text=black, minimum height=4.5cm, minimum width=2cm](rrect) {
		\tabcolsep=0.09cm
		%\captionof{table}{Q}
		\begin{tabular}{>{\small}c | >{\centering\arraybackslash\small}m{1.95cm}}
			%\multicolumn{2}{c}{$\expct\pbox{\poly(\vct{X})}$}\\[1mm]
			%\toprule
			City & $\mathbb{E}[\poly(\vct{X})]$\\
			\midrule%[0.05pt]
			Buffalo & $1.0 \cdot 0.9 = 0.9$\\[3mm]
			Chicago & $(0.5 \cdot 1.0) + $\newline $\hspace{0.2cm}(0.5 \cdot 1.0)$\newline $= 1.0$\\
		\end{tabular}
			};
		%label of rounded rectangle
		\node[below=0.2cm of rrect]{{\LARGE $\expct\pbox{\poly(\vct{X})}$}};
	\end{tikzpicture}
	}
	\caption{Two step model of computation}
	\label{fig:two-step}
\end{figure}
A bag probabilistic database (\abbrPDB) is a probability distribution over $\numvar$ (not necessarily \emph{unique}) tuples in a deterministic database $\db$.  A tuple independent bag probabilistic database (\abbrTIDB) $\pdb$ has the further restriction that each tuple $\tup$ in $\db$ be an independent random event, with all base relation tuples annotated with a unique random variable.  Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$) over $\pdb$, the goal is to compute the expected count of each \emph{distinct} output tuple $\tup$ in $\query(\pdb)$, where each $\tup$ of $\query(\pdb)$ is annotated with its respective lineage polynomial, $\poly({\vct{X}})$ such that $\vct{X}$ is the vector of all unique $\numvar$ variables in $\pdb$.  Bag \abbrPDB\xplural are a natural fit for evaluating queries involving multiplicity such as count queries.

In general the runtime of $\query$ over a deterministic database is \sharpwonehard, meaning the runtime is superlinear in the $\numvar$ sized input, based on a specific parameter $k$, as is the case for counting $k$-cliques and multiple joins also known as $k$-joins.  This hardness result is unsatisfying in the sense that it doesn't account for computing the expected count, $\expct\pbox{\poly(\vct{X})}$.  A natural question is if we can quantify the hardness of the probability computation beyond the complexity of deterministic (pre) processing.  The model illustrated in \cref{fig:two-step} is one way to do this. %Assuming $\query$ is linear or better, how does query computation of a $\abbrPDB$ compare to deterministic query processing?

This model views \abbrPDB query processing as two steps.  As depicted, computing $\query$ over a $\abbrPDB$ consists of the first step, which is essentially the deterministic computation of both the query output and $\poly(\vct{X})$.%result tuple lineage polynomial(s) encoded in the respective representation.
\footnote{Note that, assuming standard query algorithms over $\raPlus$ operators, computing a lineage polynomial is of the same complexity as computing the query output.}
% the runtime of the first step is the same in both the deterministic and \abbrPDB settings, since the computation of the linage is never greater than the query processing time.}
The second step consists of computing the expectation of $\poly({\vct{X}})$, $\expct\pbox{\poly(\vct{X})}$. This model of computation is nicely followed by set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu} (where e.g. computing the marginal probability in intensional evaluation is a separate step; further, computing the marginal probability in extensional evaluation occurs as a separate step of each operator, and therefore implies that both concerns can be separated) and also by that of semiring provenance \cite{DBLP:conf/pods/GreenKT07} (where the $\semNX$-DB first computes the annotation via the query, and then the polynomial is evaluated on a specific valuation), and further, it is useful in this work for the purpose of separating the deterministic computation from the probability computation.

This work seeks to explore whether or not step two is \emph{always} of equal or lesser complexity relative to step one in the setting of bag \abbrPDB\xplural while establishing theoretical foundations supporting the answer to this question.  For step one, as alluded above, in general query processing in deterministic databases is polynomial in $k$.  Then our question is, ``Is it always the case that computing the expected count of a result tuple $\tup$ is linear is the complexity of step one, or are there classes of queries where computing step two is \emph{superlinear} in the query complexity?''

Most work done in \abbrPDB\xplural has been done in the setting of set \abbrPDB\xplural, where lineage is represented as a propositional formula rather than a polynomial.  Each output tuple of a query $\query$ appears once with a marginal probability of $\expct\pbox{\poly(\vct{X})}$\footnote{We abuse notation and denote the propositional formula as $\poly(\vct{X})$}.  The problem of computing $\query$ \emph{exactly} over a set-\abbrPDB is known to be \sharpphard in the general case.  The dichotomy of Dalvi and Suciu \cite{10.1145/1265530.1265571} shows that for set-\abbrPDB\xplural it is the case that $\query(\pdb)$ is either polynomial or \sharpphard in $\numvar$ for any polytime step one.  Since the hardness is non-parameterized, it is not necessary to consider the two step model.  It is noteworthy to point out that this dichotomy is \emph{based} on the query structure and in general is independent of the representation of the lineage polynomial.\footnote{We do note that there exist specific cases when given a specific database instance combined with an amenable representation, that a hard $\query$ can become easy, but this is \emph{not} the general case.}  If we are okay with approximation, then this problem can then be brought back down to at most quadratic time.\AH{Citation necessary.}

%Since set-\abbrPDB\xplural are essentially limited to computing the marginal probability of $\tup$, bag-\abbrPDB\xplural are a more natural fit for computing queries such as count queries.
Traditionally, bag-\abbrPDB\xplural have long been considered to be bottlenecked in step one only, or linear in the size of query.  This may partially be due to the prevalence that exists in using a sum of products (\abbrSOP) representation of the lineage polynomial amongst many of the most well-known implementations of set-\abbrPDB\xplural.  Such a representation used in the bag-\abbrPDB setting \emph{indeed} allows for step two to be linear in the \emph{size} of the \abbrSOP representation, a result due to linearity of expectation.

The main insight of the paper is that we should not stop here.  One can have compact representations of $\poly(\vct{X})$ resulting from, for example, optimizations like projection push-down produce factorized representations of $\poly(\vct{X})$.  To capture such factorizations, this work uses (arithmetic) circuits as the representation system of $\poly(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07} (as shown in \cref{fig:nxDBSemantics}).  Our work explores whether or not step two in the computation model is \emph{always} linear in the \emph{size} of the representation of the lineage polynomial when step one of $\query(\pdb)$ is easy.  %This works focuses on step two of the computation model specifically in regards to bag-\abbrPDB queries.
Consider again the bag-\abbrTIDB $\pdb$.  When the probability of all tuples $\prob_i = 1$, the problem of computing the expected count is linear in the size of the arithemetic circuit, and we have polytime complexity for computing $\query(\pdb)$.  This leads us to our problem statement:
\begin{Problem}\label{prob:intro-stmt}
Given a query $\query$ in $\raPlus$ and bag \abbrPDB $\pdb$, what is the complexity (in the size of the circuit representation) of computing step two ($\expct\pbox{\poly(\vct{X})}$) for each tuple $\tup$ in the output of $\query(\pdb)$?
\end{Problem}

We show, for the class of \abbrTIDB\xplural with $\prob_i < 1$, the problem of computing step two in general is no longer linear in the size of the lineage polynomial representation.
Our work further introduces an approximation algorithm of the expected count of $\tup$ from the bag-\abbrPDB query $\query$ which runs in linear time.

As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\tup$, which is a stark contrast to the marginal probability ($\expct\pbox{\poly\inparen{\vct{X}}}$) paradigm of set-\abbrPDB\xplural.  From a theoretical perspective, not much work has been done considering bag-\abbrPDB\xplural.  Focusing on computing the expected count ($\expct\pbox{\poly\inparen{\vct{X}}}$) of $\tup$ is therfore a natural (and simplistic) statistic to consider in further developing the theoretical foundations of bag-\abbrPDB\xplural.  There are indeed other statistical measures that can be computed, which are beyond the scope of this paper, though we additionally consider higher moments, which can be found in the appendix.

Our work focuses on the following setting for query computation.  Inputs of $\query$ are set-\abbrPDB\xplural, while the output of $\query$ is a bag-\abbrPDB.  This setting, however, is not limiting as a simple generalization exists, reducing a bag \abbrPDB to a set \abbrPDB with typically only an $O(1)$ increase in size.

%%%%%%%%%%%%%%%%%%%%%%%%%
%Contributions, Overview, Paper Organization
%%%%%%%%%%%%%%%%%%%%%%%%%
Concretely, we make the following contributions:
(i) We show that \cref{prob:intro-stmt} for bag-\abbrTIDB\xplural is \sharpwonehard in the size of the lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph; we further show superlinear hardness for a specific cubic graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) have complexity linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are  at most quadratic); (iii) We generalize the approximation algorithm to a class of bag-\abbrBIDB\xplural, a more general model of probabilistic data; (iv) We further prove that for \raPlus queries
%(an equivalently expressive, but factorizable form of UCQs),
\AH{This point \emph{\Large seems} weird to me.  I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one.  Where am I missing it?}
we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial in what follows.%continuing \Cref{ex:intro-tbls}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Old Figure from 1st ICDT submission
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{figure}[t]
%	\begin{subfigure}[b]{0.49\linewidth}
%		\centering
%{\small
%		\begin{tabular}{ c | c c c}
%			$OnTime$ & City$_\ell$ & $\Phi$  & \textbf{p}\\
%			\hline
%                     & Buffalo     & $L_a$ & 0.9 \\
%                     & Chicago     & $L_b$ & 0.5\\
%                     & Bremen      & $L_c$ & 0.5\\
%                     & Zurich      & $L_d$ & 1.0\\
%		\end{tabular}
%	}
%		\caption{Relation $OnTime$}
%		\label{subfig:ex-shipping-simp-loc}
%	\end{subfigure}%
%	\begin{subfigure}[b]{0.49\linewidth}
%		\centering
%{\small
%		\begin{tabular}{ c | c c c c}
%			$Route$ & $\text{City}_1$ & $\text{City}_2$ & $\Phi$ & \textbf{p} \\
%			\hline
%                    & Buffalo         & Chicago         & $R_a$          & 1.0        \\
%                    & Chicago         & Zurich          & $R_b$          & 1.0        \\
%                    %& $\cdots$        & $\cdots$        & $\cdots$     & $\cdots$   \\
%                    & Chicago         & Bremen          & $R_c$          & 1.0        \\
%		\end{tabular}
%		}
%		\caption{Relation $Route$}
%		\label{subfig:ex-shipping-simp-route}
%      \end{subfigure}%
%  %     	\begin{subfigure}[b]{0.17\linewidth}
%  %   	\centering
%
%  %   \caption{Circuit for $(Chicago)$}
%  %   \label{subfig:ex-proj-push-circ-q3}
%  % \end{subfigure}
%
%	\begin{subfigure}[b]{0.66\linewidth}
%		\centering
%{\small
%          \begin{tabular}{ c | c c c}
%            $\query_1$ & City    & $\Phi$                          & $\expct_{\idb \sim \probDist}[\query(\db)(t)]$ \\ \hline
%                       & Buffalo & $L_a \cdot R_a$                 & $0.9$                                            \\
%                       & Chicago & $L_b \cdot R_b + L_b \cdot R_c$ & $0.5 \cdot 1.0 + 0.5 \cdot 1.0 = 1.0$                                               \\
%            %& $\cdots$ & $\cdots$ & $\cdots$ \\
%          \end{tabular}
%		}
%		\caption{$Q_1$'s Result}
%		\label{subfig:ex-shipping-simp-queries}
%      \end{subfigure}%
%	\begin{subfigure}[b]{0.33\linewidth}
%      \centering
%		\resizebox{!}{16mm} {
%			\begin{tikzpicture}[thick]
%				\node[tree_node] (a2) at (0, 0){$R_b$};
%				\node[tree_node] (b2) at (1, 0){$L_b$};
%				\node[tree_node] (c2) at (2, 0){$R_c$};
%				%level 1
%				\node[tree_node] (a1) at (0.5, 0.8){$\boldsymbol{\circmult}$};
%				\node[tree_node] (b1) at (1.5, 0.8){$\boldsymbol{\circmult}$};
%				%level 0
%				\node[tree_node] (a0) at (1.0, 1.6){$\boldsymbol{\circplus}$};
%				%edges
%				\draw[->] (a2) -- (a1);
%				\draw[->] (b2) -- (a1);
%				\draw[->] (b2) -- (b1);
%				\draw[->] (c2) -- (b1);
%				\draw[->] (a1) -- (a0);
%				\draw[->] (b1) -- (a0);
%			\end{tikzpicture}
%		}
%		\resizebox{!}{16mm} {
%			\begin{tikzpicture}[thick]
%				\node[tree_node] (a1) at (1, 0){$R_b$};
%				\node[tree_node] (b1) at (2, 0){$R_c$};
%				%level 1
%				\node[tree_node] (a2) at (0.75, 0.8){$L_b$};
%				\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circplus}$};
%				%level 0
%				\node[tree_node] (a3) at (1.1, 1.6){$\boldsymbol{\circmult}$};
%				%edges
%				\draw[->] (a1) -- (b2);
%				\draw[->] (b1) -- (b2);
%				\draw[->] (a2) -- (a3);
%				\draw[->] (b2) -- (a3);
%			\end{tikzpicture}
%		}
%	\caption{Two circuits for $Q_1(Chicago)$}
%	\label{subfig:ex-proj-push-circ-q4}
%	\end{subfigure}%
%  \vspace*{-3mm}
%	\caption{\ti instance and query results for \cref{ex:overview}}%\Cref{ex:intro-tbls}.}%{$\ti$ relations for $\poly$}
%	\label{fig:ex-shipping-simp}
%  \trimfigurespacing
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Consider the query $\query(\pdb) \coloneqq \project_\emptyset(OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)$
%$Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$
over the bag relations of \cref{fig:two-step}. It can be verified that $\Phi$ for $Q$ is $L_aR_aL_b + L_bR_bL_d + L_bR_cL_c$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.

The lineage polynomial for $Q^2$ is given by $\Phi^2$:
\begin{multline*}
\left(L_aR_aL_b + L_bR_bL_d + L_bR_cL_c\right)^2\\
=L_a^2R_a^2L_b^2 + L_b^2R_d^2L_d^2 + L_b^2R_c^2L_c^2 + 2L_aR_aL_b^2R_bL_d + 2L_aR_bL_b^2R_cL_c + 2L_b^2R_bL_dR_cL_c.
\end{multline*}
The expectation $\expct\pbox{\Phi^2}$ then is:
\begin{footnotesize}
\begin{multline*}
\expct\pbox{L_a^2}\expct\pbox{R_a^2}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{R_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{R_c^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\\
+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b^2}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
\end{multline*}
\end{footnotesize}
\noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
\begin{footnotesize}
\begin{multline*}
\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b}\expct{R_b}\expct\pbox{L_d} \\
+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b}\expct{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
\end{multline*}
\end{footnotesize}
\noindent This property leads us to consider a structure related to the lineage polynomial.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the SOP form of $\poly(\vct{X})$ to $1$.
\end{Definition}
With $\Phi^2$ as an example, we have:
\begin{align*}
&\widetilde{\Phi^2}(L_a, L_b, L_c, L_d, R_a, R_b, R_c)\\
&\; = L_aR_aL_b + L_bR_bL_d + L_bR_cL_c + 2L_aR_aL_bR_bL_d + 2L_aR_aL_bR_cL_c + 2L_bR_bL_dR_cL_c
\end{align*}
It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1},$ $\probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, we show in \Cref{lem:exp-poly-rpoly} that this equivalence holds for {\em all} $\raPlus$ queries over TIDB/BIDB.

To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode various hard graph-counting problems.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$).  For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation.  For example, if we know that $\prob_0 = \max_{i \in [\numvar]}\prob_i$, then $\poly(\prob_0,\ldots, \prob_0)$ is an upper bound constant factor approximation.  The opposite holds true for determining a constant factor lower bound.  To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl})\AH{Aren't they the same?}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.