In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}.
%An expression tree $\circuit$ is a binary %an ADT logically viewed as an n-ary
%tree, whose internal nodes are from the set $\{+, \times\}$, with leaf nodes being either from the set $\mathbb{R}$ $(\tnum)$ or from the set of monomials $(\var)$. The members of $\circuit$ are \type, \val, \vari{partial}, \vari{children}, and \vari{weight}, where \type is the type of value stored in the node $\circuit$ (i.e. one of $\{+, \times, \var, \tnum\}$, \val is the value stored, and \vari{children} is the list of $\circuit$'s children where $\circuit_\lchild$ is the left child and $\circuit_\rchild$ the right child. Remaining fields hold values whose semantics we will fix later. When $\circuit$ is used as input of ~\Cref{alg:mon-sam} and ~\Cref{alg:one-pass}, the values of \vari{partial} and \vari{weight} will not be set. %SEMANTICS FOR \circuit: \vari{partial} is the sum of $\circuit$'s coefficients , n, and \vari{weight} is the probability of $\circuit$ being sampled.
%Note that $\circuit$ need not encode an expression in the standard monomial basis. For instance, $\circuit$ could represent a compressed form of the polynomial in~\Cref{eq:poly-eg}, such as $(x + 2y)(2x - y)$.
The pure expansion of a polynomial $\poly$ is formed by computing all product of sums occurring in $\poly$, without combining like monomials. The pure expansion of $\poly$ generalizes ~\Cref{def:smb} by allowing monomials $m_i = m_j$ for $i \neq j$.
The logical view of \revision{$\expansion{\circuit}$} is a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef$ is in $\reals$.
\revision{$\expansion{\circuit}$} has the following recursive definition ($\circ$ is list concatenation).
Note that similar in spirit to ~\Cref{def:reduced-bi-poly}, $\expansion{\circuit}$ reduces all variable exponents $e > 1$ to $e =1$, though ~\Cref{def:reduced-bi-poly} is more general.
$\expansion{\circuit}$ encodes the \emph{reduced} form of $\polyf\inparen{\circuit}$, decoupling each monomial into a set of variables $\monom$ and a real coefficient $\coef$.
Note, however, that unlike $\rpoly$, $\expansion{\circuit}$ does not need to be in SOP form.
{\em positive circuit}, denoted $\abs{\circuit}$, is obtained from $\circuit$ as follows. For each leaf node $\ell$ of $\circuit$ where $\ell.\type$ is $\tnum$, update $\ell.\vari{value}$ to $|\ell.\vari{value}|$.
Using the same factorization from ~\Cref{example:expr-tree-T}, $\polyf(\abs{\circuit})=(X +2Y)(2X + Y)=2X^2+XY +4XY +2Y^2=2X^2+5XY +2Y^2$. Note that this \textit{is not} the same as the polynomial from~\Cref{eq:poly-eg}.
Given an expression tree $\circuit$ and a valuation $\vct{a}\in\mathbb{R}^\numvar$, we define the evaluation of $\circuit$ on $\vct{a}$ as $\circuit(\vct{a})=\polyf(\circuit)(\vct{a})$.
A subcircuit of a circuit $\circuit$ is a circuit \subcircuit such that \subcircuit is a DAG \textit{subgraph} of the DAG representing \circuit. The sink of \subcircuit has exactly one gate \gate.
To get linear runtime results from~\Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in $\expansion{\circuit}$ to be `canceled' when it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}):
Let $\poly(\vct{X})$ be as in~\Cref{lem:approx-alg} and let $\gamma=\gamma(\circuit)$. Further let it be the case that $\prob_i\ge\prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ satisfying~\Cref{eq:approx-algo-bound} can be computed in time
In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\frac1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot\log{\frac{1}{\conf}}\right)$.
The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (\Cref{app:subsec:experiment} shows experimentally that $\gamma$ is negligible in practice for these queries).
We also observe that (i) tuple presence is independent across blocks, so the corresponding probabilities (and hence $\prob_0$) are independent of the number of blocks, and (ii) \bis model uncertain attributes, so block size (and hence $\gamma$) is a function of the ``messiness'' of a dataset, rather than its size.
% \AH{I am thinking that perhaps the terminology and presentation of~\Cref{app:subsec:experiment} may need word-smithing to clearly illustrate the $\bi$ benchmarks satisfied--although the substance is already written there.}
% \AR{Yes! E.g. $\gamma$ is not used at all in~\Cref{app:subsec:experiment}}
% \AR{{\bf Boris/Oliver:} Is there a way to claim that all probabilities in practice are actually constants: i.e. they do not increase with the number of tuples?}
% \OK{@Atri: This seems like a reasonable claim. It's too late for me to come up with a reasonable motivation (maybe something will come to me in the morning), but the intuition for me is that each tuple/block is independent... it would be hard for that to be the case if the probability were a function of the number of tuples.}
The algorithm to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for \revision{circuit \circuit} over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample $(\monom,\coef)\in\expansion{\circuit}$ with probability proportional\footnote{We could have also uniformly sampled from $\expansion{\circuit}$ but this gives better parameters.}
%\AH{Regarding the footnote, is there really a difference? I \emph{suppose} technically, but in this case they are \emph{effectively} the same. Just wondering.}
%\AR{Yes, there is! If we used uniform distribution then in our bounds we will have a parameter that depends on the largest $\abs{coef}$, which e.g. could be dependent on $n$. But with the weighted probability distribution, we avoid paying this price. Though I guess perhaps we can say for the kinds of queries we consider thhese coefficients are all constants?}
to $\abs{\coef}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv0}\cdot\prod_{X_i\in\var\inparen{\monom}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate.
To summarize, \approxq modifies \circuit with a call to \onepass. It then samples from \circuit$\numsamp$ times and uses that information to approximate $\rpoly$.
%Algorithm ~\ref{alg:mon-sam} approximates $\rpoly$ using the following steps. First, a call to $\onepass$ on its input $\circuit$ produces a non-biased weight distribution over the monomials of $\expansion{\circuit}$ and a correct count of $|\circuit|(1,\ldots, 1)$, i.e., the number of monomials in $\expansion{\circuit}$. Next, ~\Cref{alg:mon-sam} calls $\sampmon$ to sample one monomial and its sign from $\expansion{\circuit}$. The sampling is repeated $\ceil{\frac{2\log{\frac{2}{\delta}}}{\epsilon^2}}$ times, where each of the samples are evaluated with input $\vct{p}$, multiplied by $1 \times sign$, and summed. The final result is scaled accordingly returning an estimate of $\rpoly$ with the claimed $(\error, \conf)$-bound of ~\Cref{lem:mon-samp}.
%\AR{Seems like the notation below belongs to the notation section (if we decide to state this explicitly at all)?}
%\AH{Yes, I only included this per your request a few months ago. Based on @lordpretzel removing my definition of monomial, perhaps we can assume that the reader understands the notation below. I \emph{think} this should be a reasonable assumption.}
%Recall that the notation $[x, y]$ denotes the range of values between $x$ and $y$ inclusive. The notation $\{x, y\}$ denotes the set of values consisting of $x$ and $y$.
% \For{$\vari{x}_{\vari{j}}$ \text{ in } $\vari{M}_{\vari{i}}$}
% \State $\vari{Y}_\vari{i} \gets \vari{Y}_\vari{i} \times \; \vari{\prob}_\vari{j}$\label{alg:mon-sam-product2} \Comment{$\vari{p}_\vari{j}$ is the assignment to $\vari{x}_\vari{j}$ from input $\vct{p}$}
%\For{$\vari{x}_{\vari{j}}$ \text{ in } $\vari{M}$}%_{\vari{i}}$}
% \State $\vari{Y}_\vari{i} \gets \vari{Y}_\vari{i} \times \; \vari{\prob}_\vari{j}$\label{alg:mon-sam-product2} \Comment{$\vari{p}_\vari{j}$ is the assignment to $\vari{x}_\vari{j}$ from input $\vct{p}$}
we first state the lemmas that summarize the relevant properties of $\onepass$ and $\sampmon$, the auxiliary algorithms on which ~\Cref{alg:mon-sam} relies. %Their proofs are given in~\Cref{sec:onepass} and~\Cref{sec:samplemonomial} respectively.
The $\onepass$ function completes in $O(size(\circuit))$ time. $\onepass$ guarantees two post-conditions: First, for each subcircuit $\vari{S}$ of $\circuit$, we have that $\vari{S}.\vari{partial}$ is set to $\abs{\vari{S}}(1,\ldots, 1)$. Second, when $\vari{S}.\type=\circplus$, \subcircuit.\lwght$=\frac{\abs{\subcircuit_\linput}(1,\ldots, 1)}{\abs{\subcircuit}(1,\ldots, 1)}$ and likewise for \subcircuit.\rwght.% is correctly computed for each child of $\vari{S}.$
To prove correctness of~\Cref{alg:mon-sam}, we only use the following fact that follows from the above lemma: for the modified circuit ($\circuit_{\vari{mod}}$), $\circuit_{\vari{mod}}$, $\circuit_{\vari{mod}}.\vari{partial}=\abs{\circuit}(1,\dots,1)$.
%\AH{I'm wondering if there is a better notation to use here. I myself got confused by my own notation of $\circuit_{\vari{mod}}$. \emph{But}, we need to to be referencing the modified $\circuit$ returned by $\onepass$ in the algorithm, so maybe this is the best we can do?}
%At the conclusion of $\onepass$, $\circuit.\vari{partial}$ will hold the sum of all coefficients in $\expansion{\abs{\circuit}}$, i.e., $\sum\limits_{(\monom, \coef) \in \expansion{\abs{\circuit}}}\coef$. $\circuit.\vari{weight}$ will hold the weighted probability that $\circuit$ is sampled from from its parent $+$ node.
The function $\sampmon$ completes in $O(\log{k}\cdot k \cdot\depth(\circuit))$ time, where $k =\degree(poly(\abs{\circuit})$. Upon completion, every $\left(\monom, sign(\coef)\right)\in\expansion{\abs{\circuit}}$ is returned with probability $\frac{|\coef|}{\abs{\circuit}(1,\ldots, 1)}$. %, $\sampmon$ returns the sampled term $\left(\monom, sign(\coef)\right)$ from $\expansion{\abs{\circuit}}$.
For any $\circuit$ with $\degree(poly(|\circuit|))= k$, algorithm \ref{alg:mon-sam} outputs an estimate $\vari{acc}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ such that %$\expct\pbox{\empmean} = \frac{\rpoly(\prob_1,\ldots, \prob_\numvar)\cdot(1 - \gamma)}{\abs{\circuit}(1,\ldots, 1)}$. %within an additive $\error \cdot \abs{\circuit}(1,\ldots, 1)$ error with
The evaluation of $\abs{\circuit}(1,\ldots, 1)$ can be defined recursively, as follows (where $\circuit_\linput$ and $\circuit_\rinput$ are the `left' and `right' inputs of $\circuit$ if they exist):
\noindent\onepass\ (Algorithm ~\ref{alg:one-pass-iter} in \Cref{sec:proofs-approx-alg}) iteratively visits each gate one time according to the topological ordering of \circuit annotating the \lwght, \rwght, and \prt variables of each node according to the definitions above. Lemma~\ref{lem:one-pass} is also proved in~\Cref{sec:proofs-approx-alg}.
%Algorithm ~\ref{alg:sample} takes $\circuit$ as input, samples an arbitrary $(\monom, \coef)$ from $\expansion{\circuit}$ with probabilities $\stree_\lchild.\wght$ and $\stree_\rchild.\wght$ for each subtree $\stree$ with $\stree.\type = +$, outputting the tuple $(\monom, \sign(\coef))$. While one cannot compute $\expansion{\circuit}$ in time better than $O(N^k)$, the algorithm, similar to \textsc{OnePass}, uses a technique on $\circuit$ which produces a sample from $\expansion{\circuit}$ without ever materializing $\expansion{\circuit}$.
\State$\circuit_{\vari{samp}}\gets$ Sample from left input ($\circuit_{\linput}$) and right input ($\circuit_{\rinput}$) w.p. $\circuit.\vari{Lweight}$ and $\circuit.\vari{Rweight}$. \label{alg:sample-plus-bsamp}\Comment{Each call to \sampmon uses fresh randomness}
% We conducted an experiment running modified TPCH queries over uncertain data generated by pdbench~\cite{pdbench}, both of which (data and queries) represent what is typically encountered in practice. Queries were run two times, once filtering $\bi$ cancellations, and then second not filtering the cancellations. The purpose of this was to determine an indication for how many $\bi$ cancellations occur in practice. Details and results can be found in~.