\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
Unlike the results in~\Cref{sec:hard} our approximation algorithm works for \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}.
\subsection{Preliminaries and some more notation}
We now introduce useful definitions and notation related to polynomials. We use the following polynomial as an example:
\poly(X, Y) = 2X^2 + 3XY - 2Y^2.
\begin{Definition}[Variables in a monomial]\label{def:vars}
Given a monomial $v$, we use $\var(v)$ to denote the set of variables in $v$.
For example the monomial $XY$ has $\var(XY)=\inset{X,Y}$.
Denote $\polyf(\etree)$ to be the function from expression tree $\etree$ to its corresponding polynomial. $poly(\cdot)$ is recursively defined on $\etree$ as follows, where $\etree_\lchild$ and $\etree_\rchild$ denote the left and right child of $\etree$ respectively.
With addition and multiplication following the standard interpretation:
\polyf(\etree) = \begin{cases}
\polyf(\etree_\lchild) + \polyf(\etree_\rchild) &\text{ if \etree.\type } = +\\
\polyf(\etree_\lchild) \cdot \polyf(\etree_\rchild) &\text{ if \etree.\type } = \times\\
\etree.\val &\text{ if \etree.\type } = \var \text{ OR } \tnum.
\begin{Definition}[Expanded T]\label{def:expand-tree}
$\expandtree{\etree}$ is the (pure) sum of products expansion of $\etree$, which we formally define next. The logical view of \expandtree{\etree} ~is a list of tuples $(\monom, \coef)$, where $\monom$ is a monomial and $\coef$ is in $\mathbb{R}$. \expandtree{\etree} has the following recursive definition (where $\circ$ is list concatenation).
\expandtree{\etree} =
\expandtree{\etree_\lchild} \circ \expandtree{\etree_\rchild} &\textbf{ if }\etree.\type = +\\
\left\{(\monom_\lchild \cup \monom_\rchild, \coef_\lchild \cdot \coef_\rchild) ~|~\right.&\\ \quad \left.(\monom_\lchild, \coef_\lchild) \in \expandtree{\etree_\lchild}, (\monom_\rchild, \coef_\rchild) \in \expandtree{\etree_\rchild}\right\} &\textbf{ if }\etree.\type = \times\\
\elist{(\emptyset, \etree.\val)} &\textbf{ if }\etree.\type = \tnum\\
\elist{(\{\etree.\val\}, 1)} &\textbf{ if }\etree.\type = \var.\\
Consider the factorized representation $(X+ 2Y)(2X - Y)$ of the polynomial in~\Cref{eq:poly-eg}. Its expression tree $\etree$ is illustrated in Figure ~\ref{fig:expr-tree-T}. The pure expansion of the product is $2X^2 - XY + 4XY - 2Y^2$ and the $\expandtree{\etree}$ is $[(2, X^2), (-1, XY), (4, XY), (-2, Y^2)]$.
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=3.55cm}, level 2/.style={sibling distance=1.8cm}, level 3/.style={sibling distance=0.8cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
child{node[highlight_treenode] (TR) {$\boldsymbol{+}$}
child{node[tree_node] (neg-leaf) {-1}}
% \node[below=2pt of neg-leaf, inner sep=1pt, blue] (neg-comment) {\textbf{Negation pushed to leaf nodes}};
% \draw[<-|, blue] (neg-leaf) -- (neg-comment);
\node[above right=0.7cm of TR, highlight_color, inner sep=0pt, font=\bfseries] (tr-label) {$\etree_\rchild$};
\node[above right=0.7cm of root, highlight_color, inner sep=0pt, font=\bfseries] (t-label) {$\etree$};
\draw[<-|, highlight_color] (TR) -- (tr-label);
\draw[<-|, highlight_color] (root) -- (t-label);
\caption{Expression tree $\etree$ for the product $\boldsymbol{(x + 2y)(2x - y)}$.}
\begin{Definition}[Positive T]\label{def:positive-tree}
For any expression tree $\etree$, the corresponding
{\em positive tree}, denoted $\abs{\etree}$ obtained from $\etree$ as follows. For each leaf node $\ell$ of $\etree$ where $\ell.\type$ is $\tnum$, update $\ell.\vari{value}$ to $|\ell.\vari{value}|$. %value $\coef$ of each coefficient leaf node in $\etree$ is set to %$\coef_i$ in $\etree$ is exchanged with its absolute value$|\coef|$.
Using the same factorization from ~\Cref{example:expr-tree-T}, $\polyf(\abs{\etree}) = (X + 2Y)(2X + Y) = 2X^2 +XY +4XY + 2Y^2 = 2X^2 + 5XY + 2Y^2$. Note that this \textit{is not} the same as the polynomial from~\Cref{eq:poly-eg}.
Given an expression tree $\etree$ and $\vct{v} \in \mathbb{R}^\numvar$, we define the evaluation of $\etree$ on $\vct{v}$ as $\etree(\vct{v}) = \polyf(\etree)(\vct{v})$.
\subsection{Our main result}
In the subsequent subsections we will prove the following theorem.
Let $\etree$ be an expression tree for a UCQ over \bi and define $\poly(\vct{X})=\polyf(\etree)$ and let $k=\degree(\poly)$.
%Let $\poly(\vct{X})$ be a query polynomial corresponding to the output of a UCQ in a \bi.
Then an estimate $\mathcal{E}$ %=\approxq(\etree, P_1,\dots,p_\numvar), \conf, \error')$
of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\etree}^2(1,\ldots, 1)\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\]
such that
\probOf\left(\left|\mathcal{E} - \rpoly(\prob_1,\dots,\prob_\numvar)\right|> \error' \cdot \rpoly(\prob_1,\dots,\prob_\numvar)\right) \leq \conf.
The proof of~\Cref{lem:approx-alg} can be found in~\Cref{sec:proofs-approx-alg}.
To get linear runtime results from~\Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in $\expandtree{\etree}$ to be `canceled' when it is modded with $\mathcal{B}$:
\begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
Given an expression tree $\etree$, define
\[\gamma(\etree)=\frac{\sum_{(\monom, \coef)\in \expandtree{\etree}} \abs{\coef}\cdot \indicator{\monom\mod{\mathcal{B}}\equiv 0}}{\abs{\etree}(1,\ldots, 1)}\]
We next present couple of corollaries of~\Cref{lem:approx-alg}.
Let $\poly(\vct{X})$ be as in~\Cref{lem:approx-alg} and let $\gamma=\gamma(\etree)$. Further let it be the case that $\prob_i\ge \prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ satisfying~\Cref{eq:approx-algo-bound} can be computed in time
\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot \prob_0^{2k}}\right)\]
In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\treesize(\etree)\cdot \log{\frac{1}{\conf}}\right)$.
The proof for~\Cref{cor:approx-algo-const-p} can be seen in~\Cref{sec:proofs-approx-alg}.
The restriction on $\gamma$ is satisfied by \ti (where $\gamma=0$) as well as for all queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment}).
Note that (i) tuple presence is independent across blocks, so the corresponding probabilities (and hence $\prob_0$) are independent of the number of blocks, and (ii) \bis model uncertain attributes, so block size (and hence $\gamma$) is a function of the ``messiness'' of a dataset, rather than its size.
Thus, we expect the corrolary to hold in general.
\subsection{Approximating $\rpoly$}
The algorithm to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=\polyf(\etree)$ for expression tree $\etree$ over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(v,c)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\hspace*{-2mm}\prod_{X_i\in \var\inparen{v}}\hspace*{-2mm} X_i
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample $(v,c)\in \expandtree{\etree}$ with probability proportional\footnote{We could have also uniformly sampled from $\expandtree{\etree}$ but this gives better parameters.}
to $\abs{c}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{v}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate.
The number of samples is computed by (see \Cref{app:subsec-th-mon-samp}):
2\exp{\left(-\frac{\samplesize\error^2}{2}\right)}\leq \conf \implies\samplesize \geq \frac{2\log{\frac{2}{\conf}}}{\error^2}.
%\frac{\samplesize\error^2}{2}\geq \log{\frac{2}{\conf}}\\
%\subsubsection{Psuedo Code}
\caption{$\approxq(\etree, \vct{p}, \conf, \error)$}
\Require \etree: Binary Expression Tree
\Require $\vct{p} = (\prob_1,\ldots, \prob_\numvar)$ $\in [0, 1]^N$
\Require $\conf$ $\in [0, 1]$
\Require $\error$ $\in [0, 1]$
%\Require $\abs{\block} \in \mathbb{N}$%\bivec$ $\in [0, 1]^{\abs{\block}}$
\State $\accum \gets 0$\label{alg:mon-sam-global1}
\State $\numsamp \gets \ceil{\frac{2 \log{\frac{2}{\conf}}}{\error^2}}$\label{alg:mon-sam-global2}
\State $(\vari{\etree}_\vari{mod}, \vari{size}) \gets $ \onepass($\etree$)\label{alg:mon-sam-onepass}\Comment{$\onepass$ is ~\Cref{alg:one-pass}}
\For{$\vari{i} \in 1 \text{ to }\numsamp$}\label{alg:sampling-loop}\Comment{Perform the required number of samples}
\State $(\vari{M}, \vari{sgn}_\vari{i}) \gets $ \sampmon($\etree_\vari{mod}$)\label{alg:mon-sam-sample}\Comment{\sampmon \; is ~\Cref{alg:sample}}
\If{$\vari{M}$ has at most one variable from each block}\label{alg:check-duplicate-block}
\State $\vari{Y}_\vari{i} \gets \prod_{X_j\in\var\inparen{\vari{M}}}p_j$\label{alg:mon-sam-assign1}%\newline
\State $\vari{Y}_\vari{i} \gets \vari{Y}_\vari{i} \times\; \vari{sgn}_\vari{i}$\label{alg:mon-sam-product}
\State $\accum \gets \accum + \vari{Y}_\vari{i}$\Comment{Store the sum over all samples}\label{alg:mon-sam-add}
\State $\vari{acc} \gets \vari{acc} \times \frac{\vari{size}}{\numsamp}$\label{alg:mon-sam-global3}
\State \Return \vari{acc}
In order to prove~\Cref{lem:approx-alg}, we will need to argue the correctness of~\Cref{alg:mon-sam}. Before we formally do that,
we first state the lemmas that summarize the relevant properties of $\onepass$ and $\sampmon$, the auxiliary algorithms on which ~\Cref{alg:mon-sam} relies. %Their proofs are given in~\Cref{sec:onepass} and~\Cref{sec:samplemonomial} respectively.
The $\onepass$ function completes in $O(size(\etree))$ time. $\onepass$ guarantees two post conditions: First, for each subtree $\vari{S}$ of $\etree$, we have that $\vari{S}.\vari{partial}$ is set to $\abs{\vari{S}}(1,\ldots, 1)$. Second, when $\vari{S}.\type = +$, each $\vari{child}$ of $\vari{S}$, $\vari{child}.\vari{weight}$ is set to $\frac{\abs{\vari{S}_{\vari{child}}}(1,\ldots, 1)}{\abs{\vari{S}}(1,\ldots, 1)}$. % is correctly computed for each child of $\vari{S}.$
To prove correctness of~\Cref{alg:mon-sam}, we only use the following fact that follows from the above lemma: $\etree_{\vari{mod}}.\vari{partial}=\abs{\etree}(1,\dots,1)$.
The function $\sampmon$ completes in $O(\log{k} \cdot k \cdot depth(\etree))$ time, where $k = \degree(poly(\abs{\etree})$. Upon completion, every $\left(\monom, sign(\coef)\right)\in \expandtree{\abs{\etree}}$ is returned with probability $\frac{|\coef|}{\abs{\etree}(1,\ldots, 1)}$. %, $\sampmon$ returns the sampled term $\left(\monom, sign(\coef)\right)$ from $\expandtree{\abs{\etree}}$.
Armed with the above two lemmas, we are ready to argue the following result (proof in~\Cref{sec:proofs-approx-alg}):
%If the contracts for $\onepass$ and $\sampmon$ hold, then
For any $\etree$ with $\degree(poly(|\etree|)) = k$, algorithm \ref{alg:mon-sam} outputs an estimate $\vari{acc}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ such that %$\expct\pbox{\empmean} = \frac{\rpoly(\prob_1,\ldots, \prob_\numvar)\cdot(1 - \gamma)}{\abs{\etree}(1,\ldots, 1)}$. %within an additive $\error \cdot \abs{\etree}(1,\ldots, 1)$ error with
%$\empmean$ has bounds
\[\probOf\left(\left|\vari{acc} - \rpoly(\prob_1,\ldots, \prob_\numvar)\right|> \error \cdot \abs{\etree}(1,\ldots, 1)\right) \leq \conf,\]
in $O\left(\treesize(\etree)\right.$ $+$ $\left.\left(\frac{\log{\frac{1}{\conf}}}{\error^2} \cdot k \cdot\log{k} \cdot depth(\etree)\right)\right)$ time.
\subsection{\onepass\ Algorithm}
%Algorithm ~\ref{alg:one-pass} satisfies the requirements of lemma ~\ref{lem:one-pass}.
The evaluation of $\abs{\etree}(1,\ldots, 1)$ can be defined recursively, as follows (where $\etree_\lchild$ and $\etree_\rchild$ are the `left' and `right' children of $\etree$ if they exist):
\abs{\etree}(1,\ldots, 1) = \begin{cases}
\abs{\etree_\lchild}(1,\ldots, 1) \cdot \abs{\etree_\rchild}(1,\ldots, 1) &\textbf{if }\etree.\type = \times\\
\abs{\etree_\lchild}(1,\ldots, 1) + \abs{\etree_\rchild}(1,\ldots, 1) &\textbf{if }\etree.\type = + \\
|\etree.\val| &\textbf{if }\etree.\type = \tnum\\
1 &\textbf{if }\etree.\type = \var.
It turns out that for proof of~\Cref{lem:sample}, we need to argue that when $\etree.\type = +$, we indeed have
\etree_\lchild.\vari{weight} &\gets \frac{\abs{\etree_\lchild}(1,\ldots, 1)}{\abs{\etree_\lchild}(1,\ldots, 1) + \abs{\etree_\rchild}(1,\ldots, 1)};\\
\etree_\rchild.\vari{weight} &\gets \frac{\abs{\etree_\rchild}(1,\ldots, 1)}{\abs{\etree_\lchild}(1,\ldots, 1)+ \abs{\etree_\rchild}(1,\ldots, 1)}
\noindent \onepass\ (Algorithm ~\ref{alg:one-pass} in \Cref{sec:proofs-approx-alg}) essentially populates the \vari{weight} variable on each node with the above definitions. Lemma~\ref{lem:one-pass} is also proved in~\Cref{sec:proofs-approx-alg}.
\subsection{\sampmon\ Algorithm}
%Algorithm ~\ref{alg:sample} takes $\etree$ as input, samples an arbitrary $(\monom, \coef)$ from $\expandtree{\etree}$ with probabilities $\stree_\lchild.\wght$ and $\stree_\rchild.\wght$ for each subtree $\stree$ with $\stree.\type = +$, outputting the tuple $(\monom, \sign(\coef))$. While one cannot compute $\expandtree{\etree}$ in time better than $O(N^k)$, the algorithm, similar to \textsc{OnePass}, uses a technique on $\etree$ which produces a sample from $\expandtree{\etree}$ without ever materializing $\expandtree{\etree}$.
A naive (slow) implementation of \sampmon\ would first compute $E(T)$ and then sample from it.
% However, this would be too time consuming.
Instead, \Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by top-down traversal.
For a parent $+$ node, the child to be visited is sampled from the weighted distribution precomputed by \onepass.
When a parent $\times$ node is visited, both children are visited.
The algorithm computes two properties: the set of all variable leaf nodes visited, and the product of signs of visited coefficient leaf nodes.
We will assume the TreeSet data structure to maintain sets with logarithmic time insertion and linear time traversal of its elements.
$\sampmon$ is given in \Cref{alg:sample}, and a proof of its correctness (via \Cref{lem:sample}) is provided in \Cref{sec:proofs-approx-alg}.
\Require \etree: Binary Expression Tree
\Ensure \vari{vars}: TreeSet
\Ensure \vari{sgn} $\in \{-1, 1\}$
\State $\vari{vars} \gets \emptyset$ \label{alg:sample-global1}
\If{$\etree.\type = +$}\Comment{Sample at every $+$ node}
\State $\etree_{\vari{samp}} \gets$ Sample from left subtree ($\etree_{\lchild}$) and right subtree ($\etree_{\rchild}$) w.p. $\etree_\lchild.\wght$ and $\etree_\rchild.\wght$. \label{alg:sample-plus-bsamp}
\State $(\vari{v}, \vari{s}) \gets \sampmon(\etree_{\vari{samp}})$\label{alg:sample-plus-traversal}
\State $\Return ~(\vari{v}, \vari{s})$
\ElsIf{$\etree.\type = \times$}\Comment{Multiply the sampled values of all subtree children}
\State $\vari{sgn} \gets 1$\label{alg:sample-global2}
\For {$child$ in $\etree.\vari{children}$}
\State $(\vari{v}, \vari{s}) \gets \sampmon(child)$
\State $\vari{vars} \gets \vari{vars} \cup \{\vari{v}\}$\label{alg:sample-times-union}
\State $\vari{sgn} \gets \vari{sgn} \times \vari{s}$\label{alg:sample-times-product}
\State $\Return ~(\vari{vars}, \vari{sgn})$
\ElsIf{$\etree.\type = numeric$}\Comment{The leaf is a coefficient}
\State $\Return ~\left(\{\}, sign(\etree.\val)\right)$\label{alg:sample-num-return}
\ElsIf{$\etree.\type = var$}
\State $\Return~\left(\{\etree.\val\}, 1\right) $\label{alg:sample-var-return}
