paper-BagRelationalPDBsAreHard/approx_alg.tex

%root: main.tex
\section{$1 \pm \epsilon$ Approximation Algorithm}
\begin{Lemma}\label{lem:approx-alg}
For any query polynomial $\poly(X_1,\ldots, X_n)$, an approximation of $\rpoly(\prob_1,\ldots, \prob_n)$ can be computed in $O\left(|\poly|\cdot k \frac{\log\frac{1}{\conf}}{\error^2}\right)$, within $1 \pm \error$ multiplicative error with probability $\geq 1 - \conf$, where $k$ denotes the product width of $\poly$.
\end{Lemma}

\begin{proof}[Proof of Lemma \ref{lem:approx-alg}]
Let $c_i$ be the coefficient of the $i^{th}$ monomial and $\distinctvars_i$ be the number of distinct variables appearing in the $i^{th}$ monomial.  Then  $\coeffitem{i}$ is the value of the $i^{th}$ monomial term in $\rpoly(\prob_1,\ldots, \prob_n)$.  Define $\coeffset$ to be the set $\{\coeffitem{1},\ldots, \coeffitem{\setsize}\}$.  Assume a set of $\samplesize$ random variables $\vct{\randvar}$, where $\randvar_i \sim \unidist{\coeffset}$.

Given random variable $\randvar_i$, it is the case that $\expct\pbox{\randvar_i} = \sum_{i = 1}^{\setsize}\frac{\coeffitem{i}}{\setsize} = \ave{\coeffset}$.  Let $\hoeffest = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i$.  Then it is true that

\[\expct\pbox{\hoeffest} = \expct\pbox{ \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i} = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\expct\pbox{\randvar_i} = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\frac{1}{\setsize}\sum_{j = 1}^{\setsize}\coeffitem{j} = \ave{\coeffset}.\]

Denote $\hoeffestsum = \hoeffest \cdot \setsize$ and $\setsum = \ave{\coeffset} \cdot \setsize$.

Given the range $[a, b]$ for every $\randvar_i$ in $\vct{\randvar}$, by Hoeffding, it is the case that $Pr\pbox{| \hoeffestsum - \setsum | \geq \error\setsize} \leq 2\exp{-\frac{2\samplesize^2\setsize^2\error^2}{\sum_{i = 1}^{\samplesize}\left(b_i - a_i\right)^2}} \leq \conf$.

Solving for the number of samples $\samplesize$ we get
\begin{align}
&\conf \geq  2\exp{-\frac{2\samplesize^2\setsize^2\error^2}{\samplesize\left(b - a\right)^2}}\label{eq:hoeff-1}\\
&\frac{\conf}{2} \geq \exp{-\frac{2\samplesize^2\setsize^2\error^2}{\samplesize\left(b - a\right)^2}}\label{eq:hoeff-2}\\
&\frac{2}{\conf} \leq \exp{\frac{2\samplesize^2\setsize^2\error^2}{\samplesize\left(b - a\right)^2}}\label{eq:hoeff-3}\\
&\log{\frac{2}{\conf}} \leq \frac{2\samplesize^2\setsize^2\error^2}{\samplesize\left(b - a\right)^2}\label{eq:hoeff-4}\\
&\log{\frac{2}{\conf}} \leq \frac{2\samplesize\setsize^2\error^2}{\left(b - a\right)^2}\label{eq:hoeff-5}\\
&\frac{\log{\frac{2}{\conf}}\left(b - a\right)^2}{2\setsize^2\error^2} \leq \samplesize.\label{eq:hoeff-6}
\end{align}

Equation \cref{eq:hoeff-1} results from the fact that for all $\samplesize$ samples, the range of values over the random variable $\randvar_i$ is always the same.  Equation \cref{eq:hoeff-2} is the result of dividing both sides by $2$.  Equation \cref{eq:hoeff-3} follows from taking the reciprocal of both sides, and noting that such an operation flips the inequality sign.  We then derive \cref{eq:hoeff-4} by the taking the base $e$ log of both sides, and \cref{eq:hoeff-5} results from cancelling the common $\samplesize$ factor.  We arrive at the final result of \cref{eq:hoeff-6} by simply multiplying both sides by the reciprocal of the RHS fraction without the $\samplesize$ factor.


Let us now show a sampling scheme which can run in $O\left(|\poly|\cdot k\right)$ per sample.

First, consider when $\poly$ is already an SOP of pure products.  In this case, sampling is trivial, and one would sample from the $\setsize$ terms with probability proportional to the product of probabilitites for each variable in the sampled monomial.

Second, consider when $\poly$ has a POS form with a product width of $k$.  In this case, we can view $\poly$ as an expression tree, where the leaves represent the individual values of each factor.  The leaves are joined together by either a $\times$ or $+$ internal node, and so on, until we reach the root, which is joining the $k$-$\times$ nodes.

Then for each $\times$ node, we multiply its subtree values, while for each $+$ node, we pick one of its children with probability proportional to the product of probabilities across its variables.

The above sampling scheme is in $O\left(|\poly|\cdot k\right)$ time then, since we have for either case, that at most the scheme would perform within a factor of the $|\poly|$ operations, and those operations are repeated the product width of $k$ times.

Thus, it is the case, that we can approximate $\rpoly(\prob_1,\ldots, \prob_n)$ within the claimed confidence bounds and computation time, thus proving the lemma.
\end{proof}

\qed