paper-BagRelationalPDBsAreHard/approx_alg.tex

%root: main.tex
\section{$1 \pm \epsilon$ Approximation Algorithm}
%\AH{I am attempting to rewrite this section mostly from scratch.  This will involve taking 'baby' steps towards the goals we spoke of on Friday 080720 as well as throughout the following week on chat channel.}
%
%\AH{\textbf{BEGIN}: Old stuff.}
%
%
%\begin{proof}
%
%Let us now show a sampling scheme which can run in $O\left(|\poly|\cdot k\right)$ per sample.  
%
%First, consider when $\poly$ is already an SOP of pure products.  In this case, sampling is trivial, and one would sample from the $\setsize$ terms with probability proportional to the product of probabilitites for each variable in the sampled monomial.
%
%Second, consider when $\poly$ has a POS form with a product width of $k$.  In this case, we can view $\poly$ as an expression tree, where the leaves represent the individual values of each factor.  The leaves are joined together by either a $\times$ or $+$ internal node, and so on, until we reach the root, which is joining the $k$-$\times$ nodes.  
%
%Then for each $\times$ node, we multiply its subtree values, while for each $+$ node, we pick one of its children with probability proportional to the product of probabilities across its variables.
%
%\AH{I think I mean to say a probability proportional to the number of elements in it's given subtree.}
%
%The above sampling scheme is in $O\left(|\poly|\cdot k\right)$ time then, since we have for either case, that at most the scheme would perform within a factor of the $|\poly|$ operations, and those operations are repeated the product width of $k$ times.
%
%Thus, it is the case, that we can approximate $\rpoly(\prob_1,\ldots, \prob_n)$ within the claimed confidence bounds and computation time, thus proving the lemma.\AH{State why.}
%
%\AH{Discuss how we have that $\rpoly \geq O(\setsize)$.  Discuss that we need $b-a$ to be small.}
%\end{proof}
%
%\qed
%\AH{{\bf END:} Old Stuff}

Unless explicitly stated otherwise, when speaking of a polynomial, it is assumed that the polynomial is of the standard monomial basis, i.e., a polynomial whose monomials are not only in SOP form, but one whose non-distinct monomials have been collapsed into one distinct monomial, with its corresponding coefficient accurately reflecting the number of monomials combined.

Before proceeding, some useful notation.

\begin{Definition}[Expression Tree]\label{def:express-tree}
An expression tree $\polytree$ is an ADT logically viewed as an n-ary tree, whose internal nodes are from the set $\{+, \times\}$, with leaf nodes being either numerical coefficients or variables, but not both.
\end{Definition}

Note that $\polytree$ is the expression tree corresponding to a general polynomial $\poly$, and is therefore generally \textit{not} in the standard monomial basis.

\begin{Definition}[Poly]\label{def:poly-func}
Denote $poly(\polytree)$ to be the function that takes as input expression tree $\polytree$ and outputs the polynomial of the same factored form as corresponding to expression tree $\polytree$.
\end{Definition}

\begin{Definition}[Expression Tree Set]\label{def:express-tree-set}$\expresstree{\smb}$ is the set of all possible polynomial expression trees whose standard monomial basis is $\smb$.  
\end{Definition}

Note that \cref{def:express-tree-set} implies that $\polytree \subseteq \expresstree{\smb}$.

\begin{Definition}[Expanded T]\label{def:expand-tree}
$\expandtree$ is the pure SOP expansion of $\polytree$, where non-distinct monomials are not combined.
\end{Definition}

To illustrate \cref{def:expand-tree} with an example, consider when $\polytree$ models the polynoial $(x + 2y)(2x - y)$.  The pure expansion then is $2x^2 - xy + 4xy - 2y^2 = \expandtree$.

\begin{Definition}[Positive T]\label{def:positive-tree}
Let $\abstree$ denote the resulting expression tree when each coefficient $c_i$ in $\polytree$ is exchanged with its absolute value $|c_i|$, and then the resulting $\polytree'$ converted to the expression tree that directly models $poly(\polytree')$, with a root node $+$, whose children consist of all the monomials in the standard monomial basis of $poly(\polytree')$.
\end{Definition}

Using the same polynomial from the above example, $poly(\abstree) = (x + 2y)(2x + y) = 2x^2 +xy +4xy + 2y^2 = 2x^2 + 5xy + 2y^2$.

\subsection{Approximating $\rpoly$}

\subsubsection{Description}
Algorithm ~\ref{alg:mon-sam} approximates $\rpoly$ by employing some auxiiliary methods on its input $\polytree$, sampling $\polytree$ $\frac{\log{\frac{1}{\delta}}}{\epsilon^2}$ times, and then outputting an estimate of $\rpoly$ within a multiplicative error of $1 \pm \epsilon$ with a probability of $1 - \delta$.

\subsubsection{Psuedo Code}

\begin{algorithm}[H]
	\caption{\textsc{Approximate $\rpoly$($\polytree$)}}
	\label{alg:mon-sam}
	\begin{algorithmic}[1]
		\State $acc \gets 0$
		\State \textsc{OnePass($\polytree$)}\Comment{\textsc{OnePass} and \textsc{Sample} defined next}
		\For{$sample$ $in$ $\frac{\log{\frac{1}{\delta}}}{\epsilon^2}$}\Comment{Perform the required number of samples}
			\State $acc \gets acc + $\textsc{Sample($\polytree$)}\Comment{Store the sum over all samples}
		\EndFor

		\State Return $acc$
	\end{algorithmic}
\end{algorithm}

\subsubsection{Correctness}
\begin{Theorem}\label{lem:approx-alg}
For any query polynomial $\poly(\vct{X})$, an approximation of $\rpoly(\prob_1,\ldots, \prob_n)$ can be computed in $O\left(|\poly|\cdot k \frac{\log\frac{1}{\conf}}{\error^2}\right)$, within $1 \pm \error$ multiplicative error with probability $\geq 1 - \conf$, where $k$ denotes the product width of $\poly$.
\end{Theorem}

\begin{Lemma}\label{lem:mon-samp}
Algorithm \ref{alg:mon-sam} computes $\frac{\log\frac{1}{\conf}}{\error^2}$ samples, outputting an estimate of $\rpoly(\prob,\ldots, \prob)$ within a multiplicative $1 \pm \error$ error with probability $1 - \conf$.
\end{Lemma}

Before the proof, a brief summary of the sample scheme is necessary.  Regardless of the $\polytree$, note that when one samples with a weighted distribution corresponding to the coefficients in $poly(\expandtree)$, it is the same as uniformly sampling over all individual terms of the equivalent polynomial whose terms have coefficients in the set $\{-1, 1\}$, i.e. collapsed monomials are decoupled.  Following this reasoning, algorithim ~\ref{alg:one-pass} computes such a weighted distribution and algorithm ~\ref{alg:sample} produces samples accordingly.  As a result, from here on, we can consider our sampling scheme to be uniform.

%each of the $k$ product terms is sampled from individually, where the final output sample is sampled with a probability that is proportional to its coefficient in $\expandtree$.  Note, that   This is performed by \cref{alg:sample} and its correctness will be argued momentarily.  For now it suffices to note that the sampling scheme samples from each of the $k$ products in a POS using a weighted distribution equivalent to sampling uniformly over all monomials.

\begin{proof}[Proof of Lemma \ref{lem:mon-samp}]
The first part of the claim in lemma ~\ref{lem:mon-samp} is trivial, as evidenced in the number of iterations in the for loop.  

Next, consider $\expandtree$ and let $c_i$ be the coefficient of the $i^{th}$ monomial and $\distinctvars_i$ be the number of distinct variables appearing in the $i^{th}$ monomial.  As discussed above, sampling each term $t$ in $\expandtree$ with probability $\frac{|c_i|}{\abstree(1,\ldots, 1)}$ is the equivalent of sampling uniformly over $\expandtree$.  Now consider $\rpoly$ and note that $\coeffitem{i}$ is the value of the $i^{th}$ monomial term in $\rpoly(\prob_1,\ldots, \prob_n)$.  Let $m$ be the number of terms in $\expandtree$ and $\coeffset$ to be the set $\{c_1,\ldots, c_m\}.$  

Consider now a set of $\samplesize$ random variables $\vct{\randvar}$, where $\randvar_i \sim \unidist{\coeffset}$  Then for random variable $\randvar_i$, it is the case that $\expct\pbox{\randvar_i} = \sum_{i = 1}^{\setsize}\frac{c'_i \cdot \prob^{\distinctvars}}{\sum_{i = 1}^{\setsize}|c_i|} = \frac{\rpoly(\prob,\ldots, \prob)}{\abstree(1,\ldots, 1)}$.  Let $\hoeffest = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i$.  It is also true that 

\[\expct\pbox{\hoeffest} = \expct\pbox{ \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i} = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\expct\pbox{\randvar_i} = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\frac{1}{\setsize}\sum_{j = 1}^{\setsize}\frac{c'_i \cdot \prob^{\distinctvars}}{\setsize} = \frac{\rpoly(\prob,\ldots, \prob)}{\abstree(1,\ldots, 1)}.\]

Given the range $[-1, 1]$ for every $\randvar_i$ in $\vct{\randvar}$, by Hoeffding, it is the case that $P\pbox{~\left| \hoeffest - \expct\pbox{\hoeffest} ~\right| \geq \error} \leq 2\exp{-\frac{2\samplesize^2\error^2}{2^2 \samplesize}} \leq \conf$.

Solving for the number of samples $\samplesize$ we get
\begin{align}
&\conf \geq  2\exp{-\frac{2\samplesize^2\error^2}{4\samplesize}}\label{eq:hoeff-1}\\
&\frac{\conf}{2} \geq \exp{-\frac{2\samplesize^2\error^2}{4\samplesize}}\label{eq:hoeff-2}\\
&\frac{2}{\conf} \leq \exp{\frac{2\samplesize^2\error^2}{4\samplesize}}\label{eq:hoeff-3}\\
&\log{\frac{2}{\conf}} \leq \frac{2\samplesize^2\error^2}{4\samplesize}\label{eq:hoeff-4}\\
&\log{\frac{2}{\conf}} \leq \frac{\samplesize\error^2}{2}\label{eq:hoeff-5}\\
&\frac{2\log{\frac{2}{\conf}}}{\error^2} \leq \samplesize.\label{eq:hoeff-6}
\end{align}

Equation \cref{eq:hoeff-1} results computing the sum in the denominator of the exponential.  Equation \cref{eq:hoeff-2} is the result of dividing both sides by $2$.  Equation \cref{eq:hoeff-3} follows from taking the reciprocal of both sides, and noting that such an operation flips the inequality sign.  We then derive \cref{eq:hoeff-4} by the taking the base $e$ log of both sides, and \cref{eq:hoeff-5} results from reducing common factors.  We arrive at the final result of \cref{eq:hoeff-6} by simply multiplying both sides by the reciprocal of the RHS fraction without the $\samplesize$ factor.

By Hoeffding, then algorithm ~\ref{alg:mon-sam} takes the correct number of samples, and this concludes the proof of lemma ~\ref{lem:mon-samp}.

\end{proof}


\subsubsection{Run-time Analysis}

\subsection{OnePass Algorithm}
\subsubsection{Description}
Auxiliary algorithm ~\ref{alg:one-pass} has the responsibility of computing the weighted (uniform) distribution over $\expandtree$.  This consists of two parts.  Computing the sum over the absolute values of each coefficient ($\abstree(1,\ldots, 1)$), and computing the distribution over each monomial term of $\expandtree$, without ever materializing $\expandtree$.

Algorithm ~\ref{alg:one-pass} takes $\polytree$ as input, modifying $\polytree$ in place with the appropriate weight distribution across all nodes, and finally returning $\abstree(1,\ldots, 1)$.  For concreteness, consider the example when $poly(\polytree) = (x_1 + x_2)(x_1 - x_2) + x_2^2$.  The expression tree $\polytree$ would then be $+\left(\times\left(+\left(x_1, x_2\right), +\left(x_1, -x_2\right)\right), \times\left(y, y\right)\right)$.

\AH{A tree diagram would work much better, but to do that it appears that I need to spend time learning the tikz package, which I haven't had time for yet.}

To compute $\abstree(1,\ldots, 1)$, algorithm ~\ref{alg:one-pass} makes a bottom-up traversal of $\polytree$ and performs the following computations.  For a leaf node whose value is a coefficient, the value is saved.  When a $+$ node is visited, the coefficient values of its children are summed.  Finally, for the case of a $\times$ node, the coefficient values of the children are multiplied.  The algorithm returns the total value upon termination.

Algorithm ~\ref{alg:one-pass} computes the weighted (uniform) probability distribution in the same bottom-up traversal.  When a leaf node is encountered, its value is saved if it is a coefficient.  When a $\times$ node is visited, the coefficient values of its children are multiplied.  When a $+$ node is visited, the algorithm computes and saves the relative probabilities of each one of its children.  This is done by taking the sum of its children's coefficient absolute values, and for each child, dividing the child's coefficient absolute value by that sum.  Note the difference in treatment between $+$ and $\times$ nodes, that the former stores probabilities for its children while the latter stores a probability measure for itself.  Upon termination, all appropriate nodes have been annotated accordingly.

For the running example, after one pass, \cref{alg:one-pass} would have learned to sample the two children of the root $+$ node with $P\left(\times\left(+\left(x_1, x_2\right), +\left(x_1, -x_2\right)\right)\right) = \frac{4}{5}$ and $P\left(\times\left(x_2, x_2\right)\right) = \frac{1}{5}$.  Similarly, the two inner $+$ nodes of the root's left child, call them $+_1$ and $+_2$, using $l$ for left child and $r$ for right child are $P_{+_1}(l) = P_{+_1}(r) = P_{+_2}(l) = P_{+_2}(r) = \frac{1}{2}$.  Note that in this example, the sampling probabilities for the children of each inner $+$ node are equal to one another because both parents have the same number of children, and, in each case, the children of each parent $+$ node share the same $|c_i|$.

The following pseudo code assumes that $\polytree$ has the following members.  $\polytree.val$ holds the value stored by $\polytree$, $\polytree.children$ contains all children of $\polytree$, $\polytree.weight$ is the probability of choosing $\polytree$, and $\polytree.partial$ is the coefficient of $\polytree$.  A child of $\polytree$ is assumed to be an expression tree itself.  The function $isnum(\cdot)$ returns true if the value is numeric.

\AH{{\bf Next:} 
5) Prove correctness for all algos.  
6) Don't forget to do run-time analysis.}

\subsubsection{Psuedo Code}

\begin{algorithm}[h!]
	\caption{\textsc{OnePass$(\polytree)$}}
	\label{alg:one-pass}
\begin{algorithmic}[1]
	\If{$\polytree.val = "+"$}
		\State $acc \gets 0$
		\For{$child$ in $T.children$}\Comment{Sum up all children coefficients}
			\State $acc \gets acc + \textsc{OnePass}(child)$
		\EndFor
		\State $T.partial \gets acc$
		\For{$child$ in $T.children$}\Comment{Record distributions for each child}
			\State $child.weight \gets \frac{child.partial}{T.partial}$
		\EndFor
		\State Return $T.partial$
	\ElsIf{$\polytree.val = "\times"$}
		\State $acc \gets 1$
		\For{$child$ in $T.children$}\Comment{Compute the product of all children coefficients}
			\State $acc \gets acc \times \textsc{OnePass}(child)$
		\EndFor
		\State $T.partial \gets acc$
		\State Return $T.partial$
	\ElsIf{$isnum(\polytree.val)$}\Comment{Base case}
		\State Return $T.val$
	\EndIf
\end{algorithmic}
\end{algorithm}

\subsection{Sample Algorithm}

Algorithm ~\ref{alg:sample} takes $\polytree$ as input and performs the equivalent of outputting a sample $\randvar_i$ such that $\randvar_i \sim Uniform(S)$, where $S$ represents the multiset of monomials in $\expandtree$.  While one cannot compute $\expandtree$ in time better than $O(N^k)$, the algorithm uses a technique on $\polytree$ which produces a uniform sample from $\expandtree$ without ever materializing $\expandtree$.

Algorithm ~\ref{alg:sample} then uniformly selects a monomial from $\expandtree$ by the following top-down traversal.  For a parent $+$ node, a subtree is chosen over the previously computed weighted sampling distribution.  When a parent $\times$ node is visited, the monomials sampled from its subtrees are combined into one monomial.  For the case of a parent node with children that are leaf nodes, if the parent is a $\times$ node, then each leaf node is returned, with the coefficient reduced to either $\{-1, 1\}$ depending on its sign.  If the parent node is a $+$ node, then one of the chidlren is sampled as discussed previously.  The algorithm concludes outputting $sign(c_i)\cdot\prob^{d_i}$.  The pseudo code uses $isdist(\cdot)$ to mean a function that takes a single variable from $\vct{X}$ as input and outputs whether or not we have seen this variable while computing the current sample.

\subsubsection{Pseudo Code}

Algorithm ~\ref{alg:sample} should be placed here.

\begin{algorithm}
	\caption{Sample($\polytree$)}
	\label{alg:sample}
	\begin{algorithmic}[1]
		\If{$T.val = "+":$}\Comment{Sample at every $+$ node}
			\State $T_{samp} \gets$ WeightedSample($T.children$, $T.weights$) \Comment{Currently considering sample generation as a 'black' box}
			\State Return $Sample(T_{samp})$
		\ElsIf{$T.val = "\times":$}\Comment{Multiply the sampled values of all subtree children}
			\State $acc \gets 1$
			\For {$child$ in $T.children:$}				
				\State $acc \gets acc \times Sample(child)$
			\EndFor
		\ElsIf{$isnum(T.val)$}\Comment{The leaf is a coefficient non-variable}
			\State Return $sign(T.val)$
		\ElsIf{$isdist(T.val)$}\Comment{The leaf is a variable; we need to know if it is distinct}
			\State Return $\prob$
		\Else
			\State Return $1$
		\EndIf
	\end{algorithmic}
\end{algorithm}

\subsubsection{Correctness of Algorithm ~\ref{alg:one-pass}}
%The algorithm begins with recursively visiting the root node and all of its children, until it reaches all leaves.  Thus, every node is visited.  When going from the bottom up, it is the case for a parent node $+$ that the algorithm records the sum of its children's coefficient values, and produces a weighted distribution based on the partial values.  This weighted distribution across each child subtree is in exact proportion to the probability of choosing either child given that it's parent's subtree was chosen.  Consider the base case, when we have $n$ leaf nodes whose parent (root) node is $+$.  For each $|c_i|$, it is the case that $\frac{|c_i|}{\sum_{i \in [n]}|c_i|}$ is exactly the uniform distribution of $\expandtree$.
%
%When a $\times$ node is visited, \cref{alg:one-pass} takes the product of each of its children.  Note that this is correct, since it is the case that a product of polynomials has a sum of coefficients equal to the product of the sum of each polynomial's coefficients.
%
%Note that for the case of a $+$ subtree of a parent $\times$ node, when the parent node passes its partial sum up to it's parent node, it is the case that the subtrees of the $+$ node probabilities are exactly the proportion of the parent's parent node.

\subsubsection{Run-time Analysis}