paper-BagRelationalPDBsAreHard/approx_alg.tex

142 lines
12 KiB
TeX
Raw Normal View History

%root: main.tex
\section{$1 \pm \epsilon$ Approximation Algorithm}
2020-08-12 17:41:09 -04:00
\AH{I am attempting to rewrite this section mostly from scratch. This will involve taking 'baby' steps towards the goals we spoke of on Friday 080720 as well as throughout the following week on chat channel.}
\AH{\textbf{BEGIN}: Old stuff.}
2020-08-05 16:08:40 -04:00
2020-08-06 15:02:37 -04:00
2020-08-14 12:03:26 -04:00
\begin{proof}
2020-08-06 15:02:37 -04:00
Let us now show a sampling scheme which can run in $O\left(|\poly|\cdot k\right)$ per sample.
First, consider when $\poly$ is already an SOP of pure products. In this case, sampling is trivial, and one would sample from the $\setsize$ terms with probability proportional to the product of probabilitites for each variable in the sampled monomial.
Second, consider when $\poly$ has a POS form with a product width of $k$. In this case, we can view $\poly$ as an expression tree, where the leaves represent the individual values of each factor. The leaves are joined together by either a $\times$ or $+$ internal node, and so on, until we reach the root, which is joining the $k$-$\times$ nodes.
Then for each $\times$ node, we multiply its subtree values, while for each $+$ node, we pick one of its children with probability proportional to the product of probabilities across its variables.
2020-08-12 17:41:09 -04:00
\AH{I think I mean to say a probability proportional to the number of elements in it's given subtree.}
2020-08-06 15:02:37 -04:00
The above sampling scheme is in $O\left(|\poly|\cdot k\right)$ time then, since we have for either case, that at most the scheme would perform within a factor of the $|\poly|$ operations, and those operations are repeated the product width of $k$ times.
2020-08-07 13:04:18 -04:00
Thus, it is the case, that we can approximate $\rpoly(\prob_1,\ldots, \prob_n)$ within the claimed confidence bounds and computation time, thus proving the lemma.\AH{State why.}
\AH{Discuss how we have that $\rpoly \geq O(\setsize)$. Discuss that we need $b-a$ to be small.}
2020-08-06 15:02:37 -04:00
\end{proof}
2020-08-12 17:41:09 -04:00
\qed
\AH{{\bf END:} Old Stuff}
2020-08-14 12:03:26 -04:00
Before proceeding to describe the approximation algorithm, let us intrduce notation that will be of use in the following discussion. First, when we speak of $\smb$, we are speaking of a polynomial $\poly$ of the standard monomial basis, i.e., a polynomial whose monomials are not only in SOP form, but one whose non-distinct monomials have been collapsed into one distinct monomial, with its corresponding coefficient accurately reflecting the number of monomials combined.
2020-08-12 17:41:09 -04:00
Let $\expresstree{\smb}$ be the set of all possible polynomial expressions equivalent to $\smb$. Call the input polynomial $\polytree$, and note that $\polytree \subseteq \expresstree{\smb}$ and need not be of the standard monomial basis. Refer to the expanded SOP form of $\poly$ as $\expandtree$, which is the SOP form of $\poly$ such that all coefficients $c_i$ are in the set $\{-1, 1\}$, thus relaxing the distinct monomial requirement of the standard monomial basis. Denote $\abstree$ as the resulting polynomial when all monomial coefficients of $\polytree$ are converted to positive coefficients, and then $\polytree$ itself is converted to the standard monomial basis.
\subsection{Monomial Sample Algorithm}
2020-08-14 12:03:26 -04:00
\begin{Lemma}\label{lem:approx-alg}
For any query polynomial $\poly(\vct{X})$, an approximation of $\rpoly(\prob_1,\ldots, \prob_n)$ can be computed in $O\left(|\poly|\cdot k \frac{\log\frac{1}{\conf}}{\error^2}\right)$, within $1 \pm \error$ multiplicative error with probability $\geq 1 - \conf$, where $k$ denotes the product width of $\poly$.
\end{Lemma}
\begin{proof}[Proof of Lemma \ref{lem:approx-alg}]
Consider $\polytree$ in the standard monomial basis and let $c_i$ be the coefficient of the $i^{th}$ monomial and $\distinctvars_i$ be the number of distinct variables appearing in the $i^{th}$ monomial. Note that sampling each term $t$ in $\polytree$ with probability $\frac{|c_i|}{\abstree(1,\ldots, 1)}$ is the equivalent of sampling uniformly over $\expandtree$. Now consider $\rpoly$ and note that $\coeffitem{i}$ is the value of the $i^{th}$ monomial term in $\rpoly(\prob_1,\ldots, \prob_n)$. Let $m$ be the number of terms in $\expandtree$ and $\coeffset$ to be the set $\{c'_1,\ldots, c'_m\}$ where each $c'_i$ is in $\{-1, 1\}$.
Consider now a set of $\samplesize$ random variables $\vct{\randvar}$, where $\randvar_i \sim \unidist{\coeffset}$. Recall that we are estimating for $\rpoly(\prob,\ldots, \prob)$. Then for random variable $\randvar_i$, it is the case that $\expct\pbox{\randvar_i} = \sum_{i = 1}^{\setsize}\frac{c'_i \cdot \prob^{\distinctvars}}{\setsize} = \frac{\rpoly(\prob,\ldots, \prob)}{\abstree(1,\ldots, 1)}$. Let $\hoeffest = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i$. It is also true that
\[\expct\pbox{\hoeffest} = \expct\pbox{ \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i} = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\expct\pbox{\randvar_i} = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\frac{1}{\setsize}\sum_{j = 1}^{\setsize}\frac{c'_i \cdot \prob^{\distinctvars}}{\setsize} = \frac{\rpoly(\prob,\ldots, \prob)}{\abstree(1,\ldots, 1)}.\]
Given the range $[-1, 1]$ for every $\randvar_i$ in $\vct{\randvar}$, by Hoeffding, it is the case that $P\pbox{~\left| \hoeffest - \expct\pbox{\hoeffest} ~\right| \geq \error} \leq 2\exp{-\frac{2\samplesize^2\error^2}{2^2 \samplesize}} \leq \conf$.
2020-08-14 12:03:26 -04:00
Solving for the number of samples $\samplesize$ we get
\begin{align}
&\conf \geq 2\exp{-\frac{2\samplesize^2\error^2}{4\samplesize}}\label{eq:hoeff-1}\\
&\frac{\conf}{2} \geq \exp{-\frac{2\samplesize^2\error^2}{4\samplesize}}\label{eq:hoeff-2}\\
&\frac{2}{\conf} \leq \exp{\frac{2\samplesize^2\error^2}{4\samplesize}}\label{eq:hoeff-3}\\
&\log{\frac{2}{\conf}} \leq \frac{2\samplesize^2\error^2}{4\samplesize}\label{eq:hoeff-4}\\
&\log{\frac{2}{\conf}} \leq \frac{\samplesize\error^2}{2}\label{eq:hoeff-5}\\
&\frac{2\log{\frac{2}{\conf}}}{\error^2} \leq \samplesize.\label{eq:hoeff-6}
\end{align}
Equation \cref{eq:hoeff-1} results computing the sum in the denominator of the exponential. Equation \cref{eq:hoeff-2} is the result of dividing both sides by $2$. Equation \cref{eq:hoeff-3} follows from taking the reciprocal of both sides, and noting that such an operation flips the inequality sign. We then derive \cref{eq:hoeff-4} by the taking the base $e$ log of both sides, and \cref{eq:hoeff-5} results from reducing common factors. We arrive at the final result of \cref{eq:hoeff-6} by simply multiplying both sides by the reciprocal of the RHS fraction without the $\samplesize$ factor.
\end{proof}
\subsubsection{Description}
\subsubsection{Psuedo Code}
\subsubsection{Correctness}
2020-08-13 20:54:06 -04:00
\subsubsection{Run-time Analysis}
2020-08-12 17:41:09 -04:00
\subsection{Sampling Algorithm}
\subsubsection{Description}
Auxiliary algorithm \textit{Sample} performs the brunt of the work in \textit{Monomial Sample}. \textit{Sample} takes $\polytree$ as input and performs the equivalent of outputting a sample $\randvar_i$ such that $\randvar_i \sim Uniform(S)$, where $S$ represents the multiset of monomials in $\expandtree$. While one cannot compute $\expandtree$ in time better than $O(N^k)$, the algorithm uses a technique on $\polytree$ which produces a uniform sample from $\expandtree$ without ever materializing $\expandtree$.
The input $\polytree$ can be seen as an expression tree, with leaf nodes being the variables of $\polytree$ and inner nodes being the polynomial operations of $+$ or $\times$. For clarity, consider when $\polytree = (x_1 + x_2)(x_1 - x_2) + x_2^2$. The expression tree for $\polytree$ would look like $+\left(\times\left(+\left(x_1, x_2\right), +\left(x_1, -x_2\right)\right), \times\left(y, y\right)\right)$.
\AH{A tree diagram would work much better, but to do that it appears that I need to spend time learning the tikz package, which I haven't had time for yet.}
2020-08-13 20:54:06 -04:00
Algorithm ~\ref{alg:one-pass} performs two important tasks in one pass over expression tree $\polytree$. One of those tasks is to compute $\abstree(1,\ldots, 1)$ in $O(|\poly|)$ time. Note, that $\abstree(1,\ldots, 1)$ is indeed the sum of coefficients in $\abstree$, or equivalently the number of terms in $\expandtree$.
2020-08-13 20:54:06 -04:00
Secondly, algorithm ~\ref{alg:one-pass} recursively visits nodes, where it performs the following operations. Given a leaf node, the algorithm stores the coefficient $c_i$ absolute value, denoted $|c_i|$. When it stops at a $+$ node, it memoizes the probabilities proportional to the partial values of the subtree children of the parent $+$ node. For the case when it encounters an inner $\times$ node, the partial values of both subtree children are multiplied.
2020-08-13 20:54:06 -04:00
For the running example, after the first pass, \cref{alg:one-pass} would have learned to sample the two children of the root $+$ node with $P\left(\times\left(+\left(x_1, x_2\right), +\left(x_1, -x_2\right)\right)\right) = \frac{4}{5}$ and $P\left(\times\left(x_2, x_2\right)\right) = \frac{1}{5}$. Similarly, the two inner $+$ nodes of the root's left child, call them $+_1$ and $+_2$, using $l$ for left child and $r$ for right child are $P_{+_1}(l) = P_{+_1}(r) = P_{+_2}(l) = P_{+_2}(r) = \frac{1}{2}$. Note that in this example, the sampling probabilities for the children of each inner $+$ node are equal to one another because both parents have the same number of children, and, in each case, the children of each parent $+$ node share the same $|c_i|$.
2020-08-13 20:54:06 -04:00
Algorithm ~\ref{alg:sample} then uniformly selects a monomial from $\expandtree$ by the following recursive method. For each leaf node, the monomial is returned with a coefficient reduced to either $\{-1, 1\}$ depending on its sign. For a parent $+$ node, a subtree is chosen over the previously computed weighted sampling distribution. When a parent $\times$ node is visited, the monomials are combined into one monomial. The algorithm concludes outputting $sign(c_i)\cdot\prob^{d_i}$.
\subsubsection{Psuedo Code}
\begin{algorithm}
\caption{OnePass($\polytree$)}
\label{alg:one-pass}
\begin{algorithmic}[1]
\State $acc \gets 0$
2020-08-13 20:54:06 -04:00
\If{$\polytree.head.val = "+"$}
\For{$child$ in $T.children$}
2020-08-13 20:54:06 -04:00
\State $acc \gets acc + OnePass(child)$
\EndFor
\State $T.partial \gets acc$
\For{$child$ in $T.children$}
2020-08-13 20:54:06 -04:00
\State $child.weight \gets \frac{child.c_i}{T.partial}$
\EndFor
\State Return $T.partial$
2020-08-13 20:54:06 -04:00
\ElsIf{$T.head.val = "\times"$}
\State $acc \gets 1$
\For{$child$ in $T.children$}
2020-08-13 20:54:06 -04:00
\State $acc \gets acc \times OnePass(child)$
\EndFor
\State $T.partial \gets acc$
\Else
\State Return $T.c_i$
\EndIf
\end{algorithmic}
\end{algorithm}
\begin{algorithm}
\caption{Sample($\polytree$)}
\label{alg:sample}
\begin{algorithmic}[1]
2020-08-13 20:54:06 -04:00
\If{$T.head.val = "+":$}
\State $T_{samp} \gets$ WeightedSample($T.children$, $T.weights$)
\State $Sample(T_{samp})$
2020-08-13 20:54:06 -04:00
\ElsIf{$T.head.val = "\times":$}
\State $c_i \gets 1$
\State $monom \gets ""$
\For {$child$ in $T.children:$}
\State $monom = monom ++ Sample(child)\_1$
2020-08-13 20:54:06 -04:00
\State $c_i \gets c_i \times Sample(child)\_2$
\EndFor
\Else
\State Return $(T.head.val, 1 \times T.c_i)$
\EndIf
\end{algorithmic}
\end{algorithm}
2020-08-13 20:54:06 -04:00
\subsubsection{Correctness of Algorithm ~\ref{alg:one-pass}}
The algorithm begins with recursively visiting the root node and all of its children, until it reaches all leaves. Thus, every node is visited. When going from the bottom up, it is the case for a parent node $+$ that the algorithm records the sum of its children's coefficient values, and produces a weighted distribution based on the partial values. This weighted distribution across each child subtree is in exact proportion to the probability of choosing either child given that it's parent's subtree was chosen. Consider the base case, when we have $n$ leaf nodes whose parent (root) node is $+$. For each $|c_i|$, it is the case that $\frac{|c_i|}{\sum_{i \in [n]}|c_i|}$ is exactly the uniform distribution of $\expandtree$.
When a $\times$ node is visited, \cref{alg:one-pass} takes the product of each of its children. Note that this is correct, since it is the case that a product of polynomials has a sum of coefficients equal to the product of the sum of each polynomial's coefficients.
Note that for the case of a $+$ subtree of a parent $\times$ node, when the parent node passes its partial sum up to it's parent node, it is the case that the subtrees of the $+$ node probabilities are exactly the proportion of the parent's parent node.
\subsubsection{Run-time Analysis}