paper-BagRelationalPDBsAreHard/approx_alg.tex

60 lines
6.7 KiB
TeX
Raw Normal View History

%root: main.tex
\section{$1 \pm \epsilon$ Approximation Algorithm}
2020-08-12 17:41:09 -04:00
\AH{I am attempting to rewrite this section mostly from scratch. This will involve taking 'baby' steps towards the goals we spoke of on Friday 080720 as well as throughout the following week on chat channel.}
\AH{\textbf{BEGIN}: Old stuff.}
2020-08-05 16:08:40 -04:00
\begin{Lemma}\label{lem:approx-alg}
For any query polynomial $\poly(X_1,\ldots, X_n)$, an approximation of $\rpoly(\prob_1,\ldots, \prob_n)$ can be computed in $O\left(|\poly|\cdot k \frac{\log\frac{1}{\conf}}{\error^2}\right)$, within $1 \pm \error$ multiplicative error with probability $\geq 1 - \conf$, where $k$ denotes the product width of $\poly$.
\end{Lemma}
\begin{proof}[Proof of Lemma \ref{lem:approx-alg}]
2020-08-06 15:02:37 -04:00
Let $c_i$ be the coefficient of the $i^{th}$ monomial and $\distinctvars_i$ be the number of distinct variables appearing in the $i^{th}$ monomial. Then $\coeffitem{i}$ is the value of the $i^{th}$ monomial term in $\rpoly(\prob_1,\ldots, \prob_n)$. Define $\coeffset$ to be the set $\{\coeffitem{1},\ldots, \coeffitem{\setsize}\}$. Assume a set of $\samplesize$ random variables $\vct{\randvar}$, where $\randvar_i \sim \unidist{\coeffset}$.
2020-08-05 16:08:40 -04:00
2020-08-06 15:02:37 -04:00
Given random variable $\randvar_i$, it is the case that $\expct\pbox{\randvar_i} = \sum_{i = 1}^{\setsize}\frac{\coeffitem{i}}{\setsize} = \ave{\coeffset}$. Let $\hoeffest = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i$. Then it is true that
2020-08-05 16:08:40 -04:00
\[\expct\pbox{\hoeffest} = \expct\pbox{ \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i} = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\expct\pbox{\randvar_i} = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\frac{1}{\setsize}\sum_{j = 1}^{\setsize}\coeffitem{j} = \ave{\coeffset}.\]
Denote $\hoeffestsum = \hoeffest \cdot \setsize$ and $\setsum = \ave{\coeffset} \cdot \setsize$.
2020-08-06 15:02:37 -04:00
Given the range $[a, b]$ for every $\randvar_i$ in $\vct{\randvar}$, by Hoeffding, it is the case that $Pr\pbox{| \hoeffestsum - \setsum | \geq \error\setsize} \leq 2\exp{-\frac{2\samplesize^2\setsize^2\error^2}{\sum_{i = 1}^{\samplesize}\left(b_i - a_i\right)^2}} \leq \conf$.
2020-08-05 16:08:40 -04:00
2020-08-12 17:41:09 -04:00
\AH{Needs to be rewritten using the 2nd Hoeffding.}
2020-08-05 16:08:40 -04:00
Solving for the number of samples $\samplesize$ we get
\begin{align}
2020-08-06 15:02:37 -04:00
&\conf \geq 2\exp{-\frac{2\samplesize^2\setsize^2\error^2}{\samplesize\left(b - a\right)^2}}\label{eq:hoeff-1}\\
&\frac{\conf}{2} \geq \exp{-\frac{2\samplesize^2\setsize^2\error^2}{\samplesize\left(b - a\right)^2}}\label{eq:hoeff-2}\\
&\frac{2}{\conf} \leq \exp{\frac{2\samplesize^2\setsize^2\error^2}{\samplesize\left(b - a\right)^2}}\label{eq:hoeff-3}\\
&\log{\frac{2}{\conf}} \leq \frac{2\samplesize^2\setsize^2\error^2}{\samplesize\left(b - a\right)^2}\label{eq:hoeff-4}\\
&\log{\frac{2}{\conf}} \leq \frac{2\samplesize\setsize^2\error^2}{\left(b - a\right)^2}\label{eq:hoeff-5}\\
&\frac{\log{\frac{2}{\conf}}\left(b - a\right)^2}{2\setsize^2\error^2} \leq \samplesize.\label{eq:hoeff-6}
2020-08-05 16:08:40 -04:00
\end{align}
2020-08-06 15:02:37 -04:00
Equation \cref{eq:hoeff-1} results from the fact that for all $\samplesize$ samples, the range of values over the random variable $\randvar_i$ is always the same. Equation \cref{eq:hoeff-2} is the result of dividing both sides by $2$. Equation \cref{eq:hoeff-3} follows from taking the reciprocal of both sides, and noting that such an operation flips the inequality sign. We then derive \cref{eq:hoeff-4} by the taking the base $e$ log of both sides, and \cref{eq:hoeff-5} results from cancelling the common $\samplesize$ factor. We arrive at the final result of \cref{eq:hoeff-6} by simply multiplying both sides by the reciprocal of the RHS fraction without the $\samplesize$ factor.
Let us now show a sampling scheme which can run in $O\left(|\poly|\cdot k\right)$ per sample.
First, consider when $\poly$ is already an SOP of pure products. In this case, sampling is trivial, and one would sample from the $\setsize$ terms with probability proportional to the product of probabilitites for each variable in the sampled monomial.
Second, consider when $\poly$ has a POS form with a product width of $k$. In this case, we can view $\poly$ as an expression tree, where the leaves represent the individual values of each factor. The leaves are joined together by either a $\times$ or $+$ internal node, and so on, until we reach the root, which is joining the $k$-$\times$ nodes.
Then for each $\times$ node, we multiply its subtree values, while for each $+$ node, we pick one of its children with probability proportional to the product of probabilities across its variables.
2020-08-12 17:41:09 -04:00
\AH{I think I mean to say a probability proportional to the number of elements in it's given subtree.}
2020-08-06 15:02:37 -04:00
The above sampling scheme is in $O\left(|\poly|\cdot k\right)$ time then, since we have for either case, that at most the scheme would perform within a factor of the $|\poly|$ operations, and those operations are repeated the product width of $k$ times.
2020-08-07 13:04:18 -04:00
Thus, it is the case, that we can approximate $\rpoly(\prob_1,\ldots, \prob_n)$ within the claimed confidence bounds and computation time, thus proving the lemma.\AH{State why.}
\AH{Discuss how we have that $\rpoly \geq O(\setsize)$. Discuss that we need $b-a$ to be small.}
2020-08-06 15:02:37 -04:00
\end{proof}
2020-08-12 17:41:09 -04:00
\qed
\AH{{\bf END:} Old Stuff}
Before proceeding to describe the approximation algorithm, let us intrduce notation that will be of use in the following discussion. First, when we speak of $\smb$, we are speaking of a polynomial $\poly$ of the standard monomial basis, i.e., a polynomial whose monomials are not only in SOP form, but non-distinct monomials have been collapsed into one distinct monomial, with a correct corresponding coefficient.
Let $\expresstree{\smb}$ be the set of all possible polynomial expressions equivalent to $\smb$. Call the input polynomial $\polytree$, and note that $\polytree \subseteq \expresstree{\smb}$ and need not be of the standard monomial basis. Refer to the expanded SOP form of $\poly$ as $\expandtree$, which is the SOP form of $\poly$ such that all coefficients $c_i$ are in the set $\{-1, 1\}$, thus relaxing the distinct monomial requirement of the standard monomial basis. Denote $\abstree$ as the resulting polynomial when all monomial coefficients of $\polytree$ are converted to positive coefficients, and then $\polytree$ itself is converted to the standard monomial basis.
\subsection{Sampling Algorithm}
Auxiliary algorithm \textit{Sample} performs the brunt of the work in \textit{Monomial Sample}. \textit{Sample} takes $\polytree$ as input and performs the equivalent of outputting a sample $\randvar_i$ such that $\randvar_i \sim Uniform(S)$, where $S$ represents the set of monomials in $\expandtree$. While one cannot compute $\expandtree$ in time less than $O(N^k)$, the algorithm uses a technique on $\polytree$ which produces a uniform sample from $\expandtree$.