paper-BagRelationalPDBsAreHard/approx_alg.tex

379 lines
37 KiB
TeX
Raw Normal View History

%root: main.tex
\section{$1 \pm \epsilon$ Approximation Algorithm}
%\AH{I am attempting to rewrite this section mostly from scratch. This will involve taking 'baby' steps towards the goals we spoke of on Friday 080720 as well as throughout the following week on chat channel.}
%
%\AH{\textbf{BEGIN}: Old stuff.}
%
%
%\begin{proof}
%
%Let us now show a sampling scheme which can run in $O\left(|\poly|\cdot k\right)$ per sample.
%
%First, consider when $\poly$ is already an SOP of pure products. In this case, sampling is trivial, and one would sample from the $\setsize$ terms with probability proportional to the product of probabilitites for each variable in the sampled monomial.
%
%Second, consider when $\poly$ has a POS form with a product width of $k$. In this case, we can view $\poly$ as an expression tree, where the leaves represent the individual values of each factor. The leaves are joined together by either a $\times$ or $+$ internal node, and so on, until we reach the root, which is joining the $k$-$\times$ nodes.
%
%Then for each $\times$ node, we multiply its subtree values, while for each $+$ node, we pick one of its children with probability proportional to the product of probabilities across its variables.
%
%\AH{I think I mean to say a probability proportional to the number of elements in it's given subtree.}
%
%The above sampling scheme is in $O\left(|\poly|\cdot k\right)$ time then, since we have for either case, that at most the scheme would perform within a factor of the $|\poly|$ operations, and those operations are repeated the product width of $k$ times.
%
%Thus, it is the case, that we can approximate $\rpoly(\prob_1,\ldots, \prob_n)$ within the claimed confidence bounds and computation time, thus proving the lemma.\AH{State why.}
%
%\AH{Discuss how we have that $\rpoly \geq O(\setsize)$. Discuss that we need $b-a$ to be small.}
%\end{proof}
%
%\qed
%\AH{{\bf END:} Old Stuff}
Before proceeding we make the following \textit{assumption}.
\begin{Assumption}
All polynomials in this note are in standard monomial basis, i.e., $\poly(\vct{X}) = \sum\limits_{\vct{d} \in \mathbb{N}^N}q_d \cdot \prod\limits_{i = 1, d_i \geq 1}^{N}X_i^{d_i}$.
\end{Assumption}
%\begin{Definition}[Polynomial]\label{def:polynomial}
%The expression $\poly(\vct{X})$ is a polynomial if it satisfies the standard mathematical definition of polynomial, and additionally is in the standard monomial basis.
%\end{Definition}
%To clarify defintion ~\ref{def:polynomial}, a polynomial in the standard monomial basis is one whose monomials are in SOP form, and whose non-distinct monomials have been collapsed into one distinct monomial, with its corresponding coefficient accurately reflecting the number of monomials combined.
Now, some useful definitions and notation. For illustrative purposes in the definitions below, let us consider when $\poly(\vct{X}) = 2x^2 + 3xy - 2y^2$.
\begin{Definition}[Degree]\label{def:degree}
The degree of polynomial $\poly(\vct{X})$ is the maximum sum of the exponents of a monomial, over all monomials.
\end{Definition}
The degree of $\poly(\vct{X})$ in the above example is $2$. In this note we consider only finite degree polynomials.
\AH{We need to verify that this definition is consistent with the rest of the paper. Also, it might be useful to specify coefficients are 1?}
\begin{Definition}[Monomial]\label{def:monomial}
A monomial is a product of a fixed set of variables, each raised to a positive integer power.
\end{Definition}
For example, the expression $xy$ is a monomial from the term $3xy$ of $\poly(\vct{X})$, produced from the set of variables $\vct{X} = \{x, y\}$.
\begin{Definition}[$|\vct{X}|$]\label{def:num-vars}
Denote the number of variables in $\poly(\vct{X})$ as $|\vct{X}|$.
\end{Definition}
In the running example, $|\vct{X}| = 2$.
2020-08-12 17:41:09 -04:00
\begin{Definition}[Expression Tree]\label{def:express-tree}
An expression tree $\etree$ is a binary %an ADT logically viewed as an n-ary
tree, whose internal nodes are from the set $\{+, \times\}$, with leaf nodes being either numerical coefficients or variables. The members of $\etree$ are \vari{type}, \vari{val}, \vari{partial}, \vari{children}, and \vari{weight}, values whose semantics we will fix later. When $\etree$ is used as input, values of certain fields, e.g., \vari{partial} and \vari{weight} may not be set. %SEMANTICS FOR \etree: where \vari{type} is the type of value stored in node $\etree$, \vari{val} is the value stored in node $\etree$, \vari{partial} is the sum of $\etree$'s coefficients , \vari{children} is the a list of $\etree$'s children, and \vari{weight} is the probability of $\etree$ being sampled.
\end{Definition}
2020-08-07 13:04:18 -04:00
Note that $\etree$ encodes an expression generally \textit{not} in the standard monomial basis, for example, when $\etree$ represents the expression $(x + 2y)(2x - y)$.
2020-08-17 13:52:18 -04:00
\AH{Wondering if there is a cleaner or slicker way to define $poly$ recursively.}
\begin{Definition}[poly$(\cdot)$]\label{def:poly-func}
Denote $poly(\etree)$ to be the function that takes as input expression tree $\etree$ and outputs its corresponding polynomial by returning the product of the children of $\times$ parent node, and summing the children of $+$ parent nodes, bottom up.
2020-08-17 13:52:18 -04:00
\end{Definition}
2020-08-06 15:02:37 -04:00
\begin{Definition}[Expression Tree Set]\label{def:express-tree-set}$\etreeset{\smb}$ is the set of all possible expression trees $\etree$, such that $poly(\etree) = \poly(\vct{X})$.
\end{Definition}
2020-08-12 17:41:09 -04:00
For our running example, $\etreeset{\smb} = \{2x^2 + 3xy - 2y^2, (x + 2y)(2x - y)\}$. Note that \cref{def:express-tree-set} implies that $\etree \in \etreeset{poly(\etree)}$.
\AH{2 Q's: 1) Not sure if the recursive def is clear enough and polished enough? Concerned if 'corresponding term' is precise enough?}
\begin{Definition}[Expanded T]\label{def:expand-tree}
$\expandtree$ is the pure SOP expansion of $\etree$ such that, beginning from the leaf nodes, when a parent $\times$ node is visited, the product of its children is returned, but when a parent $+$ node is encountered, its children are returned as individual summand elements, i.e., without actually applying the addition operation on them, meaning non-distinct monomials are not combined after the product operation. The logical view of \expandtree ~is a list of tuples $(v, c)$, where $v$ is the monomial %set of variables
and $c$ the coefficient of its corresponding term.
\end{Definition}
To illustrate \cref{def:expand-tree} with an example, consider the product $(x + 2y)(2x - y)$ and its expression tree $\etree$. The pure expansion of the product is $2x^2 - xy + 4xy - 2y^2 = \expandtree$, logically viewed as $[(2, x^2), (-1, xy), (4, xy), (-2, y^2)]$. (For preciseness, note that $\etree$ would use a $+$ node to model the second factor ($\etree_\vari{R}$), while storing a child coefficient of $-1$ for the variable $y$. The subtree $\etree_\vari{R}$ would be $+(\times(2, x), \times(-1, y))$, and one can see that $\etree_R$ is indeed equivlent to $(2x - y)$).
\AH{Figure here.}
\AH{I'm not sure if this is what you were aiming at with 'using notation from ~\cref{def:express-tree}}
\begin{Definition}[Positive T]\label{def:positive-tree}
Let $\abstree$ denote the resulting expression tree when the value $c$ of each coefficient leaf node in $\etree$ is set to %$c_i$ in $\etree$ is exchanged with its absolute value
$|c|$.
\end{Definition}
2020-08-12 17:41:09 -04:00
Using the same polynomial from the above example, $poly(\abstree) = (x + 2y)(2x + y) = 2x^2 +xy +4xy + 2y^2 = 2x^2 + 5xy + 2y^2$. Note that this \textit{is not} the same as $\poly(\vct{X})$.
2020-08-12 17:41:09 -04:00
\AH{Just wondering if the prose "for their corresponding variables," suffices or if it is appropriate to mention $\vct{X}$ or not?}
\begin{Definition}[Evaluation]\label{def:exp-poly-eval}
Given an expression tree $\etree$ and polynomial $poly(\etree)$, the evaluation of both expressions at $\vct{v} \in \mathbb{R}^N$ is performed by substituting values of $\vct{v}$ in for their corresponding variables and performing the indicated operations in either structure. Observe that $\etree(\vct{v}) = poly(\etree)(\vct{v})$.
\end{Definition}
2020-08-25 11:18:08 -04:00
In the subsequent subsections we lay the groundwork to prove the following theorem.
\AH{There are 2 incorrect areas of the ~\cref{lem:approx-alg} that we need to come back to here.}
2020-08-22 15:47:56 -04:00
\begin{Theorem}\label{lem:approx-alg}
For any query polynomial $\poly(\vct{X})$, an approximation of $\rpoly(\prob_1,\ldots, \prob_n)$ can be computed in \AH{--the original conjecture was $O\left(|\poly|\cdot k \frac{\log\frac{1}{\conf}}{\error^2}\right)$, but after the proof of theorem ~\ref{lem:mon-samp}--} $O\left(|\etree| + \frac{\log{\frac{1}{\conf}}}{\error^2} \cdot \left(k \cdot\left(1 + \log{k} \cdot depth(\etree)\right)\right)\right)$, within $1 \pm \error$ multiplicative error with probability $\geq 1 - \conf$, where $k$ denotes the degree of $\poly$.
2020-08-22 15:47:56 -04:00
\end{Theorem}
\AH{In the above, I am not sure what you mean in the comments "state exactly what this (multiplicative error) means," and theorem 3.2 is not the result you want to quote here.}
2020-09-01 14:39:50 -04:00
\subsection{Approximating $\rpoly$}
\subsubsection{Description}
Algorithm ~\ref{alg:mon-sam} approximates $\rpoly$ using the following steps. First, a call to $\onepass$ on its input $\etree$ produces a non-biased weight distribution over the monomials of $\expandtree$ and a correct count of $|\etree|(1,\ldots, 1)$, i.e., the number of monomials in $\expandtree$. Next, ~\cref{alg:mon-sam} calls $\sampmon$ to sample one monomial and its sign from $\expandtree$. The sampling is repeated $\ceil{\frac{2\log{\frac{2}{\delta}}}{\epsilon^2}}$ times, where each of the samples are evaluated over $\vct{p}$, multiplied by $1 \times sign$, and summed. The final result is scaled accordingly returning an estimate of $\rpoly$ within a multiplicative error of $1 \pm \epsilon$ with a probability of $1 - \delta$.
\subsubsection{Psuedo Code}
\begin{algorithm}[H]
\caption{$\approxq$($\etree$, $\vct{p}$, $\conf$, $\error$)}
\label{alg:mon-sam}
\begin{algorithmic}[1]
\Require \etree: Binary Expression Tree
\Require $\vct{p}$: Vector $\in [0, 1]^N$
\Require $\conf$: $\in [0, 1]$
\Require $\error$: $\in [0, 1]$
\Ensure \vari{acc}: $\in \mathbb{R}$
2020-09-01 14:39:50 -04:00
\State $\accum \gets 0$\label{alg:mon-sam-global1}
\State $\numsamp \gets \ceil{\frac{2 \log{\frac{2}{\conf}}}{\error^2}}$\label{alg:mon-sam-global2}
\State $(\vari{\etree}_\vari{mod}, \vari{size}) \gets $ \onepass($\etree$)\label{alg:mon-sam-onepass}\Comment{$\onepass$ is ~\cref{alg:one-pass} \;and \sampmon \; is ~\cref{alg:sample}}
\For{\vari{i} \text{ in } $1\text{ to }\numsamp$}\Comment{Perform the required number of samples}
\State $(\vari{Y}_\vari{i}, \vari{sgn}_\vari{i}) \gets $ \sampmon($\etree_\vari{mod}$)
2020-09-01 14:39:50 -04:00
\State $\vari{temp} \gets 1$\label{alg:mon-sam-assign1}
\For{$\vari{x}_{\vari{j}}$ \text{ in } $\vari{Y}_{\vari{i}}$}
2020-09-01 14:39:50 -04:00
\State \vari{temp} $\gets$ \vari{temp} $\times \; \vari{\prob}_\vari{j}$\label{alg:mon-sam-product2} \Comment{$\vari{p}_\vari{j}$ is the probability of $\vari{x}_\vari{j}$ from input $\vct{p}$}
\EndFor
2020-09-01 14:39:50 -04:00
\State \vari{temp} $\gets$ \vari{temp} $\times\; \vari{sgn}_\vari{i}$\label{alg:mon-sam-product}
\State $\accum \gets \accum + \vari{temp}$\Comment{Store the sum over all samples}\label{alg:mon-sam-add}
\EndFor
2020-09-01 14:39:50 -04:00
\State $\vari{acc} \gets \vari{acc} \times \frac{\vari{size}}{\numsamp}$\label{alg:mon-sam-global3}
2020-08-25 11:18:08 -04:00
\State \Return \vari{acc}
\end{algorithmic}
\end{algorithm}
\AH{Corrections made up to this point.}
\subsubsection{Correctness}
2020-08-14 12:03:26 -04:00
2020-09-01 14:39:50 -04:00
\begin{Theorem}\label{lem:mon-samp}
Algorithm \ref{alg:mon-sam} outputs an estimate of $\rpoly(\prob_1,\ldots, \prob_n)$ within a mulitplicative $\left(1 \pm \error\right)\cdot\rpoly(\prob_1,\ldots, \prob_n)$ error with probability $1 - \conf$, in $O\left(|\etree| + \frac{\log{\frac{1}{\conf}}}{\error^2} \cdot \left(k \cdot\left(1 + \log{k} \cdot depth(\etree)\right)\right)\right)$ time, for $k = deg(\etree)$.
2020-09-01 14:39:50 -04:00
\end{Theorem}
\AH{@atri: originally, the proof below was written for the case of $p_1 = \cdots = p_n$. I have adapted it to the general case of distinct $p_i$ values, with a note to highlight any necessary changes to complete the generalization.}
2020-09-01 14:39:50 -04:00
We state the lemmas for ~\cref{alg:one-pass} and ~\cref{alg:sample}, the auxiliary algorithms on which ~\cref{alg:mon-sam} relies. Their proofs are subsequent.
\begin{Lemma}\label{lem:one-pass}
There exists an algorithm, algorithm ~\ref{alg:one-pass}, which correctly computes $\abstree(1,\ldots, 1)$ for each subtree $S$ of $\etree$. For the children of $+$ nodes, it correctly computes the weighted distribution $\frac{|c_S|}{|T_S|(1,\ldots, 1)}$ across each child. All computations are performed in one traversal in $O(|\etree|)$ time.
\end{Lemma}
2020-09-01 14:39:50 -04:00
\begin{Lemma}\label{lem:sample}
For every $(m,c)$ in \expandtree, $k = deg(\etree)$, there exists an algorithm $\sampmon(\etree)$ that returns $m$ with probability $\frac{|c|}{\abstree(1,\ldots, 1)}$ in $O(\log{k} \cdot k \cdot depth(\etree))$ time.
2020-09-01 14:39:50 -04:00
\end{Lemma}
\begin{proof}[Proof of Theorem \ref{lem:mon-samp}]
Consider $\expandtree$ and let $c_i$ be the coefficient of the $i^{th}$ monomial and
\AH{Here is the main change needed to generalize the proof for distinct $p_i$ values. Wasn't sure about the best notation choice, but wanted to distinguish from $\vct{p}$ which we use to mean the vector of all $p_i$ values.}
%$\distinctvars_i$ be the number of distinct variables
$\vct{p}_\vct{i}$ be the vector whose elements are the probabilities for each distinct variable appearing in the $i^{th}$ monomial. By ~\cref{lem:sample}, the sampling scheme samples each term $t$ in $\expandtree$ with probability $\frac{|c_i|}{\abstree(1,\ldots, 1)}$. Call this sampling scheme $\mathcal{S}$. Now consider $\rpoly$ and note that
\AH{Changed notation here.}
$c_i \cdot \prod\limits_{p_{i, j} \in \vct{p}_\vct{i}}p_{i, j}$ %$\coeffitem{i}$
is the value of the $i^{th}$ monomial term in $\rpoly(\prob_1,\ldots, \prob_n)$. Let $m$ be the number of terms in $\expandtree$ and $\coeffset$ to be the set $\{c_1,\ldots, c_m\}.$
2020-08-17 13:52:18 -04:00
Consider now a set of $\samplesize$ random variables $\vct{\randvar}$, where each $\randvar_i$ is distributed as described above. Then for random variable $\randvar_i$, it is the case that
\AH{Calculation changed here.}
$\expct\pbox{\randvar_i} = \sum_{i = 1}^{\setsize}\frac{c_i \cdot \prod\limits_{p_{i, j} \in \vct{p}_\vct{i}}p_{i, j}%\prob^{\distinctvars_i}
}{\sum_{i = 1}^{\setsize}|c_i|} = \frac{\rpoly(\prob_1,\ldots, \prob_n)}{\abstree(1,\ldots, 1)}$. Let $\hoeffest = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i$. It is also true that
2020-08-14 12:03:26 -04:00
\[\expct\pbox{\hoeffest} = \expct\pbox{ \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\randvar_i} = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\expct\pbox{\randvar_i} = \frac{1}{\samplesize}\sum_{i = 1}^{\samplesize}\frac{1}{\setsize}\sum_{j = 1}^{\setsize}\frac{c_i \cdot \prod\limits_{p_{i, j} \in \vct{p}_\vct{i}}p_{i, j}% \prob^{\distinctvars}
}{\setsize} = \frac{\rpoly(\prob_1,\ldots, \prob_n)}{\abstree(1,\ldots, 1)}.\]
2020-08-14 12:03:26 -04:00
2020-08-22 15:47:56 -04:00
\begin{Lemma}\label{lem:hoeff-est}
Given $\samplesize$ random variables $\vct{\randvar}$ with distribution $\mathcal{S}$ over expression tree $\etree$, an additive $\error' \cdot \abstree(1,\ldots, 1)$ error bound is obtained with $N \geq \frac{2\log{\frac{2}{\conf}}}{\error^2}$ samples.
2020-08-22 15:47:56 -04:00
\end{Lemma}
\begin{proof}[Proof of Lemma \ref{lem:hoeff-est}]
2020-08-14 12:03:26 -04:00
Given the range $[-1, 1]$ for every $\randvar_i$ in $\vct{\randvar}$, by Hoeffding, it is the case that $P\pbox{~\left| \hoeffest - \expct\pbox{\hoeffest} ~\right| \geq \error} \leq 2\exp{-\frac{2\samplesize^2\error^2}{2^2 \samplesize}} \leq \conf$.
2020-08-14 12:03:26 -04:00
Solving for the number of samples $\samplesize$ we get
\begin{align}
&\conf \geq 2\exp{-\frac{2\samplesize^2\error^2}{4\samplesize}}\label{eq:hoeff-1}\\
&\frac{\conf}{2} \geq \exp{-\frac{2\samplesize^2\error^2}{4\samplesize}}\label{eq:hoeff-2}\\
&\frac{2}{\conf} \leq \exp{\frac{2\samplesize^2\error^2}{4\samplesize}}\label{eq:hoeff-3}\\
&\log{\frac{2}{\conf}} \leq \frac{2\samplesize^2\error^2}{4\samplesize}\label{eq:hoeff-4}\\
&\log{\frac{2}{\conf}} \leq \frac{\samplesize\error^2}{2}\label{eq:hoeff-5}\\
&\frac{2\log{\frac{2}{\conf}}}{\error^2} \leq \samplesize.\label{eq:hoeff-6}
\end{align}
Equation \cref{eq:hoeff-1} results computing the sum in the denominator of the exponential. Equation \cref{eq:hoeff-2} is the result of dividing both sides by $2$. Equation \cref{eq:hoeff-3} follows from taking the reciprocal of both sides, and noting that such an operation flips the inequality sign. We then derive \cref{eq:hoeff-4} by the taking the base $e$ log of both sides, and \cref{eq:hoeff-5} results from reducing common factors. We arrive at the final result of \cref{eq:hoeff-6} by simply multiplying both sides by the reciprocal of the RHS fraction without the $\samplesize$ factor.
2020-08-25 11:18:08 -04:00
By Hoeffding we obtain the number of samples necessary to acheive the claimed additive error bounds.
2020-08-22 15:47:56 -04:00
\end{proof}
\qed
\begin{Corollary}\label{cor:adj-err}
Setting $\error = \error' \cdot \frac{\rpoly(\prob_1,\ldots, \prob_n)}{\abstree(1,\ldots, 1)}$ achieves $1 \pm \epsilon$ multiplicative error bounds.
2020-08-22 15:47:56 -04:00
\end{Corollary}
\begin{proof}[Proof of Corollary \ref{cor:adj-err}]
Since it is the case that we have $\error' \cdot \abstree(1,\ldots, 1)$ additive error, one can set $\error = \error' \cdot \frac{\rpoly(\prob_1,\ldots, \prob_n)}{\abstree(1,\ldots, 1)}$, yielding a multiplicative error proportional to $\rpoly(\prob_1,\ldots, \prob_n)$.
2020-08-22 15:47:56 -04:00
\end{proof}
\qed
Note that Hoeffding is assuming the sum of random variables be divided by the number of variables. Also see that to properly estimate $\rpoly$, it is necessary to multiply by the number of monomials in $\rpoly$, i.e. $\abstree(1,\ldots, 1)$. Therefore it is the case that $\frac{acc}{N}$ gives the estimate of one monomial, and multiplying by $\abstree(1,\ldots, 1)$ yields the estimate of $\rpoly(\prob_1,\ldots, \prob_n)$. This concludes the proof for the first claim of theorem ~\ref{lem:mon-samp}.
2020-09-01 14:39:50 -04:00
\subsubsection{Run-time Analysis}
Note that lines ~\ref{alg:mon-sam-global1}, ~\ref{alg:mon-sam-global2}, and ~\ref{alg:mon-sam-global3} are $O(1)$ global operations. The call to $\onepass$ in line ~\ref{alg:mon-sam-onepass} by lemma ~\ref{lem:one-pass} is $O(|\etree|)$ time.
First, algorithm ~\ref{alg:mon-sam} calls \textsc{OnePass} which takes $O(|\etree|)$ time. Then for $\numsamp = \ceil{\frac{2 \log{\frac{2}{\conf}}}{\error^2}}$, the $O(1)$ assignment, product, and addition operations occur. Over the same $\numsamp$ iterations, $\sampmon$ is called, with a runtime of $O(\log{k}\cdot k \cdot depth(\etree)$ by lemma ~\ref{lem:sample}. Finally, over the same iterations, the assignment and product operations of line ~\ref{alg:mon-sam-product2} is called at most $k$ times.
2020-09-01 14:39:50 -04:00
Thus we have $O(|\etree|) + O(\frac{\log{\frac{1}{\conf}}}{\error^2} \cdot \left(k + \log{k}\cdot k \cdot depth(\etree)\right) = O\left(|\etree| + \frac{\log{\frac{1}{\conf}}}{\error^2} \cdot \left(k \cdot\left(1 + \log{k} \cdot depth(\etree)\right)\right)\right)$ overall running time.
2020-08-14 12:03:26 -04:00
\end{proof}
2020-08-25 11:18:08 -04:00
\qed
2020-08-13 20:54:06 -04:00
2020-08-17 13:52:18 -04:00
\subsection{OnePass Algorithm}
\subsubsection{Description}
Auxiliary algorithm ~\ref{alg:one-pass} has the responsibility of computing the weighted distribution over $\expandtree$. This consists of two parts. Computing the sum over the absolute values of each coefficient ($\abstree(1,\ldots, 1)$), and computing the distribution over each monomial term of $\expandtree$, without ever materializing $\expandtree$.
Algorithm ~\ref{alg:one-pass} takes $\etree$ as input, modifying $\etree$ in place with the appropriate weight distribution across all children of $+$ nodes, and finally returning $\abstree(1,\ldots, 1)$.
\AH{A tree diagram would work much better, but to do that it appears that I need to spend time learning the tikz package, which I haven't had time for yet.}
To compute $\abstree(1,\ldots, 1)$, algorithm ~\ref{alg:one-pass} makes a bottom-up traversal of $\etree$ and performs the following computations. For a leaf node whose value is a coefficient, the value is saved. When a $+$ node is visited, the coefficient values of its children are summed. Finally, for the case of a $\times$ node, the coefficient values of the children are multiplied. The algorithm returns the modified tree along with the sum of absolute values over all coefficients of $\expandtree$ upon termination.
In the same traversal, algorithm ~\ref{alg:one-pass} computes the weighted probability distribution. The relative probabilities for the children of each $+$ node are computed by taking the sum of its children's coefficient absolute values, and for each child, dividing the child's coefficient absolute value by that sum. Lastly, the partial value of its subtree coefficients is stored at the $+$ node. Upon termination, all appropriate nodes have been annotated accordingly.
For example, consider the when $\etree$ is $+\left(\times\left(+\left(\times\left(1, x_1\right), \times\left(1, x_2\right)\right), +\left(\times\left(1, x_1\right), \times\left(-1, x_2\right)\right)\right), \times\left(\times\left(1, x_2\right), \times\left(1, x_2\right)\right)\right)$, which encodes the expression $(x_1 + x_2)(x_1 - x_2) + x_2^2$. After one pass, \cref{alg:one-pass} would have learned to sample the two children of the root $+$ node with $P\left(\times\left(+\left(x_1, x_2\right), +\left(x_1, -x_2\right)\right)\right) = \frac{4}{5}$ and $P\left(\times\left(x_2, x_2\right)\right) = \frac{1}{5}$. Similarly, the two inner $+$ nodes of the root's left child, call them $+_1$ and $+_2$, using $l$ for left child and $r$ for right child are $P_{+_1}(l) = P_{+_1}(r) = P_{+_2}(l) = P_{+_2}(r) = \frac{1}{2}$. Note that in this example, the sampling probabilities for the children of each inner $+$ node are equal to one another because both parents have the same number of children, and, in each case, the children of each parent $+$ node share the same $|c_i|$.
2020-08-17 17:12:25 -04:00
\subsubsection{Psuedo Code}
2020-08-17 13:52:18 -04:00
\begin{algorithm}[h!]
\caption{\onepass$(\etree)$}
\label{alg:one-pass}
\begin{algorithmic}[1]
\Require \etree: Binary Expression Tree
\Ensure \etree: Binary Expression Tree
2020-08-25 11:18:08 -04:00
\Ensure \vari{sum}: Real
\State $\vari{sum} \gets 1$\label{alg:one-pass-global-assign}
\If{$\etree.\vari{type} = +$}\label{alg:one-pass-equality1}
\State $\accum \gets 0$\label{alg:one-pass-plus-assign1}
\For{$child$ in $\etree.\vari{children}$}\Comment{Sum up all children coefficients}
2020-08-25 11:18:08 -04:00
\State $(\vari{T}, \vari{s}) \gets \onepass(child)$
\State $\accum \gets \accum + \vari{s}$\label{alg:one-pass-plus-add}
\EndFor
\State $\etree.\vari{partial} \gets \accum$\label{alg:one-pass-plus-assign2}
\For{$child$ in $\etree.\vari{children}$}\Comment{Record distributions for each child}
\State $child.\vari{weight} \gets \frac{\vari{child.partial}}{\etree.\vari{partial}}$\label{alg:one-pass-plus-prob}
\EndFor
\State $\vari{sum} \gets \etree.\vari{partial}$\label{alg:one-pass-plus-assign3}
\State \Return (\etree, \vari{sum})
\ElsIf{$\etree.\vari{type} = \times$}\label{alg:one-pass-equality2}
\State $\accum \gets 1$\label{alg:one-pass-times-assign1}
\For{$child \text{ in } \etree.\vari{children}$}\Comment{Compute the product of all children coefficients}
2020-08-25 11:18:08 -04:00
\State $(\vari{T}, \vari{s}) \gets \onepass(child)$
\State $\accum \gets \accum \times \vari{s}$\label{alg:one-pass-times-product}
2020-08-25 11:18:08 -04:00
\EndFor
\State $\etree.\vari{partial}\gets \accum$\label{alg:one-pass-times-assign2}
\State $\vari{sum} \gets \etree.\vari{partial}$\label{alg:one-pass-times-assign3}
\State \Return (\etree, \vari{sum})
\ElsIf{$\etree.\vari{type} = numeric$}\Comment{Base case}\label{alg:one-pass-equality3}
\State $\vari{sum} \gets |\etree.\vari{val}|$\label{alg:one-pass-leaf-assign1}
\State \Return (\etree, \vari{sum})
\Else
\State \Return (\etree, \vari{sum})
\EndIf
\end{algorithmic}
\end{algorithm}
\subsubsection{Correctness of Algorithm ~\ref{alg:one-pass}}
\begin{proof}[Proof of Lemma ~\ref{lem:one-pass}]
Use proof by structural induction over the depth $d$ of the binary tree $\etree$.
For the base case, $d = 0$, it is the case that the root node is a leaf and therefore by definition ~\ref{def:express-tree} must be a variable or coefficient. When it is a variable, \textsc{OnePass} returns $1$, and we have that $\abstree(1,\ldots, 1) = 1$ which is correct. When the root is a coefficient, the absolute value of the coefficient is returned, which is indeed $\abstree(1,\ldots, 1)$. Since the root node cannot be a $+$ node, this proves the base case.
Let the inductive hypothesis be the assumption that for $d \leq k \geq 0$, lemma ~\ref{lem:one-pass} is true for algorithm ~\ref{alg:one-pass}.
Now prove that lemma ~\ref{lem:one-pass} holds for $k + 1$. Notice that the root of $\etree$ has at most two children, $\etree_L$ and $\etree_R$. Note also, that for each child, it is the case that $d = k$, since we have a maximal path from the root to each child of $1$. Then, by inductive hypothesis, lemma ~\ref{lem:one-pass} holds for each existing child, and we are left with two possibilities for the root node. The first case is when the root node is a $+$ node. When this happens, algorithm ~\ref{alg:one-pass} computes $\abstree(1,\ldots, 1) = |T_L|(1,\ldots, 1) + |T_R|(1,\ldots, 1)$ which is correct. For the distribution of the children of $+$, algorithm ~\ref{alg:one-pass} computes $P(\etree_i) = \frac{|T_i|(1,\ldots, 1)}{|T_L|(1,\ldots, 1) + |T_R|(1,\ldots, 1)}$ which is indeed the case. The second case is when the root is a $\times$ node. Algorithm ~\ref{alg:one-pass} then computes the product of the subtree partial values, $|T_L|(1,\ldots, 1) \times |T_R|(1,\ldots, 1)$ which indeed equals $\abstree(1,\ldots, 1)$.
Since algorithm ~\ref{alg:one-pass} completes exactly one traversal, computing these values from the bottom up, it is the case that all subtree values are computed, and this completes the proof.
\end{proof}
\qed
\subsubsection{Run-time Analysis}
The runtime for \textsc{OnePass} is fairly straight forward. First, ~\cref{alg:one-pass-global-assign} gives $O(1)$ global assignments. Second, note that line ~\ref{alg:one-pass-equality1}, ~\ref{alg:one-pass-equality2}, and ~\ref{alg:one-pass-equality3} give a constant number of equality checks per node. Then, for $+$ nodes, \algref{alg:one-pass}{alg:one-pass-plus-add} and \algref{alg:one-pass}{alg:one-pass-plus-prob} (note there is a \textit{constant} factor of $2$ here) perform a constant number of arithmetic operations, while \algref{alg:one-pass}{alg:one-pass-plus-assign1} \algref{alg:one-pass}{alg:one-pass-plus-assign2}, and \algref{alg:one-pass}{alg:one-pass-plus-assign3} all have $O(1)$ assignments. Similarly, when a $\times$ node is visited, lines \ref{alg:one-pass-times-assign1}, \ref{alg:one-pass-times-assign2}, and \ref{alg:one-pass-times-assign3} have $O(1)$ assignments, while \algref{alg:one-pass}{alg:one-pass-times-product} has $O(1)$ product operations per node. For leaf nodes, ~\cref{alg:one-pass-leaf-assign1} is $O(1)$ assignment.
Thus, the algorithm visits each node of $\etree$ one time, with a constant number of operations for all of the $+$, $\times$, and leaf nodes, leading to a runtime of $O(|\etree|)$.
2020-08-19 16:28:29 -04:00
\AH{Technically, should I be using $\Theta$ instead of big-O in the above?}
2020-08-17 13:52:18 -04:00
\subsection{Sample Algorithm}
Algorithm ~\ref{alg:sample} takes $\etree$ as input and produces a sample $\randvar_i$ according to the weighted distribution computed by \textsc{OnePass}. While one cannot compute $\expandtree$ in time better than $O(N^k)$, the algorithm, similar to \textsc{OnePass}, uses a technique on $\etree$ which produces a sample from $\expandtree$ without ever materializing $\expandtree$.
2020-08-17 13:52:18 -04:00
Algorithm ~\ref{alg:sample} selects a monomial from $\expandtree$ by the following top-down traversal. For a parent $+$ node, a subtree is chosen over the previously computed weighted sampling distribution. When a parent $\times$ node is visited, both children are visited. All variable leaf nodes of the subgraph traversal are added to a set. Additionally, the product of signs over all coefficient leaf nodes of the subgraph traversal is computed. The algorithm returns a set of the distinct variables of which the monomial is composed and the monomial's sign.
2020-08-17 13:52:18 -04:00
\subsubsection{Pseudo Code}
\begin{algorithm}
\caption{\sampmon(\etree)}
\label{alg:sample}
\begin{algorithmic}[1]
\Require \etree: Binary Expression Tree
2020-08-25 11:18:08 -04:00
\Ensure \vari{vars}: TreeSet
\Ensure \vari{sgn}: Integer in $\{-1, 1\}$
2020-09-01 14:39:50 -04:00
\State $\vari{vars} \gets new$ $TreeSet()$\label{alg:sample-global1}
\State $\vari{sgn} \gets 1$\label{alg:sample-global2}
\If{$\etree.\vari{type} = +$}\Comment{Sample at every $+$ node}
2020-09-01 14:39:50 -04:00
\State $\etree_{\vari{samp}} \gets$ Sample from left ($\etree_{\vari{L}}$) and right ($\etree_{\vari{R}}$) w.p. $\frac{\vari{c}_\vari{L}}{|\etree_{\vari{L}}|(1,\ldots, 1)}$ and $\frac{\vari{c}_{\vari{R}}}{|\etree_{\vari{R}}|(1,\ldots, 1)}$ \label{alg:sample-plus-bsamp}
\State $(\vari{v}, \vari{s}) \gets \sampmon(\etree_{\vari{samp}})$
2020-09-01 14:39:50 -04:00
\State $\vari{vars} \gets \vari{vars} \;\cup \;\vari{v}$\label{alg:sample-plus-union}
\State $\vari{sgn} \gets \vari{sgn} \times \vari{s}$\label{alg:sample-plus-product}
2020-08-25 11:18:08 -04:00
\State $\Return ~(\vari{vars}, \vari{sgn})$
\ElsIf{$\etree.\vari{type} = \times$}\Comment{Multiply the sampled values of all subtree children}
\For {$child$ in $\etree.\vari{children}$}
2020-08-25 11:18:08 -04:00
\State $(\vari{v}, \vari{s}) \gets \sampmon(child)$
\State $\vari{vars} \gets \vari{vars} \cup \vari{v}$\label{alg:sample-times-union}
\State $\vari{sgn} \gets \vari{sgn} \times \vari{s}$\label{alg:sample-times-product}
\EndFor
2020-08-25 11:18:08 -04:00
\State $\Return ~(\vari{vars}, \vari{sgn})$
2020-09-01 14:39:50 -04:00
\ElsIf{$\etree.\vari{type} = numeric$}\Comment{The leaf is a coefficient}
\State $\vari{sgn} \gets \vari{sgn} \times sign(\etree.\vari{val})$
2020-08-25 11:18:08 -04:00
\State $\Return ~(\vari{vars}, \vari{sgn})$
\ElsIf{$\etree.\vari{type} = var$}
\State $\vari{vars} \gets \vari{vars} \; \cup \; \{\;\etree.\vari{val}\;\}$\Comment{Add the variable to the set}
2020-08-25 11:18:08 -04:00
\State $\Return~(\vari{vars}, \vari{sgn})$
\EndIf
\end{algorithmic}
\end{algorithm}
\subsubsection{Correctness of Algorithm ~\ref{alg:sample}}
\begin{proof}[Proof of Lemma ~\ref{lem:sample}]
First, we need to show that $\sampmon$ indeed returns a monomial.
For the base case of the depth $d$ of $\etree$ is $0$, we have that the root node is either a constant $c$ or a variable $x$. Both cases satisfy the definition of a monomial, and the base case is proven.
By inductive hyptothesis, assume that for $0 < d \leq k$, that it is indeed the case that $\sampmon$ returns a monomial.
For the inductive step, let us take a tree $\etree$ with $d = k + 1$. Note that since the maximal path from the root to a child of the root is $1$, each child is therefore depth $d = k$, and by inductive hyptothesis both of them return a valid monomial. Then the root can be either a $+$ or $\times$ node. For the case of a $+$ root node, line ~\ref{alg:sample-plus-bsamp} of $\sampmon$ will choose one of the children of the root. Since by hypothesis it is the case that a monomial is being returned from either child, and only one of these monomials is selected, we have for the case of $+$ root node that a valid monomial is returned by $\sampmon$. When the root is a $\times$ node, lines ~\ref{alg:sample-times-union} and ~\ref{alg:sample-times-product} multiply the monomials returned by the two children of the root, and by definition the product of $2$ monomials is also a monomial, which means that $\sampmon$ returns a vaild monomial for the $\times$ root node, thus concluding the proof.
Note that for any monomial sampled by algorithm ~\ref{alg:sample}, the nodes traversed form a subgraph of $\etree$ that is \textit{not} a subtree in the general case. We thus seek to prove that the subgraph traversed produces the correct probability corresponding to the monomial sampled.
Prove by structural induction on the depth $d$ of $\etree$. For the base case $d = 0$, by definition ~\ref{def:express-tree} we know that the root has to be either a coefficient or a variable. When the root is a variable $x$, we have the fact that the probability \sampmon returns $x$ is $1$, the algorithm correctly returns $(\{x\}, 1 )$, upholding correctness. When the root is a coefficient, \sampmon ~correctly returns $sign(c_i) \times 1$.
\AH{I don't know if I need to state why the latter statement (for the case of the root being a coefficient )is correct. I am not sure how to properly argue this either, whether is suffices to say that this follows by definition of our sampling scheme--or if there is a statistical claime, etc.., perhaps the simplicity of the fact that after scaling the samples, we will get the correct result.}
2020-08-19 16:28:29 -04:00
For the inductive hypothesis, assume that for $d \leq k \geq 0$ lemma ~\ref{lem:sample} is true.
Prove now, that when $d = k + 1$ lemma ~\ref{lem:sample} holds. It is the case that the root of $\etree$ has up to two children $\etree_L$ and $\etree_R$. Since we have a maximal path of 1 from the root to either child, we know that by inductive hypothesis, $\etree_L$ and $\etree_R$ are both depth $d = k$, and lemma ~\ref{lem:sample} holds for either of them, thus, the probabilities computed on the sampled subgraphs of nodes visited in $\etree_L$ and $\etree_R$ are therefore correct.
Then the root has to be either a $+$ or $\times$ node.
Consider the case when the root is $\times$. Note that we are sampling a term from $\expandtree$. Consider $(m, c)$ in $\expandtree$, where $m$ is the sampled monomial. Notice also that it is the case that $m = m_L \times m_R$, where $m_L$ is coming from $\etree_L$ and $m_R$ from $\etree_R$. The probability that \sampmon$(\etree_{L})$ returns $m_L$ is $\frac{|c_{m_L}|}{|\etree_L|(1,\ldots, 1)}$ and a symmetric probability exists for $m_R$. The final probability for sample $m$ is then $\frac{|c_{m_L}| \cdot |c_{m_R}|}{|\etree_L|(1,\ldots, 1) \cdot |\etree_R|(1,\ldots, 1)}$. For $(m, c)$ in \expandtree, it is indeed the case that $|c_i| = |c_{m_L}| \cdot |c_{m_R}|$ and that $\abstree(1,\ldots, 1) = |\etree_L|(1,\ldots, 1) \cdot |\etree_R|(1,\ldots, 1)$, and therefore $m$ is sampled with correct probability $\frac{|c_i|}{\abstree(1,\ldots, 1)}$.
For the case when the root is a $+$ node, \sampmon ~will sample monomial $m$ from one of its children. By inductive hypothesis we know that $m_L$ and $m_R$ will both be sampled with correct probability $\frac{|c_{m_L}|}{\etree_{\vari{L}}(1,\ldots, 1)}$ and $\frac{|c_{m_R}|}{|\etree_\vari{R}|(1,\ldots, 1)}$. Assume that $m$ is sampled from $\etree_\vari{L}$, and note that a symmetric argument holds for the case when $m$ is sampled from $\etree_\vari{R}$. Then the probability for $m$ to be sampled from $\etree$ is equal to the product of the probability that $m$ is sampled in $\etree_\vari{L}$ and the probability that $\etree_\vari{L}$ is sampled from $\etree$, and
\begin{align*}
P(\sampmon(\etree) = m) = &P(\sampmon(\etree_\vari{L}) = m) \cdot P(SampledChild(\etree) = \etree_\vari{L})\\
= &\frac{|c_m|}{|\etree_\vari{L}|(1,\ldots, 1)} \cdot \frac{|\etree_\vari{L}(1,\ldots, 1)}{|\etree_\vari{L}|(1,\ldots, 1) + |\etree_\vari{R}|(1,\ldots, 1)}\\
= &\frac{|c_m|}{\abstree(1,\ldots, 1)},
\end{align*}
and we obtain the desired result.
2020-08-19 16:28:29 -04:00
2020-08-19 16:28:29 -04:00
\subsubsection{Run-time Analysis}
Take an arbitrary sample subgraph of expression tree $\etree$ of degree $k$ and pick an arbitrary level $i$. Call the number of $\times$ nodes in this level $y_i$, and the total number of nodes $x_i$. Given that both children of a $\times$ node are traversed in $\sampmon$ while only one child is traversed for a $+$ parent node, note that the number of nodes on level $i + 1$ in the general case is at most $y_i + x_i$, and the increase in the number of nodes from level $i$ to level $i + 1$ is upperbounded by $x_{i + 1} - x_i \leq y_i$. Recall by definition, that a polynomial of degree $k$ has $O(k)$ product operations for an arbitrary monomial. So, for any level $i$ of the sample subgraph, there are $O(k)$ more multiplication nodes at level $i + 1$. \AH{I got a bit confused considering the next statement of the analysis regarding coefficients. I need guidance on the accuracy of the statement. Particularly, the bound could be tightened (messy however), but asymptotically, I \textit{think} is still accurate.}
2020-09-01 14:39:50 -04:00
Note that accounting for coefficients joined by $\times$ nodes yields an upperbound of $O(2k)$ at depth $depth(\etree) - 1$, which still yields the same asymptotic bound $O(k)$ nodes visited at this level. Then, since $\etree$ has $depth(\etree)$ levels, the total number of nodes visited in a call to $\sampmon$ is $O(k \cdot depth(\etree))$.
Globally, lines ~\ref{alg:sample-global1} and ~\ref{alg:sample-global2} are $O(1)$ time. For the $+$ node, line ~\ref{alg:sample-plus-bsamp} has $O(1)$ time by the fact that $\etree$ is binary. Line ~\ref{alg:sample-plus-union} has $O(\log{k})$ time by nature of the TreeSet datastructure and the fact that by definition any monomial sampled from $\expandtree$ has degree $\leq k$ and hence at most $k$ distinct variables.
2020-09-01 14:39:50 -04:00
Finally, line ~\ref{alg:sample-plus-product} is in $O(1)$ for a product and an assignment operation. When a times node is visited, the same union, product, and assignment operations take place, and we again have $O(\log{k})$ runtime. When a variable leaf node is traversed, the same union operation occurs with $O(\log{k})$ runtime, and a constant leaf node has the above mentioned product and assignment operations. Thus for each node visited, we have $O(\log{k})$ runtime, and the final runtime for $\sampmon$ is $O(\log{k} \cdot k \cdot depth(\etree))$.
\end{proof}
\qed