paper-BagRelationalPDBsAreHard/app_samp-monom-analysis.tex

%root: main.tex


\subsection{\sampmon Remarks}\label{subsec:sampmon-remarks}
\input{app_samp-monom_pseudo-code}
We briefly describe the top-down traversal of \sampmon.  When \circuit.\type $= +$, the input to be visited is sampled from the weighted distribution precomputed by \onepass.
When a \circuit.\type$= \times$ node is visited, both inputs are visited.
The algorithm computes two properties: the set of all variable leaf nodes visited, and the product of the signs of visited coefficient leaf nodes.
%
We will assume the TreeSet data structure to maintain sets with logarithmic time insertion and linear time traversal of its elements.
While we would like to take advantage of the space efficiency gained in using a circuit \circuit instead an expression tree \etree, we do not know that such a method exists when computing a sample of the input polynomial representation.

The efficiency gains of circuits over trees is found in the capability of circuits to only require space for each \emph{distinct} term in the compressed representation.  This saves space in such polynomials containing non-distinct terms multiplied or added to each other, e.g., $x^4$.  However, to avoid biased sampling, it is imperative to sample from both inputs of a multiplication gate, independently, which is indeed the approach of \sampmon.

\subsection{Proof of \sampmon (\Cref{lem:sample})}\label{sec:proof-sample-monom}
\begin{proof}

We first need to show that $\sampmon$ samples a valid monomial $\encMon$ by sampling and returning a set of variables $\monom$, such that $(\monom, \coef)$ is in $\expansion{\circuit}$ and $\encMon$ is indeed a monomial of the $\rpoly\inparen{\vct{X}}$ encoded in \circuit.  We show this via induction over the depth of \circuit.
For the base case, let the depth $d$ of $\circuit$ be $1$.  We have that the single gate is either a constant $\coef$ for which by line~\ref{alg:sample-num-return} we return $\{~\}$, or we have that $\circuit.\type = \var$ and $\circuit.\val = x$, and  by line~\ref{alg:sample-var-return} we return $\{x\}$.  By \cref{def:expand-circuit}, both cases return a valid $\monom$ for some $(\monom, \coef)$ from $\expansion{\circuit}$, and the base case is proven.

\AH{I think it is slightly confusing to say that depth $= 0$ in view of the definition of depth in S.4.  To say $k = 0$ is also strange, since, for a single join, we have that $k = 2$.}

For the inductive hypothesis, assume that for $d \leq k$ for some $k \geq 1$, that it is indeed the case that $\sampmon$ returns a valid monomial.

For the inductive step, let us take a circuit $\circuit$ with $d = k + 1$.  Note that each input has depth $d - 1 \leq k$, and by inductive hypothesis both of them sample a valid monomial.  Then the sink can be either a $\circplus$ or $\circmult$ gate.  For the case when $\circuit.\type = \circplus$, line~\ref{alg:sample-plus-bsamp} of $\sampmon$ will choose one of the inputs of the source.  By inductive hypothesis it is the case that some valid monomial is being randomly sampled from each of the inputs.  Then it follows when $\circuit.\type = \circplus$ that a valid monomial is sampled by $\sampmon$.  When the $\circuit.\type = \circmult$, line~\ref{alg:sample-times-union} computes the set union of the monomials returned by the two inputs of the sink, and it is trivial to see by \cref{def:expand-circuit} that $\encMon$ is a valid monomial encoded by some $(\monom, \coef)$ of $\expansion{\circuit}$.

We will next prove by induction on the depth $d$ of $\circuit$ that for $(\monom,\coef) \in \expansion{\circuit}$, $\monom$ is sampled with a probability $\frac{|\coef|}{\abs{\circuit}\polyinput{1}{1}}$.

For the base case $d = 1$, by definition~\ref{def:circuit} we know that the $\size\inparen{\circuit} = 1$ and \circuit.\type$=$ \tnum or \var.  For either case, the probability of the value returned is $1$ since there is only one value to sample from.  When \circuit.\val $= x$, the algorithm always return the variable set $\{x\}$.  When $\circuit.\type = \tnum$, \sampmon will always return the variable set $\emptyset$.

\AH{I don't think this is technically right, since \sampmon returns a tuple of two values.}

For the inductive hypothesis, assume that for $d \leq k$ and $k \geq 1$ $\sampmon$ indeed returns $\monom$ in $(\monom, \coef)$ of $\expansion{\circuit}$ with probability $\frac{|\coef|}{\abs{\circuit}\polyinput{1}{1}}$.

We prove now for $d = k + 1$ the inductive step holds.  It is the case that the sink of $\circuit$ has two inputs $\circuit_\linput$ and $\circuit_\rinput$.  Since $\circuit_\linput$ and $\circuit_\rinput$ are both depth $d - 1 \leq k$, by inductive hypothesis, $\sampmon$ will return $\monom_\linput$ in $(\monom_\lchild, \coef_\lchild)$ of $\expansion{\circuit_\linput}$ and $\monom_\rinput$ in $(\monom_\rchild, \coef_\rchild)$ of $\expansion{\circuit_\rinput}$, from $\circuit_\linput$ and $\circuit_\rinput$ with probability $\frac{|\coef_\lchild|}{\abs{\circuit_\linput}\polyinput{1}{1}}$ and $\frac{|\coef_\rchild|}{\abs{\circuit_\rinput}\polyinput{1}{1}}$.

Consider the case when $\circuit.\type = \circmult$.  For the term $(\monom, \coef)$ from $\expansion{\circuit}$ that is being sampled it is the case that $\monom = \monom_\lchild \cup \monom_\rchild$, where $\monom_\lchild$ is coming from $\circuit_\linput$ and $\monom_\rchild$ from $\circuit_\rinput$.  The probability that \sampmon$(\circuit_{\lchild})$ returns $\monom_\lchild$ is $\frac{|\coef_{\monom_\lchild}|}{|\circuit_\linput|(1,\ldots, 1)}$ and $\frac{|\coef_{\monom_\rchild}|}{\abs{\circuit_\rinput}\polyinput{1}{1}}$ for $\monom_\rchild$.  Since both $\monom_\lchild$ and $\monom_\rchild$ are sampled with independent randomness, the final probability for sample $\monom$ is then $\frac{|\coef_{\monom_\lchild}| \cdot |\coef_{\monom_\rchild}|}{|\circuit_\linput|(1,\ldots, 1) \cdot |\circuit_\rinput|(1,\ldots, 1)}$.  For $(\monom, \coef)$ in $\expansion{\circuit}$, by \cref{def:expand-circuit} it is indeed the case that $|\coef| = |\coef_{\monom_\lchild}| \cdot |\coef_{\monom_\rchild}|$ and that (as shown in \cref{eq:T-all-ones}) $\abs{\circuit}(1,\ldots, 1) = |\circuit_\linput|(1,\ldots, 1) \cdot |\circuit_\rinput|(1,\ldots, 1)$, and therefore $\monom$ is sampled with correct probability $\frac{|\coef|}{\abs{\circuit}(1,\ldots, 1)}$.

For the case when $\circuit.\type = \circplus$, \sampmon ~will sample $\monom$ from one of its inputs.  By inductive hypothesis we know that any $\monom_\lchild$ in $\expansion{\circuit_\linput}$ and any $\monom_\rchild$ in $\expansion{\circuit_\rinput}$ will both be sampled with correct probability $\frac{|\coef_{\monom_\lchild}|}{\abs{\circuit_{\lchild}}(1,\ldots, 1)}$ and $\frac{|\coef_{\monom_\rchild}|}{|\circuit_\rinput|(1,\ldots, 1)}$, where either $\monom_\lchild$ or $\monom_\rchild$ will equal $\monom$, depending on whether $\circuit_\linput$ or $\circuit_\rinput$ is sampled.  Assume that $\monom$ is sampled from $\circuit_\linput$, and note that a symmetric argument holds for the case when $\monom$ is sampled from $\circuit_\rinput$.  Notice also that the probability of choosing $\circuit_\linput$ from $\circuit$ is $\frac{\abs{\circuit_\linput}\polyinput{1}{1}}{\abs{\circuit_\linput}\polyinput{1}{1} + \abs{\circuit_\rinput}\polyinput{1}{1}}$ as computed by $\onepass$.  Then, since $\sampmon$ goes top-down, and each sampling choice is independent (which follows from the randomness in the root of $\circuit$ being independent from the randomness used in its subtrees), the probability for $\monom$ to be sampled from $\circuit$ is equal to the product of the probability that $\circuit_\linput$ is sampled from $\circuit$ and $\monom$ is sampled in $\circuit_\linput$, and
\begin{align*}
&\probOf(\sampmon(\circuit) = \monom) = \\
&\probOf(\sampmon(\circuit_\linput) = \monom) \cdot \probOf(SampledChild(\circuit) = \circuit_\linput)\\
&= \frac{|\coef_\monom|}{|\circuit_\linput|(1,\ldots, 1)} \cdot \frac{\abs{\circuit_\linput}(1,\ldots, 1)}{|\circuit_\linput|(1,\ldots, 1) + |\circuit_\rinput|(1,\ldots, 1)}\\
&= \frac{|\coef_\monom|}{\abs{\circuit}(1,\ldots, 1)},
\end{align*}
and we obtain the desired result.

Lastly, we show by simple induction of the depth $d$ of \circuit that \sampmon indeed returns the correct sign value of $\coef$ in $(\monom, \coef)$.

In the base case, $\circuit.\type = \tnum$ or $\var$.  For the former by~\Cref{alg:sample-num-leaf}, \sampmon correctly returns the sign value of the gate.  For the latter by~\Cref{alg:sample-var-return}, \sampmon returns the correct sign of $1$, since a variable is a neutral element, and $1$ is the multiplicative identity, whose product with another sign element will not change that sign element.

For the inductive hypothesis, we assume for a circuit of depth $d \leq k$ and $k \geq 1$ that the algorithm correctly returns the sign value of $\coef$.

Similar to before, for a depth \AH{Why do we use $d = k + 1$ for the inductive cases above?}
$d \leq k + 1$, it is true that $\circuit_\linput$ and $\circuit_\rinput$ both return the correct sign of $\coef$.  For the case that $\circuit.\type = \circmult$, the sign value of both inputs are multiplied, which is the correct behavior by \cref{def:expand-circuit}.  When $\circuit.\type = \circplus$, only one input of $\circuit$ is sampled, and the algorithm  returns the correct sign value of $\coef$ by inductive hyptothesis.


\paragraph*{Run-time Analysis}
It is easy to check that except for lines~\ref{alg:sample-plus-bsamp} and~\ref{alg:sample-times-union}, all lines take $O(1)$ time.  Consider an execution of \cref{alg:sample-times-union}. We note that we will be adding a given set of variables to some set at most once: since the sum of the sizes of the sets at a given level is at most $\degree(\circuit)$, each gate visited takes $O(\log{\degree(\circuit)})$.  For \Cref{alg:sample-plus-bsamp}, note that we pick $\circuit_\linput$ with probability $\frac a{a+b}$ where $a=\circuit.\vari{Lweight}$ and $b=\circuit.\vari{Rweight}$. We can implement this step by picking a random number $r\in[a+b]$ and then checking if $r\le a$. It is easy to check that $a+b\le \abs{\circuit}(1,\dots,1)$. This means we need to add and compare $\log{\abs{\circuit}(1,\ldots, 1)}$-bit numbers, which can certainly be done in time $\multc{\log\left(\abs{\circuit(1\ldots, 1)}\right)}{\log{\size(\circuit)}}$ (note that this is an over-estimate).
Denote \cost(\circuit) (\Cref{eq:cost-sampmon}) to be an upper bound of the number of gates visited by \sampmon.  Then the runtime is $O\left(\cost(\circuit)\cdot \log{\degree(\circuit)}\cdot \multc{\log\left(\abs{\circuit(1\ldots, 1)}\right)}{\log{\size(\circuit)}}\right)$.

\AH{We don't really justify why we can bound the number of recursive calls as we claim in what follows.}
Since there can be at most $k = \degree\inparen{\circuit}$ nodes visited at every level of the circuit, and each of the first $d - 1$ levels (going from the sink to the source nodes) will contain at least one recursive call, we can upperbound the number of recursive calls in $\sampmon$ by $O\left((\degree(\circuit) + 1)\right.$$\left.\cdot\right.$ $\left.\depth(\circuit)\right)$, which by the above will prove the claimed runtime of~\Cref{lem:sample}.  %The reason for this is that the number of recursive calls is exactly the number of calls to lines~\ref{alg:sample-plus-bsamp} and~\ref{alg:sample-times-union}.

Let \cost$(\cdot)$ be a function that models an upper bound on the number of gates that can be visited in the run of \sampmon.  We define \cost$(\cdot)$ recursively as follows.

\begin{equation}
	\cost(\circuit) =
		\begin{cases}
			1 + \cost(\circuit_\linput) + \cost(\circuit_\rinput) & \textbf{if } \text{\circuit.\type = }\circmult\\
			1 + \max\left(\cost(\circuit_\linput), \cost(\circuit_\rinput)\right) & \textbf{if } \text{\circuit.\type = \circplus}\\
			1 & \textbf{otherwise}
		\end{cases}\label{eq:cost-sampmon}
\end{equation}

First note that the number of gates visited in \sampmon is $\leq\cost(\circuit)$.  To show that \cref{eq:cost-sampmon} upper bounds the number of nodes visited by \sampmon, note that when \sampmon visits a gate such that \circuit.\type $ =\circmult$, line~\ref{alg:sample-times-for-loop} visits each input of \circuit, as defined in (\ref{eq:cost-sampmon}).  For the case when \circuit.\type $= \circplus$, line~\ref{alg:sample-plus-bsamp} visits exactly one of the input gates, which may or may not be the subcircuit with the maximum number of gates traversed, which makes \cost$(\cdot)$ an upperbound.  Finally, it is trivial to see that when \circuit.\type $\in \{\var, \tnum\}$, i.e., a source gate, that only one gate is visited.

We prove the following inequality holds.
\begin{equation}
2\left(\degree(\circuit) + 1\right) \cdot \depth(\circuit) + 1 \geq \cost(\circuit)\label{eq:strict-upper-bound}
\end{equation}

Note that \cref{eq:strict-upper-bound} implies the claimed runtime.
\AH{If the claimed runtime is from the first paragraph, then I don't follow.}

We prove \cref{eq:strict-upper-bound} for the number of gates traversed in \sampmon using induction over $\depth(\circuit)$.  Recall how degree is defined in \cref{def:degree}.
\AH{In the following, by~\Cref{def:size-depth}, we would have that $\depth\inparen{\circuit} = 1$ \emph{technically}.}
For the base case $\degree(\circuit) \in \inset{0, 1}, \depth(\circuit) = 1$, $\cost(\circuit) = 1$, and it is trivial to see that the inequality $2\degree(\circuit) \cdot \depth(\circuit) + 1 \geq \cost(\circuit)$ holds.

For the inductive hypothesis, we assume the bound holds for any circuit where $\ell \geq \depth(\circuit) \geq 0$.
Now consider the case when \sampmon has an arbitrary circuit \circuit input with $\depth(\circuit) = \ell + 1$.  By definition \circuit.\type $\in \{\circplus, \circmult\}$. Note that since $\depth(\circuit) \geq 2$, \circuit must have input(s).  Further we know that by the inductive hypothesis the inputs $\circuit_i$ for $i \in \{\linput, \rinput\}$ of the sink gate \circuit uphold the bound
\begin{equation}
2\left(\degree(\circuit_i) + 1\right)\cdot \depth(\circuit_i) + 1 \geq \cost(\circuit_i).\label{eq:ih-bound-cost}
\end{equation}
In particular, since for any $i$, \cref{eq:ih-bound-cost} holds, then it immediately follows that an inequality whose operands consist of a sum of the aforementioned inequalities must also hold.  This is readily seen in the inequality of \cref{eq:times-middle} and \cref{eq:times-rhs}, where $2\inparen{\degree(\circuit_\linput) + 1}\cdot \depth(\circuit_\linput) \geq \cost(\circuit_\linput)$, likewise for $\circuit_\rinput$, and $1\geq 1$.
It is also true that $\depth(\circuit_\linput) \leq \depth(\circuit) - 1$ and $\depth(\circuit_\rinput) \leq \depth(\circuit) - 1$.

If \circuit.\type $= \circplus$, then $\degree(\circuit) = \max\left(\degree(\circuit_\linput), \degree(\circuit_\rinput)\right)$.  Otherwise \circuit.\type = $\circmult$ and $\degree(\circuit) = \degree(\circuit_\linput) + \degree(\circuit_\rinput) + 1$.  In either case it is true that $\depth(\circuit) = \max\inparen{\depth(\circuit_\linput), \depth(\circuit_\rinput)} + 1$.

If \circuit.\type $= \circmult$, then, by \cref{eq:cost-sampmon}, substituting values, the following should hold,
\begin{align}
&2\left(\degree(\circuit_\linput) + \degree(\circuit_\rinput) + 2\right) \cdot \left(\max(\depth(\circuit_\linput), \depth(\circuit_\rinput)) + 1\right) + 1 \label{eq:times-lhs}\\
&\qquad\geq 2\left(\degree(\circuit_\linput) + 1\right) \cdot \depth(\circuit_\linput) + 2\left(\degree(\circuit_\rinput) + 1\right)\cdot \depth(\circuit_\rinput) + 3\label{eq:times-middle} \\
&\qquad\geq 1 + \cost(\circuit_\linput) + \cost(\circuit_\rinput) = \cost(\circuit) \label{eq:times-rhs}.
\end{align}

To prove (\ref{eq:times-middle}), first, \cref{eq:times-lhs} expands to,
\begin{equation}
2\degree(\circuit_\linput)\cdot\depth_{\max} + 2\degree(\circuit_\rinput)\cdot\depth_{\max} + 4\depth_{\max} + 2\degree(\circuit_\linput) +  2\degree(\circuit_\rinput) + 4 + 1\label{eq:times-lhs-expanded}
\end{equation}
where $\depth_{\max}$ is used to denote the maximum depth of the two input subcircuits.  \Cref{eq:times-middle} expands to
\begin{equation}
2\degree(\circuit_\linput)\cdot\depth(\circuit_\linput) + 2\depth(\circuit_\linput) + 2\degree(\circuit_\rinput)\cdot\depth(\circuit_\rinput) + 2\depth(\circuit_\rinput) + 3\label{eq:times-middle-expanded}
\end{equation}

Putting \Cref{eq:times-lhs-expanded} and \Cref{eq:times-middle-expanded} together we get
\begin{align}
&2\degree(\circuit_\linput)\cdot\depth_{\max} + 2\degree(\circuit_\rinput)\cdot\depth_{\max} + 4\depth_{\max} + 2\degree(\circuit_\linput) +  2\degree(\circuit_\rinput) + 5\nonumber\\
&\qquad\geq 2\degree(\circuit_\linput)\cdot\depth(\circuit_\linput) + 2\degree(\circuit_\rinput)\cdot\depth(\circuit_\rinput) + 2\depth(\circuit_\linput) + 2\depth(\circuit_\rinput) + 3.\label{eq:times-lhs-middle}
\end{align}

%Since the following is always true,
%\begin{align*}
%&2\degree(\circuit_\linput)\cdot\depth_{\max} + 2\degree(\circuit_\rinput)\cdot\depth_{\max} + 4\depth_{\max}  + 5\\
%&\qquad \geq 2\degree(\circuit_\linput)\cdot\depth(\circuit_\linput) + 2\degree(\circuit_\rinput)\cdot\depth(\circuit_\rinput) + 2\depth(\circuit_\linput) + 2\depth(\circuit_\rinput) + 3,
%\end{align*}
%then it is the case that \Cref{eq:times-lhs-middle} is \emph{always} true.

Now to justify (\ref{eq:times-rhs}) which holds for the following reasons.  First, \cref{eq:times-rhs}
is the result of \Cref{eq:cost-sampmon} when $\circuit.\type = \circmult$.  \Cref{eq:times-middle}
is then produced by substituting the upperbound of (\ref{eq:ih-bound-cost}) for each $\cost(\circuit_i)$, trivially establishing the upper bound of (\ref{eq:times-rhs}).  This proves \cref{eq:strict-upper-bound} for the $\circmult$ case.

For the case when \circuit.\type $= \circplus$, substituting values yields
\begin{align}
&2\left(\max(\degree(\circuit_\linput), \degree(\circuit_\rinput)) + 1\right) \cdot \left(\max(\depth(\circuit_\linput), \depth(\circuit_\rinput)) + 1\right) +1\label{eq:plus-lhs-inequality}\\
&\qquad \geq \max\left(2\left(\degree(\circuit_\linput) + 1\right) \cdot \depth(\circuit_\linput) + 1, 2\left(\degree(\circuit_\rinput) + 1\right) \cdot \depth(\circuit_\rinput) +1\right) + 1\label{eq:plus-middle}\\
&\qquad \geq 1 + \max(\cost(\circuit_\linput), \cost(\circuit_\rinput)) = \cost(\circuit)\label{eq:plus-rhs}
\end{align}

To prove  (\ref{eq:plus-middle}), \cref{eq:plus-lhs-inequality} expands to
\begin{equation}
2\degree_{\max}\depth_{\max} + 2\degree_{\max} + 2\depth_{\max} + 2 + 1.\label{eq:plus-lhs-expanded}
\end{equation}

Since $\degree_{\max} \cdot \depth_{\max} \geq \degree(\circuit_i)\cdot \depth(\circuit_i),$ the following upperbounds the expansion of \cref{eq:plus-middle}:
\begin{equation}
2\degree_{\max}\depth_{\max} + 2\depth_{\max} + 2
\label{eq:plus-middle-expanded}
\end{equation}
Putting it together we obtain the following for (\ref{eq:plus-middle}):
\begin{align}
&2\degree_{\max}\depth_{\max} + 2\degree_{\max} + 2\depth_{\max} + 3\nonumber\\
&\qquad \geq 2\degree_{\max}\depth_{\max} + 2\depth_{\max} + 2, \label{eq:plus-upper-bound-final}
\end{align}
where it can be readily seen that the inequality stands and (\ref{eq:plus-upper-bound-final}) follows.  This proves (\ref{eq:plus-middle}).

Similar to the case of $\circuit.\type = \circmult$, (\ref{eq:plus-rhs}) follows by equations $(\ref{eq:cost-sampmon})$ and $(\ref{eq:ih-bound-cost})$.

This proves (\ref{eq:strict-upper-bound}) as desired.
\qed
\end{proof}