In order to prove \Cref{lem:approx-alg}, we will need to argue the correctness of \approxq, which relies on the correctness of auxiliary algorithms \onepass and \sampmon.
$\onepass$ guarantees two post-conditions: First, for each subcircuit $\vari{S}$ of $\circuit$, we have that $\vari{S}.\vari{partial}$ is set to $\abs{\vari{S}}(1,\ldots, 1)$. Second, when $\vari{S}.\type=\circplus$, \subcircuit.\lwght$=\frac{\abs{\subcircuit_\linput}(1,\ldots, 1)}{\abs{\subcircuit}(1,\ldots, 1)}$ and likewise for \subcircuit.\rwght.
To prove correctness of \Cref{alg:mon-sam}, we only use the following fact that follows from the above lemma: for the modified circuit ($\circuit_{\vari{mod}}$), $\circuit_{\vari{mod}}.\vari{partial}=\abs{\circuit}(1,\dots,1)$.
$$O(\log{k}\cdot k \cdot\depth(\circuit)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log{\size(\circuit)}})$$
where $k =\degree(\circuit)$. The function returns every $\left(\monom, sign(\coef)\right)$ for $(\monom, \coef)\in\expansion{\circuit}$ with probability $\frac{|\coef|}{\abs{\circuit}(1,\ldots, 1)}$.
For any $\circuit$ with $\degree(poly(|\circuit|))= k$, algorithm \ref{alg:mon-sam} outputs an estimate $\vari{acc}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ such that
in $O\left(\left(\size(\circuit)+\frac{\log{\frac{1}{\conf}}}{\error^2}\cdot k \cdot\log{k}\cdot\depth(\circuit)\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log{\size(\circuit)}}\right)$ time.
which achieves the claimed error bound on $\mathcal{E}$ (\vari{acc}) trivially due to the assignment to $\error'$ and \cref{lem:mon-samp}, since $\error' \cdot\abs{\circuit}(1,\ldots, 1)=\error\cdot\frac{\rpoly(1,\ldots, 1)}{\abs{\circuit}(1,\ldots, 1)}\cdot\abs{\circuit}(1,\ldots, 1)=\error\cdot\rpoly(1,\ldots, 1)$.
Consider now the random variables $\randvar_1,\dots,\randvar_\numsamp$, where each $\randvar_\vari{i}$ is the value of $\vari{Y}_{\vari{i}}$ in \cref{alg:mon-sam} after \cref{alg:mon-sam-product} is executed. In particular, note that we have
where in the first equality we use the fact that $\vari{sgn}_{\vari{i}}\cdot\abs{\coef}=\coef$ and the second equality follows from \Cref{eq:tilde-Q-bi} with $X_i$ substituted by $\prob_i$.
Hoeffding's inequality states that if we know that each $\randvar_i$ (which are all independent) always lie in the intervals $[a_i, b_i]$, then it is true that
Line~\ref{alg:mon-sam-sample} shows that $\vari{sgn}_\vari{i}$ has a value in $\{-1, 1\}$ that is multiplied with $O(k)$$\prob_i\in[0, 1]$, which implies the range for each $\randvar_i$ is $[-1, 1]$.
For the claimed probability bound of $\probOf\left(\left|\vari{acc}-\rpoly(\prob_1,\ldots, \prob_\numvar)\right|> \error\cdot\abs{\circuit}(1,\ldots, 1)\right)\leq\conf$, note that in the algorithm, \vari{acc} is exactly $\empmean\cdot\abs{\circuit}(1,\ldots, 1)$. Multiplying the rest of the terms by the additional factor $\abs{\circuit}(1,\ldots, 1)$ yields the said bound.
The runtime of the algorithm is dominated first by \Cref{alg:mon-sam-onepass} (which by \Cref{lem:one-pass} takes time $O\left({\size(\circuit)}\cdot\multc{\log\left(\abs{\circuit}^2(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$) and then by $\samplesize$ iterations of the loop in \Cref{alg:sampling-loop}. Each iteration's run time is dominated by the call to \sampmon in \Cref{alg:mon-sam-sample} (which by \Cref{lem:sample} takes $O\left(\log{k}\cdot k \cdot{\depth(\circuit)}\cdot\multc{\log\left(\abs{\circuit}^2(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$
) and the check \Cref{alg:check-duplicate-block}, which by the subsequent argument takes $O(k\log{k})$ time. We sort the $O(k)$ variables by their block IDs and then check if there is a duplicate block ID or not. Combining all the times discussed here gives us the desired overall runtime.
Applying this bound in the runtime bound in \Cref{lem:approx-alg} gives the first claimed runtime. The final runtime of $O_k\left(\frac1{\eps^2}\cdot\size(\circuit)\cdot\log{\frac{1}{\conf}}\cdot\multc{\log\left(\abs{\circuit}^2(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$ follows by noting that $\depth({\circuit})\le\size({\circuit})$ and absorbing all factors that just depend on $k$.
%\paragraph{Sufficient condition for $\abs{\circuit}(1,\ldots, 1)$ to be size $O(N)$}
%For our runtime results to be relevant, it must be the case that the sum of the coefficients computed by \onepass is indeed size $O(N)$ since there are $O(\log{N})$ bits in the RAM model where $N$ is the size of the input. The size of the input here is \size(\circuit). We show that when \size$(\circuit_\linput) = N_\linput$, \size$(\circuit_\rinput) = N_\rinput$, where $N_\linput + N_\rinput \leq N$, this is indeed the case.
For the base case, we have that \depth(\circuit) $=0$, and there can only be one node which must contain a coefficient (or constant) of $1$. In this case, $\abs{\circuit}(1,\ldots, 1)=1$, and \size(\circuit) $=1$, and it is true that $\abs{\circuit}(1,\ldots, 1)=1\leq N^{k+1}=1^{1}=1$.
Assume for $\ell > 0$ an arbitrary circuit \circuit of $\depth(\circuit)\leq\ell$ that it is true that $\abs{\circuit}(1,\ldots, 1)\leq N^{\deg(\circuit)+1}$.% for $k \geq 1$ when \depth(C) $\geq 1$.
For the inductive step we consider a circuit \circuit such that $\depth(\circuit)=\ell+1$. The sink can only be either a $\circmult$ or $\circplus$ gate. Let $k_\linput, k_\rinput$ denote \degree($\circuit_\linput$) and \degree($\circuit_\rinput$) respectively. Consider when sink node is $\circmult$.
%We derive the upperbound of \Cref{eq:sumcoeff-times-upper} by noting that the maximum value of the LHS occurs when both the base and exponent are maximized.
In the above the first inequality follows from the inductive hypothesis (and the fact that the size of either subtree is at most $N-1$) and \Cref{eq:sumcoeff-times-upper} follows by noting by \cref{def:degree} for $k =\degree(\circuit)$ we have $k=k_\linput+k_\rinput+1$.
In the above, the first inequality follows from the inductive hypothes and \cref{def:degree} (which implies the fact that $k_\linput,k_\rinput\le k$). Note that the RHS of this inequality is maximized when the base and exponent of one of the terms is maximized. The second inequality follows from this fact as well as the fact that since $\circuit$ is a tree we have $N_\linput+N_\rinput=N-1$ and, lastly, the fact that $k\ge0$. This completes the proof.
The upper bound in \Cref{lem:val-ub} for the general case is a simple variant of the above proof (but we present a proof sketch of the bound below for completeness):
We use the same notation as in the proof of \Cref{lem:C-ub-tree}. We will prove by induction on $\depth(\circuit)$ that $\abs{\circuit}(1,\ldots, 1)\leq2^{2^k\cdot N }$. The base case argument is similar to that in the proof of \Cref{lem:C-ub-tree}. In the inductive case we have that $N_\linput,N_\rinput\le N-1$.
In the above the first inequality follows from inductive hypothesis while the second inequality follows from the fact that $k_\linput,k_\rinput\le k-1$ and $N_\linput, N_\rinput\le N-1$, where we substitute the upperbound into every respective term.
In the above the first inequality follows from the inductive hypothesis while the second inequality follows from the facts that $k_\linput,k_\rinput\le k$ and $N_\linput,N_\rinput\le N-1$. The final inequality follows from the fact that $k\ge0$.
Finally, we consider the case when $\circuit$ encodes the run of the algorithm from~\cite{DBLP:conf/pods/KhamisNR16} on an FAQ query. We cannot handle the full generality of an FAQ query but we can handle an FAQ query that has a ``core'' join query on $k$ relations and then a subset of the $k$ attributes are ``summed'' out (e.g. the sum could be because of projecting out a subset of attributes from the join query). While the algorithm~\cite{DBLP:conf/pods/KhamisNR16} essentially figures out when to `push in' the sums, in our case since we only care about $\abs{\circuit}(1,\dots,1)$. We will consider the obvious circuit that computes the ``inner join'' using a worst-case optimal join (WCOJ) algorithm like~\cite{NPRR} and then adding in the addition gates. The basic idea is very simple: we will argue that the there are at most $\size(\circuit)^k$ tuples in the join output (each with having a value of $1$ in $\abs{\circuit}(1,\dots,1)$). Then the largest value we can see in $\abs{\circuit}(1,\dots,1)$ is by summing up these at most $\size(\circuit)^k$ values of $1$. Note that this immediately implies the claimed bound in \Cref{lem:val-ub}.
We now sketch the argument for the claim about the join query above. First, we note that the computation of a WCOJ algorithm like~\cite{NPRR} can be expressed as a circuit with {\em multiple} sinks (one for each output tuple). Note that annotation corresponding to $\mathbf{t}$ in $\circuit$ is the polynomial $\prod_{e\in E} R(\pi_e(\mathbf{t}))$ (where $E$ indexes the set of relations). It is easy to see that in this case the value of $\mathbf{t}$ in $\abs{\circuit}(1,\dots,1)$ will be $1$ (by multiplying $1$$k$ times). The claim on the number of output tuples follow from the trivial bound of multiplying the input size bound (each relation has at most $n\le\size(\circuit)$ tuples and hence we get an overall bound of $n^k\le\size(\circuit)^k$. Note that we did not really use anything about the WCOJ algorithm except for the fact that $\circuit$ for the join part only is built only of multiplication gates. In fact, we do not need the better WCOJ join size bounds either (since we used the trivial $n^k$ bound). As a final remark, we note that we can build the circuit for the join part by running say the algorithm from~\cite{DBLP:conf/pods/KhamisNR16} on an FAQ query that just has the join query but each tuple is annotated with the corresponding variable $X_i$ (i.e. the semi-ring for the FAQ query is $\mathbb{N}[\mathbf{X}]$).