In \Cref{sec:hard}, we showed that \Cref{prob:bag-pdb-poly-expected} cannot be solved in $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$ runtime. In light of this, we desire to produce and approximation algorithm that runs in time $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$. We do this by showing the result via circuits,
such that our approximation algorithm for this problem runs in $\bigO{\abs{\circuit}}$ for a very broad class of circuits, (thus affirming~\Cref{prob:intro-stmt}); see the discussion after \Cref{lem:val-ub} for more).
The following approximation algorithm applies to bag query semantics over both
\abbrCTIDB lineage polynomials and general \abbrBIDB lineage polynomials in practice, where for the latter we note that a $1$-\abbrTIDB is equivalently a \abbrBIDB (blocks are size $1$). Our experimental results (see~\Cref{app:subsec:experiment}) which use queries from the PDBench benchmark~\cite{pdbench} show a low $\gamma$ (see~\Cref{def:param-gamma}) supporting the notion that our bounds hold for general \abbrBIDB in practice.
Corresponding proofs and pseudocode for all formal statements and algorithms
We now introduce definitions and notation related to circuits and polynomials that we will need to state our upper bound results. First we introduce the expansion $\expansion{\circuit}$ of circuit $\circuit$ which % encodes the reduced polynomial for $\polyf\inparen{\circuit}$ and is the basis
For a circuit $\circuit$, we define $\expansion{\circuit}$ as a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef\in\domN$.
Later on, we will denote the monomial composed of the variables in $\monom$ as $\encMon$. As an example of $\expansion{\circuit}$, consider $\circuit$ illustrated in \Cref{fig:circuit}. $\expansion{\circuit}$ is then $[(X, 2), (XY, -1), (XY, 4), (Y, -2)]$. This helps us redefine $\rpoly$ (see \Cref{eq:tilde-Q-bi}) in a way that makes our algorithm more transparent.
{\em positive circuit}, denoted $\abs{\circuit}$, is obtained from $\circuit$ as follows. For each leaf node $\ell$ of $\circuit$ where $\ell.\type$ is $\tnum$, update $\ell.\vari{value}$ to $|\ell.\vari{value}|$.
\begin{Definition}[$\degree(\cdot)$]\label{def:degree}\footnote{Note that the degree of $\polyf(\abs{\circuit})$ is always upper bounded by $\degree(\circuit)$ and the latter can be strictly larger (e.g. consider the case when $\circuit$ multiplies two copies of the constant $1$-- here we have $\deg(\circuit)=1$ but degree of $\polyf(\abs{\circuit})$ is $0$).}
$\degree(\circuit)$ is defined recursively as follows:
\[\degree(\circuit)=
\begin{cases}
\max(\degree(\circuit_\linput),\degree(\circuit_\rinput)) &\text{ if }\circuit.\type=+\\
\degree(\circuit_\linput) + \degree(\circuit_\rinput)+1 &\text{ if }\circuit.\type=\times\\
\begin{Definition}[$\multc{\cdot}{\cdot}$]\footnote{We note that when doing arithmetic operations on the RAM model for input of size $N$, we have that $\multc{O(\log{N})}{O(\log{N})}=O(1)$. More generally we have $\multc{N}{O(\log{N})}=O(N\log{N}\log\log{N})$.}
In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity of multiplying two integers represented with $M$-bits. (We will assume that for input of size $N$, $W=O(\log{N})$.)
Finally, to get linear runtime results, we will need to define another parameter modeling the (weighted) number of monomials in %$\poly\inparen{\vct{X}}$
that need to be `canceled' when monomials with dependent variables are removed (\Cref{subsec:one-bidb}). %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
Let $\isInd{\cdot}$ be a boolean function returning true if monomial $\encMon$ is composed of independent variables and false otherwise; further, let $\indicator{\theta}$ also be a boolean function returning true if $\theta$ evaluates to true.
\abbrOneBIDB (recall that all \abbrCTIDB can be reduced to \abbrOneBIDB by~\Cref{def:ctidb-reduct}), we have: % can exactly represent $\rpoly(\vct{X})$ as follows:
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in\expansion{\circuit}$ with probability proportional
to $\abs{\coef}$ and compute $\vari{Y}=\indicator{\isInd{\encMon}}
and computing the average of $\vari{Y}$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in \Cref{sec:proofs-approx-alg}).
%%%%%%%%%%%%%%%%%%%%%%%
%The following results assume input circuit \circuit computed from an arbitrary $\raPlus$ query $\query$ and arbitrary \abbrBIDB $\pdb$. We refer to \circuit as a \abbrBIDB circuit.
%\AH{Verify that the proof for \Cref{lem:approx-alg} doesn't rely on properties of $\raPlus$ or \abbrBIDB.}
Let \circuit be an arbitrary \emph{\abbrOneBIDB} circuit, define $\poly(\vct{X})=\polyf(\circuit)$, let $k=\degree(\circuit)$, and let $\gamma=\gamma(\circuit)$. Further let it be the case that $\prob_i\ge\prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$
In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\left(\frac1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot\log{\frac{1}{\conf}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$.
Given \emph{\abbrOneBIDB} computed from the reduction of~\Cref{def:ctidb-reduct}, $\gamma\inparen{\circuit}=\inparen{c +1}^{-k}$.
\end{Lemma}
\begin{Corollary}
Given any \abbrCTIDB circuit \circuit, $\poly\inparen{\vct{X}}=\polyf\inparen{\circuit}$, for $k =\degree\inparen{\circuit}$, $\gamma\inparen{\circuit}$, and $\prob_i\ge\prob_0$ for all $i\in\pbox{\numvar}$. The results of~\Cref{cor:approx-algo-const-p} follow for estimating $\rpoly\inparen{\prob_1,\ldots, \prob_\numvar}$.
We briefly connect the runtime in \Cref{eq:approx-algo-runtime} to the algorithm outline earlier (where we ignore the dependence on $\multc{\cdot}{\cdot}$, which is needed to handle the cost of arithmetic operations over integers). The $\size(\circuit)$ comes from the time take to run \onepass once (\onepass essentially computes $\abs{\circuit}(1,\ldots, 1)$ using the natural circuit evaluation algorithm on $\circuit$). We make $\frac{\log{\frac{1}{\conf}}}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot\prob_0^{2k}}$ many calls to \sampmon (each of which essentially traces $O(k)$ random sink to source paths in $\circuit$ all of which by definition have length at most $\depth(\circuit)$).
Finally, we address the $\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}$ term in the runtime. %In \Cref{susec:proof-val-up}, we show the following:
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplifies to $O_k\left(\frac1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot\log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$. If $\circuit$ is a tree, then the runtime simplifies to $O_k\left(\frac1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot\log{\frac{1}{\conf}}\right)$, which then answers \Cref{prob:intro-stmt} with yes for such circuits.
%\AH{Is it standard to assume that in the asymptotic notation above, $\error$ and $\delta$ are constant? Otherwise this does not uphold~\Cref{prob:intro-stmt}.}
Finally, note that by \Cref{prop:circuit-depth} and \Cref{lem:circ-model-runtime} for any $\raPlus$ query $\query$, there exists a circuit $\circuit^*$ for $\apolyqdt$ such that $\depth(\circuit^*)\le O_{|Q|}(\log{n})$ and $\size(\circuit)\le O_k\inparen{\qruntime{\query, \dbbase}}$. Using this along with \Cref{lem:val-ub}, \Cref{cor:approx-algo-const-p} and the fact that $n\le\qruntime{\query, \dbbase}$, we answer \Cref{prob:big-o-joint-steps} in the affirmative as follows:
Let $\query$ be an $\raPlus$ query and $\pdb$ be a \emph{\abbrOneBIDB} with $p_0>0$ and $\gamma<1$ (where $p_0,\gamma$ as in \Cref{cor:approx-algo-const-p}) are absolute constants. Let $\poly(\vct{X})=\apolyqdt$ for any result tuple $\tup$ with $\deg(\poly)=k$. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time $O_{k,|Q|,\error',\conf}\inparen{\qruntime{\query, \tupset, \bound}}$ (given $\query,\tupset$ and $p_i$ for each $i\in[n]$ that defines $\pd$).
%Let $\poly(\vct{X})$ be a \abbrBIDB-lineage polynomial correspoding to an \abbrBIDB circuit $\circuit$ that satisfies the specific conditions in \Cref{lem:val-ub}. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time
% $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$. % for the case when $\circuit$ satisfies the specific conditions in \Cref{lem:val-ub}.
If we want to approximate the expected multiplicities of all $Z=O(n^k)$ result tuples $\tup$ simultaneously, we just need to run the above result with $\conf$ replaced by $\frac\conf Z$. Note this increases the runtime by only a logarithmic factor.
%\AR{The above Corollary needs to be improved/generalized. This is a place-holder for now.}
%In \Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios, including query evaluation under FAQ/AJAR setup.