In \Cref{sec:hard}, we showed that \Cref{prob:bag-pdb-poly-expected} cannot be solved in $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$ runtime. In light of this, we desire to produce an approximation algorithm that runs in time $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$. We do this by showing the result via circuits,
such that our approximation algorithm for this problem runs in $\bigO{\abs{\circuit}}$ for a very broad class of circuits, (thus affirming~\Cref{prob:intro-stmt}); see the discussion after \Cref{lem:val-ub} for more.
The following approximation algorithm applies to bag query semantics over both
\abbrCTIDB lineage polynomials and general \abbrBIDB lineage polynomials in practice, where for the latter we note that a $1$-\abbrTIDB is equivalently a \abbrBIDB (blocks are size $1$). Our experimental results (see~\Cref{app:subsec:experiment}) which use queries from the PDBench benchmark~\cite{pdbench} show a low $\gamma$ (see~\Cref{def:param-gamma}) supporting the notion that our bounds hold for general \abbrBIDB in practice.
Corresponding proofs and pseudocode for all formal statements and algorithms
We now introduce definitions and notation related to circuits and polynomials that we will need to state our upper bound results. First we introduce the expansion $\expansion{\circuit}$ of circuit $\circuit$ which % encodes the reduced polynomial for $\polyf\inparen{\circuit}$ and is the basis
For a circuit $\circuit$, we define $\expansion{\circuit}$ as a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef\in\domN$.
Later on, we will denote the monomial composed of the variables in $\monom$ as $\encMon$. As an example of $\expansion{\circuit}$, consider $\circuit$ illustrated in \Cref{fig:circuit}. $\expansion{\circuit}$ is then $[(X, 2), (XY, -1), (XY, 4), (Y, -2)]$. This helps us redefine $\rpoly$ (see \Cref{eq:tilde-Q-bi}) in a way that makes our algorithm more transparent.
{\em positive circuit}, denoted $\abs{\circuit}$, is obtained from $\circuit$ as follows. For each leaf node $\ell$ of $\circuit$ where $\ell.\type$ is $\tnum$, update $\ell.\vari{value}$ to $|\ell.\vari{value}|$.
\begin{Definition}[$\degree(\cdot)$]\label{def:degree}\footnote{Note that the degree of $\polyf(\abs{\circuit})$ is always upper bounded by $\degree(\circuit)$ and the latter can be strictly larger (e.g. consider the case when $\circuit$ multiplies two copies of the constant $1$-- here we have $\deg(\circuit)=1$ but degree of $\polyf(\abs{\circuit})$ is $0$).}
$\degree(\circuit)$ is defined recursively as follows:
\[\degree(\circuit)=
\begin{cases}
\max(\degree(\circuit_\linput),\degree(\circuit_\rinput)) &\text{ if }\circuit.\type=+\\
\degree(\circuit_\linput) + \degree(\circuit_\rinput)+1 &\text{ if }\circuit.\type=\times\\
\begin{Definition}[$\multc{\cdot}{\cdot}$]\footnote{We note that when doing arithmetic operations on the RAM model for input of size $N$, we have that $\multc{O(\log{N})}{O(\log{N})}=O(1)$. More generally we have $\multc{N}{O(\log{N})}=O(N\log{N}\log\log{N})$.}
In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity of multiplying two integers represented with $M$-bits. (We will assume that for input of size $N$, $W=O(\log{N})$.)
Finally, to get linear runtime results, we will need to define another parameter modeling the (weighted) number of monomials in %$\poly\inparen{\vct{X}}$
that need to be `canceled' when monomials with dependent variables are removed (\Cref{subsec:one-bidb}). %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
Let $\isInd{\cdot}$ be a boolean function returning true if monomial $\encMon$ is composed of independent variables and false otherwise; further, let $\indicator{\theta}$ also be a boolean function returning true if $\theta$ evaluates to true.
\abbrOneBIDB (recall that all \abbrCTIDB can be reduced to \abbrOneBIDB by~\Cref{prop:ctidb-reduct}), we have: % can exactly represent $\rpoly(\vct{X})$ as follows:
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in\expansion{\circuit}$ with probability proportional
to $\abs{\coef}$ and compute $\vari{Y}=\indicator{\isInd{\encMon}}
and computing the average of $\vari{Y}$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in \Cref{sec:proofs-approx-alg}).
%%%%%%%%%%%%%%%%%%%%%%%
%The following results assume input circuit \circuit computed from an arbitrary $\raPlus$ query $\query$ and arbitrary \abbrBIDB $\pdb$. We refer to \circuit as a \abbrBIDB circuit.
%\AH{Verify that the proof for \Cref{lem:approx-alg} doesn't rely on properties of $\raPlus$ or \abbrBIDB.}
Let \circuit be an arbitrary \emph{\abbrOneBIDB} circuit, define $\poly(\vct{X})=\polyf(\circuit)$, let $k=\degree(\circuit)$, and let $\gamma=\gamma(\circuit)$. Further let it be the case that $\prob_i\ge\prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$
In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\left(\frac1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot\log{\frac{1}{\conf}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$.
%Given any \abbrCTIDB circuit \circuit, $\poly\inparen{\vct{X}} = \polyf\inparen{\circuit}$, for $k =\degree\inparen{\circuit}$, $\gamma\inparen{\circuit}$, and $\prob_i\ge\prob_0$ for all $i\in\pbox{\numvar}$. The results of~\Cref{cor:approx-algo-const-p} follow for estimating $\rpoly\inparen{\prob_1,\ldots, \prob_\numvar}$.
as well as for all three queries of the PDBench \abbrBIDB benchmark (see \Cref{app:subsec:experiment} for experimental results). Further, we can also argue the following result
\secrev{
, recalling from~\Cref{sec:intro} for \abbrCTIDB$\pdb=\inparen{\worlds, \bpd}$, where $\tupset$ is the set of possible tuples across all possible worlds of $\pdb$.
Given $\raPlus$ query $\query$ and \abbrCTIDB$\pdb$, let \circuit be the circuit computed by $\query\inparen{\tupset}$. Then, for the reduced \abbrOneBIDB$\pdb'$ there exists an equivalent circuit \circuit' obtained from $\query\inparen{\tupset'}$, such that $\gamma\inparen{\circuit'}\leq1-\inparen{\bound+1}^{-\inparen{k-1}}$ with $\size\inparen{\circuit'}\leq\size\inparen{\circuit}+\bigO{\numvar\bound}$%\cdot\inparen{2^{\inparen{\ceil{\log{2\bound}}}+ 1} - 1}$
and $\depth\inparen{\circuit'}=\depth\inparen{\circuit}+\bigO{\log{\bound}}$.%\ceil{\log{2\bound}}$.
The circuit \circuit' is built from \circuit in the following manner. For each input gate $\gate_i$ with $\gate_i.\val= X_\tup$, replace $\gate_i$ with the circuit \subcircuit encoding the sum $\sum_{j =1}^\bound j\cdot X_{\tup, j}$. We argue that \circuit' is a valid circuit by the following facts. Let $\pdb=\inparen{\worlds, \bpd}$ be the original \abbrCTIDB\circuit was generated from. Then, by~\Cref{prop:ctidb-reduct} there exists a \abbrOneBIDB$\pdb' =\inparen{\onebidbworlds{\tupset'}, \bpd'}$, with $\tupset' =\inset{\intup{\tup, j}~|~\tup\in\tupset, j\in\pbox{\bound}}$, from which the conversion from \circuit to \circuit' follows. Both $\polyf\inparen{\circuit}$ and $\polyf\inparen{\circuit'}$ have the same expected multiplicity since (by~\Cref{prop:ctidb-reduct}) the distributions $\bpd$ and $\bpd'$ are equivalent and each $j\cdot\worldvec'_{\tup, j}=\worldvec_\tup$ for $\worldvec'\in\inset{0, 1}^{\bound\numvar}$ and $\worldvec\in\worlds$. Finally, note that because there exists a (sub) circuit encoding $\sum_{j =1}^\bound j\cdot X_{\tup, j}$ that is a \emph{balanced} binary tree, the above conversion implies the claimed size and depth bounds of the lemma.
Next we argue the claim on $\gamma\inparen{\circuit'}$. Consider the list of expanded monomials $\expansion{\circuit}$ for \abbrCTIDB circuit \circuit. Let \monom be an arbitrary monomial such that the set of variables in \monom is $\encMon= X_{\tup_1}^{d_1},\ldots,X_{\tup_\ell}^{d_\ell}$ with $\ell$ variables. Then \monom yields the set of monomials $\vari{E}_\monom\inparen{\circuit'}=\inset{j_1^{d_1}\cdot X_{\tup, j_1}^{d_1}\times\cdots\times j_\ell^{d_\ell}\cdot X_{\tup, j_\ell}^{d_\ell}}_{j_1,\ldots, j_\ell\in\pbox{0, \bound}}$ in $\expansion{\circuit'}$. Recall that a cancellation occurs when we have a monomial \monom such that there exists $\tup\neq\tup'$ in the same block $\block$ where variables $X_\tup, X_{\tup'}$ are in the set of variables $\encMon$ of \monom. Observe that cancellations can only occur for each $X_{\tup}^{d_\tup}\in\encMon$, where the expansion $\inparen{\sum_{j =1}^\bound j\cdot X_{\tup, j}}^{d_\tup}$ represents the monomial $X_\tup^{d_\tup}$ in $\tupset'$. Consider the number of cancellations for $\inparen{\sum_{j =1}^\bound j\cdot X_{\tup, j}^{d_\tup}}^{d_\ell}$. Then $\gamma\leq1-\inparen{c +1}^{d_\tup-1}$, since for each element in the set of cross products$\inset{\bigtimes_{i\in\pbox{d_\tup}, j_i\in\pbox{0, \bound}}X_{\tup, j_i}}$ there are \emph{exactly}$\bound+1$ surviving elements with $j_1=\cdots=j_{d_\tup}$, i.e. $X_j^{d_\tup}$ for each $j\in\pbox{0, \bound}$. The rest of the $\inparen{\bound+1}^{d_\tup-1}$ cross terms cancel. Regarding the whole monomial \monom it is the case that the proportion of non-cancellations across each $X_\tup^{d_\tup}\in\encMon$ multiply as non-cancelling terms for $X_\tup$ can only be joined with non-cancelling terms of $X_{\tup'}^{d_{\tup'}}$. This then yields the fraction of cancelled monomials $1-\prod_{i =1}^{\ell}\inparen{c +1}^{d_i -1}\leq\gamma\leq1-\inparen{c +1}^{-\inparen{k -1}}$ where the inequalities take into account the fact that $\sum_{i =1}^\ell d_i \leq k$.
We briefly connect the runtime in \Cref{eq:approx-algo-runtime} to the algorithm outline earlier (where we ignore the dependence on $\multc{\cdot}{\cdot}$, which is needed to handle the cost of arithmetic operations over integers). The $\size(\circuit)$ comes from the time taken to run \onepass once (\onepass essentially computes $\abs{\circuit}(1,\ldots, 1)$ using the natural circuit evaluation algorithm on $\circuit$). We make $\frac{\log{\frac{1}{\conf}}}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot\prob_0^{2k}}$ many calls to \sampmon (each of which essentially traces $O(k)$ random sink to source paths in $\circuit$ all of which by definition have length at most $\depth(\circuit)$).
Finally, we address the $\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}$ term in the runtime. %In \Cref{susec:proof-val-up}, we show the following:
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplifies to $O_k\left(\frac1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot\log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$. If $\circuit$ is a tree, then the runtime simplifies to $O_k\left(\frac1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot\log{\frac{1}{\conf}}\right)$, which then answers \Cref{prob:intro-stmt} with yes for such circuits.
%\AH{Is it standard to assume that in the asymptotic notation above, $\error$ and $\delta$ are constant? Otherwise this does not uphold~\Cref{prob:intro-stmt}.}
Finally, note that by \Cref{prop:circuit-depth} and \Cref{lem:circ-model-runtime} for any $\raPlus$ query $\query$, there exists a circuit $\circuit^*$ for $\apolyqdt$ such that $\depth(\circuit^*)\le O_{|Q|}(\log{n})$ and $\size(\circuit)\le O_k\inparen{\qruntime{\query, \tupset, \bound}}$. Using this along with \Cref{lem:val-ub}, \Cref{cor:approx-algo-const-p} and the fact that $n\le\qruntime{\query, \tupset, \bound}$, we have the following corollary:
Let $\query$ be an $\raPlus$ query and $\pdb$ be a \emph{\abbrOneBIDB} with $p_0>0$ and $\gamma<1$ (where $p_0,\gamma$ as in \Cref{cor:approx-algo-const-p}) are absolute constants. Let $\poly(\vct{X})=\apolyqdt$ for any result tuple $\tup$ with $\deg(\poly)=k$. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time $O_{k,|Q|,\error',\conf}\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$ (given $\query,\tupset$ and $p_i$ for each $i\in[n]$ that defines $\pd$).
%Let $\poly(\vct{X})$ be a \abbrBIDB-lineage polynomial correspoding to an \abbrBIDB circuit $\circuit$ that satisfies the specific conditions in \Cref{lem:val-ub}. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time
% $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$. % for the case when $\circuit$ satisfies the specific conditions in \Cref{lem:val-ub}.
Let $\query$ be an $\raPlus$ query and $\pdb$ be a \abbrCTIDB with $p_0>0$ (where $p_0$ as in \Cref{cor:approx-algo-const-p}) is an absolute constant. Let $\poly(\vct{X})=\apolyqdt$ for any result tuple $\tup$ with $\deg(\poly)=k$. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time $O_{k,|Q|,\error',\conf,\bound}\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$ (given $\query,\tupset$ and $\prob_{\tup, j}$ for each $\tup\in\tupset,~j\in\pbox{\bound}$ that defines $\bpd$).
%Let $\poly(\vct{X})$ be a \abbrBIDB-lineage polynomial correspoding to an \abbrBIDB circuit $\circuit$ that satisfies the specific conditions in \Cref{lem:val-ub}. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time
% $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$. % for the case when $\circuit$ satisfies the specific conditions in \Cref{lem:val-ub}.
If we want to approximate the expected multiplicities of all $Z=O(n^k)$ result tuples $\tup$ simultaneously, we just need to run the above result with $\conf$ replaced by $\frac\conf Z$. Note this increases the runtime by only a logarithmic factor.
%\AR{The above Corollary needs to be improved/generalized. This is a place-holder for now.}
%In \Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios, including query evaluation under FAQ/AJAR setup.