paper-BagRelationalPDBsAreHard/approx_alg.tex

211 lines
15 KiB
TeX
Raw Normal View History

%root: main.tex
2020-12-19 01:15:50 -05:00
%!TEX root=./main.tex
2020-12-17 16:40:48 -05:00
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
2021-09-03 10:36:41 -04:00
In \Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries), and by extension \bi (or more general \abbrPDB models) %any $\semNX$-PDB)
is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
2021-04-10 09:48:26 -04:00
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.\footnote{For a very broad class of circuits: please see the discussion after \Cref{lem:val-ub} for more.}
2021-09-03 12:34:08 -04:00
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}. As before, all proofs and pseudocode can be found in \Cref{sec:proofs-approx-alg}.
2020-12-14 11:47:18 -05:00
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
\subsection{Preliminaries and some more notation}
2021-09-03 12:34:08 -04:00
We now introduce useful definitions and notation related to circuits and polynomials.
2021-04-10 14:35:38 -04:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{Definition}[Variables in a monomial]\label{def:vars}
% Given a monomial $v$, we use $\var(v)$ to denote the set of variables in $v$.
%\end{Definition}
2021-09-03 12:34:08 -04:00
%\noindent For example the monomial $XY$ has $\var(XY)=\inset{X,Y}$.
2021-04-10 14:35:38 -04:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-09-03 12:34:08 -04:00
2020-12-14 11:47:18 -05:00
\begin{Definition}[$\expansion{\circuit}$]\label{def:expand-circuit}
For a circuit $\circuit$, we define $\expansion{\circuit}$ as a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef \in \domN$. We will denote the monomial composed of the variables in $\monom$ as $\encMon$.
$\expansion{\circuit}$ has the following recursive definition ($\circ$ is list concatenation).
$\expansion{\circuit} =
\begin{cases}
\expansion{\circuit_\linput} \circ \expansion{\circuit_\rinput} &\textbf{ if }\circuit.\type = \circplus\\
\left\{(\monom_\linput \cup \monom_\rinput, \coef_\linput \cdot \coef_\rinput) ~|~(\monom_\linput, \coef_\linput) \in \expansion{\circuit_\linput}, (\monom_\rinput, \coef_\rinput) \in \expansion{\circuit_\rinput}\right\} &\textbf{ if }\circuit.\type = \circmult\\
\elist{(\emptyset, \circuit.\val)} &\textbf{ if }\circuit.\type = \tnum\\
\elist{(\{\circuit.\val\}, 1)} &\textbf{ if }\circuit.\type = \var.\\
\end{cases}
$
2020-12-19 01:15:50 -05:00
\end{Definition}
2021-09-03 12:34:08 -04:00
Consider $\circuit$ illustrated in \Cref{fig:circuit}. $\expansion{\circuit}$ is then $[(X, 2), (XY, -1), (XY, 4), (Y, -2)]$.
\begin{Definition}[$\abs{\circuit}(\vct{X})$]\label{def:positive-circuit}
For any circuit $\circuit$, the corresponding
2021-04-08 22:30:03 -04:00
{\em positive circuit}, denoted $\abs{\circuit}$, is obtained from $\circuit$ as follows. For each leaf node $\ell$ of $\circuit$ where $\ell.\type$ is $\tnum$, update $\ell.\vari{value}$ to $|\ell.\vari{value}|$.
\end{Definition}
2021-09-03 12:34:08 -04:00
Conveniently, $\abs{\circuit}\inparen{1,\ldots,1}$ gives us the number of terms represented in $\expansion{\circuit}$, i.e. $\sum\limits_{\inparen{\monom, \coef} \in \expansion{\circuit}}\abs{\coef}$.
2021-09-03 12:34:08 -04:00
\begin{Definition}[\size($\cdot$), \depth$\inparen{\cdot}$]\label{def:size-depth}
The functions \size and \depth output the number of gates and levels respectively for input \circuit.
\end{Definition}
2021-09-03 12:34:08 -04:00
%\begin{Definition}[\depth($\cdot$)]
%The function \depth has circuit $\circuit$ as input and outputs the number of levels in \circuit.
%\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%
%NEEDS to be moved to appendix
%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{Definition}[$\degree(\cdot)$]\label{def:degree}\footnote{Note that the degree of $\polyf(\abs{\circuit})$ is always upper bounded by $\degree(\circuit)$ and the latter can be strictly larger (e.g. consider the case when $\circuit$ multiplies two copies of the constant $1$-- here we have $\deg(\circuit)=1$ but degree of $\polyf(\abs{\circuit})$ is $0$).}
%$\degree(\circuit)$ is defined recursively as follows:
%\[\degree(\circuit)=
%\begin{cases}
%\max(\degree(\circuit_\linput),\degree(\circuit_\rinput)) & \text{ if }\circuit.\type=+\\
%\degree(\circuit_\linput) + \degree(\circuit_\rinput)+1 &\text{ if }\circuit.\type=\times\\
%1 & \text{ if }\circuit.\type = \var\\
%0 & \text{otherwise}.
%\end{cases}
%\]
%\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%
%END move to appendix
%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-04-06 10:40:05 -04:00
Finally, we will need the following notation for the complexity of multiplying large integers:
\begin{Definition}[$\multc{\cdot}{\cdot}$]\footnote{We note that when doing arithmetic operations on the RAM model for input of size $N$, we have that $\multc{O(\log{N})}{O(\log{N})}=O(1)$. More generally we have $\multc{N}{O(\log{N})}=O(N\log{N}\log\log{N})$.}
2021-04-06 10:40:05 -04:00
In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity of multiplying two integers represented with $M$-bits. (We will assume that for input of size $N$, $W=O(\log{N})$.
\end{Definition}
2020-12-14 11:47:18 -05:00
\subsection{Our main result}
2021-08-30 22:50:21 -04:00
\AH{Verify that the proof for \cref{lem:approx-alg} doesn't rely on properties of $\raPlus$ or \abbrBIDB.}
2020-08-22 15:47:56 -04:00
\begin{Theorem}\label{lem:approx-alg}
2021-08-30 22:50:21 -04:00
Let \circuit be an arbitrary arithmetic circuit %for a UCQ over \bi
and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
2021-03-11 11:42:46 -05:00
Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
{\small
\[O\left(\left(\size(\circuit) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\circuit}^2(1,\ldots, 1)\cdot k\cdot \log{k} \cdot \depth(\circuit))}{\inparen{\error}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)\]
}
such that
\begin{equation}
\label{eq:approx-algo-bound}
\probOf\left(\left|\mathcal{E} - \rpoly(\prob_1,\dots,\prob_\numvar)\right|> \error \cdot \rpoly(\prob_1,\dots,\prob_\numvar)\right) \leq \conf.
\end{equation}
2020-08-22 15:47:56 -04:00
\end{Theorem}
2021-09-03 12:34:08 -04:00
To get linear runtime results from \Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in %$\poly\inparen{\vct{X}}$
$\expansion{\circuit}$
to be `canceled' monomials with dependent variables are removed (\cref{def:reduced-bi-poly}). %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
Let $\isInd{\cdot}$ be a boolean function returning true if monomial $\encMon$ is composed of independent variables and false otherwise.
2020-12-14 11:47:18 -05:00
\begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
Given an expression tree $\circuit$, define
2021-09-03 12:34:08 -04:00
\AH{Technically, $\monom$ is a set of variables rather than a monomial. Perhaps we don't need the $\var(\cdot)$ function and can replace is with a function that returns the monomial represented by a set of variables. FIXED: need to propogate this to the appendix ($\encMon$)}
2021-06-15 16:57:32 -04:00
\AH{To add, this is an issue on line 1073, 1117 of app C.}
2021-09-03 12:34:08 -04:00
\[\gamma(\circuit)=\frac{\sum_{(\monom, \coef)\in \expansion{\circuit}} \abs{\coef}\cdot \indicator{\neg\isInd{\encMon}} }%\encMon\mod{\mathcal{B}}\equiv 0}}
{\abs{\circuit}(1,\ldots, 1)}.\]
2020-12-14 11:47:18 -05:00
\end{Definition}
2021-04-10 09:48:26 -04:00
\noindent We next present a few corollaries of \Cref{lem:approx-alg}.
\begin{Corollary}
\label{cor:approx-algo-const-p}
2021-04-10 09:48:26 -04:00
Let $\poly(\vct{X})$ be as in \Cref{lem:approx-alg} and let $\gamma=\gamma(\circuit)$. Further let it be the case that $\prob_i\ge \prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ satisfying \Cref{eq:approx-algo-bound} can be computed in time
2021-04-06 16:35:11 -04:00
\[O\left(\left(\size(\circuit) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \cdot \depth(\circuit))}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot \prob_0^{2k}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)\]
In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$.
\end{Corollary}
2020-12-17 16:40:48 -05:00
2021-04-09 00:14:02 -04:00
The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment} for experimental results).
2020-12-19 23:20:31 -05:00
2021-04-10 09:48:26 -04:00
Finally, we address the $\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}$ term in the runtime. %In \Cref{susec:proof-val-up}, we show the following:
2021-04-06 11:21:52 -04:00
\begin{Lemma}
\label{lem:val-ub}
For any circuit $\circuit$ with $\degree(\circuit)=k$, we have
2021-04-08 22:30:03 -04:00
$\abs{\circuit}(1,\ldots, 1)\le 2^{2^k\cdot \size(\circuit)}.$
Further, under either of the following conditions:
2021-04-06 11:21:52 -04:00
\begin{enumerate}
\item $\circuit$ is a tree,
2021-09-03 12:34:08 -04:00
\item $\circuit$ encodes the run of the algorithm in~\cite{DBLP:conf/pods/KhamisNR16} on an FAQ\AH{citation would help here, as a reviewer complaint on this was ``What is FAQ?'', though we do cite (I think) in the appendix.} query,
2021-04-06 11:21:52 -04:00
\end{enumerate}
2021-04-08 22:30:03 -04:00
we have $\abs{\circuit}(1,\ldots, 1)\le \size(\circuit)^{O(k)}.$
2021-04-06 11:21:52 -04:00
\end{Lemma}
2021-04-10 09:48:26 -04:00
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the specific conditions in \Cref{lem:val-ub}. In \Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios, including query evaluation under \raPlus or FAQ.
2021-04-06 11:21:52 -04:00
2021-04-10 14:35:38 -04:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Approximating $\rpoly$}
2021-04-10 14:35:38 -04:00
We prove \Cref{lem:approx-alg} by developing an approximation algorithm (\approxq detailed in \Cref{alg:mon-sam}) with the desired runtime. This algorithm is based on the following observation.
% The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove \Cref{lem:approx-alg} follows from the following observation.
Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we have: % can exactly represent $\rpoly(\vct{X})$ as follows:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-04-08 22:59:48 -04:00
\begin{equation}
\label{eq:tilde-Q-bi}
2021-09-03 12:34:08 -04:00
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(\monom,\coef)\in \expansion{\circuit}} %\hspace*{-2mm}
\indicator{\isInd{\encMon}%\mod{\mathcal{B}}\not\equiv 0
}\cdot \coef\cdot\hspace*{-2mm}\prod_{X_i\in \monom}\hspace*{-2mm} X_i
2021-04-08 22:59:48 -04:00
\end{equation}
2021-04-10 14:35:38 -04:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-09-03 12:34:08 -04:00
%%%%%%%%%%%%%%%%%%%%%%%%%
%NEED to move to appendix
%%%%%%%%%%%%%%%%%%%%%%%%%
%\input{app_approx-alg-pseudo-code}
%%%%%%%%%%%%%%%%%%%%%%%%%
%END move to appendix
%%%%%%%%%%%%%%%%%%%%%%%%%
2021-04-10 14:35:38 -04:00
2021-04-09 11:48:10 -04:00
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in \expansion{\circuit}$ with probability proportional %\footnote{We could have also uniformly sampled from $\expansion{\circuit}$ but this gives better parameters.}
2021-09-03 12:34:08 -04:00
to $\abs{\coef}$ and compute $\vari{Y}=\indicator{\isInd{\encMon}}%\monom\mod{\mathcal{B}}\not\equiv 0}
\cdot \prod_{X_i\in \monom} p_i$. Taking $\numsamp$ samples and computing the average of $\vari{Y}$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in \Cref{sec:proofs-approx-alg}).
2021-04-10 09:48:26 -04:00
%\approxq (\Cref{alg:mon-sam}) modifies \circuit with a call to \onepass. It then samples from $\circuit_{\vari{mod}}\numsamp$ times and uses that information to approximate $\rpoly$.
2021-04-10 14:35:38 -04:00
2020-09-04 18:32:40 -04:00
2021-04-08 22:59:48 -04:00
%\subsubsection{Correctness}
2021-04-10 09:48:26 -04:00
%In order to prove \Cref{lem:approx-alg}, we will need to argue the correctness of \approxq, which relies on the correctness of auxiliary algorithms \onepass and \sampmon.
2020-09-01 14:39:50 -04:00
2021-04-08 22:59:48 -04:00
%\begin{Lemma}\label{lem:one-pass}
%The $\onepass$ function completes in time:
%$$O\left(\size(\circuit) \cdot \multc{\log\left(\abs{\circuit(1\ldots, 1)}\right)}{\log{\size(\circuit}}\right)$$
% $\onepass$ guarantees two post-conditions: First, for each subcircuit $\vari{S}$ of $\circuit$, we have that $\vari{S}.\vari{partial}$ is set to $\abs{\vari{S}}(1,\ldots, 1)$. Second, when $\vari{S}.\type = \circplus$, \subcircuit.\lwght $= \frac{\abs{\subcircuit_\linput}(1,\ldots, 1)}{\abs{\subcircuit}(1,\ldots, 1)}$ and likewise for \subcircuit.\rwght.
%\end{Lemma}
2021-04-10 09:48:26 -04:00
%To prove correctness of \Cref{alg:mon-sam}, we only use the following fact that follows from the above lemma: for the modified circuit ($\circuit_{\vari{mod}}$), $\circuit_{\vari{mod}}.\vari{partial}=\abs{\circuit}(1,\dots,1)$.
2021-04-08 22:59:48 -04:00
%\begin{Lemma}\label{lem:sample}
%The function $\sampmon$ completes in time
%$$O(\log{k} \cdot k \cdot \depth(\circuit)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log{\size(\circuit)}})$$
% where $k = \degree(\circuit)$. The function returns every $\left(\monom, sign(\coef)\right)$ for $(\monom, \coef)\in \expansion{\circuit}$ with probability $\frac{|\coef|}{\abs{\circuit}(1,\ldots, 1)}$.
%\end{Lemma}
2021-04-10 09:48:26 -04:00
%With the above two lemmas, we are ready to argue the following result (proof in \Cref{sec:proofs-approx-alg}):
2021-04-08 22:59:48 -04:00
%\begin{Theorem}\label{lem:mon-samp}
%For any $\circuit$ with $\degree(poly(|\circuit|)) = k$, algorithm \ref{alg:mon-sam} outputs an estimate $\vari{acc}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ such that
%\[\probOf\left(\left|\vari{acc} - \rpoly(\prob_1,\ldots, \prob_\numvar)\right|> \error \cdot \abs{\circuit}(1,\ldots, 1)\right) \leq \conf,\]
% in $O\left(\left(\size(\circuit)+\frac{\log{\frac{1}{\conf}}}{\error^2} \cdot k \cdot\log{k} \cdot \depth(\circuit)\right)\cdot \multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log{\size(\circuit)}}\right)$ time.
%\end{Theorem}
2020-08-13 20:54:06 -04:00
2021-04-08 22:59:48 -04:00
%\subsection{\onepass\ Algorithm}
%\label{sec:onepass}
2021-04-10 09:48:26 -04:00
%\noindent \onepass\ (Algorithm ~\ref{alg:one-pass-iter} in \Cref{sec:proofs-approx-alg}) iteratively visits each gate one time according to the topological ordering of \circuit annotating the \lwght, \rwght, and \prt variables of each node according to the definitions above. Lemma~\ref{lem:one-pass} is proved in \Cref{sec:proofs-approx-alg}.
2021-04-08 22:59:48 -04:00
%\subsection{\sampmon\ Algorithm}
%\label{sec:samplemonomial}
2020-08-17 13:52:18 -04:00
2021-04-08 22:59:48 -04:00
%A naive (slow) implementation of \sampmon\ would first compute $\expansion{\circuit}$ and then sample from it.
2021-04-10 09:48:26 -04:00
%Instead, \Cref{alg:sample} selects a monomial from $\expansion{\circuit}$ by top-down traversal of the input \circuit. More details on the traversal can be found in \Cref{subsec:sampmon-remarks}.
%
%$\sampmon$ is given in \Cref{alg:sample}, and a proof of its correctness (via \Cref{lem:sample}) is provided in \Cref{sec:proofs-approx-alg}.
2020-10-01 14:38:40 -04:00
%%%%%%%%%%%%%%%%%%%%%%%
2020-12-19 16:44:18 -05:00
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: