Completed pass on S4.

master
Aaron Huber 2022-02-17 10:07:33 -05:00
parent 0942a47d69
commit cb73d33ec2
3 changed files with 32 additions and 21 deletions

View File

@ -2,16 +2,22 @@
%!TEX root=./main.tex
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
\AH{If we get rid of the problem statements, then we need to remember to get rid of references to particular problem statements, as in below.}
In \Cref{sec:hard}, we showed that the answer to \Cref{prob:intro-stmt} is no.
With this result, we now design an approximation algorithm for our problem that runs in $\bigO{\abs{\circuit}}$ for a very broad class of circuits (see the discussion after \Cref{lem:val-ub} for more).
The following approximation algorithm applies to \abbrBIDB lineage polynomials (over $\raPlus$ queries), though our bounds are more meaningful for a non-trivial subclass of queries over \bis that contains all queries on \tis, as well as the queries of the PDBench benchmark~\cite{pdbench}. All proofs and pseudocode can be found in \Cref{sec:proofs-approx-alg}.
In \Cref{sec:hard}, we showed that \secrev{
\Cref{prob:bag-pdb-poly-expected} cannot be solved in $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$ runtime.
}
With this result, we now design an approximation algorithm for our problem that runs in $\bigO{\abs{\circuit}}$ for a very broad class of circuits\secrev{, (thus affirming~\Cref{prob:intro-stmt};} see the discussion after \Cref{lem:val-ub} for more).
The following approximation algorithm applies to \secrev{
\abbrCTIDB lineage polynomials and general \abbrBIDB (over bag-$\raPlus$ query semantics) lineage polynomials in practice, where for the latter we note that a $1$-\abbrTIDB is equivalently a $1$-\abbrBIDB (blocks are size $1$) and our experimental results (see~\Cref{app:subsec:experiment}) using queries from the PDBench benchmark~\cite{pdbench} show a low $\gamma$ (see~\Cref{def:param-gamma}) supporting the notion that our bounds hold for general \abbrBIDB in practice.
}
\secrev{Corresponding proofs and pseudocode for all formal statements and algorithms }
can be found in \Cref{sec:proofs-approx-alg}.
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
\AH{We are going to have to rework $\gamma$ in this section, as well as the proof for our result.}
\subsection{Preliminaries and some more notation}
We now introduce definitions and notation related to circuits and polynomials that we will need to state our upper bound results. First we introduce the expansion $\expansion{\circuit}$ of circuit $\circuit$ which % encodes the reduced polynomial for $\polyf\inparen{\circuit}$ and is the basis
is used in our algorithm for sampling monomials (part of our approximation algorithm).
is used in our
\secrev{auxiliary algorithm for sampling monomials when computing the approximation. }% (part of our approximation algorithm).
\begin{Definition}[$\expansion{\circuit}$]\label{def:expand-circuit}
For a circuit $\circuit$, we define $\expansion{\circuit}$ as a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef \in \domN$.
@ -57,12 +63,10 @@ In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity o
Finally, to get linear runtime results, we will need to define another parameter modeling the (weighted) number of monomials in %$\poly\inparen{\vct{X}}$
$\expansion{\circuit}$
that need to be `canceled' when monomials with dependent variables are removed (\Cref{def:reduced-bi-poly}). %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
that need to be `canceled' when monomials with dependent variables are removed (\Cref{subsec:one-bidb}). %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
Let $\isInd{\cdot}$ be a boolean function returning true if monomial $\encMon$ is composed of independent variables and false otherwise; further, let $\indicator{\theta}$ also be a boolean function returning true if $\theta$ evaluates to true.
\begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
Given a \abbrBIDB circuit $\circuit$ define
\AH{Technically, $\monom$ is a set of variables rather than a monomial. Perhaps we don't need the $\var(\cdot)$ function and can replace is with a function that returns the monomial represented by a set of variables. FIXED: need to propogate this to the appendix ($\encMon$)}
\AH{To add, this is an issue on line 1073, 1117 of app C.}
\[\gamma(\circuit)=\frac{\sum_{(\monom, \coef)\in \expansion{\circuit}} \abs{\coef}\cdot \indicator{\neg\isInd{\encMon}} }%\encMon\mod{\mathcal{B}}\equiv 0}}
{\abs{\circuit}(1,\ldots, 1)}.\]
\end{Definition}
@ -75,7 +79,10 @@ Our approximation algorithm (\approxq pseudo code in \Cref{sec:proof-lem-approx-
%with the desired runtime. This algorithm
is based on the following observation.
% The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove \Cref{lem:approx-alg} follows from the following observation.
Given a lineage polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we have: % can exactly represent $\rpoly(\vct{X})$ as follows:
Given a lineage polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over
\secrev{
$1$-\abbrBIDB (recall that all \abbrCTIDB can be reduced to $1$-\abbrBIDB by~\Cref{def:ctidb-reduct}), we have: % can exactly represent $\rpoly(\vct{X})$ as follows:
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{equation}
@ -89,7 +96,7 @@ Given a lineage polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circui
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in \expansion{\circuit}$ with probability proportional
to $\abs{\coef}$ and compute $\vari{Y}=\indicator{\isInd{\encMon}}
\cdot \prod_{X_i\in \monom} p_i$. %Taking $\ceil{\frac{2 \log{\frac{2}{\conf}}}{\error^2}}$ samples
Repeating the sampling appropriate number of times
Repeating the sampling an appropriate number of times
and computing the average of $\vari{Y}$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in \Cref{sec:proofs-approx-alg}).
%%%%%%%%%%%%%%%%%%%%%%%
@ -114,10 +121,7 @@ and computing the average of $\vari{Y}$ gives us our final estimate. \onepass is
% We next present a few corollaries of \Cref{lem:approx-alg}.
\begin{Theorem}
\label{cor:approx-algo-const-p}
Let \circuit be an arbitrary \abbrBIDB circuit %for a UCQ over \bi
and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
%Let $\poly(\vct{X})$ be as in \Cref{lem:approx-alg} and
Let $\gamma=\gamma(\circuit)$. Further let it be the case that $\prob_i\ge \prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$
Let \circuit be an arbitrary $1$-\abbrBIDB circuit, define $\poly(\vct{X})=\polyf(\circuit)$, let $k=\degree(\circuit)$, and let $\gamma=\gamma(\circuit)$. Further let it be the case that $\prob_i\ge \prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$
satisfying
\begin{equation}
\label{eq:approx-algo-bound-main}
@ -131,7 +135,9 @@ O\left(\left(\size(\circuit) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \
In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$.
\end{Theorem}
The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment} for experimental results).
The restriction on $\gamma$ is satisfied by any \secrev{
$1$-\abbrTIDB (where $\gamma=0$ in the equivalent $1$-\abbrBIDB of~\Cref{def:ctidb-reduct})
} as well as for all three queries of the PDBench \abbrBIDB benchmark (see \Cref{app:subsec:experiment} for experimental results).
We briefly connect the runtime in \Cref{eq:approx-algo-runtime} to the algorithm outline earlier (where we ignore the dependence on $\multc{\cdot}{\cdot}$, which is needed to handle the cost of arithmetic operations over integers). The $\size(\circuit)$ comes from the time take to run \onepass once (\onepass essentially computes $\abs{\circuit}(1,\ldots, 1)$ using the natural circuit evaluation algorithm on $\circuit$). We make $\frac{\log{\frac{1}{\conf}}}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot \prob_0^{2k}}$ many calls to \sampmon (each of which essentially traces $O(k)$ random sink to source paths in $\circuit$ all of which by definition have length at most $\depth(\circuit)$).
@ -148,15 +154,18 @@ if $\circuit$ is a tree, then
we have $\abs{\circuit}(1,\ldots, 1)\le \size(\circuit)^{O(k)}.$
\end{Lemma}
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$. If $\circuit$ is a tree, then the runtime simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$, which then answers \Cref{prob:intro-stmt} is yes for such circuits.
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$. If $\circuit$ is a tree, then the runtime simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$, which then answers \Cref{prob:intro-stmt} with yes for such circuits.
\AH{Is it standard to assume that in the asymptotic notation above, $\error$ and $\delta$ are constant? Otherwise this does not uphold~\Cref{prob:intro-stmt}.}
Finally, note that by \Cref{prop:circuit-depth} and \Cref{lem:circ-model-runtime} for any $\raPlus$ query $\query$, there exists a circuit $\circuit^*$ for $\apolyqdt$ such that $\depth(\circuit^*)\le O_{|Q|}(\log{n})$ and $\size(\circuit)\le O_k\inparen{\qruntime{\query, \dbbase}}$. Using this along with \Cref{lem:val-ub}, \Cref{cor:approx-algo-const-p} and the fact that $n\le \qruntime{\query, \dbbase}$, we answer \Cref{prob:big-o-joint-steps} in the affirmative as follows:
\begin{Corollary}
\label{cor:approx-algo-punchline}
Let $\query$ be an $\raPlus$ query and $\pdb$ be an \abbrBIDB with $p_0>0$ and $\gamma<1$ (where $p_0,\gamma$ as in \Cref{cor:approx-algo-const-p}) are absolute constants. Let $\poly(\vct{X})=\apolyqdt$ for any result tuple $\tup$ with $\deg(\poly)=k$. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time $O_{k,|Q|,\error',\conf}\inparen{\qruntime{\query, \dbbase}}$ (given $\query,\dbbase$ and $p_i$ for each $i\in [n]$ that defines $\pd$).
Let $\query$ be an $\raPlus$ query and $\pdb$ be a $1$-\abbrBIDB with $p_0>0$ and $\gamma<1$ (where $p_0,\gamma$ as in \Cref{cor:approx-algo-const-p}) are absolute constants. Let $\poly(\vct{X})=\apolyqdt$ for any result tuple $\tup$ with $\deg(\poly)=k$. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time $O_{k,|Q|,\error',\conf}\inparen{\qruntime{\query, \tupset, \bound}}$ (given $\query,\tupset$ and $p_i$ for each $i\in [n]$ that defines $\pd$).
%Let $\poly(\vct{X})$ be a \abbrBIDB-lineage polynomial correspoding to an \abbrBIDB circuit $\circuit$ that satisfies the specific conditions in \Cref{lem:val-ub}. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time
% $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$. % for the case when $\circuit$ satisfies the specific conditions in \Cref{lem:val-ub}.
\end{Corollary}
\AH{What is $\abs{\query}$? Isn't that just $k$?}
If we want to approximate the expected multiplicities of all $Z=O(n^k)$ result tuples $\tup$ simultaneously, we just need to run the above result with $\conf$ replaced by $\frac \conf Z$. Note this increases the runtime by only a logarithmic factor.

View File

@ -17,7 +17,7 @@ A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source gates (in de
%
Each gate has the following members: \type, \vpartial, \vari{input}, \degval, \vari{Lweight}, and \vari{Rweight}, where \type is the value type $\{\circplus, \circmult, \var, \tnum\}$ and \vari{input} the list of inputs. Source gates have an extra member \val storing the value. $\circuit_\linput$ ($\circuit_\rinput$) denotes the left (right) input of \circuit.
\end{Definition}
\AH{Does the following matter, i.e., does it point anything out special for our research?}
\AH{Does the following matter, i.e., does it point anything out special for our research? \textbf{EDIT}: ~\Cref{lem:val-ub} does use this (when \circuit is a tree) to answer~\Cref{prob:intro-stmt} with a yes.}
When the underlying DAG is a tree (with edges pointing towards the root), the structure is an expression tree \etree. In such a case, the root of \etree is analogous to the sink of \circuit. The fields \vari{partial}, \degval, \vari{Lweight}, and \vari{Rweight} are used in the proofs of \Cref{sec:proofs-approx-alg}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@ -29,7 +29,9 @@ When it is unclear, we use $\smbOf{\poly}$ to denote the \abbrSMB form of a poly
\begin{Definition}[Degree]\label{def:degree-of-poly}
The degree of polynomial $\poly(\vct{X})$ is the largest \secrev{$\vct{d} = \sum_{i\in\pbox{\numedge}}d_i %= \norm{\vct{d}}_1
$}% = \sum_{\tup\in\tupset} d_\tup$
such that $c_{(d_1,\dots,d_n)}\ne 0$. % maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$.
such that $c_{(d_1,\dots,d_n)}\ne 0$. \secrev{
We denote the degree of $\poly$ as $\deg\inparen{\poly}$.
}% maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As an example, the degree of the polynomial $X^2+2XY^2+Y^2$ is $3$.
@ -43,7 +45,7 @@ or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \abbr
%Following the typical representation of bags in production databases, for query inputs, we will use \abbrBPDB\xplural with multiplicities $\{0, 1\}$ (see \Cref{sec:gener-results-beyond} for more on this choice).
\subsection{$\mathbf{1}$-BIDB}
\subsection{$\mathbf{1}$-BIDB}\label{subsec:one-bidb}
\label{subsec:tidbs-and-bidbs}
\noindent\secrev{