Completed pass on S4.

2022-02-17 10:07:33 -05:00 · 2022-02-17 10:07:33 -05:00 · cb73d33ec2
parent 0942a47d69
commit cb73d33ec2
3 changed files with 32 additions and 21 deletions
--- a/approx_alg.tex
+++ b/approx_alg.tex
@ -2,16 +2,22 @@
 %!TEX root=./main.tex

 \section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
-\AH{If we get rid of the problem statements, then we need to remember to get rid of references to particular problem statements, as in below.}
-In \Cref{sec:hard}, we showed that the answer to \Cref{prob:intro-stmt} is no.
-With this result, we now design an approximation algorithm for our problem that runs in $\bigO{\abs{\circuit}}$ for a very broad class of circuits (see the discussion after \Cref{lem:val-ub} for more).
-The following approximation algorithm applies to \abbrBIDB lineage polynomials (over $\raPlus$ queries), though our bounds are more meaningful for a non-trivial subclass of queries over \bis that contains all queries on \tis, as well as the queries of the PDBench benchmark~\cite{pdbench}.  All proofs and pseudocode can be found in \Cref{sec:proofs-approx-alg}.
+In \Cref{sec:hard}, we showed that \secrev{
+\Cref{prob:bag-pdb-poly-expected} cannot be solved in $\bigO{\qruntime{\optquery{\query},\tupset,\bound}}$ runtime.
+}
+With this result, we now design an approximation algorithm for our problem that runs in $\bigO{\abs{\circuit}}$ for a very broad class of circuits\secrev{, (thus affirming~\Cref{prob:intro-stmt};} see the discussion after \Cref{lem:val-ub} for more).
+The following approximation algorithm applies to \secrev{
+\abbrCTIDB lineage polynomials and general \abbrBIDB (over bag-$\raPlus$ query semantics) lineage polynomials in practice, where for the latter we note that a $1$-\abbrTIDB is equivalently a $1$-\abbrBIDB (blocks are size $1$) and our experimental results (see~\Cref{app:subsec:experiment}) using queries from the PDBench benchmark~\cite{pdbench} show a low $\gamma$ (see~\Cref{def:param-gamma}) supporting the notion that our bounds hold for general \abbrBIDB in practice.
+}
+  \secrev{Corresponding proofs and pseudocode for all formal statements and algorithms }
+  can be found in \Cref{sec:proofs-approx-alg}.
 %it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
-\AH{We are going to have to rework $\gamma$ in this section, as well as the proof for our result.}
+
 \subsection{Preliminaries and some more notation}

 We now introduce definitions and notation related to circuits and polynomials that we will need to state our upper bound results. First we introduce the expansion $\expansion{\circuit}$ of circuit $\circuit$ which % encodes the reduced polynomial for $\polyf\inparen{\circuit}$ and is the basis
-is used in our algorithm for sampling monomials (part of our approximation algorithm).
+is used in our
+\secrev{auxiliary algorithm for sampling monomials when computing the approximation.  }% (part of our approximation algorithm).

 \begin{Definition}[$\expansion{\circuit}$]\label{def:expand-circuit}
 For a circuit $\circuit$, we define $\expansion{\circuit}$ as a list of tuples $(\monom, \coef)$, where $\monom$ is a set of variables and $\coef \in \domN$.
@ -57,12 +63,10 @@ In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity o

 Finally, to get linear runtime results, we will need to define another parameter modeling the (weighted) number of monomials in %$\poly\inparen{\vct{X}}$
 $\expansion{\circuit}$
-that need to be `canceled' when monomials with dependent variables are removed (\Cref{def:reduced-bi-poly}).  %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
+that need to be `canceled' when monomials with dependent variables are removed (\Cref{subsec:one-bidb}).  %def:hen it is modded with $\mathcal{B}$ (\Cref{def:mod-set-polys}).
 Let $\isInd{\cdot}$ be a boolean function returning true if monomial $\encMon$ is composed of independent variables and false otherwise; further, let $\indicator{\theta}$ also be a boolean function returning true if $\theta$ evaluates to true.
 \begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
 Given a \abbrBIDB circuit $\circuit$ define
-\AH{Technically, $\monom$ is a set of variables rather than a monomial.  Perhaps we don't need the $\var(\cdot)$ function and can replace is with a function that returns the monomial represented by a set of variables.  FIXED: need to propogate this to the appendix ($\encMon$)}
-\AH{To add, this is an issue on line 1073, 1117 of app C.}
 \[\gamma(\circuit)=\frac{\sum_{(\monom, \coef)\in \expansion{\circuit}} \abs{\coef}\cdot \indicator{\neg\isInd{\encMon}} }%\encMon\mod{\mathcal{B}}\equiv 0}}
 {\abs{\circuit}(1,\ldots, 1)}.\]
 \end{Definition}
@ -75,7 +79,10 @@ Our approximation algorithm (\approxq pseudo code in \Cref{sec:proof-lem-approx-
 %with the desired runtime. This algorithm
 is based on the following observation.
 % The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove \Cref{lem:approx-alg} follows from the following observation.
-Given a lineage polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we have: % can exactly represent $\rpoly(\vct{X})$ as follows:
+Given a lineage polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over 
+\secrev{
+$1$-\abbrBIDB (recall that all \abbrCTIDB can be reduced to $1$-\abbrBIDB by~\Cref{def:ctidb-reduct}), we have: % can exactly represent $\rpoly(\vct{X})$ as follows:
+}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{equation}
@ -89,7 +96,7 @@ Given a lineage polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circui
 Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in \expansion{\circuit}$ with probability proportional
 to $\abs{\coef}$ and compute $\vari{Y}=\indicator{\isInd{\encMon}}
 \cdot \prod_{X_i\in \monom} p_i$. %Taking $\ceil{\frac{2 \log{\frac{2}{\conf}}}{\error^2}}$ samples
-Repeating the sampling appropriate number of times
+Repeating the sampling an appropriate number of times
 and computing the average of $\vari{Y}$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in \Cref{sec:proofs-approx-alg}).
 %%%%%%%%%%%%%%%%%%%%%%%

@ -114,10 +121,7 @@ and computing the average of $\vari{Y}$ gives us our final estimate. \onepass is
 % We next present a few corollaries of \Cref{lem:approx-alg}.
 \begin{Theorem}
 \label{cor:approx-algo-const-p}
-Let \circuit be an arbitrary \abbrBIDB circuit %for a UCQ over \bi
-and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
-%Let $\poly(\vct{X})$ be as in \Cref{lem:approx-alg} and
-Let $\gamma=\gamma(\circuit)$. Further let it be the case that $\prob_i\ge \prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$  of $\rpoly(\prob_1,\ldots, \prob_\numvar)$
+Let \circuit be an arbitrary $1$-\abbrBIDB circuit, define $\poly(\vct{X})=\polyf(\circuit)$, let $k=\degree(\circuit)$, and let $\gamma=\gamma(\circuit)$. Further let it be the case that $\prob_i\ge \prob_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$  of $\rpoly(\prob_1,\ldots, \prob_\numvar)$
 satisfying
 \begin{equation}
 \label{eq:approx-algo-bound-main}
@ -131,7 +135,9 @@ O\left(\left(\size(\circuit) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \
 In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$.
 \end{Theorem}

-The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment} for experimental results).
+The restriction on $\gamma$ is satisfied by any \secrev{
+$1$-\abbrTIDB (where $\gamma=0$ in the equivalent $1$-\abbrBIDB of~\Cref{def:ctidb-reduct})
+ } as well as for all three queries of the PDBench \abbrBIDB benchmark (see \Cref{app:subsec:experiment} for experimental results).

 We briefly connect the runtime in \Cref{eq:approx-algo-runtime} to the algorithm outline earlier (where  we ignore the dependence on $\multc{\cdot}{\cdot}$, which is needed to handle the cost of arithmetic operations over integers). The $\size(\circuit)$ comes from the time take to run \onepass once (\onepass essentially computes $\abs{\circuit}(1,\ldots, 1)$ using the natural circuit evaluation algorithm on $\circuit$). We make $\frac{\log{\frac{1}{\conf}}}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot \prob_0^{2k}}$ many calls to \sampmon (each of which essentially traces $O(k)$ random sink to source paths in $\circuit$ all of which by definition have length at most $\depth(\circuit)$).

@ -148,15 +154,18 @@ if $\circuit$ is a tree, then
 we have $\abs{\circuit}(1,\ldots, 1)\le  \size(\circuit)^{O(k)}.$
 \end{Lemma}

-Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$. If $\circuit$ is a tree, then the runtime simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$, which then answers \Cref{prob:intro-stmt} is yes for such circuits.
+Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$. If $\circuit$ is a tree, then the runtime simplifies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$, which then answers \Cref{prob:intro-stmt} with yes for such circuits.
+
+\AH{Is it standard to assume that in the asymptotic notation above, $\error$ and $\delta$ are constant?  Otherwise this does not uphold~\Cref{prob:intro-stmt}.}

 Finally, note that by \Cref{prop:circuit-depth} and \Cref{lem:circ-model-runtime} for any $\raPlus$ query $\query$, there exists a circuit $\circuit^*$ for $\apolyqdt$ such that $\depth(\circuit^*)\le O_{|Q|}(\log{n})$ and $\size(\circuit)\le O_k\inparen{\qruntime{\query, \dbbase}}$. Using this along with \Cref{lem:val-ub}, \Cref{cor:approx-algo-const-p} and the fact that $n\le \qruntime{\query, \dbbase}$, we answer \Cref{prob:big-o-joint-steps} in the affirmative as follows:
 \begin{Corollary}
 \label{cor:approx-algo-punchline}
-Let $\query$ be an $\raPlus$ query and $\pdb$ be an \abbrBIDB with $p_0>0$ and $\gamma<1$ (where $p_0,\gamma$ as in \Cref{cor:approx-algo-const-p}) are absolute constants. Let $\poly(\vct{X})=\apolyqdt$ for any result tuple $\tup$ with $\deg(\poly)=k$. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time $O_{k,|Q|,\error',\conf}\inparen{\qruntime{\query, \dbbase}}$ (given $\query,\dbbase$ and $p_i$ for each $i\in [n]$ that defines $\pd$).
+Let $\query$ be an $\raPlus$ query and $\pdb$ be a $1$-\abbrBIDB with $p_0>0$ and $\gamma<1$ (where $p_0,\gamma$ as in \Cref{cor:approx-algo-const-p}) are absolute constants. Let $\poly(\vct{X})=\apolyqdt$ for any result tuple $\tup$ with $\deg(\poly)=k$. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time $O_{k,|Q|,\error',\conf}\inparen{\qruntime{\query, \tupset, \bound}}$ (given $\query,\tupset$ and $p_i$ for each $i\in [n]$ that defines $\pd$).
 %Let $\poly(\vct{X})$ be a \abbrBIDB-lineage polynomial correspoding to an \abbrBIDB circuit $\circuit$ that satisfies the specific conditions in \Cref{lem:val-ub}. Then one can compute an approximation satisfying \Cref{eq:approx-algo-bound-main} in time
 % $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$. % for the case when $\circuit$ satisfies the specific conditions in \Cref{lem:val-ub}.
 \end{Corollary}
+\AH{What is $\abs{\query}$?  Isn't that just $k$?}
 If we want to approximate the expected multiplicities of all $Z=O(n^k)$ result tuples $\tup$ simultaneously, we just need to run the above result with $\conf$ replaced by $\frac \conf Z$. Note this increases the runtime by only a logarithmic factor.


--- a/prob-def.tex
+++ b/prob-def.tex
@ -17,7 +17,7 @@ A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source gates (in de
 %
 Each gate has the following members: \type, \vpartial, \vari{input}, \degval, \vari{Lweight}, and \vari{Rweight}, where \type is the value type $\{\circplus, \circmult, \var, \tnum\}$ and \vari{input} the list of inputs. Source gates have an extra member \val storing the value.  $\circuit_\linput$ ($\circuit_\rinput$) denotes the left (right) input of \circuit.
 \end{Definition}
-\AH{Does the following matter, i.e., does it point anything out special for our research?}
+\AH{Does the following matter, i.e., does it point anything out special for our research? \textbf{EDIT}: ~\Cref{lem:val-ub} does use this (when \circuit is a tree) to answer~\Cref{prob:intro-stmt} with a yes.}
 When the underlying DAG is a tree (with edges pointing towards the root), the structure is an expression tree \etree.  In such a case, the root of \etree is analogous to the sink of \circuit.  The fields \vari{partial}, \degval, \vari{Lweight}, and \vari{Rweight} are used in the proofs of \Cref{sec:proofs-approx-alg}.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
--- a/ra-to-poly.tex
+++ b/ra-to-poly.tex
@ -29,7 +29,9 @@ When it is unclear, we use $\smbOf{\poly}$ to denote the \abbrSMB form of a poly
 \begin{Definition}[Degree]\label{def:degree-of-poly}
 The degree of polynomial $\poly(\vct{X})$ is the largest \secrev{$\vct{d} = \sum_{i\in\pbox{\numedge}}d_i %= \norm{\vct{d}}_1
 $}% = \sum_{\tup\in\tupset} d_\tup$ 
- such that $c_{(d_1,\dots,d_n)}\ne 0$. % maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$.
+ such that $c_{(d_1,\dots,d_n)}\ne 0$. \secrev{
+ We denote the degree of $\poly$ as $\deg\inparen{\poly}$.
+ }% maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$.
 \end{Definition}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 As an example, the degree of the polynomial $X^2+2XY^2+Y^2$ is $3$.
@ -43,7 +45,7 @@ or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \abbr


 %Following the typical representation of bags in production databases, for query inputs, we will use \abbrBPDB\xplural with multiplicities $\{0, 1\}$ (see \Cref{sec:gener-results-beyond} for more on this choice).
-\subsection{$\mathbf{1}$-BIDB}
+\subsection{$\mathbf{1}$-BIDB}\label{subsec:one-bidb}
 \label{subsec:tidbs-and-bidbs}

 \noindent\secrev{