Done with my final pass

2021-04-09 00:14:02 -04:00 · 2021-04-09 00:14:02 -04:00 · 03897a15ac
parent 6b36c843db
commit 03897a15ac
2 changed files with 19 additions and 17 deletions
--- a/approx_alg.tex
+++ b/approx_alg.tex
@ -3,14 +3,14 @@

 \section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}

-In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability  (\Cref{th:single-p-hard}).
-Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
+In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability  (\Cref{th:single-p-hard}).
+Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.\footnote{For a very broad class of circuits: please see the discussion after~\Cref{lem:val-ub} for more.}
 The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}.
 %it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.

 \subsection{Preliminaries and some more notation}

-We now introduce useful definitions and notation related to circuits and polynomials.  Kindly note that all proofs and pseudocode can be found in \cref{sec:proofs-approx-alg}.
+We now introduce useful definitions and notation related to circuits and polynomials.  All proofs and missing pseudocode can be found in \cref{sec:proofs-approx-alg}.
 \begin{Definition}[Variables in a monomial]\label{def:vars}
 Given a monomial $v$, we use $\var(v)$ to denote the set of variables in $v$.
 \end{Definition}
@ -62,7 +62,7 @@ In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity o
 \end{Definition}

 \subsection{Our main result}
-In the subsequent subsections we will prove the following theorem.
+%In the subsequent subsections we will prove the following theorem.

 \begin{Theorem}\label{lem:approx-alg}
 Let \circuit be a circuit for a UCQ over \bi and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
@ -91,7 +91,7 @@ Let $\poly(\vct{X})$ be as in~\Cref{lem:approx-alg} and let $\gamma=\gamma(\circ
 In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$.
 \end{Corollary}

-The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (Please see \Cref{app:subsec:experiment} for experimental results).
+The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment} for experimental results).

 Finally, we address the $\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}$ term in the runtime. %In \cref{susec:proof-val-up}, we show the following:
 \begin{Lemma}
@ -106,15 +106,16 @@ Further, under either of the following conditions:
 we have $\abs{\circuit}(1,\ldots, 1)\le  \size(\circuit)^{O(k)}.$
 \end{Lemma}

-Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the special conditions in~\Cref{lem:val-ub}. In~\Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios.
+Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the specific conditions in~\Cref{lem:val-ub}. In~\Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios.

 \subsection{Approximating $\rpoly$}
-The algorithm (\approxq) to prove~\Cref{lem:approx-alg} follows from the following observation.  Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
+The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove~\Cref{lem:approx-alg} follows from the following observation.  Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
 \begin{equation}
 \label{eq:tilde-Q-bi}
 \rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(\monom,\coef)\in \expansion{\circuit}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \coef\cdot\hspace*{-2mm}\prod_{X_i\in \var\inparen{\monom}}\hspace*{-2mm} X_i
 \end{equation}
-Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in \expansion{\circuit}$ with probability proportional\footnote{We could have also uniformly sampled from $\expansion{\circuit}$ but this gives better parameters.} to $\abs{\coef}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{\monom}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in~\Cref{sec:proofs-approx-alg}).
+Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in \expansion{\circuit}$ with probability proportional%\footnote{We could have also uniformly sampled from $\expansion{\circuit}$ but this gives better parameters.}
+ to $\abs{\coef}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{\monom}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in~\Cref{sec:proofs-approx-alg}).
 %\approxq (\cref{alg:mon-sam}) modifies \circuit with a call to \onepass.  It then samples from $\circuit_{\vari{mod}}\numsamp$ times and uses that information to approximate $\rpoly$.

 \input{app_approx-alg-pseudo-code}
--- a/circuits-model-runtime.tex
+++ b/circuits-model-runtime.tex
@ -15,10 +15,10 @@ Lastly, we generalize our result for expectation to other moments.
 %
 %\label{sec:circuits}

-\subsection{The cost model}
-\label{sec:cost-model}
-So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.
-We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed lineage polynomial for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
+\mypar{The cost model}
+%\label{sec:cost-model}
+So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
+We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
 We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
 \footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
 In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
@ -45,7 +45,7 @@ We adopt a minimalistic compute-bound model of query evaluation drawn from the w
 Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.
 We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.

-It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
+It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.

 %We now make a simple observation on the above cost model:
 %\begin{proposition}
@ -55,6 +55,7 @@ It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

+We are now ready to formally state our claim from \Cref{sec:intro}:
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Corollary}
  Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
@ -70,13 +71,13 @@ This follows from~\Cref{lem:circuits-model-runtime} (\cref{sec:circuit-runtime})
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Higher Moments}
-\label{sec:momemts}
-
+\mypar{Higher Moments}
+%\label{sec:momemts}
+%
 We make a simple observation to conclude the presentation of our results.
 So far we have only focused on the expectation of $\poly$.  In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.  Progress can be made on this as follows:
 For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
-We leave this questions like this for future work.
+We leave a further investigation of this question for future work.

 %%% Local Variables:
 %%% mode: latex