Done with my final pass

master
Atri Rudra 2021-04-09 00:14:02 -04:00
parent 6b36c843db
commit 03897a15ac
2 changed files with 19 additions and 17 deletions

View File

@ -3,14 +3,14 @@
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.\footnote{For a very broad class of circuits: please see the discussion after~\Cref{lem:val-ub} for more.}
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}.
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
\subsection{Preliminaries and some more notation}
We now introduce useful definitions and notation related to circuits and polynomials. Kindly note that all proofs and pseudocode can be found in \cref{sec:proofs-approx-alg}.
We now introduce useful definitions and notation related to circuits and polynomials. All proofs and missing pseudocode can be found in \cref{sec:proofs-approx-alg}.
\begin{Definition}[Variables in a monomial]\label{def:vars}
Given a monomial $v$, we use $\var(v)$ to denote the set of variables in $v$.
\end{Definition}
@ -62,7 +62,7 @@ In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity o
\end{Definition}
\subsection{Our main result}
In the subsequent subsections we will prove the following theorem.
%In the subsequent subsections we will prove the following theorem.
\begin{Theorem}\label{lem:approx-alg}
Let \circuit be a circuit for a UCQ over \bi and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
@ -91,7 +91,7 @@ Let $\poly(\vct{X})$ be as in~\Cref{lem:approx-alg} and let $\gamma=\gamma(\circ
In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$.
\end{Corollary}
The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (Please see \Cref{app:subsec:experiment} for experimental results).
The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment} for experimental results).
Finally, we address the $\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}$ term in the runtime. %In \cref{susec:proof-val-up}, we show the following:
\begin{Lemma}
@ -106,15 +106,16 @@ Further, under either of the following conditions:
we have $\abs{\circuit}(1,\ldots, 1)\le \size(\circuit)^{O(k)}.$
\end{Lemma}
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the special conditions in~\Cref{lem:val-ub}. In~\Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios.
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the specific conditions in~\Cref{lem:val-ub}. In~\Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios.
\subsection{Approximating $\rpoly$}
The algorithm (\approxq) to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
\begin{equation}
\label{eq:tilde-Q-bi}
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(\monom,\coef)\in \expansion{\circuit}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \coef\cdot\hspace*{-2mm}\prod_{X_i\in \var\inparen{\monom}}\hspace*{-2mm} X_i
\end{equation}
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in \expansion{\circuit}$ with probability proportional\footnote{We could have also uniformly sampled from $\expansion{\circuit}$ but this gives better parameters.} to $\abs{\coef}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{\monom}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in~\Cref{sec:proofs-approx-alg}).
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in \expansion{\circuit}$ with probability proportional%\footnote{We could have also uniformly sampled from $\expansion{\circuit}$ but this gives better parameters.}
to $\abs{\coef}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{\monom}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in~\Cref{sec:proofs-approx-alg}).
%\approxq (\cref{alg:mon-sam}) modifies \circuit with a call to \onepass. It then samples from $\circuit_{\vari{mod}}\numsamp$ times and uses that information to approximate $\rpoly$.
\input{app_approx-alg-pseudo-code}

View File

@ -15,10 +15,10 @@ Lastly, we generalize our result for expectation to other moments.
%
%\label{sec:circuits}
\subsection{The cost model}
\label{sec:cost-model}
So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed lineage polynomial for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
\mypar{The cost model}
%\label{sec:cost-model}
So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
\footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
@ -45,7 +45,7 @@ We adopt a minimalistic compute-bound model of query evaluation drawn from the w
Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
%We now make a simple observation on the above cost model:
%\begin{proposition}
@ -55,6 +55,7 @@ It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We are now ready to formally state our claim from \Cref{sec:intro}:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Corollary}
Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
@ -70,13 +71,13 @@ This follows from~\Cref{lem:circuits-model-runtime} (\cref{sec:circuit-runtime})
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Higher Moments}
\label{sec:momemts}
\mypar{Higher Moments}
%\label{sec:momemts}
%
We make a simple observation to conclude the presentation of our results.
So far we have only focused on the expectation of $\poly$. In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$. Progress can be made on this as follows:
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
We leave this questions like this for future work.
We leave a further investigation of this question for future work.
%%% Local Variables:
%%% mode: latex