Done with my final pass
parent
6b36c843db
commit
03897a15ac
|
@ -3,14 +3,14 @@
|
|||
|
||||
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
|
||||
|
||||
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
|
||||
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
|
||||
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
|
||||
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.\footnote{For a very broad class of circuits: please see the discussion after~\Cref{lem:val-ub} for more.}
|
||||
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}.
|
||||
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
|
||||
|
||||
\subsection{Preliminaries and some more notation}
|
||||
|
||||
We now introduce useful definitions and notation related to circuits and polynomials. Kindly note that all proofs and pseudocode can be found in \cref{sec:proofs-approx-alg}.
|
||||
We now introduce useful definitions and notation related to circuits and polynomials. All proofs and missing pseudocode can be found in \cref{sec:proofs-approx-alg}.
|
||||
\begin{Definition}[Variables in a monomial]\label{def:vars}
|
||||
Given a monomial $v$, we use $\var(v)$ to denote the set of variables in $v$.
|
||||
\end{Definition}
|
||||
|
@ -62,7 +62,7 @@ In a RAM model of word size of $W$-bits, $\multc{M}{W}$ denotes the complexity o
|
|||
\end{Definition}
|
||||
|
||||
\subsection{Our main result}
|
||||
In the subsequent subsections we will prove the following theorem.
|
||||
%In the subsequent subsections we will prove the following theorem.
|
||||
|
||||
\begin{Theorem}\label{lem:approx-alg}
|
||||
Let \circuit be a circuit for a UCQ over \bi and define $\poly(\vct{X})=\polyf(\circuit)$ and let $k=\degree(\circuit)$.
|
||||
|
@ -91,7 +91,7 @@ Let $\poly(\vct{X})$ be as in~\Cref{lem:approx-alg} and let $\gamma=\gamma(\circ
|
|||
In particular, if $\prob_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)\cdot\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}\right)$.
|
||||
\end{Corollary}
|
||||
|
||||
The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (Please see \Cref{app:subsec:experiment} for experimental results).
|
||||
The restriction on $\gamma$ is satisfied by any \ti (where $\gamma=0$) as well as for all three queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment} for experimental results).
|
||||
|
||||
Finally, we address the $\multc{\log\left(\abs{\circuit}(1,\ldots, 1)\right)}{\log\left(\size(\circuit)\right)}$ term in the runtime. %In \cref{susec:proof-val-up}, we show the following:
|
||||
\begin{Lemma}
|
||||
|
@ -106,15 +106,16 @@ Further, under either of the following conditions:
|
|||
we have $\abs{\circuit}(1,\ldots, 1)\le \size(\circuit)^{O(k)}.$
|
||||
\end{Lemma}
|
||||
|
||||
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the special conditions in~\Cref{lem:val-ub}. In~\Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios.
|
||||
Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the specific conditions in~\Cref{lem:val-ub}. In~\Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios.
|
||||
|
||||
\subsection{Approximating $\rpoly$}
|
||||
The algorithm (\approxq) to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
|
||||
The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
|
||||
\begin{equation}
|
||||
\label{eq:tilde-Q-bi}
|
||||
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(\monom,\coef)\in \expansion{\circuit}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \coef\cdot\hspace*{-2mm}\prod_{X_i\in \var\inparen{\monom}}\hspace*{-2mm} X_i
|
||||
\end{equation}
|
||||
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in \expansion{\circuit}$ with probability proportional\footnote{We could have also uniformly sampled from $\expansion{\circuit}$ but this gives better parameters.} to $\abs{\coef}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{\monom}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in~\Cref{sec:proofs-approx-alg}).
|
||||
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample (via \sampmon) $(\monom,\coef)\in \expansion{\circuit}$ with probability proportional%\footnote{We could have also uniformly sampled from $\expansion{\circuit}$ but this gives better parameters.}
|
||||
to $\abs{\coef}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{\monom}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate. \onepass is used to compute the sampling probabilities needed in \sampmon (details are in~\Cref{sec:proofs-approx-alg}).
|
||||
%\approxq (\cref{alg:mon-sam}) modifies \circuit with a call to \onepass. It then samples from $\circuit_{\vari{mod}}\numsamp$ times and uses that information to approximate $\rpoly$.
|
||||
|
||||
\input{app_approx-alg-pseudo-code}
|
||||
|
|
|
@ -15,10 +15,10 @@ Lastly, we generalize our result for expectation to other moments.
|
|||
%
|
||||
%\label{sec:circuits}
|
||||
|
||||
\subsection{The cost model}
|
||||
\label{sec:cost-model}
|
||||
So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.
|
||||
We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed lineage polynomial for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
|
||||
\mypar{The cost model}
|
||||
%\label{sec:cost-model}
|
||||
So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
|
||||
We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
|
||||
We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
|
||||
\footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
|
||||
In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
|
||||
|
@ -45,7 +45,7 @@ We adopt a minimalistic compute-bound model of query evaluation drawn from the w
|
|||
Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.
|
||||
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.
|
||||
|
||||
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
|
||||
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
|
||||
|
||||
%We now make a simple observation on the above cost model:
|
||||
%\begin{proposition}
|
||||
|
@ -55,6 +55,7 @@ It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey
|
|||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
We are now ready to formally state our claim from \Cref{sec:intro}:
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Corollary}
|
||||
Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
|
||||
|
@ -70,13 +71,13 @@ This follows from~\Cref{lem:circuits-model-runtime} (\cref{sec:circuit-runtime})
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Higher Moments}
|
||||
\label{sec:momemts}
|
||||
|
||||
\mypar{Higher Moments}
|
||||
%\label{sec:momemts}
|
||||
%
|
||||
We make a simple observation to conclude the presentation of our results.
|
||||
So far we have only focused on the expectation of $\poly$. In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$. Progress can be made on this as follows:
|
||||
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
|
||||
We leave this questions like this for future work.
|
||||
We leave a further investigation of this question for future work.
|
||||
|
||||
%%% Local Variables:
|
||||
%%% mode: latex
|
||||
|
|
Loading…
Reference in New Issue