paper-BagRelationalPDBsAreHard/circuits-model-runtime.tex

87 lines
6.7 KiB
TeX
Raw Normal View History

2020-12-14 23:21:03 -05:00
%!TEX root=./main.tex
2021-04-06 16:11:19 -04:00
\section{More on Circuits and Moments}\label{sec:gen}
2021-04-10 00:19:16 -04:00
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.
2021-04-08 15:02:40 -04:00
Lastly, we generalize our result for expectation to other moments.
2020-12-16 21:34:26 -05:00
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-14 23:21:03 -05:00
2020-12-17 00:02:07 -05:00
2021-04-08 15:02:40 -04:00
%\revision{
%\subsection{Cost Model, Query Plans, and Runtime}
%As in the introduction, we could consider polynomials to be represented as an expression tree.
%However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
%}
2020-12-19 12:59:27 -05:00
%
2021-04-08 15:02:40 -04:00
%\label{sec:circuits}
2020-12-17 00:02:07 -05:00
2021-04-09 00:14:02 -04:00
\mypar{The cost model}
%\label{sec:cost-model}
So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pxdb$'s \dbbaseName $\dbbase$.
% Note that by definition, there exists a linear relationship between input sizes $|\pxdb|$ and $|\dbbase|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
% \footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
% In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
2020-12-20 17:19:07 -05:00
% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.
2021-09-14 14:41:14 -04:00
2020-12-20 14:43:03 -05:00
%
\noindent\resizebox{1\linewidth}{!}{
\begin{minipage}{1.0\linewidth}
2021-04-08 22:30:03 -04:00
\begin{align*}
\qruntime{R,\dbbase} & = |\dbbase.R| &
\qruntime{\sigma Q, \dbbase} & = \qruntime{Q,\dbbase} &
\qruntime{\pi Q, \dbbase} & = \qruntime{Q,\dbbase} + \abs{Q(D)}
2021-04-08 22:30:03 -04:00
\end{align*}\\[-15mm]
2020-12-14 23:21:03 -05:00
\begin{align*}
\qruntime{Q \cup Q', \dbbase} & = \qruntime{Q, \dbbase} + \qruntime{Q', \dbbase} +\abs{Q(D)}+\abs{Q'(D)} \\
\qruntime{Q_1 \bowtie \ldots \bowtie Q_n, \dbbase} & = \qruntime{Q_1, \dbbase} + \ldots + \qruntime{Q_n,\dbbase} + \abs{Q_1(D) \bowtie \ldots \bowtie Q_n(D)}
2020-12-14 23:21:03 -05:00
\end{align*}
2020-12-20 14:43:03 -05:00
\end{minipage}
}\\
Under this model a query $Q$ evaluated over database $\dbbase$ has runtime $O(\qruntime{Q,\dbbase})$.
2020-12-20 14:43:03 -05:00
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.
2020-12-14 23:21:03 -05:00
2021-04-10 14:35:38 -04:00
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though the size of these queries is data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
%
2020-12-17 01:32:08 -05:00
%We now make a simple observation on the above cost model:
%\begin{proposition}
%\label{prop:queries-need-to-output-tuples}
%The runtime $\qruntime{Q}$ of any query $Q$ is at least $|Q|$
%\end{proposition}
2021-04-10 14:35:38 -04:00
%
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-04-10 14:35:38 -04:00
%
2021-04-09 00:14:02 -04:00
We are now ready to formally state our claim from \Cref{sec:intro}:
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-17 01:32:08 -05:00
\begin{Corollary}
Given an SPJU query $Q$ over a \ti $\pxdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
2020-12-20 14:43:03 -05:00
%
\[
O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\dbbase}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
2020-12-20 18:54:40 -05:00
\]
2020-12-17 01:32:08 -05:00
\end{Corollary}
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-17 01:32:08 -05:00
\begin{proof}
2021-04-10 09:48:26 -04:00
This follows from \Cref{lem:circuits-model-runtime} (\Cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that \Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
2020-12-14 23:21:03 -05:00
\end{proof}
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-16 21:34:26 -05:00
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-04-09 00:14:02 -04:00
\mypar{Higher Moments}
%\label{sec:momemts}
%
2020-12-20 01:16:52 -05:00
We make a simple observation to conclude the presentation of our results.
2021-04-10 14:35:38 -04:00
So far we have only focused on the expectation of $\poly$.
In addition, we could e.g. prove bounds of the probability of a tuple's multiplicity being at least $1$.
2021-04-10 00:19:16 -04:00
Progress can be made on this as follows:
2021-04-10 14:35:38 -04:00
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use the Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
2021-04-10 00:19:16 -04:00
We leave further investigations for future work.
2020-12-20 01:16:52 -05:00
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: