paper-BagRelationalPDBsAreHard/circuits-model-runtime.tex

85 lines
6.5 KiB
TeX
Raw Normal View History

2020-12-14 23:21:03 -05:00
%!TEX root=./main.tex
2021-04-06 16:11:19 -04:00
\section{More on Circuits and Moments}\label{sec:gen}
2021-04-08 15:02:40 -04:00
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered in the same runtime as deterministic queries under reasonable assumptions.
Lastly, we generalize our result for expectation to other moments.
2020-12-16 21:34:26 -05:00
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-14 23:21:03 -05:00
2020-12-17 00:02:07 -05:00
2021-04-08 15:02:40 -04:00
%\revision{
%\subsection{Cost Model, Query Plans, and Runtime}
%As in the introduction, we could consider polynomials to be represented as an expression tree.
%However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
%}
2020-12-19 12:59:27 -05:00
%
2021-04-08 15:02:40 -04:00
%\label{sec:circuits}
2020-12-17 00:02:07 -05:00
2021-04-08 15:02:40 -04:00
\subsection{The cost model}
2020-12-17 00:02:07 -05:00
\label{sec:cost-model}
So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.
2021-04-08 15:02:40 -04:00
We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed lineage polynomial for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
2021-04-08 15:02:40 -04:00
\footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
2020-12-20 17:19:07 -05:00
% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.
2021-04-08 15:02:40 -04:00
2020-12-17 01:32:08 -05:00
\newcommand{\qruntime}[1]{\textbf{cost}(#1)}
2020-12-20 14:43:03 -05:00
%
\noindent\resizebox{1\linewidth}{!}{
\begin{minipage}{1.0\linewidth}
2021-04-08 22:30:03 -04:00
\begin{align*}
\qruntime{R,D} & = |R| &
\qruntime{\sigma Q, D} & = \qruntime{Q,D} &
\qruntime{\pi Q, D} & = \qruntime{Q,D} + \abs{Q(D)}
\end{align*}\\[-15mm]
2020-12-14 23:21:03 -05:00
\begin{align*}
2020-12-20 14:43:03 -05:00
\qruntime{Q \cup Q', D} & = \qruntime{Q, D} + \qruntime{Q', D} +\abs{Q(D)}+\abs{Q'(D)} \\
\qruntime{Q_1 \bowtie \ldots \bowtie Q_n, D} & = \qruntime{Q_1, D} + \ldots + \qruntime{Q_n,D} + \abs{Q_1(D) \bowtie \ldots \bowtie Q_n(D)}
2020-12-14 23:21:03 -05:00
\end{align*}
2020-12-20 14:43:03 -05:00
\end{minipage}
}\\
Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.
2020-12-14 23:21:03 -05:00
2020-12-20 14:43:03 -05:00
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
2020-12-17 01:32:08 -05:00
%We now make a simple observation on the above cost model:
%\begin{proposition}
%\label{prop:queries-need-to-output-tuples}
%The runtime $\qruntime{Q}$ of any query $Q$ is at least $|Q|$
%\end{proposition}
2020-12-14 23:21:03 -05:00
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-17 01:32:08 -05:00
\begin{Corollary}
2020-12-20 17:27:12 -05:00
Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
2020-12-20 14:43:03 -05:00
%
\[
2020-12-20 17:19:07 -05:00
O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\db_{max}}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
2020-12-20 18:54:40 -05:00
\]
2020-12-17 01:32:08 -05:00
\end{Corollary}
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-17 01:32:08 -05:00
\begin{proof}
2021-04-08 15:02:40 -04:00
This follows from~\Cref{lem:circuits-model-runtime} (\cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
2020-12-14 23:21:03 -05:00
\end{proof}
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-16 21:34:26 -05:00
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Higher Moments}
2020-12-16 21:34:26 -05:00
\label{sec:momemts}
2020-12-17 01:32:08 -05:00
2020-12-20 01:16:52 -05:00
We make a simple observation to conclude the presentation of our results.
2021-04-08 15:02:40 -04:00
So far we have only focused on the expectation of $\poly$. In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$. Progress can be made on this as follows:
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
We leave this questions like this for future work.
2020-12-20 01:16:52 -05:00
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: