paper-BagRelationalPDBsAreHard/circuits-model-runtime.tex

135 lines
11 KiB
TeX
Raw Normal View History

2020-12-14 23:21:03 -05:00
%!TEX root=./main.tex
\revision{
\section{More on Circuits and Moments}
}
2020-12-18 11:37:37 -05:00
\label{sec:gen}
%In this section, we consider generalizations/corollaries of our results.
%In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.
%Then,
We formalize our claim from \Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries under reasonable assumptions.
Finally, in~\Cref{sec:momemts}, we generalize our result for expectation to other moments.
2020-12-16 21:34:26 -05:00
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-14 23:21:03 -05:00
2020-12-17 00:02:07 -05:00
\revision{
\subsection{Cost Model, Query Plans, and Runtime}
}
2020-12-17 00:02:07 -05:00
\label{sec:circuits}
2020-12-17 00:02:07 -05:00
%In~\Cref{sec:results-circuits} we argue why results from earlier sections also hold for circuits and then
We argue why circuits capture the runtime of well-known query processing algorithms in~\Cref{sec:circuit-runtime} (\Cref{sec:cost-model} formalizes the query cost model).
2020-12-17 00:02:07 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\subsubsection{Extending our results to circuits}
%\label{sec:results-circuits}
%
%We first note that since expression trees are a special case of linear circuits, all of our hardness results for in~\Cref{sec:hard} are still valid for the latter.
2020-12-19 12:59:27 -05:00
%
%Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for circuits as well.
%It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;
%Analogously in a circuit, each node has a maximum in-degree of two.
%Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
%%
%For a more detailed discussion of why~\Cref{lem:approx-alg} holds for a circuit, see~\Cref{app:lineage-circuit-ext}.
2020-12-17 00:02:07 -05:00
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-17 00:02:07 -05:00
\subsubsection{The cost model}
\label{sec:cost-model}
So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive queries, we can construct a compressed lineage polynomial for a query $Q$ and \bi $\pxdb$ of size (and in runtime) linear in the runtime of a general class of query processing algorithms for the same query $Q$ on a deterministic database $\db$.
We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
2020-12-20 18:57:03 -05:00
This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
2020-12-20 18:43:14 -05:00
In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).
%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
2020-12-20 17:19:07 -05:00
% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.
2020-12-17 01:32:08 -05:00
\newcommand{\qruntime}[1]{\textbf{cost}(#1)}
2020-12-20 14:43:03 -05:00
%
\noindent\resizebox{1\linewidth}{!}{
\begin{minipage}{1.0\linewidth}
2020-12-14 23:21:03 -05:00
\begin{align*}
2020-12-20 14:43:03 -05:00
\qruntime{R,D} & = |R| \\
\qruntime{\sigma Q, D} & = \qruntime{Q,D} \\
\qruntime{\pi Q, D} & = \qruntime{Q,D} + \abs{Q(D)} \\
\qruntime{Q \cup Q', D} & = \qruntime{Q, D} + \qruntime{Q', D} +\abs{Q(D)}+\abs{Q'(D)} \\
\qruntime{Q_1 \bowtie \ldots \bowtie Q_n, D} & = \qruntime{Q_1, D} + \ldots + \qruntime{Q_n,D} + \abs{Q_1(D) \bowtie \ldots \bowtie Q_n(D)}
2020-12-14 23:21:03 -05:00
\end{align*}
2020-12-20 14:43:03 -05:00
\end{minipage}
}\\
Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.
2020-12-14 23:21:03 -05:00
2020-12-20 14:43:03 -05:00
It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
2020-12-17 01:32:08 -05:00
%We now make a simple observation on the above cost model:
%\begin{proposition}
%\label{prop:queries-need-to-output-tuples}
%The runtime $\qruntime{Q}$ of any query $Q$ is at least $|Q|$
%\end{proposition}
2020-12-14 23:21:03 -05:00
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Circuits for query plans}
2020-12-17 00:02:07 -05:00
\label{sec:circuits-formal}
We now formalize circuits and the construction of circuits for SPJU queries.
2020-12-20 17:19:07 -05:00
As mentioned earlier, we represent lineage polynomials as arithmetic circuits over $\mathbb N$-valued variables with $+$, $\times$.
2020-12-20 14:43:03 -05:00
A circuit for query $Q$ and $\semNX$-PDB $\pxdb$ is a directed acyclic graph $\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}$ with vertices $V_{Q,\pxdb}$ and directed edges $E_{Q,\pxdb} \subset {V_{Q,\pxdb}}^2$.
The sink function $\phi_{Q,\pxdb} : \udom^n \rightarrow V_{Q,\pxdb}$ is a partial function that maps the tuples of the $n$-ary relation $Q(\pxdb)$ to vertices.
We require that $\phi_{Q,\pxdb}$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
2020-12-17 01:32:08 -05:00
%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
2020-12-20 14:43:03 -05:00
A function $\ell_{Q,\pxdb} : V_{Q,\pxdb} \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
2020-12-14 23:21:03 -05:00
We require that vertices have an in-degree of at most two.
%
For the specifics on how to construct a circuit to encode the polynomials of all result tuples for a query and $\semNX$-PDB see \Cref{app:subsec-rep-poly-lin-circ}. Note that we can construct circuits for \bis in time linear in the time required for deterministic query processing over a possible world of the \bi under the aforementioned assumption that $\abs{\pxdb} \leq c \cdot \abs{\db}$.
2020-12-14 23:21:03 -05:00
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-17 00:02:07 -05:00
\subsubsection{Circuit size vs. runtime}
\label{sec:circuit-runtime}
2020-12-14 23:21:03 -05:00
2020-12-20 14:43:03 -05:00
\newcommand{\bagdbof}{\textsc{bag}(\pxdb)}
We now connect the size of a circuit (where the size of a circuit is the number of vertices in the corresponding DAG) %\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.})
for a given SPJU query $Q$ and $\semNX$-PDB $\pxdb$ to its $\qruntime{Q,\db}$ where $\db$ is one of the possible worlds of $\pxdb$. We do this formally by showing that the size of the circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.
2020-12-17 01:32:08 -05:00
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-14 23:21:03 -05:00
\begin{lemma}
\label{lem:circuits-model-runtime}
2020-12-20 14:43:03 -05:00
Given a $\semNX$-PDB $\pxdb$ and query plan $Q$, the runtime of $Q$ over $\bagdbof$ has the same or better complexity as the size of the lineage of $Q(\pxdb)$. That is, we have $\abs{V_{Q,\pxdb}} \leq (k-1)\qruntime{Q}$, where $k$ is the maximal degree of any polynomial in $Q(\pxdb)$.
2020-12-14 23:21:03 -05:00
\end{lemma}
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent The proof is shown in in~\Cref{app:subsec-lem-lin-vs-qplan}.
We now have all the pieces to argue that using our approximation algorithm, the expected multiplicities of a SPJU query can be computed in essentially the same runtime as deterministic query processing for the same query:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-17 01:32:08 -05:00
\begin{Corollary}
2020-12-20 17:27:12 -05:00
Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
2020-12-20 14:43:03 -05:00
%
\[
2020-12-20 17:19:07 -05:00
O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\db_{max}}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
2020-12-20 18:54:40 -05:00
\]
2020-12-17 01:32:08 -05:00
\end{Corollary}
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-17 01:32:08 -05:00
\begin{proof}
This follows from~\Cref{lem:circuits-model-runtime} and (the circuit counterpart-- see~\Cref{sec:results-circuits})~\Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
2020-12-14 23:21:03 -05:00
\end{proof}
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-16 21:34:26 -05:00
2020-12-20 14:43:03 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Higher Moments}
2020-12-16 21:34:26 -05:00
\label{sec:momemts}
2020-12-17 01:32:08 -05:00
2020-12-20 01:16:52 -05:00
We make a simple observation to conclude the presentation of our results.
So far focused on the expectation of $\poly$.
2020-12-20 01:16:52 -05:00
In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.
While we do not have a good approximation algorithm for this problem, we can make some progress as follows:
For any positive integer $m$ we can compute the expectation $\poly^m$ (which only changes the degree of the corresponding lineage polynomial by a factor of $m$).
In other words, we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
2020-12-19 23:40:43 -05:00
However, we leave the question of coming up with a more accurate approximation algorithms for future work.
2020-12-20 01:16:52 -05:00
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: