359 lines
24 KiB
TeX
359 lines
24 KiB
TeX
%!TEX root=./main.tex
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\input{app_set_to_bag_pdb}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Missing details from Section~\ref{sec:background}}\label{sec:proofs-background}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsection{$\semK$-relations and \abbrNXPDB\xplural}\label{subsec:supp-mat-background}\label{subsec:supp-mat-krelations}
|
|
\input{app_notation-background}
|
|
|
|
|
|
|
|
\section{Missing details from Section~\ref{sec:hard}}
|
|
\label{app:single-mult-p}
|
|
\input{app_hardness-results}
|
|
|
|
\section{Missing Details from Section~\ref{sec:algo}}\label{sec:proofs-approx-alg}
|
|
\input{app_approx-alg-defs-and-examples}
|
|
\input{app_approx-alg-analysis}
|
|
\input{app_one-pass-analysis}
|
|
\input{app_samp-monom-analysis}
|
|
|
|
\subsection{Experimental Results}\label{app:subsec:experiment}
|
|
\input{experiments}
|
|
|
|
\section{Circuits}\label{app:sec-cicuits}
|
|
\subsection{Representing Polynomials with Circuits}\label{app:subsec-rep-poly-lin-circ}
|
|
\subsubsection{Circuits for query plans}
|
|
\label{sec:circuits-formal}
|
|
\AR{Since this comment is not showing up below, I do not follow why the last sentence of this para is true.}
|
|
We now formalize circuits and the construction of circuits for $\raPlus$ queries.
|
|
As mentioned earlier, we represent lineage polynomials as arithmetic circuits over $\mathbb N$-valued variables with $+$, $\times$.
|
|
A circuit for query $Q$ and \abbrNXPDB $\pxdb$ is a directed acyclic graph $\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}$ with vertices $V_{Q,\pxdb}$ and directed edges $E_{Q,\pxdb} \subset {V_{Q,\pxdb}}^2$.
|
|
The sink function $\phi_{Q,\pxdb} : \udom^n \rightarrow V_{Q,\pxdb}$ is a partial function that maps the tuples of the $n$-ary\AR{In the main paper we have used $n$ to denote the number of input tuples so we need to use some other notation $n$ but since I do not know where all this change will need to be propagated so am not changing it for now.} relation $Q(\pxdb)$ to vertices.
|
|
We require that $\phi_{Q,\pxdb}$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
|
|
%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
|
|
A function $\ell_{Q,\pxdb} : V_{Q,\pxdb} \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
|
|
We require that vertices have an in-degree of at most two.
|
|
%For the specifics on how to construct a circuit to encode the polynomials of all result tuples for a query and \abbrNXPDB see \Cref{app:subsec-rep-poly-lin-circ}.
|
|
Note that we can construct circuits for \bis in time linear in the time required for deterministic query processing over a possible world of the \bi under the aforementioned assumption that $\abs{\pxdb} \leq c \cdot \abs{\db}$.\AR{I do not follow the last sentence.}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\subsection{Modeling Circuit Construction}
|
|
|
|
\newcommand{\bagdbof}{\textsc{bag}(\pxdb)}
|
|
|
|
We now connect the size of a circuit (where the size of a circuit is the number of vertices in the corresponding DAG) %\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.})
|
|
for a given $\raPlus$ query $Q$ and \abbrNXPDB $\pxdb$ to
|
|
the runtime $\qruntime{Q,\dbbase}$ of the PDB's \dbbaseName $\dbbase$.
|
|
We do this formally by showing that the size of the circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\newcommand{\getpoly}[1]{\textbf{lin}\inparen{#1}}
|
|
Each vertex $v \in V_{Q,\pxdb}$ in the arithmetic circuit for
|
|
|
|
\[\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}\]
|
|
|
|
encodes a polynomial, realized as
|
|
|
|
\[\getpoly{v} = \begin{cases}
|
|
\sum_{v' : (v',v) \in E_{Q,\pxdb}} \getpoly{v'} & \textbf{if } \ell(v) = +\\
|
|
\prod_{v' : (v',v) \in E_{Q,\pxdb}} \getpoly{v'} & \textbf{if } \ell(v) = \times\\
|
|
\ell(v) & \textbf{otherwise}
|
|
\end{cases}\]
|
|
|
|
|
|
We define the circuit for a $\raPlus$ query $\query$ recursively by cases as follows. In each case, let $\tuple{V_{Q_i,\pxdb}, E_{Q_i,\pxdb}, \phi_{Q_{i},\pxdb}, \ell_{Q_i,\pxdb}}$ denote the circuit for subquery $Q_i$. We implicitly include in all circuits a global zero node $v_0$ s.t., $\ell_{Q, \pxdb}(v_0) = 0$ for any $Q,\pxdb$.
|
|
|
|
|
|
\begin{algorithm}
|
|
\caption{\abbrStepOne$(\query, \dbbase, E, V, \ell)$}
|
|
\label{alg:lc}
|
|
\begin{algorithmic}[1]
|
|
\Require $\query$: query
|
|
\Require $\dbbase$: a \dbbaseName
|
|
\Require $E, V, \ell$: accumulators for the edge list, vertex list, and vertex label list.
|
|
\Ensure $\circuit = \tuple{E, V, \phi, \ell}$: a circuit encoding the lineage of each tuple in $\query(\dbbase)$
|
|
\If{$\query$ is $R$} \Comment{\textbf{Case 1}: $\query$ is a relation atom}
|
|
\For{$t \in \dbbase.R$}
|
|
\State $V \leftarrow V \cup \{v_t\}$; $\ell \leftarrow \ell \cup \{(v_t, R(t))\}$ \Comment{Allocate a fresh node $v_t$}
|
|
\State $\phi(t) \gets v_t$
|
|
\EndFor
|
|
\ElsIf{$\query$ is $\sigma_\theta(\query')$} \Comment{\textbf{Case 2}: $\query$ is a Selection}
|
|
\State $\tuple{V, E, \phi', \ell} \gets \abbrStepOne(\query', \dbbase, V, E, \ell)$
|
|
\For{$t \in \domain(\phi')$}
|
|
\State \textbf{if }$\theta(t)$
|
|
\textbf{ then } $\phi(t) \gets \phi'(t)$
|
|
\textbf{ else } $\phi(t) \gets v_0$
|
|
\EndFor
|
|
\ElsIf{$\query$ is $\pi_{\vec{A}}(\query')$} \Comment{\textbf{Case 3}: $\query$ is a Projection}
|
|
\State $\tuple{V, E, \phi', \ell} \gets \abbrStepOne(\query', \dbbase, V, E, \ell)$
|
|
\For{$t \in \pi_{\vec{A}}(\query'(\dbbase))$}
|
|
\State $V \leftarrow V \cup \{v_t\}$; $\ell \leftarrow \ell \cup \{(v_t, +)\}$\Comment{Allocate a fresh node $v_t$}
|
|
\State $\phi(t) \leftarrow v_t$
|
|
\EndFor
|
|
\For{$t \in \query'(\dbbase)$}
|
|
\State $E \leftarrow E \cup \{(\phi'(t), \phi(\pi_{\vec{A}}t))\}$
|
|
\EndFor
|
|
\State Correct nodes with in-degrees $>2$ by appending an equivalent fan-in two tree instead
|
|
\ElsIf{$\query$ is $\query_1 \cup \query_2$} \Comment{\textbf{Case 4}: $\query$ is a Bag Union}
|
|
\State $\tuple{V, E, \phi_1, \ell} \gets \abbrStepOne(\query_1, \dbbase, V, E, \ell)$
|
|
\State $\tuple{V, E, \phi_2, \ell} \gets \abbrStepOne(\query_2, \dbbase, V, E, \ell)$
|
|
\State $\phi \gets \phi_1 \cup \phi_2$
|
|
\For{$t \in \domain(\phi_1) \cap \domain(\phi_2)$}
|
|
\State $V \leftarrow V \cup \{v_t\}$; $\ell \leftarrow \ell \cup \{(v_t, +)\}$ \Comment{Allocate a fresh node $v_t$}
|
|
\State $\phi(t) \gets v_t$
|
|
\State $E \leftarrow E \cup \{(\phi_1(t), v_t), (\phi_2(t), v_t)\}$
|
|
\EndFor
|
|
\ElsIf{$\query$ is $\query_1 \bowtie \ldots \bowtie \query_m$} \Comment{\textbf{Case 5}: $\query$ is a $m$-ary Join}
|
|
\For{$i \in [m]$}
|
|
\State $\tuple{V, E, \phi_i, \ell} \gets \abbrStepOne(\query_i, \dbbase, V, E, \ell)$
|
|
\EndFor
|
|
\For{$t \in \domain(\phi_1) \bowtie \ldots \bowtie \domain(\phi_m)$}
|
|
\State $V \leftarrow V \cup \{v_t\}$; $\ell \leftarrow \ell \cup \{(v_t, \times)\}$ \Comment{Allocate a fresh node $v_t$}
|
|
\State $\phi(t) \gets v_t$
|
|
\State $E \leftarrow E \cup \comprehension{(\phi_i(\pi_{sch(\query_i(\dbbase))}(t)), v_t)}{i \in [n]}$
|
|
\EndFor
|
|
\State Correct nodes with in-degrees $>2$ by appending an equivalent fan-in two tree instead
|
|
|
|
\EndIf
|
|
|
|
\end{algorithmic}
|
|
\end{algorithm}
|
|
|
|
|
|
\Cref{alg:lc} defines how the circuit for a query result is constructed. We quickly review the number of vertices emitted in each case.
|
|
|
|
\caseheading{Base Relation}
|
|
% Let $Q$ be a base relation $R$. We define one node for each tuple. Formally, let $V_{Q,\pxdb} = \comprehension{v_t}{t\in R}$, let $\phi_{Q,\pxdb}(t) = v_t$, let $\ell_{Q,\pxdb}(v_t) = R(t)$, and let $E_{Q,\pxdb} = \emptyset$.
|
|
This circuit has $|D_\Omega.R|$ vertices.
|
|
|
|
\caseheading{Selection}
|
|
% Let $Q = \sigma_\theta \inparen{Q_1}$.
|
|
% We re-use the circuit for $Q_1$. %, but define a new distinguished node $v_0$ with label $0$ and make it the sink node for all tuples that fail the selection predicate.
|
|
% Let $V_{Q,\pxdb} = V_{Q_1,\pxdb} \cup \{v_0\}$, and let $\ell_{Q,\pxdb}(v) = \ell_{Q_1,\pxdb}(v)$ for any $v \in V_{Q_1,\pxdb}$. Let $E_{Q,\pxdb} = E_{Q_1,\pxdb}$, and define
|
|
% $$\phi_{Q,\pxdb}(t) =
|
|
% \phi_{Q_{1}, \pxdb}(t) \text{ for } t \text{ s.t.}\; \theta(t) \text{ and } \phi_{Q,\pxdb}(t) = v_0 \text{ otherwise}.$$
|
|
If we assume dead sinks are iteratively garbage collected,
|
|
this circuit has at most $|V_{Q_1,\pxdb}|$ vertices.
|
|
|
|
\caseheading{Projection}
|
|
% Let $Q = \pi_{\vct A} {Q_1}$.
|
|
% We extend the circuit for ${Q_1}$ with a new set of sum vertices (i.e., vertices with label $+$) for each tuple in $Q$, and connect them to the corresponding sink nodes of the circuit for ${Q_1}$.
|
|
% Naively, let $V_{Q,\pxdb} = V_{Q_1,\pxdb} \cup \comprehension{v_t}{t \in \pi_{\vct A} {Q_1}}$, let $\phi_{Q,\pxdb}(t) = v_t$, and let $\ell_{Q,\pxdb}(v_t) = +$. Finally let
|
|
% $$E_{Q,\pxdb} = E_{Q_1,\pxdb} \cup \comprehension{(\phi_{Q_{1}, \pxdb}(t'), v_t)}{t = \pi_{\vct A} t', t' \in {Q_1}, t \in \pi_{\vct A} {Q_1}}$$
|
|
This formulation will produce vertices with an in-degree greater than two, a problem that we correct by replacing every vertex with an in-degree over two by an equivalent fan-in two tree. The resulting structure has at most $|{Q_1}|-1$ new vertices.
|
|
% \AH{Is the rightmost operator \emph{supposed} to be a $-$? In the beginning we add $|\pi_{\vct A}{Q_1}|$ vertices.}
|
|
The corrected circuit thus has at most $|V_{Q_1,\pxdb}|+|{Q_1}|$ vertices.
|
|
|
|
\caseheading{Union}
|
|
% Let $Q = {Q_1} \cup {Q_2}$.
|
|
% We merge graphs and produce a sum vertex for all tuples in both sides of the union.
|
|
% Formally, let $V_{Q,\pxdb} = V_{Q_1,\pxdb} \cup V_{Q_2,\pxdb} \cup \comprehension{v_t}{t \in {Q_1} \cap {Q_2}}$, let $\ell_{Q,\pxdb}(v_t) = +$, and let
|
|
% \[E_{Q,\pxdb} = E_{Q_1,\pxdb} \cup E_{Q_2,\pxdb} \cup \comprehension{(\phi_{Q_{1}, \pxdb}(t), v_t), (\phi_{Q_{2}, \pxdb}(t), v_t)}{t \in {Q_1} \cap {Q_2}}\]
|
|
% \[
|
|
% \phi_{Q,\pxdb}(t) = \begin{cases}
|
|
% v_t & \textbf{if } t \in {Q_1} \cap {Q_1}\\
|
|
% \phi_{Q_{1}, \pxdb}(t) & \textbf{if } t \not \in {Q_2}\\
|
|
% \phi_{Q_{2}, \pxdb}(t) & \textbf{if } t \not \in {Q_1}\\
|
|
% \end{cases}\]
|
|
This circuit has $|V_{Q_1,\pxdb}|+|V_{Q_2,\pxdb}|+|{Q_1} \cap {Q_2}|$ vertices.
|
|
|
|
\caseheading{$k$-ary Join}
|
|
% Let $Q = {Q_1} \bowtie \ldots \bowtie {Q_k}$.
|
|
% We merge graphs and produce a multiplication vertex for all tuples resulting from the join
|
|
% Naively, let $V_{Q,\pxdb} = V_{Q_1,\pxdb} \cup \ldots \cup V_{Q_k,\pxdb} \cup \comprehension{v_t}{t \in {Q_1} \bowtie \ldots \bowtie {Q_k}}$, let
|
|
% {\small
|
|
% \begin{multline*}
|
|
% E_{Q,\pxdb} = E_{Q_1,\pxdb} \cup \ldots \cup E_{Q_k,\pxdb} \cup
|
|
% \left\{\;
|
|
% (\phi_{Q_{1}, \pxdb}(\pi_{\sch({Q_1})}t), v_t), \right.\\
|
|
% \ldots, (\phi_{Q_k,\pxdb}(\pi_{\sch({Q_k})}t), v_t)
|
|
% \;\left|\;t \in {Q_1} \bowtie \ldots \bowtie {Q_k}\;\right\}
|
|
% \end{multline*}
|
|
% }
|
|
% Let $\ell_{Q,\pxdb}(v_t) = \times$, and let $\phi_{Q,\pxdb}(t) = v_t$
|
|
As in projection, newly created vertices will have an in-degree of $k$, and a fan-in two tree is required.
|
|
There are $|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ such vertices, so the corrected circuit has $|V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ vertices.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{Bounding circuit depth}
|
|
\label{sec:circuit-depth}
|
|
|
|
We first show that the depth of the circuit (\depth; \Cref{def:size-depth}) is bounded by the size of the query. Denote by $|\query|$ the number of relational operators in query $\query$, which recall we assume is a constant.
|
|
|
|
\begin{Proposition}[Circuit depth is bounded]
|
|
\label{prop:circuit-depth}
|
|
Let $\query$ be a relational query and $\dbbase$ be a \dbbaseName with $n$ tuples. There exists a (lineage) circuit $\circuit^*$ encoding the lineage of all tuples $\tup \in \query(\dbbase)$ for which
|
|
$\depth(\circuit^*) \leq O(k|\query|\log(n))$.
|
|
\end{Proposition}
|
|
|
|
\begin{proof}
|
|
We show that the bound of \Cref{prop:circuit-depth} holds for the circuit constructed by \Cref{alg:lc}.
|
|
First, observe that \Cref{alg:lc} is (recursively) invoked exactly once for every relational operator or base relation in $\query$; It thus suffices to show that a call to \Cref{alg:lc} adds at most $O_k(\log(n))$ to the depth of a circuit produced by any recursive invocation.
|
|
Second, observe that modulo the logarithmic fan-in of the projection and join cases, the depth of the output is at most one greater than the depth of any input (or at most 1 in the base case of relation atoms).
|
|
For the join case, the number of in-edges can be no greater than the join width, which itself is bounded by $k$. The depth thus increases by at most a constant factor of $\lceil \log(k) \rceil = O_k(1)$.
|
|
For the projection case, observe that the fan-in is bounded by $|\query'(\dbbase)|$, which is in turn bounded by $n^k$. The depth increase for any projection node is thus at most $\lceil \log(n^k)\rceil = O(k\log(n))$, as desired. % = O_k(\log(n))$.
|
|
\qed
|
|
\end{proof}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{Circuit size vs. runtime}
|
|
\label{sec:circuit-runtime}
|
|
|
|
\begin{Lemma}\label{lem:circ-model-runtime}
|
|
\label{lem:circuits-model-runtime}
|
|
Given a \abbrNXPDB $\pxdb$ with \dbbaseName $\dbbase$, and an $\raPlus$ query $Q$, the runtime of $Q$ over $\dbbase$ has the same or greater complexity as the size of the lineage of $Q(\pxdb)$. That is, we have $\abs{V_{Q,\pxdb}} \leq k\qruntime{Q, \dbbase}+1$, where $k\ge 1$ is the maximal degree of any polynomial in $Q(\pxdb)$.
|
|
\end{Lemma}
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%\noindent The proof is shown in \Cref{app:subsec-lem-lin-vs-qplan}.
|
|
|
|
%\subsection{Proof for \Cref{lem:circuits-model-runtime}}\label{app:subsec-lem-lin-vs-qplan}
|
|
\begin{proof}
|
|
We prove by induction that $\abs{V_{Q,\pxdb} \setminus \{v_0\}} \leq k\qruntime{Q, \dbbase}$. For clarity, we implicitly exclude $v_0$ in the proof below.
|
|
|
|
The base case is a base relation: $Q = R$ and is trivially true since $|V_{R,\pxdb}| = |\dbbase.R|=\qruntime{R, \dbbase}$ (note that here the degree $k=1$).
|
|
For the inductive step, we assume that we have circuits for subqueries $Q_1, \ldots, Q_m$ such that $|V_{Q_i,\pxdb}| \leq k_i\qruntime{Q_i,\dbbase}$ where $k_i$ is the degree of $Q_i$.
|
|
|
|
\caseheading{Selection}
|
|
Assume that $Q = \sigma_\theta(Q_1)$.
|
|
In the circuit for $Q$, $|V_{Q,\pxdb}| = |V_{Q_1,\dbbase}|$ vertices, so from the inductive assumption and $\qruntime{Q,\dbbase} = \qruntime{Q_1,\dbbase}$ by definition, we have $|V_{Q,\pxdb}| \leq k \qruntime{Q,\dbbase} $.
|
|
% \AH{Technically, $\kElem$ is the degree of $\poly_1$, but I guess this is a moot point since one can argue that $\kElem$ is also the degree of $\poly$.}
|
|
% OK: Correct
|
|
|
|
\caseheading{Projection}
|
|
Assume that $Q = \pi_{\vct A}(Q_1)$.
|
|
The circuit for $Q$ has at most $|V_{Q_1,\pxdb}|+|{Q_1}|$ vertices.
|
|
% \AH{The combination of terms above doesn't follow the details for projection above.}
|
|
\begin{align*}
|
|
|V_{Q,\pxdb}| & \leq |V_{Q_1,\pxdb}| + |Q_1|\\
|
|
%\intertext{By \Cref{prop:queries-need-to-output-tuples} $\qruntime{Q_1,\dbbase} \geq |Q_1|$}
|
|
%& \leq |V_{Q_1,\pxdb}| + 2 \qruntime{Q_1,\pxdb}\\
|
|
\intertext{(From the inductive assumption)}
|
|
& \leq k\qruntime{Q_1,\dbbase} + \abs{Q_1}\\
|
|
\intertext{(By definition of $\qruntime{Q,\dbbase}$)}
|
|
& \le k\qruntime{Q,\dbbase}.
|
|
\end{align*}
|
|
\caseheading{Union}
|
|
Assume that $Q = Q_1 \cup Q_2$.
|
|
The circuit for $Q$ has $|V_{Q_1,\pxdb}|+|V_{Q_2,\pxdb}|+|{Q_1} \cap {Q_2}|$ vertices.
|
|
\begin{align*}
|
|
|V_{Q,\pxdb}| & \leq |V_{Q_1,\pxdb}|+|V_{Q_2,\pxdb}|+|{Q_1}|+|{Q_2}|\\
|
|
%\intertext{By \Cref{prop:queries-need-to-output-tuples} $\qruntime{Q_1,\dbbase} \geq |Q_1|$}
|
|
%& \leq |V_{Q_1,\pxdb}|+|V_{Q_2,\pxdb}|+\qruntime{Q_1,\pxdb}+\qruntime{Q_2,\dbbase}|\\
|
|
\intertext{(From the inductive assumption)}
|
|
& \leq k(\qruntime{Q_1,\dbbase} + \qruntime{Q_2,\dbbase}) + (|Q_1| + |Q_2|)
|
|
\intertext{(By definition of $\qruntime{Q,\dbbase}$)}
|
|
& \leq k(\qruntime{Q,\dbbase}).
|
|
\end{align*}
|
|
|
|
\caseheading{$m$-ary Join}
|
|
Assume that $Q = Q_1 \bowtie \ldots \bowtie Q_m$. Note that $k=\sum_{i=1}^m k_i\ge m$.
|
|
The circuit for $Q$ has $|V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(m-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ vertices.
|
|
\begin{align*}
|
|
|V_{Q,\pxdb}| & = |V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(m-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|\\
|
|
\intertext{From the inductive assumption and noting $\forall i: k_i \leq k$ and $m\le k$}
|
|
& \leq k\qruntime{Q_1,\dbbase}+\ldots+k\qruntime{Q_k,\dbbase}+\\
|
|
&\;\;\; (m-1)|{Q_1} \bowtie \ldots \bowtie {Q_m}|\\
|
|
& \leq k(\qruntime{Q_1,\dbbase}+\ldots+\qruntime{Q_m,\dbbase}+\\
|
|
&\;\;\;|{Q_1} \bowtie \ldots \bowtie {Q_m}|)\\
|
|
\intertext{(By definition of $\qruntime{Q,\dbbase}$ and assumption on $\jointime{\cdot}$)}
|
|
& \le k\qruntime{Q,\dbbase}.
|
|
\end{align*}
|
|
|
|
The property holds for all recursive queries, and the proof holds.
|
|
\qed
|
|
\end{proof}
|
|
|
|
\subsubsection{Runtime of \abbrStepOne}
|
|
\label{sec:lc-runtime}
|
|
|
|
We next need to show that we can construct the circuit in time linear in the deterministic runtime.
|
|
\begin{Lemma}\label{lem:tlc-is-the-same-as-det}
|
|
Given a query $\query$ over a \dbbaseName $\dbbase$ and the $\circuit^*$ output by \Cref{alg:lc}, the runtime $\timeOf{\abbrStepOne}(\query,\dbbase,\circuit^*) \le O(\qruntime{\query, \dbbase})$.
|
|
\end{Lemma}
|
|
\begin{proof}
|
|
By analysis of \Cref{alg:lc}, invoked as $\circuit^*\gets\abbrStepOne(\query, \dbbase, \{v_0\}, \emptyset, \{(v_0, 0)\})$.
|
|
|
|
We assume that the vertex list $V$, edge list $E$, and vertex label list $\ell$ are mutable accumulators with $O(1)$ ammortized append.
|
|
We assume that the tuple to sink mapping $\phi$ is a linked hashmap, with $O(1)$ insertions and retrievals, and $O(n)$ iteration over the domain of keys.
|
|
We assume that the n-ary join $\domain(\phi_1) \bowtie \ldots \bowtie\domain(\phi_n)$ can be computed in time $\jointime{\domain(\phi_1), \ldots, \domain(\phi_n)}$ (\Cref{def:join-cost}) and that an intersection $\domain(\phi_1) \cap \domain(\phi_2)$ can be computed in time $O(|\domain(\phi_1)| + |\domain(\phi_2)|)$ (e.g., with a hash table).
|
|
|
|
|
|
Before proving our runtime bound, we first observe that $\qruntime{\query, \db} \geq \Omega(|\query(\db)|)$.
|
|
This is true by construction for the relation, projection, and union cases, by \Cref{def:join-cost} for joins, and by the observation that $|\sigma(R)| \leq |R|$.
|
|
|
|
We showthat $\qruntime{\query, \dbbase}$ is an upper-bound for the runtime of \Cref{alg:lc} by recursion.
|
|
The base case of a relation atom requires only an $O(|\dbbase.R|)$ iteration over the source tuples.
|
|
For the remaining cases, we make the recursive assumption that for every subquery $\query'$, it holds that $O(\qruntime{\query', \dbbase})$ bounds the runtime of \Cref{alg:lc}.
|
|
|
|
\caseheading{Selection}
|
|
Selection requires a recursive call to \Cref{alg:lc}, which by the recursive assumption is bounded by $O(\qruntime{\query', \dbbase})$.
|
|
\Cref{alg:lc} requires a loop over every element of $\query'(\dbbase)$.
|
|
By the observation above that $\qruntime{\query, \db} \geq \Omega(|\query(\db)|)$, this iteration is also bounded by $O(\qruntime{\query', \dbbase})$.
|
|
|
|
\caseheading{Projection}
|
|
Projection requires a recursive call to \Cref{alg:lc}, which by the recursive assumption is bounded by $O(\qruntime{\query', \dbbase})$, which in turn is a term in $\qruntime{\pi_{\vec{A}}\query', \dbbase}$.
|
|
What remains is an iteration over $\pi_{\vec A}(\query(\dbbase))$ (lines 13--16), an iteration over $\query'(\dbbase)$ (lines 17--19), and the construction of a fan-in tree (line 20).
|
|
The first iteration is $O(|\query(\dbbase)|) \leq O(\qruntime{\query, \dbbase})$.
|
|
The second iteration and the construction of the bounded fan-in tree are both $O(|\query'(\dbbase)|) \leq O(\qruntime{\query', \dbbase}) \leq O(\qruntime{\query, \dbbase}) $, by the the observation above that $\qruntime{\query, \db} \geq \Omega(|\query(\db)|)$.
|
|
|
|
\caseheading{Bag Union}
|
|
As above, the recursive calls explicitly correspond to terms in the expansion of $\qruntime{\query_1 \cup \query_2, \dbbase}$.
|
|
Initializing $\phi$ (line 24) can be accomplished in $O(\domain(\phi_1) + \domain(\phi_2)) = O(|\query_1(\dbbase)| + |\query_2(\dbbase)|) \leq O(\qruntime{\query_1, \dbbase} + \qruntime{\query_2, \dbbase})$.
|
|
The remainder requires computing $\query_1 \cup \query_2$ (line 25) and iterating over it (lines 25--29), which is $O(|\query_1| + |\query_2|)$ as noted above --- this directly corresponds to terms in $\qruntime{\query_1 \cup \query_2, \dbbase}$.
|
|
|
|
|
|
\caseheading{$m$-ary Join}
|
|
As in the prior cases, recursive calls explicitly correspond to terms in our target runtime.
|
|
The remaining logic involves (i) computing $\domain(\phi_1) \bowtie \ldots \bowtie \domain(\phi_m)$, (ii) iterating over the results, and (iii) creating a fan-in tree.
|
|
Respectively, these are: \\
|
|
~(i)~$\jointime{\domain(\phi_1), \ldots, \domain(\phi_m)}$\\
|
|
~(ii)~$O(|\query_1(\dbbase) \bowtie \ldots \bowtie \query_m(\dbbase)|) \leq O(\jointime{\domain(\phi_1), \ldots, \domain(\phi_m)})$ (\Cref{def:join-cost})\\
|
|
~(iii)~$O(m|\query_1(\dbbase) \bowtie \ldots \bowtie \query_m(\dbbase)|)$ (as (ii), noting that $m \leq k = O(1)$)
|
|
\qed
|
|
\end{proof}
|
|
|
|
|
|
|
|
%With \Cref{lem:circ-model-runtime,lem:tlc-is-the-same-as-det} and our upper bound results on \approxq, we now have all the pieces to argue that using our approximation algorithm, the expected multiplicities of an $\raPlus$ query can be computed in essentially the same runtime as deterministic query processing for the same query, proving claim (iv) of the Introduction.
|
|
|
|
%\section{Proof of \Cref{cor:cost-model}}
|
|
%\begin{proof}
|
|
%This follows from \Cref{lem:circuits-model-runtime} (\Cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that \Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
|
|
%\qed
|
|
%\end{proof}
|
|
|
|
\section{Higher Moments}
|
|
%\label{sec:momemts}
|
|
%
|
|
We make a simple observation to conclude the presentation of our results.
|
|
So far we have only focused on the expectation of $\poly$.
|
|
In addition, we could e.g. prove bounds of the probability of a tuple's multiplicity being at least $1$.
|
|
Progress can be made on this as follows:
|
|
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use the Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
|
|
We leave further investigations for future work.
|
|
|
|
% \section{The Karp-Luby Estimator}
|
|
% \label{sec:karp-luby}
|
|
% %
|
|
% Computing the marginal probability of a tuple in the output of a set-probabilistic database query has been studied extensively.
|
|
% To the best of our knowledge, the current state of the art approximation algorithm for this problem is the Karp-Luby estimator~\cite{DBLP:journals/jal/KarpLM89}, which first appeared in MayBMS/Sprout~\cite{DBLP:conf/icde/OlteanuHK10}, and more recently as part of an online ``anytime'' approximation algorithm~\cite{FH13,heuvel-19-anappdsd}.
|
|
|
|
% The estimator works by observing that for any \ell random events $X_1, \ldots, X_\ell$, the probability of either occurring $\probOf\inparen(X_1 \vee \ldots X_\ell)$ is bounded from above by the sum of the independent event probabilities (i.e., $\probOf\inparen(X_1 \vee \ldots X_\ell) \leq \probOf\inparen{X_1} + \ldots + \probOf\inparen{X_\ell}$).
|
|
% Starting from this (easy to compute and large) value, the estimator proceeds to ``adjust'' the estimate by computing the expectation of the number of
|
|
|
|
|
|
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "main"
|
|
%%% End:
|