345 lines
23 KiB
TeX
345 lines
23 KiB
TeX
%!TEX root=./main.tex
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
%%%%%%%%%%%%%%%%%%
|
|
%Appendix A is no longer needed %
|
|
%%%%%%%%%%%%%%%%%%
|
|
|
|
%\input{app_set-to-bag-pdb}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\onecolumn
|
|
\section{Missing details from Section~\ref{sec:background}}\label{sec:proofs-background}
|
|
\subsection{Polynomials}
|
|
\begin{Definition}[Degree]\label{def:degree-of-poly}
|
|
The degree of polynomial $\genpoly(\vct{X})$ is the largest $\sum_{\tup\in S}d_\tup
|
|
$ for all $\vct{d}\in\inset{0,\ldots,\hideg}^S$
|
|
such that $c_{(d_1,\dots,d_n)}\ne 0$.
|
|
We denote the degree of $\genpoly$ as $\deg\inparen{\genpoly}$.
|
|
\end{Definition}
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
As an example, the degree of the polynomial $X^2+2XY^2+Y^2$ is $3$.
|
|
Product terms in lineage arise only from join operations (\Cref{fig:nxDBSemantics}), so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins needed to produce a result tuple.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsection{Background details for proof of~\Cref{prop:expection-of-polynom}}\label{app:subsec:background-nxdbs}
|
|
\subsubsection{$\semK$-relations and \abbrNXPDB\xplural}\label{subsec:supp-mat-background}\label{subsec:supp-mat-krelations}
|
|
\input{app_k-relations}
|
|
\input{app_notation-background}
|
|
|
|
|
|
|
|
\section{Missing details from Section~\ref{sec:hard}}
|
|
\label{app:single-mult-p}
|
|
\input{app_hardness-results}
|
|
|
|
\section{Missing Details from Section~\ref{sec:algo}}\label{sec:proofs-approx-alg}
|
|
\input{app_approx-algo-defs-and-examples}
|
|
\input{app_approx-alg-analysis}
|
|
\input{app_onepass-analysis}
|
|
\input{app_samp-monom-analysis}
|
|
\input{app_approx-alg-corollaries}
|
|
|
|
\subsection{Experimental Results}\label{app:subsec:experiment}
|
|
\input{experiments}
|
|
|
|
\section{Circuits}\label{app:sec-cicuits}
|
|
\subsection{Representing Polynomials with Circuits}\label{app:subsec-rep-poly-lin-circ}
|
|
\subsubsection{Circuits for query plans}
|
|
\label{sec:circuits-formal}
|
|
We now formalize circuits and the construction of circuits for $\raPlus$ queries.
|
|
As mentioned earlier, we represent lineage polynomials as arithmetic circuits over $\mathbb N$-valued variables with $+$, $\times$.
|
|
A circuit for query $Q$ and \abbrNXPDB $\pxdb$ \footnote{For background on \abbrNXPDB\xplural, see~\Cref{app:subsec:background-nxdbs}} is a directed acyclic graph $\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}$ with vertices $V_{Q,\pxdb}$ and directed edges $E_{Q,\pxdb} \subset {V_{Q,\pxdb}}^2$.
|
|
The sink function $\phi_{Q,\pxdb} : \udom^n \rightarrow V_{Q,\pxdb}$ is a partial function that maps the tuples of the $n$-ary relation $Q(\pxdb)$ to vertices.
|
|
We require that $\phi_{Q,\pxdb}$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
|
|
A function $\ell_{Q,\pxdb} : V_{Q,\pxdb} \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
|
|
We require that vertices have an in-degree of at most two.
|
|
Note that we can construct circuits for \bis in time linear in the time required for deterministic query processing over a possible world of the \bi under the aforementioned assumption that $\abs{\pxdb} \leq c \cdot \abs{\db}$.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\subsection{Modeling Circuit Construction}
|
|
|
|
\newcommand{\bagdbof}{\textsc{bag}(\pxdb)}
|
|
|
|
We now connect the size of a circuit (where the size of a circuit is the number of vertices in the corresponding DAG)
|
|
for a given $\raPlus$ query $Q$ and \abbrNXPDB $\pxdb$ to
|
|
the runtime $\qruntime{\query,\tupset, \bound}$ of the PDB's \dbbaseName $\tupset$.
|
|
\AH{@atri: do we use $\tupset$ or $\gentupset$ here?}
|
|
We do this formally by showing that the size of the circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\newcommand{\getpoly}[1]{\textbf{lin}\inparen{#1}}
|
|
Each vertex $v \in V_{Q,\pxdb}$ in the arithmetic circuit for
|
|
|
|
\[\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}\]
|
|
|
|
encodes a polynomial, realized as
|
|
|
|
\[\getpoly{v} = \begin{cases}
|
|
\sum_{v' : (v',v) \in E_{Q,\pxdb}} \getpoly{v'} & \textbf{if } \ell(v) = +\\
|
|
\prod_{v' : (v',v) \in E_{Q,\pxdb}} \getpoly{v'} & \textbf{if } \ell(v) = \times\\
|
|
\ell(v) & \textbf{otherwise}
|
|
\end{cases}\]
|
|
|
|
|
|
We define the circuit for a $\raPlus$ query $\query$ recursively by cases as follows. In each case, let $\tuple{V_{Q_i,\pxdb}, E_{Q_i,\pxdb}, \phi_{Q_{i},\pxdb}, \ell_{Q_i,\pxdb}}$ denote the circuit for subquery $Q_i$. We implicitly include in all circuits a global zero node $v_0$ s.t., $\ell_{Q, \pxdb}(v_0) = 0$ for any $Q,\pxdb$.
|
|
|
|
|
|
\begin{algorithm}
|
|
\caption{\lincirc$(\query, \tupset, E, V, \ell)$}
|
|
\label{alg:lc}
|
|
\begin{algorithmic}[1]
|
|
\Require $\query$: query
|
|
\Require $\tupset$: a \dbbaseName
|
|
\Require $E, V, \ell$: accumulators for the edge list, vertex list, and vertex label list.
|
|
\Ensure $\circuit = \tuple{V, E, \phi, \ell}$: a circuit encoding the lineage of each tuple in $\query(\tupset)$
|
|
\If{$\query$ is $\rel$} \Comment{\textbf{Case 1}: $\query$ is a relation atom}
|
|
\For{$t \in \tupset.\rel$}
|
|
\State $V \leftarrow V \cup \{v_t\}$; $\ell \leftarrow \ell \cup \{\inparen{v_t, \rel\inparen{\tup}}\}$ \Comment{Allocate a fresh node $v_t$; note that when $\rel\inparen{\tup} = \bound\cdot X_\tup$ for $\bound > 1$, we assume the algorithm generates a $3$ node circuit encoding the multiplcation of $\bound\cdot X_\tup$, adding the new vertices, edges, and vertice/label pairs to their respective sets.}
|
|
\State $\phi(t) \gets v_t$
|
|
\EndFor
|
|
\State\Return $\tuple{V, E, \phi, \ell}$
|
|
\ElsIf{$\query$ is $\sigma_\theta(\query_1)$} \Comment{\textbf{Case 2}: $\query$ is a Selection}
|
|
\State $\tuple{V, E, \phi', \ell} \gets \lincirc(\query_1, \tupset, V, E, \ell)$
|
|
\For{$t \in \domain(\phi')$}
|
|
\State \textbf{if }$\theta(t)$
|
|
\textbf{ then } $\phi(t) \gets \phi'(t)$
|
|
\textbf{ else } $\phi(t) \gets v_0$
|
|
\EndFor
|
|
\State\Return $\tuple{V, E, \phi, \ell}$
|
|
\ElsIf{$\query$ is $\pi_{A}(\query_1)$} \Comment{\textbf{Case 3}: $\query$ is a Projection}
|
|
\State $\tuple{V, E, \phi', \ell} \gets \lincirc(\query_1, \tupset, V, E, \ell)$
|
|
\For{$t \in \pi_{A}(\query_1(\tupset))$}
|
|
\State $V \leftarrow V \cup \{v_t\}$; $\ell \leftarrow \ell \cup \{(v_t, +)\}$\Comment{Allocate a fresh node $v_t$}
|
|
\State $\phi(t) \leftarrow v_t$
|
|
\EndFor
|
|
\For{$t \in \query_1(\tupset)$}
|
|
\State $E \leftarrow E \cup \{(\phi'(t), \phi(\pi_{A}t))\}$
|
|
\EndFor
|
|
\State Correct nodes with in-degrees $>2$ by appending an equivalent fan-in two tree instead
|
|
\State\Return $\tuple{V, E, \phi, \ell}$
|
|
\ElsIf{$\query$ is $\query_1 \cup \query_2$} \Comment{\textbf{Case 4}: $\query$ is a Bag Union}
|
|
\State $\tuple{V', E', \phi_1, \ell'} \gets \lincirc(\query_1, \tupset, V, E, \ell)$
|
|
\State $\tuple{V, E, \phi_2, \ell} \gets \lincirc(\query_2, \tupset, V', E', \ell')$
|
|
\State $\phi \gets \phi_1 \cup \phi_2$\label{alg:lincirc-union-phi}
|
|
\For{$t \in \domain(\phi_1) \cap \domain(\phi_2)$}\label{alg:lincirc-union-intersection}
|
|
\State $V \leftarrow V \cup \{v_t\}$; $\ell \leftarrow \ell \cup \{(v_t, +)\}$\label{alg:lincirc-union-intersection-one} \Comment{Allocate a fresh node $v_t$}
|
|
\State $\phi(t) \gets v_t$\label{alg:lincirc-union-intersection-two}
|
|
\State $E \leftarrow E \cup \{(\phi_1(t), v_t), (\phi_2(t), v_t)\}$\label{alg:lincirc-union-intersection-three}
|
|
\EndFor
|
|
\State\Return $\tuple{V, E, \phi, \ell}$
|
|
\ElsIf{$\query$ is $\query_1 \bowtie \ldots \bowtie \query_m$} \Comment{\textbf{Case 5}: $\query$ is a $m$-ary Join}
|
|
\For{$i \in [m]$}
|
|
\State $\tuple{V, E, \phi_i, \ell} \gets \lincirc(\query_i, \tupset, V, E, \ell)$
|
|
\EndFor
|
|
\For{$t \in \domain(\phi_1) \bowtie \ldots \bowtie \domain(\phi_m)$}
|
|
\State $V \leftarrow V \cup \{v_t\}$; $\ell \leftarrow \ell \cup \{(v_t, \times)\}$ \Comment{Allocate a fresh node $v_t$}
|
|
\State $\phi(t) \gets v_t$
|
|
\State $E \leftarrow E \cup \comprehension{(\phi_i(\pi_{sch(\query_i(\tupset))}(t)), v_t)}{i \in [m]}$
|
|
\EndFor
|
|
\State Correct nodes with in-degrees $>2$ by appending an equivalent fan-in two tree instead
|
|
\State\Return $\tuple{V, E, \phi, \ell}$
|
|
|
|
\EndIf
|
|
|
|
\end{algorithmic}
|
|
\end{algorithm}
|
|
|
|
|
|
\Cref{alg:lc} defines how the circuit for a query result is constructed. Denote the set of active output tuples as $\domain\inparen{\phi}$. We quickly review the number of vertices emitted in each case.
|
|
|
|
\caseheading{Base Relation}
|
|
This circuit has $\abs{\tupset.\rel}$ vertices.
|
|
|
|
\caseheading{Selection}
|
|
If we assume dead sinks are iteratively garbage collected,
|
|
this circuit has at most $|V_{Q_1,\pxdb}|$ vertices.
|
|
|
|
\caseheading{Projection}
|
|
This formulation will produce vertices with an in-degree greater than two, a problem that we correct by replacing every vertex with an in-degree over two by an equivalent fan-in two tree. The resulting structure has at most $|{Q_1}|-1$ new vertices.
|
|
The corrected circuit thus has at most $|V_{Q_1,\pxdb}|+|{Q_1}|$ vertices.
|
|
|
|
\caseheading{Union}
|
|
This circuit has $|V_{Q_1,\pxdb}|+|V_{Q_2,\pxdb}|+|{Q_1} \cap {Q_2}|$ vertices.
|
|
|
|
\caseheading{$k$-ary Join}
|
|
As in projection, newly created vertices will have an in-degree of $k$, and a fan-in two tree is required.
|
|
There are $|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ such vertices, so the corrected circuit has $|V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ vertices.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{Bounding circuit depth}
|
|
\label{sec:circuit-depth}
|
|
|
|
We first show that the depth of the circuit (\depth; \Cref{def:size-depth}) is bounded by the size of the query. Denote by $|\query|$ the number of relational operators in query $\query$, which recall we assume is a constant.
|
|
|
|
\begin{Proposition}[Circuit depth is bounded]
|
|
\label{prop:circuit-depth}
|
|
Let $\query$ be a relational query and $\tupset$ be a \dbbaseName with $n$ tuples. There exists a (lineage) circuit $\circuit^*$ encoding the lineage of all tuples $\tup \in \query(\tupset)$ for which
|
|
$\depth(\circuit^*) \leq O(k|\query|\log(n))$.
|
|
\end{Proposition}
|
|
|
|
\begin{proof}
|
|
We show that the bound of \Cref{prop:circuit-depth} holds for the circuit constructed by \Cref{alg:lc}.
|
|
First, observe that \Cref{alg:lc} is (recursively) invoked exactly once for every relational operator or base relation in $\query$; It thus suffices to show that a call to \Cref{alg:lc} adds at most $O_k(\log(n))$ to the depth of a circuit produced by any recursive invocation.
|
|
Second, observe that modulo the logarithmic fan-in of the projection and join cases, the depth of the output is at most one greater than the depth of any input (or at most 1 in the base case of relation atoms).
|
|
For the join case, the number of in-edges can be no greater than the join width, which itself is bounded by $k$. The depth thus increases by at most a constant factor of $\lceil \log(k) \rceil = O_k(1)$.
|
|
For the projection case, observe that the fan-in is bounded by $|\query'(\dbbase)|$, which is in turn bounded by $n^k$. The depth increase for any projection node is thus at most $\lceil \log(n^k)\rceil = O(k\log(n))$, as desired.
|
|
\qed
|
|
\end{proof}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsubsection{Circuit size vs. runtime}
|
|
\label{sec:circuit-runtime}
|
|
|
|
\begin{Lemma}\label{lem:circ-model-runtime}
|
|
\label{lem:circuits-model-runtime}
|
|
Given a \abbrNXPDB $\pxdb$ with \dbbaseName $\tupset$, and an $\raPlus$ query $Q$, the runtime of $Q$ over $\tupset$ has the same or greater complexity as the size of the lineage of $Q(\pxdb)$. That is, we have $\abs{V_{Q,\pxdb}} \leq k\qruntime{\query, \tupset, \bound}+1$, where $k\ge 1$ is the maximal degree of any polynomial in $Q(\pxdb)$.
|
|
\end{Lemma}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\begin{proof}
|
|
We prove by induction that $\abs{V_{Q,\pxdb} \setminus \{v_0\}} \leq k\qruntime{\query, \tupset, \bound}$. For clarity, we implicitly exclude $v_0$ in the proof below.
|
|
|
|
The base case is a base relation: $Q = R$ and is trivially true since $|V_{R,\pxdb}| = |\tupset.R|=\qruntime{\rel, \tupset, \bound}$ (note that here the degree $k=1$).
|
|
For the inductive step, we assume that we have circuits for subqueries $Q_1, \ldots, Q_m$ such that $|V_{Q_i,\pxdb}| \leq k_i\qruntime{\query_i,\tupset, \bound}$ where $k_i$ is the degree of $Q_i$.
|
|
|
|
\caseheading{Selection}
|
|
Assume that $Q = \sigma_\theta(Q_1)$.
|
|
In the circuit for $Q$, $|V_{Q,\pxdb}| = |V_{Q_1,\tupset}|$ vertices, so from the inductive assumption and $\qruntime{\query,\tupset, \bound} = \qruntime{\query_1,\tupset, \bound}$ by definition, we have $|V_{Q,\pxdb}| \leq k \qruntime{\query,\tupset, \bound} $.
|
|
|
|
\caseheading{Projection}
|
|
Assume that $Q = \pi_{\vct A}(Q_1)$.
|
|
The circuit for $Q$ has at most $|V_{Q_1,\pxdb}|+|{Q_1}|$ vertices.
|
|
\begin{align*}
|
|
|V_{Q,\pxdb}| & \leq |V_{Q_1,\pxdb}| + |Q_1|\\
|
|
\intertext{(From the inductive assumption)}
|
|
& \leq k\qruntime{\query_1,\tupset, \bound} + \abs{Q_1}\\
|
|
\intertext{(By definition of $\qruntime{\query,\tupset, \bound}$)}
|
|
& \le k\qruntime{\query,\tupset, \bound}.
|
|
\end{align*}
|
|
\caseheading{Union}
|
|
Assume that $Q = Q_1 \cup Q_2$.
|
|
The circuit for $Q$ has $|V_{Q_1,\pxdb}|+|V_{Q_2,\pxdb}|+|{Q_1} \cap {Q_2}|$ vertices.
|
|
\begin{align*}
|
|
|V_{Q,\pxdb}| & \leq |V_{Q_1,\pxdb}|+|V_{Q_2,\pxdb}|+|{Q_1}|+|{Q_2}|\\
|
|
\intertext{(From the inductive assumption)}
|
|
& \leq k(\qruntime{\query_1,\tupset, \bound} + \qruntime{\query_2,\tupset, \bound}) + (|Q_1| + |Q_2|)
|
|
\intertext{(By definition of $\qruntime{\query, \tupset, \bound}$)}
|
|
& \leq k(\qruntime{\query,\tupset, \bound}).
|
|
\end{align*}
|
|
|
|
\caseheading{$m$-ary Join}
|
|
Assume that $Q = Q_1 \bowtie \ldots \bowtie Q_m$. Note that $k=\sum_{i=1}^m k_i\ge m$.
|
|
The circuit for $Q$ has $|V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(m-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ vertices.
|
|
\begin{align*}
|
|
|V_{Q,\pxdb}| & = |V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(m-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|\\
|
|
\intertext{From the inductive assumption and noting $\forall i: k_i \leq k$ and $m\le k$}
|
|
& \leq k\qruntime{\query_1,\tupset, \bound}+\ldots+k\qruntime{\query_k,\tupset, \bound}+\\
|
|
&\;\;\; (m-1)|{Q_1} \bowtie \ldots \bowtie {Q_m}|\\
|
|
& \leq k\left(\qruntime{\query_1, \tupset, \bound}+\ldots+\qruntime{\query_1, \tupset, \bound}+\right.\\
|
|
&\;\;\;\left.|{Q_1} \bowtie \ldots \bowtie {Q_m}|\right)\\
|
|
\intertext{(By definition of $\qruntime{\query,\tupset, \bound}$ and assumption on $\jointime{\cdot}$)}
|
|
& \le k\qruntime{\query,\tupset, \bound}.
|
|
\end{align*}
|
|
|
|
The property holds for all recursive queries, and the proof holds.
|
|
\qed
|
|
\end{proof}
|
|
|
|
\subsubsection{Runtime of \lincirc}
|
|
\label{sec:lc-runtime}
|
|
|
|
We next need to show that we can construct the circuit in time linear in the deterministic runtime.
|
|
\begin{Lemma}\label{lem:tlc-is-the-same-as-det}
|
|
Given a query $\query$ over a \dbbaseName $\tupset$ and the $\circuit^*$ output by \Cref{alg:lc}, the runtime $\timeOf{\lincirc}(\query,\tupset,\circuit^*) \le O(\qruntime{\query, \tupset, \bound})$.
|
|
\end{Lemma}
|
|
\begin{proof}
|
|
By analysis of \Cref{alg:lc}, invoked as $\circuit^*\gets\lincirc(\query, \tupset, \emptyset, \{v_0\}, \{(v_0, 0)\})$.
|
|
|
|
We assume that the vertex list $V$, edge list $E$, and vertex label list $\ell$ are mutable accumulators with $O(1)$ ammortized append.
|
|
We assume that the tuple to sink mapping $\phi$ is a linked hashmap, with $O(1)$ insertions and retrievals, and $O(n)$ iteration over the domain of keys.
|
|
We assume that the n-ary join $\domain(\phi_1) \bowtie \ldots \bowtie\domain(\phi_n)$ can be computed in time $\jointime{\domain(\phi_1), \ldots, \domain(\phi_n)}$ (\Cref{def:join-cost}) and that an intersection $\domain(\phi_1) \cap \domain(\phi_2)$ can be computed in time $O(|\domain(\phi_1)| + |\domain(\phi_2)|)$ (e.g., with a hash table).
|
|
|
|
|
|
Before proving our runtime bound, we first observe that $\qruntime{\query, \tupset, \bound} \geq \Omega(|\query(\db)|)$.
|
|
This is true by construction for the relation, projection, and union cases, by \Cref{def:join-cost} for joins, and by the observation that $|\sigma(R)| \leq |R|$.
|
|
|
|
We show that $\qruntime{\query, \tupset, \bound}$ is an upper-bound for the runtime of \Cref{alg:lc} by recursion.
|
|
The base case of a relation atom requires only an $O(|\tupset.R|)$ iteration over the source tuples.
|
|
For the remaining cases, we make the recursive assumption that for every subquery $\query'$, it holds that $O(\qruntime{\query', \tupset, \bound})$ bounds the runtime of \Cref{alg:lc}.
|
|
|
|
|
|
\caseheading{Selection}
|
|
Selection requires a recursive call to \Cref{alg:lc}, which by the recursive assumption is bounded by $O(\qruntime{\query', \tupset, \bound})$.
|
|
\Cref{alg:lc} requires a loop over every element of $\query'(\tupset)$.
|
|
By the observation above that $\qruntime{\query, \db, \bound} \geq \Omega(|\query(\db)|)$, this iteration is also bounded by $O(\qruntime{\query', \tupset, \bound})$.
|
|
|
|
\caseheading{Projection}
|
|
Projection requires a recursive call to \Cref{alg:lc}, which by the recursive assumption is bounded by $O(\qruntime{\query', \tupset})$, which in turn is a term in $\qruntime{\pi_{A}\query', \tupset, \bound}$.
|
|
What remains is an iteration over $\pi_{A}(\query(\tupset))$ (lines 13--16), an iteration over $\query'(\tupset)$ (lines 17--19), and the construction of a fan-in tree (line 20).
|
|
The first iteration is $O(|\query(\tupset)|) \leq O(\qruntime{\query, \tupset, \bound})$.
|
|
The second iteration and the construction of the bounded fan-in tree are both $O(|\query'(\tupset)|) \leq O(\qruntime{\query', \tupset}) \leq O(\qruntime{\query, \tupset, \bound}) $, by the the observation above that $\qruntime{\query, \db, \bound} \geq \Omega(|\query(\db)|)$.
|
|
|
|
\caseheading{Bag Union}
|
|
As above, the recursive calls explicitly correspond to terms in the expansion of $\qruntime{\query_1 \cup \query_2, \tupset, \bound}$.
|
|
Initializing $\phi$ (\Cref{alg:lincirc-union-phi}) can be accomplished in $O(\domain(\phi_1) + \domain(\phi_2)) = O(|\query_1(\tupset)| + |\query_2(\tupset)|) \leq O(\qruntime{\query_1, \tupset} + \qruntime{\query_2, \tupset, \bound})$.
|
|
The remainder requires computing $\query_1 \cap \query_2$ (\Cref{alg:lincirc-union-intersection}) and iterating over it (\Crefrange{alg:lincirc-union-intersection-one}{alg:lincirc-union-intersection-three}), which is $O(|\query_1| + |\query_2|)$ as noted above --- this directly corresponds to terms in $\qruntime{\query_1 \cup \query_2, \tupset, \bound}$.
|
|
|
|
|
|
\caseheading{$m$-ary Join}
|
|
As in the prior cases, recursive calls explicitly correspond to terms in our target runtime.
|
|
The remaining logic involves (i) computing $\domain(\phi_1) \bowtie \ldots \bowtie \domain(\phi_m)$, (ii) iterating over the results, and (iii) creating a fan-in tree.
|
|
Respectively, these are: \\
|
|
~(i)~$\jointime{\domain(\phi_1), \ldots, \domain(\phi_m)}$\\
|
|
~(ii)~$O(|\query_1(\tupset) \bowtie \ldots \bowtie \query_m(\tupset)|) \leq O(\jointime{\domain(\phi_1), \ldots, \domain(\phi_m)})$ (\Cref{def:join-cost})\\
|
|
~(iii)~$O(m|\query_1(\tupset) \bowtie \ldots \bowtie \query_m(\tupset)|)$ (as (ii), noting that $m \leq k = O(1)$)
|
|
\qed
|
|
\end{proof}
|
|
|
|
\section{Higher Moments}
|
|
\label{sec:momemts}
|
|
%
|
|
We make a simple observation to conclude the presentation of our results.
|
|
So far we have only focused on the expectation of $\poly$.
|
|
In addition, we could e.g. prove bounds of the probability of a tuple's multiplicity being at least $1$.
|
|
Progress can be made on this as follows:
|
|
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use the Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
|
|
We leave further investigations for future work.
|
|
|
|
\section{The Karp-Luby Estimator}
|
|
\label{sec:karp-luby}
|
|
%
|
|
Computing the marginal probability of a tuple in the output of a set-probabilistic database query has been studied extensively.
|
|
To the best of our knowledge, the current state of the art approximation algorithm for this problem is the Karp-Luby estimator~\cite{DBLP:journals/jal/KarpLM89}, which first appeared in MayBMS/Sprout~\cite{DBLP:conf/icde/OlteanuHK10}, and more recently as part of an online ``anytime'' approximation algorithm~\cite{FH13,heuvel-19-anappdsd}.
|
|
|
|
The estimator works by observing that for any $\ell$ random binary (but not necessarily independent) events $\vct{W}_1, \ldots, \vct{W}_\ell$, the probability of at least one event occurring (i.e., $\probOf\inparen{\vct{W}_1 \vee \ldots \vee\vct{W}_\ell}$) is bounded from above by the sum of the independent event probabilities (i.e., $\probOf\inparen{\vct{W}_1 \vee \ldots \vee \vct{W}_\ell} \leq \probOf\inparen{\vct{W}_1} + \ldots + \probOf\inparen{\vct{W}_\ell}$).
|
|
Starting from this (`easily' computable and large) value, the estimator proceeds to correct the estimate by estimating how much of an over-estimate it is.
|
|
Specifically, if $\mathcal P$ is the joint distribution over $\vct{W}$, the estimator computes an approximation of:
|
|
$$\mathcal O = \underset{\vct{W} \sim \mathcal P}{\expct}\Big[
|
|
\left|\comprehension{i}{\vct{W}_i = 1, i \in [\ell]}\right|
|
|
\Big].$$
|
|
\AH{Why are we computing the cardinality of variables that equal 1?}
|
|
The accuracy of this estimate is improved by conditioning $\mathcal P$ on a $W_i$ chosen uniformly at random (which ensures that the sampled count will be at least 1) and correcting the resulting estimate by $\probOf\inparen{W_i}$. With an estimate of $\mathcal O$, it can easily be verified that the probability of the disjunction can be computed as:
|
|
\AH{A bit confused on the above sentence:
|
|
\par
|
|
i) what is meant by conditioning $\mathcal{P}$ on a $W_i$,
|
|
|
|
ii) why is each $W_i$ a monomial?
|
|
}
|
|
$$\probOf\inparen{\vct{W}_1 \vee \ldots \vee\vct{W}_\ell} = \probOf\inparen{\vct{W}_1} + \ldots + \probOf\inparen{\vct{W}_\ell} - \mathcal O$$
|
|
|
|
The Karp-Luby estimator is employed on the \abbrSMB representation\footnote{Note that since we are in the set semantics, in the lineage polynomial/formula, addition is logical OR and multiplication is logical AND.} of $\circuit$ (to solve the set-PDB version of \Cref{prob:intro-stmt}), where each $W_i$ represents the event that one monomial is true.
|
|
By simple inspection, if there are $\ell$ monomials, this estimator has runtime $\Omega(\ell)$. Further, a minimum of $\left\lceil\frac{3\cdot \ell\cdot \log(\frac{2}{\delta})}{\epsilon^2}\right\rceil$ invocations of the estimator are required to achieve $1\pm\epsilon$ approximation with probability at least $1-\delta$~\cite{DBLP:conf/icde/OlteanuHK10}, entailing a runtime at least quadratic in $\ell$.
|
|
As an arbitrary lineage circuit $\circuit$ may encode $\Omega\inparen{|\circuit|^k}$ monomials, the worst case runtime is at least $\Omega\inparen{|\circuit|^{2k}}$ (where $k$ is the `degree' of lineage polynomial encoded by $\circuit$). By contrast note that by the discussion after \Cref{lem:val-ub} we can solve \Cref{prob:intro-stmt} in time $O\inparen{|\circuit|^2}$ for all \abbrBIDB circuits {\em independent} of the degree $k$.
|
|
|
|
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "main"
|
|
%%% End:
|