Trimmed Section 5 (Aaron) 1st pass.
parent
c9eeae3fe8
commit
c57cf0e973
35
appendix.tex
35
appendix.tex
|
@ -23,6 +23,29 @@
|
|||
|
||||
\section{Circuits}\label{app:sec-cicuits}
|
||||
\subsection{Representing Polynomials with Circuits}\label{app:subsec-rep-poly-lin-circ}
|
||||
\subsubsection{Circuits for query plans}
|
||||
\label{sec:circuits-formal}
|
||||
We now formalize circuits and the construction of circuits for SPJU queries.
|
||||
As mentioned earlier, we represent lineage polynomials as arithmetic circuits over $\mathbb N$-valued variables with $+$, $\times$.
|
||||
A circuit for query $Q$ and $\semNX$-PDB $\pxdb$ is a directed acyclic graph $\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}$ with vertices $V_{Q,\pxdb}$ and directed edges $E_{Q,\pxdb} \subset {V_{Q,\pxdb}}^2$.
|
||||
The sink function $\phi_{Q,\pxdb} : \udom^n \rightarrow V_{Q,\pxdb}$ is a partial function that maps the tuples of the $n$-ary relation $Q(\pxdb)$ to vertices.
|
||||
We require that $\phi_{Q,\pxdb}$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
|
||||
%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
|
||||
A function $\ell_{Q,\pxdb} : V_{Q,\pxdb} \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
|
||||
We require that vertices have an in-degree of at most two.
|
||||
%
|
||||
For the specifics on how to construct a circuit to encode the polynomials of all result tuples for a query and $\semNX$-PDB see \Cref{app:subsec-rep-poly-lin-circ}. Note that we can construct circuits for \bis in time linear in the time required for deterministic query processing over a possible world of the \bi under the aforementioned assumption that $\abs{\pxdb} \leq c \cdot \abs{\db}$.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsubsection{Circuit size vs. runtime}
|
||||
\label{sec:circuit-runtime}
|
||||
|
||||
\newcommand{\bagdbof}{\textsc{bag}(\pxdb)}
|
||||
|
||||
We now connect the size of a circuit (where the size of a circuit is the number of vertices in the corresponding DAG) %\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.})
|
||||
for a given SPJU query $Q$ and $\semNX$-PDB $\pxdb$ to its $\qruntime{Q,\db}$ where $\db$ is one of the possible worlds of $\pxdb$. We do this formally by showing that the size of the circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\newcommand{\getpoly}[1]{\textbf{lin}\inparen{#1}}
|
||||
Each vertex $v \in V_{Q,\pxdb}$ in the arithmetic circuit for
|
||||
|
||||
|
@ -95,8 +118,16 @@ As in projection, newly created vertices will have an in-degree of $k$, and a fa
|
|||
There are $|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ such vertices, so the corrected circuit has $|V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ vertices.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Proof for~\Cref{lem:circuits-model-runtime}}\label{app:subsec-lem-lin-vs-qplan}
|
||||
\begin{Lemma}
|
||||
\label{lem:circuits-model-runtime}
|
||||
Given a $\semNX$-PDB $\pxdb$ and query plan $Q$, the runtime of $Q$ over $\pxdb$ has the same or better complexity as the size of the lineage of $Q(\pxdb)$. That is, we have $\abs{V_{Q,\pxdb}} \leq (k-1)\qruntime{Q}$, where $k$ is the maximal degree of any polynomial in $Q(\pxdb)$.
|
||||
\end{Lemma}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\noindent The proof is shown in in~\Cref{app:subsec-lem-lin-vs-qplan}.
|
||||
We now have all the pieces to argue that using our approximation algorithm, the expected multiplicities of a SPJU query can be computed in essentially the same runtime as deterministic query processing for the same query.
|
||||
|
||||
\subsection{Proof for~\Cref{lem:circuits-model-runtime}}\label{app:subsec-lem-lin-vs-qplan}
|
||||
\begin{proof}
|
||||
Proof by induction. The base case is a base relation: $Q = R$ and is trivially true since $|V_{R,\pxdb}| = |R|$.
|
||||
For the inductive step, we assume that we have circuits for subplans $Q_1, \ldots, Q_n$ such that $|V_{Q_i,\pxdb}| \leq (k_i-1)\qruntime{Q_i,\pxdb}$ where $k_i$ is the degree of $Q_i$.
|
||||
|
||||
|
@ -146,6 +177,8 @@ The circuit for $Q$ has $|V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(k-1)|{Q_1} \bow
|
|||
\end{align*}
|
||||
|
||||
The property holds for all recursive queries, and the proof holds.
|
||||
\qed
|
||||
\end{proof}
|
||||
|
||||
%%% Local Variables:
|
||||
%%% mode: latex
|
||||
|
|
|
@ -1,51 +1,31 @@
|
|||
%!TEX root=./main.tex
|
||||
|
||||
\section{More on Circuits and Moments}\label{sec:gen}
|
||||
%In this section, we consider generalizations/corollaries of our results.
|
||||
%In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.
|
||||
%Then,
|
||||
We formalize our claim from \Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries under reasonable assumptions.
|
||||
Finally, in~\Cref{sec:momemts}, we generalize our result for expectation to other moments.
|
||||
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered in the same runtime as deterministic queries under reasonable assumptions.
|
||||
Lastly, we generalize our result for expectation to other moments.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
|
||||
\revision{
|
||||
\subsection{Cost Model, Query Plans, and Runtime}
|
||||
As in the introduction, we could consider polynomials to be represented as an expression tree.
|
||||
However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
|
||||
}
|
||||
|
||||
\label{sec:circuits}
|
||||
|
||||
|
||||
%In~\Cref{sec:results-circuits} we argue why results from earlier sections also hold for circuits and then
|
||||
We argue why circuits capture the runtime of well-known query processing algorithms in~\Cref{sec:circuit-runtime} (\Cref{sec:cost-model} formalizes the query cost model).
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%\subsubsection{Extending our results to circuits}
|
||||
%\label{sec:results-circuits}
|
||||
%\revision{
|
||||
%\subsection{Cost Model, Query Plans, and Runtime}
|
||||
%As in the introduction, we could consider polynomials to be represented as an expression tree.
|
||||
%However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
|
||||
%}
|
||||
%
|
||||
%We first note that since expression trees are a special case of linear circuits, all of our hardness results for in~\Cref{sec:hard} are still valid for the latter.
|
||||
%
|
||||
%Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for circuits as well.
|
||||
%It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;
|
||||
%Analogously in a circuit, each node has a maximum in-degree of two.
|
||||
%Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
|
||||
%%
|
||||
%For a more detailed discussion of why~\Cref{lem:approx-alg} holds for a circuit, see~\Cref{app:lineage-circuit-ext}.
|
||||
%\label{sec:circuits}
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsubsection{The cost model}
|
||||
\subsection{The cost model}
|
||||
\label{sec:cost-model}
|
||||
So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.
|
||||
We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive queries, we can construct a compressed lineage polynomial for a query $Q$ and \bi $\pxdb$ of size (and in runtime) linear in the runtime of a general class of query processing algorithms for the same query $Q$ on a deterministic database $\db$.
|
||||
We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed lineage polynomial for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
|
||||
We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
|
||||
This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
|
||||
In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).
|
||||
\footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
|
||||
In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
|
||||
%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
|
||||
% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
|
||||
We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.
|
||||
|
||||
\newcommand{\qruntime}[1]{\textbf{cost}(#1)}
|
||||
%
|
||||
\noindent\resizebox{1\linewidth}{!}{
|
||||
|
@ -72,36 +52,7 @@ It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey
|
|||
%\end{proposition}
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsubsection{Circuits for query plans}
|
||||
\label{sec:circuits-formal}
|
||||
We now formalize circuits and the construction of circuits for SPJU queries.
|
||||
As mentioned earlier, we represent lineage polynomials as arithmetic circuits over $\mathbb N$-valued variables with $+$, $\times$.
|
||||
A circuit for query $Q$ and $\semNX$-PDB $\pxdb$ is a directed acyclic graph $\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}$ with vertices $V_{Q,\pxdb}$ and directed edges $E_{Q,\pxdb} \subset {V_{Q,\pxdb}}^2$.
|
||||
The sink function $\phi_{Q,\pxdb} : \udom^n \rightarrow V_{Q,\pxdb}$ is a partial function that maps the tuples of the $n$-ary relation $Q(\pxdb)$ to vertices.
|
||||
We require that $\phi_{Q,\pxdb}$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
|
||||
%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
|
||||
A function $\ell_{Q,\pxdb} : V_{Q,\pxdb} \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
|
||||
We require that vertices have an in-degree of at most two.
|
||||
%
|
||||
For the specifics on how to construct a circuit to encode the polynomials of all result tuples for a query and $\semNX$-PDB see \Cref{app:subsec-rep-poly-lin-circ}. Note that we can construct circuits for \bis in time linear in the time required for deterministic query processing over a possible world of the \bi under the aforementioned assumption that $\abs{\pxdb} \leq c \cdot \abs{\db}$.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsubsection{Circuit size vs. runtime}
|
||||
\label{sec:circuit-runtime}
|
||||
|
||||
\newcommand{\bagdbof}{\textsc{bag}(\pxdb)}
|
||||
|
||||
We now connect the size of a circuit (where the size of a circuit is the number of vertices in the corresponding DAG) %\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.})
|
||||
for a given SPJU query $Q$ and $\semNX$-PDB $\pxdb$ to its $\qruntime{Q,\db}$ where $\db$ is one of the possible worlds of $\pxdb$. We do this formally by showing that the size of the circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Lemma}
|
||||
\label{lem:circuits-model-runtime}
|
||||
Given a $\semNX$-PDB $\pxdb$ and query plan $Q$, the runtime of $Q$ over $\pxdb$ has the same or better complexity as the size of the lineage of $Q(\pxdb)$. That is, we have $\abs{V_{Q,\pxdb}} \leq (k-1)\qruntime{Q}$, where $k$ is the maximal degree of any polynomial in $Q(\pxdb)$.
|
||||
\end{Lemma}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\noindent The proof is shown in in~\Cref{app:subsec-lem-lin-vs-qplan}.
|
||||
We now have all the pieces to argue that using our approximation algorithm, the expected multiplicities of a SPJU query can be computed in essentially the same runtime as deterministic query processing for the same query:
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Corollary}
|
||||
Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
|
||||
|
@ -112,7 +63,7 @@ We now have all the pieces to argue that using our approximation algorithm, the
|
|||
\end{Corollary}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{proof}
|
||||
This follows from~\Cref{lem:circuits-model-runtime} and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
|
||||
This follows from~\Cref{lem:circuits-model-runtime} (\cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
|
||||
\end{proof}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
|
@ -121,12 +72,9 @@ This follows from~\Cref{lem:circuits-model-runtime} and \Cref{cor:approx-algo-co
|
|||
\label{sec:momemts}
|
||||
|
||||
We make a simple observation to conclude the presentation of our results.
|
||||
So far focused on the expectation of $\poly$.
|
||||
In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.
|
||||
While we do not have a good approximation algorithm for this problem, we can make some progress as follows:
|
||||
For any positive integer $m$ we can compute the expectation $\poly^m$ (which only changes the degree of the corresponding lineage polynomial by a factor of $m$).
|
||||
In other words, we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
|
||||
However, we leave the question of coming up with a more accurate approximation algorithms for future work.
|
||||
So far we have only focused on the expectation of $\poly$. In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$. Progress can be made on this as follows:
|
||||
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
|
||||
We leave this questions like this for future work.
|
||||
|
||||
%%% Local Variables:
|
||||
%%% mode: latex
|
||||
|
|
Loading…
Reference in New Issue