Trimmed Section 5 (Aaron) 1st pass.

2021-04-08 15:02:40 -04:00 · 2021-04-08 15:02:40 -04:00 · c57cf0e973
parent c9eeae3fe8
commit c57cf0e973
2 changed files with 51 additions and 70 deletions
--- a/appendix.tex
+++ b/appendix.tex
@ -23,6 +23,29 @@

 \section{Circuits}\label{app:sec-cicuits}
 \subsection{Representing Polynomials with Circuits}\label{app:subsec-rep-poly-lin-circ}
+\subsubsection{Circuits for query plans}
+\label{sec:circuits-formal}
+We now formalize circuits and the construction of circuits for SPJU queries.
+As mentioned earlier, we represent lineage polynomials as arithmetic circuits over $\mathbb N$-valued variables with $+$, $\times$.
+A circuit for query $Q$ and $\semNX$-PDB $\pxdb$ is a directed acyclic graph $\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}$ with vertices $V_{Q,\pxdb}$ and directed edges $E_{Q,\pxdb} \subset {V_{Q,\pxdb}}^2$.
+The sink function $\phi_{Q,\pxdb} : \udom^n \rightarrow V_{Q,\pxdb}$ is a partial function that maps the tuples of the $n$-ary relation $Q(\pxdb)$ to vertices.
+We require that $\phi_{Q,\pxdb}$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
+%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
+A function $\ell_{Q,\pxdb} : V_{Q,\pxdb} \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
+We require that vertices have an in-degree of at most two.
+%
+For the specifics on how to construct a  circuit to encode the polynomials of all result tuples for a query and $\semNX$-PDB see \Cref{app:subsec-rep-poly-lin-circ}. Note that we can construct circuits for \bis in time linear in the time required for deterministic query processing over a possible world of the \bi under the aforementioned assumption that $\abs{\pxdb} \leq c \cdot \abs{\db}$.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsubsection{Circuit size vs. runtime}
+\label{sec:circuit-runtime}
+
+\newcommand{\bagdbof}{\textsc{bag}(\pxdb)}
+
+We now connect the size of a circuit (where the size of a circuit is the number of vertices in the corresponding DAG) %\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.})
+ for a given SPJU query $Q$ and $\semNX$-PDB $\pxdb$ to its $\qruntime{Q,\db}$ where $\db$ is one of the possible worlds of $\pxdb$. We do this formally by showing that the size of the circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \newcommand{\getpoly}[1]{\textbf{lin}\inparen{#1}}
 Each vertex $v \in V_{Q,\pxdb}$ in the arithmetic circuit for

@ -95,8 +118,16 @@ As in projection, newly created vertices will have an in-degree of $k$, and a fa
 There are $|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ such vertices, so the corrected circuit has $|V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ vertices.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Proof for~\Cref{lem:circuits-model-runtime}}\label{app:subsec-lem-lin-vs-qplan}
+\begin{Lemma}
+\label{lem:circuits-model-runtime}
+Given a $\semNX$-PDB $\pxdb$ and query plan $Q$, the runtime of $Q$ over $\pxdb$ has the same or better complexity as the size of the lineage of $Q(\pxdb)$.  That is, we have $\abs{V_{Q,\pxdb}} \leq (k-1)\qruntime{Q}$, where $k$ is the maximal degree of  any  polynomial in $Q(\pxdb)$.
+\end{Lemma}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\noindent The proof is shown in in~\Cref{app:subsec-lem-lin-vs-qplan}.
+We now have all the pieces to argue that using our approximation algorithm,  the expected multiplicities of a SPJU query can be computed in essentially the same runtime as deterministic query processing for the same query.

+\subsection{Proof for~\Cref{lem:circuits-model-runtime}}\label{app:subsec-lem-lin-vs-qplan}
+\begin{proof}
 Proof by induction.  The base case is a base relation: $Q = R$ and is trivially true since $|V_{R,\pxdb}| = |R|$.
 For the inductive step, we assume that we have circuits for subplans $Q_1, \ldots, Q_n$ such that $|V_{Q_i,\pxdb}| \leq (k_i-1)\qruntime{Q_i,\pxdb}$ where $k_i$ is the degree of $Q_i$.

@ -146,6 +177,8 @@ The circuit for $Q$ has $|V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(k-1)|{Q_1} \bow
 \end{align*}

 The property holds for all recursive queries, and the proof holds.
+\qed
+\end{proof}

 %%% Local Variables:
 %%% mode: latex
--- a/circuits-model-runtime.tex
+++ b/circuits-model-runtime.tex
@ -1,51 +1,31 @@
 %!TEX root=./main.tex

 \section{More on Circuits and Moments}\label{sec:gen}
-%In this section, we consider generalizations/corollaries of our results.
-%In particular, in~\Cref{sec:circuits} we first consider the case when the compressed  polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.
-%Then, 
-We formalize our claim from \Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries under reasonable assumptions.
-Finally, in~\Cref{sec:momemts}, we generalize our result for expectation to other moments.
+We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered in the same runtime as deterministic queries under reasonable assumptions.
+Lastly, we generalize our result for expectation to other moments.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


-\revision{
-\subsection{Cost Model, Query Plans, and Runtime}
-As in the introduction, we could consider polynomials to be represented as an expression tree.
-However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
-}
-
-\label{sec:circuits}
-
-
-%In~\Cref{sec:results-circuits} we argue why results from earlier sections also hold for circuits and then 
-We argue why circuits capture the runtime of well-known query processing algorithms in~\Cref{sec:circuit-runtime} (\Cref{sec:cost-model} formalizes the query cost model).
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%\subsubsection{Extending our results to circuits}
-%\label{sec:results-circuits}
+%\revision{
+%\subsection{Cost Model, Query Plans, and Runtime}
+%As in the introduction, we could consider polynomials to be represented as an expression tree.
+%However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
+%}
 %
-%We first note that since expression trees are a special case of linear circuits, all of our hardness results for  in~\Cref{sec:hard} are still valid for the latter.
-%
-%Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for circuits as well.
-%It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;
-%Analogously in a circuit, each node has a maximum in-degree of two.
-%Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
-%%
-%For a more detailed discussion of why~\Cref{lem:approx-alg} holds for a circuit, see~\Cref{app:lineage-circuit-ext}.
+%\label{sec:circuits}

-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsubsection{The cost model}
+\subsection{The cost model}
 \label{sec:cost-model}
 So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.
-We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive queries, we can construct a compressed lineage polynomial for a query $Q$ and \bi $\pxdb$ of size (and in runtime) linear in the runtime of a general class of query processing algorithms for the same query $Q$ on a deterministic database $\db$.
+We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed lineage polynomial for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
 We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
-This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
-In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).
+\footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
+In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
 %That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
 % with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
 We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.
+
 \newcommand{\qruntime}[1]{\textbf{cost}(#1)}
 %
 \noindent\resizebox{1\linewidth}{!}{
@ -72,36 +52,7 @@ It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey
 %\end{proposition}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsubsection{Circuits for query plans}
-\label{sec:circuits-formal}
-We now formalize circuits and the construction of circuits for SPJU queries.
-As mentioned earlier, we represent lineage polynomials as arithmetic circuits over $\mathbb N$-valued variables with $+$, $\times$.
-A circuit for query $Q$ and $\semNX$-PDB $\pxdb$ is a directed acyclic graph $\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}$ with vertices $V_{Q,\pxdb}$ and directed edges $E_{Q,\pxdb} \subset {V_{Q,\pxdb}}^2$.
-The sink function $\phi_{Q,\pxdb} : \udom^n \rightarrow V_{Q,\pxdb}$ is a partial function that maps the tuples of the $n$-ary relation $Q(\pxdb)$ to vertices.
-We require that $\phi_{Q,\pxdb}$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
-%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
-A function $\ell_{Q,\pxdb} : V_{Q,\pxdb} \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
-We require that vertices have an in-degree of at most two.
-%
-For the specifics on how to construct a  circuit to encode the polynomials of all result tuples for a query and $\semNX$-PDB see \Cref{app:subsec-rep-poly-lin-circ}. Note that we can construct circuits for \bis in time linear in the time required for deterministic query processing over a possible world of the \bi under the aforementioned assumption that $\abs{\pxdb} \leq c \cdot \abs{\db}$.

-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsubsection{Circuit size vs. runtime}
-\label{sec:circuit-runtime}
-
-\newcommand{\bagdbof}{\textsc{bag}(\pxdb)}
-
-We now connect the size of a circuit (where the size of a circuit is the number of vertices in the corresponding DAG) %\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.})
- for a given SPJU query $Q$ and $\semNX$-PDB $\pxdb$ to its $\qruntime{Q,\db}$ where $\db$ is one of the possible worlds of $\pxdb$. We do this formally by showing that the size of the circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\begin{Lemma}
-\label{lem:circuits-model-runtime}
-Given a $\semNX$-PDB $\pxdb$ and query plan $Q$, the runtime of $Q$ over $\pxdb$ has the same or better complexity as the size of the lineage of $Q(\pxdb)$.  That is, we have $\abs{V_{Q,\pxdb}} \leq (k-1)\qruntime{Q}$, where $k$ is the maximal degree of  any  polynomial in $Q(\pxdb)$.
-\end{Lemma}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\noindent The proof is shown in in~\Cref{app:subsec-lem-lin-vs-qplan}.
-We now have all the pieces to argue that using our approximation algorithm,  the expected multiplicities of a SPJU query can be computed in essentially the same runtime as deterministic query processing for the same query:
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Corollary}
  Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
@ -112,7 +63,7 @@ We now have all the pieces to argue that using our approximation algorithm,  the
 \end{Corollary}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{proof}
-This follows from~\Cref{lem:circuits-model-runtime} and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
+This follows from~\Cref{lem:circuits-model-runtime} (\cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
 \end{proof}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

@ -121,12 +72,9 @@ This follows from~\Cref{lem:circuits-model-runtime} and \Cref{cor:approx-algo-co
 \label{sec:momemts}

 We make a simple observation to conclude the presentation of our results.
-So far focused on the expectation of $\poly$.
-In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.
-While we do not have a good approximation algorithm for this problem, we can make some progress as follows:
-For any positive integer $m$ we can compute the expectation $\poly^m$ (which only changes the degree of the corresponding lineage polynomial by a factor of $m$).
-In other words, we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
-However, we leave the question of coming up with a more accurate approximation algorithms for future work.
+So far we have only focused on the expectation of $\poly$.  In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.  Progress can be made on this as follows:
+For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
+We leave this questions like this for future work.

 %%% Local Variables:
 %%% mode: latex