master
Boris Glavic 2020-12-20 00:16:52 -06:00
parent 097cb8a0dd
commit a72dd81b28
1 changed files with 24 additions and 19 deletions

View File

@ -1,9 +1,9 @@
%!TEX root=./main.tex
\section{Generalizations}
\label{sec:gen}
In this section, we consider a several generalizations/corollaries of our results.
In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.
Then we formalize our claim in~\Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries.
In this section, we consider a several generalizations/corollaries of our results.
In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.
Then we formalize our claim in~\Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries.
Finally, in~\Cref{sec:momemts}, we observe how our results can be used to estimate moments other than the expectation.
\subsection{Lineage circuits}
@ -11,10 +11,10 @@ Finally, in~\Cref{sec:momemts}, we observe how our results can be used to estima
In~\Cref{sec:semnx-as-repr}, we switched to thinking of our query results as polynomials and until now, have has focused on thinking of our input as a polynomial. In particular, starting with~\Cref{sec:expression-trees} we considered these polynomials to be represented as an expression tree. However, these do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `storing' any intermediate results, which is crucial for these algorithms (and other query processing results as well).
In this section, we represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way.
We present a formal treatment of {\em lineage circuit}s in~\Cref{sec:circuits-formal}, with only a quick overview to start.
A lineage circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant and the sinks correspond to the output.
Every other node has at most two in-edges, is labeled as an addition or a multiplication node, and has no limit on its outdegree.
In this section, we represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way.
We present a formal treatment of {\em lineage circuit}s in~\Cref{sec:circuits-formal}, with only a quick overview to start.
A lineage circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant and the sinks correspond to the output.
Every other node has at most two in-edges, is labeled as an addition or a multiplication node, and has no limit on its outdegree.
Note that if we limit the outdegree to one, then we get back expression trees.
In~\Cref{sec:results-circuits} we argue why results from earlier sections also hold for lineage circuits and then argue why lineage circuits capture the notion of runtime of well-known query processing algorithms in~\Cref{sec:circuit-runtime} (\Cref{sec:cost-model} formalizes the query cost model).
@ -24,8 +24,8 @@ In~\Cref{sec:results-circuits} we argue why results from earlier sections also h
We first note that since expression trees are a special case of linear circuits, all of our hardness results in~\Cref{sec:hard} are still valid for the latter.
Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for lineage circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for lineage circuits as well.
It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;
Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for lineage circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for lineage circuits as well.
It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;
Analogously in a circuit, each node has a maximum in-degree of two.
Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
%
@ -33,7 +33,7 @@ For a more detailed discussion of why~\Cref{lem:approx-alg} holds for a lineage
\subsubsection{The cost model}
\label{sec:cost-model}
Thus far, our analysis of the runtime of $\onepass$ has been in terms of the size of the compressed lineage polynomial.
Thus far, our analysis of the runtime of $\onepass$ has been in terms of the size of the compressed lineage polynomial.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive query, we can construct a compressed lineage polynomial with the same complexity as it would take to evaluate the query on a deterministic \emph{bag-relational} database.
We adopt a minimalistic compute-bound model of query evaluation drawn from worst-case optimal joins~\cite{skew,ngo-survey}.
\newcommand{\qruntime}[1]{\textbf{cost}(#1)}
@ -62,9 +62,9 @@ It can be verified that the worst-case join algorithms~\cite{skew,ngo-survey}, a
\label{sec:circuits-formal}
We now formalize lineage circuits and the construction of lineage circuits for SPJU queries.
As mentioned earlier, we represent lineage polynomials with arithmetic circuits over $\mathbb N$ with $+$, $\times$.
A circuit for query $Q$ is a directed acyclic graph $\tuple{V_Q, E_Q, \phi_Q, \ell_Q}$ with vertices $V_Q$ and directed edges $E_Q \subset V_Q^2$.
A sink function $\phi_Q : \udom^n \rightarrow V_Q$ is a partial function that maps the tuples of the $n$-ary relation defined by $Q$ to vertices.
As mentioned earlier, we represent lineage polynomials with arithmetic circuits over $\mathbb N$ with $+$, $\times$.
A circuit for query $Q$ is a directed acyclic graph $\tuple{V_Q, E_Q, \phi_Q, \ell_Q}$ with vertices $V_Q$ and directed edges $E_Q \subset V_Q^2$.
A sink function $\phi_Q : \udom^n \rightarrow V_Q$ is a partial function that maps the tuples of the $n$-ary relation defined by $Q$ to vertices.
We require that $\phi_Q$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
A function $\ell_Q : V_Q \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
@ -81,7 +81,7 @@ We now connect the size of a lineage circuit (where the size of a lineage circui
\begin{lemma}
\label{lem:circuits-model-runtime}
The runtime of any query plan $Q$ has the same or better complexity as the lineage of the corresponding query result for any specific database instance. That is, for any query plan $Q$ we have $|V_Q| \leq (k-1)\qruntime{Q}$, where $k$ is the degree of query polynomial corresponding to $Q$.
The runtime of any query plan $Q$ has the same or better complexity as the lineage of the corresponding query result for any specific database instance. That is, for any query plan $Q$ we have $|V_Q| \leq (k-1)\qruntime{Q}$, where $k$ is the degree of query polynomial corresponding to $Q$.
\end{lemma}
\noindent The proof appears in~\Cref{app:subsec-lem-lin-vs-qplan}.
@ -96,10 +96,15 @@ This follows from~\Cref{lem:circuits-model-runtime} and (the lineage circuit cou
\subsection{Higher moments}
\label{sec:momemts}
We make a simple observation to conclude the presentation of our results.
So far we have presented algorithms that approximate the expectation of $\poly$.
In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.
We make a simple observation to conclude the presentation of our results.
So far we have presented algorithms that approximate the expectation of $\poly$.
In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.
While we do not have a good approximation algorithm for this problem, we can make some progress as follows:
Note that for any positive integer $m$ we can compute the expectation $\poly^m$ (since this only changes the degree of the corresponding lineage polynomial by a factor of $m$).
In other words, we can compute the $m$-th moment of the multiplicities as well allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
Note that for any positive integer $m$ we can compute the expectation $\poly^m$ (since this only changes the degree of the corresponding lineage polynomial by a factor of $m$).
In other words, we can compute the $m$-th moment of the multiplicities as well allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
However, we leave the question of coming up with a more accurate approximation algorithms for future work.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: