Finished incorporating @atri changes 020421

master
Aaron Huber 2021-02-09 09:12:22 -05:00
parent ba6010daa8
commit d2f628fbc9
3 changed files with 49 additions and 35 deletions

View File

@ -1,39 +1,39 @@
%!TEX root=./main.tex
\section{Generalizations}
\revision{
\section{Need a good title for this section}
}
\label{sec:gen}
In this section, we consider generalizations/corollaries of our results.
In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.
Then, we formalize our claim from \Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries under reasonable assumptions.
%In this section, we consider generalizations/corollaries of our results.
%In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.
%Then,
We formalize our claim from \Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries under reasonable assumptions.
Finally, in~\Cref{sec:momemts}, we generalize our result for expectation to other moments.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Lineage Circuits}
\revision{
\subsection{Cost Model, Query Plans, and Runtime}
}
\label{sec:circuits}
In~\Cref{sec:semnx-as-repr}, we switched to thinking of our query results as polynomials and until now, have focused on thinking of inputs this way.
In particular, starting with~\Cref{sec:expression-trees} we considered these polynomials to be represented as an expression tree.
However, these do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
In this section, we represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way.
We present a formal treatment of {\em circuit}s in~\Cref{sec:circuits-formal}, with only a quick overview to in this section.
A circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant, and the sinks to output tuples.
Every other node has at most two in-edges, is labeled as an addition or a multiplication node, and has no limit on its outdegree.
Note that if we limit the outdegree to one, then we get back expression trees.
In~\Cref{sec:results-circuits} we argue why results from earlier sections also hold for circuits and then argue why circuits capture the runtime of well-known query processing algorithms in~\Cref{sec:circuit-runtime} (\Cref{sec:cost-model} formalizes the query cost model).
%In~\Cref{sec:results-circuits} we argue why results from earlier sections also hold for circuits and then
We argue why circuits capture the runtime of well-known query processing algorithms in~\Cref{sec:circuit-runtime} (\Cref{sec:cost-model} formalizes the query cost model).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Extending our results to circuits}
\label{sec:results-circuits}
We first note that since expression trees are a special case of linear circuits, all of our hardness results for in~\Cref{sec:hard} are still valid for the latter.
Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for circuits as well.
It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;
Analogously in a circuit, each node has a maximum in-degree of two.
Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
%\subsubsection{Extending our results to circuits}
%\label{sec:results-circuits}
%
For a more detailed discussion of why~\Cref{lem:approx-alg} holds for a circuit, see~\Cref{app:lineage-circuit-ext}.
%We first note that since expression trees are a special case of linear circuits, all of our hardness results for in~\Cref{sec:hard} are still valid for the latter.
%
%Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for circuits as well.
%It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;
%Analogously in a circuit, each node has a maximum in-degree of two.
%Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
%%
%For a more detailed discussion of why~\Cref{lem:approx-alg} holds for a circuit, see~\Cref{app:lineage-circuit-ext}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{The cost model}
@ -72,7 +72,7 @@ It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey
%\end{proposition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Lineage circuits for query plans}
\subsubsection{Circuits for query plans}
\label{sec:circuits-formal}
We now formalize circuits and the construction of circuits for SPJU queries.
As mentioned earlier, we represent lineage polynomials as arithmetic circuits over $\mathbb N$-valued variables with $+$, $\times$.

View File

@ -689,15 +689,17 @@ It is easy to check that except for~\Cref{alg:sample-times-union}, all other lin
\input{experiments}
\section{Circuits}\label{app:sec-cicuits}
\subsection{Extending to Lineage Circuits}\label{app:lineage-circuit-ext}
More specifically consider $\onepass$. The algorithm (as well as its analysis) basically uses the fact that one can compute the corresponding polynomial at all $1$s input with a simple recursive formula (\cref{eq:T-all-ones}), and that we can compute a probability distribution based on these weights (as in~\cref{eq:T-weights}). It can be verified that all the arguments go through if we replace $\etree_\lchild$ and $\etree_\rchild$ for expression tree $\etree$ with the two incoming nodes of the sink for the given lineage circuit. Another way to look at this is we could `unroll' the recursion in $\onepass$ and think of the algorithm as doing the evaluation at each node bottom up from leaves to the root in the expression tree. For lineage circuits, we start from the source nodes and do the computation in the topological order till we reach the sink(s).
The argument for $\sampmon$ is similar. Since we argued that $\onepass$ works as intended for lineage circuits since~\Cref{alg:one-pass} only recurses on children of the current node in the expression tree and we can generalize it to lineage circuits by recursing to the two children of the current node in the lineage circuit. Alternatively, as we have already used in the proof of~\Cref{lem:sample}, we can think of the sampling algorithm as sampling a sub-graph of the expression tree. For lineage circuits, we can think of $\sampmon$ as sampling the same sub-graph. Alternatively, one can implicitly expand the circuit lineage into a (larger but) equivalent expression tree. Since $\sampmon$ only explores one sub-graph during its run we can think of its run on a lineage circuit as being done on the implicit equivalent expression tree\footnote{
Recall that $\sampmon$ scales only in the depth of the expression and its polynomial degree ($k$). There exist polynomials that can be encoded in size $\Omega(\log k)$, but we follow convention in assuming that the circuit size is asymptotically larger than $k$ and thus treat the degree (i.e., join width) as a constant.
}. Hence, all of the results on $\sampmon$ on expression trees carry over to lineage circuits.
Thus, we have argued that~\Cref{lem:approx-alg} also holds if we use a lineage circuit instead of an expression tree as the input to our approximation algorithm.
%\subsection{Extending to Lineage Circuits}\label{app:lineage-circuit-ext}
%
%More specifically consider $\onepass$. The algorithm (as well as its analysis) basically uses the fact that one can compute the corresponding polynomial at all $1$s input with a simple recursive formula (\cref{eq:T-all-ones}), and that we can compute a probability distribution based on these weights (as in~\cref{eq:T-weights}). It can be verified that all the arguments go through if we replace $\etree_\lchild$ and $\etree_\rchild$ for expression tree $\etree$ with the two incoming nodes of the sink for the given lineage circuit. Another way to look at this is we could `unroll' the recursion in $\onepass$ and think of the algorithm as doing the evaluation at each node bottom up from leaves to the root in the expression tree. For lineage circuits, we start from the source nodes and do the computation in the topological order till we reach the sink(s).
%
%The argument for $\sampmon$ is similar. Since we argued that $\onepass$ works as intended for lineage circuits since~\Cref{alg:one-pass} only recurses on children of the current node in the expression tree and we can generalize it to lineage circuits by recursing to the two children of the current node in the lineage circuit. Alternatively, as we have already used in the proof of~\Cref{lem:sample}, we can think of the sampling algorithm as sampling a sub-graph of the expression tree. For lineage circuits, we can think of $\sampmon$ as sampling the same sub-graph. Alternatively, one can implicitly expand the circuit lineage into a (larger but) equivalent expression tree. Since $\sampmon$ only explores one sub-graph during its run we can think of its run on a lineage circuit as being done on the implicit equivalent expression tree\footnote{
% Recall that $\sampmon$ scales only in the depth of the expression and its polynomial degree ($k$). There exist polynomials that can be encoded in size $\Omega(\log k)$, but we follow convention in assuming that the circuit size is asymptotically larger than $k$ and thus treat the degree (i.e., join width) as a constant.
%}. Hence, all of the results on $\sampmon$ on expression trees carry over to lineage circuits.
%
%Thus, we have argued that~\Cref{lem:approx-alg} also holds if we use a lineage circuit instead of an expression tree as the input to our approximation algorithm.
\subsection{Representing Polynomials with Lineage Circuits}\label{app:subsec-rep-poly-lin-circ}
\newcommand{\getpoly}[1]{\textbf{lin}\inparen{#1}}

View File

@ -103,15 +103,27 @@ tree, whose internal nodes are from the set $\{+, \times\}$, with leaf nodes bei
\end{Definition}}
\revision{
We could consider polynomials to be represented as an expression tree.
However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
We represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, a standard way to represent polynomials over fields (particularly in the field of algebraic complexity) that we use for polynomials over $\mathbb N$ in the obvious way.
\begin{Definition}[Circuit]\label{def:circuit}
A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source nodes (in degree of $0$) consist of elements in either $\reals$ or $\vct{X}$. The internal and sink nodes of $\circuit$ have binary input and are either sum ($\circplus$) or product ($\circmult$) gates.
Circuit $\circuit$ additionally has the following members: \type, \val, \vari{partial}, \vari{input}, and \vari{weight}, where \type is the type of value stored in the node $\circuit$ (i.e. one of $\{\circplus, \circmult, \var, \tnum\}$, \val is the value stored, and \vari{input} is the list of \circuit 's inputs where $\circuit_\linput$ is the left input and $\circuit_\rinput$ the right input. When the underlying DAG is a tree, we will refer to the structure as an expression tree.
\end{Definition}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We ignore the remaining fields (\vari{partial} and \vari{weight}) until \Cref{sec:algo}. Also note that the out degree of any internal node can grow with the circuit size.
We present a formal treatment of {\em circuit}s in~\Cref{sec:circuits-formal}.
As stated in ~\Cref{def:circuit}, a circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant, and the sinks to output tuples.
Every other node has at most two in-edges, is labeled as an addition or a multiplication node, and has no limit on its outdegree.
Note that if we limit the outdegree to one, then we get expression trees.
We ignore the remaining fields (\vari{partial} and \vari{weight}) until \Cref{sec:algo}.
}
%Also note that the out degree of any internal node can grow with the circuit size.
The semantics of \revision{circuits} ~follows the obvious interpretation. We \revision{next} define \revision{the realtionship with polynomials } formally:
\begin{Definition}[$\polyf(\cdot)$]\label{def:poly-func}