paper-BagRelationalPDBsAreHard/circuits-model-runtime.tex

%!TEX root=./main.tex
\section{Generalizations}
\label{sec:gen}
In this section, we consider a couple of generalizations/corollaries of our results so far. In particular, in~\Cref{sec:circuits} we first consider the case when the compressed  polynomial is represented by a Directed Acyclic Graph (DAG) instead of the earlier  (expression) tree (\Cref{def:express-tree}) and we  observe that all of our results carry over to the DAG representation. Then we formalize our claim in~\Cref{sec:intro} that a linear runtime algorithm for our problem would imply that we can process PDBs in the same time as deterministic query processing. Finally, in~\Cref{sec:momemts}, we make  some simple observations on how our results can be used to estimate moments beyond the expectation of a lineage polynomial.

\subsection{Lineage circuits}
\label{sec:circuits}

In~\Cref{sec:semnx-as-repr}, we switched to thinking of our query results as polynomials and indeed pretty much of the rest of the paper has focused on thinking of our input as  a polynomial. In particular, starting with~\Cref{sec:expression-trees} we considered these polynomials to be represented as an expression tree. However, these do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `storing' any intermediate results, which is crucial for these algorithms (and other query processing results as well).

In this section, we represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, which are a standard way to represent polynomials over fields (and is standard in the field of algebraic complexity), though in our case we use them for polynomials over $\mathbb N$ in the obvious way. We present a formal treatment of {\em lineage circuit}s in~\Cref{sec:circuits-formal}, with only a quick overview to start. A lineage circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant, and the sinks correspond to the output. Every other node has at most two incoming edges (and is labeled as either an addition or a multiplication node), but there is no limit on the outdegree of such nodes. We note that if we restricted the outdegree to be one, then we get back expression trees.


In~\Cref{sec:results-circuits} we argue why our results from earlier sections also hold for lineage circuits and then argue why lineage circuits do indeed capture the notion of runtime of some well-known query processing algorithms in~\Cref{sec:circuit-runtime} (We formally define the corresponding cost model in~\Cref{sec:cost-model}).

\subsubsection{Extending our results to lineage circuits}
\label{sec:results-circuits}

We first note that since expression trees are a special case of them, all of our hardness results in~\Cref{sec:hard} are still valid for lineage circuits.

For the approximation algorithm in~\Cref{sec:algo} we note that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam}) works for lineage circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for lineage circuits as well. It turns out that both $\onepass$ and $\sampmon$ work for lineage circuits as well, simply because the only property these use for expression trees is that each node has two children. This is still valid of lineage circuits where for each non-source node the children correspond to the two nodes that have incoming edges to the given node. Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.

For further discussion on why~\Cref{lem:approx-alg} holds for a lineage circuit, see~\Cref{app:lineage-circuit-ext}.

\subsubsection{The cost model}
\label{sec:cost-model}
Thus far, our analysis of the runtime of $\onepass$ has been in terms of the size of the compressed lineage polynomial. 
We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive query, we can construct a compressed lineage polynomial with the same complexity as it would take to evaluate the query on a deterministic \emph{bag-relational} database.
We adopt a minimalistic compute-bound model of query evaluation drawn from worst-case optimal joins~\cite{skew,ngo-survey}.
\newcommand{\qruntime}[1]{\textbf{cost}(#1)}
\begin{align*}
\qruntime{Q} & = |Q|\\
\qruntime{\sigma Q} & = \qruntime{Q}\\
\qruntime{\pi Q} & = \qruntime{Q} + \abs{Q}\\
\qruntime{Q \cup Q'} & = \qruntime{Q} + \qruntime{Q'} +\abs{Q}+\abs{Q'}\\
\qruntime{Q_1 \bowtie \ldots \bowtie Q_n} & = \qruntime{Q_1} + \ldots + \qruntime{Q_n} + |Q_1 \bowtie \ldots \bowtie Q_n|\\
\end{align*}
Under this model the query plan $Q(D)$ has runtime $O(\qruntime{Q(D)})$.
Base relations assume that a full table scan is required; We model index scans by treating an index scan query $\sigma_\theta(R)$ as a single base relation.

It can be verified that the worst-case join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.

%We now make a simple observation on the above cost model:
%\begin{proposition}
%\label{prop:queries-need-to-output-tuples}
%The runtime $\qruntime{Q}$ of any query $Q$ is at least $|Q|$
%\end{proposition}


\subsubsection{Lineage circuit for query plans}
\label{sec:circuits-formal}
We now define a lineage circuit more formally and also show how to construct a lineage circuit given a SPJU query $Q$.

As mentioned earlier, we represent lineage polynomials with arithmetic circuits over $\mathbb N$ with $+$, $\times$.  
A circuit for query $Q$ is a directed acyclic graph $\tuple{V_Q, E_Q, \phi_Q, \ell_Q}$ with vertices $V_Q$ and directed edges $E_Q \subset V_Q^2$.  
A sink function $\phi_Q : \udom^n \rightarrow V_Q$ is a partial function that maps the tuples of the $n$-ary relation defined by $Q$ to vertices in the graph.  
We require that $\phi_Q$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
A function $\ell_Q : V_Q \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
We require that vertices have an in-degree of at most two.

For the specifics on how lineage circuits are translated to represent polynomials see~\Cref{app:subsec-rep-poly-lin-circ}.


\subsubsection{Circuit size vs. runtime}
\label{sec:circuit-runtime}

We now connect the size of a lineage circuit (where the size of a lineage circuit is the number of vertices in the corresponding DAG\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.}) for a given SPJU query $Q$ to its $\qruntime{Q}$.  We do this formally by showing that the size of the lineage circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.

\begin{lemma}
\label{lem:circuits-model-runtime}
The runtime of any query plan $Q$ has the same or better complexity as the lineage of the corresponding query result for any specific database instance.  That is, for any query plan $Q$ we have $|V_Q| \leq (k-1)\qruntime{Q}$, where $k$ is the degree of query polynomial corresponding to $Q$. 
\end{lemma}
Proof is in~\Cref{app:subsec-lem-lin-vs-qplan}.

We now have all the pieces to argue the following, which formally states that our approximation algorithm implies that approximating the expected multiplicities of  SPJU query can be done in essentially the same runtime as deterministic query processing of the same query:
\begin{Corollary}
Given an SPJU query $Q$ for a TIDB, we can present $(1\pm\eps)$ approximation to the expectation of each output tuple with probability at least $1-\delta$ in time $O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)$.
\end{Corollary}
\begin{proof}
This follows from~\Cref{lem:circuits-model-runtime} and (the lineage circuit counterpart-- see~\Cref{sec:results-circuits} of)~\Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
\end{proof}

\subsection{Higher moments}
\label{sec:momemts}

We make a simple observation to conclude the presentation of our results. So far we have presented algorithms that when given $\poly$, we approximate its expectation. In addition, we would e.g. prove bounds of probability of the multiplicity being at least $1$. While we do not have a good approximation algorithm for this problem, we can make some progress as follows. We first note that for any positive integer $m$ we can compute the expectation $\poly^m$ (since this only changes the degree of the corresponding lineage polynomial by a factor of $m$). In other words, we can compute the $m$-th moment of the multiplicities as well. This allows us e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in. However, we leave the question of coming up with better approximation algorithms for proving probability bounds for future work.
Circuits model runtime 2020-12-14 23:21:03 -05:00			`%!TEX root=./main.tex`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00			`\section{Generalizations}`
In middle of my pass on intro 2020-12-18 11:37:37 -05:00			`\label{sec:gen}`
Reading pass on S5 2020-12-17 12:13:30 -05:00			In this section, we consider a couple of generalizations/corollaries of our results so far. In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of the earlier (expression) tree (\Cref{def:express-tree}) and we observe that all of our results carry over to the DAG representation. Then we formalize our claim in~\Cref{sec:intro} that a linear runtime algorithm for our problem would imply that we can process PDBs in the same time as deterministic query processing. Finally, in~\Cref{sec:momemts}, we make some simple observations on how our results can be used to estimate moments beyond the expectation of a lineage polynomial.
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00
			`\subsection{Lineage circuits}`
			`\label{sec:circuits}`
Circuits model runtime 2020-12-14 23:21:03 -05:00
pass through S5 2020-12-17 22:00:32 -05:00			In~\Cref{sec:semnx-as-repr}, we switched to thinking of our query results as polynomials and indeed pretty much of the rest of the paper has focused on thinking of our input as a polynomial. In particular, starting with~\Cref{sec:expression-trees} we considered these polynomials to be represented as an expression tree. However, these do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `storing' any intermediate results, which is crucial for these algorithms (and other query processing results as well).
Still working on Sec 5 2020-12-17 00:02:07 -05:00
pass through S5 2020-12-17 22:00:32 -05:00			In this section, we represent query polynomials via {\em arithmetic circuits}~\cite{arith-complexity}, which are a standard way to represent polynomials over fields (and is standard in the field of algebraic complexity), though in our case we use them for polynomials over $\mathbb N$ in the obvious way. We present a formal treatment of {\em lineage circuit}s in~\Cref{sec:circuits-formal}, with only a quick overview to start. A lineage circuit is represented by a DAG, where each source node corresponds to either one of the input variables or a constant, and the sinks correspond to the output. Every other node has at most two incoming edges (and is labeled as either an addition or a multiplication node), but there is no limit on the outdegree of such nodes. We note that if we restricted the outdegree to be one, then we get back expression trees.
Still working on Sec 5 2020-12-17 00:02:07 -05:00

pass through S5 2020-12-17 22:00:32 -05:00			`In~\Cref{sec:results-circuits} we argue why our results from earlier sections also hold for lineage circuits and then argue why lineage circuits do indeed capture the notion of runtime of some well-known query processing algorithms in~\Cref{sec:circuit-runtime} (We formally define the corresponding cost model in~\Cref{sec:cost-model}).`
Still working on Sec 5 2020-12-17 00:02:07 -05:00
			`\subsubsection{Extending our results to lineage circuits}`
			`\label{sec:results-circuits}`

pass through S5 2020-12-17 22:00:32 -05:00			`We first note that since expression trees are a special case of them, all of our hardness results in~\Cref{sec:hard} are still valid for lineage circuits.`
Still working on Sec 5 2020-12-17 00:02:07 -05:00
pass through S5 2020-12-17 22:00:32 -05:00			For the approximation algorithm in~\Cref{sec:algo} we note that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam}) works for lineage circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for lineage circuits as well. It turns out that both $\onepass$ and $\sampmon$ work for lineage circuits as well, simply because the only property these use for expression trees is that each node has two children. This is still valid of lineage circuits where for each non-source node the children correspond to the two nodes that have incoming edges to the given node. Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
Still working on Sec 5 2020-12-17 00:02:07 -05:00
Finished porting all @atri asked in 121820 meeting 2020-12-18 18:23:24 -05:00			`For further discussion on why~\Cref{lem:approx-alg} holds for a lineage circuit, see~\Cref{app:lineage-circuit-ext}.`
Still working on Sec 5 2020-12-17 00:02:07 -05:00
			`\subsubsection{The cost model}`
			`\label{sec:cost-model}`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`Thus far, our analysis of the runtime of $\onepass$ has been in terms of the size of the compressed lineage polynomial.`
pass through S5 2020-12-17 22:00:32 -05:00			`We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive query, we can construct a compressed lineage polynomial with the same complexity as it would take to evaluate the query on a deterministic \emph{bag-relational} database.`
			`We adopt a minimalistic compute-bound model of query evaluation drawn from worst-case optimal joins~\cite{skew,ngo-survey}.`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\newcommand{\qruntime}[1]{\textbf{cost}(#1)}`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\begin{align*}`
			`\qruntime{Q} & = \|Q\|\\`
			`\qruntime{\sigma Q} & = \qruntime{Q}\\`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\qruntime{\pi Q} & = \qruntime{Q} + \abs{Q}\\`
			`\qruntime{Q \cup Q'} & = \qruntime{Q} + \qruntime{Q'} +\abs{Q}+\abs{Q'}\\`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\qruntime{Q_1 \bowtie \ldots \bowtie Q_n} & = \qruntime{Q_1} + \ldots + \qruntime{Q_n} + \|Q_1 \bowtie \ldots \bowtie Q_n\|\\`
			`\end{align*}`
			`Under this model the query plan $Q(D)$ has runtime $O(\qruntime{Q(D)})$.`
			`Base relations assume that a full table scan is required; We model index scans by treating an index scan query $\sigma_\theta(R)$ as a single base relation.`

Finished porting all @atri asked in 121820 meeting 2020-12-18 18:23:24 -05:00			It can be verified that the worst-case join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.

Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`%We now make a simple observation on the above cost model:`
			`%\begin{proposition}`
			`%\label{prop:queries-need-to-output-tuples}`
			`%The runtime $\qruntime{Q}$ of any query $Q$ is at least $\|Q\|$`
			`%\end{proposition}`
Circuits model runtime 2020-12-14 23:21:03 -05:00

Still working on Sec 5 2020-12-17 00:02:07 -05:00			`\subsubsection{Lineage circuit for query plans}`
			`\label{sec:circuits-formal}`
Finished porting all @atri asked in 121820 meeting 2020-12-18 18:23:24 -05:00			`We now define a lineage circuit more formally and also show how to construct a lineage circuit given a SPJU query $Q$.`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00
			`As mentioned earlier, we represent lineage polynomials with arithmetic circuits over $\mathbb N$ with $+$, $\times$.`
			`A circuit for query $Q$ is a directed acyclic graph $\tuple{V_Q, E_Q, \phi_Q, \ell_Q}$ with vertices $V_Q$ and directed edges $E_Q \subset V_Q^2$.`
pass through S5 2020-12-17 22:00:32 -05:00			`A sink function $\phi_Q : \udom^n \rightarrow V_Q$ is a partial function that maps the tuples of the $n$-ary relation defined by $Q$ to vertices in the graph.`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`We require that $\phi_Q$'s range be limited to sink vertices (i.e., vertices with out-degree 0).`
			`%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.`
			`A function $\ell_Q : V_Q \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`We require that vertices have an in-degree of at most two.`

Finished porting all @atri asked in 121820 meeting 2020-12-18 18:23:24 -05:00			`For the specifics on how lineage circuits are translated to represent polynomials see~\Cref{app:subsec-rep-poly-lin-circ}.`

Circuits model runtime 2020-12-14 23:21:03 -05:00
Still working on Sec 5 2020-12-17 00:02:07 -05:00			`\subsubsection{Circuit size vs. runtime}`
			`\label{sec:circuit-runtime}`
Circuits model runtime 2020-12-14 23:21:03 -05:00
Finished porting all @atri asked in 121820 meeting 2020-12-18 18:23:24 -05:00			We now connect the size of a lineage circuit (where the size of a lineage circuit is the number of vertices in the corresponding DAG\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.}) for a given SPJU query $Q$ to its $\qruntime{Q}$. We do this formally by showing that the size of the lineage circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\begin{lemma}`
			`\label{lem:circuits-model-runtime}`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`The runtime of any query plan $Q$ has the same or better complexity as the lineage of the corresponding query result for any specific database instance. That is, for any query plan $Q$ we have $\|V_Q\| \leq (k-1)\qruntime{Q}$, where $k$ is the degree of query polynomial corresponding to $Q$.`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\end{lemma}`
Finished porting all @atri asked in 121820 meeting 2020-12-18 18:23:24 -05:00			`Proof is in~\Cref{app:subsec-lem-lin-vs-qplan}.`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00
Finished porting all @atri asked in 121820 meeting 2020-12-18 18:23:24 -05:00			`We now have all the pieces to argue the following, which formally states that our approximation algorithm implies that approximating the expected multiplicities of SPJU query can be done in essentially the same runtime as deterministic query processing of the same query:`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\begin{Corollary}`
Finished porting all @atri asked in 121820 meeting 2020-12-18 18:23:24 -05:00			`Given an SPJU query $Q$ for a TIDB, we can present $(1\pm\eps)$ approximation to the expectation of each output tuple with probability at least $1-\delta$ in time $O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)$.`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\end{Corollary}`
			`\begin{proof}`
			This follows from~\Cref{lem:circuits-model-runtime} and (the lineage circuit counterpart-- see~\Cref{sec:results-circuits} of)~\Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\end{proof}`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00
			`\subsection{Higher moments}`
			`\label{sec:momemts}`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00
Reading pass on S5 2020-12-17 12:13:30 -05:00			We make a simple observation to conclude the presentation of our results. So far we have presented algorithms that when given $\poly$, we approximate its expectation. In addition, we would e.g. prove bounds of probability of the multiplicity being at least $1$. While we do not have a good approximation algorithm for this problem, we can make some progress as follows. We first note that for any positive integer $m$ we can compute the expectation $\poly^m$ (since this only changes the degree of the corresponding lineage polynomial by a factor of $m$). In other words, we can compute the $m$-th moment of the multiplicities as well. This allows us e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in. However, we leave the question of coming up with better approximation algorithms for proving probability bounds for future work.