paper-BagRelationalPDBsAreHard/circuits-model-runtime.tex

%!TEX root=./main.tex

\section{More on Circuits and Moments}\label{sec:gen}
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.
Lastly, we generalize our result for expectation to other moments.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%\revision{
%\subsection{Cost Model, Query Plans, and Runtime}
%As in the introduction, we could consider polynomials to be represented as an expression tree.
%However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
%}
%
%\label{sec:circuits}

\mypar{The cost model}
%\label{sec:cost-model}
So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on a deterministic database $\db$.
We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
\footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.

\newcommand{\qruntime}[1]{\textbf{cost}(#1)}
%
\noindent\resizebox{1\linewidth}{!}{
\begin{minipage}{1.0\linewidth}
  \begin{align*}
\qruntime{R,D}                               & = |R|                                                        &
                                                                                                              \qruntime{\sigma Q, D}                       & = \qruntime{Q,D}                                             &
                                                                                                                                                                                                                            \qruntime{\pi Q, D}                          & = \qruntime{Q,D} + \abs{Q(D)}
  \end{align*}\\[-15mm]
\begin{align*}
\qruntime{Q \cup Q', D}                      & = \qruntime{Q, D} + \qruntime{Q', D} +\abs{Q(D)}+\abs{Q'(D)} \\
\qruntime{Q_1 \bowtie \ldots \bowtie Q_n, D} & = \qruntime{Q_1, D} + \ldots + \qruntime{Q_n,D} + \abs{Q_1(D) \bowtie \ldots \bowtie Q_n(D)}
\end{align*}
\end{minipage}
}\\

Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.

It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though the size of these queries is data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
%
%We now make a simple observation on the above cost model:
%\begin{proposition}
%\label{prop:queries-need-to-output-tuples}
%The runtime $\qruntime{Q}$ of any query $Q$ is at least $|Q|$
%\end{proposition}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
We are now ready to formally state our claim from \Cref{sec:intro}:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Corollary}
  Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
%
  \[
    O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\db_{max}}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
    \]
\end{Corollary}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{proof}
This follows from \Cref{lem:circuits-model-runtime} (\Cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that \Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
\end{proof}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Higher Moments}
%\label{sec:momemts}
%
We make a simple observation to conclude the presentation of our results.
So far we have only focused on the expectation of $\poly$.
In addition, we could e.g. prove bounds of the probability of a tuple's multiplicity being at least $1$.
Progress can be made on this as follows:
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use the Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
We leave further investigations for future work.

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:
Circuits model runtime 2020-12-14 23:21:03 -05:00			`%!TEX root=./main.tex`
Changes from 021221 meeting implemented. 2021-02-12 14:20:50 -05:00
Finished pass on Intro (Aaron) 2021-04-06 16:11:19 -04:00			`\section{More on Circuits and Moments}\label{sec:gen}`
Minor adjustments 2021-04-10 00:19:16 -04:00			`We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`Lastly, we generalize our result for expectation to other moments.`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Circuits model runtime 2020-12-14 23:21:03 -05:00
Still working on Sec 5 2020-12-17 00:02:07 -05:00
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`%\revision{`
			`%\subsection{Cost Model, Query Plans, and Runtime}`
			`%As in the introduction, we could consider polynomials to be represented as an expression tree.`
			%However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
			`%}`
Trimming for space 2020-12-19 12:59:27 -05:00			`%`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`%\label{sec:circuits}`
Still working on Sec 5 2020-12-17 00:02:07 -05:00
Done with my final pass 2021-04-09 00:14:02 -04:00			`\mypar{The cost model}`
			`%\label{sec:cost-model}`
			`So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.`
More changes to notation, etc. 2021-06-11 11:22:58 -04:00			`We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on a deterministic database $\db$.`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`We assume a linear relationship between input sizes $\|\pxdb\|$ and $\|\db\|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`\footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.`
			`In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).`
circuits 2020-12-20 17:19:07 -05:00			`% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.`
			`We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\newcommand{\qruntime}[1]{\textbf{cost}(#1)}`
circuits 2020-12-20 14:43:03 -05:00			`%`
			`\noindent\resizebox{1\linewidth}{!}{`
			`\begin{minipage}{1.0\linewidth}`
shorten 2021-04-08 22:30:03 -04:00			`\begin{align*}`
			`\qruntime{R,D} & = \|R\| &`
			`\qruntime{\sigma Q, D} & = \qruntime{Q,D} &`
			`\qruntime{\pi Q, D} & = \qruntime{Q,D} + \abs{Q(D)}`
			`\end{align*}\\[-15mm]`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\begin{align*}`
circuits 2020-12-20 14:43:03 -05:00			`\qruntime{Q \cup Q', D} & = \qruntime{Q, D} + \qruntime{Q', D} +\abs{Q(D)}+\abs{Q'(D)} \\`
			`\qruntime{Q_1 \bowtie \ldots \bowtie Q_n, D} & = \qruntime{Q_1, D} + \ldots + \qruntime{Q_n,D} + \abs{Q_1(D) \bowtie \ldots \bowtie Q_n(D)}`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\end{align*}`
circuits 2020-12-20 14:43:03 -05:00			`\end{minipage}`
			`}\\`

			`Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.`
			`We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.`
Circuits model runtime 2020-12-14 23:21:03 -05:00
updates 2021-04-10 14:35:38 -04:00			It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db}\AR{See my comment on element on whether we should include this ref or not.} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though the size of these queries is data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
			`%`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`%We now make a simple observation on the above cost model:`
			`%\begin{proposition}`
			`%\label{prop:queries-need-to-output-tuples}`
			`%The runtime $\qruntime{Q}$ of any query $Q$ is at least $\|Q\|$`
			`%\end{proposition}`
updates 2021-04-10 14:35:38 -04:00			`%`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
updates 2021-04-10 14:35:38 -04:00			`%`
Done with my final pass 2021-04-09 00:14:02 -04:00			`We are now ready to formally state our claim from \Cref{sec:intro}:`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\begin{Corollary}`
small 2020-12-20 17:27:12 -05:00			`Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time`
circuits 2020-12-20 14:43:03 -05:00			`%`
			`\[`
circuits 2020-12-20 17:19:07 -05:00			`O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\db_{max}}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)`
One cite over still 2020-12-20 18:54:40 -05:00			`\]`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\end{Corollary}`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\begin{proof}`
cref 2021-04-10 09:48:26 -04:00			`This follows from \Cref{lem:circuits-model-runtime} (\Cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that \Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\end{proof}`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Done with my final pass 2021-04-09 00:14:02 -04:00			`\mypar{Higher Moments}`
			`%\label{sec:momemts}`
			`%`
circuits 2020-12-20 01:16:52 -05:00			`We make a simple observation to conclude the presentation of our results.`
updates 2021-04-10 14:35:38 -04:00			`So far we have only focused on the expectation of $\poly$.`
			`In addition, we could e.g. prove bounds of the probability of a tuple's multiplicity being at least $1$.`
Minor adjustments 2021-04-10 00:19:16 -04:00			`Progress can be made on this as follows:`
updates 2021-04-10 14:35:38 -04:00			`For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use the Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.`
Minor adjustments 2021-04-10 00:19:16 -04:00			`We leave further investigations for future work.`
circuits 2020-12-20 01:16:52 -05:00
			`%%% Local Variables:`
			`%%% mode: latex`
			`%%% TeX-master: "main"`
			`%%% End:`