paper-BagRelationalPDBsAreHard/circuits-model-runtime.tex

%!TEX root=./main.tex

\section{More on Circuits and Moments}\label{sec:gen}
We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered in the same runtime as deterministic queries under reasonable assumptions.
Lastly, we generalize our result for expectation to other moments.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%\revision{
%\subsection{Cost Model, Query Plans, and Runtime}
%As in the introduction, we could consider polynomials to be represented as an expression tree.
%However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
%}
%
%\label{sec:circuits}

\subsection{The cost model}
\label{sec:cost-model}
So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed lineage polynomial for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
\footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.

\newcommand{\qruntime}[1]{\textbf{cost}(#1)}
%
\noindent\resizebox{1\linewidth}{!}{
\begin{minipage}{1.0\linewidth}
  \begin{align*}
\qruntime{R,D}                               & = |R|                                                        &
                                                                                                              \qruntime{\sigma Q, D}                       & = \qruntime{Q,D}                                             &
                                                                                                                                                                                                                            \qruntime{\pi Q, D}                          & = \qruntime{Q,D} + \abs{Q(D)}
  \end{align*}\\[-15mm]
\begin{align*}
\qruntime{Q \cup Q', D}                      & = \qruntime{Q, D} + \qruntime{Q', D} +\abs{Q(D)}+\abs{Q'(D)} \\
\qruntime{Q_1 \bowtie \ldots \bowtie Q_n, D} & = \qruntime{Q_1, D} + \ldots + \qruntime{Q_n,D} + \abs{Q_1(D) \bowtie \ldots \bowtie Q_n(D)}
\end{align*}
\end{minipage}
}\\

Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.

It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.

%We now make a simple observation on the above cost model:
%\begin{proposition}
%\label{prop:queries-need-to-output-tuples}
%The runtime $\qruntime{Q}$ of any query $Q$ is at least $|Q|$
%\end{proposition}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Corollary}
  Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
%
  \[
    O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\db_{max}}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
    \]
\end{Corollary}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{proof}
This follows from~\Cref{lem:circuits-model-runtime} (\cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
\end{proof}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Higher Moments}
\label{sec:momemts}

We make a simple observation to conclude the presentation of our results.
So far we have only focused on the expectation of $\poly$.  In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.  Progress can be made on this as follows:
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
We leave this questions like this for future work.

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:
Circuits model runtime 2020-12-14 23:21:03 -05:00			`%!TEX root=./main.tex`
Changes from 021221 meeting implemented. 2021-02-12 14:20:50 -05:00
Finished pass on Intro (Aaron) 2021-04-06 16:11:19 -04:00			`\section{More on Circuits and Moments}\label{sec:gen}`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered in the same runtime as deterministic queries under reasonable assumptions.`
			`Lastly, we generalize our result for expectation to other moments.`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Circuits model runtime 2020-12-14 23:21:03 -05:00
Still working on Sec 5 2020-12-17 00:02:07 -05:00
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`%\revision{`
			`%\subsection{Cost Model, Query Plans, and Runtime}`
			`%As in the introduction, we could consider polynomials to be represented as an expression tree.`
			%However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
			`%}`
Trimming for space 2020-12-19 12:59:27 -05:00			`%`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`%\label{sec:circuits}`
Still working on Sec 5 2020-12-17 00:02:07 -05:00
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`\subsection{The cost model}`
Still working on Sec 5 2020-12-17 00:02:07 -05:00			`\label{sec:cost-model}`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed lineage polynomial for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`We assume a linear relationship between input sizes $\|\pxdb\|$ and $\|\db\|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`\footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.`
			`In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).`
circuits 2020-12-20 17:19:07 -05:00			`% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.`
			`We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\newcommand{\qruntime}[1]{\textbf{cost}(#1)}`
circuits 2020-12-20 14:43:03 -05:00			`%`
			`\noindent\resizebox{1\linewidth}{!}{`
			`\begin{minipage}{1.0\linewidth}`
shorten 2021-04-08 22:30:03 -04:00			`\begin{align*}`
			`\qruntime{R,D} & = \|R\| &`
			`\qruntime{\sigma Q, D} & = \qruntime{Q,D} &`
			`\qruntime{\pi Q, D} & = \qruntime{Q,D} + \abs{Q(D)}`
			`\end{align*}\\[-15mm]`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\begin{align*}`
circuits 2020-12-20 14:43:03 -05:00			`\qruntime{Q \cup Q', D} & = \qruntime{Q, D} + \qruntime{Q', D} +\abs{Q(D)}+\abs{Q'(D)} \\`
			`\qruntime{Q_1 \bowtie \ldots \bowtie Q_n, D} & = \qruntime{Q_1, D} + \ldots + \qruntime{Q_n,D} + \abs{Q_1(D) \bowtie \ldots \bowtie Q_n(D)}`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\end{align*}`
circuits 2020-12-20 14:43:03 -05:00			`\end{minipage}`
			`}\\`

			`Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.`
			`We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.`
Circuits model runtime 2020-12-14 23:21:03 -05:00
circuits 2020-12-20 14:43:03 -05:00			It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
Finished porting all @atri asked in 121820 meeting 2020-12-18 18:23:24 -05:00
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`%We now make a simple observation on the above cost model:`
			`%\begin{proposition}`
			`%\label{prop:queries-need-to-output-tuples}`
			`%The runtime $\qruntime{Q}$ of any query $Q$ is at least $\|Q\|$`
			`%\end{proposition}`
Circuits model runtime 2020-12-14 23:21:03 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`

			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\begin{Corollary}`
small 2020-12-20 17:27:12 -05:00			`Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time`
circuits 2020-12-20 14:43:03 -05:00			`%`
			`\[`
circuits 2020-12-20 17:19:07 -05:00			`O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\db_{max}}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)`
One cite over still 2020-12-20 18:54:40 -05:00			`\]`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\end{Corollary}`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\begin{proof}`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`This follows from~\Cref{lem:circuits-model-runtime} (\cref{sec:circuit-runtime}) and \Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\end{proof}`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`\subsection{Higher Moments}`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00			`\label{sec:momemts}`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00
circuits 2020-12-20 01:16:52 -05:00			`We make a simple observation to conclude the presentation of our results.`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`So far we have only focused on the expectation of $\poly$. In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$. Progress can be made on this as follows:`
			`For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.`
			`We leave this questions like this for future work.`
circuits 2020-12-20 01:16:52 -05:00
			`%%% Local Variables:`
			`%%% mode: latex`
			`%%% TeX-master: "main"`
			`%%% End:`