paper-BagRelationalPDBsAreHard/circuits-model-runtime.tex

%!TEX root=./main.tex
\revision{
\section{More on Circuits and Moments}
}
\label{sec:gen}
%In this section, we consider generalizations/corollaries of our results.
%In particular, in~\Cref{sec:circuits} we first consider the case when the compressed  polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.
%Then, 
We formalize our claim from \Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries under reasonable assumptions.
Finally, in~\Cref{sec:momemts}, we generalize our result for expectation to other moments.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\revision{
\subsection{Cost Model, Query Plans, and Runtime}
}

\label{sec:circuits}


%In~\Cref{sec:results-circuits} we argue why results from earlier sections also hold for circuits and then 
We argue why circuits capture the runtime of well-known query processing algorithms in~\Cref{sec:circuit-runtime} (\Cref{sec:cost-model} formalizes the query cost model).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\subsubsection{Extending our results to circuits}
%\label{sec:results-circuits}
%
%We first note that since expression trees are a special case of linear circuits, all of our hardness results for  in~\Cref{sec:hard} are still valid for the latter.
%
%Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for circuits as well.
%It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;
%Analogously in a circuit, each node has a maximum in-degree of two.
%Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.
%%
%For a more detailed discussion of why~\Cref{lem:approx-alg} holds for a circuit, see~\Cref{app:lineage-circuit-ext}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{The cost model}
\label{sec:cost-model}
So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.
We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive queries, we can construct a compressed lineage polynomial for a query $Q$ and \bi $\pxdb$ of size (and in runtime) linear in the runtime of a general class of query processing algorithms for the same query $Q$ on a deterministic database $\db$.
We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).
%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.
\newcommand{\qruntime}[1]{\textbf{cost}(#1)}
%
\noindent\resizebox{1\linewidth}{!}{
\begin{minipage}{1.0\linewidth}
\begin{align*}
\qruntime{R,D}                               & = |R|                                                        \\
\qruntime{\sigma Q, D}                       & = \qruntime{Q,D}                                             \\
\qruntime{\pi Q, D}                          & = \qruntime{Q,D} + \abs{Q(D)}                                \\
\qruntime{Q \cup Q', D}                      & = \qruntime{Q, D} + \qruntime{Q', D} +\abs{Q(D)}+\abs{Q'(D)} \\
\qruntime{Q_1 \bowtie \ldots \bowtie Q_n, D} & = \qruntime{Q_1, D} + \ldots + \qruntime{Q_n,D} + \abs{Q_1(D) \bowtie \ldots \bowtie Q_n(D)}
\end{align*}
\end{minipage}
}\\

Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.

It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.

%We now make a simple observation on the above cost model:
%\begin{proposition}
%\label{prop:queries-need-to-output-tuples}
%The runtime $\qruntime{Q}$ of any query $Q$ is at least $|Q|$
%\end{proposition}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Circuits for query plans}
\label{sec:circuits-formal}
We now formalize circuits and the construction of circuits for SPJU queries.
As mentioned earlier, we represent lineage polynomials as arithmetic circuits over $\mathbb N$-valued variables with $+$, $\times$.
A circuit for query $Q$ and $\semNX$-PDB $\pxdb$ is a directed acyclic graph $\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}$ with vertices $V_{Q,\pxdb}$ and directed edges $E_{Q,\pxdb} \subset {V_{Q,\pxdb}}^2$.
The sink function $\phi_{Q,\pxdb} : \udom^n \rightarrow V_{Q,\pxdb}$ is a partial function that maps the tuples of the $n$-ary relation $Q(\pxdb)$ to vertices.
We require that $\phi_{Q,\pxdb}$'s range be limited to sink vertices (i.e., vertices with out-degree 0).
%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
A function $\ell_{Q,\pxdb} : V_{Q,\pxdb} \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
We require that vertices have an in-degree of at most two.
%
For the specifics on how to construct a  circuit to encode the polynomials of all result tuples for a query and $\semNX$-PDB see \Cref{app:subsec-rep-poly-lin-circ}. Note that we can construct circuits for \bis in time linear in the time required for deterministic query processing over a possible world of the \bi under the aforementioned assumption that $\abs{\pxdb} \leq c \cdot \abs{\db}$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Circuit size vs. runtime}
\label{sec:circuit-runtime}

\newcommand{\bagdbof}{\textsc{bag}(\pxdb)}

We now connect the size of a circuit (where the size of a circuit is the number of vertices in the corresponding DAG) %\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.})
 for a given SPJU query $Q$ and $\semNX$-PDB $\pxdb$ to its $\qruntime{Q,\db}$ where $\db$ is one of the possible worlds of $\pxdb$. We do this formally by showing that the size of the circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{lemma}
\label{lem:circuits-model-runtime}
Given a $\semNX$-PDB $\pxdb$ and query plan $Q$, the runtime of $Q$ over $\bagdbof$ has the same or better complexity as the size of the lineage of $Q(\pxdb)$.  That is, we have $\abs{V_{Q,\pxdb}} \leq (k-1)\qruntime{Q}$, where $k$ is the maximal degree of  any  polynomial in $Q(\pxdb)$.
\end{lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent The proof is shown in in~\Cref{app:subsec-lem-lin-vs-qplan}.
We now have all the pieces to argue that using our approximation algorithm,  the expected multiplicities of a SPJU query can be computed in essentially the same runtime as deterministic query processing for the same query:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Corollary}
  Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
%
  \[
    O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\db_{max}}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
    \]
\end{Corollary}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{proof}
This follows from~\Cref{lem:circuits-model-runtime} and (the circuit counterpart-- see~\Cref{sec:results-circuits})~\Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
\end{proof}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Higher Moments}
\label{sec:momemts}

We make a simple observation to conclude the presentation of our results.
So far focused on the expectation of $\poly$.
In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.
While we do not have a good approximation algorithm for this problem, we can make some progress as follows:
For any positive integer $m$ we can compute the expectation $\poly^m$ (which only changes the degree of the corresponding lineage polynomial by a factor of $m$).
In other words, we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
However, we leave the question of coming up with a more accurate approximation algorithms for future work.

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:
Circuits model runtime 2020-12-14 23:21:03 -05:00			`%!TEX root=./main.tex`
Finished incorporating @atri changes 020421 2021-02-09 09:12:22 -05:00			`\revision{`
Finished @atri 021821 suggestions modulo element Qs. 2021-02-11 12:33:57 -05:00			`\section{More on Circuits and Moments}`
Finished incorporating @atri changes 020421 2021-02-09 09:12:22 -05:00			`}`
In middle of my pass on intro 2020-12-18 11:37:37 -05:00			`\label{sec:gen}`
Finished incorporating @atri changes 020421 2021-02-09 09:12:22 -05:00			`%In this section, we consider generalizations/corollaries of our results.`
			`%In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of an expression tree (\Cref{def:express-tree}) and observe that our results carry over.`
			`%Then,`
			`We formalize our claim from \Cref{sec:intro} that a linear algorithm for our problem implies that PDB queries can be answered in the same runtime as deterministic queries under reasonable assumptions.`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`Finally, in~\Cref{sec:momemts}, we generalize our result for expectation to other moments.`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Circuits model runtime 2020-12-14 23:21:03 -05:00
Still working on Sec 5 2020-12-17 00:02:07 -05:00
Finished incorporating @atri changes 020421 2021-02-09 09:12:22 -05:00			`\revision{`
			`\subsection{Cost Model, Query Plans, and Runtime}`
			`}`
Still working on Sec 5 2020-12-17 00:02:07 -05:00
Finished incorporating @atri changes 020421 2021-02-09 09:12:22 -05:00			`\label{sec:circuits}`
Still working on Sec 5 2020-12-17 00:02:07 -05:00

Finished incorporating @atri changes 020421 2021-02-09 09:12:22 -05:00			`%In~\Cref{sec:results-circuits} we argue why results from earlier sections also hold for circuits and then`
			`We argue why circuits capture the runtime of well-known query processing algorithms in~\Cref{sec:circuit-runtime} (\Cref{sec:cost-model} formalizes the query cost model).`
Still working on Sec 5 2020-12-17 00:02:07 -05:00
Finished incorporating @atri changes 020421 2021-02-09 09:12:22 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`%\subsubsection{Extending our results to circuits}`
			`%\label{sec:results-circuits}`
			`%`
			`%We first note that since expression trees are a special case of linear circuits, all of our hardness results for in~\Cref{sec:hard} are still valid for the latter.`
Trimming for space 2020-12-19 12:59:27 -05:00			`%`
Finished incorporating @atri changes 020421 2021-02-09 09:12:22 -05:00			`%Observe that \textsc{Approx}\textsc{imate}$\rpoly$ (\Cref{alg:mon-sam} in \Cref{sec:algo}) works for circuits as long as the same guarantees on $\onepass$ and $\sampmon$ (\Cref{lem:one-pass} and \Cref{lem:sample} respectively) hold for circuits as well.`
			`%It turns out that this is the case, simply because both algorithms rely on only one property of expression trees: that each node has two children;`
			`%Analogously in a circuit, each node has a maximum in-degree of two.`
			`%Put another way, our argument never used the fact that in an expression tree, each node has at most one parent.`
			`%%`
			`%For a more detailed discussion of why~\Cref{lem:approx-alg} holds for a circuit, see~\Cref{app:lineage-circuit-ext}.`
Still working on Sec 5 2020-12-17 00:02:07 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Still working on Sec 5 2020-12-17 00:02:07 -05:00			`\subsubsection{The cost model}`
			`\label{sec:cost-model}`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`So far our analysis of $\approxq$ has been in terms of the size of the compressed lineage polynomial.`
			`We now show that this model corresponds to the behavior of a deterministic database by proving that for any union of conjunctive queries, we can construct a compressed lineage polynomial for a query $Q$ and \bi $\pxdb$ of size (and in runtime) linear in the runtime of a general class of query processing algorithms for the same query $Q$ on a deterministic database $\db$.`
			`We assume a linear relationship between input sizes $\|\pxdb\|$ and $\|\db\|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).`
save line 2020-12-20 18:57:03 -05:00			`This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.`
minor phrasing 2020-12-20 18:43:14 -05:00			`In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).`
circuits 2020-12-20 17:19:07 -05:00			`% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.`
			`We adopt a minimalistic compute-bound model of query evaluation drawn from the worst-case optimal join literature~\cite{skew,ngo-survey}.`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\newcommand{\qruntime}[1]{\textbf{cost}(#1)}`
circuits 2020-12-20 14:43:03 -05:00			`%`
			`\noindent\resizebox{1\linewidth}{!}{`
			`\begin{minipage}{1.0\linewidth}`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\begin{align*}`
circuits 2020-12-20 14:43:03 -05:00			`\qruntime{R,D} & = \|R\| \\`
			`\qruntime{\sigma Q, D} & = \qruntime{Q,D} \\`
			`\qruntime{\pi Q, D} & = \qruntime{Q,D} + \abs{Q(D)} \\`
			`\qruntime{Q \cup Q', D} & = \qruntime{Q, D} + \qruntime{Q', D} +\abs{Q(D)}+\abs{Q'(D)} \\`
			`\qruntime{Q_1 \bowtie \ldots \bowtie Q_n, D} & = \qruntime{Q_1, D} + \ldots + \qruntime{Q_n,D} + \abs{Q_1(D) \bowtie \ldots \bowtie Q_n(D)}`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\end{align*}`
circuits 2020-12-20 14:43:03 -05:00			`\end{minipage}`
			`}\\`

			`Under this model a query $Q$ evaluated over database $D$ has runtime $O(\qruntime{Q,D})$.`
			`We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.`
Circuits model runtime 2020-12-14 23:21:03 -05:00
circuits 2020-12-20 14:43:03 -05:00			It can be verified that worst-case optimal join algorithms~\cite{skew,ngo-survey}, as well as query evaluation via factorized databases~\cite{factorized-db} (and work on FAQs~\cite{DBLP:conf/pods/KhamisNR16}) can be modeled as select-union-project-join queries (though these queries can be data dependent).\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} Further, it can be verified that the above cost model on the corresponding SPJU join queries correctly captures their runtime.
Finished porting all @atri asked in 121820 meeting 2020-12-18 18:23:24 -05:00
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`%We now make a simple observation on the above cost model:`
			`%\begin{proposition}`
			`%\label{prop:queries-need-to-output-tuples}`
			`%The runtime $\qruntime{Q}$ of any query $Q$ is at least $\|Q\|$`
			`%\end{proposition}`
Circuits model runtime 2020-12-14 23:21:03 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Finished incorporating @atri changes 020421 2021-02-09 09:12:22 -05:00			`\subsubsection{Circuits for query plans}`
Still working on Sec 5 2020-12-17 00:02:07 -05:00			`\label{sec:circuits-formal}`
Finished rewrite of SampMon; started Iterative solution of OnePass 2021-02-08 13:44:50 -05:00			`We now formalize circuits and the construction of circuits for SPJU queries.`
circuits 2020-12-20 17:19:07 -05:00			`As mentioned earlier, we represent lineage polynomials as arithmetic circuits over $\mathbb N$-valued variables with $+$, $\times$.`
circuits 2020-12-20 14:43:03 -05:00			`A circuit for query $Q$ and $\semNX$-PDB $\pxdb$ is a directed acyclic graph $\tuple{V_{Q,\pxdb}, E_{Q,\pxdb}, \phi_{Q,\pxdb}, \ell_{Q,\pxdb}}$ with vertices $V_{Q,\pxdb}$ and directed edges $E_{Q,\pxdb} \subset {V_{Q,\pxdb}}^2$.`
			`The sink function $\phi_{Q,\pxdb} : \udom^n \rightarrow V_{Q,\pxdb}$ is a partial function that maps the tuples of the $n$-ary relation $Q(\pxdb)$ to vertices.`
			`We require that $\phi_{Q,\pxdb}$'s range be limited to sink vertices (i.e., vertices with out-degree 0).`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.`
circuits 2020-12-20 14:43:03 -05:00			`A function $\ell_{Q,\pxdb} : V_{Q,\pxdb} \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`We require that vertices have an in-degree of at most two.`
Trimming about all that I can trim through rephrasing+spacing cheats 2020-12-19 14:02:12 -05:00			`%`
Finished rewrite of SampMon; started Iterative solution of OnePass 2021-02-08 13:44:50 -05:00			`For the specifics on how to construct a circuit to encode the polynomials of all result tuples for a query and $\semNX$-PDB see \Cref{app:subsec-rep-poly-lin-circ}. Note that we can construct circuits for \bis in time linear in the time required for deterministic query processing over a possible world of the \bi under the aforementioned assumption that $\abs{\pxdb} \leq c \cdot \abs{\db}$.`
Circuits model runtime 2020-12-14 23:21:03 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Still working on Sec 5 2020-12-17 00:02:07 -05:00			`\subsubsection{Circuit size vs. runtime}`
			`\label{sec:circuit-runtime}`
Circuits model runtime 2020-12-14 23:21:03 -05:00
circuits 2020-12-20 14:43:03 -05:00			`\newcommand{\bagdbof}{\textsc{bag}(\pxdb)}`

Finished rewrite of SampMon; started Iterative solution of OnePass 2021-02-08 13:44:50 -05:00			`We now connect the size of a circuit (where the size of a circuit is the number of vertices in the corresponding DAG) %\footnote{since each node has indegree at most two, this also is the same up to constants to counting the number of edges in the DAG.})`
			`for a given SPJU query $Q$ and $\semNX$-PDB $\pxdb$ to its $\qruntime{Q,\db}$ where $\db$ is one of the possible worlds of $\pxdb$. We do this formally by showing that the size of the circuit is asymptotically no worse than the corresponding runtime of a large class of deterministic query processing algorithms.`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\begin{lemma}`
			`\label{lem:circuits-model-runtime}`
circuits 2020-12-20 14:43:03 -05:00			`Given a $\semNX$-PDB $\pxdb$ and query plan $Q$, the runtime of $Q$ over $\bagdbof$ has the same or better complexity as the size of the lineage of $Q(\pxdb)$. That is, we have $\abs{V_{Q,\pxdb}} \leq (k-1)\qruntime{Q}$, where $k$ is the maximal degree of any polynomial in $Q(\pxdb)$.`
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\end{lemma}`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`\noindent The proof is shown in in~\Cref{app:subsec-lem-lin-vs-qplan}.`
			`We now have all the pieces to argue that using our approximation algorithm, the expected multiplicities of a SPJU query can be computed in essentially the same runtime as deterministic query processing for the same query:`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\begin{Corollary}`
small 2020-12-20 17:27:12 -05:00			`Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time`
circuits 2020-12-20 14:43:03 -05:00			`%`
			`\[`
circuits 2020-12-20 17:19:07 -05:00			`O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\db_{max}}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)`
One cite over still 2020-12-20 18:54:40 -05:00			`\]`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\end{Corollary}`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`\begin{proof}`
Finished rewrite of SampMon; started Iterative solution of OnePass 2021-02-08 13:44:50 -05:00			This follows from~\Cref{lem:circuits-model-runtime} and (the circuit counterpart-- see~\Cref{sec:results-circuits})~\Cref{cor:approx-algo-const-p} (where the latter is used with $\delta$ being substituted\footnote{Recall that~\Cref{cor:approx-algo-const-p} is stated for a single output tuple so to get the required guarantee for all (at most $n^k$) output tuples of $Q$ we get at most $\frac \delta{n^k}$ probability of failure for each output tuple and then just a union bound over all output tuples. } with $\frac \delta{n^k}$).
Circuits model runtime 2020-12-14 23:21:03 -05:00			`\end{proof}`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`\subsection{Higher Moments}`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00			`\label{sec:momemts}`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00
circuits 2020-12-20 01:16:52 -05:00			`We make a simple observation to conclude the presentation of our results.`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`So far focused on the expectation of $\poly$.`
circuits 2020-12-20 01:16:52 -05:00			`In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.`
Trimming about all that I can trim through rephrasing+spacing cheats 2020-12-19 14:02:12 -05:00			`While we do not have a good approximation algorithm for this problem, we can make some progress as follows:`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`For any positive integer $m$ we can compute the expectation $\poly^m$ (which only changes the degree of the corresponding lineage polynomial by a factor of $m$).`
			`In other words, we can compute the $m$-th moment of the multiplicities, allowing us to e.g. to use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.`
Done with S5 2020-12-19 23:40:43 -05:00			`However, we leave the question of coming up with a more accurate approximation algorithms for future work.`
circuits 2020-12-20 01:16:52 -05:00
			`%%% Local Variables:`
			`%%% mode: latex`
			`%%% TeX-master: "main"`
			`%%% End:`