paper-BagRelationalPDBsAreHard/circuits-model-runtime.tex

%!TEX root= prob-def.tex

\subsection{Deterministic Query Runtimes}\label{sec:gen}
%We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.
%Lastly, we generalize our result for expectation to other moments.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%\revision{
%\subsection{Cost Model, Query Plans, and Runtime}
%As in the introduction, we could consider polynomials to be represented as an expression tree.
%However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
%}
%
%\label{sec:circuits}

%\mypar{The cost model}
%\label{sec:cost-model}
%So far our analysis of \Cref{prob:intro-stmt} has been in terms of the size of the lineage circuits.
%We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pdb$'s \dbbaseName $\dbbase$.  
% Note that by definition, there exists a linear relationship between input sizes $|\pxdb|$ and $|\dbbase|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
% \footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
% In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).
% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.
In~\Cref{sec:intro}, we introduced the function $T_{det}\inparen{\cdot}$ to analyze the runtime complexity of~\Cref{prob:expect-mult}.
To decouple our results from any specific join algorithm, we first lower bound the cost of a join.

\begin{Definition}[Join Cost]
\label{def:join-cost}
Denote by $\jointime{R_1, \ldots, R_m}$ the runtime of an algorithm for computing the $m$-ary join $R_1 \bowtie \ldots \bowtie R_m$.
We require only that the algorithm must enumerate its output, i.e., that $\jointime{R_1, \ldots, R_m} \geq |R_1 \bowtie \ldots \bowtie R_m|$.  %With this definition of $\jointime{\cdot}$, worst-case optimal join algorithms are handled.
\end{Definition}

Worst-case optimal join algorithms~\cite{skew,ngo-survey} and query evaluation via factorized databases~\cite{factorized-db} (as well as work on FAQs~\cite{DBLP:conf/pods/KhamisNR16} and AJAR~\cite{ajar}) can be modeled as $\raPlus$ queries (though the query size is data dependent).
For these algorithms, $\jointime{R_1, \ldots, R_n}$ is linear in the {\em AGM bound}~\cite{AGM}.
% = |R_1| + \ldots + |R_n| + |R_1(\db) \bowtie \ldots \bowtie R_n(\db)|$.  
 Our cost model for general query evaluation follows:
%\noindent\resizebox{1\linewidth}{!}{
%\begin{minipage}{1.0\linewidth}
{\small
\begin{flalign*}
&\begin{aligned}
    &\qruntimenoopt{R,\gentupset,\bound} = |\gentupset.R| 
    & 
    &\qquad\qquad\qruntimenoopt{\sigma \query, \gentupset,\bound} = \qruntimenoopt{\query,\gentupset,\bound}\\
\end{aligned}&\\
%\vspace{-.6cm}
 &\qruntimenoopt{\pi \query, \gentupset,\bound} = \qruntimenoopt{\query,\gentupset,\bound} +\abs{\query(\gentupset)}&\\
 &\qruntimenoopt{\query \cup \query', \gentupset,\bound} = \qruntimenoopt{\query, \gentupset,\bound} + \qruntimenoopt{\query', \gentupset,\bound} + \abs{\query\inparen{\gentupset}}+\abs{\query'\inparen{\gentupset}}& \\
&\qruntimenoopt{\query_1 \bowtie \ldots \bowtie \query_m, \gentupset,\bound} = &\\
&\qquad\qruntimenoopt{\query_1, \gentupset,\bound} + \ldots + \qruntimenoopt{\query_m,\gentupset,\bound} + 
       \jointime{\query_1(\gentupset), \ldots, \query_m(\gentupset)}&\\
\end{flalign*}
}
\vspace{-0.93cm}
%\begin{align*}
%	&\qruntimenoopt{\query_1 \bowtie \ldots \bowtie \query_m, \gentupset,\bound} = \\
%	&\qquad\qruntimenoopt{\query_1, \gentupset,\bound} + \ldots + \qruntimenoopt{\query_m,\gentupset,\bound} + 
%       \jointime{\query_1(\gentupset), \ldots, \query_m(\gentupset)}
%\end{align*}\\%[-10mm]
%\end{minipage}
%}\\

\noindent
Under this model, an $\raPlus$ query $\query$ evaluated over over any deterministic database 
%\dbbaseName 
that maps each tuple in $\gentupset$ to a multiplicity in $[0,\bound]$ %database $\gentupset$
has runtime $O\inparen{\qruntimenoopt{Q,\gentupset, \bound}}$.
We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.
%Observe that 
% () .\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding $\raPlus$ join queries correctly captures the runtime of current best known .

\Cref{lem:circ-model-runtime} and \Cref{lem:tlc-is-the-same-as-det} show that for any $\raPlus$ query $\query$ and \dbbaseName $\tupset$, there exists a circuit $\circuit^*$ such that $\timeOf{\abbrStepOne}(Q,\tupset,\circuit^*)$ and $|\circuit^*|$ are both $O(\qruntimenoopt{\optquery{\query}, \tupset,\bound})$, as we assumed when moving from \Cref{prob:big-o-joint-steps} to \Cref{prob:intro-stmt}.  Lastly, we can handle FAQs/AJAR queries and factorized databases by allowing for optimization. %, i.e. $\qruntimenoopt{\optquery{\query}, \gentupset, \bound}$.
%
%We now make a simple observation on the above cost model:
%\begin{proposition}
%\label{prop:queries-need-to-output-tuples}
%The runtime $\qruntimenoopt{Q}$ of any query $Q$ is at least $|Q|$
%\end{proposition}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%We are now ready to formally state our claim with respect to \Cref{prob:intro-stmt}:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{Corollary}\label{cor:cost-model}
%  Given an $\raPlus$ query $\query$ over a \ti $\pdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pdb)$ with probability at least $1-\delta$ in time
%
%  \[
%    O_k\left(\frac 1{\eps^2}\cdot\qruntimenoopt{Q,\dbbase}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
%    \]
%\end{Corollary}
%Atri: The above is no longer needed
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:
Addressing the deterministic database issue. 2022-06-07 10:10:19 -04:00			`%!TEX root= prob-def.tex`
Changes from 021221 meeting implemented. 2021-02-12 14:20:50 -05:00
Update on Overleaf. 2022-06-03 13:51:40 -04:00			`\subsection{Deterministic Query Runtimes}\label{sec:gen}`
Done with pass on S2 2021-09-18 00:55:37 -04:00			`%We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.`
			`%Lastly, we generalize our result for expectation to other moments.`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Circuits model runtime 2020-12-14 23:21:03 -05:00
Still working on Sec 5 2020-12-17 00:02:07 -05:00
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`%\revision{`
			`%\subsection{Cost Model, Query Plans, and Runtime}`
			`%As in the introduction, we could consider polynomials to be represented as an expression tree.`
			%However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well).
			`%}`
Trimming for space 2020-12-19 12:59:27 -05:00			`%`
Trimmed Section 5 (Aaron) 1st pass. 2021-04-08 15:02:40 -04:00			`%\label{sec:circuits}`
Still working on Sec 5 2020-12-17 00:02:07 -05:00
Done with pass on S2 2021-09-18 00:55:37 -04:00			`%\mypar{The cost model}`
Done with my final pass 2021-04-09 00:14:02 -04:00			`%\label{sec:cost-model}`
Done with pass on S2 2021-09-18 00:55:37 -04:00			`%So far our analysis of \Cref{prob:intro-stmt} has been in terms of the size of the lineage circuits.`
			`%We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pdb$'s \dbbaseName $\dbbase$.`
Appendix D cleanup, and starting to think about the intro 2021-09-08 22:32:30 -04:00			`% Note that by definition, there exists a linear relationship between input sizes $\|\pxdb\|$ and $\|\dbbase\|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).`
			`% \footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.`
			`% In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}`
Tweaking S5, and trimming back down to 12 2020-12-20 18:38:59 -05:00			`%That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim).`
circuits 2020-12-20 17:19:07 -05:00			`% with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB.`
Update on Overleaf. 2022-06-07 23:26:24 -04:00			`In~\Cref{sec:intro}, we introduced the function $T_{det}\inparen{\cdot}$ to analyze the runtime complexity of~\Cref{prob:expect-mult}.`
Minor cleaning 2022-06-05 13:15:51 -04:00			`To decouple our results from any specific join algorithm, we first lower bound the cost of a join.`
Bounding TLC 2021-09-18 21:09:17 -04:00
			`\begin{Definition}[Join Cost]`
			`\label{def:join-cost}`
Done with S2 pass 2021-09-20 18:04:04 -04:00			`Denote by $\jointime{R_1, \ldots, R_m}$ the runtime of an algorithm for computing the $m$-ary join $R_1 \bowtie \ldots \bowtie R_m$.`
Update on Overleaf. 2022-06-02 00:11:18 -04:00			`We require only that the algorithm must enumerate its output, i.e., that $\jointime{R_1, \ldots, R_m} \geq \|R_1 \bowtie \ldots \bowtie R_m\|$. %With this definition of $\jointime{\cdot}$, worst-case optimal join algorithms are handled.`
Bounding TLC 2021-09-18 21:09:17 -04:00			`\end{Definition}`

Update on Overleaf. 2022-06-02 00:11:18 -04:00			`Worst-case optimal join algorithms~\cite{skew,ngo-survey} and query evaluation via factorized databases~\cite{factorized-db} (as well as work on FAQs~\cite{DBLP:conf/pods/KhamisNR16} and AJAR~\cite{ajar}) can be modeled as $\raPlus$ queries (though the query size is data dependent).`
Done with pass on S2.4 2021-09-18 23:17:00 -04:00			`For these algorithms, $\jointime{R_1, \ldots, R_n}$ is linear in the {\em AGM bound}~\cite{AGM}.`
			`% = \|R_1\| + \ldots + \|R_n\| + \|R_1(\db) \bowtie \ldots \bowtie R_n(\db)\|$.`
Trimming. Down by ~1 column so far. 2022-05-18 17:51:12 -04:00			`Our cost model for general query evaluation follows:`
Almost done fixing layout issues with PODs format. 2022-03-15 11:21:11 -04:00			`%\noindent\resizebox{1\linewidth}{!}{`
			`%\begin{minipage}{1.0\linewidth}`
Trimming. Down by ~1 column so far. 2022-05-18 17:51:12 -04:00			`{\small`
Almost done fixing layout issues with PODs format. 2022-03-15 11:21:11 -04:00			`\begin{flalign*}`
			`&\begin{aligned}`
			`&\qruntimenoopt{R,\gentupset,\bound} = \|\gentupset.R\|`
			`&`
Small changes, fixing bugs and typos. 2022-05-17 10:55:17 -04:00			`&\qquad\qquad\qruntimenoopt{\sigma \query, \gentupset,\bound} = \qruntimenoopt{\query,\gentupset,\bound}\\`
Almost done fixing layout issues with PODs format. 2022-03-15 11:21:11 -04:00			`\end{aligned}&\\`
			`%\vspace{-.6cm}`
			`&\qruntimenoopt{\pi \query, \gentupset,\bound} = \qruntimenoopt{\query,\gentupset,\bound} +\abs{\query(\gentupset)}&\\`
			`&\qruntimenoopt{\query \cup \query', \gentupset,\bound} = \qruntimenoopt{\query, \gentupset,\bound} + \qruntimenoopt{\query', \gentupset,\bound} + \abs{\query\inparen{\gentupset}}+\abs{\query'\inparen{\gentupset}}& \\`
			`&\qruntimenoopt{\query_1 \bowtie \ldots \bowtie \query_m, \gentupset,\bound} = &\\`
			`&\qquad\qruntimenoopt{\query_1, \gentupset,\bound} + \ldots + \qruntimenoopt{\query_m,\gentupset,\bound} +`
			`\jointime{\query_1(\gentupset), \ldots, \query_m(\gentupset)}&\\`
			`\end{flalign*}`
Trimming. Down by ~1 column so far. 2022-05-18 17:51:12 -04:00			`}`
Almost done fixing layout issues with PODs format. 2022-03-15 11:21:11 -04:00			`\vspace{-0.93cm}`
			`%\begin{align*}`
			`% &\qruntimenoopt{\query_1 \bowtie \ldots \bowtie \query_m, \gentupset,\bound} = \\`
			`% &\qquad\qruntimenoopt{\query_1, \gentupset,\bound} + \ldots + \qruntimenoopt{\query_m,\gentupset,\bound} +`
			`% \jointime{\query_1(\gentupset), \ldots, \query_m(\gentupset)}`
			`%\end{align*}\\%[-10mm]`
			`%\end{minipage}`
			`%}\\`
circuits 2020-12-20 14:43:03 -05:00
Trimming. Down by ~1 column so far. 2022-05-18 17:51:12 -04:00			`\noindent`
Update on Overleaf. 2022-06-09 20:27:28 -04:00			`Under this model, an $\raPlus$ query $\query$ evaluated over over any deterministic database`
			`%\dbbaseName`
			`that maps each tuple in $\gentupset$ to a multiplicity in $[0,\bound]$ %database $\gentupset$`
			`has runtime $O\inparen{\qruntimenoopt{Q,\gentupset, \bound}}$.`
Bounding TLC 2021-09-18 21:09:17 -04:00			`We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation.`
Done with pass on S2.4 2021-09-18 23:17:00 -04:00			`%Observe that`
Bounding TLC 2021-09-18 21:09:17 -04:00			`% () .\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding $\raPlus$ join queries correctly captures the runtime of current best known .`
Done with pass on S2 2021-09-18 00:55:37 -04:00
Update on Overleaf. 2022-06-09 20:27:28 -04:00			\Cref{lem:circ-model-runtime} and \Cref{lem:tlc-is-the-same-as-det} show that for any $\raPlus$ query $\query$ and \dbbaseName $\tupset$, there exists a circuit $\circuit^$ such that $\timeOf{\abbrStepOne}(Q,\tupset,\circuit^)$ and $\|\circuit^*\|$ are both $O(\qruntimenoopt{\optquery{\query}, \tupset,\bound})$, as we assumed when moving from \Cref{prob:big-o-joint-steps} to \Cref{prob:intro-stmt}. Lastly, we can handle FAQs/AJAR queries and factorized databases by allowing for optimization. %, i.e. $\qruntimenoopt{\optquery{\query}, \gentupset, \bound}$.
updates 2021-04-10 14:35:38 -04:00			`%`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`%We now make a simple observation on the above cost model:`
			`%\begin{proposition}`
			`%\label{prop:queries-need-to-output-tuples}`
Finished S.2 pass. 2022-02-08 16:39:14 -05:00			`%The runtime $\qruntimenoopt{Q}$ of any query $Q$ is at least $\|Q\|$`
Done with my pass on Sec 5 2020-12-17 01:32:08 -05:00			`%\end{proposition}`
updates 2021-04-10 14:35:38 -04:00			`%`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
updates 2021-04-10 14:35:38 -04:00			`%`
Done with pass on S2 2021-09-18 00:55:37 -04:00			`%We are now ready to formally state our claim with respect to \Cref{prob:intro-stmt}:`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Done with pass on S2 2021-09-18 00:55:37 -04:00			`%\begin{Corollary}\label{cor:cost-model}`
			`% Given an $\raPlus$ query $\query$ over a \ti $\pdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pdb)$ with probability at least $1-\delta$ in time`
circuits 2020-12-20 14:43:03 -05:00			`%`
Done with pass on S2 2021-09-18 00:55:37 -04:00			`% \[`
Finished S.2 pass. 2022-02-08 16:39:14 -05:00			`% O_k\left(\frac 1{\eps^2}\cdot\qruntimenoopt{Q,\dbbase}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)`
Done with pass on S2 2021-09-18 00:55:37 -04:00			`% \]`
			`%\end{Corollary}`
			`%Atri: The above is no longer needed`
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Finished incorporating @atri write up changes to S2, S4, S5 2021-09-15 12:18:44 -04:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Finally, started my pass on Sec 5 2020-12-16 21:34:26 -05:00
circuits 2020-12-20 14:43:03 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
circuits 2020-12-20 01:16:52 -05:00
			`%%% Local Variables:`
			`%%% mode: latex`
			`%%% TeX-master: "main"`
			`%%% End:`