%!TEX root= prob-def.tex \subsection{Deterministic Query Runtimes}\label{sec:gen} %We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions. %Lastly, we generalize our result for expectation to other moments. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\revision{ %\subsection{Cost Model, Query Plans, and Runtime} %As in the introduction, we could consider polynomials to be represented as an expression tree. %However, they do not capture many of the compressed polynomial representations that we can get from query processing algorithms on bags, including the recent work on worst-case optimal join algorithms~\cite{ngo-survey,skew}, factorized databases~\cite{factorized-db}, and FAQ~\cite{DBLP:conf/pods/KhamisNR16}. Intuitively, the main reason is that an expression tree does not allow for `sharing' of intermediate results, which is crucial for these algorithms (and other query processing methods as well). %} % %\label{sec:circuits} %\mypar{The cost model} %\label{sec:cost-model} %So far our analysis of \Cref{prob:intro-stmt} has been in terms of the size of the lineage circuits. %We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\query$, we can construct a compressed circuit for $\poly$ and \bi $\pdb$ of size and runtime linear in that of a general class of query processing algorithms for the same query $\query$ on $\pdb$'s \dbbaseName $\dbbase$. % Note that by definition, there exists a linear relationship between input sizes $|\pxdb|$ and $|\dbbase|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$). % \footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes. % In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).} %That is for \bis that fulfill this restriction approximating the expectation of results of SPJU queries is only has a constant factor overhead over deterministic query processing (using one of the algorithms for which we prove the claim). % with the same complexity as it would take to evaluate the query on a deterministic \emph{bag} database of the same size as the input PDB. In~\Cref{sec:intro}, we introduced the function $T_{det}\inparen{\cdot}$ to analyze the runtime complexity of~\Cref{prob:expect-mult}. To decouple our results from any specific join algorithm, we first lower bound the cost of a join. \begin{Definition}[Join Cost] \label{def:join-cost} Denote by $\jointime{R_1, \ldots, R_m}$ the runtime of an algorithm for computing the $m$-ary join $R_1 \bowtie \ldots \bowtie R_m$. We require only that the algorithm must enumerate its output, i.e., that $\jointime{R_1, \ldots, R_m} \geq |R_1 \bowtie \ldots \bowtie R_m|$. %With this definition of $\jointime{\cdot}$, worst-case optimal join algorithms are handled. \end{Definition} Worst-case optimal join algorithms~\cite{skew,ngo-survey} and query evaluation via factorized databases~\cite{factorized-db} (as well as work on FAQs~\cite{DBLP:conf/pods/KhamisNR16} and AJAR~\cite{ajar}) can be modeled as $\raPlus$ queries (though the query size is data dependent). For these algorithms, $\jointime{R_1, \ldots, R_n}$ is linear in the {\em AGM bound}~\cite{AGM}. % = |R_1| + \ldots + |R_n| + |R_1(\db) \bowtie \ldots \bowtie R_n(\db)|$. Our cost model for general query evaluation follows: %\noindent\resizebox{1\linewidth}{!}{ %\begin{minipage}{1.0\linewidth} {\small \begin{flalign*} &\begin{aligned} &\qruntimenoopt{R,\gentupset,\bound} = |\gentupset.R| & &\qquad\qquad\qruntimenoopt{\sigma \query, \gentupset,\bound} = \qruntimenoopt{\query,\gentupset,\bound}\\ \end{aligned}&\\ %\vspace{-.6cm} &\qruntimenoopt{\pi \query, \gentupset,\bound} = \qruntimenoopt{\query,\gentupset,\bound} +\abs{\query(\gentupset)}&\\ &\qruntimenoopt{\query \cup \query', \gentupset,\bound} = \qruntimenoopt{\query, \gentupset,\bound} + \qruntimenoopt{\query', \gentupset,\bound} + \abs{\query\inparen{\gentupset}}+\abs{\query'\inparen{\gentupset}}& \\ &\qruntimenoopt{\query_1 \bowtie \ldots \bowtie \query_m, \gentupset,\bound} = &\\ &\qquad\qruntimenoopt{\query_1, \gentupset,\bound} + \ldots + \qruntimenoopt{\query_m,\gentupset,\bound} + \jointime{\query_1(\gentupset), \ldots, \query_m(\gentupset)}&\\ \end{flalign*} } \vspace{-0.93cm} %\begin{align*} % &\qruntimenoopt{\query_1 \bowtie \ldots \bowtie \query_m, \gentupset,\bound} = \\ % &\qquad\qruntimenoopt{\query_1, \gentupset,\bound} + \ldots + \qruntimenoopt{\query_m,\gentupset,\bound} + % \jointime{\query_1(\gentupset), \ldots, \query_m(\gentupset)} %\end{align*}\\%[-10mm] %\end{minipage} %}\\ \noindent Under this model, an $\raPlus$ query $\query$ evaluated over over any deterministic database %\dbbaseName that maps each tuple in $\gentupset$ to a multiplicity in $[0,\bound]$ %database $\gentupset$ has runtime $O\inparen{\qruntimenoopt{Q,\gentupset, \bound}}$. We assume that full table scans are used for every base relation access. We can model index scans by treating an index scan query $\sigma_\theta(R)$ as a base relation. %Observe that % () .\footnote{This claim can be verified by e.g. simply looking at the {\em Generic-Join} algorithm in~\cite{skew} and {\em factorize} algorithm in~\cite{factorized-db}.} It can be verified that the above cost model on the corresponding $\raPlus$ join queries correctly captures the runtime of current best known . \Cref{lem:circ-model-runtime} and \Cref{lem:tlc-is-the-same-as-det} show that for any $\raPlus$ query $\query$ and \dbbaseName $\tupset$, there exists a circuit $\circuit^*$ such that $\timeOf{\abbrStepOne}(Q,\tupset,\circuit^*)$ and $|\circuit^*|$ are both $O(\qruntimenoopt{\optquery{\query}, \tupset,\bound})$, as we assumed when moving from \Cref{prob:big-o-joint-steps} to \Cref{prob:intro-stmt}. Lastly, we can handle FAQs/AJAR queries and factorized databases by allowing for optimization. %, i.e. $\qruntimenoopt{\optquery{\query}, \gentupset, \bound}$. % %We now make a simple observation on the above cost model: %\begin{proposition} %\label{prop:queries-need-to-output-tuples} %The runtime $\qruntimenoopt{Q}$ of any query $Q$ is at least $|Q|$ %\end{proposition} % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % %We are now ready to formally state our claim with respect to \Cref{prob:intro-stmt}: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\begin{Corollary}\label{cor:cost-model} % Given an $\raPlus$ query $\query$ over a \ti $\pdb$ with \dbbaseName $\dbbase$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pdb)$ with probability at least $1-\delta$ in time % % \[ % O_k\left(\frac 1{\eps^2}\cdot\qruntimenoopt{Q,\dbbase}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right) % \] %\end{Corollary} %Atri: The above is no longer needed %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% Local Variables: %%% mode: latex %%% TeX-master: "main" %%% End: