Started changes to S.3 to eliminate the 2-step process from our theoretical results.

master
Aaron Huber 2021-09-13 12:10:22 -04:00
parent 441eb67719
commit a5667ee7d2
2 changed files with 13 additions and 6 deletions

View File

@ -174,9 +174,9 @@ $\Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^{c_0\cdot k}}$ for {\em some} $c
\label{tab:lbs}
\end{table}
Note that the lower bound in the first row by itself is enough to refute \Cref{prob:informal}.
To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \dbbase}}^k})$, where $k$ is the largest degree of the polynomial $\apolyqdt$ over all result tuple $\tup$ (which is the parameter that defines our family of hard queries). What our lower bound in the third rows says that one cannot get more than a polynomial improvement over essentially the trivial algorithm for \Cref{prob:informal}.\footnote{
To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, where $k$ is the largest degree of the polynomial $\apolyqdt$ over all result tuple $\tup$ (which is the parameter that defines our family of hard queries). What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for \Cref{prob:informal}.\footnote{
We note similar hardness results for determinsitic query processing that apply lower bounds in terms of $\abs{\dbbase}$. Our lower bounds are in terms of $\qruntime{Q,\dbbase}$, which in general can be super-linear in $\abs{\dbbase}$.
} However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing resluts already imply the claimed lower bounds if we were to replace the $\timeOf{}^*(Q,\pdb)$ by just $\abs{\dbbase}$-- indeed these results follow from known lower bound for deterministic query processing-- our contribution to then identify a family of hard query where deterministic query procedding is `easy' but computing the expected multuplicities is hard. To put these hardness results in context, we will next take a short detour to review the existing hardness results for \abbrPDB\xplural under set semantics.
} However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\timeOf{}^*(Q,\pdb)$ by just $\abs{\dbbase}$-- indeed these results follow from known lower bound for deterministic query processing-- our contribution is to then identify a family of hard query where deterministic query procedding is `easy' but computing the expected multuplicities is hard. To put these hardness results in context, we will next take a short detour to review the existing hardness results for \abbrPDB\xplural under set semantics.
% Atri: Converting sub-section to para since it saves space
@ -203,7 +203,7 @@ We note that the \sharpphard lower bound is much stronger than what one can hope
%Atri: Removed the para above since the above does not seem to add much to the current intro flow.
%Such a guarantee is not possible
For queries on the hard side of the dichotomy, and the best known algorithmic approach is the \emph{intensional} query evaluation~\cite{DBLP:series/synthesis/2011Suciu}, where one explicitly computes the lineage polynomial and then compute its expectation as in \Cref{prob:bag-pdb-poly-expected}. % , a two step process that first computes the lineage of the query result --- a representation of $\Phi_\tup$ --- which it then uses to compute the desired probability.
For queries on the hard side of the dichotomy, the best known algorithmic approach is the \emph{intensional} query evaluation~\cite{DBLP:series/synthesis/2011Suciu}, where one explicitly computes the lineage polynomial and then its expectation as in \Cref{prob:bag-pdb-poly-expected}. % , a two step process that first computes the lineage of the query result --- a representation of $\Phi_\tup$ --- which it then uses to compute the desired probability.
%The complexity of this approach is, in general, dominated by computing the expectation $\expct\pbox{\apolyqdt(\vct{\randWorld})}$, a problem known to be \sharpphard~\cite{DS07}.
@ -244,7 +244,7 @@ Given \abbrBPDB $\pdb$, $\raPlus$ query $\query$,
is there a $(1\pm\epsilon)$-approximation of $\expct_{\db\sim\pd}\pbox{\query\inparen{\db}\inparen{\tup}}$ for all result tuples $\tup$ where
$\exists \circuit : \timeOf{\abbrStepOne}(Q,\dbbase,\circuit) + \timeOf{\abbrStepTwo}(\circuit) \le O(\qruntime{Q, \dbbase})$?
\end{Problem}
Note that if the answer to the above problem is yes, then we have shown that the answer to \Cref{prob:informal} is yes (when we are interested in approximating the expected multiplities).
Note that if the answer to the above problem is yes, then we have shown that the answer to \Cref{prob:informal} is yes (when we are interested in approximating the expected multiplicities).
We show in \Cref{sec:gen}
%\OK{confirm this ref}
@ -259,7 +259,7 @@ For example, if we insist that $\circuit$ represent the lineage polynomial in th
However, systems can generate compact representations of $\poly(\vct{X})$ (e.g., through optimizations like projection push-down~\cite{DBLP:books/daglib/0020812}, which directly result in factorized representations of $\poly(\vct{X})$.
For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.
Accordingly, this work uses (arithmetic) circuits\footnote{
An arithmetic circuit is a DAG with variable and/or numeric source nodes, with internal nodes representing either an addition or multiplication operator.
An arithmetic circuit is a DAG with variable and/or numeric source nodes and internal nodes representing either an addition or multiplication operator.
}
as the representation system of $\poly(\vct{X})$.
@ -310,7 +310,7 @@ Given a circuit $\circuit$ for $\apolyqdt$ (over all result tuples $\tup$) for \
%graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
%(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly.
(i) We show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}), the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes}, where there is a single result tuple\footnote{We can approximate the expected output tuple multiplicities (for all output tuples {\em simultanesouly} with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}
(i) We show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}), where there is a single result tuple the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes}.\footnote{We can approximate the expected output tuple multiplicities (for all output tuples {\em simultanesouly} with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
In contrast, known approximation techniques in set-\abbrPDB\xplural are at most quadratic in the size of the compressed lineage encoding~\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}.
%Atri: The footnote below does not add much

View File

@ -69,6 +69,13 @@ SELECT 1 FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
where adapting the PDB instance in \Cref{fig:two-step}, relation $OnTime$ has $4$ tuples corresponding to each vertex for $i$ in $[4]$, each with probability $\prob_i$ and $Route$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
Note that this implies that our hard lineage polynomial can be represented as an expression tree produced by a project-join query with same probability value for each input tuple $\prob_i$, and hence is indeed a lineage polynomial for a \abbrTIDB \abbrPDB.
\begin{Lemma}\label{lem:pdb-for-def-qk}
The relations encoding the edges for the hard query of \Cref{def:qk} can be computed in $\bigO{\numedge}$ time.
\end{Lemma}
\begin{proof}
Only two relations need be constructed, one for the vertexes and one for the edges. By a simple linear scan, each can be constructed in time $\bigO{\numedge + \numvar}$. If we assume a constant factor of edges in the number of vertexes, then we have $\bigO{\numedge}$ time..
\end{proof}
\subsection{Multiple Distinct $\prob$ Values}
\label{sec:multiple-p}
%Unless otherwise noted, all proofs for this section are in \Cref{app:single-mult-p}.