Started changes to S.3 to eliminate the 2-step process from our theoretical results.

2021-09-13 12:10:22 -04:00 · 2021-09-13 12:10:22 -04:00 · a5667ee7d2
parent 441eb67719
commit a5667ee7d2
2 changed files with 13 additions and 6 deletions
--- a/intro-rewrite-070921.tex
+++ b/intro-rewrite-070921.tex
@ -174,9 +174,9 @@ $\Omega\inparen{\inparen{\qruntime{Q, \dbbase}}^{c_0\cdot k}}$ for {\em some} $c
 \label{tab:lbs}
 \end{table}
 Note that the lower bound in the first row by itself is enough to refute \Cref{prob:informal}.
-To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le  O\inparen{\inparen{\qruntime{Q, \dbbase}}^k})$, where $k$ is the largest degree of the polynomial $\apolyqdt$ over all result tuple $\tup$ (which is the parameter that defines our family of hard queries). What our lower bound in the third rows says that one cannot get more than a polynomial improvement over essentially the trivial algorithm for \Cref{prob:informal}.\footnote{
+To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le  O\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, where $k$ is the largest degree of the polynomial $\apolyqdt$ over all result tuple $\tup$ (which is the parameter that defines our family of hard queries). What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for \Cref{prob:informal}.\footnote{
  We note similar hardness results for determinsitic query processing that apply lower bounds in terms of $\abs{\dbbase}$. Our lower bounds are in terms of $\qruntime{Q,\dbbase}$, which in general can be super-linear in  $\abs{\dbbase}$.
-} However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing resluts already imply the claimed lower bounds if we were to replace the $\timeOf{}^*(Q,\pdb)$ by just $\abs{\dbbase}$-- indeed these results follow from known lower bound for deterministic query processing-- our contribution to then identify a family of hard query where deterministic query procedding is `easy' but computing the expected multuplicities is hard. To put these hardness results in context, we will next take a short detour to review the existing hardness results for \abbrPDB\xplural under set semantics.
+} However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\timeOf{}^*(Q,\pdb)$ by just $\abs{\dbbase}$-- indeed these results follow from known lower bound for deterministic query processing-- our contribution is to then identify a family of hard query where deterministic query procedding is `easy' but computing the expected multuplicities is hard. To put these hardness results in context, we will next take a short detour to review the existing hardness results for \abbrPDB\xplural under set semantics.


 % Atri: Converting sub-section to para since it saves space
@ -203,7 +203,7 @@ We note that the \sharpphard lower bound is much stronger than what one can hope
 %Atri: Removed the para above since the above does not seem to add much to the current intro flow.

 %Such a guarantee is not possible 
-For queries on the hard side of the dichotomy, and the best known algorithmic approach is the  \emph{intensional} query evaluation~\cite{DBLP:series/synthesis/2011Suciu}, where one explicitly computes the lineage polynomial and then compute its expectation as in \Cref{prob:bag-pdb-poly-expected}. % , a two step process that first computes the lineage of the query result --- a representation of $\Phi_\tup$ --- which it then uses to compute the desired probability.
+For queries on the hard side of the dichotomy, the best known algorithmic approach is the  \emph{intensional} query evaluation~\cite{DBLP:series/synthesis/2011Suciu}, where one explicitly computes the lineage polynomial and then its expectation as in \Cref{prob:bag-pdb-poly-expected}. % , a two step process that first computes the lineage of the query result --- a representation of $\Phi_\tup$ --- which it then uses to compute the desired probability.
 %The complexity of this approach is, in general, dominated by computing the expectation $\expct\pbox{\apolyqdt(\vct{\randWorld})}$, a problem known to be \sharpphard~\cite{DS07}.


@ -244,7 +244,7 @@ Given \abbrBPDB $\pdb$, $\raPlus$ query $\query$,
 is there a $(1\pm\epsilon)$-approximation of $\expct_{\db\sim\pd}\pbox{\query\inparen{\db}\inparen{\tup}}$ for all result tuples $\tup$ where
 $\exists \circuit : \timeOf{\abbrStepOne}(Q,\dbbase,\circuit) + \timeOf{\abbrStepTwo}(\circuit) \le O(\qruntime{Q, \dbbase})$?
 \end{Problem}
-Note that if the answer to the above problem is yes, then we have shown that the answer to \Cref{prob:informal} is yes (when we are interested in approximating the expected multiplities).
+Note that if the answer to the above problem is yes, then we have shown that the answer to \Cref{prob:informal} is yes (when we are interested in approximating the expected multiplicities).

 We show in \Cref{sec:gen}
 %\OK{confirm this ref}
@ -259,7 +259,7 @@ For example, if we insist that $\circuit$ represent the lineage polynomial in th
 However, systems can generate compact representations of $\poly(\vct{X})$ (e.g., through optimizations like projection push-down~\cite{DBLP:books/daglib/0020812}, which directly result in factorized representations of $\poly(\vct{X})$.
 For example, in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB-form $BY+BZ$.  
 Accordingly, this work uses (arithmetic) circuits\footnote{
-  An arithmetic circuit is a DAG with variable and/or numeric source nodes, with internal nodes representing either an addition or multiplication operator.
+  An arithmetic circuit is a DAG with variable and/or numeric source nodes and internal nodes representing either an addition or multiplication operator.
 }
 as the representation system of $\poly(\vct{X})$.

@ -310,7 +310,7 @@ Given a circuit $\circuit$ for $\apolyqdt$ (over all result tuples $\tup$) for \
 %graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
 %(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for  an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly.  

-(i) We show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}), the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes}, where there is a single result tuple\footnote{We can approximate the expected output tuple multiplicities (for all output tuples {\em simultanesouly} with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}
+(i) We show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)/Aggregations and Joins over Annotated Relations (AJAR) followups~\cite{DBLP:conf/pods/KhamisNR16, ajar}), where there is a single result tuple the answer to \Cref{prob:intro-stmt} for \abbrTIDB is {\em yes}.\footnote{We can approximate the expected output tuple multiplicities (for all output tuples {\em simultanesouly} with only $O(\log{Z})=O_k(\log{n})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms (see \Cref{app:sec-cicuits}).}
 % the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
 In contrast, known approximation techniques in set-\abbrPDB\xplural are  at most quadratic in the size of the compressed lineage encoding~\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}.
 %Atri: The footnote below does not add much
--- a/mult_distinct_p.tex
+++ b/mult_distinct_p.tex
@ -69,6 +69,13 @@ SELECT 1 FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
 where adapting the PDB instance in \Cref{fig:two-step}, relation $OnTime$ has $4$ tuples corresponding to each vertex for $i$ in $[4]$, each with probability $\prob_i$ and $Route$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
 Note that this implies that our hard lineage polynomial can be represented as an expression tree produced by a  project-join query with same probability value for each input tuple $\prob_i$, and hence is indeed a lineage polynomial for a \abbrTIDB \abbrPDB.

+\begin{Lemma}\label{lem:pdb-for-def-qk}
+The relations encoding the edges for the hard query of \Cref{def:qk} can be computed in $\bigO{\numedge}$ time.
+\end{Lemma}
+\begin{proof}
+Only two relations need be constructed, one for the vertexes and one for the edges.  By a simple linear scan, each can be constructed in time $\bigO{\numedge + \numvar}$.  If we assume a constant factor of edges in the number of vertexes, then we have $\bigO{\numedge}$ time..
+\end{proof}
+
 \subsection{Multiple Distinct $\prob$ Values}
 \label{sec:multiple-p}
 %Unless otherwise noted, all proofs for this section are in \Cref{app:single-mult-p}.