This commit is contained in:
Boris Glavic 2021-09-03 19:15:07 -05:00
parent 54ff2ef002
commit 8921da1783

View file

@ -54,11 +54,11 @@ When $\pdb$ is a \abbrTIDB, for every output tuple $\tup$, $\query\inparen{\pdb}
Green, Karvounarakis, and Tannen established (\cite{DBLP:conf/pods/GreenKT07}; see \cref{fig:nxDBSemantics}) that for any $\raPlus$ query $\query$ and \abbrTIDB $\pdb$, there exists a polynomial $\poly_\tup\inparen{\vct{X}}$ following the standard addition and multiplication operators over Natural numbers (i.e., $\semN$-semiring semantics), such that $\query\inparen{\pdb_{\vct{W}}}\inparen{\tup} = \poly_\tup\inparen{\vct{W}}$.
This in turn implies that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct_{\vct{W}\sim\pd}\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$.}
Thanks to linearity of expectation, simple polynomial-time algorithms exist for computing the expectation of a lineage polynomial $\apolyqdt$ when $\pdb$ is a \abbrTIDB and $\query$ is an $\raPlus$ query
Thanks to linearity of expectation, simple polynomial-time algorithms exist for computing the expectation of a lineage polynomial $\apolyqdt$ when $\pdb$ is a \abbrTIDB and $\query$ is an $\raPlus$ query.
% The algo is trivial so I think putting in a 2010 cite seems like bit too much
%\cite{kennedy:2010:icde:pip})
% for computing exact results for bag-probabilistic count queries $Q$ over \abbrTIDB{}s.
However, it is also known that since we are considering data complexity, that {\em deterministic} query processing for the same query $Q$ can also be done in polynomial time. If our notion of efficiency was polynomial time algorithms, then we would be done. However, in practice (and in theory), we care about the {\em fine-grained} complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime). Given that there is a huge literature on fine grained complexity of deterministic query complexity, here is a natural (informal) specialization of~\cref{prob:bag-pdb-query-eval}:
However, it is also known that since we are considering data complexity, that {\em deterministic} query processing for the same query $Q$ can also be done in polynomial time. If our notion of efficiency was polynomial time algorithms, then we would be done. However, in practice (and in theory), we care about the {\em fine-grained} complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime). Given that there is a huge literature on fine grained complexity of deterministic query processing, a natural (informal) specialization of~\cref{prob:bag-pdb-query-eval} is:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Informal problem statement]\label{prob:informal}
@ -78,20 +78,22 @@ We note that an answer in the affirmative for~\cref{prob:informal} indicates tha
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Relationship to Set-Probabilistic Query Evaluation}
%
\Cref{prob:bag-pdb-query-eval} has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where each output tuple appears at most once. Here, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula
\Cref{prob:bag-pdb-query-eval} has been extensively studied in the context of \emph{set}-\abbrPDB\xplural. % , where each output tuple appears at most once.
As mentioned before, under set semantics, $\apolyqdt\inparen{\vct{X}}$ is a propositional formula
%Atri: If we get a reviewer who does not know what a propositional formula is then we are in trouble-- I did move some of the footnote text to the main part though
%\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives. Evaluating such a formula follows the standard semantics of the said operators on boolean variables ($\semB$-semiring semantics).}
whose evaluation follows the standard Boolean semi-ring semantics (i.e. addition is logical OR and multiplication is logical AND), denoting the presence or absence of $\tup$. Computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ determines the marginal probability of $\tup$ appearing in the output. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard
% whose evaluation follows the standard Boolean semi-ring semantics (i.e. addition is logical OR and multiplication is logical AND), denoting the presence or absence of $\tup$.
and $\expct\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ is the marginal probability of $\tup$ appearing in the output. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query evaluation problem over set-\abbrPDB\xplural is \sharpphard
%Atri: Again if we have a reviewer who does not know what \sharpp is then we are in trouble
%\footnote{\sharpp is the counting version for problems residing in the NP complexity class.}
in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard $Q$ in data complexity. %for any polynomial-time deterministic query.
in general, and proved that a dichotomy exists for this problem for the class of union of conjunctive queries which has the same expressive power as $\raPlus$ we study here, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard $Q$ in data complexity. %for any polynomial-time deterministic query.
Thus, for the hard queries the answer to~\cref{prob:informal} is {\em no} for set-PDBs (under the standard complexity assumption that $\sharpp\ne \polytime$.
Concretely, easy queries in this dichotomy can be answered through so-called \emph{extensional} query evaluation, where probability computation is inlined into normal deterministic query processing.
This is possible, because queries on the easy side of the dichotomy can always be rewritten into a form that guarantees that, for every relational operator in the query, the presence of every tuple in the operator's output is governed by either a conjunction or disjunction of \emph{independent} events.
Such a guarantee is not possible for queries on the hard side of the dichotomy, and the best known approach is so-called \emph{intensional} query evaluation~\cite{DBLP:series/synthesis/2011Suciu}, a two step process that first computes the lineage of the query result --- a representation of $\Phi_\tup$ --- which it then uses to compute the desired probability.
The complexity of this approach is typically dominated by the second step, computing the expectation $\expct\pbox{\poly_\tup(\vct{\randWorld})}$, a problem known to be \sharpphard~\cite{DS07}.
Such a guarantee is not possible for queries on the hard side of the dichotomy, and the best known approach is the aforementioned \emph{intensional} query evaluation approach~\cite{DBLP:series/synthesis/2011Suciu}. % , a two step process that first computes the lineage of the query result --- a representation of $\Phi_\tup$ --- which it then uses to compute the desired probability.
The complexity of this approach is general dominated by computing the expectation $\expct\pbox{\apolyqdt(\vct{\randWorld})}$, a problem known to be \sharpphard~\cite{DS07}.
@ -102,7 +104,7 @@ The complexity of this approach is typically dominated by the second step, compu
%Atri: Again changing subsection below to para
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Intensional Bag-Probabilistic Query Evaluation}
However, there exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit than set-\abbrPDB\xplural. One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$) of the result. This works focuses on computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity. Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
However, there exist some queries for which \abbrBPDB\xplural are a more natural fit than set-\abbrPDB\xplural. One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly\inparen{\vct{\randWorld}}}$) of the result. This works focuses on computing $\expct\pbox{\poly\inparen{\vct{\randWorld}}}$ as a natural statistic to develop the theoretical foundations of \abbrBPDB complexity. Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
%BEGIN Needs to be noted.
%As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\poly_\tup\inparen{\vct{X}}$, a stark contrast to the marginal probability %($\expct\pbox{\poly\inparen{\vct{X}}}$)