master
Boris Glavic 2021-09-03 15:50:14 -05:00
parent 28a11390f2
commit 691d668031
4 changed files with 15 additions and 14 deletions

View File

@ -18,7 +18,8 @@ of tuple $\tup$.
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We are mostly interested in the data complexity of this problem (i.e. we think of $Q$ as being of constant size). Unless stated otherwise, we implicitly assume the probability distribution $\pd$, and for notational convenience use $\expct\pbox{\cdot}$ instead of $\expct_\pd\pbox{\cdot}$. It has been shown that the problem of computing the marginal probability of a query result tuple can be reduced to the problem of computing the probability that the lineage formula of the tuple evaluates to true. The lineage formula of a tuple $\tup$ is a propositional formula over boolean random variables (whose joint probability distribution encodes which tuple exists in which world) representing the tuples of $\pdb$ which encodes how the existence of $\tup$ depends on the existence of the input tuples. The bag semantics analog of a lineage formula is a provenance polynomial $\apolyqdt$, a polynomial with integer co-efficients and exponents over integer random variables encoding the multiplicity of input tuples. Note that we drop $Q$, $\pdb$, and $\tup$ from $\apolyqdt$ if they are clear from the context or irrelevant to the discussion.
We are mostly interested in the data complexity of this problem (i.e. we think of $Q$ as being of constant size). Unless stated otherwise, we implicitly assume the probability distribution $\pd$, and for notational convenience use $\expct\pbox{\cdot}$ instead of $\expct_\pd\pbox{\cdot}$. It has been shown that the problem of computing the marginal probability of a query result tuple can be reduced to the problem of computing the probability that the lineage formula of the tuple evaluates to true. The lineage formula of a tuple $\tup$ is a propositional formula over boolean random variables (whose joint probability distribution encodes which tuple exists in which world) representing the tuples of $\pdb$ which encodes how the existence of $\tup$ depends on the existence of the input tuples. The bag semantics analog of a lineage formula is a provenance polynomial $\apolyqdt$, a polynomial with integer coefficients and exponents over integer random variables $\vct{X}$ encoding the multiplicity of input tuples. We will drop $Q$, $\pdb$, and $\tup$ from $\apolyqdt$ if they are clear from the context or irrelevant to the discussion.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Expected Multiplicity of Lineage Polynomials]\label{prob:bag-pdb-poly-expected}
@ -27,19 +28,19 @@ multiplicity of $\apolyqdt$ ($\expct_\pd\pbox{\apolyqdt}$).
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Note that, if $\apolyqdt$ is given, then \Cref{prob:bag-pdb-query-eval} reduces to problem \Cref{prob:bag-pdb-poly-expected}. In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials, considering the size of the encoding as the input size. % specifically, the bag semantics version of tuple-independent probabilistic bag-databases (\abbrTIDB) and block-independent probabilistic databases (\abbrBIDB).
Note that, if $\apolyqdt$ is given, then \Cref{prob:bag-pdb-query-eval} reduces to \Cref{prob:bag-pdb-poly-expected} (see \Cref{subsec:expectation-of-polynom-proof} for the proof). Evaluating queries over probabilistic databases in this fashion (computing a tuple's lineage and then calculating the expectation of the lineage) has been referred to as \textit{intensional query evaluation}~\cite{DBLP:series/synthesis/2011Suciu}. In this work, we study the complexity of \Cref{prob:bag-pdb-poly-expected} for several models of probabilistic databases and various encodings of such polynomials, considering the size of the encoding as the input size. % specifically, the bag semantics version of tuple-independent probabilistic bag-databases (\abbrTIDB) and block-independent probabilistic databases (\abbrBIDB).
% Our main technical focus is on studying the complexity of this problem for various encoding of such polynomials.
However, as we will show, these results also have implications for \cref{prob:bag-pdb-query-eval} when considering the cost of generating polynomials of query result tuples.
However, as we will show, these results have implications for solving \Cref{prob:bag-pdb-query-eval} using intensional query evaluation, i.e., when also considering the cost of generating lineage polynomials.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{\abbrTIDB\xplural}
%Solving~\cref{prob:bag-pdb-query-eval} for arbitrary $\pd$ is hopeless since we need exponential space to repreent an arbitrary $\pd$.
We initially focus on tuple-independent probabilistic bag-databases (\abbrTIDB),\BG{cite} a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event\footnote{
This model corresponds to the classical set-relational approach to \abbrTIDB{}s, where we can handle the case of each input tuple having its own multiplicity by replacing each input tuple with as many copies as its multiplicity. To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates. This increases the size of the input but this overhead is negligible when each input tuple has constant multiplicity. %$\tup$ in $\pdb$.
This model corresponds to the classical set semantics definition of \abbrTIDB{}s\cite{VS17}. We can handle the case of each input tuple having a multiplicity larger than one by replacing each input tuple with as many copies as its multiplicity. To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates. This increases the size of the input but this overhead is negligible when each input tuple has constant multiplicity. %$\tup$ in $\pdb$.
%This typically has an $\bigO{c}$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{\tup}$, where $\db\inparen{\tup}$ denotes $\tup$'s multiplicity in the encoding.
We further generalize this model in \Cref{sec:background} and beyond.
}.\BG{The footnote is still a bit hard to follow I think, but I do not have a great suggestion on how to improve it.}
We will denote the $n$ tuples in the database by $t_1,\dots,t_\numvar$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world that has $t_i$ in it iff $w_i=1$. Further $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$. Finally for each $\vct{W}\in\{0,1\}^\numvar$, we define $\pdb_{\vct{W}}$ as the world represented by $\vct{W}$.
We will denote the $n$ tuples in the database by $t_1,\dots,t_\numvar$. Each of the $2^n$ possible worlds in $\Omega$ can be encoded as a string in $\{0,1\}^\numvar$. In particular, any vector $\vct{W}=\inparen{W_1,\dots,W_n}\in \{0,1\}^\numvar$ represents a world that has $\tup_i$ in it iff $w_i=1$. Further $\pd$ is compactly described by a tuple $\vct{p}=\inparen{p_1,\dots,p_n}$, which induces the Bernoulli distribution over vectors $\vct{W}\in\{0,1\}^\numvar$ where each $i\in [n]$, $\probOf(W_i=1)=p_i$. Finally for each $\vct{W}\in\{0,1\}^\numvar$, we define $\pdb_{\vct{W}}$ as the world represented by $\vct{W}$.
%Atri: Stuff below was confusing, so am re-writing it.
%A \abbrTIDB encodes a compatible $\pdb$ as a deterministic database $\encodedDB$ with $\numvar$ tuples, each annotated with a probability $\prob_\tup$, and with $\pd$
@ -48,18 +49,20 @@ We will denote the $n$ tuples in the database by $t_1,\dots,t_\numvar$. Each of
%The possible worlds of a \abbrTIDB can be encoded by the vector $\vct{W}$, such that each of the $\numvar$ tuples in $\vct{W}$ has its own unique Bernoulli-distributed random variable, i.e. $\vct{W} = \inparen{W_{\tup_1},\ldots, W_{\tup_\numvar}}$, and for each tuple $\tup$, $\probOf(W_\tup) = \prob_\tup$.
%Given a vector $\vct{X}$ such that each $\tup \in \encodedDB$ has a unique formal variable annotation $X_\tup \in \vct{X}$, for a boolean domain $\{0,1\}^\numvar$, denote by $\pdb_{\vct{X}}$ the deterministic database consisting of exactly those tuples $\tup$ where $X_\tup = 1$.
\BG{REMOVED:
When $\pdb$ is a \abbrTIDB, for every output tuple $\tup$, $\query\inparen{\pdb}\inparen{\tup}$ can be encoded by a polynomial, with variables in $\vct{X}$.
Green, Karvounarakis, and Tannen established (\cite{DBLP:conf/pods/GreenKT07}; see \cref{fig:nxDBSemantics}) that for any $\raPlus$ query $\query$ and \abbrTIDB $\pdb$, there exists a polynomial $\poly_\tup\inparen{\vct{X}}$ following the standard addition and multiplication operators over Natural numbers (i.e., $\semN$-semiring semantics), such that $\query\inparen{\pdb_{\vct{W}}}\inparen{\tup} = \poly_\tup\inparen{\vct{W}}$.
This in turn implies that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct_{\vct{W}\sim\pd}\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$.
This in turn implies that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct_{\vct{W}\sim\pd}\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$.}
Thanks to linearity of expectation, simple polynomial-time algorithms exist
Thanks to linearity of expectation, simple polynomial-time algorithms exist for computing the expectation of a lineage polynomial $\apolyqdt$ when $\pdb$ is a \abbrTIDB and $\query$ is an $\raPlus$ query
% The algo is trivial so I think putting in a 2010 cite seems like bit too much
%\cite{kennedy:2010:icde:pip})
for computing exact results for bag-probabilistic count queries $Q$ over \abbrTIDB{}s. However, it is also known that since we are considering data complexity, that {\em deterministic} query processing for the same query $Q$ can also be done in polynomial time. If our notion of efficiency was polynomial time algorithms, then we would be done. However, in practice (and in theory), we care about the {\em fine-grained} complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime). Given that there is a huge literature on fine grained complexity of deterministic query complexity, here is a natural (informal) specialization of~\cref{prob:bag-pdb-query-eval}:
% for computing exact results for bag-probabilistic count queries $Q$ over \abbrTIDB{}s.
However, it is also known that since we are considering data complexity, that {\em deterministic} query processing for the same query $Q$ can also be done in polynomial time. If our notion of efficiency was polynomial time algorithms, then we would be done. However, in practice (and in theory), we care about the {\em fine-grained} complexity of deterministic query processing (i.e. we care about the exact exponent in our polynomial runtime). Given that there is a huge literature on fine grained complexity of deterministic query complexity, here is a natural (informal) specialization of~\cref{prob:bag-pdb-query-eval}:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Informal problem statement]
For any query $Q$, is it the case that the {\em fine-grained complexity} of computing expected multiplicities for the result tuples of $Q$ can be asymptotically as fast as the `best' deterministic query processing of $Q$?
\label{prob:informal}
\begin{Problem}[Informal problem statement]\label{prob:informal}
For any query $\query$, is it the case that the {\em fine-grained complexity} of computing expected multiplicities for the result tuples of $Q$ can be asymptotically as fast as the `best' deterministic query processing algorithm on $Q$?
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% However the question remains: \emph{can bag-probabilistic databases be as fast as deterministic queries}.

View File

@ -200,7 +200,7 @@
%using \wVec for world bit vector notation<-----Is this still the case?
%Polynomial
\newcommand{\poly}{\Phi}
\newcommand{\polyOf}[1]{\poly(#1)}
\newcommand{\polyOf}[1]{\poly[#1]}
\newcommand{\polyqdt}[3]{\polyOf{#1,#2,#3}}
\newcommand{\apolyqdt}{\polyqdt{\query}{\pdb}{\tup}}
\newcommand{\tupvar}[2]{X_{#1,#2}}

View File

View File

@ -155,8 +155,6 @@
\section{Reviewer 1}
\label{sec:reviewer-1}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{
\textbf{Response to author feedback}