updates
parent
347cca2f7d
commit
1aa796641a
15
abstract.tex
15
abstract.tex
|
@ -1,11 +1,14 @@
|
|||
%root: main.tex
|
||||
%!TEX root=./main.tex
|
||||
In this work, we study the problem of computing a tuple's expected multiplicity over probabilistic databases with bag semantics (where each tuple is associated with a multiplicity) exactly and approximately.
|
||||
We consider bag-\abbrTIDB\xplural where we have a bound $\bound$ on the maximum multiplicity of each tuple and tuples are independent probabilistic events (we refer to such databases as \abbrCTIDB\xplural).
|
||||
We are specifically interested in the fine-grained complexity of computing expected multiplicities and how it compares to the complexity of deterministic query evaluation algorithms --- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases.
|
||||
Unfortunately, our results imply that computing expected multiplicities for \abbrCTIDB\xplural based on the results produced by such query evaluation algorithms introduces super-linear overhead (under parameterized complexity hardness assumptions/conjectures).
|
||||
We proceed to study approximation of expected result tuple multiplicities for positive relational algebra queries ($\raPlus$) over \abbrCTIDB\xplural and for a non-trivial subclass of block-independent databases (\abbrBIDB\xplural).
|
||||
We develop a sampling algorithm that computes a $(1 \pm \epsilon)$-approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any $\raPlus$ query.
|
||||
In this work, we study the problem of computing a query result tuple's expected multiplicity for probabilistic databases under bag semantics (where each tuple is associated with a multiplicity) exactly and approximately.
|
||||
Specifically, we are interested in the fine-grained complexity of this problem for \abbrCTIDB\xplural, i.e., probabilistic databases where tuples are independent probabilistic events and the multiplicity of each tuple is bound by a constant $\bound$.
|
||||
% We consider bag-\abbrTIDB\xplural where we have a bound $\bound$ on the maximum multiplicity of each tuple and tuples are independent probabilistic events (we refer to such databases as \abbrCTIDB\xplural).
|
||||
Unfortunately, our results imply that computing expected multiplicities for \abbrCTIDB\xplural based on the results produced by deterministic query evaluation algorithms introduces super-linear overhead (under certain parameterized complexity hardness conjectures).
|
||||
% We are specifically interested in the fine-grained complexity of computing expected multiplicities and how it compares to the complexity of deterministic query evaluation algorithms --- if these complexities are comparable, it opens the door to practical deployment of probabilistic databases.
|
||||
% Unfortunately, our results imply that computing expected multiplicities for \abbrCTIDB\xplural based on the results produced by such query evaluation algorithms introduces super-linear overhead (under parameterized complexity hardness assumptions/conjectures).
|
||||
Nonetheless, we develop a sampling algorithm that computes a $(1 \pm \epsilon)$-approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any positive relational algebra ($\raPlus$) query over \abbrCTIDB\xplural and for a non-trivial subclass of block-independent databases (\abbrBIDB\xplural).
|
||||
% We proceed to study approximation of expected result tuple multiplicities for positive relational algebra queries ($\raPlus$) over \abbrCTIDB\xplural and for a non-trivial subclass of block-independent databases (\abbrBIDB\xplural).
|
||||
% We develop a sampling algorithm that computes a $(1 \pm \epsilon)$-approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding deterministic query for any $\raPlus$ query.
|
||||
|
||||
%%% Local Variables:
|
||||
%%% mode: latex
|
||||
|
|
|
@ -2,16 +2,18 @@
|
|||
%root: main.tex
|
||||
\section{Introduction}\label{sec:intro}
|
||||
|
||||
This work explores the problem of computing the expectation of the multiplicity of a tuple in the result of a query over a \abbrCTIDB, a type of probabilistic database with bag semantics where the multiplicity of a tuple is a random variable with range $[0,\bound]$ for some fixed constant $\bound$ and multiplicities assigned to any two tuples are independent of each other.
|
||||
This work explores the problem of computing the expectation of the multiplicity of a tuple in the result of a query over a \abbrCTIDB (tuple independent database), a type of probabilistic database with bag semantics where the multiplicity of a tuple is a random variable with range $[0,\bound]$ for some fixed constant $\bound$ and multiplicities assigned to any two tuples are independent of each other.
|
||||
Formally, a \abbrCTIDB,
|
||||
$\pdb = \inparen{\worlds, \bpd}$ is a set of tuples $\tupset$ and a probability distribution $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
|
||||
Any such world can be encoded as a vector (of length $\numvar=\abs{\tupset}$) from $\worlds$, such that the multiplicity of each $\tup \in \tupset$ is stored at a distinct index.
|
||||
A given world $\worldvec \in\worlds$ can be interpreted as follows: for each $\tup \in \tupset$, $\worldvec_{\tup}$ is the multiplicity of $\tup$ in $\worldvec$.
|
||||
We note that encoding a possible world as a vector, while non-standard, is equivalent to encoding it as a set of tuples (\Cref{prop:expection-of-polynom} in \Cref{subsec:expectation-of-polynom-proof}).
|
||||
Given that tuple multiplicities are independent events, the probability distribution $\bpd$ can be expressed compactly by assigning each tuple a (disjoint) probability distribution over $[0,\bound]$. Let $\prob_{\tup,j}$ denote the probability that tuple $\tup$ is assigned multiplicity $j$. The probability of a world $\worldvec$ is then $\prod_{\tup \in \tupset} \prob_{\tup,\worldvec_{\tup}}$.
|
||||
|
||||
Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB, which (assuming set query semantics), is the same as the traditional set \abbrTIDB.
|
||||
In this work, since we are generally considering bag query input, we will only be considering bag query semantics. We denote by $\query\inparen{\worldvec}\inparen{\tup}$ the multiplicity of $\tup$ in query $\query$ over possible world $\worldvec\in\worlds$.
|
||||
%
|
||||
% Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB, which (assuming set query semantics), is the same as the traditional set \abbrTIDB.
|
||||
% In this work, since we are generally considering bag query input, we will only be considering bag query semantics.
|
||||
Note that in this work, we consider queries with bag semantics over such bag probabilistic databases.
|
||||
We denote by $\query\inparen{\worldvec}\inparen{\tup}$ the multiplicity of $\tup$ in query $\query$ over possible world $\worldvec\in\worlds$.
|
||||
%
|
||||
We can formally state our problem of computing the expected multiplicity of a result tuple as:
|
||||
|
||||
|
@ -53,19 +55,31 @@ An $\raPlus$ query is a query expressed in positive relational algebra, i.e., us
|
|||
\vspace{-0.53cm}
|
||||
\end{figure}
|
||||
|
||||
Computing the expected multiplicity of a result tuple in a \abbrCTIDB is the analog of the marginal probability in a set \abbrPDB.
|
||||
In this work we will assume that $c =\bigO{1}$, since this is what is typically seen in practice.
|
||||
Allowing for unbounded $c$ is an interesting open problem.
|
||||
As also observed in \cite{https://doi.org/10.48550/arxiv.2201.11524}, computing the expected multiplicity of a result tuple in a bag probabilistic database is the analog of the marginal probability in a set \abbrPDB.
|
||||
% We will assume that $c =\bigO{1}$, since this is what is typically seen in practice.
|
||||
% Allowing for unbounded $c$ is an interesting open problem.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
|
||||
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and their data complexity has, in general been shown by Dalvi and Suicu to be \sharpphard\cite{10.1145/1265530.1265571}. For our setting, there exists a trivial polytime algorithm to compute~\Cref{prob:expect-mult} for any $\raPlus$ query over a \abbrCTIDB due to linearity of expection (see~\Cref{sec:intro-poly-equiv}).
|
||||
Since we can compute~\Cref{prob:expect-mult} in polynomial time, the interesting question that we explore is the hardness of computing expectation using fine-grained analysis and parameterized complexity, where we are interested in the exponent of polynomial runtime.
|
||||
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and its data complexity has, in general been shown % by Dalvi and Suicu
|
||||
to be \sharpphard\cite{10.1145/1265530.1265571}.
|
||||
Grohe et. al.~\cite{https://doi.org/10.48550/arxiv.2201.11524} studied bag-\abbrTIDB\xplural allowing for unbounded multiplicities which requires them to explicitly address the issue of a succinct representation of probability distributions over infinitely many multiplicities.
|
||||
This work demonstrated the existence of a dichotomy for
|
||||
the problem of computing the probability that an output tuple has a multiplicity of at most $k$.
|
||||
% investigates the query evaluation problem over bag-\abbrTIDB\xplural when computing the probability of an output tuple having at most a multiplicity of $k$, showing that a dichotomy exists for this problem.
|
||||
% While the authors observe that computing the expectation of an output tuple multiplicity is in polynomial time, no further (fine-grained) analysis of the expected value is considered.
|
||||
% Our work in contrast assumes a finite bound on the multiplicities where we simply list the finitely many probability values (and hence do not need consider a more succinct representation). Further, our work primarily looks into the fine-grained analysis of computing the expected multiplicity of an output tuple.
|
||||
|
||||
In contrast to this work, we consider \abbrCTIDB\xplural, i.e., the multiplicity of input tuples is bound by a constant $\bound$.
|
||||
For this setting, % (\abbrCTIDB\xplural, i.e., the multiplicity of input tuples is bound by a constant $\bound$), however,
|
||||
there exists a trivial \ptime algorithm for computing the expectation of a result tuple's multiplicity~(\Cref{prob:expect-mult}) for any $\raPlus$ query due to linearity of expectation (see~\Cref{sec:intro-poly-equiv}).
|
||||
Since we can solve~\Cref{prob:expect-mult} in \ptime, the interesting question that we explore is the hardness of computing expectation using fine-grained analysis and parameterized complexity, where we are interested in the exponent of polynomial runtime.\footnote{While the authors of \cite{https://doi.org/10.48550/arxiv.2201.11524} also observe that computing the expectation of an output tuple multiplicity is in \ptime, they do not investigate the fine-grained complexity of this problem.}
|
||||
|
||||
Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in time linear in the runtime of an analogous deterministic query, which we make more precise shortly.
|
||||
If this is true, then this would open up the way for deployment of \abbrCTIDB\xplural in practice. To analyze this question we denote by $\timeOf{}^*(Q,\pdb)$ the optimal runtime complexity of computing~\Cref{prob:expect-mult} over \abbrCTIDB $\pdb$.
|
||||
|
||||
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$, deterministic database $\gentupset$, and multiplicity bound $\bound$. This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q}
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{table*}[t!]
|
||||
\centering
|
||||
\begin{tabular}{|p{0.43\textwidth}|p{0.12\textwidth}|p{0.35\textwidth}|}
|
||||
|
@ -78,22 +92,25 @@ $\omega\inparen{\inparen{\qruntime{\optquery{\qhard}, \tupset, \bound}}^{C_0}}$
|
|||
$\Omega\inparen{\inparen{\qruntime{\optquery{\qhard}, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\
|
||||
\hline
|
||||
\end{tabular}
|
||||
\caption{Our lower bounds for a specific hard query $\qhard$ parameterized by $k$. For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
|
||||
\caption{Our lower bounds for $\qhard$, any query from a class of hard queries parameterized by $k$. For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
|
||||
\label{tab:lbs}
|
||||
\vspace{-0.73cm}
|
||||
\end{table*}
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\mypar{Our lower bound results}
|
||||
Our question is whether or not it is always true that $\timeOf{}^*\inparen{\query, \pdb}\leq\qruntime{\optquery{\query}, \tupset, \bound}$. Unfortunately this is not the case.
|
||||
%
|
||||
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$ over a deterministic database $\gentupset$ where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q}
|
||||
Our question is whether or not it is always true that $\timeOf{}^*\inparen{\query, \pdb}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. Unfortunately this is not the case.
|
||||
~\Cref{tab:lbs} shows our results.
|
||||
|
||||
Specifically, depending on what hardness result/conjecture we assume, we get various weaker or stronger versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in \Cref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le \bigO{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $k$ is the join width (our notion of join width follows from~\Cref{def:degree-of-poly} and~\Cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
|
||||
Specifically, depending on what hardness result/conjecture we assume, we get various weaker or stronger versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in \Cref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le \bigO{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $k$ is the join width of $Q$ (our notion of join width follows from~\Cref{def:degree-of-poly} and~\Cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
|
||||
|
||||
What our lower bound in the third row says, is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for~\Cref{prob:expect-mult}.
|
||||
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results\footnote{This claim follows from known results for the problem of counting $k$-cliques, where for a query $\query$ over database $\tupset$ that counts the number of $k$-cliques. Specifically, a lower bound of the form $\Omega\inparen{n^{1+\eps_0}}$ for {\em some} $\eps_0>0$ follows from the triangle detection hypothesis (this like our result is for $k=3$). Second, a lower bound of $\omega\inparen{n^{C_0}}$ for {\em all} $C_0>0$ under the assumption $\sharpwzero\ne\sharpwone$ for counting $k$-clique~\cite{10.5555/645413.652181}. Finally, a lower bound of $\Omega\inparen{n^{c_0\cdot k}}$ for {\em some} $c_0>0$ was shown by~\cite{CHEN20061346} (under the strong exponential time hypothesis).
|
||||
} already imply the claimed lower bounds if we replace the $\qruntime{\optquery{\query}, \tupset, \bound}$ by just $\numvar = |\tupset|$ (indeed these results follow from known lower bounds for deterministic query processing). Our contribution is to identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
|
||||
|
||||
\mypar{Our upper bound results} We introduce a $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$. This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query and bag \abbrPDB\xplural are deployable in practice.
|
||||
\mypar{Our upper bound results} We introduce a $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$. This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query\BG{What is the size of the deterministic query?}. % and bag \abbrPDB\xplural are deployable in practice.
|
||||
In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}) in set-\abbrPDB\xplural need time $\Omega(\qruntime{\optquery{\query}, \tupset, \bound}^{2k})$
|
||||
(see \Cref{sec:karp-luby}).
|
||||
Further, our approximation algorithm works for a more general notion of bag \abbrPDB\xplural beyond \abbrCTIDB\xplural
|
||||
|
@ -257,8 +274,8 @@ Although \abbrPDB queries instead convey the trustworthiness of results~\cite{ku
|
|||
Bags, as we consider, are sufficient for production use, where bag-relational algebra is already the default for performance reasons.
|
||||
Our results show that bag-\abbrPDB\xplural can be competitive, laying the groundwork for probabilistic functionality in production database engines.
|
||||
|
||||
\mypar{Concurrent Work}
|
||||
In work independent of ours, Grohe, et. al.~\cite{https://doi.org/10.48550/arxiv.2201.11524} investigate bag-\abbrTIDB\xplural allowing for unbounded multiplicities (which requires them to explicitly address the issue of a succinct representation of the distribution over infinitely many multiplicities). While the authors observe that computing the expected value of an output tuple multiplicity is in polynomial time, no further (fine-grained) analysis of the expected value is considered. The work primarily investigates the query evaluation problem over bag-\abbrTIDB\xplural when computing the probability of an output tuple having at most a multiplicity of $k$, showing that a dichotomy exists for this problem. Our work in contrast assumes a finite bound on the multiplicities where we simply list the finitely many probability values (and hence do not need consider a more succinct representation). Further, our work primarily looks into the fine-grained analysis of computing the expected multiplicity of an output tuple.
|
||||
% \mypar{Concurrent Work}
|
||||
% In work independent of ours, Grohe, et. al.~\cite{https://doi.org/10.48550/arxiv.2201.11524} investigate bag-\abbrTIDB\xplural allowing for unbounded multiplicities (which requires them to explicitly address the issue of a succinct representation of the distribution over infinitely many multiplicities). While the authors observe that computing the expected value of an output tuple multiplicity is in polynomial time, no further (fine-grained) analysis of the expected value is considered. The work primarily investigates the query evaluation problem over bag-\abbrTIDB\xplural when computing the probability of an output tuple having at most a multiplicity of $k$, showing that a dichotomy exists for this problem. Our work in contrast assumes a finite bound on the multiplicities where we simply list the finitely many probability values (and hence do not need consider a more succinct representation). Further, our work primarily looks into the fine-grained analysis of computing the expected multiplicity of an output tuple.
|
||||
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue