This commit is contained in:
Aaron Huber 2021-08-31 15:06:30 -04:00
commit ed16f49249
3 changed files with 66 additions and 48 deletions

1
.gitignore vendored
View file

@ -12,3 +12,4 @@
*.xoj
*.auxlock
*.vtc
auto

View file

@ -2,14 +2,24 @@
%root: main.tex
\section{Introduction (Rewrite - 070921)}\label{sec:intro-rewrite-070921}
\input{two-step-model}
A probabilistic database (or PDB) $\pdb$ is a pair $\inparen{\idb, \pd}$ such that $\idb$ is a set of deterministic database instances (possible worlds) and $\pd$ is a probability distribution over $\idb$.
In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ is the multiplicity of its corresponding output tuple $\tup$ (in a random database instance in $\idb$ chosen according to $\pd$).
In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics can be formally stated as:
\begin{Problem}\label{prob:bag-pdb-query-eval}
Given a query $\query$ from the set of positive relational algebra queries\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} ($\raPlus$), compute the expected\footnote{Unless stated otherwise, we assume the implicity probability distribution $\pd$, and for notational convenience use $\expct\pbox{\cdot}$ instead of $\expct_\pd\pbox{\cdot}$.}
multiplicity ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$)
of output tuple $\tup$. We are interested in the data complexity of this problem (i.e. we think of $Q$ as being of constant size).
A probabilistic database (PDB) $\pdb$ is a tuple $\inparen{\idb, \pd}$ such that $\idb$ is a set of deterministic database instances called possible worlds and $\pd$ is a probability distribution over $\idb$.
A commonly studied problem in probabilistic databases is given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} to be in the query's result, i.e., computing the expectation of a Boolean random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise. In this work, we are interested in bag semantics where each tuple $\tup$ is associated with a multiplicity $\db(\tup)$ from $\semN$ in each possible world.\footnote{We find it convenient to use the notation from~\cite{DBLP:conf/pods/GreenKT07} which models bag relations as function that map tuples to their multiplicity.}
We refer to such a probabilistic database as a bag-probabilistic database or \abbrBPDB for short.
The natural generalization of the problem of computing marginal probabilities of query result tuples to bag semantics is to compute the expectation of a random variable over $\pd$ that assign value $\query(\db)(\tup)$ in world $\db$:
% In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ is the multiplicity of its corresponding output tuple $\tup$ (in a random database instance in $\idb$ chosen according to $\pd$).
%In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics can be formally stated as:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Problem}[Expected Multiplicity]\label{prob:bag-pdb-query-eval}
Given a positive relational algebra query ($\raPlus$)\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} $\query$, \abbrBPDB $\pdb$, and output tuple $\tup$, compute the expected
multiplicity ($\expct_\pd\pbox{\query\inparen{\pdb}\inparen{\tup}}$)
of tuple $\tup$.
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We are mostly interested in the data complexity of this problem (i.e. we think of $Q$ as being of constant size). Unless stated otherwise, we implicitly assume the probability distribution $\pd$, and for notational convenience use $\expct\pbox{\cdot}$ instead of $\expct_\pd\pbox{\cdot}$. It has been shown that the problem of computing the marginal probability of a query result tuple can be reduced to the problem of computing the probability that the lineage formula of the tuple evaluates to true. The lineage formula of a tuple is a propositional formula over boolean random variables representing the tuples of $\pdb$. The bag semantics analog for a lineage formula is a provenance polynomial, a polynomial with integer co-efficients and exponents over integer random variables (representing the multiplicity of input tuples) and we show that \Cref{prob:bag-pdb-query-eval} corresponds to the problem of computing the expectation of such a polynomial. Our main technical focus is on studying the complexity of this problem for various encoding of such polynomials. However, as we will show, these results also have implications for \cref{prob:bag-pdb-query-eval} when considering the cost of generating polynomials of query result tuples.
Solving~\cref{prob:bag-pdb-query-eval} for arbitrary $\pd$ is hopeless since we need exponential space to repreent an arbitrary $\pd$.
We initially focus on tuple-independent probabilistic bag-databases (\abbrTIDB), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world can be modeled as an independent probabilistic event\footnote{
@ -245,3 +255,9 @@ To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample mon
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl})\AH{Aren't they the same?}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:

View file

@ -124,6 +124,7 @@
%PDB Abbreviations
\newcommand{\abbrPDB}{\textnormal{PDB}\xspace}
\newcommand{\abbrBPDB}{\textnormal{BPDB}\xspace}
\newcommand{\abbrTIDB}{\textnormal{TIDB}\xspace}%replace \ti with this
\newcommand{\abbrBIDB}{\textnormal{BIDB}\xspace}
\newcommand{\ti}{TIDB\xspace}