Merge branch 'master' of https://gitlab.odin.cse.buffalo.edu/ahuber/SketchingWorlds

2021-08-31 15:06:30 -04:00 · 2021-08-31 15:06:30 -04:00 · ed16f49249
parent 1535b692cf 30433473d8
commit ed16f49249
3 changed files with 66 additions and 48 deletions
--- a/.gitignore
+++ b/.gitignore
@ -12,3 +12,4 @@
 *.xoj
 *.auxlock
 *.vtc
+auto
--- a/intro-rewrite-070921.tex
+++ b/intro-rewrite-070921.tex
@ -2,14 +2,24 @@
 %root: main.tex
 \section{Introduction (Rewrite - 070921)}\label{sec:intro-rewrite-070921}
 \input{two-step-model}
-A probabilistic database (or PDB) $\pdb$ is a pair $\inparen{\idb, \pd}$ such that $\idb$ is a set of deterministic database instances (possible worlds) and $\pd$ is a probability distribution over $\idb$.  
-In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ is the multiplicity of its corresponding output tuple $\tup$ (in a random database instance in $\idb$ chosen according to $\pd$).  
-In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics can be formally stated as:
-\begin{Problem}\label{prob:bag-pdb-query-eval}
-Given a query $\query$ from the set of positive relational algebra queries\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} ($\raPlus$), compute the expected\footnote{Unless stated otherwise, we assume the implicity probability distribution $\pd$, and for notational convenience use $\expct\pbox{\cdot}$ instead of $\expct_\pd\pbox{\cdot}$.}
-multiplicity ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$)
-of output tuple $\tup$. We are interested in the data complexity of this problem (i.e. we think of $Q$ as being of constant size).
+A probabilistic database (PDB) $\pdb$ is a tuple $\inparen{\idb, \pd}$ such that $\idb$ is a set of deterministic database instances called possible worlds and $\pd$ is a probability distribution over $\idb$.
+A commonly studied problem in probabilistic databases is given a query $\query$, PDB $\pdb$, and possible query result tuple $\tup$, to compute the tuple's \textit{marginal probability} to be in the query's result, i.e., computing the expectation of a Boolean  random variable over $\pd$ that is $1$ for every $\db \in \idb$ for which $\tup \in \query(\db)$ and $0$ otherwise. In this work, we are interested in bag semantics where each tuple $\tup$ is associated with a multiplicity $\db(\tup)$ from $\semN$ in each possible world.\footnote{We find it convenient to use the notation from~\cite{DBLP:conf/pods/GreenKT07} which models bag relations as function that map tuples to their multiplicity.}
+We refer to such a probabilistic database as a bag-probabilistic database or \abbrBPDB for short.
+The natural generalization of the problem of computing marginal probabilities of query result tuples to bag semantics is to compute the expectation of a random variable over $\pd$ that assign value $\query(\db)(\tup)$ in world $\db$:
+
+% In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ is the multiplicity of its corresponding output tuple $\tup$ (in a random database instance in $\idb$ chosen according to $\pd$).
+%In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics can be formally stated as:
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{Problem}[Expected Multiplicity]\label{prob:bag-pdb-query-eval}
+Given a positive relational algebra query  ($\raPlus$)\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).}   $\query$, \abbrBPDB $\pdb$, and output tuple $\tup$, compute the expected
+multiplicity ($\expct_\pd\pbox{\query\inparen{\pdb}\inparen{\tup}}$)
+of tuple $\tup$.
 \end{Problem}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+We are mostly interested in the data complexity of this problem (i.e. we think of $Q$ as being of constant size). Unless stated otherwise, we implicitly assume the probability distribution $\pd$, and for notational convenience use $\expct\pbox{\cdot}$ instead of $\expct_\pd\pbox{\cdot}$. It has been shown that the problem of computing the marginal probability of a query result tuple can be reduced to the problem of computing the probability that the lineage formula of the tuple evaluates to true. The lineage formula of a tuple is a propositional formula over boolean random variables representing the tuples of $\pdb$. The bag semantics analog for a lineage formula is a provenance polynomial, a polynomial with integer co-efficients and exponents over integer random variables (representing the multiplicity of input tuples) and we show that \Cref{prob:bag-pdb-query-eval} corresponds to the problem of computing the expectation of such a polynomial. Our main technical focus is on studying the complexity of this problem for various encoding of such polynomials. However, as we will show, these results also have implications for \cref{prob:bag-pdb-query-eval} when considering the cost of generating polynomials of query result tuples.
+

 Solving~\cref{prob:bag-pdb-query-eval} for arbitrary $\pd$ is hopeless since we need exponential space to repreent an arbitrary $\pd$.
 We initially focus on tuple-independent probabilistic bag-databases (\abbrTIDB), a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world can be modeled as an independent probabilistic event\footnote{
@ -245,3 +255,9 @@ To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample mon

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl})\AH{Aren't they the same?}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.
+
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "main"
+%%% End:
--- a/macros.tex
+++ b/macros.tex
@ -124,6 +124,7 @@

 %PDB Abbreviations
 \newcommand{\abbrPDB}{\textnormal{PDB}\xspace}
+\newcommand{\abbrBPDB}{\textnormal{BPDB}\xspace}
 \newcommand{\abbrTIDB}{\textnormal{TIDB}\xspace}%replace \ti with this
 \newcommand{\abbrBIDB}{\textnormal{BIDB}\xspace}
 \newcommand{\ti}{TIDB\xspace}