Incorporated @oliver 120920 pdf suggestions.

This commit is contained in:
Aaron Huber 2020-12-09 16:51:37 -05:00
parent a9c3d362ee
commit dcf0ec7c9d

View file

@ -2,9 +2,12 @@
\section{Introduction}
Modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing the probability of an output tuple is analogous to weighted model counting, a known $\sharpP$ problem.
Modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing the probability of an output tuple is analogous to weighted model counting (a known $\sharpP$ problem).
%the annotation of the tuple is a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, which can essentially be thought of as a boolean formula. It is known that computing the probability of a lineage formula is \#-P hard in general
~\cite{DBLP:series/synthesis/2011Suciu}. In PDBs, a boolean formula encodes the conditions under which each output tuple appears in the result. This formula is also called a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}. The marginal probability of a tuple is its probability to appear in a possible world. The set of variables are each mapped to a probability, and by substituting the probability mappings of each variable into the lineage formula, one can compute the marginal probability. However, instead of using boolean lineage formulas, the bag case requires the use of polynomials over random variables to represent the probability distribution of the multiplicity of input tuples. In this case, the polynomial is interpreted as the probability of the input tuple contributing to the multiplicity of the output tuple. Or in other words, the expectation of the polynomial is the expected multiplicity of the output tuple. Due to linearity of expectation, computing the expectation of the polynomial in the bag setting is linear in the number of terms of the expanded formula, with the result that many regard bags to be easy. In this work we consider compressed representations of the lineage formula showing that the complexity landscape becomes much more nuanced, and is \textit{not} linear in general. Thus, even bag PDBs do not enjoy the same computational complexity as deterministic databases.
~\cite{DBLP:series/synthesis/2011Suciu}. In PDBs, a boolean formula, also called a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, encodes the conditions under which each output tuple appears in the result. The marginal probability of this formula being true is the tuple's probability to appear in a possible world. The set of variables are each mapped to probability values, from which the marginal probability can be computed. The corresponding problem for bag PDBs is computing the expected multiplicity of the output tuple, which requires polynomials to represent the probability distribution of the multiplicity of input tuples.
%In this case, the polynomial is interpreted as the probability of the input tuple contributing to the multiplicity of the output tuple. Or in other words, the expectation of the polynomial is the expected multiplicity of the output tuple.
The standard representation for lineage formulas in PDBs is sum of products (SOP), which is much bigger than the lineage-free representation that deterministic databases employ.
Due to linearity of expectation, computing the expectation of the polynomial in the bag setting is linear in the number of terms in the SOP formula, with the result that many regard bags to be easy. In this work we consider compressed representations of the lineage formula. We show that the complexity landscape becomes much more nuanced, and is \textit{not} linear in general. The compressed representation of the formula is analagous to deterministic query optimizations (e.g. pushing down projections). Thus, even bag PDBs do not enjoy the same computational complexity as deterministic databases.
@ -70,8 +73,9 @@ Modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In
%\end{figure}
\begin{Example}\label{ex:intro}
Assume a set semantics setting. Suppose we are given a Tuple Independent Database ($\ti$), which is a PDB whose tuples are independent. We are given the following boolean query $\poly() :- R(A), E(A, B), R(B)$, where the lineage of the output will consist of the products of all input tuple lineages whose combination satsifies the join condition, summed together. The $\ti$ example instances are given in~\cref{fig:intro-ex}. While for completeness we should include annotations for Table E, since each tuple has a probability of $1$, we drop them for simplicity. The attribute column $\Phi$ contains its repsective tuple's marginal probability. %Finally, see that the tuples in table E can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
This query is hard in set semantics because of correlations in the lineage formula, but under bag semantics with a polynomial formula representing the multiple contributing tuples from the input set $\ti$, it is easy since we enjoy linearity of expectation.
Assume a set semantics setting. Suppose we are given a Tuple Independent Database ($\ti$), which is a PDB whose tuples are independent. We are given the following boolean query $\poly() :- R(A), E(A, B), R(B)$. The lineage of the output is computed by adding variables when a union operation is performed, and by multiplying variables for a join operation. This yields the products of all input tuple lineages whose combination satsifies the join condition, summed together. A $\ti$ example instance is given in~\cref{fig:intro-ex}. While for completeness we should include random variables for Table E, since each tuple has a probability of $1$, we drop them for simplicity. The attribute column $\Phi$ contains its repsective random variable, where $P[W_i = 1]$ is its marginal probability. %Finally, see that the tuples in table E can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
Next we explain why this query is hard in set semantics % due to correlations in the lineage formula. But
and easy under bag semantics.% with a polynomial formula representing the multiple contributing tuples from the input set $\ti$, it is easy since we enjoy linearity of expectation.
\end{Example}
Our work also handles Block Independent Disjoint Databases ($\bi$), a PDB model in which tuples are arranged in blocks, where all blocks are independent from one another, but tuples within the same block are mutually exclusive. For now, let us consider the $\ti$ model. In the example $Dom(W_i) = \{0, 1\}$ and we consider a fixed probability $\prob$ for all tuple variables such that $P[W_i = 1] = \prob$. Let us also be explicit in mentioning that the input tables are \textit{sets}, and the difference when we speak of bag semantics, is that we consider the output tuple to potentially have duplicates, or in other words we are thinking about query output (over set instances) in the bag context when we speak of the output formula under \textit{bag semantics}.