Finished @oliver 121020 suggestions modulo last Riot p suggestion

This commit is contained in:
Aaron Huber 2020-12-11 11:48:55 -05:00
parent 8211a9bfa0
commit ba79c9ffd7

View file

@ -2,12 +2,14 @@
\section{Introduction}
Modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing the probability of an output tuple is analogous to weighted model counting (a known $\sharpP$ problem).
Modern production databases like Postgres and Oracle use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing the probability of an output tuple is analogous to weighted model counting (a known $\sharpP$ problem).
%the annotation of the tuple is a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, which can essentially be thought of as a boolean formula. It is known that computing the probability of a lineage formula is \#-P hard in general
~\cite{DBLP:series/synthesis/2011Suciu}. In PDBs, a boolean formula, also called a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, encodes the conditions under which each output tuple appears in the result. The marginal probability of this formula being true is the tuple's probability to appear in a possible world. The set of variables are each mapped to probability values, from which the marginal probability can be computed. The corresponding problem for bag PDBs is computing the expected multiplicity of the output tuple, which requires polynomials to represent the probability distribution of the multiplicity of input tuples.
In PDBs, a boolean formula, ~\cite{DBLP:series/synthesis/2011Suciu} also called a lineage formula, encodes the conditions under which each output tuple appears in the result.
%The marginal probability of this formula being true is the tuple's probability to appear in a possible world.
The set of variables in a lineage formula are each drawn from a probability distribution, from which the marginal probability can be computed. The corresponding problem for bag PDBs is computing the expected multiplicities of the output tuple, where polynomials are used to represent the probability distribution of the multiplicity of input tuples.
%In this case, the polynomial is interpreted as the probability of the input tuple contributing to the multiplicity of the output tuple. Or in other words, the expectation of the polynomial is the expected multiplicity of the output tuple.
The standard representation for lineage formulas in PDBs is sum of products (SOP), which is much bigger than the lineage-free representation that deterministic databases employ.
Due to linearity of expectation, computing the expectation of the polynomial in the bag setting is linear in the number of terms in the SOP formula, with the result that many regard bags to be easy. In this work we consider compressed representations of the lineage formula. We show that the complexity landscape becomes much more nuanced, and is \textit{not} linear in general. The compressed representation of the formula is analagous to deterministic query optimizations (e.g. pushing down projections). Thus, even bag PDBs do not enjoy the same computational complexity as deterministic databases.
The standard representation for lineage formulas in PDBs is the sum of products (SOP). The SOP is essentially the expansion of all products of sums terms, so that the formula is a sum of variable products. The SOP representation is much bigger than the lineage-free representation that deterministic databases employ.
Due to linearity of expectation, computing the expectation of tuple multiplicities is linear in the number of terms in the SOP formula, so many regard bags to be easy. In this work we consider compressed representations of the lineage formula. We show that the complexity landscape becomes much more nuanced, and is \textit{not} linear in general. Such compressed representations of the formula are analagous to deterministic query optimizations (e.g. pushing down projections). In this work, we define hard to be anything greater than linear time. Thus, even bag PDBs do not enjoy the same computational complexity as deterministic databases and are hard in general. This makes it desirable to find linear time approximation algorithm.
@ -73,40 +75,58 @@ Due to linearity of expectation, computing the expectation of the polynomial in
%\end{figure}
\begin{Example}\label{ex:intro}
Assume a set semantics setting. Suppose we are given a Tuple Independent Database ($\ti$), which is a PDB whose tuples are independent. We are given the following boolean query $\poly() :- R(A), E(A, B), R(B)$. The lineage of the output is computed by adding variables when a union operation is performed, and by multiplying variables for a join operation. This yields the products of all input tuple lineages whose combination satsifies the join condition, summed together. A $\ti$ example instance is given in~\cref{fig:intro-ex}. While for completeness we should include random variables for Table E, since each tuple has a probability of $1$, we drop them for simplicity. The attribute column $\Phi$ contains its repsective random variable, where $P[W_i = 1]$ is its marginal probability. %Finally, see that the tuples in table E can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
Assume a set semantics setting. Suppose we are given a Tuple Independent Database ($\ti$), which is a PDB whose tuples are independently present or not. We are given the following boolean query $\poly() :- R(A), E(A, B), R(B)$. The lineage of the output is computed by adding polynomials when a union operation is performed, and by multiplying polynomials for a join operation. This yields the products of all input tuple lineages whose combination satsifies the join condition, summed together. A $\ti$ example instance is given in~\cref{fig:intro-ex}. The attribute column $\Phi$ contains its repsective random variable, where $P[W_i = 1]$ is its marginal probability. While for completeness we should include random variables for Table E, since each tuple has a probability of $1$, we drop them for simplicity. %Finally, see that the tuples in table E can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
Next we explain why this query is hard in set semantics % due to correlations in the lineage formula. But
and easy under bag semantics.% with a polynomial formula representing the multiple contributing tuples from the input set $\ti$, it is easy since we enjoy linearity of expectation.
\end{Example}
Our work also handles Block Independent Disjoint Databases ($\bi$), a PDB model in which tuples are arranged in blocks, where all blocks are independent from one another, but tuples within the same block are mutually exclusive. For now, let us consider the $\ti$ model. In the example $Dom(W_i) = \{0, 1\}$ and we consider a fixed probability $\prob$ for all tuple variables such that $P[W_i = 1] = \prob$. Let us also be explicit in mentioning that the input tables are \textit{sets}, and the difference when we speak of bag semantics, is that we consider the output tuple to potentially have duplicates, or in other words we are thinking about query output (over set instances) in the bag context when we speak of the output formula under \textit{bag semantics}.
Our work also handles Block Independent Disjoint Databases ($\bi$), a PDB model in which tuples are arranged in blocks, where all blocks are independent from one another, but tuples within the same block are mutually exclusive. For now, let us consider the $\ti$ model. In the example we consider a fixed probability $\prob$ for all tuple variables such that $P[W_i = 1] = \prob$. Let us also be explicit in mentioning that the input tables are \textit{sets}, i.e. $Dom(W_i) = \{0, 1\}$, and the difference when we speak of bag semantics, is that we consider the query to potentially have duplicates, or in other words we are thinking about query output (over set instances) in the bag context.
To contrast the bag/polynomial and set/lineage interpretations, we provide another example.
\begin{Example}\label{ex:bag-vs-set}
The output polynomial in ~\cref{ex:intro} has the following lineage formula (top) and polynomial (bottom).
\begin{align*}
&\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a\\
&\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a
\end{align*}
Notice that $\poly$ in the set/lineage setting above, $\poly: (\mathbb{B})^3 \mapsto \mathbb{B}$, while under bag/polynomial semantics we define $\poly: (\mathbb{N})^3 \mapsto \mathbb{N}$.
Assume the following $\mathbb{B}/\mathbb{N}$ variable assignments: $W_a\mapsto T/1, W_b \mapsto T/1, W_c \mapsto F/0.$ Then the polynomials evaluate as
\begin{align*}
&\poly(T, T, F) = TT \vee TF \vee FT = T\\
&\poly(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1
\end{align*}
In the set/lineage setting, we find that the boolean query is satisfied, while in the bags evaluation we see how many combinations of the input satsify the query.
\end{Example}
Note that computing the probability of the query of ~\cref{ex:intro} in set semantics is indeed $\sharpP$ hard, since it is a query that is non-hierarchical
%, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$,
as defined by Dalvi and Suciu in ~\cite{10.1145/1265530.1265571}. For the purposes of this work, we define hard to be anything greater than linear time. %Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$, i.e. the probability of the output with annotation $\poly(W_a, W_b, W_c)$, ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics.
To see why this computation is hard for query $\poly$ over set semantics, from the query input we compute an output lineage formula of $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent of one another and the computation of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$.
~\cite{10.1145/1265530.1265571}. %Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$, i.e. the probability of the output with annotation $\poly(W_a, W_b, W_c)$, ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics.
To see why this computation is hard for query $\poly$ over set semantics, from the query input we compute an output lineage formula of $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent of one another and the computation of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$:
\begin{equation*}
\expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
\end{equation*}
In general, such a computation can be exponential.
In general, such a computation can be exponential in the size of the database.
%Using Shannon's Expansion,
%\begin{align*}
%&W_aW_b \vee W_bW_c \vee W_cW_a
%= &W_a
%\end{align*}
However, in the bag setting, the polynomial is $\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a$. To be reiterate, the output lineage formula is produced from a query over a set $\ti$ input, where duplicates are allowed in the output. The expectation computation over the output lineage is a computation of the 'average' multiplicity of an output tuple across possible worlds. In ~\cref{ex:intro}, the expectation is simply
However, in the bag setting, the polynomial is $\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a$. To be reiterate, the output lineage formula is produced from a query over a set $\ti$ input, where duplicates are allowed in the output. The expectation computation over the output lineage is a computation of the expected multiplicity of an output tuple across possible worlds. In ~\cref{ex:intro}, the expectation is simply
\begin{align*}
&\expct\pbox{\poly(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}\\
= &\expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}\\
= &\prob^2 + \prob^2 + \prob^2 = 3\prob^2,
= &\prob^2 + \prob^2 + \prob^2 = 3\prob^2.
\end{align*}
which is indeed linear in the size of the lineage as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial. The above equalities hold, since expectation is linear over addition of the natural numbers. We were also able to push expectation into the product due to the $\ti$ independence property, where all variables are independent. Note that the answer is the same as $\poly(\prob, \prob, \prob)$, where substituting $\prob$ in for each variable yields $\prob \cdot \prob + \prob \cdot \prob + \prob \cdot \prob = 3\prob^2$. This however is coincidental and not true for the general case.
Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial. The above equalities hold, since expectation is linear over addition of the natural numbers. We were also able to push expectation into the product due to the $\ti$ independence property, where all variables are independent. Note that the answer is the same as substituting $\prob$ in for each variable. For example, $\poly(\prob, \prob, \prob)$ $=$ $\prob \cdot \prob + \prob \cdot \prob + \prob \cdot \prob = 3\prob^2$. This however is coincidental and not true for the general case.
Now, consider the query
\begin{equation*}
\poly^2() := \rel(A), E(A, B), \rel(B), \rel(C), E(C, D), \rel(D),
\end{equation*}
For an arbitrary lineage formula, which we can view as a polynomial, it is known that there may exist equivalent compressed representations of the polynomial. One such compression is known as the factorized polynomial ~\cite{10.1145/3003665.3003667}, where the polynomial can be broken up into separate factors. Another equivalent form of the polynomial is the sum of products (SOP), which is the expansion of the factorized polynomial by multiplying out all terms, and in general is exponentially larger (in the number of products) than the factorized version.
For an arbitrary lineage formula, which we can view as a polynomial, it is known that there may exist equivalent compressed representations of the polynomial. One such compression is the factorized polynomial ~\cite{10.1145/3003665.3003667}, where the polynomial can be broken up into separate factors. %Another form of the polynomial is the SOP, which is the expansion of the factorized polynomial by multiplying out all terms, and in general is exponentially larger (in the number of products) than the factorized version.
A factorized polynomial of $\poly^2$ is
@ -153,7 +173,7 @@ This factorized expression can be easily modeled as an expression tree as depict
\label{fig:intro-q2-etree}
\end{figure}
In contrast, the SOP equivalent representation is
In contrast, the equivalent SOP representation is
\begin{equation*}
W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.
\end{equation*}
@ -172,17 +192,29 @@ The expectation then is
In this case, even though we substitute probability values in for each variable, $\poly^2(\prob, \prob, \prob)$ is not the answer we seek since for a random variable $X$, $\expct\pbox{X^2} = \sum_{x \in Dom(X)}x^2 \cdot p(x)$. Intuitively, bags are only hard with self-joins.\AH{Atri suggests a proof in the appendix regarding the last claim.}
Define $\rpoly^2(\vct{X})$ to be the resulting polynomial when all exponents $e > 1$ are set to $1$ in $\poly^2$. Note that this structure $\rpoly^2(\prob, \prob, \prob)$ is the expectation we computed, since it is always the case that $i^2 = i$ for all $i$ in $\{0, 1\}$. And, $\poly^2()$ is still computable in linear time in the size of the output polynomial, compressed or SOP.
Define $\rpoly^2(\vct{X})$ to be the resulting polynomial when all exponents $e > 1$ are set to $1$ in $\poly^2$. For example, when we have
\begin{align*}
&\poly^2(W_a, W_b, W_c) = W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c\\
&+ 2W_aW_bW_c^2,
\end{align*}
then
\begin{align*}
&\rpoly^2(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
&+ 2W_aW_bW_c\\
&= W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c
\end{align*}
Note that this structure $\rpoly^2(\prob, \prob, \prob)$ is the expectation we computed, since it is always the case that $i^2 = i$ for all $i$ in $\{0, 1\}$. And, $\poly^2()$ is still computable in linear time in the size of the output polynomial, compressed or SOP.
A compressed polynomial can be exponentially smaller in $k$ for $k$-products. It is also always the case that computing the expectation of an output polynomial in SOP is always linear in the size of the polynomial, since expecation can be pushed through addition.
This works seeks to explore the complexity landscape for compressed representations of polynomials. We use the term 'easy' to mean linear time, and the term 'hard' to mean superlinear time or greater. Note that when we are linear in the size of the lineage formula, we essentially have runtime that is of deterministic query complexity.
This works seeks to explore the complexity landscape for compressed representations of polynomials. Note that when we are linear in the size of the lineage formula, we essentially have runtime that is of deterministic query complexity.
Up to this point the message seems consistent that bags are always easy in the size of the SOP representation, but
\begin{Question}
Is it always the case that bags are easy in the size of the compressed polynomial?
Is it always the case that bags are easy in the size of the \emph{compressed} polynomial?
\end{Question}
If bags \textit{are} always easy for any compressed version of the polynomial, then there is no need for improvement. But, if proveably not, then the option to approximate the computation over a compressed polynomial in linear time is desirable.
If bags \textit{are} always easy for any compressed version of the polynomial, then there is no need for improvement. But, if proveably not, then the option to approximate the computation over a compressed polynomial in linear time is critical for making PDBs practical.
Consider the query
\begin{equation*}