\BG{In this paragraph we are not stating the precise problem that we are trying to solve yet. What I mean is: what is the input, what is the output. The case for sets is clear now, but bags require a bit of extra explanation because there is no marginal probability of output tuples and the input encoding also needs to be made clear.}
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing expectations and other moments\BG{for what?} is analogous to counting the number of solutions to a boolean formula\BG{call it by its name: weighted model counting}, a known \#-P\BG{\#P?} problem
%the annotation of the tuple is a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, which can essentially be thought of as a boolean formula. It is known that computing the probability of a lineage formula is \#-P hard in general
~\cite{DBLP:series/synthesis/2011Suciu}. In PDBs, the boolean formula\BG{maybe say a bit more here: what does this boolean formula encode?} is called a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, a formula generated by query processing\BG{do we have one formula per query? state what the formula encodes (see above)}. However computing expectation\BG{of what?} in the bag setting is linear\BG{in what?} with the result that many regard bags to be easy. In this work we consider compressed representations of the lineage formula\BG{maybe we should say that in the bag case instead of using boolean lineage formulas, we have to use polynomials over random variables representing the probability distribution of the multiplicity of input tuples.} showing that the complexity landscape becomes much more nuanced, and is not linear\BG{the complexity landscape is linear?} in general.
%Consider an arbitrary output polynomial $\poly$. Further, consider the same polynomial, with all exponents $e > 1$ set to $1$ and call the resulting polynomial $\rpoly$.
\BG{I think we need to clarify that this example is sets first}
Suppose we are given the following boolean query $\poly() :- R(A), E(A, B), R(B)$ over a Tuple Independent Database ($\ti$)\BG{briefly say what this is}, where the annotation\BG{we have not used the term yet, so it should be explained} of the output will consist of all contributing tuple annotations\BG{same here}. The $\ti$ relations\BG{example instances?} are given in~\cref{fig:intro-ex}. While for completeness we should include annotations for Table E, since each tuple has a probability of $1$, we drop them for simplicity. Note that the attribute column $\Phi$ contains a variable/value, where in the former case the variable ranges over $[0, 1]$ denoting its\BG{what is its here?} marginal probability of appearing in the set of possible worlds\BG{that is strangely worded. The probability of appearing in the set of possible worlds is always 1 if the tuple exists in at least one world. I think (marginal) probability of the input tuple would be enough.}, and the latter is the fixed (marginal) probability of the tuple across the set of possible worlds.\BG{The previous sentence is a bit hard to follow, can we try to simplify it?} Finally, see that the tuples in table E can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
This query is hard in set semantics because of correlations in the lineage formula, but under bag semantics it is easy since we enjoy linearity of expectation.\BG{I think this could be confusing to people. We have to clarify what our interpretation of the database under bag semantics is.}
While our work handles Block Independent Disjoint Databases\BG{what are these?} ($\bi$), for now we consider the $\ti$ model. Define the probability distribution to be $P[W_i =1]=\prob$ for $i$ in $\{a, b, c\}$.\BG{say that is a fixed probability for all tuples. Also again what is the interpretation of a probability for bag semantics?}
Note that computing the probability of the query of ~\cref{ex:intro} in set semantics is indeed \#-P hard, since it is a query that is non-hierarchical
%, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$,
as defined by Dalvi and Suciu in ~\cite{10.1145/1265530.1265571}. For the purposes of this work, we define hard to be anything greater than linear time. %Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$, i.e. the probability of the output with annotation $\poly(W_a, W_b, W_c)$, ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics.
To see why this computation is hard for query $\poly$ over set semantics, we have an output lineage formula of $\poly(W_a, W_b, W_c)= W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent of one another and the computation
However, in the bag setting, the lineage formula is $\poly(W_a, W_b, W_c)= W_aW_b + W_bW_c + W_cW_a$. To be precise, the output lineage formula is produced from a query over a set $\ti$ input, where duplicates are allowed in the output. The expectation computation over the output lineage is a computation of the 'average' multiplicity of an output tuple across possible worlds. In ~\cref{ex:intro}, the expectation is simply
which is indeed linear in the size of the lineage as the number of operations in the computation is \textit{exactly} the number of lineage operations. The above equalities hold, since expectation is linear over addition of the natural numbers. Further, we exploited linearity of expectation over multiplication since in the $\ti$ model, all variables are independent. Note that the answer is the same as $\poly(\prob, \prob, \prob)$, although this is coincidental and not true for the general case.
For an arbitrary lineage formula, which we can view as a polynomial, it is known that there may exist equivalent compressed representations of the polynomial. One such compression is known as the factorized polynomial, where the polynomial can be broken up into separate factors, and this is generally smaller than the expanded polynomial. Another equivalent form of the polynomial is the SOP, which is the expansion of the factorized polynomial by multiplying out all terms, and in general is exponentially larger (in the number of products) than the factorized version.
In this case, even though we substituting probability or expecation values in for each variable, $\poly^2(\prob, \prob, \prob)$ is not the answer we seek since for a random variable $X$, $\expct\pbox{X^2}=\sum_{x \in Dom(X)}x^2\cdot p(x)$. Note, that for our example, $Dom(W_i)=\{0, 1\}$. Intuitively, bags are only hard with self-joins.\AH{Atri suggests a proof in the appendix regarding the last claim.}
Define $\rpoly^2(\vct{X})$ to be the resulting polynomial when all exponents $e > 1$ are set to $1$ in $\poly^2$. Note that this structure $\rpoly^2(\prob, \prob, \prob)$ is the expectation we computed, since it is always the case that $i^2= i$ for all $i$ in $\{0, 1\}$. And, $\poly^2()$ is still computable in linear time in the size of the output polynomial, compressed or SOP.
A compressed polynomial can be exponentially smaller in $k$ for $k$-products. It is also always the case that computing the expectation of an output polynomial in SOP is always linear in the size of the polynomial, since expecation can be pushed through addition.
This works seeks to explore the complexity landscape for compressed representations of polynomials. We use the term 'easy' to mean linear time, and the term 'hard' to mean superlinear time or greater. Note that when we are linear in the size of the lineage formula, we essentially have runtime that is of deterministic query complexity.
If bags \textit{are} always easy for any compressed version of the polynomial, then there is no need for improvement. But, if proveably not, then the option to approximate the computation over a compressed polynomial in linear time is desirable.
Upon inspection one can see that the factorized output polynomial consists of three product terms, while the SOP version consists of $3^3$ terms. We show in this paper that, given a $\ti$ and any conjunctive query with input $\prob$ for all variables of $\poly^3$, this particular query is hard given a factorized polynomial as input. We show this via a reduction to computing the number of $3$-matchings over an arbitrary graph. The fact that bags are not easy in the general case when considering compressed polynomials necessitates an approximation algorithm that computes the expected multiplicity of the output in linear time when the output polynomial is in factorized form. We introduce such an approximation algorithm with confidence guarantees to compute $\rpoly(\vct{X})$ in linear time. Our apporximation algorithm generalizes to the $\bi$ model as well. This shows that for all RA+ queries, the processing time in approximation is essentially the same deterministic processing.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Interesting contributions, problem definition, known results, our results, etc