paper-BagRelationalPDBsAreHard/abstract.tex

%root: main.tex
%!TEX root=./main.tex
\begin{abstract}
  The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds.
  The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world.
  In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected  multiplicity) exactly and approximately.
  For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products.
  However, using a reduction from the problem of counting k-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
  As we show, this result has a significant implication: a Bag-PDB doing exact computations will never be as fast as a classical (deterministic) database.
  The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \not \in \{0,1\}$).
  We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial, paving the way for PDBs to be competitive with deterministic databases.
% \AH{High-level intuition}
% \BG{Most people think that computing expected multiplicity of an output tuple in a probabilistic database (PDB) is easy.  Due to the fact that most modern implementations of PDBs represent tuple lineage in their expanded form, it has to be the case that such a computation is linear in the size of the lineage.  This follows since, when we have an uncompressed lineage, linearity allows for expectation to be pushed through the sum.}
% \AH{Low-level why-would-an-expert-read-this}
% \BG{However, when we consider compressed representations of the tuple lineage, the complexity landscape changes.  If we use a lineage computed over a factorized database, we find in general that computation time is not linear in the size of the compressed lineage.}
% \AH{Key technical contributions}
% \BG{This work theoretically demonstrates that bags are not easy in general, and in the case of compressed lineage forms, the computation can be greater than linear.  As such, it is then desirable to have an approximation algorithm to approximate the expected multiplicity in linear time.  We introduce such an algorithm and give theoretical guarentees on its efficiency and accuracy.  It then follows that computing an approximate value of the tuple's expected multiplicity on a bag PDB is equivalent to deterministic query processing complexity.}
\end{abstract}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: