introduction

This commit is contained in:
Boris Glavic 2020-12-19 22:54:18 -06:00
parent dcff4ec4eb
commit 2a926630cb

View file

@ -9,40 +9,74 @@
As explainability and fairness become more relevant to the data science community, it is now more critical than ever to understand how reliable a dataset is.
Probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu} are a compelling solution, but a major roadblock to their adoption remains:
PDBs are orders of magnitude slower than classical (i.e., deterministic) database systems~\cite{feng:2019:sigmod:uncertainty}.
A naive strategy might be to move from the theoretically simpler set-relational model~\cite{DBLP:series/synthesis/2011Suciu,BD05,DBLP:conf/icde/AntovaKO07a,DBLP:conf/sigmod/SinghMMPHS08} to the computationally simpler bag-relational model, mirroring a similar transition in deterministic datbases decades ago.
However, after discarding a long-held approach to representing lineage, we prove that query processing in Bag-PDBs is \sharpwonehard.
This finding shows that even Bag-PDB query processing has a higher complexity than deterministic query processing, and opens a rich landscape of opportunities for research on approximate algorithms.
Most work on probabilistic database has assumed set semantics, however, virtually all implementations of the relational data model use bags semantics.
The fundamental challenge is lineage formulas, a key component of query processing in PDBs.
Using the standard (i.e., DNF) encoding of lineage, computing typical statistics like marginal probabilities or moments is easy (i.e., $O(|\text{lineage}|)$) for bags and hence, perhaps not worthy of research attention, but hard (i.e., $O(2^{|\text{lineage}|})$) for sets and hence, interesting from a research perspective.
However, the standard encoding is unnecessarily large, and so even for Bag-PDBs, computing such statistics still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
In this paper, we formally prove this limitation of PDBs and address it by proposing an algorithm that, to our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.\footnote{
MCDB~\cite{jampani2008mcdb} is also a constant factor slower, but only guarantees additive bounds.
}
A fundamental problem in probabilistic query processing is: given a query, probabilistic database, and possible result tuple, compute the marginal probability of the tuple to appear in the result.
In the set semantics setting, it was shown that this is equivalent to computing the probability of a Boolean formula called lineage formula which records how the result tuple has been produced by combining input tuples. Given this correspondence, the problem reduces to weighted model counting which is \sharpphard, even if the formula is in DNF. A large body of work has focused on identifying tractable cases by either identifying tractable classes of queries (e.g.,~\cite{DS12}) or studying compressed representations of lineage formulas that are tractable for certain classes of input databases (e.g.,~\cite{AB15}).
Consider the dominant problem in Set-PDBs (Computing marginal probabilities) and the corresponding problem in Bag-PDBs (computing expectations of counts).
In work that addresses the former problem~\cite{DBLP:series/synthesis/2011Suciu}, the lineage of a query result tuple is a Boolean formula over random variables that captures the conditions under which the tuple appears in the result.
Computing the probability of the tuple appearing in the result is thus analogous to weighted model counting (a known \sharpphard problem).
In the corresponding Bag-PDB problem~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty,GL16}, lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
The expectation of the multiplicity is the expectation of this polynomial.
The problem of computing the marginal probability of a result tuple has a natural correspondence under bag semantics: computing the expected multiplicity of a query result tuple and, analog to the case for sets, this problem can be reduced to computing the expectation of the lineage, which under bag semantics is a polynomial. This problem has received much less attention, perhaps because the problem is trivially tractable, in fact linear time in the size of the formula, for the sum of product (SOP) representation of lineage polynomials. However, there exist compressed representations of polynomials, e.g., factorizations~\cite{factorized-db}, that can be exponentially more concise than the SOP representation of a polynomial. We would like to be able to exploit such representations for query processing. However, this is only beneficial if the problem remains tractable for such compressed representations.\footnote{Note that while for sets, compressed representations are investigated in hope of finding tractable cases, for bags the compressed representation can increase computational complexity.}
We prove that, unfortunately, this is not the case. In fact,
we prove by reduction from counting $k$-matchings that
computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of the compressed representation. In spite of this negative result, not everything is lost. We develop a the first linear time (in the size of the factorized lineage) $(1-\epsilon)$-approximation scheme for expected counts of SPJU query results over Bag-PDBs.
As we also show for project-join queries, that this algorithm only has a constant factor overhead over deterministic query processing over the probabilistic database. This is an important result, because it implies that approximate expectations can be computed within time competitive with deterministic query evaluation over bag databases.
Lineage in Set-PDBs is typically encoded in disjunctive normal form.
This representation is significantly larger than the query result sans lineage.
However, even with alternative encodings~\cite{FH13}, the limiting factor in computing marginal probabilities is the probability computation itself, not the lineage formula.
The Bag-PDB analog is a polynomial in sum of products (SOP) form --- a sum of `clauses', each the product of a set of integer or variable atoms.
Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
Compressed representations like Factorized Databases~\cite{factorized-db} %DBLP:conf/tapp/Zavodny11
or Arithmetic/Polynomial Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula more closely relates the algorithm's performance to the complexity of query evaluation in a deterministic database.
% is \emph{not} linear in the size of a compressed (factorized~\cite{factorized-db}) lineage polynomial by reduction from counting $k$-matchings.
% Thus, even Bag-PDBs do not enjoy the same time complexity as deterministic databases.
% This motivates our second goal, a linear time (in the size of the factorized lineage) approximation of expected counts for SPJU results in Bag-PDBs.
% As we also show, this complexity is proportional to the same query on a deterministic database.
The initial picture is not good.
We prove that computing expected counts is \emph{not} linear in the size of a compressed (factorized~\cite{factorized-db}) lineage polynomial by reduction from counting $k$-matchings.
Thus, even Bag-PDBs do not enjoy the same time complexity as deterministic databases.
This motivates our second goal, a linear time (in the size of the factorized lineage) approximation of expected counts for SPJU results in Bag-PDBs.
As we also show, this complexity is proportional to the same query on a deterministic database.
% In this paper, we prove this
% limitation of PDBs and address it by proposing an algorithm that, to our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.\footnote{
% MCDB~\cite{jampani2008mcdb} is also a constant factor slower, but only guarantees additive bounds.
% }
% The fundamental challenge is lineage formulas, a key component of query processing in PDBs.
% Using the standard (i.e., DNF) encoding of lineage, computing typical statistics like marginal probabilities or moments is easy (i.e., $O(|\text{lineage}|)$) for bags and hence, perhaps not worthy of research attention, but hard (i.e., $O(2^{|\text{lineage}|})$) for sets and hence, interesting from a research perspective.
% However, the standard encoding is unnecessarily large, and so even for Bag-PDBs, computing such statistics still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
% A naive strategy might be to move from the theoretically simpler set-relational model~\cite{DBLP:series/synthesis/2011Suciu,BD05,DBLP:conf/icde/AntovaKO07a,DBLP:conf/sigmod/SinghMMPHS08} to the computationally simpler bag-relational model, mirroring a similar transition in deterministic datbases decades ago.
% However, after discarding a long-held approach to representing lineage, we prove that query processing in Bag-PDBs is \sharpwonehard.
% This finding shows that even Bag-PDB query processing has a higher complexity than deterministic query processing, and opens a rich landscape of opportunities for research on approximate algorithms.
% The fundamental challenge is lineage formulas, a key component of query processing in PDBs.
% Using the standard (i.e., DNF) encoding of lineage, computing typical statistics like marginal probabilities or moments is easy (i.e., $O(|\text{lineage}|)$) for bags and hence, perhaps not worthy of research attention, but hard (i.e., $O(2^{|\text{lineage}|})$) for sets and hence, interesting from a research perspective.
% However, the standard encoding is unnecessarily large, and so even for Bag-PDBs, computing such statistics still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
% In this paper, we formally prove this limitation of PDBs and address it by proposing an algorithm that, to our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.\footnote{
% MCDB~\cite{jampani2008mcdb} is also a constant factor slower, but only guarantees additive bounds.
% }
% Consider the dominant problem in Set-PDBs (Computing marginal probabilities) and the corresponding problem in Bag-PDBs (computing expectations of counts).
% In work that addresses the former problem~\cite{DBLP:series/synthesis/2011Suciu}, the lineage of a query result tuple is a Boolean formula over random variables that captures the conditions under which the tuple appears in the result.
% Computing the probability of the tuple appearing in the result is thus analogous to weighted model counting (a known \sharpphard problem).
% In the corresponding Bag-PDB problem~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty,GL16}, lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
% The expectation of the multiplicity is the expectation of this polynomial.
% Lineage in Set-PDBs is typically encoded in disjunctive normal form.
% This representation is significantly larger than the query result sans lineage.
% However, even with alternative encodings~\cite{FH13}, the limiting factor in computing marginal probabilities is the probability computation itself, not the lineage formula.
% The Bag-PDB analog is a polynomial in sum of products (SOP) form --- a sum of `clauses', each the product of a set of integer or variable atoms.
% Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
% Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
% Compressed representations like Factorized Databases~\cite{factorized-db} %DBLP:conf/tapp/Zavodny11
% or Arithmetic/Polynomial Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
% Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula more closely relates the algorithm's performance to the complexity of query evaluation in a deterministic database.
% The initial picture is not good.
% We prove that computing expected counts is \emph{not} linear in the size of a compressed (factorized~\cite{factorized-db}) lineage polynomial by reduction from counting $k$-matchings.
% Thus, even Bag-PDBs do not enjoy the same time complexity as deterministic databases.
% This motivates our second goal, a linear time (in the size of the factorized lineage) approximation of expected counts for SPJU results in Bag-PDBs.
% As we also show, this complexity is proportional to the same query on a deterministic database.
% In other words, our approximation algorithm can estimate expected multiplicities for tuples in the result of an SPJU query with a complexity comparable to deterministic query-processing.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Sets vs Bags}
%Consider an arbitrary output polynomial $\poly$. Further, consider the same polynomial, with all exponents $e > 1$ set to $1$ and call the resulting polynomial $\rpoly$.
@ -251,7 +285,7 @@ This property leads us to consider a structure related to $\poly$.
For any polynomial $\poly(\vct{X})$, we define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.
With $\poly^2$ as an example, we have:
\begin{align*}
\rpoly^2(W_a, W_b, W_c)
\rpoly^2(W_a, W_b, W_c)
% =&\; W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
% &+ 2W_aW_bW_c\\
=&\; W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c