Intro tweaks

This commit is contained in:
Oliver Kennedy 2020-12-20 01:03:43 -05:00
parent 3e43bd053a
commit cf5ee7b3cd
Signed by: okennedy
GPG key ID: 3E5F9B3ABD3FDB60

View file

@ -9,16 +9,28 @@
As explainability and fairness become more relevant to the data science community, it is now more critical than ever to understand how reliable a dataset is.
Probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu} are a compelling solution, but a major roadblock to their adoption remains:
PDBs are orders of magnitude slower than classical (i.e., deterministic) database systems~\cite{feng:2019:sigmod:uncertainty}.
Most work on probabilistic database has assumed set semantics, however, virtually all implementations of the relational data model use bags semantics.
Naively, one might suggest that this is because most work on probabilistic databases assumes set semantics, while, virtually all implementations of the relational data model use bag semantics.
However, as we show in this paper, there is a more subtle problem behind this barrier to adoption.
A fundamental problem in probabilistic query processing is: given a query, probabilistic database, and possible result tuple, compute the marginal probability of the tuple to appear in the result.
In the set semantics setting, it was shown that this is equivalent to computing the probability of a Boolean formula called lineage formula which records how the result tuple has been produced by combining input tuples. Given this correspondence, the problem reduces to weighted model counting which is \sharpphard, even if the formula is in DNF. A large body of work has focused on identifying tractable cases by either identifying tractable classes of queries (e.g.,~\cite{DS12}) or studying compressed representations of lineage formulas that are tractable for certain classes of input databases (e.g.,~\cite{AB15}).
A fundamental problem in probabilistic query processing is: given a query, probabilistic database, and possible result tuple, compute the marginal probability of the tuple appearing in the result.
In the set semantics setting, it was shown that this is equivalent to computing the probability of a Boolean formula called the lineage formula, which records how the result tuple was derived from input tuples.
Given this correspondence, the problem reduces to weighted model counting over the lineage (a \sharpphard problem, even if the lineage is in DNF).
A large body of work has focused on identifying tractable cases by either identifying tractable classes of queries (e.g.,~\cite{DS12}) or studying compressed representations of lineage formulas that are tractable for certain classes of input databases (e.g.,~\cite{AB15}).
The problem of computing the marginal probability of a result tuple has a natural correspondence under bag semantics: computing the expected multiplicity of a query result tuple and, analog to the case for sets, this problem can be reduced to computing the expectation of the lineage, which under bag semantics is a polynomial. This problem has received much less attention, perhaps because the problem is trivially tractable, in fact linear time in the size of the formula, for the sum of product (SOP) representation of lineage polynomials. However, there exist compressed representations of polynomials, e.g., factorizations~\cite{factorized-db}, that can be exponentially more concise than the SOP representation of a polynomial. We would like to be able to exploit such representations for query processing. However, this is only beneficial if the problem remains tractable for such compressed representations.\footnote{Note that while for sets, compressed representations are investigated in hope of finding tractable cases, for bags the compressed representation can increase computational complexity.}
We prove that, unfortunately, this is not the case. In fact,
we prove by reduction from counting $k$-matchings that
computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of the compressed representation. In spite of this negative result, not everything is lost. We develop a the first linear time (in the size of the factorized lineage) $(1-\epsilon)$-approximation scheme for expected counts of SPJU query results over Bag-PDBs.
As we also show for project-join queries, that this algorithm only has a constant factor overhead over deterministic query processing over the probabilistic database. This is an important result, because it implies that approximate expectations can be computed within time competitive with deterministic query evaluation over bag databases.
The problem of computing the marginal probability of a result tuple has a natural correspondence under bag semantics: computing the expected multiplicity of a query result tuple.
Analogously, this problem can be reduced to computing the expectation of the lineage, which under bag semantics is a polynomial.
This problem has received much less attention, perhaps because the problem is trivially tractable;
In fact it is linear time when the lineage polynomial is encoded in the typical sum of products (SOP) representation.
However, there exist compressed representations of polynomials, e.g., factorizations~\cite{factorized-db}, that can be polynomially more concise than the SOP representation of a polynomial.
These compression schemes are close analogs of typical database optimizations like projection push-down~\cite{DBLP:conf/pods/KhamisNR16}, hinting that perhaps even Bag-PDBs inherently have higher query processing complexity than deterministic databases.
In this paper, we confirm this intuition, first proving (by reduction from counting $k$-matchings) that computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a compressed (factorized~\cite{factorized-db}) lineage representation, and then relating the size of the compressed lineage to the cost of answering a deterministic query.
In spite of this negative result, not everything is lost.
We develop an approximation algorithm for expected counts of SPJU query results over Bag-PDBs that is, to our knowledge, the the first linear time (in the size of the factorized lineage) $(1-\epsilon)$-approximation.
By extension, this algorithm only has a constant factor overhead relative to deterministic query processing.\footnote{
Monte-carlo sampling~\cite{jampani2008mcdb} is also trivially a constant factor slower, but can only guarantee additive rather than our stronger multiplicative bounds.
}
This is an important result, because it implies that computing approximate expectations for SPJU queries can indeed be competitive with deterministic query evaluation over bag databases.
% is \emph{not} linear in the size of a compressed (factorized~\cite{factorized-db}) lineage polynomial by reduction from counting $k$-matchings.
% Thus, even Bag-PDBs do not enjoy the same time complexity as deterministic databases.
@ -27,9 +39,7 @@ As we also show for project-join queries, that this algorithm only has a constan
% In this paper, we prove this
% limitation of PDBs and address it by proposing an algorithm that, to our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.\footnote{
% MCDB~\cite{jampani2008mcdb} is also a constant factor slower, but only guarantees additive bounds.
% }
% limitation of PDBs and address it by proposing an algorithm that, to our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.
@ -83,6 +93,32 @@ As we also show for project-join queries, that this algorithm only has a constan
%Figures, etc
%Relations for example 1
%Graph of query output for intro example
%\begin{figure}
% \begin{tikzpicture}
% \node at (1.5, 3) [tree_node](top){a};
% \node at (0, 0) [tree_node](left){b};
% \node at (3, 0) [tree_node](right){c};
% \draw (top)--(left);
% \draw (left)--(right);
% \draw (right)--(top);
% \end{tikzpicture}
%\caption{Graph of tuples in table E}
%\label{fig:intro-ex-graph}
%\end{figure}
\begin{Example}\label{ex:intro}
Consider the Tuple Independent ($\ti$) Set-PDB\footnote{Our work does also handle Block Independent Databases ($\bi$)~\cite{BD05,DBLP:series/synthesis/2011Suciu}.} given in \Cref{fig:intro-ex} with two input relations $R$ and $E$.
Each input tuple is assigned an annotation (attribute $\Phi_{set}$): an independent random Boolean variable ($W_i$) or the constant $\top$.
Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$)
% \SF{Do we need to state the meaning of $\top$ and $\bot$? Also do we want to add bag annotation to Figure 1 too since we are discussing both sets and bags later?}
identifies one \emph{possible world}, a deterministic database instance containing exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$.
The probability of this world is the joint probability of the corresponding assignments.
For example, let $\probOf[W_a] = \probOf[W_b] = \probOf[W_c] = \prob$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and its probability is $\probOf[W_a]\cdot \probOf[W_b] \cdot \probOf[\neg W_c] = \prob\cdot \prob\cdot (1-\prob)=\prob^2-\prob^3$.
\end{Example}
\begin{figure}[t]
\begin{subfigure}{0.2\textwidth}
\centering
@ -126,29 +162,6 @@ As we also show for project-join queries, that this algorithm only has a constan
\trimfigurespacing
\end{figure}
%Graph of query output for intro example
%\begin{figure}
% \begin{tikzpicture}
% \node at (1.5, 3) [tree_node](top){a};
% \node at (0, 0) [tree_node](left){b};
% \node at (3, 0) [tree_node](right){c};
% \draw (top)--(left);
% \draw (left)--(right);
% \draw (right)--(top);
% \end{tikzpicture}
%\caption{Graph of tuples in table E}
%\label{fig:intro-ex-graph}
%\end{figure}
\begin{Example}\label{ex:intro}
Consider the Tuple Independent ($\ti$) Set-PDB\footnote{Our work does also handle Block Independent Databases ($\bi$)~\cite{BD05,DBLP:series/synthesis/2011Suciu}.} given in \Cref{fig:intro-ex} with two input relations $R$ and $E$.
Each input tuple is assigned an annotation (attribute $\Phi_{set}$): an independent random Boolean variable ($W_i$) or the constant $\top$.
% Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) \SF{Do we need to state the meaning of $\top$ and $\bot$? Also do we want to add bag annotation to Figure 1 too since we are discussing both sets and bags later?} identifies one \emph{possible world}, a deterministic database instance that contains exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$.
The probability of this world is the joint probability of the corresponding assignments.
For example, let $\probOf[W_a] = \probOf[W_b] = \probOf[W_c] = \prob$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $\probOf[W_a]\cdot \probOf[W_b] \cdot \probOf[\neg W_c] = \prob\cdot \prob\cdot (1-\prob)=\prob^2-\prob^3$.
\end{Example}
Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,GL16}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while query evaluation follows bag semantics.
@ -161,8 +174,8 @@ Because the query result is a nullary relation, we write $Q(\cdot)$ to denote th
\poly_{set}(W_a, W_b, W_c) &= W_aW_b \vee W_bW_c \vee W_cW_a\\
\poly_{bag}(W_a, W_b, W_c) &= W_aW_b + W_bW_c + W_cW_a
\end{align*}
Given $W_a$, $W_b$, $W_c$, these functions compute the existence (resp., count) of the nullary result tuple for $\poly$ applied to the database instance in \Cref{fig:intro-ex}.
We show one possible world here, with the set assignment $\{\;W_a\mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$ and the analogous bag assignment:
These functions compute the existence (resp., count) of the nullary tuple resulting from applying $\poly$ on the PDB of \Cref{fig:intro-ex}.
For the same possible world as in the prior example:
\begin{align*}
&\poly_{set}(\top, \top, \bot) = \top\top \vee \top\bot \vee \top\bot = \top\\
&\poly_{bag}(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1
@ -318,10 +331,10 @@ Concretely, in this paper:
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further generalize our results to higher moments, polynomial circuits, and prove that for RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.
Our hardness results follow by considering a suitable generalization of the lineage polynomial in Example~\ref{ex:bag-vs-set}. First it is easy to generalize the polynomial in Example~\ref{ex:bag-vs-set} to $\poly_G^k(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\inparen{\poly_G^k(X_1,\dots,X_n)}^k$ encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (computing which is \sharpwonehard\ ) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le 2k$, then we can setup a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
Our hardness results follow by considering a suitable generalization of the lineage polynomial in Example~\ref{ex:bag-vs-set}. First it is easy to generalize the polynomial in Example~\ref{ex:bag-vs-set} to $\poly_G(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\inparen{\poly_G(X_1,\dots,X_n)}^k$ encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (computing which is \sharpwonehard\ ) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le 2k$, then we can setup a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(X_1,\dots,X_n)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ and $k=\deg(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_n)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(X_1,\dots,X_n)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well. For the ease of exposition, we start off with expression trees (see~\Cref{fig:intro-q2-etree} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(X_1,\dots,X_n)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ and $k=\deg(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(1,\dots,1)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(X_1,\dots,X_n)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well. For the ease of exposition, we start off with expression trees (see~\Cref{fig:intro-q2-etree} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
We also formalize our claim that, since our approximation algorithm runs in time linear in the size of the polynomial circuit, we can approximate the expected output tuple multiplicities with only a $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).