In middle of my pass on intro

This commit is contained in:
Atri Rudra 2020-12-18 11:37:37 -05:00
parent bdb95465e1
commit 634d6a0048
2 changed files with 27 additions and 21 deletions

View file

@ -1,5 +1,6 @@
%!TEX root=./main.tex
\section{Generalizations}
\label{sec:gen}
In this section, we consider a couple of generalizations/corollaries of our results so far. In particular, in~\Cref{sec:circuits} we first consider the case when the compressed polynomial is represented by a Directed Acyclic Graph (DAG) instead of the earlier (expression) tree (\Cref{def:express-tree}) and we observe that all of our results carry over to the DAG representation. Then we formalize our claim in~\Cref{sec:intro} that a linear runtime algorithm for our problem would imply that we can process PDBs in the same time as deterministic query processing. Finally, in~\Cref{sec:momemts}, we make some simple observations on how our results can be used to estimate moments beyond the expectation of a lineage polynomial.
\subsection{Lineage circuits}

View file

@ -4,13 +4,15 @@
\section{Introduction}
\label{sec:intro}
\AR{\textbf{Oliver/Boris:} What is missing from the intro is why would someone care about bag-PDBs in {\em practice}? This is kinda obliquely referred to in the first para but it would be good to motivate this more. The intro (rightly) focusses on the theoretical reasons to study bag PDBs but what (fi any) are the practical significance of getting bag PDBs done in linear-time? Would this lead to much faster real-life PDB systems?}
Modern production databases like Postgres and Oracle use bag semantics, while research on probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu,DBLP:conf/sigmod/BoulosDMMRS05,DBLP:conf/icde/AntovaKO07a,DBLP:conf/sigmod/SinghMMPHS08} focuseses predominantly on query evaluation under set semantics.
This is not surprising, as the conventional strategy for encoding the lineage of a query result --- a key component of query evaluation in PDBs --- makes computing typical statistics like marginal probabilities or moments easy (at worst linear in the size of the lineage) for bags, but hard (at worst exponential in the size of the lineage) for sets.
This is not surprising, as the conventional strategy for encoding the lineage of a query result --- a key component of query evaluation in PDBs --- makes computing typical statistics like marginal probabilities or moments easy (at worst linear in the size of the lineage) for bags and hence, perhaps not worthy of research attention, but hard (at worst exponential in the size of the lineage) for sets and hence, interesting from a research perspective.
However, conventional encodings of a result's lineage are typically large, and even for Bag-PDBs, computing such statistics from lineage formulas still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
In this paper, we formally prove this limitation of PDBs, and address it by proposing an approximation algorithm that, to the best of our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.
Consider the dominant problem in Set-PDBs: Computing marginal probabilities, and the corresponding problem in Bag-PDBs: computing expectations of counts.
In work that addresses the former problem~\cite{DBLP:series/synthesis/2011Suciu}, the lineage of a query result tuple is a boolean formula over random variables that captures the conditions under which the tuple appears in the result.
In work that addresses the former problem~\cite{DBLP:series/synthesis/2011Suciu}, the lineage of a query result tuple is a Boolean formula over random variables that captures the conditions under which the tuple appears in the result.
Computing the probability of the tuple appearing in the result is thus analogous to weighted model counting (a known \sharpphard problem).
In the corresponding problem for Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty}, lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
Thus, the expectation of the multiplicity is the expectation of this polynomial.
@ -18,14 +20,14 @@ Thus, the expectation of the multiplicity is the expectation of this polynomial.
Lineage in Set-PDBs is typically encoded in disjunctive normal form.
This representation is significantly larger than the query result sans lineage.
However, even with alternative encodings~\cite{DBLP:journals/vldb/FinkHO13}, the limiting factor in computing marginal probabilities remains the probability computation itself, and not the lineage formula.
The corresponding lineage encoding for Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of clauses, each of which is the product of a set of integer or variable atoms.
The corresponding lineage encoding for Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of `clauses', each of which is the product of a set of integer or variable atoms.
Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
Such compressed representations like Factorized Databases~\cite{10.1145/3003665.3003667,DBLP:conf/tapp/Zavodny11} or Polynomial Circuits\todo[noinline]{cite}, are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
Such compressed representations like Factorized Databases~\cite{10.1145/3003665.3003667,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity}, are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula allows us to more closely relate the algorithm's performance to the complexity of query evaluation in a deterministic database.
The initial picture is not good.
In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{10.1145/3003665.3003667} --- lineage polynomial by reduction to counting 3-matchings.
In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{10.1145/3003665.3003667} --- lineage polynomial by reduction from counting $k$-matchings.
Thus, even bag PDBs do not enjoy the same computational complexity as deterministic databases.
This motivates our second goal, a linear time approximation algorithm for computing expected counts in a bag database, with complexity linear in the size of a factorized lineage formula.
As we will show, the size of the factorized
@ -97,23 +99,23 @@ In other words, our approximation algorithm can estimate expected multiplicities
\begin{Example}\label{ex:intro}
Consider the Tuple Independent ($\ti$) Set-PDB\footnote{Our work also handles Block Independent Disjoint Databases ($\bi$)~\cite{DBLP:conf/sigmod/BoulosDMMRS05,DBLP:series/synthesis/2011Suciu}, we return to this model later.} given in \Cref{fig:intro-ex} with two input relations $R$ and $E$.
Each input tuple is assigned an annotation (attribute $\Phi$): an independent random boolean variable ($W_i$) or the constant $\top$.
Each input tuple is assigned an annotation (attribute $\Phi$): an independent random Boolean variable ($W_i$) or the constant $\top$.
Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) \SF{Do we need to state the meaning of $\top$ and $\bot$? Also do we want to add bag annotation to Figure 1 too since we are discussing both sets and bags later?} identifies one \emph{possible world}, a deterministic database instance that contains exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$.
The probability of this world is the joint probability of the corresponding assignments.
For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p^2-p^3$
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p\cdot p\cdot (1-p)=p^2-p^3$.
\end{Example}
Prior efforts to generalize incomplete databases to bags~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17} replace the boolean annotations with natural numbers.
Prior efforts to generalize incomplete databases to bags~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17} replace the Boolean annotations with natural numbers.
Analogously, we generalize the above model of Set-PDBs to bags by using natural-number-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and positive natural number constants.
Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while query evaluation follows bag semantics.
We contrast bag and set query evaluation with the following example:
\begin{Example}\label{ex:bag-vs-set}
Continuing the prior example, we are given the following boolean (resp,. count) query
Continuing the prior example, we are given the following Boolean (resp,. count) query
$$\poly() :- R(A), E(A, B), R(B)$$
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a boolean (resp., polynomial) formula over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
Because the boolean query has only a nullary relation, we write $Q(\cdot)$ to denote the function mapping variable assignments to a concrete value for the lineage in the corresponding possible world:
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a Boolean (resp., polynomial) formula over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
Because the Boolean query has only a nullary relation, we write $Q(\cdot)$ to denote the function mapping variable assignments to a concrete value for the lineage in the corresponding possible world:
\begin{align*}
\poly_{set}(W_a, W_b, W_c) &= W_aW_b \vee W_bW_c \vee W_cW_a\\
\poly_{bag}(W_a, W_b, W_c) &= W_aW_b + W_bW_c + W_cW_a
@ -126,7 +128,7 @@ The polynomials evaluate as:
&\poly_{bag}(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1
\end{align*}
The Set-PDB query is satisfied in this possible world, while the Bag-PDB query produces a nullary tuple with a multiplicity of 1.
The marginal probability (resp., expected count) of this query is computed over all possible worlds:
The marginal probability (resp., expected count) of this query is computed over all possible worlds:\AR{What is $\mu$ below?}
{\small
\begin{align*}
P[\poly_{set}] &= \sum_{w_i \in \{\top,\bot\}} \mu(\poly_{set}(w_a, w_b, w_c))P[W_a = w_a,W_b = w_b,W_c = w_c]\\
@ -136,7 +138,7 @@ P[\poly_{set}] &= \sum_{w_i \in \{\top,\bot\}} \mu(\poly_{set}(w_a, w_b, w_c))P[
\end{Example}
Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed \sharpphard, since it non-hierarchical~\cite{10.1145/1265530.1265571}.
To see why computing this probability is hard, observe that the clauses of the disjunctive normal form boolean lineage are neither independent nor disjoint, forcing~\cite{DBLP:journals/vldb/FinkHO13} the use of Shannon decomposition, which is at worst exponential in the size of the input.
To see why computing this probability is hard, observe that the clauses of the disjunctive normal form Boolean lineage are neither independent nor disjoint, leading to e.g.~\cite{DBLP:journals/vldb/FinkHO13} the use of Shannon decomposition, which is at worst exponential in the size of the input.
% \begin{equation*}
% \expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
% \end{equation*}
@ -218,8 +220,8 @@ For example:
\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right).
\end{equation*}
}
This factorized expression can be easily modeled as an expression tree, as in \Cref{fig:intro-q2-etree}.
In contrast, the equivalent SOP representation is
This factorized expression can be easily modeled as an expression tree, as in \Cref{fig:intro-q2-etree}
while the equivalent SOP representation is
\begin{equation*}
W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.
\end{equation*}
@ -246,15 +248,16 @@ With $\poly^2$ as an example, we have:
\end{align*}
\SF{Should this be like $\tilde{\poly^2}$ to avoid ambiguous?}
Observe that the reduced polynomial is a closed form formula for the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(P\pbox{W_a=1}, P\pbox{W_b=1}, P\pbox{W_c=1})$).
Also note that our initial example polynomial $\poly$ is already in reduced form.
Also note that the $\poly$ in~\Cref{ex:bag-vs-set} is already in reduced form.
The reduced form of a polynomial can be obtained in a linear scan over the clauses of a SOP encoding of the polynomial.
In prior work on PDBs, where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.
In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the cartesian product of $k$ copies of $\poly$ has a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
This leads us to the central question of this paper:
\begin{Question}
Is it always the case that the expectation of a nullary count query in a Bag-PDB can be computed in time linear in the size of the \emph{compressed} lineage polynomial?
\end{Question}
This leads us to the \textbf{central question of this paper}:
\begin{quote}
{\em
Is it always the case that the expectation of an UCQ in a Bag-PDB can be computed in time linear in the size of the \emph{compressed} lineage polynomial?}
\end{quote}
If the answer is yes, then it is possible for Bag-PDBs to achieve performance competitive with deterministic databases.
The answer, unfortunately, is no, and an approximation algorithm is required.
@ -264,12 +267,14 @@ The answer, unfortunately, is no, and an approximation algorithm is required.
% \end{equation*}
% The factorized output polynomial consists of a product of three identical three-way summations, while the SOP encoding is exponential --- $3^3$ clauses to be precise.
\subsection{Overview of our results and techniques}
Concretely, in this paper:
(i) We show that conjunctive queries over a bag-$\ti$ are hard (i.e., superlinear in the size of a compressed lineage encoding) by reduction to counting the number of $3$-matchings over an arbitrary graph;
(i) We show that conjunctive queries over a bag-$\ti$ are hard (i.e., superlinear in the size of a compressed lineage encoding) by reduction from counting the number of $k$-matchings over an arbitrary graph;
(ii) We present an $(1-\epsilon)$-approximation algorithm for bag-$\ti$s and show that its complexity is linear in the size of the compressed lineage encoding;
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further generalize our results to higher moments, polynomial circuits, and prove RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.
\paragraph{Paper Organization.} We present some relevant background and setup our notation in~\Cref{sec:background}. We present our hardness results in~\Cref{sec:hard} and our approximation algorithm in~\Cref{sec:algo}. We present some (easy) generalizations of our results in~\Cref{sec:gen}. We do a quick overview of related work in~\Cref{sec:related-work} and conclude with some open questions in~\Cref{sec:concl-future-work}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%