Trimming S1

This commit is contained in:
Oliver Kennedy 2020-12-19 15:00:03 -05:00
parent 5321761acf
commit f10a58b415
Signed by: okennedy
GPG key ID: 3E5F9B3ABD3FDB60

View file

@ -14,17 +14,17 @@ However, after discarding a long-held approach to representing lineage, we prove
This finding shows that even Bag-PDB query processing has a higher complexity than deterministic query processing, and opens a rich landscape of opportunities for research on approximate algorithms.
The fundamental challenge is lineage formulas, a key component of query processing in PDBs.
Under standard assumptions about how these are encoded, computing typical statistics like marginal probabilities or moments is easy (at worst linear in the size of the lineage) for bags and hence, perhaps not worthy of research attention, but hard (at worst exponential in the size of the lineage) for sets and hence, interesting from a research perspective.
However, conventional encodings of a result's lineage are typically large, and so even for Bag-PDBs, computing such statistics from lineage formulas still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
In this paper, we formally prove this limitation of PDBs, and address it by proposing an approximation algorithm that, to the best of our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing\footnote{
MCDB~\cite{jampani2008mcdb} is notable in that it is also a constant factor slower, but only guarantees additive rather than multiplicative bounds.
}.
Using the standard (i.e., DNF) encoding of lineage, computing typical statistics like marginal probabilities or moments is easy (i.e., $O(|\text{lineage}|)$) for bags and hence, perhaps not worthy of research attention, but hard (i.e., $O(2^{|\text{lineage}|})$) for sets and hence, interesting from a research perspective.
However, the standard encoding is unnecessarily large, and so even for Bag-PDBs, computing such statistics from lineage formulas still has a higher complexity than answering queries in a deterministic (i.e., non-probabilistic) database.
In this paper, we formally prove this limitation of PDBs and address it by proposing an algorithm that, to the best of our knowledge, is the first $(1-\epsilon)$-approximation for expectations of counts to have a runtime within a constant factor of deterministic query processing.\footnote{
MCDB~\cite{jampani2008mcdb} is also a constant factor slower, but only guarantees additive bounds.
}
Consider the dominant problem in Set-PDBs: Computing marginal probabilities, and the corresponding problem in Bag-PDBs: computing expectations of counts.
Consider the dominant problem in Set-PDBs (Computing marginal probabilities) and the corresponding problem in Bag-PDBs (computing expectations of counts).
In work that addresses the former problem~\cite{DBLP:series/synthesis/2011Suciu}, the lineage of a query result tuple is a Boolean formula over random variables that captures the conditions under which the tuple appears in the result.
Computing the probability of the tuple appearing in the result is thus analogous to weighted model counting (a known \sharpphard problem).
In the corresponding problem for Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty,GL16}, lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
Thus, the expectation of the multiplicity is the expectation of this polynomial.
In the corresponding Bag-PDB problem~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty,GL16}, lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
The expectation of the multiplicity is the expectation of this polynomial.
Lineage in Set-PDBs is typically encoded in disjunctive normal form.
This representation is significantly larger than the query result sans lineage.
@ -32,16 +32,15 @@ However, even with alternative encodings~\cite{FH13}, the limiting factor in com
The corresponding lineage encoding for Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of `clauses', each of which is the product of a set of integer or variable atoms.
Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
Such compressed representations like Factorized Databases~\cite{factorized-db,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity}, are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula allows us to more closely relate the algorithm's performance to the complexity of query evaluation in a deterministic database.
Compressed representations like Factorized Databases~\cite{factorized-db,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula more closely relates the algorithm's performance to the complexity of query evaluation in a deterministic database.
The initial picture is not good.
In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{factorized-db} --- lineage polynomial by reduction from counting $k$-matchings.
Thus, even bag PDBs do not enjoy the same computational complexity as deterministic databases.
This motivates our second goal, a linear time approximation algorithm for computing expected counts in a bag database, with complexity linear in the size of a factorized lineage formula.
As we will show, the size of the factorized
lineage formula for a query --- and by extension, our approximation algorithm --- is proportional to the complexity of evaluating the same query on a comparable deterministic database instance~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
In other words, our approximation algorithm can estimate expected multiplicities for tuples in the result of an SPJU query with a complexity comparable to deterministic query-processing.
We prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{factorized-db} --- lineage polynomial by reduction from counting $k$-matchings.
Thus, even Bag-PDBs can not enjoy the same time complexity as deterministic databases.
This motivates our second goal, a linear time (in the size of the factorized lineage) approximation of expected counts for SPJU query results over Bag-PDBs.
We also show that the size of the factorized lineage formula for a query (and by extension, our approximation algorithm) has complexity proportional to the same query on a deterministic database.
% In other words, our approximation algorithm can estimate expected multiplicities for tuples in the result of an SPJU query with a complexity comparable to deterministic query-processing.
\subsection{Sets vs Bags}
@ -49,7 +48,7 @@ In other words, our approximation algorithm can estimate expected multiplicities
%Figures, etc
%Relations for example 1
\begin{figure}[ht]
\begin{figure}[t]
\begin{subfigure}{0.2\textwidth}
\centering
\begin{tabular}{ c | c c c}
@ -90,6 +89,7 @@ In other words, our approximation algorithm can estimate expected multiplicities
\caption{$\ti$ relations for $\poly$}
\label{fig:intro-ex}
\trimfigurespacing
\end{figure}
%Graph of query output for intro example
@ -115,28 +115,25 @@ For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p\cdot p\cdot (1-p)=p^2-p^3$.
\end{Example}
Prior efforts to generalize incomplete databases to bags~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17} replace the Boolean annotations with natural numbers.
Analogously, we generalize the above model of Set-PDBs to bags by using natural-number-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and positive natural number constants ($\Phi_{bag}$ in the example).
Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while query evaluation follows bag semantics.
We contrast bag and set query evaluation with the following example:
\begin{Example}\label{ex:bag-vs-set}
Continuing the prior example, we are given the following Boolean (resp,. count) query
$$\poly() :- R(A), E(A, B), R(B)$$
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a Boolean (resp., polynomial) formula over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
Because the Boolean query has only a nullary relation, we write $Q(\cdot)$ to denote the function that evaluates the lineage over one specific assignment of values to the variables (i.e., the value of the lineage in the corresponding possible world):
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a Boolean (polynomial) formula over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
Because the query result is a nullary relation, we write $Q(\cdot)$ to denote the function that evaluates the lineage over one specific assignment of values to the variables (i.e., the value of the lineage in the corresponding possible world):
\begin{align*}
\poly_{set}(W_a, W_b, W_c) &= W_aW_b \vee W_bW_c \vee W_cW_a\\
\poly_{bag}(W_a, W_b, W_c) &= W_aW_b + W_bW_c + W_cW_a
\end{align*}
It is left as an exercise for the reader to show that, given assignments to $W_a$, $W_b$, $W_c$, these expressions correspond to the existence (resp., count) of the single nullary result tuple for $\poly$ applied to the database instance in \Cref{fig:intro-ex}.
We show one possible world here, with the set assignment $\{\;W_a\mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$ (and the corresponding bag assignment),
The polynomials evaluate as:
Given $W_a$, $W_b$, $W_c$, these functions compute the existence (resp., count) of the nullary result tuple for $\poly$ applied to the database instance in \Cref{fig:intro-ex}.
We show one possible world here, with the set assignment $\{\;W_a\mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$ and the analogous bag assignment:
\begin{align*}
&\poly_{set}(\top, \top, \bot) = \top\top \vee \top\bot \vee \top\bot = \top\\
&\poly_{bag}(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1
\end{align*}
The Set-PDB query is satisfied in this possible world, while the Bag-PDB query produces a nullary tuple with a multiplicity of 1.
The Set-PDB query is satisfied in this possible world and the Bag-PDB result tuple has a multiplicity of 1.
The marginal probability (resp., expected count) of this query is computed over all possible worlds:
% \AR{What is $\mu$ below?}
{\small
@ -175,12 +172,13 @@ Computing such expectations is indeed linear in the size of the SOP as the numbe
As a further interesting feature of this example, note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals:
\begin{multline}
\label{eqn:can-inline-probabilities-into-polynomial}
\expct\pbox{\poly_{bag}} = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
+ P[W_c = 1]P[W_a = 1]\\
\expct\pbox{\poly_{bag}}
% = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
% + P[W_c = 1]P[W_a = 1]\\
= \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1])
\end{multline}
\begin{figure}[h!]
\begin{figure}[t]
\resizebox{0.8\columnwidth}{!}{
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=4.55cm}, level 2/.style={sibling distance=1.5cm}, level 3/.style={sibling distance=0.7cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
\node[tree_node](root){$\boldsymbol{\times}$}