fixed refs

master
Boris Glavic 2020-12-19 00:28:07 -06:00
parent 61b46da038
commit d020eba059
3 changed files with 54 additions and 22 deletions

View File

@ -7,9 +7,9 @@
% \AR{\textbf{Oliver/Boris:} What is missing from the intro is why would someone care about bag-PDBs in {\em practice}? This is kinda obliquely referred to in the first para but it would be good to motivate this more. The intro (rightly) focuses on the theoretical reasons to study bag PDBs but what (if any) are the practical significance of getting bag PDBs done in linear-time? Would this lead to much faster real-life PDB systems?}
As explainability and fairness become more relevant to the data science community, it is now more critical than ever to understand how reliable a dataset is.
Probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu} are a compelling solution, but a major roadblock to their adoption remains:
Probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu} are a compelling solution, but a major roadblock to their adoption remains:
PDBs are orders of magnitude slower than classical (i.e., deterministic) database systems~\cite{feng:2019:sigmod:uncertainty}.
A naive strategy might be to move from the theoretically simpler set-relational model~\cite{DBLP:series/synthesis/2011Suciu,DBLP:conf/sigmod/BoulosDMMRS05,DBLP:conf/icde/AntovaKO07a,DBLP:conf/sigmod/SinghMMPHS08} to the computationally simpler bag-relational model, mirroring a similar transition in deterministic datbases decades ago.
A naive strategy might be to move from the theoretically simpler set-relational model~\cite{DBLP:series/synthesis/2011Suciu,BD05,DBLP:conf/icde/AntovaKO07a,DBLP:conf/sigmod/SinghMMPHS08} to the computationally simpler bag-relational model, mirroring a similar transition in deterministic datbases decades ago.
However, after discarding a long-held approach to representing lineage, we prove that query processing in Bag-PDBs is \sharpwonehard.
This finding shows that even Bag-PDB query processing has a higher complexity than deterministic query processing, and opens a rich landscape of opportunities for research on approximate algorithms.
@ -28,10 +28,10 @@ Thus, the expectation of the multiplicity is the expectation of this polynomial.
Lineage in Set-PDBs is typically encoded in disjunctive normal form.
This representation is significantly larger than the query result sans lineage.
However, even with alternative encodings~\cite{DBLP:journals/vldb/FinkHO13}, the limiting factor in computing marginal probabilities remains the probability computation itself, and not the lineage formula.
However, even with alternative encodings~\cite{FH13}, the limiting factor in computing marginal probabilities remains the probability computation itself, and not the lineage formula.
The corresponding lineage encoding for Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of `clauses', each of which is the product of a set of integer or variable atoms.
Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
Such compressed representations like Factorized Databases~\cite{10.1145/3003665.3003667,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity}, are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula allows us to more closely relate the algorithm's performance to the complexity of query evaluation in a deterministic database.
@ -39,7 +39,7 @@ The initial picture is not good.
In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{10.1145/3003665.3003667} --- lineage polynomial by reduction from counting $k$-matchings.
Thus, even bag PDBs do not enjoy the same computational complexity as deterministic databases.
This motivates our second goal, a linear time approximation algorithm for computing expected counts in a bag database, with complexity linear in the size of a factorized lineage formula.
As we will show, the size of the factorized
As we will show, the size of the factorized
lineage formula for a query --- and by extension, our approximation algorithm --- is proportional to the complexity of evaluating the same query on a comparable deterministic database instance~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
In other words, our approximation algorithm can estimate expected multiplicities for tuples in the result of an SPJU query with a complexity comparable to deterministic query-processing.
@ -107,11 +107,11 @@ In other words, our approximation algorithm can estimate expected multiplicities
%\end{figure}
\begin{Example}\label{ex:intro}
Consider the Tuple Independent ($\ti$) Set-PDB\footnote{Our work also handles Block Independent Disjoint Databases ($\bi$)~\cite{DBLP:conf/sigmod/BoulosDMMRS05,DBLP:series/synthesis/2011Suciu} and we return to this model later.} given in \Cref{fig:intro-ex} with two input relations $R$ and $E$.
Consider the Tuple Independent ($\ti$) Set-PDB\footnote{Our work also handles Block Independent Disjoint Databases ($\bi$)~\cite{BD05,DBLP:series/synthesis/2011Suciu} and we return to this model later.} given in \Cref{fig:intro-ex} with two input relations $R$ and $E$.
Each input tuple is assigned an annotation (attribute $\Phi_{set}$): an independent random Boolean variable ($W_i$) or the constant $\top$.
% Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) \SF{Do we need to state the meaning of $\top$ and $\bot$? Also do we want to add bag annotation to Figure 1 too since we are discussing both sets and bags later?} identifies one \emph{possible world}, a deterministic database instance that contains exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$.
The probability of this world is the joint probability of the corresponding assignments.
For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
The probability of this world is the joint probability of the corresponding assignments.
For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p\cdot p\cdot (1-p)=p^2-p^3$.
\end{Example}
@ -121,7 +121,7 @@ Without loss of generality, we assume that input relations are sets (i.e. $Dom(W
We contrast bag and set query evaluation with the following example:
\begin{Example}\label{ex:bag-vs-set}
Continuing the prior example, we are given the following Boolean (resp,. count) query
Continuing the prior example, we are given the following Boolean (resp,. count) query
$$\poly() :- R(A), E(A, B), R(B)$$
The lineage of the result in a Set-PDB (resp., Bag-PDB) is a Boolean (resp., polynomial) formula over random variables annotating the input relations (i.e., $W_a$, $W_b$, $W_c$).
Because the Boolean query has only a nullary relation, we write $Q(\cdot)$ to denote the function that evaluates the lineage over one specific assignment of values to the variables (i.e., the value of the lineage in the corresponding possible world):
@ -149,7 +149,7 @@ P[\poly_{set}] &= \hspace*{-1mm}
\end{Example}
Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{10.1145/1265530.1265571}, and thus \sharpphard.
To see why computing this probability is hard, observe that the clauses of the disjunctive normal form Boolean lineage are neither independent nor disjoint, leading to e.g.~\cite{DBLP:journals/vldb/FinkHO13} the use of Shannon decomposition, which is at worst exponential in the size of the input.
To see why computing this probability is hard, observe that the clauses of the disjunctive normal form Boolean lineage are neither independent nor disjoint, leading to e.g.~\cite{FH13} the use of Shannon decomposition, which is at worst exponential in the size of the input.
% \begin{equation*}
% \expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
% \end{equation*}
@ -159,25 +159,25 @@ To see why computing this probability is hard, observe that the clauses of the d
%&W_aW_b \vee W_bW_c \vee W_cW_a
%= &W_a
%\end{align*}
Conversely, in Bag-PDBs, correlations between clauses of the SOP polynomial are not problematic thanks to linearity of expectation.
Conversely, in Bag-PDBs, correlations between clauses of the SOP polynomial are not problematic thanks to linearity of expectation.
The expectation computation over the output lineage is simply the sum of expectations of each clause.
For \Cref{ex:intro}, the expectation is simply
{\small
\begin{align*}
\expct\pbox{\poly(W_a, W_b, W_c)} &= \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}\\
\intertext{\normalsize
\intertext{\normalsize
In this particular lineage polynomial, all variables in each product clause are independent, so we can push expectations through.
}
&= \expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
\end{align*}
}
Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
As a further interesting feature of this example, note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals:
Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
As a further interesting feature of this example, note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals:
\begin{multline}
\label{eqn:can-inline-probabilities-into-polynomial}
\expct\pbox{\poly_{bag}} = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
+ P[W_c = 1]P[W_a = 1]\\
= \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1])
= \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1])
\end{multline}
\begin{figure}[h!]
@ -219,12 +219,12 @@ As a further interesting feature of this example, note that $\expct\pbox{W_i} =
\end{figure}
\subsection{Superlinearity of Bag PDBs}
Moving forward, we focus exclusively on bags and drop the subscript from $\poly_{bag}$.
Moving forward, we focus exclusively on bags and drop the subscript from $\poly_{bag}$.
Consider the Cartesian product of $\poly$ with itself:
\begin{equation*}
\poly^2() := \rel(A), E(A, B), \rel(B),\; \rel(C), E(C, D), \rel(D)
\end{equation*}
For an arbitrary polynomial, it is known that there may exist equivalent compressed representations.
For an arbitrary polynomial, it is known that there may exist equivalent compressed representations.
One such compression is the factorized polynomial~\cite{10.1145/3003665.3003667}, where the polynomial is broken up into separate factors.
For example:
{\small
@ -251,7 +251,7 @@ Under the assumption that $Dom(W_i) = \{0, 1\}$, it is generally true that for a
This property leads us to consider another structure related to $\poly$.
% \AH{I don't know if we want to include the following statement: \par \emph{ bags are only hard with self-joins }
% \par Atri suggests a proof in the appendix regarding this claim.}
For any polynomial $\poly(\vct{X})$, we define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.
For any polynomial $\poly(\vct{X})$, we define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.
With $\poly^2$ as an example, we have:
\begin{align*}
\rpoly^2(W_a, W_b, W_c) =&\; W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
@ -267,7 +267,7 @@ In prior work on lineage-based Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vl
In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the Cartesian product of $k$ copies of $\poly$ has a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
This leads us to the \textbf{central question of this paper}:
\begin{quote}
{\em
{\em
Is it always the case that the expectation of a UCQ in a Bag-PDB can be computed in time linear in the size of the \textbf{compressed} lineage polynomial?}
\end{quote}
If the answer is yes, then it is possible for Bag-PDBs to achieve performance competitive with deterministic databases.
@ -280,8 +280,8 @@ The answer, unfortunately, is no, and an approximation algorithm is required.
% The factorized output polynomial consists of a product of three identical three-way summations, while the SOP encoding is exponential --- $3^3$ clauses to be precise.
\subsection{Overview of our results and techniques}
Concretely, in this paper:
(i) We show that conjunctive queries over a bag-$\ti$ are hard (i.e., superlinear in the size of a compressed lineage encoding) by reduction from counting the number of $k$-matchings over an arbitrary graph;
Concretely, in this paper:
(i) We show that conjunctive queries over a bag-$\ti$ are hard (i.e., superlinear in the size of a compressed lineage encoding) by reduction from counting the number of $k$-matchings over an arbitrary graph;
(ii) We present an $(1-\epsilon)$-approximation algorithm for bag-$\ti$s and show that its complexity is linear in the size of the compressed lineage encoding;
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further generalize our results to higher moments, polynomial circuits, and prove that for RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.

View File

@ -601,3 +601,36 @@ Artificial Intelligence, UAI 2013, Bellevue, WA, USA, August
volume = {76},
year = {2010}
}
@inproceedings{GL16,
author = {Paolo Guagliardo and
Leonid Libkin},
booktitle = {PODS},
title = {Making SQL Queries Correct on Incomplete Databases: A Feasibility
Study},
year = {2016}
}
@inproceedings{jampani2008mcdb,
author = {Jampani, Ravi and Xu, Fei and Wu, Mingxi and Perez, Luis Leopoldo and Jermaine, Christopher and Haas, Peter J},
booktitle = {SIGMOD},
title = {MCDB: a monte carlo approach to managing uncertain data},
year = {2008}
}
@article{yang:2015:pvldb:lenses,
author = {Yang, Ying and Meneghetti, Niccolò and Fehling, Ronny and Liu, Zhen Hua and Gawlick, Dieter and Kennedy, Oliver},
title = {Lenses: An On-Demand Approach to ETL},
journal = {pVLDB},
volume = {8},
number = {12},
year = {2015},
pages = {1578--1589}
}
@misc{pdbench,
title = {pdbench},
howpublished = {\url{http://pdbench.sourceforge.net/}},
note = {Accessed: 2020-12-15}
}

View File

@ -21,7 +21,6 @@ Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of
There is a large body of work on compact using representations of Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{OS16,DBLP:conf/tapp/Zavodny11}) some of which have been utilized for probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Parameterized Complexity Theory}\label{sec:param-compl-theory}