This commit is contained in:
Aaron Huber 2020-12-11 20:20:02 -05:00
commit 17b2e6df8f
5 changed files with 43 additions and 33 deletions

View file

@ -1,9 +1,15 @@
%root: main.tex
\begin{abstract}
The problem of computing the marginal probability of a tuple in the result of a query over a probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{linage formula} of the result which is a Boolean formula whose variables represent the existence of tuples in the database. Under bag semantics, lineage formulas have to be replaced with provenance polynomials. For any given possible world, the polynomial of a result tuple evaluates to the multiplicity of the tuple in this world. In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately. For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can be computed in linear time in the size of the tuple's provenance polynomial if the polynomial is encoded as a sum of products. However, using a reduction from the problem of counting k-matchings, we demonstrate that calculating the expectation for factorized polynomials is \sharpwonehard. We then proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial.
\AH{High-level intution}
Most people think that computing expected multiplicity of an output tuple in a probabilistic database (PDB) is easy. Due to the fact that most modern implementations of PDBs represent tuple lineage in their expanded form, it has to be the case that such a computation is linear in the size of the lineage. This follows since, when we have an uncompressed lineage, linearity allows for expectation to be pushed through the sum.
\BG{Most people think that computing expected multiplicity of an output tuple in a probabilistic database (PDB) is easy. Due to the fact that most modern implementations of PDBs represent tuple lineage in their expanded form, it has to be the case that such a computation is linear in the size of the lineage. This follows since, when we have an uncompressed lineage, linearity allows for expectation to be pushed through the sum.}
\AH{Low-level why-would-an-expert-read-this}
However, when we consider compressed representations of the tuple lineage, the complexity landscape changes. If we use a lineage computed over a factorized database, we find in general that computation time is not linear in the size of the compressed lineage.
\BG{However, when we consider compressed representations of the tuple lineage, the complexity landscape changes. If we use a lineage computed over a factorized database, we find in general that computation time is not linear in the size of the compressed lineage.}
\AH{Key technical contributions}
This work theoretically demonstrates that bags are not easy in general, and in the case of compressed lineage forms, the computation can be greater than linear. As such, it is then desirable to have an approximation algorithm to approximate the expected multiplicity in linear time. We introduce such an algorithm and give theoretical guarentees on its efficiency and accuracy. It then follows that computing an approximate value of the tuple's expected multiplicity on a bag PDB is equivalent to deterministic query processing complexity.
\BG{This work theoretically demonstrates that bags are not easy in general, and in the case of compressed lineage forms, the computation can be greater than linear. As such, it is then desirable to have an approximation algorithm to approximate the expected multiplicity in linear time. We introduce such an algorithm and give theoretical guarentees on its efficiency and accuracy. It then follows that computing an approximate value of the tuple's expected multiplicity on a bag PDB is equivalent to deterministic query processing complexity.}
\end{abstract}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:

View file

@ -2,11 +2,11 @@
\section{Introduction}
Modern production databases like Postgres and Oracle use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing the probability of an output tuple is analogous to weighted model counting (a known $\sharpP$ problem).
Modern production databases like Postgres and Oracle use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing the probability of an output tuple is analogous to weighted model counting (a known \sharpPhard problem).
%the annotation of the tuple is a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, which can essentially be thought of as a boolean formula. It is known that computing the probability of a lineage formula is \#-P hard in general
In PDBs, a boolean formula, ~\cite{DBLP:series/synthesis/2011Suciu} also called a lineage formula, encodes the conditions under which each output tuple appears in the result.
%The marginal probability of this formula being true is the tuple's probability to appear in a possible world.
The set of variables in a lineage formula are each drawn from a probability distribution, from which the marginal probability can be computed. The corresponding problem for bag PDBs is computing the expected multiplicities of the output tuple, where polynomials are used to represent the probability distribution of the multiplicity of input tuples.
In PDBs, a boolean formula, ~\cite{DBLP:series/synthesis/2011Suciu} also called a lineage formula, encodes the conditions under which each output tuple appears in the result.
%The marginal probability of this formula being true is the tuple's probability to appear in a possible world.
The set of variables in a lineage formula are each drawn from a probability distribution, from which the marginal probability can be computed. The corresponding problem for bag PDBs is computing the expected multiplicities of the output tuple, where polynomials are used to represent the probability distribution of the multiplicity of input tuples.
%In this case, the polynomial is interpreted as the probability of the input tuple contributing to the multiplicity of the output tuple. Or in other words, the expectation of the polynomial is the expected multiplicity of the output tuple.
The standard representation for lineage formulas in PDBs is the sum of products (SOP). The SOP is essentially the expansion of all products of sums terms, so that the formula is a sum of variable products. The SOP representation is much bigger than the lineage-free representation that deterministic databases employ.
Due to linearity of expectation, computing the expectation of tuple multiplicities is linear in the number of terms in the SOP formula, so many regard bags to be easy. In this work we consider compressed representations of the lineage formula. We show that the complexity landscape becomes much more nuanced, and is \textit{not} linear in general. Such compressed representations of the formula are analagous to deterministic query optimizations (e.g. pushing down projections). In this work, we define hard to be anything greater than linear time. Thus, even bag PDBs do not enjoy the same computational complexity as deterministic databases and are hard in general. This makes it desirable to find linear time approximation algorithm.
@ -80,7 +80,7 @@ Next we explain why this query is hard in set semantics % due to correlations in
and easy under bag semantics.% with a polynomial formula representing the multiple contributing tuples from the input set $\ti$, it is easy since we enjoy linearity of expectation.
\end{Example}
Our work also handles Block Independent Disjoint Databases ($\bi$), a PDB model in which tuples are arranged in blocks, where all blocks are independent from one another, but tuples within the same block are mutually exclusive. For now, let us consider the $\ti$ model. In the example we consider a fixed probability $\prob$ for all tuple variables such that $P[W_i = 1] = \prob$. Let us also be explicit in mentioning that the input tables are \textit{sets}, i.e. $Dom(W_i) = \{0, 1\}$, and the difference when we speak of bag semantics, is that we consider the query to potentially have duplicates, or in other words we are thinking about query output (over set instances) in the bag context.
Our work also handles Block Independent Disjoint Databases ($\bi$), a PDB model in which tuples are arranged in blocks, where all blocks are independent from one another, but tuples within the same block are mutually exclusive. For now, let us consider the $\ti$ model. In the example we consider a fixed probability $\prob$ for all tuple variables such that $P[W_i = 1] = \prob$. Let us also be explicit in mentioning that the input tables are \textit{sets}, i.e. $Dom(W_i) = \{0, 1\}$, and the difference when we speak of bag semantics, is that we consider the query to potentially have duplicates, or in other words we are thinking about query output (over set instances) in the bag context.
To contrast the bag/polynomial and set/lineage interpretations, we provide another example.
\begin{Example}\label{ex:bag-vs-set}
@ -92,7 +92,7 @@ The output polynomial in ~\cref{ex:intro} has the following lineage formula (top
Notice that $\poly$ in the set/lineage setting above, $\poly: (\mathbb{B})^3 \mapsto \mathbb{B}$, while under bag/polynomial semantics we define $\poly: (\mathbb{N})^3 \mapsto \mathbb{N}$.
Assume the following $\mathbb{B}/\mathbb{N}$ variable assignments: $W_a\mapsto T/1, W_b \mapsto T/1, W_c \mapsto F/0.$ Then the polynomials evaluate as
Assume the following $\mathbb{B}/\mathbb{N}$ variable assignments: $W_a\mapsto T/1, W_b \mapsto T/1, W_c \mapsto F/0.$ Then the polynomials evaluate as
\begin{align*}
&\poly(T, T, F) = TT \vee TF \vee FT = T\\
&\poly(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1
@ -102,7 +102,7 @@ In the set/lineage setting, we find that the boolean query is satisfied, while i
Note that computing the probability of the query of ~\cref{ex:intro} in set semantics is indeed $\sharpP$ hard, since it is a query that is non-hierarchical
%, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$,
~\cite{10.1145/1265530.1265571}. %Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$, i.e. the probability of the output with annotation $\poly(W_a, W_b, W_c)$, ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics.
To see why this computation is hard for query $\poly$ over set semantics, from the query input we compute an output lineage formula of $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent of one another and the computation of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$:
To see why this computation is hard for query $\poly$ over set semantics, from the query input we compute an output lineage formula of $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$. Note that the conjunctive clauses are not independent of one another and the computation of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$:
\begin{equation*}
\expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
\end{equation*}
@ -197,8 +197,8 @@ Define $\rpoly^2(\vct{X})$ to be the resulting polynomial when all exponents $e
\begin{align*}
&\poly^2(W_a, W_b, W_c) = W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c\\
&+ 2W_aW_bW_c^2,
\end{align*}
then
\end{align*}
then
\begin{align*}
&\rpoly^2(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
&+ 2W_aW_bW_c\\

View file

@ -8,7 +8,6 @@
\newcommand{\wElem}{w} %an element of \vct{w}
\newcommand{\st}{\;|\;} %such that
\newcommand{\kElem}{k}%the kth element
\newcommand{\sharpP}{\#P}
%RA-to-Poly Notation
\newcommand{\polyinput}[2]{\left(#1,\ldots, #2\right)}
\newcommand{\numvar}{n}
@ -141,7 +140,7 @@
\tikzset{
default_node/.style={align=center, inner sep=0pt},
pattern_node/.style={fill=gray!50, draw=black, semithick, inner sep=0pt, minimum size = 2pt, circle},
tree_node/.style={default_node, draw=black, black, circle, text width=0.3cm, font=\bfseries, minimum size=0.65cm},
tree_node/.style={default_node, draw=black, black, circle, text width=0.3cm, font=\bfseries, minimum size=0.65cm},
highlight_color/.style={black}, wght_color/.style={black},
highlight_treenode/.style={tree_node, draw=black, black},
edge from parent path={(\tikzparentnode) -- (\tikzchildnode)}
@ -160,7 +159,7 @@
\end{tikzpicture}
}
}
\newcommand{\kmatch}{\ed\cdots\ed^\kElem}
\newcommand{\twodis}{\patternshift{
\begin{tikzpicture}[every path/.style={thick, draw}]
@ -169,7 +168,7 @@
\node at (0.14, 0) [pattern_node] (bottom2) {};
\node [above=0.07cm of bottom2, pattern_node] (top2) {} edge (bottom2);
\end{tikzpicture}
}
}
}
\newcommand{\twopath}{\patternshift{
\begin{tikzpicture}[every path/.style={thick, draw}]
@ -247,7 +246,7 @@
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREM LIKE ENVIRONMENTS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\DeclareMathAlphabet{\mathbbold}{U}{bbold}{m}{n}
@ -310,12 +309,15 @@
\newcommand{\dbDomK}[1]{\mathcal{DB}_{#1}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% COMPLEXITY
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\sharpphard}{\#P-hard\xspace}
\newcommand{\sharpwonehard}{\#W[1]-hard\xspace}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "attrprov"
%%% TeX-master: "main"
%%% End:
%%%Adding stuff below so that long chain of display equatoons can be split across pages

View file

@ -1,4 +1,3 @@
\documentclass[sigconf]{acmart}
\usepackage{algpseudocode}
@ -145,7 +144,7 @@ sensitive=true
\maketitle
%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%
%\input{abstract}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -191,8 +190,3 @@ sensitive=true
% \input{glossary.tex}
% \input{addproofappendix.tex}
\end{document}

View file

@ -2,18 +2,22 @@
\section{Multiple Distinct $\prob$ Values}
We would like to argue for a compressed version of $\poly(\vct{w})$, in general $\expct_{\vct{w}}\pbox{\poly(\vct{w})}$ cannot be computed in linear time.
We would like to argue for a compressed version of $\poly(\vct{w})$, in general $\expct_{\vct{w}}\pbox{\poly(\vct{w})}$ cannot be computed in linear time.
\AR{Added the hardness result below.}
Our hardness result is based on the following hardness result:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}[\cite{k-match}]
\label{thm:k-match-hard}
Given a positive integer $k$ and an undirected graph $G$ with no self-loops or parallel edges, counting the number of $k$-matchings in $G$ is $\#W[1]$-hard.
Given a positive integer $k$ and an undirected graph $G$ with no self-loops or parallel edges, counting the number of $k$-matchings in $G$ is \sharpwonehard.
\end{Theorem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The above result means that we cannot hope to count the number of $k$-matchings in $G=(V,E)$ in time $f(k)\cdot |V|^{O(1)}$ for any function $f$. In fact, all known algorithms to solve this problem take time $|V|^{\Omega(k)}$.
To prove our hardness result, consider a graph $G(V, E)$, where $|E| = \numedge$, $|V| = \numvar$, and $i, j \in [\numvar]$.
To prove our hardness result, consider a graph $G(V, E)$, where $|E| = \numedge$, $|V| = \numvar$, and $i, j \in [\numvar]$.
Consider the query $\poly_{G}(\vct{X}) = q_E(X_1,\ldots, X_\numvar) = \sum\limits_{(i, j) \in E} X_i \cdot X_j$.
Consider the query $\poly_{G}(\vct{X}) = q_E(X_1,\ldots, X_\numvar) = \sum\limits_{(i, j) \in E} X_i \cdot X_j$.
For the following discussion, set $\poly_{G}^\kElem(\vct{X}) = \left(q_E(X_1,\ldots, X_\numvar)\right)^\kElem$.
@ -30,13 +34,13 @@ We will show that $\rpoly_{G}^\kElem(\prob,\ldots, \prob) = \sum\limits_{i = 0}^
Since each of $(i_1, j_1),\ldots, (i_\kElem, j_\kElem)$ are from $E$, it follows that the set of $\kElem!$ permutations of the $\kElem$ $X_iX_j$ pairs which form the monomial products are of degree $2\kElem$ with the number of distinct variables in an arbitrary monomial $\leq 2\kElem$. By definition, $\rpoly_{G}^{\kElem}(\vct{X})$ sets every exponent $e > 1$ to $e = 1$, thereby shrinking the degree a monomial product term in the SOP form of $\poly_{G}^{\kElem}(\vct{X})$ to the exact number of distinct variables the monomial contains. This implies that $\rpoly_{G}^\kElem$ is a polynomial of degree $2\kElem$ and hence $\rpoly_{G}^\kElem(\prob,\ldots, \prob)$ is a polynomial in $\prob$ of degree $2\kElem$. Then it is the case that
\begin{equation*}
\rpoly_{G}^{\kElem}(\prob,\ldots, \prob) = \sum_{i = 0}^{2\kElem} c_i \prob^i
\end{equation*}
\end{equation*}
where $c_i$ denotes all monomials in the expansion of $\poly_{G}^{\kElem}(\vct{X})$ composed of $i$ distinct variables, with $\prob$ substituted for each distinct variable\footnote{Since $\rpoly_G^\kElem(\vct{X})$ does not have any monomial with degree $< 2$, it is the case that $c_0 = c_1 = 1$.}.
Given that we then have $2\kElem + 1$ distinct values of $\rpoly_{G}^\kElem(\prob,\ldots, \prob)$ for $0\leq i\leq2\kElem$, it follows that we then have $2\kElem + 1$ distinct rows of the form $\prob_i^0\ldots\prob_i^{2\kElem}$ which form a matrix $M$. We have then a linear system of the form $M \cdot \vct{c} = \vct{b}$ where $\vct{c}$ is the coefficient vector ($c_0,\ldots, c_{2\kElem}$), and $\vct{b}$ is the vector such that $\vct{b}[i] = \rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$. By construction of the summation, matrix $M$ is the Vandermonde matrix, from which it follows that we have a matrix with full rank, and we can solve the linear system in $O(k^3)$ time to determine $\vct{c}$ exactly.
Denote the number of $\kElem$-matchings in $G$ as $\numocc{G}{\kmatch}$. Note that $c_{2\kElem}$ is $\kElem! \cdot \numocc{G}{\kmatch}$. This can be seen intuitively by looking at the original factorized representation $\poly_{G}^\kElem(\vct{X})$, where, across each of the $\kElem$ products, an arbitrary $\kElem$-matching can be selected $\prod_{i = 1}^\kElem \kElem = \kElem!$ times. Note that each $\kElem$-matching $(i_1, j_1)\ldots$ $(i_k, j_k)$ in $G$ corresponds to the unique monomial $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ in $\poly_{G}^\kElem(\vct{X})$, where each index is distinct. Since each index is distinct, then each variable has an exponent $e = 1$ and this monomial survives in $\rpoly_{G}^{\kElem}(\vct{X})$ Since $\rpoly$ contains only exponents $e \leq 1$, the only degree $2\kElem$ terms that can exist in $\rpoly_{G}^\kElem$ are $\kElem$-matchings since every other monomial in $\poly_{G}^\kElem(\vct{X})$ has strictly less than $2\kElem$ distinct variables, which, as stated earlier implies that every other non-$\kElem$-matching monomial in $\rpoly_{G}^\kElem(\vct{X})$ has degree $< 2\kElem$.
%It has already been established above that a $\kElem$-matching ($\kmatch$) has coefficient $c_{2\kElem}$. As noted, a $\kElem$-matching occurs when there are $\kElem$ edges, $e_1, e_2,\ldots, e_\kElem$, such that all of them are disjoint, i.e., $e_1 \neq e_2 \neq \cdots \neq e_\kElem$. In all $\kElem$ factors of $\poly_{G}^\kElem(\vct{X})$ there are $k$ choices from the first factor to select an edge for a given $\kElem$ matching, $\kElem - 1$ choices in the second factor, and so on throughout all the factors, yielding $\kElem!$ duplicate terms for each $\kElem$ matching in the expansion of $\poly_{G}^\kElem(\vct{X})$.
%It has already been established above that a $\kElem$-matching ($\kmatch$) has coefficient $c_{2\kElem}$. As noted, a $\kElem$-matching occurs when there are $\kElem$ edges, $e_1, e_2,\ldots, e_\kElem$, such that all of them are disjoint, i.e., $e_1 \neq e_2 \neq \cdots \neq e_\kElem$. In all $\kElem$ factors of $\poly_{G}^\kElem(\vct{X})$ there are $k$ choices from the first factor to select an edge for a given $\kElem$ matching, $\kElem - 1$ choices in the second factor, and so on throughout all the factors, yielding $\kElem!$ duplicate terms for each $\kElem$ matching in the expansion of $\poly_{G}^\kElem(\vct{X})$.
Then, since we have $\kElem!$ duplicates of each distinct $\kElem$-matching, and the fact that $c_{2\kElem}$ contains all monomials with degree $2\kElem$, it follows that $c_{2\kElem} = \kElem!\cdot\numocc{G}{\kmatch}$. This allows us to solve for $\numocc{G}{\kmatch}$ by simply dividing $c_{2\kElem}$ by $\kElem!$.
\end{proof}
@ -72,3 +76,7 @@ The proof follows by ~\cref{thm:k-match-hard} and ~\cref{lem:qEk-multi-p}.
%The proof follows by ~\cref{thm:k-match-hard}, ~\cref{lem:qEk-multi-p} and ~\cref{cor:lem-qEk}.
%\end{proof}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: