intro

2021-04-06 08:50:42 -05:00 · 2021-04-06 08:50:42 -05:00 · 487b22f6b2
parent 2a12a719af
commit 487b22f6b2
4 changed files with 179 additions and 10 deletions
--- a/intro-new.tex
+++ b/intro-new.tex
@ -0,0 +1,177 @@
+%root: main.tex
+%!TEX root=./main.tex
+
+\section{Introduction}
+\label{sec:intro}
+A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds paired with a probability distribution $\pd$ over these worlds. A well-studied problem in probabilistic databases is given a query $\query$ and probabilistic database $\pdb$ to compute the \emph{marginal probability} of a tuple $\tup$, i.e., its probability to exist in the result of query $\query$ over $\pdb$. This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs) which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases from cases that are in \ptime for the class of union of conjunctive queries (UCQs). In this work, we consider bag semantics where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analog problem of computing the expectation of the multiplicity of a query result tuple $\tup$:
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{equation}\label{eq:bag-expectation}
+\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \pd(\db) \hspace{2cm}\text{\textbf{(Expected Result Multiplicity)}}
+\end{equation}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{figure}[t]
+	\begin{subfigure}[b]{0.49\linewidth}
+		\centering
+{\small
+		\begin{tabular}{ c | c c c}
+			$OnTime$ & City$_\ell$ & $\Phi$  & \textbf{p}\\
+			\hline
+                     & Buffalo     & $L_a$ & 0.9 \\
+                     & Chicago     & $L_b$ & 0.5\\
+                     & Bremen      & $L_c$ & 0.5\\
+                     & Zurich      & $L_e$ & 1.0\\
+		\end{tabular}
+	}
+		\caption{Relation $OnTime$}
+		\label{subfig:ex-shipping-simp-loc}
+	\end{subfigure}%
+	\begin{subfigure}[b]{0.49\linewidth}
+		\centering
+{\small
+		\begin{tabular}{ c | c c c c}
+			$Route$ & $\text{City}_1$ & $\text{City}_2$ & $\Phi$ & \textbf{p} \\
+			\hline
+                    & Buffalo         & Chicago         & $R_a$          & 1.0        \\
+                    & Chicago         & Zurich          & $R_b$          & 1.0        \\
+                    & $\cdots$        & $\cdots$        & $\cdots$     & $\cdots$   \\
+                    & Chicago         & Bremen          & $R_c$          & 1.0        \\
+		\end{tabular}
+		}
+		\caption{Relation $Route$}
+		\label{subfig:ex-shipping-simp-route}
+      \end{subfigure}%
+  %     	\begin{subfigure}[b]{0.17\linewidth}
+  %   	\centering
+
+  %   \caption{Circuit for $(Chicago)$}
+  %   \label{subfig:ex-proj-push-circ-q3}
+  % \end{subfigure}
+
+	\begin{subfigure}[b]{0.66\linewidth}
+		\centering
+{\small
+          \begin{tabular}{ c | c c c}
+            $\query_1$ & City    & $\Phi$                          & $\expct_{\idb \sim \probDist}[\query(\db)(t)]$ \\ \hline
+                       & Buffalo & $L_a \cdot R_a$                 & $0.9$                                            \\
+                       & Chicago & $L_b \cdot R_b + L_b \cdot R_c$ & $0.5 \cdot 1.0 + 0.5 \cdot 1.0 = 1.0$                                               \\
+            & $\cdots$ & $\cdots$ & $\cdots$ \\
+          \end{tabular}
+		}
+		\caption{$Q_1$'s Result}
+		\label{subfig:ex-shipping-simp-queries}
+      \end{subfigure}%
+	\begin{subfigure}[b]{0.33\linewidth}
+      \centering
+		\resizebox{!}{16mm} {
+			\begin{tikzpicture}[thick]
+				\node[tree_node] (a2) at (0, 0){$R_b$};
+				\node[tree_node] (b2) at (1, 0){$L_b$};
+				\node[tree_node] (c2) at (2, 0){$R_c$};
+				%level 1
+				\node[tree_node] (a1) at (0.5, 0.8){$\boldsymbol{\circmult}$};
+				\node[tree_node] (b1) at (1.5, 0.8){$\boldsymbol{\circmult}$};
+				%level 0
+				\node[tree_node] (a0) at (1.0, 1.6){$\boldsymbol{\circplus}$};
+				%edges
+				\draw[->] (a2) -- (a1);
+				\draw[->] (b2) -- (a1);
+				\draw[->] (b2) -- (b1);
+				\draw[->] (c2) -- (b1);
+				\draw[->] (a1) -- (a0);
+				\draw[->] (b1) -- (a0);
+			\end{tikzpicture}
+		}
+		\resizebox{!}{16mm} {
+			\begin{tikzpicture}[thick]
+				\node[tree_node] (a1) at (1, 0){$R_b$};
+				\node[tree_node] (b1) at (2, 0){$R_c$};
+				%level 1
+				\node[tree_node] (a2) at (0.75, 0.8){$L_b$};
+				\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circplus}$};
+				%level 0
+				\node[tree_node] (a3) at (1.1, 1.6){$\boldsymbol{\circmult}$};
+				%edges
+				\draw[->] (a1) -- (b2);
+				\draw[->] (b1) -- (b2);
+				\draw[->] (a2) -- (a3);
+				\draw[->] (b2) -- (a3);
+			\end{tikzpicture}
+		}
+	\caption{Two circuits for $(Chicago)$}
+	\label{subfig:ex-proj-push-circ-q4}
+	\end{subfigure}%
+  \vspace*{-3mm}
+	\caption{\ti instance and query results for \Cref{ex:intro-tbls}.}%{$\ti$ relations for $\poly$}
+	\label{fig:ex-shipping-simp}
+  \trimfigurespacing
+\end{figure}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{Example}\label{ex:intro-tbls}
+  Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analog to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise has multiplicity zero) and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations  is on time (with a certain probability). Query $\query_2$ shown below that returns starting points of shipping routes where processing of shipping is on time.
+
+$$Q_2 := \pi_{\text{City}_1}(Loc \bowtie_{\text{City}_\ell = \text{City}_1} Route)$$
+
+\Cref{subfig:ex-shipping-simp-route} shows the possible results of this query. For example, there is a 90\% probability there is a single route starting in Buffalo that with  90\% probability is on time. Thus, the expected multiplicity of this result tuple is $0.9$. There are two shipping routes starting in Chicago and the Chicago location has a 50\% probability to be on time (we assume that either all shipping starting in a location is on time or all shipping from this location is delayed). Thus, the expected multiplicity of this result tuple is $0.5 + 0.5 = 1.0$.
+\end{Example}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+A well-known result in probabilistic databases is that under set semantics the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vec{X}$) over random variables that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\mathbb{B}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: the lineage formula evaluates to true over the assignment for a world $\db$, then $\tup \in \query(\db)$. Thus, the marginal probability of  tuple $\tup$ is equal to the probability that the lineage evaluates to true wrt. the probability distribution that associates each possible assignment from $\mathbb{B}^\numvar$ with the probability of the world it corresponds to.
+
+For bag semantics, the lineage of a tuple is a polynomial over random variables from the set $\vct{X} \in \mathbb{N}^\numvar$ with
+coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$). Analog to the set case, evaluating the lineage over an assignment corresponding to a possible world (mapping variables to natural numbers representing input tuple multiplicities in this world) yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which we will denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{Example}\label{ex:intro-lineage}
+Associating a lineage variable with every input tuple as shown in \Cref{subfig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \Cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ it joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing over all possible worlds the multiplicity of the result tuple in this world multiplied
+Note that since $\Phi$ is a sum of products, we can use linearity of expectation to  solve the problem in linear time in the size of  $\linsett{\query}{\pdb}{\tup}$: the exception of the sum is the sum of the exception of each monomial. The expectation of each monomial is then computed by multiplying the probabilities of the variables (tuple) occurring in the monomial.
+The expected multiplicity of Chicago is $1.0$.
+\end{Example}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+While the expected multiplicity of a query result can be computed in linear time in the size of the result's lineage if it is in sum of product form, this may not be true for compressed representations of polynomials such as factorized polynomials and arithmetic circuits. For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}. The left one encodes the lineage as a sum of products while the right one uses distributivity push the addition below the multiplication resulting in a smaller circuit. Given that there is a large body of work that can  output such compressed representations \BG{cite FDBs and FAQ}, an interesting question is whether computing expectations is still in linear time for such compressed representations. We prove that this is not the case: computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a lineage circuit.
+
+Of course, any complexity results for computing the expectation of polynomials only translate to the expected result multiplicity problem when we also take into account the complexity of constructing the lineage for a given tuple. As long as the complexity of constructing the lineage polynomial $\linsett{\query}{\pdb}{\tup}$ for a given database $\pdb$, query $\query$, and result tuple $\tup$ is less or equal to the complexity of computing the expected multiplicity of $\linsett{\query}{\pdb}{\tup}$, then our results for polynomials translate to the expected result multiplicity problem.
+
+Concretely, me make the following contributions:
+(i) We show that computing the expected result multiplicity problem for conjunctive queries for bag-$\ti$ is \sharpwonehard  in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;
+(ii) We present an $(1-\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that its complexity is linear in the size of the compressed lineage encoding;
+(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
+(iv) We further generalize our results to higher moments and prove that for \raPlus queries, the processing time in approximation is within a constant factor of the same query processed deterministically.
+
+Our hardness results follow by considering a suitable generalization of the lineage polynomial in \cref{eq:edge-query}. First it is easy to generalize the polynomial to $\poly_G(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\poly_G^k(X_1,\dots,X_n)$ (i.e., $\inparen{\poly_G(X_1,\dots,X_n)}^k$) encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ (see \Cref{def:reduced-poly}) can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (which computing is \sharpwonehard) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le  2k$, then we can set up a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed  it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
+
+The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(\prob,\dots, \prob)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ (i.e., the \emph{input} tuple probabilities) and $k=\degree(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_\numvar)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(1,\ldots, 1)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well.
+%For the ease of exposition, we start off with expression trees (see~\Cref{fig:circuit-q2-intro} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
+
+We formalize our claim that, since our approximation algorithm runs in time linear in the size of the polynomial circuit, we can approximate the expected output tuple multiplicities with only a $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
+
+\paragraph{Paper Organization.} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.
+
+
+
+% and then relating the size of the compressed lineage to the cost of answering a deterministic query.
+
+% This suggests that perhaps even Bag-PDBs have higher query processing complexity than deterministic databases.
+% In this paper, we confirm this intuition, first proving that computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a compressed lineage representation, and then relating the size of the compressed lineage to the cost of answering a deterministic query.
+
+% In view of this hardness result (i.e., step 2 of the workflow is the bottleneck in the bag setting as well), we develop an approximation algorithm for expected counts of SPJU query Bag-PDB output, that is, to our knowledge, the first linear time (in the size of the factorized lineage) $(1-\epsilon)$-\emph{multiplicative} approximation, eliminating step 2 from being the bottleneck of the workflow.
+% By extension, this algorithm only has a constant factor slower runtime relative to deterministic query processing.\footnote{
+% 	Monte-carlo sampling~\cite{jampani2008mcdb} is also trivially a constant factor slower, but can only guarantee additive rather than our stronger multiplicative bounds.
+% }
+% This is an important result, because it implies that computing approximate expectations for bag output PDBs of SPJU queries can indeed be competitive with deterministic query evaluation over bag databases.
+
+
+
+
+
+
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "main"
+%%% End:
--- a/intro.tex
+++ b/intro.tex
@ -3,15 +3,6 @@

 \section{Introduction}
 \label{sec:intro}
-A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds paired with a probability distribution $\pd$ over these worlds. A well-studied problem in probabilistic databases is to compute the \emph{marginal probability} of a tuple $\tup$, i.e., its probability to  to exist in the result of a query $\query$ over $\pdb$. This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs) which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases from cases that are in \ptime for the class of union of conjunctive queries (UCQs). In this work, we consider bag semantics where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analog problem of computing the expectation of the multiplicity of a query result tuple $\tup$:
-\begin{equation}\label{eq:bag-expectation}
-\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \pd(\db)
-\end{equation}
-
-Under set semantics, the marginal probability of a query result $\tup$ can be computed based on the lineage $\linsett{\query}{\pdb}{\tup}$ of tuple $\tup$. The lineage $\linsett{\query}{\pdb}{\tup}$ is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vec{X}$) over random variables that represent input tuples. Each possible world $\db$ corresponds to an assignment $\mathbb{B}^\numvar$ of each of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). The marginal probability of the tuple is equal to the probability of the lineage $\linsett{\query}{\pdb}{\tup}$ evaluating to true over these assignments. This following from the following fact: iff $\linsett{\query}{\pdb}{\tup}$ evaluates to true over the assignment for a world $\db$, then $\tup \in \query(\db)$. For bag semantics, the lineage of a tuple is a polynomial over random variables from the set $\vct{X} \in \mathbb{N}^\numvar$ with
-coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$). Analog to the set case, evaluating the lineage over an assignment corresponding to a possible world (mapping variables to natural numbers representing input tuple multiplicities) yields the multiplicity of result tuple $\tup$ in this world.
-
-

 In their most general form, tuple-independent set-probabilistic databases~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs) answer existential queries (queries for the probability of a specific condition holding over the input database) in two steps: (i) lineage and (ii) probability.
 The lineage is a boolean formula, an element of the $\text{PosBool}[\vct{X}]$ semiring, where lineage variables $\vct{X}\in \mathbb{B}^\numvar$ are random variables corresponding to the presence of each of the $\numvar$ input tuples in one possible world of the input database.
--- a/main.tex
+++ b/main.tex
@ -94,6 +94,7 @@ sensitive=true
 \maketitle

 \input{abstract}
+\input{intro-new}
 \input{intro}
 \input{ra-to-poly}
 \input{poly-form}
--- a/main.vtc
+++ b/main.vtc
@ -1 +1 @@
-\contitem\title{Standard Operating Procedure in Bag PDBs Queries Considered Harmful}\author{Su Feng, Boris Glavic, Aaron Huber, Oliver Kennedy, and Atri Rudra}\page{23:1--23:46}
+\contitem\title{Standard Operating Procedure in Bag PDBs Queries Considered Harmful}\author{Su Feng, Boris Glavic, Aaron Huber, Oliver Kennedy, and Atri Rudra}\page{23:1--23:51}