paper-BagRelationalPDBsAreHard/intro-new.tex

223 lines
17 KiB
TeX
Raw Normal View History

2021-04-06 09:50:42 -04:00
%root: main.tex
%!TEX root=./main.tex
\AH{1st Submission ICDT Intro}
2021-04-06 09:50:42 -04:00
\section{Introduction}
\label{sec:intro}
2021-04-09 16:12:46 -04:00
A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds, paired with a probability distribution $\pd$ over these worlds.
2021-04-10 13:33:44 -04:00
A well-studied problem in probabilistic databases is to take a query $\query$ and a probabilistic database $\pdb$, and compute the \emph{marginal probability} of a tuple $\tup$ (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
2021-04-09 16:12:46 -04:00
This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs), which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases, from cases that are in \ptime for unions of conjunctive queries (UCQs).
2021-06-09 12:42:26 -04:00
In this work we consider bag semantics, where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analogous problem of computing the expectation (where $\overline{\db}$ denotes a random variable) of the multiplicity of a query result tuple $\tup$ (denoted $\query(\db)(t)$):
2021-04-06 09:50:42 -04:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-04-06 16:11:19 -04:00
\begin{equation}\label{eq:intro-bag-expectation}
2021-06-09 12:42:26 -04:00
\expct_{\randDB \sim \pd}[\query(\overline{D})(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \probOf\pbox{\db} \hspace{2cm}\text{\textbf{(Expected Result Multiplicity)}}
2021-04-06 09:50:42 -04:00
\end{equation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\begin{subfigure}[b]{0.49\linewidth}
\centering
{\small
\begin{tabular}{ c | c c c}
$OnTime$ & City$_\ell$ & $\Phi$ & \textbf{p}\\
\hline
& Buffalo & $L_a$ & 0.9 \\
& Chicago & $L_b$ & 0.5\\
& Bremen & $L_c$ & 0.5\\
2021-04-07 22:49:21 -04:00
& Zurich & $L_d$ & 1.0\\
2021-04-06 09:50:42 -04:00
\end{tabular}
}
\caption{Relation $OnTime$}
\label{subfig:ex-shipping-simp-loc}
\end{subfigure}%
\begin{subfigure}[b]{0.49\linewidth}
\centering
{\small
\begin{tabular}{ c | c c c c}
$Route$ & $\text{City}_1$ & $\text{City}_2$ & $\Phi$ & \textbf{p} \\
\hline
& Buffalo & Chicago & $R_a$ & 1.0 \\
& Chicago & Zurich & $R_b$ & 1.0 \\
2021-04-07 22:49:21 -04:00
%& $\cdots$ & $\cdots$ & $\cdots$ & $\cdots$ \\
2021-04-06 09:50:42 -04:00
& Chicago & Bremen & $R_c$ & 1.0 \\
\end{tabular}
}
\caption{Relation $Route$}
\label{subfig:ex-shipping-simp-route}
\end{subfigure}%
% \begin{subfigure}[b]{0.17\linewidth}
% \centering
% \caption{Circuit for $(Chicago)$}
% \label{subfig:ex-proj-push-circ-q3}
% \end{subfigure}
\begin{subfigure}[b]{0.66\linewidth}
\centering
{\small
\begin{tabular}{ c | c c c}
$\query_1$ & City & $\Phi$ & $\expct_{\idb \sim \probDist}[\query(\db)(t)]$ \\ \hline
& Buffalo & $L_a \cdot R_a$ & $0.9$ \\
& Chicago & $L_b \cdot R_b + L_b \cdot R_c$ & $0.5 \cdot 1.0 + 0.5 \cdot 1.0 = 1.0$ \\
2021-04-07 22:49:21 -04:00
%& $\cdots$ & $\cdots$ & $\cdots$ \\
2021-04-06 09:50:42 -04:00
\end{tabular}
}
\caption{$Q_1$'s Result}
\label{subfig:ex-shipping-simp-queries}
\end{subfigure}%
\begin{subfigure}[b]{0.33\linewidth}
\centering
\resizebox{!}{16mm} {
\begin{tikzpicture}[thick]
\node[tree_node] (a2) at (0, 0){$R_b$};
\node[tree_node] (b2) at (1, 0){$L_b$};
\node[tree_node] (c2) at (2, 0){$R_c$};
%level 1
\node[tree_node] (a1) at (0.5, 0.8){$\boldsymbol{\circmult}$};
\node[tree_node] (b1) at (1.5, 0.8){$\boldsymbol{\circmult}$};
%level 0
\node[tree_node] (a0) at (1.0, 1.6){$\boldsymbol{\circplus}$};
%edges
\draw[->] (a2) -- (a1);
\draw[->] (b2) -- (a1);
\draw[->] (b2) -- (b1);
\draw[->] (c2) -- (b1);
\draw[->] (a1) -- (a0);
\draw[->] (b1) -- (a0);
\end{tikzpicture}
}
\resizebox{!}{16mm} {
\begin{tikzpicture}[thick]
\node[tree_node] (a1) at (1, 0){$R_b$};
\node[tree_node] (b1) at (2, 0){$R_c$};
%level 1
\node[tree_node] (a2) at (0.75, 0.8){$L_b$};
\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circplus}$};
%level 0
\node[tree_node] (a3) at (1.1, 1.6){$\boldsymbol{\circmult}$};
%edges
\draw[->] (a1) -- (b2);
\draw[->] (b1) -- (b2);
\draw[->] (a2) -- (a3);
\draw[->] (b2) -- (a3);
\end{tikzpicture}
}
2021-04-07 22:49:21 -04:00
\caption{Two circuits for $Q_1(Chicago)$}
2021-04-06 09:50:42 -04:00
\label{subfig:ex-proj-push-circ-q4}
\end{subfigure}%
\vspace*{-3mm}
\caption{\ti instance and query results for \Cref{ex:intro-tbls}.}%{$\ti$ relations for $\poly$}
\label{fig:ex-shipping-simp}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Example}\label{ex:intro-tbls}
2021-04-10 13:33:44 -04:00
Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analogously to the set case: each input tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$, shown below returns starting points of shipping routes where shipment processing is on time.
2021-04-06 09:50:42 -04:00
2021-04-10 09:40:43 -04:00
$$Q_1(\text{City}) \dlImp OnTime(\text{City}), Route(\text{City}, \dlDontcare)$$
2021-04-06 09:50:42 -04:00
2021-04-06 18:46:06 -04:00
\Cref{subfig:ex-shipping-simp-queries} shows the possible results of this query.
2021-04-09 16:12:46 -04:00
For example, there is a 90\% probability there is a single route starting in Buffalo that is on time, and the expected multiplicity of this result tuple is $0.9$.
There are two shipping routes starting in Chicago.
2021-04-08 11:51:26 -04:00
Since the Chicago location has a 50\% probability of being on schedule (we assume that delays are linked), the expected multiplicity of this result tuple is $0.5 + 0.5 = 1.0$.
2021-04-06 09:50:42 -04:00
\end{Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-04-10 13:33:44 -04:00
A well-known result in probabilistic databases is that under set semantics, the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$~\cite{DBLP:conf/pods/GreenKT07} of positive Boolean expressions)
over random variables
2021-04-10 13:33:44 -04:00
($\vct{X}=(X_1,\dots,X_n)$)
2021-04-10 00:19:16 -04:00
that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
2021-06-09 12:42:26 -04:00
Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the obvious analog of probability distribution $\pd$ defined over $\Omega$ and induced over $\vct{X}$).
2021-04-06 09:50:42 -04:00
2021-04-07 22:49:21 -04:00
For bag semantics, the lineage of a tuple is a polynomial over variables $\vct{X}=(X_1,\dots,X_n)$ with % \in \mathbb{N}^\numvar$ with
2021-04-09 16:12:46 -04:00
coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$).
2021-04-10 13:33:44 -04:00
Analogously to sets, evaluating the lineage for $t$ over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$, which for this example we denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context\footnote{
2021-04-10 13:21:10 -04:00
In later sections, where we focus on a single lineage polynomial, we will simply refer to $\linsett{\query}{\pdb}{\tup}$ as $Q$.
2021-04-10 13:33:44 -04:00
}. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
2021-04-06 09:50:42 -04:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Example}\label{ex:intro-lineage}
2021-06-09 12:42:26 -04:00
Associating a lineage variable with every input tuple as shown in \Cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \Cref{subfig:ex-shipping-simp-queries}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
2021-04-10 09:40:43 -04:00
In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation to solve the problem in linear time (in the size of $\Phi$).
2021-04-09 16:12:46 -04:00
The expectation of the sum is the sum of the expectations of monomials.
2021-06-09 12:42:26 -04:00
The expectation of each monomial is then computed by multiplying the probabilities of the variables (tuples) occurring in the monomial, since we have independence of tuples.
2021-04-10 00:19:16 -04:00
The expected multiplicity for Chicago is $1.0$.
2021-04-06 09:50:42 -04:00
\end{Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-04-06 18:46:06 -04:00
The expected multiplicity of a query result can be computed in linear time (in the size of the result's lineage) if the lineage is in SOP form.
2021-04-09 16:12:46 -04:00
However, this need not be true for compressed representations of polynomials, including factorized polynomials or arithmetic circuits.
For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}.
The left circuit encodes the lineage as a SOP while the right circuit uses distributivity to push the addition gate below the multiplication, resulting in a smaller circuit.
2021-04-10 00:19:16 -04:00
Given that there is a large body of work (on, e.g., deterministic bag-relational query processing) that can output such compressed representations~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}, %\BG{cite FDBs and FAQ},
2021-04-09 16:12:46 -04:00
an interesting question is whether computing expectations is still in linear time for such compressed representations.
2021-04-10 13:59:17 -04:00
If the answer is in the affirmative, then probabilities over bag-PDBs can be computed with linear overhead (in the size of the compressed representation) using any algorithm that computes compressed lineage polynomials.
% and if lineage formulas can also be computed in linear time (in the lineage size), then bag-relational probabilistic databases can theoretically match the performance of deterministic databases.
2021-04-07 22:49:21 -04:00
Unfortunately, we prove that this is not the case: computing the expected count of a query result tuple is super-linear under standard complexity assumptions (\sharpwonehard) in the size of a lineage circuit.
2021-04-06 09:50:42 -04:00
2021-04-06 16:11:19 -04:00
Concretely, we make the following contributions:
2021-04-10 13:20:30 -04:00
(i) We show that the expected result multiplicity problem (\Cref{def:the-expected-multipl}) for conjunctive queries for bag-$\ti$s is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;
2021-04-10 00:19:16 -04:00
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) its complexity is linear in the size of the compressed lineage encoding; %;\BG{Fix not linear in all cases, restate after 4 is done}
2021-04-06 09:50:42 -04:00
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
2021-04-10 12:21:30 -04:00
(iv) We further prove that for \raPlus queries (an equivalently expressive, but factorizable form of UCQs), we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
2021-04-07 22:49:21 -04:00
2021-04-09 16:12:46 -04:00
%\mypar{Implications of our Results} As mentioned above
2021-04-07 22:49:21 -04:00
2021-04-10 09:48:26 -04:00
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial by continuing \Cref{ex:intro-tbls}.
2021-04-07 22:49:21 -04:00
2021-04-09 16:12:46 -04:00
%Moving forward, we focus exclusively on bags.
2021-04-10 09:48:26 -04:00
Consider the query $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$ over the bag relations of \Cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\poly^2()\dlImp Q(), Q()$.
2021-04-07 22:49:21 -04:00
%The factorized representation of $\poly^2$ is (for simplicity we ignore the random variables of $Route$ since each variable has probability of $1$):
%\begin{equation*}
%\poly^2 = \left(L_aL_b + L_bL_d + L_bL_c\right) \cdot \left(L_aL_b + L_bL_d + L_bL_c\right)
%\end{equation*}
%This equivalent SOP representation is
2021-04-10 00:19:16 -04:00
The lineage polynomial for $Q^2$ is given by $\Phi^2$:
2021-04-07 22:49:21 -04:00
\begin{equation*}
\left(L_aL_b + L_bL_d + L_bL_c\right)^2=L_a^2L_b^2 + L_b^2L_d^2 + L_b^2L_c^2 + 2L_aL_b^2L_d + 2L_aL_b^2L_c + 2L_b^2L_dL_c.
\end{equation*}
The expectation $\expct\pbox{\Phi^2}$ then is:
2021-04-08 11:51:26 -04:00
\begin{multline*}
\expct\pbox{L_a}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_d} \\
+ 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{L_d}\expct\pbox{L_c}
\end{multline*}
2021-04-10 00:19:16 -04:00
\noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
2021-04-07 22:49:21 -04:00
\begin{footnotesize}
\begin{equation*}
2021-06-09 12:42:26 -04:00
\expct\pbox{L_a}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{L_d}\expct\pbox{L_c}
2021-04-07 22:49:21 -04:00
\end{equation*}
\end{footnotesize}
2021-04-10 13:20:30 -04:00
\noindent This property leads us to consider a structure related to the lineage polynomial.
2021-04-07 22:49:21 -04:00
\begin{Definition}\label{def:reduced-poly}
2021-04-08 11:51:26 -04:00
For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the SOP form of $\poly(\vct{X})$ to $1$.
2021-04-07 22:49:21 -04:00
\end{Definition}
With $\Phi^2$ as an example, we have:
\begin{align*}
\widetilde{\Phi^2}(L_a, L_b, L_c, L_d)
2021-06-09 12:42:26 -04:00
=&\; L_aL_b + L_bL_d + L_bL_c + 2L_aL_bL_d + 2L_aL_bL_c + 2L_bL_cL_d
2021-04-07 22:49:21 -04:00
\end{align*}
2021-04-10 13:20:30 -04:00
It can be verified that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1}, \probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, we show in \Cref{lem:exp-poly-rpoly} that this equivalence holds for {\em all} UCQs over TIDB/BIDB.
2021-04-07 22:49:21 -04:00
%The reduced form of a lineage polynomial can be obtained but requires a linear scan over the clauses of an SOP encoding of the polynomial. Note that for a compressed representation, this scheme would require an exponential number of computations in the size of the compressed representation. In \Cref{sec:hard}, we use $\rpoly$ to prove our hardness results .
To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode various hard graph-counting problems. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$). For the upper bound it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. \AH{Why do we say `approximation'? This is a linear {\emph exact} computation.} To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
2021-04-06 09:50:42 -04:00
2021-04-06 10:16:09 -04:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-04-10 13:20:30 -04:00
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl}). Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.
2021-04-06 09:50:42 -04:00
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: