129 lines
20 KiB
TeX
129 lines
20 KiB
TeX
%root: main.tex
|
|
\section{Introduction (Rewrite - 070921)}
|
|
\input{two-step-model}
|
|
A tuple independent probabilistic database\footnote{In \cref{sec:background} and beyond, we generalize the data model.} (\abbrTIDB) $\pdb$ is a tuple $\inparen{\db, \pd}$ where $\db$ is a set of $\numvar$ tuples. The probability distribution $\pd$ over $\db$ is the one induced from the requirement that each tuple be treated as an independent Bernoulli distributed random variable. In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ computes the multiplicity of its corresponding tuple $\tup$. The query evaluation problem in bag-\abbrPDB semantics can be stated as
|
|
\begin{Problem}\label{prob:bag-pdb-query-eval}
|
|
Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$),\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} compute the expected multiplicity ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$.
|
|
\end{Problem}
|
|
There exists a polynomial $\poly_\tup\inparen{\vct{X}}$ such that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{X}}}$, where $\vct{X} = \inparen{X_1,\ldots, X_\numvar}$ is the set of variables annotating the tuples in $\pdb$. The expectation is any Bernoulli distribution over $\{0, 1\}^\numvar$, whose evaluation semantics follow the standard interpretation of addition and multiplication operators over the natural numbers, i.e. $\semN$-semiring semantics.
|
|
|
|
The problem of deterministic query evaluation is known to be \sharpwonehard\footnote{A problem is in \sharpwone if the runtime of the most efficient known algorithm to solve it is lower bounded by some function $f$ of a parameter $k$, where the growth in runtime is polynomially dependent on $f(k)$, i.e. $\Omega\inparen{\numvar^{f(k)}}$.} in data complexity for general $\query$. For example, the counting $k$-cliques query problem (where the parameter $k$ is the size of the clique) is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.
|
|
This result is unsatisfying when considering complexity of evaluating $\query$ over \abbrPDB\xplural, since it does not account for computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$, entirely ignoring the `P' in \abbrPDB.
|
|
|
|
A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$ separately from the complexity of deterministic query evaluation. Viewing \abbrPDB query evaluation as these two seperate steps is also known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}, illustrated in \cref{fig:two-step}.
|
|
The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing both $\query\inparen{\db}$ and $\poly_\tup(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in \cref{sec:circuit-runtime}.} The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{X})}$. Such a model of computation is nicely followed in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly_\tup\inparen{\vct{X}}$ must be computed separate from deterministic query evaluation to obtain exact output when $\query(\pdb)$ is hard since evaluating the probability inline with query operators (extensional evaluation) will only approximate the actual probability in such a case. The paradigm of \cref{fig:two-step} is also neatly followed by semiring provenance, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with standard polynomials, i.e. elements from $\semNX$ connected by addition operators.} query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial is subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation.
|
|
|
|
Let $\timeOf{\abbrStepOne}$ denote the runtime of \abbrStepOne and similarly for $\timeOf{\abbrStepTwo}$. When solving \cref{prob:bag-pdb-query-eval}, $\timeOf{\abbrStepTwo}$ lies somewhere between $\bigO{\timeOf{\abbrStepOne}}$ and $\bigO{\timeOf{\abbrStepOne}^k}$, since when $\poly_\tup$ is in \abbrSMB\footnote{\abbrSMB is akin to the sum of products expansion but with the added requirement that all monomials are unique.} computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$ is linear (due to linearity of expectation and the independence assumption of \abbrTIDB), while the case of a factorized $\poly_\tup$ has a worst case (since in general product terms must be expanded) of $\timeOf{\abbrStepTwo}$ being $\timeOf{\abbrStepOne}^k$ for a $k$-wise factorization. This observation introduces our next problem statement:
|
|
\begin{Problem}\label{prob:big-o-step-one}
|
|
Given a \abbrPDB $\pdb$ and $\raPlus$ query $\query$, is it \emph{always} the case that $\timeOf{\abbrStepTwo}$ is always $\bigO{\timeOf{\abbrStepOne}}$?
|
|
\end{Problem}
|
|
If the answer to \cref{prob:big-o-step-one} is yes, then the query evaluation problem over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation.
|
|
|
|
\Cref{prob:big-o-step-one} has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where each output tuple appears at most once. Here, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives. Evaluating such a formula follows the standard semantics of the said operators on boolean variables.} whose evaluation follows the standard semantics ($\semB$-semiring semantics), denoting the presence or absence of $\tup$. Computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$ determines the marginal probability of $\tup$ appearing in the output. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard\footnote{\sharpp is the counting version for problems residing in the NP complexity class.} in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time \abbrStepOne. Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), techniques such as parameterized complexity (bounding complexity by another parameter other than $\numvar$) and fine grained analysis (complexity analysis that asks what precisely is the value of this other parameter, for example, what is the value of $f(k)$ given a \sharpwone algorithm) of \abbrStepTwo will not refine the hardness results from \sharpphard.
|
|
|
|
There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit than set-\abbrPDB\xplural. One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly_\tup\inparen{\vct{X}}}$) of the result.
|
|
|
|
In bag-\abbrPDB\xplural (as alluded to above), $\timeOf{\abbrStepTwo}$ is $\bigO{\abs{\poly_\tup}}$\footnote{$\abs{\poly_\tup}$ denotes the size of $\poly_\tup$, i.e., the number of arithmetic operations.} when $\poly_\tup$ is in \abbrSMB. For the special case when $\query$ is a sequence of query algorithms (e.g. $\project\inparen{\join}$) whose evaluation is precisely mirrored in the \abbrSMB representation of $\poly_\tup$, it then follows that $\timeOf{\abbrStepTwo}$ is indeed $\bigO{\timeOf{\abbrStepOne}}$.
|
|
|
|
The main insight of the paper is that we should not stop here. One can have compact representations of $\poly_\tup(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations of $\poly_\tup(\vct{X})$. To capture such factorizations, this work uses (arithmetic) circuits
|
|
\footnote{An arithmetic circuit has variable and/or numeric inputs, with internal nodes each of which can take on a value of either an addition or multiplication operator.}
|
|
as the representation system of $\poly_\tup(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07}. The standard query evaluation semantics depicted in \cref{fig:nxDBSemantics} nicely illustrate this.
|
|
|
|
\begin{figure}
|
|
\begin{align*}
|
|
\evald{\project_A(\rel)}{\db}(\tup) =& \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
|
|
\evald{(\rel_1 \union \rel_2)}{\db}(\tup) =& \evald{\rel_1}{\db}(\tup) + \evald{\rel_2}{\db}(\tup)\\
|
|
\evald{\select_\theta(\rel)}{\db}(\tup) =& \begin{cases}
|
|
\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
|
|
\zeroK & \text{otherwise}.
|
|
\end{cases} &
|
|
\begin{aligned}
|
|
\evald{(\rel_1 \join \rel_2)}{\db}(\tup) =\\ ~
|
|
\end{aligned}&
|
|
\begin{aligned}
|
|
&\evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \\
|
|
&~~~\cdot\evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup))
|
|
\end{aligned}\\
|
|
& & \evald{R}{\db}(\tup) =& \rel(\tup)
|
|
\end{align*}\\[-10mm]
|
|
\caption{Evaluation semantics $\evald{\cdot}{\db}$ for $\semNX$-DBs~\cite{DBLP:conf/pods/GreenKT07}.}
|
|
\label{fig:nxDBSemantics}
|
|
\end{figure}
|
|
|
|
Given bag-\abbrPDB query $\query$ and \abbrTIDB $\pdb$ with $\numvar$ tuples, let $\prob_i$ denote the probability of tuple $\tup_i$ ($\probOf\pbox{X_i = 1}$) for $i \in [\numvar]$. Consider the special case when for all $i$ in $[\numvar]$, $\prob_i = 1$. For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{X}}}$ is linear in the size of the arithemetic circuit, since we can essentially push expectation through multiplication of variables dependent on one another.\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.} Here is another special case where $\timeOf{\abbrStepTwo}$ is $\bigO{\timeOf{\abbrStepOne}}$ and we again achieve deterministic query runtime for $\query\inparen{\pdb}$ (up to a constant factor). Is this the general case? This leads us to the main problem statement of this paper:
|
|
\begin{Problem}\label{prob:intro-stmt}
|
|
Given a query $\query$ in $\raPlus$ and bag \abbrPDB $\pdb$, is it always the case that computing \abbrStepTwo is $\bigO{\timeOf{\abbrStepOne}}$?
|
|
\end{Problem}
|
|
|
|
We show, for the class of \abbrTIDB\xplural with $0 < \prob_i < 1$, the problem of computing \abbrStepTwo is superlinear in the size of the lineage polynomial representation under fine grained complexity hardness assumption.
|
|
Our work further introduces an approximation algorithm of \abbrStepTwo from the bag-\abbrPDB query $\query$ which runs in linear time.
|
|
|
|
As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\poly_\tup\inparen{\vct{X}}$, a stark contrast to the marginal probability %($\expct\pbox{\poly\inparen{\vct{X}}}$)
|
|
paradigm of set-\abbrPDB\xplural. To address the question of whether or not bag-\abbrPDB\xplural are easy, we focus on computing the expected count ($\expct\pbox{\poly_\tup\inparen{\vct{X}}}$) as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity. Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
|
|
|
|
Our work focuses on the following setting for query computation. Inputs of $\query$ are set-\abbrPDB\xplural, while the output of $\query$ is a bag-\abbrPDB. This setting, however, is not limiting as a simple generalization exists, reducing a bag \abbrPDB to a set \abbrPDB with typically only an $O(c)$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{t}$.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%Contributions, Overview, Paper Organization
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
Concretely, we make the following contributions:
|
|
(i) Under fine grained hardness assumption, we show that \cref{prob:intro-stmt} for bag-\abbrTIDB\xplural is not true in general
|
|
% \sharpwonehard in the size of the lineage circuit
|
|
by reduction from counting the number of $k$-matchings over an arbitrary graph; we further show superlinear hardness for a specific %cubic
|
|
graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
|
|
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries that makes \cref{prob:intro-stmt} true again; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ) followups~\cite{DBLP:conf/pods/KhamisNR16}) have runtime linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are at most quadratic\footnote{Note that this doesn't rule out queries for which approximation is linear}); (iii) We generalize the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural), a more general model of probabilistic data; (iv) We further prove that for \raPlus queries
|
|
\AH{This point \emph{\Large seems} weird to me. I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one. Where am I missing it?}
|
|
we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
|
|
|
|
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly_\tup$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial in what follows.
|
|
|
|
Consider the query $\query(\pdb) \coloneqq \project_\emptyset(OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)$
|
|
%$Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$
|
|
over the bag relations of \cref{fig:two-step}. It can be verified that $\poly_\tup$ for $Q$ is $L_aR_aL_b + L_bR_bL_d + L_bR_cL_c$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.
|
|
|
|
The lineage polynomial for $Q^2$ is given by $\poly^2$:
|
|
\begin{multline*}
|
|
\left(L_aR_aL_b + L_bR_bL_d + L_bR_cL_c\right)^2\\
|
|
=L_a^2R_a^2L_b^2 + L_b^2R_d^2L_d^2 + L_b^2R_c^2L_c^2 + 2L_aR_aL_b^2R_bL_d + 2L_aR_bL_b^2R_cL_c + 2L_b^2R_bL_dR_cL_c.
|
|
\end{multline*}
|
|
By exploiting linearity of expectation of summand terms, and further pushing expectation through independent \abbrTIDB variables, the expectation $\expct\pbox{\Phi^2}$ then is:
|
|
\begin{footnotesize}
|
|
\begin{multline*}
|
|
\expct\pbox{L_a^2}\expct\pbox{R_a^2}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{R_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{R_c^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\\
|
|
+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b^2}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
|
|
\end{multline*}
|
|
\end{footnotesize}
|
|
\noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
|
|
\begin{footnotesize}
|
|
\begin{multline*}
|
|
\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b}\expct{R_b}\expct\pbox{L_d} \\
|
|
+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b}\expct{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
|
|
\end{multline*}
|
|
\end{footnotesize}
|
|
\noindent This property leads us to consider a structure related to the lineage polynomial.
|
|
\begin{Definition}\label{def:reduced-poly}
|
|
For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the \abbrSMB form of $\poly(\vct{X})$ to $1$.
|
|
\end{Definition}
|
|
With $\Phi^2$ as an example, we have:
|
|
\begin{align*}
|
|
&\widetilde{\Phi^2}(L_a, L_b, L_c, L_d, R_a, R_b, R_c)\\
|
|
&\; = L_aR_aL_b + L_bR_bL_d + L_bR_cL_c + 2L_aR_aL_bR_bL_d + 2L_aR_aL_bR_cL_c + 2L_bR_bL_dR_cL_c
|
|
\end{align*}
|
|
It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1},$ $\probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, the following lemma shows that this equivalence holds for {\em all} $\raPlus$ queries over TIDB (proof in \cref{subsec:proof-exp-poly-rpoly}).
|
|
\begin{Lemma}
|
|
Let $\pdb$ be a \abbrTIDB over variables $\vct{X} = \{X_1,\ldots,X_\numvar\}$ with the probability distribution $\pd$ induced by the probability vector $\probAllTup = \inparen{\prob_1,\ldots,\prob_\numvar}$ of each individual tuple's probability for each world $\vct{w}$ in the set of all $2^\numvar$ possible worlds $\vct{W}$. For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}$ based on $\query\inparen{\pdb}$ the following holds:
|
|
|
|
\begin{equation*}
|
|
\expct_{\vct{W} \sim \pd}\pbox{\poly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}.
|
|
\end{equation*}
|
|
\end{Lemma}
|
|
|
|
To prove our hardness result we show that for the same $Q$ considered in the query above, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \cref{fig:two-step}}. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$). For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, we can see that
|
|
\begin{equation*}
|
|
\poly^2\inparen{\probAllTup} = \inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
|
|
\end{equation*}
|
|
%For example, if we know that $\prob_0 = \max_{i \in [\numvar]}\prob_i$, then $\poly(\prob_0,\ldots, \prob_0)$ is an upper bound constant factor approximation. Consider the first output tuple of \cref{fig:two-step}. Here, we set $\prob_0 = 1$, and the approximation $\poly\inparen{\vct{1}} = 1 \cdot 1 = 1$. The opposite holds true for determining a constant factor lower bound.
|
|
To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl})\AH{Aren't they the same?}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}. |