paper-BagRelationalPDBsAreHard/intro-rewrite-070921.tex

135 lines
22 KiB
TeX

%root: main.tex
\section{Introduction (Rewrite - 070921)}
\input{two-step-model}
A probabilistic database (\abbrPDB) $\pdb$ is a probability distribution $\pd$ over a set of $\numvar$ tuples in a deterministic database $\db$. A tuple independent probabilistic database (\abbrTIDB) $\pdb$ further restricts $\pd$ to treating each tuple in $\db$ as an independent Bernoulli distributed random variable corresponding to the tuple's presence, where, in bag query semantics, $\query\inparen{\pdb}\inparen{\tup}$ is the random (polynomial) variable corresponding to the multiplicity of the output tuple $\tup$. Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$)\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the SPJU and renaming operators.}, the goal is to compute the expected multiplicity\footnote{In set semantic \abbrPDB\xplural, computing $\expct\pbox{\query\inparen{\pdb}\inparen{t}}$ corresponds to computing the marginal probability.} ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$. There exists a polynomial $\poly_\tup\inparen{\vct{X}}$ such that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{X}}}$\footnote{In this work we focus on one output tuple $\tup$, and hence refer to $\poly_\tup$ as $\poly$}, where $\vct{X} = \inparen{X_1,\ldots, X_\numvar}$, and the expectation is any Bernoulli distribution over $\{0, 1\}^\numvar$, using $\semN$ semiring semantics.. The set of variables $X_i$ in $\vct{X}$ with nonzero assignments\footnote{While there are semirings such as the security semiring that use $0$ (denoted $\zerosymbol$) as a valid annotation over tuples of a finite database, for our purposes this distinction is not necessary.} and nonzero coefficients in $\poly\inparen{\vct{X}}$ represent all contributing input tuples to the presence of the output. In the context of bags, as is the case for $\expct\pbox{\poly\inparen{\vct{X}}}$, computing the multiplicity $\poly\inparen{\vct{X}}$ follows $\semN$-semiring semantics.
To precisely compare and contrast runtime complexities across varying database models and \abbrPDB computations, a brief informal review of certain complexity classes is useful. Given an algorithm $\mathcal{A}$ and input size $\numvar$, $\mathcal{A}$ is an element of the \sharpp complexity class if computing a solution is an element of $\np$ and there may exist multiple solutions to a given problem.
To be an element of the \sharpwone class, the runtime of algorithm $\mathcal{A}$ must be lower bounded by some function $f$ of a parameter $k$ such that the growth in runtime is polynomially dependent on $f(k)$. Specifically, $\mathcal{A}$ is an element of \sharpwone if its lower bound on runtime is of the form $n^{f(k)}$.
The special case of deterministic query evaluation
%simply computing $\query$ over a deterministic database
is itself known to be \sharpwonehard in data complexity for general $\query$. An algorithm, such as a counting cliques query, is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.
%hardness is seen in such queries as counting $k$-cliques and $k$-way joins, where the superlinear runtime is parameterized in $k$.
This result is unsatisfying when considering complexity of evaluating $\query$ over \abbrPDB\xplural, since it does not account for computing $\expct\pbox{\poly\inparen{\vct{X}}}$, entirely ignoring the `P' in \abbrPDB.
%of intensional evaluation (computing $\expct\pbox{\poly\inparen{\vct{X}}}$).
A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly\inparen{\vct{X}}}$ separately from the complexity of deterministic query evaluation. Viewing \abbrPDB query evaluation as these two seperate steps is essentially what is known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}. \Cref{fig:two-step} illustrates the intensional evaluation computation model.
%one way to do this.
%The model of computation in \cref{fig:two-step} views \abbrPDB query processing as two steps.
The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing $\query$ over a $\abbrPDB$, which is essentially the deterministic computation of both the query output and $\poly(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in [placeholder].} The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly(\vct{X})}$. Such a model of computation is nicely followed by intensional evaluation in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly\inparen{\vct{X}}$ must be computed separately for exact output when $\query(\pdb)$ is hard since extensional evaluation will only approximate in such a case.
%(where e.g. intensional evaluation is itself a separate computational step; further, computing $\expct\pbox{\poly\inparen{\vct{X}}}$ in extensional evaluation occurs as a separate step of each operator in the query tree, and therefore implies that both concerns can be separated)
The paradigm of \cref{fig:two-step} is also neatly followed by semiring provenance, where $\semNX$-DB query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial is subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the model lends itself nicely in separating the concerns of deterministic computation and the probability computation. Observing this model prompts the informal problem statement: given $\query\inparen{\pdb}$, is it the case \abbrStepTwo is always $\bigO{\abbrStepOne}$?
%always $\bigO{\abbrStepOne}$.
If so, then query evaluation over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation (up to a constant factor).
The problem of computing $\query(\pdb)$ has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where the lineage polynomial follows $\semB$-semiring semantics, each output tuple appears at most once, and $\expct\pbox{\poly\inparen{\vct{X}}}$ is essentially the marginal probability of a tuple's prensence in the output. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time \abbrStepOne. Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), techniques such as parameterized complexity (bounding complexity by another parameter other than $\numvar$) and fine grained analysis (complexity analysis that asks what precisely is the value of this other parameter, for example, what is the value of $f(k)$ given a \sharpwone algorithm) of \abbrStepTwo will not refine the hardness results from \sharpphard.
There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit. One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly\inparen{\vct{X}}}$) of the result.
\begin{figure}
\begin{align*}
\evald{\project_A(\rel)}{\db}(\tup) =& \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
\evald{(\rel_1 \union \rel_2)}{\db}(\tup) =& \evald{\rel_1}{\db}(\tup) + \evald{\rel_2}{\db}(\tup)\\
\evald{\select_\theta(\rel)}{\db}(\tup) =& \begin{cases}
\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
\zeroK & \text{otherwise}.
\end{cases} &
\begin{aligned}
\evald{(\rel_1 \join \rel_2)}{\db}(\tup) =\\ ~
\end{aligned}&
\begin{aligned}
&\evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \\
&~~~\cdot\evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup))
\end{aligned}\\
& & \evald{R}{\db}(\tup) =& \rel(\tup)
\end{align*}\\[-10mm]
\caption{Evaluation semantics $\evald{\cdot}{\db}$ for $\semNX$-DBs~\cite{DBLP:conf/pods/GreenKT07}.}
\label{fig:nxDBSemantics}
\end{figure}
The semantics of $\query(\pdb)$ in bag-\abbrPDB\xplural allow for output tuples to appear \emph{more} than once, which is naturally captured by the $\semN$-semiring. Given a standard monomial basis (\abbrSMB)\footnote{\abbrSMB is akin to the sum of products expansion but with the added requirement that all monomials are unique.} representation of the lineage polynomial, the complexity of computing \abbrStepTwo is linear in the size of the lineage polynomial. This is true since standard addition operators allow for
%the addition and multiplication operators in \cref{fig:nxDBSemantics} are those of the $\semN$-semiring, and computing the expected count over such operators allows for
linearity of expectation, and since \abbrSMB has no factorization, the monomials with dependent multiplicative variables are known up front without any additional operations needed. Thus, the expected count can indeed be computed by the same order of operations as contained in $\poly$. In other words, given an \abbrSMB representation, $\expct\pbox{\poly\inparen{\vct{X}}}$ can always be computed in $\bigO{n^k}$ for $\numvar$ input tuples and $k$-size query complexity. Then for the special case of queries whose operators naturally produce an \abbrSMB representation, we have that \abbrStepTwo is $\bigO{\abbrStepOne}$.
%This result coupled with the prevalence that exists amongst most well-known \abbrPDB implementations to use an sum of products\footnote{Sum of products differs from \abbrSMB in allowing any arbitrary monomial $m_i$ to appear in the polynomial more than once, whereas, \abbrSMB requires all monomials $m_i,\ldots, m_j$ such that $m_i = \cdots = m_j$ to be combined into one monomial, such that each monomial appearing in \abbrSMB is unique. The complexity difference between the two representations is up to a constant factor.} representation,
%\currentWork{
% show us that to develop comparable bounds between \abbrStepOne and \abbrStepTwo for bag $\query\inparen{\pdb}$, one must examine complexity at finer levels....to not be bottlenecked in \abbrStepTwo, it must be the case that \abbrStepTwo is not \sharpwonehard regardless of the polynomial representation. may partially explain why the bag-\abbrPDB query problem has long been thought to be easy.
%}
The main insight of the paper is that we should not stop here. One can have compact representations of $\poly(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations of $\poly(\vct{X})$. To capture such factorizations, this work uses (arithmetic) circuits as the representation system of $\poly(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07} (as shown in \cref{fig:nxDBSemantics}).
%Our work explores whether or not \abbrStepTwo in the computation model is \emph{always} in the same complexity class as deterministic query evaluation, when \abbrStepOne of $\query(\pdb)$ is easy. We examine the class of queries whose lineage computation in step one is lower bounded by the query runtime of step one.
Let us consider output $\poly(\vct{X})$ for $\query(\pdb)$ such that $\pdb$ is a bag-\abbrTIDB. For $\tup_i$ in $\pdb$, denote its corresponding probability as $\prob_i$, that is, $\probOf\pbox{X_i} = \prob_i$ for all $i$ in $[\numvar]$. Consider the special case when $\pdb$ is a deterministic database with one possible world $\db$. In this case, $\prob_i = 1$ for all $i$ in $[\numvar]$, and it can be seen that the problem of computing the expected count is linear in the size of the arithemetic circuit, since we can essentially push expectation through multiplication of variables dependent on one another\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.}. This means that \abbrStepTwo is $\bigO{\abbrStepOne}$ and we always have deterministic query runtime for $\query\inparen{\pdb}$ up to a constant factor for this special case. Is this the general case? This leads us to our problem statement:
\begin{Problem}\label{prob:intro-stmt}
Given a query $\query$ in $\raPlus$ and bag \abbrPDB $\pdb$, is it always the case that computing \abbrStepTwo is $\bigO{\abbrStepOne}$?%what is the complexity (in the size of the circuit representation) of computing step two ($\expct\pbox{\poly(\vct{X})}$) for each tuple $\tup$ in the output of $\query(\pdb)$?
\end{Problem}
We show, for the class of \abbrTIDB\xplural with $0 < \prob_i < 1$, the problem of computing \abbrStepTwo is superlinear in the size of the lineage polynomial representation under fine grained complexity hardness assumption.
Our work further introduces an approximation algorithm of \abbrStepTwo from the bag-\abbrPDB query $\query$ which runs in linear time.
As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\poly\inparen{\vct{X}}$, a stark contrast to the marginal probability %($\expct\pbox{\poly\inparen{\vct{X}}}$)
paradigm of set-\abbrPDB\xplural. To address the question of whether or not bag-\abbrPDB\xplural are easy, we focus on computing the expected count ($\expct\pbox{\poly\inparen{\vct{X}}}$) as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity. Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
Our work focuses on the following setting for query computation. Inputs of $\query$ are set-\abbrPDB\xplural, while the output of $\query$ is a bag-\abbrPDB. This setting, however, is not limiting as a simple generalization exists, reducing a bag \abbrPDB to a set \abbrPDB with typically only an $O(c)$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{t}$.
%%%%%%%%%%%%%%%%%%%%%%%%%
%Contributions, Overview, Paper Organization
%%%%%%%%%%%%%%%%%%%%%%%%%
Concretely, we make the following contributions:
(i) Under fine grained hardness assumption, we show that \cref{prob:intro-stmt} for bag-\abbrTIDB\xplural is \sharpwonehard in the size of the lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph; we further show superlinear hardness for a specific cubic graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ) followups~\cite{DBLP:conf/pods/KhamisNR16}) have runtime linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are at most quadratic\footnote{Note that this doesn't rule out queries for which approximation is linear}); (iii) We generalize the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural), a more general model of probabilistic data; (iv) We further prove that for \raPlus queries
\AH{This point \emph{\Large seems} weird to me. I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one. Where am I missing it?}
we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial in what follows.
Consider the query $\query(\pdb) \coloneqq \project_\emptyset(OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)$
%$Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$
over the bag relations of \cref{fig:two-step}. It can be verified that $\Phi$ for $Q$ is $L_aR_aL_b + L_bR_bL_d + L_bR_cL_c$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.
The lineage polynomial for $Q^2$ is given by $\Phi^2$:
\begin{multline*}
\left(L_aR_aL_b + L_bR_bL_d + L_bR_cL_c\right)^2\\
=L_a^2R_a^2L_b^2 + L_b^2R_d^2L_d^2 + L_b^2R_c^2L_c^2 + 2L_aR_aL_b^2R_bL_d + 2L_aR_bL_b^2R_cL_c + 2L_b^2R_bL_dR_cL_c.
\end{multline*}
By exploiting linearity of expectation of summand terms, and further pushing expectation through independent \abbrTIDB variables, the expectation $\expct\pbox{\Phi^2}$ then is:
\begin{footnotesize}
\begin{multline*}
\expct\pbox{L_a^2}\expct\pbox{R_a^2}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{R_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{R_c^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\\
+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b^2}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
\end{multline*}
\end{footnotesize}
\noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
\begin{footnotesize}
\begin{multline*}
\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b}\expct{R_b}\expct\pbox{L_d} \\
+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b}\expct{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
\end{multline*}
\end{footnotesize}
\noindent This property leads us to consider a structure related to the lineage polynomial.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the \abbrSMB form of $\poly(\vct{X})$ to $1$.
\end{Definition}
With $\Phi^2$ as an example, we have:
\begin{align*}
&\widetilde{\Phi^2}(L_a, L_b, L_c, L_d, R_a, R_b, R_c)\\
&\; = L_aR_aL_b + L_bR_bL_d + L_bR_cL_c + 2L_aR_aL_bR_bL_d + 2L_aR_aL_bR_cL_c + 2L_bR_bL_dR_cL_c
\end{align*}
It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1},$ $\probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, the following lemma shows that this equivalence holds for {\em all} $\raPlus$ queries over TIDB (proof in \cref{subsec:proof-exp-poly-rpoly}).
\begin{Lemma}
Let $\pdb$ be a \abbrTIDB over variables $\vct{X} = \{X_1,\ldots,X_\numvar\}$ with the probability distribution $\pd$ induced by the probability vector $\probAllTup = \inparen{\prob_1,\ldots,\prob_\numvar}$ of each individual tuple's probability for each world $\vct{w}$ in the set of all $2^\numvar$ possible worlds $\vct{W}$. For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}$ based on $\query\inparen{\pdb}$ the following holds:
\begin{equation*}
\expct_{\vct{W} \sim \pd}\pbox{\poly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}.
\end{equation*}
\end{Lemma}
To prove our hardness result we show that for the same $Q$ considered in the query above, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \cref{fig:two-step}}. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$). For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, we can see that
\begin{equation*}
\poly^2\inparen{\probAllTup} = \inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
\end{equation*}
%For example, if we know that $\prob_0 = \max_{i \in [\numvar]}\prob_i$, then $\poly(\prob_0,\ldots, \prob_0)$ is an upper bound constant factor approximation. Consider the first output tuple of \cref{fig:two-step}. Here, we set $\prob_0 = 1$, and the approximation $\poly\inparen{\vct{1}} = 1 \cdot 1 = 1$. The opposite holds true for determining a constant factor lower bound.
To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl})\AH{Aren't they the same?}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.