paper-BagRelationalPDBsAreHard/intro-rewrite-070921.tex

%root: main.tex
\section{Introduction (Rewrite - 070921)}
\input{two-step-model}
A probabilistic database (\abbrPDB) $\pdb$ is a probability distribution $\pd$ over a (multi-) set of $\numvar$ tuples in a deterministic database $\db$.  A tuple independent probabilistic database (\abbrTIDB) $\pdb$ further restricts $\pd$ to treating each tuple in $\db$ as an independent bernoulli distributed random variable.  Given a query $\query$ from the class of positive relational algebra queries ($\raPlus$), the goal is to compute the expectation ($\expct\pbox{\poly\inparen{\vct{X}}}$) of each output tuple $\tup$, where $\poly\inparen{\vct{X}}$ is the lineage polynomial of $\tup$ parameterized by $\vct{X}$, the set of $\numvar$ variables annotating the base tuples of $\pdb$.

The lesser problem of simply computing $\query$ over a deterministic database is itself known to be \sharpwonehard in data complexity for the general case.  This is seen in such queries as counting $k$-cliques and $k$-way joins, where the superlinear runtime is parameterized in $k$.  This result is (obviously) unsatisfying when considering query runtime over \abbrPDB\xplural, since it \emph{entirely} ignores the complexity of intensional evaluation (computing $\expct\pbox{\poly\inparen{\vct{X}}}$).  A natural question is whether or not we can quantify the runtime of the intensional evaluation of $\poly\inparen{\vct{X}}$ separately from the complexity of deterministic query evaluation.  \Cref{fig:two-step} illustrates one way to do this.

The model of computation in \cref{fig:two-step} views \abbrPDB query processing as two steps.  As depicted, the first step consists of computing $\query$ over a $\abbrPDB$, which is essentially the deterministic computation of both the query output and $\poly(\vct{X})$\footnote{Note that, assuming standard $\raPlus$ query algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$.}.  The second step consists of computing $\expct\pbox{\poly(\vct{X})}$. Such a model of computation is nicely followed by set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu} (where e.g. intensional evaluation is itself a separate computational step; further, computing $\expct\pbox{\poly\inparen{\vct{X}}}$ in extensional evaluation occurs as a separate step of each operator in the query tree, and therefore implies that both concerns can be separated) and also by that of semiring provenance \cite{DBLP:conf/pods/GreenKT07} (where the $\semNX$-DB first computes the annotation via the query, and then the polynomial is evaluated on a specific valuation), and further, in this work the model lends itself nicely in separating the deterministic computation from the probability computation.

The problem of computing $\query(\pdb)$ has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where the lineage polynomial is a propositional formula.\footnote{For the case when $\query$ is in the class of $\raPlus$ and $\pdb$ is a \abbrTIDB, a propositional formula is special case of the general polynomial with conjunction ($\wedge$) as the polynomial multiplication operator and disjunction ($\vee$) as the polynomial addition operator.}  The semantics of evaluating $\query(\pdb)$ in this setting require each output tuple in $\query(\pdb)$ appears at most once in the result with its corresponding marginal probability $\expct\pbox{\poly\inparen{\vct{X}}}$.  Dalvi and Suicu showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial step one.  Since the hardness is in data complexity (the size of the input, ($\numvar$)), fine grained complexity analysis of step two will not reduce the hardness results from the \sharpphard complexity class for any parameterized complexity class.  To overcome this result, one can allow for approximation which reduces the problem to a quadratic upper bound.

There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit.  One such query is the count query, where one might desire to compute the expected multiplicity ($\expct\pbox{\poly\inparen{\vct{X}}}$) of a result tuple $\tup$.  Should we allow for approximation in this setting, this paper shows that we can \emph{guarantee} runtime of $\query(\pdb)$ to be linear in runtime of step one.  Is approximation necessary?  The semantics of $\query(\pdb)$ in bag-\abbrPDB\xplural allow for output tuples to appear \emph{more} than once, which is naturally captured by a lineage polynomial with standard addition and multiplication polynomial operators.  In this setting, linearity of expectation holds over the standard addition operator of the lineage polynomial, and given a sum of products (\abbrSOP) representation, the complexity of computing step two is linear in the size of the lineage polynomial.  This result coupled with the prevalence that exists amongst most well-known \abbrPDB implementations to use an \abbrSOP representation, may partially explain why the bag-\abbrPDB query problem has long been thought to be easy.

The main insight of the paper is that we should not stop here.  One can have compact representations of $\poly(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations of $\poly(\vct{X})$.  To capture such factorizations, this work uses (arithmetic) circuits as the representation system of $\poly(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07} (as shown in \cref{fig:nxDBSemantics}).  Our work explores whether or not step two in the computation model is \emph{always} linear in the query runtime of step one, when step one of $\query(\pdb)$ is easy.  We examine the class of queries whose lineage computation in step one is lower bounded by the query runtime of step one.  Consider again the bag-\abbrTIDB $\pdb$.  When the probability of all tuples $\prob_i = 1$, the problem of computing the expected count is linear in the size of the arithemetic circuit, and we have polytime complexity for computing $\query(\pdb)$.  Is this the general case?  This leads us to our problem statement:
\begin{Problem}\label{prob:intro-stmt}
Given a query $\query$ in $\raPlus$ and bag \abbrPDB $\pdb$, what is the complexity (in the size of the circuit representation) of computing step two ($\expct\pbox{\poly(\vct{X})}$) for each tuple $\tup$ in the output of $\query(\pdb)$?
\end{Problem}

We show, for the class of \abbrTIDB\xplural with $\prob_i < 1$, the problem of computing step two in general is no longer linear in the size of the lineage polynomial representation.
Our work further introduces an approximation algorithm of the expected count of $\tup$ from the bag-\abbrPDB query $\query$ which runs in linear time.

As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\tup$, which is a stark contrast to the marginal probability ($\expct\pbox{\poly\inparen{\vct{X}}}$) paradigm of set-\abbrPDB\xplural.  From a theoretical perspective, not much work has been done considering bag-\abbrPDB\xplural.  Focusing on computing the expected count ($\expct\pbox{\poly\inparen{\vct{X}}}$) of $\tup$ is therfore a natural (and simplistic) statistic to consider in further developing the theoretical foundations of bag-\abbrPDB\xplural.  There are indeed other statistical measures that can be computed, which are beyond the scope of this paper, though we additionally consider higher moments, which can be found in the appendix.

Our work focuses on the following setting for query computation.  Inputs of $\query$ are set-\abbrPDB\xplural, while the output of $\query$ is a bag-\abbrPDB.  This setting, however, is not limiting as a simple generalization exists, reducing a bag \abbrPDB to a set \abbrPDB with typically only an $O(1)$ increase in size.

%%%%%%%%%%%%%%%%%%%%%%%%%
%Contributions, Overview, Paper Organization
%%%%%%%%%%%%%%%%%%%%%%%%%
Concretely, we make the following contributions:
(i) We show that \cref{prob:intro-stmt} for bag-\abbrTIDB\xplural is \sharpwonehard in the size of the lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph; we further show superlinear hardness for a specific cubic graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) have complexity linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are  at most quadratic); (iii) We generalize the approximation algorithm to a class of bag-\abbrBIDB\xplural, a more general model of probabilistic data; (iv) We further prove that for \raPlus queries
\AH{This point \emph{\Large seems} weird to me.  I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one.  Where am I missing it?}
we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial in what follows.

Consider the query $\query(\pdb) \coloneqq \project_\emptyset(OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)$
%$Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$
over the bag relations of \cref{fig:two-step}. It can be verified that $\Phi$ for $Q$ is $L_aR_aL_b + L_bR_bL_d + L_bR_cL_c$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.

The lineage polynomial for $Q^2$ is given by $\Phi^2$:
\begin{multline*}
\left(L_aR_aL_b + L_bR_bL_d + L_bR_cL_c\right)^2\\
=L_a^2R_a^2L_b^2 + L_b^2R_d^2L_d^2 + L_b^2R_c^2L_c^2 + 2L_aR_aL_b^2R_bL_d + 2L_aR_bL_b^2R_cL_c + 2L_b^2R_bL_dR_cL_c.
\end{multline*}
The expectation $\expct\pbox{\Phi^2}$ then is:
\begin{footnotesize}
\begin{multline*}
\expct\pbox{L_a^2}\expct\pbox{R_a^2}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{R_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{R_c^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\\
+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b^2}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
\end{multline*}
\end{footnotesize}
\noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
\begin{footnotesize}
\begin{multline*}
\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b}\expct{R_b}\expct\pbox{L_d} \\
+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b}\expct{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
\end{multline*}
\end{footnotesize}
\noindent This property leads us to consider a structure related to the lineage polynomial.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the SOP form of $\poly(\vct{X})$ to $1$.
\end{Definition}
With $\Phi^2$ as an example, we have:
\begin{align*}
&\widetilde{\Phi^2}(L_a, L_b, L_c, L_d, R_a, R_b, R_c)\\
&\; = L_aR_aL_b + L_bR_bL_d + L_bR_cL_c + 2L_aR_aL_bR_bL_d + 2L_aR_aL_bR_cL_c + 2L_bR_bL_dR_cL_c
\end{align*}
It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1},$ $\probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, we show in \Cref{lem:exp-poly-rpoly} that this equivalence holds for {\em all} $\raPlus$ queries over TIDB/BIDB.

To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode various hard graph-counting problems.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$).  For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation.  For example, if we know that $\prob_0 = \max_{i \in [\numvar]}\prob_i$, then $\poly(\prob_0,\ldots, \prob_0)$ is an upper bound constant factor approximation.  The opposite holds true for determining a constant factor lower bound.  To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl})\AH{Aren't they the same?}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.