Done with pass on intro

This commit is contained in:
Atri Rudra 2021-08-26 23:02:22 -04:00
parent cd7f86847d
commit eea16e3a67
2 changed files with 85 additions and 42 deletions

View file

@ -47,7 +47,8 @@ For any query $Q$, is it the case the {\em fine-grained complexity} of bag-PDB p
%In this paper, we begin to explore whether the problem of bag-probabilistic query evaluation (which we relate to deterministic query processing more precisely below) falls into this same complexity class.
We note that an answer in the affirmative for~\cref{prob:informal} indicates that bag-probabilistic databases can be competitive with classical deterministic databases, opening the door for deployment in practice.
\subsection{Relationship to Set-Probabilistic Query Evaluation}
%Atri: Converting sub-section to para since it saves space
\mypar{Relationship to Set-Probabilistic Query Evaluation}
\Cref{prob:bag-pdb-query-eval} has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where each output tuple appears at most once. Here, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula
%Atri: If we get a reviewer who does not know what a propositional formula is then we are in trouble-- I did move some of the footnote text to the main part though
%\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives. Evaluating such a formula follows the standard semantics of the said operators on boolean variables ($\semB$-semiring semantics).}
@ -60,8 +61,8 @@ Thus, for the hard queries the answer to~\cref{prob:informal} is {\em no} for se
Concretely, easy queries in this dichotomy can be answered through so-called \emph{extensional} query evaluation, where probability computation is inlined into normal deterministic query processing.
This is possible, because queries on the easy side of the dichotomy can always be rewritten into a form that guarantees that, for every relational operator in the query, the presence of every tuple in the operator's output is governed by either a conjunction or disjunction of \emph{independent} events.
Such a guarantee is not possible for queries on the hard side of the dichotomy, and the best known approach is so-called \emph{intensional} query evaluation~\cite{DBLP:series/synthesis/2011Suciu}, a two step process that first computes the lineage of the query result --- a boolean formula analogous to $\Phi_\tup$ --- which it then uses to compute the probability.
The complexity of this approach typically depends on the second step, computing the expectation $\expct\pbox{\poly_\tup(\vct{\randWorld})}$, a problem known to be in $\sharpphard$~\cite{DS07}.
Such a guarantee is not possible for queries on the hard side of the dichotomy, and the best known approach is so-called \emph{intensional} query evaluation~\cite{DBLP:series/synthesis/2011Suciu}, a two step process that first computes the lineage of the query result --- a representation of $\Phi_\tup$ --- which it then uses to compute the desired probability.
The complexity of this approach is typically dominated by the second step, computing the expectation $\expct\pbox{\poly_\tup(\vct{\randWorld})}$, a problem known to be \sharpphard~\cite{DS07}.
@ -69,40 +70,52 @@ The complexity of this approach typically depends on the second step, computing
%Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), techniques such as parameterized complexity (bounding complexity by another parameter other than $\numvar$) and fine grained analysis (complexity analysis that asks what precisely is the value of this other parameter, for example, what is the value of $f(k)$ given a \sharpwone algorithm) of \abbrStepTwo will not refine the hardness results from \sharpphard.
%END NEeds to be said
There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit than set-\abbrPDB\xplural. One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$) of the result. This works focuses on computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity. Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
%Atri: Again changing subsection below to para
\mypar{Intensional Bag-Probabilistic Query Evaluation}
However, there exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit than set-\abbrPDB\xplural. One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$) of the result. This works focuses on computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity. Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
%BEGIN Needs to be noted.
%As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\poly_\tup\inparen{\vct{X}}$, a stark contrast to the marginal probability %($\expct\pbox{\poly\inparen{\vct{X}}}$)
% paradigm of set-\abbrPDB\xplural. To address the question of whether or not bag-\abbrPDB\xplural are easy,
%END Needs to be noted.
A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ separately from the complexity of deterministic query evaluation, effectively dividing \abbrPDB query evaluation into two steps: deterministic query evaluation\footnote{Given input $\pdb$, this step includes outputting every tuple $\tup$ that satisfies $\query$, annotated with its lineage polynomial ($\poly_\tup$) which is computed inline across the query operators of $\query$.\cite{Imielinski1989IncompleteII}\cite{DBLP:conf/pods/GreenKT07}} and computing expectation. Viewing \abbrPDB query evaluation as these two seperate steps is also known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}, illustrated in \cref{fig:two-step}.
The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing both $\query\inparen{\db}$ and $\poly_\tup(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in \cref{sec:circuit-runtime}.} The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{\randWorld})}$. Such a model of computation is nicely followed in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly_\tup\inparen{\vct{X}}$ must be computed separate from deterministic query evaluation to obtain exact output when $\query(\pdb)$ is hard since evaluating the probability inline with query operators (extensional evaluation) will only approximate the actual probability in such a case. The paradigm of \cref{fig:two-step} is also analogous to semiring provenance, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with elements from the set of polynomials with variables in $\vct{X}$ and natural number coeficients and exponents.} query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial can then subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation.
% Atri: Removing stuff below as per conversation with Oliver on matrix on Aug 26
%A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ separately from the complexity of deterministic query evaluation, effectively dividing \abbrPDB query evaluation into two steps: deterministic query evaluation\footnote{Given input $\pdb$, this step includes outputting every tuple $\tup$ that satisfies $\query$, annotated with its lineage polynomial ($\poly_\tup$) which is computed inline across the query operators of $\query$.\cite{Imielinski1989IncompleteII}\cite{DBLP:conf/pods/GreenKT07}} and computing expectation. Viewing \abbrPDB query evaluation as these two seperate steps is also known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}, illustrated in \cref{fig:two-step}.
%The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing both $\query\inparen{\db}$ and $\poly_\tup(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in \cref{sec:circuit-runtime}.} The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{\randWorld})}$. Such a model of computation is nicely followed in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly_\tup\inparen{\vct{X}}$ must be computed separate from deterministic query evaluation to obtain exact output when $\query(\pdb)$ is hard since evaluating the probability inline with query operators (extensional evaluation) will only approximate the actual probability in such a case. The paradigm of \cref{fig:two-step} is also analogous to semiring provenance, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with elements from the set of polynomials with variables in $\vct{X}$ and natural number coeficients and exponents.} query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial can then subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation.
Analog to set-probabilistic databases, we focus on the intensional model of query evaluation, as illustrated in \cref{fig:two-step}.
Analogous to set-probabilistic databases, we focus on the intensional model of query evaluation, as illustrated in \cref{fig:two-step}.
Given input $\pdb$ and $\query$, the first step, which we will refer to as \termStepOne (\abbrStepOne), outputs every tuple $\tup$ that satisfies $\query$, annotated with its lineage polynomial ($\poly_\tup$) which is computed inline across the query operators of $\query$~\cite{Imielinski1989IncompleteII}\cite{DBLP:conf/pods/GreenKT07}.
We show in \cref{sec:circuit-runtime} that, assuming a standard $\raPlus$ query evaluation algorithm, the cost of constructing the lineage polynomial for all tuples in a query result is upper-bounded by runtime of generating those tuples through deterministic query evaluation.
In other words, the first step is in \sharpwonehard, allowing us to focus on the complexity of the second step.
The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{\randWorld})}$.
We observe that the paradigm of \cref{fig:two-step} is also analogous to semiring provenance~\cite{DBLP:conf/pods/GreenKT07}, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with standard polynomials, i.e. elements from $\semNX$ connected by multiplication and addition operators.} query processing first computes the query and polynomial, and the $\semNX$-polynomial can then subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation.
We observe that the paradigm of \cref{fig:two-step} is also analogous to semiring provenance~\cite{DBLP:conf/pods/GreenKT07}, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with standard polynomials, i.e. elements from $\semNX$ connected by multiplication and addition operators.} query processing first computes the query and polynomial, and the $\semNX$-polynomial can then subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation. \AR{Need to state/justify that intensional model is the "norm" in existing PDB systems.}
\subsection{Intensional Bag-Probabilistic Query Evaluation}
Let $\timeOf{\abbrStepOne}$ denote the runtime of \abbrStepOne (Lineage Computation) and similarly for $\timeOf{\abbrStepTwo}$ (Expectation Computation).
Given bag-\abbrPDB query $\query$ and \abbrTIDB $\pdb$ with $\numvar$ tuples, let us go a step further and assume that computing $\poly_\tup$ is lower bounded by the runtime of determistic query computation of $\query$ (e.g. when $\abs{\textnormal{input}} \leq \abs{\textnormal{output}}$). When $\poly_\tup$ is in standard monomial basis (\abbrSMB)\footnote{A polynomial is in \abbrSMB when it consists of a sum of unique products.}, by linearity of expectation and independence of \abbrTIDB, it follows that $\timeOf{\abbrStepTwo}$ is indeed $\bigO{\timeOf{\abbrStepOne}}$. Let $\prob_i$ denote the probability of tuple $\tup_i$ ($\probOf\pbox{X_i = 1}$) for $i \in [\numvar]$. Consider another special case when for all $i$ in $[\numvar]$, $\prob_i = 1$. For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{\randWorld}}}$ is linear in
$\abs{\poly_\tup}$
For PDB $\pdb$ and query $Q$, let $\timeOf{\abbrStepOne}(Q,\pdb)$ denote the runtime of \abbrStepOne (Lineage Computation) and similarly for $\timeOf{\abbrStepTwo}(Q,\pdb)$ (Expectation Computation).
%Atri: Don't see what the sentence below is adding, so removing
%Given bag-\abbrPDB query $\query$ and \abbrTIDB $\pdb$ with $\numvar$ tuples, let us go a step further and assume that computing $\poly_\tup$ is lower bounded by the runtime of determistic query computation of $\query$ (e.g. when $\abs{\textnormal{input}} \leq \abs{\textnormal{output}}$).
When $\poly_\tup(\vct{X})$ is in standard monomial basis (\abbrSMB)\footnote{A polynomial is in \abbrSMB when it consists of a sum of products of variables (a variable can occur more than once).}, by linearity of expectation and independence of \abbrTIDB, it follows that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is indeed $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$. Recall that $\prob_i$ denote the probability of tuple $\tup_i$ (i.e. $\probOf\pbox{W_i = 1}$) for $i \in [\numvar]$. Consider another special case when for all $i$ in $[\numvar]$, $\prob_i = 1$.
% Replaced the stuff below with something more auccint
%For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{\randWorld}}}$ is linear in
%$\abs{\poly_\tup}$
%the size of the arithemetic circuit
, since we can essentially push expectation through multiplication of variables dependent on one another.\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.} Here is another special case where $\timeOf{\abbrStepTwo}$ is $\bigO{\timeOf{\abbrStepOne}}$ and we again achieve deterministic query runtime for $\query\inparen{\pdb}$ (up to a constant factor). These observations introduce our next problem statement:
%, since we can essentially push expectation through multiplication of variables dependent on one another.\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.}
In this case, we have for any output tuple $\tup$ $\expct\pbox{\Phi_\tup(\vct{W})}=\Phi(1,\dots,1)$.
Thus, we have another case where $\timeOf{\abbrStepTwo}(Q,\pdb)$ is $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$ and we again achieve deterministic query runtime for $\query\inparen{\pdb}$ (up to a constant factor). These observations introduce our first formalization of~\Cref{prob:informal}:
\begin{Problem}\label{prob:big-o-step-one}
Given a \abbrPDB $\pdb$ and $\raPlus$ query $\query$, is it \emph{always} the case that $\timeOf{\abbrStepTwo}$ is always $\bigO{\timeOf{\abbrStepOne}}$?
Given a \abbrPDB $\pdb$, $\raPlus$ query $\query$ and output tuple $\tup$, is it \emph{always} the case that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is always $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$?
\end{Problem}
If the answer to \cref{prob:big-o-step-one} is yes, then the query evaluation problem over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation, and probabilistic databases can offer performance competitive with deterministic databases.
The main insight of the paper is that we should not stop here. One can have compact representations of $\poly_\tup(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations\footnote{A factorized representation is a representation of a polynomial that is not in \abbrSMB form.} of $\poly_\tup(\vct{X})$. To capture such factorizations, this work uses (arithmetic) circuits
\footnote{An arithmetic circuit has variable and/or numeric inputs, with internal nodes each of which can take on a value of either an addition or multiplication operator.}
as the representation system of $\poly_\tup(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07}. The standard query evaluation semantics depicted in \cref{fig:nxDBSemantics} nicely illustrate this.
The main insight of the paper is that to answer~\Cref{prob:big-o-step-one}, the representation of $\Phi_\tup(\vct{X})$ matters. One can have compact representations of $\poly_\tup(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations
%Atri: footnote below was not informative: used an example instead
%\footnote{A factorized representation is a representation of a polynomial that is not in \abbrSMB form.}
of $\poly_\tup(\vct{X})$ (e.g. in~\Cref{fig:two-step}, $B(Y+Z)$ is a factorized representation of the SMB form $BY+BZ$). To capture such factorizations, this work uses (arithmetic) circuits
\footnote{An arithmetic circuit has variable and/or numeric inputs, with internal nodes representing either an addition or multiplication operator.}
as the representation system of $\poly_\tup(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07}. The standard query evaluation semantics depicted in \cref{fig:nxDBSemantics} illustrate this.
\begin{figure}
\begin{align*}
@ -110,14 +123,14 @@ as the representation system of $\poly_\tup(\vct{X})$, which are a natural fit t
\evald{(\rel_1 \union \rel_2)}{\db}(\tup) =& \evald{\rel_1}{\db}(\tup) + \evald{\rel_2}{\db}(\tup)\\
\evald{\select_\theta(\rel)}{\db}(\tup) =& \begin{cases}
\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
\zeroK & \text{otherwise}.
0 & \text{otherwise}.
\end{cases} &
\begin{aligned}
\evald{(\rel_1 \join \rel_2)}{\db}(\tup) =\\ ~
\end{aligned}&
\begin{aligned}
&\evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \\
&~~~\cdot\evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup))
&\evald{\rel_1}{\db}(\project_{\attr{\rel_1}}(\tup)) \\
&~~~\cdot\evald{\rel_2}{\db}(\project_{\attr{\rel_2}}(\tup))
\end{aligned}\\
& & \evald{R}{\db}(\tup) =& \rel(\tup)
\end{align*}\\[-10mm]
@ -125,28 +138,47 @@ as the representation system of $\poly_\tup(\vct{X})$, which are a natural fit t
\label{fig:nxDBSemantics}
\end{figure}
Above we have seen, given a circuit \circuit, if \circuit is in \abbrSMB, then we have that $\timeOf{\abbrStepTwo}$ is indeed $\bigO{\timeOf{\abbrStepOne}}$. Such representations are produced by queries with the form $\project, \project\inparen{\join},$ etc. Suppose, on the contrary, that \circuit is not in \abbrSMB and rather in some factorized form. Then to naively compute \abbrStepTwo, one needs to convert \circuit into \circuit' such that \circuit' is in \abbrSMB, and then compute $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$, which takes $\bigO{\abbrStepOne^k}$ time for the general $k$-wise factorization. Since \abbrStepTwo lies between $\bigO{\abbrStepOne}$ and $\bigO{\abbrStepOne^k}$, it behooves us to determine which of these extremes is true for the general \circuit. This leads us to the main problem statement of our paper:
In other words, we can capture the size of a factorized lineage polynomial by the size of its correspoding arithmetic circuit $\circuit$ (which we denote by $|\circuit|$).
More importantly, our result in \cref{sec:circuit-runtime} shows that, assuming a standard $\raPlus$ query evaluation algorithm for \termStepOne, given the arithmetic circuit $\circuit$ corresponding to lineage polynomial output at the end of \termStepOne, we always have $|\circuit|\le \bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$. Given this, we study the following stronger version of~\Cref{prob:big-o-step-one}:
%Atri: Replaced the text below by the above. I know I had talked about $|\circuit|^k$ but I think the stuff below breaks the flow a bit
%Re-stating our earlier observation, given a circuit \circuit, if \circuit is in \abbrSMB (i.e. every sink to source path has a prefix of addition nodes and the rest of the internal nodes are multiplication nodes), then we have that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is indeed $\bigO{\timeOf{\abbrStepOne}(Q,\pdb)}$. We note that \abbrSMB representations are produced by queries with a projection operation on top of a join operation.
% the form $\project, \project\inparen{\join},$ etc.
%Suppose, on the contrary, that \circuit is not in \abbrSMB and rather in some factorized form. Then to naively compute \abbrStepTwo, one needs to convert \circuit into \circuit' such that \circuit' is in \abbrSMB, and then compute $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$, which takes $\bigO{|\circuit|^k}$ time for the case that $k$ is the degree of the polynimial $\Phi_\tup(\vct{X})$. Since $|\circuit'|$ lies between $\bigO{|\circuit|}$ and $\bigO{|\circuit|^k}$, it behooves us to determine which of these extremes is true for the general \circuit. This leads us to the main problem statement of our paper:
\begin{Problem}\label{prob:intro-stmt}
Given \abbrPDB $\pdb$ and $\raPlus$ query $\query$, is it always the case that $\timeOf{\abbrStepTwo}$ is $\bigO{\abbrStepOne}$?
\OK{This doesn't parse. What is $\bigO{\abbrStepOne}$? Should this be $\bigO{\poly}$?}
Given a circuit $\circuit$ for \termStepOne for \abbrPDB $\pdb$ and $\raPlus$ query $\query$, is it always the case that $\timeOf{\abbrStepTwo}(Q,\pdb)$ is $\bigO{|\circuit|}$?
%\OK{This doesn't parse. What is $\bigO{\abbrStepOne}$? Should this be $\bigO{\poly}$?}
\end{Problem}
Note that an answer in the affirmative to the above question, implies an affirmative answer to~\Cref{prob:big-o-step-one}. Further, we note that if we insist on $\circuit$ being in \abbrSMB form then the result in~\Cref{sec:circuit-runtime} no longer holds and hence, we need to able to answer the above question for general arithmetic circuits.\AR{I think we need to add more justification/motivation for general circuit representation? Am not sure if the current switch from~\Cref{prob:big-o-step-one} to~\Cref{prob:intro-stmt} flows well enough}
%%%%%%%%%%%%%%%%%%%%%%%%%
%Contributions, Overview, Paper Organization
%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Our Results} In this paper we tackle~\Cref{prob:big-o-step-one} to~\Cref{prob:intro-stmt}.
Concretely, we make the following contributions:
(i) Under fine grained hardness assumption, we show that \cref{prob:intro-stmt} for bag-\abbrTIDB\xplural is not true in general
(i) %Under fine grained hardness assumption,
We show that the answer to~\Cref{prob:big-o-step-one} is no in general for exact computation. %\cref{prob:intro-stmt} for bag-\abbrTIDB\xplural is not true in general
% \sharpwonehard in the size of the lineage circuit
by reduction from counting the number of $k$-matchings over an arbitrary graph; we further show superlinear hardness in the size of \circuit for a specific %cubic
graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries that makes \cref{prob:intro-stmt} true again; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ) followups~\cite{DBLP:conf/pods/KhamisNR16}), the approximation algorithm has runtime linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are at most quadratic\footnote{Note that this doesn't rule out queries for which approximation is linear}); (iii) We generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural); (iv) We further prove that for \raPlus queries
\AH{This point \emph{\Large seems} weird to me. I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one. Where am I missing it?}
\OK{Atri's (and most theoretician's) statements about complexity always need to be suffixed with ``to within a log factor''}
we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
In fact, via a
reduction from counting the number of $k$-matchings over an arbitrary graph, we show that for the problem of \termStepTwo is \sharpwonehard. I.e., not only is the answer to~\Cref{prob:intro-stmt} no, but \termStepTwo cannot be solved in fully polynomial time, i.e. there is no algorithm for \termStepTwo with runtime that grows as $f(k)\cdot |\circuit|^d$, where $k$ is the degree of the corresponding lineage polynomial and $d$ is any fixed constant.\footnote{We would like to note that it is a well-known result in deterministic query computation that \termStepOne is also a \sharpwonehard. What our result says is that \termStepTwo is \sharpwonehard {\em even if} we exclude the complexity of \termStepOne .}
This hardness result requires the algorithm to be able to solve the hard query $Q$ for {\em multiple} PDBs. We further show that the answer to ~\Cref{prob:intro-stmt} is no even if we fix the $\pd$ (in particular, we insist on $\prob_i = \prob$ for some $\prob$ in $(0, 1)$).
%Atri: The footnote above is where I talk about \sharpwonehard of det query complexity.
We further note that in our hardness proofs, we have $|\circuit|=\Theta\inparen{\timeOf{\abbrStepOne}(Q,\pdb)}$, which shows that the answer to~\Cref{prob:big-o-step-one} is also no.\AR{Need to make sure we have the correct statement for this claim in the main paper.}
%we further show superlinear hardness in the size of \circuit for a specific %cubic
%graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
(ii) To complement our hardness results, we consider an approximate version of~\Cref{prob:intro-stmt}, where instead of computing the expected multiplicity exactly, we allow for an $(1\pm\epsilon)$-\emph{multiplicative} approximation of the expected multiplicitly. We show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ)\AR{need to cite the AJAR paper} followups~\cite{DBLP:conf/pods/KhamisNR16}), the answer to the approximation version of~\Cref{prob:intro-stmt} problem is {\em yes}.
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
In contrast, known approximation techniques in set-\abbrPDB\xplural are at most quadratic in the size of the compressed lineage encoding.\AR{cite?}
%Atri: The footnote below does not add much
%\footnote{Note that this doesn't rule out queries for which approximation is linear});
(iii) We generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural); (iv) We further prove that for \raPlus queries
%\AH{This point \emph{\Large seems} weird to me. I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one. Where am I missing it?}
%\OK{Atri's (and most theoretician's) statements about complexity always need to be suffixed with ``to within a log factor''}
we can approximate the expected output tuple multiplicities (for all output tuples {\em simultanesouly} with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly_\tup$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial in what follows.
Consider the query $\query(\pdb) \coloneqq \project_\emptyset(OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)$
Consider the query $\query(\pdb) \coloneqq \project_\emptyset(OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)$\AR{$\rename$ is not defined. Any reason why we do not just associate the attribute names with the relation. The datalog notation was much cleaner to me.}
%$Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$
over the bag relations of \cref{fig:two-step}. It can be verified that $\poly_\tup\inparen{A, B, C, D, X, Y, Z}$ for $Q$ is $AXB + BYD + BZC$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.
@ -162,7 +194,9 @@ By exploiting linearity of expectation of summand terms, and further pushing exp
+ 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_D}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C}.
\end{multline*}
\end{footnotesize}
\noindent If the domain of a random variable $\randWorld$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{\randWorld^k} = \expct\pbox{\randWorld}$, which means that $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}}$ simplifies to:
\noindent Since for any $\randWorld\in\{0, 1\}$, we have $\randWorld^2=\randWorld$,
%then for any $k > 0$, $\expct\pbox{\randWorld^k} = \expct\pbox{\randWorld}$, which means that
$\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}}$ simplifies to:
\begin{footnotesize}
\begin{multline*}
\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_D} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B}\expct{\randWorld_Y}\expct\pbox{\randWorld_D} \\
@ -173,23 +207,29 @@ By exploiting linearity of expectation of summand terms, and further pushing exp
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the \abbrSMB form of $\poly(\vct{X})$ to $1$.
\end{Definition}
With $\Phi^2\inparen{\vct{X}}$ as an example, we have:
With $\Phi^2\inparen{A, B, C, D, X, Y, Z}$ as an example, we have:
\begin{align*}
&\widetilde{\Phi^2}(A, B, C, D, X, Y, Z)\\
&\; = AXB + BYD + BZC + 2AXBYD + 2AXBZC + 2BYDZC
&\widetilde{\Phi^2}(A, B, C, D, X, Y, Z) = AXB + BYD + BZC + 2AXBYD + 2AXBZC + 2BYDZC
%&\; = AXB + BYD + BZC + 2AXBYD + 2AXBZC + 2BYDZC
\end{align*}
It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}} = \widetilde{\Phi^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{D=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$). In fact, the following lemma shows that this equivalence holds for {\em all} $\raPlus$ queries over TIDB (proof in \cref{subsec:proof-exp-poly-rpoly}).
Note that we have argued that for our specific example the expectation that we want to compute is $\widetilde{\Phi^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{D=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$.
%It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}} = \widetilde{\Phi^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{D=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$).
In fact, the following lemma shows that this equivalence holds for {\em all} $\raPlus$ queries over TIDB (proof in \cref{subsec:proof-exp-poly-rpoly}).
\begin{Lemma}
Let $\pdb$ be a \abbrTIDB over variables\OK{Should this be $\vct{W}$?} $\vct{X} = \{X_1,\ldots,X_\numvar\}$ such that the probability distribution $\pd$ over $\idb$ (the set of possible worlds) is induced by the probability vector $\probAllTup = \inparen{\prob_1,\ldots,\prob_\numvar}$ where $\probAllTup$ consists of each individual tuple's marginal probability across $\idb$. For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}$ based on $\query\inparen{\pdb}$ the following holds:
Let $\pdb$ be a \abbrTIDB over $n$ input tuples
%\OK{Should this be $\vct{W}$?} $\vct{X} = \{X_1,\ldots,X_\numvar\}$
such that the probability distribution $\pd$ over $\vct{W}\in\{0,1\}^\numvar$ (the set of possible worlds) is induced by the probability vector $\probAllTup = \inparen{\prob_1,\ldots,\prob_\numvar}$ where $\prob_i=\probOf\pbox{W_i=1}$.
% $\probAllTup$ consists of each individual tuple's marginal probability across $\idb$.
For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}$ based on $\query\inparen{\pdb}$ the following holds:
\begin{equation*}
\expct_{\vct{W} \sim \pd}\pbox{\poly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}.
\end{equation*}
\end{Lemma}
To prove our hardness result we show that for the same $Q$ considered in the query above, for an arbitrary product width $k$, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \cref{fig:two-step}}. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$).
To prove our hardness result we show that for the same $Q$ considered in the example above, for an arbitrary product width $k$, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \cref{fig:two-step}}. We do so by analyzing how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the $Route$ relation in $Q$).
For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$, we can see that
For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$ (and similarly for the other six variables), we can see that
\begin{align*}
\poly^2\inparen{\probAllTup} &= \prob_A^2\prob_X^2\prob_B^2 + \prob_B^2\prob_Y^2\prob_D^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_X\prob_B^2\prob_Y\prob_D + 2\prob_A\prob_X\prob_B^2\prob_Z\prob_C + 2\prob_B^2\prob_Y\prob_D\prob_Z\prob_C\\
&\leq\prob_A\prob_X\prob_B + \prob_B\prob_Y\prob_D + \prob_B\prob_Z\prob_C +
@ -197,9 +237,11 @@ For an upper bound on approximating the expected count, it is easy to check that
&= \rpoly\inparen{\vct{p}}
%\inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
\end{align*}
Choose the least factor that is reduced in $\rpoly^2\inparen{\vct{X}}$, in this case $\prob_A\prob_X\prob_B$, and we see that $\poly^2\inparen{\vct{\prob}}$ is in the range $[\prob_A\prob_X\prob_B\cdot\rpoly\inparen{\vct{\prob}}, \rpoly\inparen{\vct{\prob}}]$.
If we assume that all of the seven probability values are at least $p_0>0$,
%Choose the least factor that is reduced in $\rpoly^2\inparen{\vct{X}}$, in this case $\prob_A\prob_X\prob_B$, and
then we note that $\poly^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdot\rpoly\inparen{\vct{\prob}}, \rpoly\inparen{\vct{\prob}}]$.
To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from $\Phi$ and `adjust' their contribution based on the construction of $\widetilde{\Phi}\left(\cdot\right)$.
To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from the \abbrSMB representation of $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl})\AH{Aren't they the same?}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.

View file

@ -128,4 +128,5 @@
}
\caption{Intensional Evaluation Model}
\label{fig:two-step}
\end{figure}
\end{figure}
\AR{Comment for first fig: What is $\ell$ doing in City$_\ell$ for OnTime? Also need to state what $Q$ is in the example-- can be in the caption}