More changes per @atri 081021 suggestions.

master
Aaron Huber 2021-08-16 13:37:15 -04:00
parent aa825f93a6
commit 7b6d6dc37e
6 changed files with 49 additions and 51 deletions

View File

@ -33,8 +33,8 @@ We require that $\phi_{Q,\pxdb}$'s range be limited to sink vertices (i.e., vert
%We call a sink vertex not in the range of $\phi_R$ a \emph{dead sink}.
A function $\ell_{Q,\pxdb} : V_{Q,\pxdb} \rightarrow \{\;+,\times\;\}\cup \mathbb N \cup \vct X$ assigns a label to each node: Source nodes (i.e., vertices with in-degree 0) are labeled with constants or variables (i.e., $\mathbb N \cup \vct X$), while the remaining nodes are labeled with the symbol $+$ or $\times$.
We require that vertices have an in-degree of at most two.
%
For the specifics on how to construct a circuit to encode the polynomials of all result tuples for a query and $\semNX$-PDB see \Cref{app:subsec-rep-poly-lin-circ}. Note that we can construct circuits for \bis in time linear in the time required for deterministic query processing over a possible world of the \bi under the aforementioned assumption that $\abs{\pxdb} \leq c \cdot \abs{\db}$.
%For the specifics on how to construct a circuit to encode the polynomials of all result tuples for a query and $\semNX$-PDB see \Cref{app:subsec-rep-poly-lin-circ}.
Note that we can construct circuits for \bis in time linear in the time required for deterministic query processing over a possible world of the \bi under the aforementioned assumption that $\abs{\pxdb} \leq c \cdot \abs{\db}$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Circuit size vs. runtime}
@ -118,15 +118,14 @@ As in projection, newly created vertices will have an in-degree of $k$, and a fa
There are $|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ such vertices, so the corrected circuit has $|V_{Q_1,\pxdb}|+\ldots+|V_{Q_k,\pxdb}|+(k-1)|{Q_1} \bowtie \ldots \bowtie {Q_k}|$ vertices.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}
\begin{Lemma}\label{lem:circ-model-runtime}
\label{lem:circuits-model-runtime}
Given a $\semNX$-PDB $\pxdb$ and query plan $Q$, the runtime of $Q$ over $\pxdb$ has the same or better complexity as the size of the lineage of $Q(\pxdb)$. That is, we have $\abs{V_{Q,\pxdb}} \leq (k-1)\qruntime{Q}$, where $k$ is the maximal degree of any polynomial in $Q(\pxdb)$.
Given a $\semNX$-PDB $\pxdb$ and query plan $Q$, the runtime of $Q$ over $\pxdb$ has the same or greater complexity as the size of the lineage of $Q(\pxdb)$. That is, we have $\abs{V_{Q,\pxdb}} \leq (k-1)\qruntime{Q}$, where $k$ is the maximal degree of any polynomial in $Q(\pxdb)$.
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent The proof is shown in in \Cref{app:subsec-lem-lin-vs-qplan}.
We now have all the pieces to argue that using our approximation algorithm, the expected multiplicities of a SPJU query can be computed in essentially the same runtime as deterministic query processing for the same query.
%\noindent The proof is shown in \Cref{app:subsec-lem-lin-vs-qplan}.
\subsection{Proof for \Cref{lem:circuits-model-runtime}}\label{app:subsec-lem-lin-vs-qplan}
%\subsection{Proof for \Cref{lem:circuits-model-runtime}}\label{app:subsec-lem-lin-vs-qplan}
\begin{proof}
Proof by induction. The base case is a base relation: $Q = R$ and is trivially true since $|V_{R,\pxdb}| = |R|$.
For the inductive step, we assume that we have circuits for subplans $Q_1, \ldots, Q_n$ such that $|V_{Q_i,\pxdb}| \leq (k_i-1)\qruntime{Q_i,\pxdb}$ where $k_i$ is the degree of $Q_i$.
@ -136,6 +135,7 @@ Assume that $Q = \sigma_\theta(Q_1)$.
In the circuit for $Q$, $|V_{Q,\pxdb}| = |V_{Q_1,\pxdb}|$ vertices, so from the inductive assumption and $\qruntime{Q,\pxdb} = \qruntime{Q_1,\pxdb}$ by definition, we have $|V_{Q,\pxdb}| \leq (k-1) \qruntime{Q,\pxdb} $.
% \AH{Technically, $\kElem$ is the degree of $\poly_1$, but I guess this is a moot point since one can argue that $\kElem$ is also the degree of $\poly$.}
% OK: Correct
\caseheading{Projection}
Assume that $Q = \pi_{\vct A}(Q_1)$.
The circuit for $Q$ has at most $|V_{Q_1,\pxdb}|+|{Q_1}|$ vertices.
@ -180,6 +180,9 @@ The property holds for all recursive queries, and the proof holds.
\qed
\end{proof}
With \cref{lem:circ-model-runtime} and our upper bound results on \approxq, we now have all the pieces to argue that using our approximation algorithm, the expected multiplicities of an $\raPlus$ query can be computed in essentially the same runtime as deterministic query processing for the same query, proving claim (iv) of the Introduction.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"

View File

@ -1,30 +1,33 @@
%root: main.tex
\section{Introduction (Rewrite - 070921)}
\input{two-step-model}
A probabilistic database (\abbrPDB) $\pdb$ is a probability distribution $\pd$ over a set of $\numvar$ tuples in a deterministic database $\db$. A tuple independent probabilistic database (\abbrTIDB) $\pdb$ further restricts $\pd$ to treating each tuple in $\db$ as an independent Bernoulli distributed random variable corresponding to the tuple's presence, where, in bag query semantics, $\query\inparen{\pdb}\inparen{\tup}$ is the random (polynomial) variable corresponding to the multiplicity of the output tuple $\tup$. Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$)\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the SPJU and renaming operators.}, the goal is to compute the expected multiplicity\footnote{In set semantic \abbrPDB\xplural, computing $\expct\pbox{\query\inparen{\pdb}\inparen{t}}$ corresponds to computing the marginal probability.} ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$. There exists a polynomial $\poly_\tup\inparen{\vct{X}}$ such that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{X}}}$\footnote{In this work we focus on one output tuple $\tup$, and hence refer to $\poly_\tup$ as $\poly$}, where $\vct{X} = \inparen{X_1,\ldots, X_\numvar}$, and the expectation is any Bernoulli distribution over $\{0, 1\}^\numvar$, using $\semN$ semiring semantics.. The set of variables $X_i$ in $\vct{X}$ with nonzero assignments\footnote{While there are semirings such as the security semiring that use $0$ (denoted $\zerosymbol$) as a valid annotation over tuples of a finite database, for our purposes this distinction is not necessary.} and nonzero coefficients in $\poly\inparen{\vct{X}}$ represent all contributing input tuples to the presence of the output. In the context of bags, as is the case for $\expct\pbox{\poly\inparen{\vct{X}}}$, computing the multiplicity $\poly\inparen{\vct{X}}$ follows $\semN$-semiring semantics.
A tuple independent probabilistic database\footnote{In \cref{sec:background} and beyond, we generalize the data model.} (\abbrTIDB) $\pdb$ is a tuple $\inparen{\db, \pd}$ where $\db$ is a set of $\numvar$ tuples. The probability distribution $\pd$ over $\db$ is the one induced from the requirement that each tuple be treated as an independent Bernoulli distributed random variable. In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ computes the multiplicity of its corresponding tuple $\tup$. The query evaluation problem in bag-\abbrPDB semantics can be stated as
\begin{Problem}\label{prob:bag-pdb-query-eval}
Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$),\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} compute the expected multiplicity ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$.
\end{Problem}
There exists a polynomial $\poly_\tup\inparen{\vct{X}}$ such that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{X}}}$, where $\vct{X} = \inparen{X_1,\ldots, X_\numvar}$ is the set of variables annotating the tuples in $\pdb$. The expectation is any Bernoulli distribution over $\{0, 1\}^\numvar$, whose evaluation semantics follow the standard interpretation of addition and multiplication operators over the natural numbers, i.e. $\semN$-semiring semantics.
To precisely compare and contrast runtime complexities across varying database models and \abbrPDB computations, a brief informal review of certain complexity classes is useful. Given an algorithm $\mathcal{A}$ and input size $\numvar$, $\mathcal{A}$ is an element of the \sharpp complexity class if computing a solution is an element of $\np$ and there may exist multiple solutions to a given problem.
To be an element of the \sharpwone class, the runtime of algorithm $\mathcal{A}$ must be lower bounded by some function $f$ of a parameter $k$ such that the growth in runtime is polynomially dependent on $f(k)$. Specifically, $\mathcal{A}$ is an element of \sharpwone if its lower bound on runtime is of the form $n^{f(k)}$.
The problem of deterministic query evaluation is known to be \sharpwonehard\footnote{A problem is in \sharpwone if the runtime of the most efficient known algorithm to solve it is lower bounded by some function $f$ of a parameter $k$, where the growth in runtime is polynomially dependent on $f(k)$, i.e. $\Omega\inparen{\numvar^{f(k)}}$.} in data complexity for general $\query$. For example, the counting $k$-cliques query problem (where the parameter $k$ is the size of the clique) is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.
This result is unsatisfying when considering complexity of evaluating $\query$ over \abbrPDB\xplural, since it does not account for computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$, entirely ignoring the `P' in \abbrPDB.
The special case of deterministic query evaluation
%simply computing $\query$ over a deterministic database
is itself known to be \sharpwonehard in data complexity for general $\query$. An algorithm, such as a counting cliques query, is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.
%hardness is seen in such queries as counting $k$-cliques and $k$-way joins, where the superlinear runtime is parameterized in $k$.
This result is unsatisfying when considering complexity of evaluating $\query$ over \abbrPDB\xplural, since it does not account for computing $\expct\pbox{\poly\inparen{\vct{X}}}$, entirely ignoring the `P' in \abbrPDB.
%of intensional evaluation (computing $\expct\pbox{\poly\inparen{\vct{X}}}$).
A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$ separately from the complexity of deterministic query evaluation. Viewing \abbrPDB query evaluation as these two seperate steps is also known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}, illustrated in \cref{fig:two-step}.
The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing both $\query\inparen{\db}$ and $\poly_\tup(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in \cref{sec:circuit-runtime}.} The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{X})}$. Such a model of computation is nicely followed in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly_\tup\inparen{\vct{X}}$ must be computed separate from deterministic query evaluation to obtain exact output when $\query(\pdb)$ is hard since evaluating the probability inline with query operators (extensional evaluation) will only approximate the actual probability in such a case. The paradigm of \cref{fig:two-step} is also neatly followed by semiring provenance, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with standard polynomials, i.e. elements from $\semNX$ connected by addition operators.} query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial is subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation.
A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly\inparen{\vct{X}}}$ separately from the complexity of deterministic query evaluation. Viewing \abbrPDB query evaluation as these two seperate steps is essentially what is known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}. \Cref{fig:two-step} illustrates the intensional evaluation computation model.
%one way to do this.
%The model of computation in \cref{fig:two-step} views \abbrPDB query processing as two steps.
The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing $\query$ over a $\abbrPDB$, which is essentially the deterministic computation of both the query output and $\poly(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in [placeholder].} The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly(\vct{X})}$. Such a model of computation is nicely followed by intensional evaluation in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly\inparen{\vct{X}}$ must be computed separately for exact output when $\query(\pdb)$ is hard since extensional evaluation will only approximate in such a case.
%(where e.g. intensional evaluation is itself a separate computational step; further, computing $\expct\pbox{\poly\inparen{\vct{X}}}$ in extensional evaluation occurs as a separate step of each operator in the query tree, and therefore implies that both concerns can be separated)
The paradigm of \cref{fig:two-step} is also neatly followed by semiring provenance, where $\semNX$-DB query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial is subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the model lends itself nicely in separating the concerns of deterministic computation and the probability computation. Observing this model prompts the informal problem statement: given $\query\inparen{\pdb}$, is it the case \abbrStepTwo is always $\bigO{\abbrStepOne}$?
%always $\bigO{\abbrStepOne}$.
If so, then query evaluation over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation (up to a constant factor).
Let $\timeOf{\abbrStepOne}$ denote the runtime of \abbrStepOne and similarly for $\timeOf{\abbrStepTwo}$. When solving \cref{prob:bag-pdb-query-eval}, $\timeOf{\abbrStepTwo}$ lies somewhere between $\bigO{\timeOf{\abbrStepOne}}$ and $\bigO{\timeOf{\abbrStepOne}^k}$, since when $\poly_\tup$ is in \abbrSMB\footnote{\abbrSMB is akin to the sum of products expansion but with the added requirement that all monomials are unique.} computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$ is linear (due to linearity of expectation and the independence assumption of \abbrTIDB), while the case of a factorized $\poly_\tup$ has a worst case (since in general product terms must be expanded) of $\timeOf{\abbrStepTwo}$ being $\timeOf{\abbrStepOne}^k$ for a $k$-wise factorization. This observation introduces our next problem statement:
\begin{Problem}\label{prob:big-o-step-one}
Given a \abbrPDB $\pdb$ and $\raPlus$ query $\query$, is it \emph{always} the case that $\timeOf{\abbrStepTwo}$ is always $\bigO{\timeOf{\abbrStepOne}}$?
\end{Problem}
If the answer to \cref{prob:big-o-step-one} is yes, then the query evaluation problem over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation.
The problem of computing $\query(\pdb)$ has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where the lineage polynomial follows $\semB$-semiring semantics, each output tuple appears at most once, and $\expct\pbox{\poly\inparen{\vct{X}}}$ is essentially the marginal probability of a tuple's prensence in the output. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time \abbrStepOne. Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), techniques such as parameterized complexity (bounding complexity by another parameter other than $\numvar$) and fine grained analysis (complexity analysis that asks what precisely is the value of this other parameter, for example, what is the value of $f(k)$ given a \sharpwone algorithm) of \abbrStepTwo will not refine the hardness results from \sharpphard.
\Cref{prob:big-o-step-one} has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where each output tuple appears at most once. Here, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives. Evaluating such a formula follows the standard semantics of the said operators on boolean variables.} whose evaluation follows the standard semantics ($\semB$-semiring semantics), denoting the presence or absence of $\tup$. Computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$ determines the marginal probability of $\tup$ appearing in the output. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard\footnote{\sharpp is the counting version for problems residing in the NP complexity class.} in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time \abbrStepOne. Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), techniques such as parameterized complexity (bounding complexity by another parameter other than $\numvar$) and fine grained analysis (complexity analysis that asks what precisely is the value of this other parameter, for example, what is the value of $f(k)$ given a \sharpwone algorithm) of \abbrStepTwo will not refine the hardness results from \sharpphard.
There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit. One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly\inparen{\vct{X}}}$) of the result.
There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit than set-\abbrPDB\xplural. One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly_\tup\inparen{\vct{X}}}$) of the result.
In bag-\abbrPDB\xplural (as alluded to above), $\timeOf{\abbrStepTwo}$ is $\bigO{\abs{\poly_\tup}}$\footnote{$\abs{\poly_\tup}$ denotes the size of $\poly_\tup$, i.e., the number of arithmetic operations.} when $\poly_\tup$ is in \abbrSMB. For the special case when $\query$ is a sequence of query algorithms (e.g. $\project\inparen{\join}$) whose evaluation is precisely mirrored in the \abbrSMB representation of $\poly_\tup$, it then follows that $\timeOf{\abbrStepTwo}$ is indeed $\bigO{\timeOf{\abbrStepOne}}$.
The main insight of the paper is that we should not stop here. One can have compact representations of $\poly_\tup(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations of $\poly_\tup(\vct{X})$. To capture such factorizations, this work uses (arithmetic) circuits
\footnote{An arithmetic circuit has variable and/or numeric inputs, with internal nodes each of which can take on a value of either an addition or multiplication operator.}
as the representation system of $\poly_\tup(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07}. The standard query evaluation semantics depicted in \cref{fig:nxDBSemantics} nicely illustrate this.
\begin{figure}
\begin{align*}
@ -45,30 +48,18 @@ There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natura
\end{align*}\\[-10mm]
\caption{Evaluation semantics $\evald{\cdot}{\db}$ for $\semNX$-DBs~\cite{DBLP:conf/pods/GreenKT07}.}
\label{fig:nxDBSemantics}
\end{figure}
\end{figure}
The semantics of $\query(\pdb)$ in bag-\abbrPDB\xplural allow for output tuples to appear \emph{more} than once, which is naturally captured by the $\semN$-semiring. Given a standard monomial basis (\abbrSMB)\footnote{\abbrSMB is akin to the sum of products expansion but with the added requirement that all monomials are unique.} representation of the lineage polynomial, the complexity of computing \abbrStepTwo is linear in the size of the lineage polynomial. This is true since standard addition operators allow for
%the addition and multiplication operators in \cref{fig:nxDBSemantics} are those of the $\semN$-semiring, and computing the expected count over such operators allows for
linearity of expectation, and since \abbrSMB has no factorization, the monomials with dependent multiplicative variables are known up front without any additional operations needed. Thus, the expected count can indeed be computed by the same order of operations as contained in $\poly$. In other words, given an \abbrSMB representation, $\expct\pbox{\poly\inparen{\vct{X}}}$ can always be computed in $\bigO{n^k}$ for $\numvar$ input tuples and $k$-size query complexity. Then for the special case of queries whose operators naturally produce an \abbrSMB representation, we have that \abbrStepTwo is $\bigO{\abbrStepOne}$.
%This result coupled with the prevalence that exists amongst most well-known \abbrPDB implementations to use an sum of products\footnote{Sum of products differs from \abbrSMB in allowing any arbitrary monomial $m_i$ to appear in the polynomial more than once, whereas, \abbrSMB requires all monomials $m_i,\ldots, m_j$ such that $m_i = \cdots = m_j$ to be combined into one monomial, such that each monomial appearing in \abbrSMB is unique. The complexity difference between the two representations is up to a constant factor.} representation,
%\currentWork{
% show us that to develop comparable bounds between \abbrStepOne and \abbrStepTwo for bag $\query\inparen{\pdb}$, one must examine complexity at finer levels....to not be bottlenecked in \abbrStepTwo, it must be the case that \abbrStepTwo is not \sharpwonehard regardless of the polynomial representation. may partially explain why the bag-\abbrPDB query problem has long been thought to be easy.
%}
The main insight of the paper is that we should not stop here. One can have compact representations of $\poly(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations of $\poly(\vct{X})$. To capture such factorizations, this work uses (arithmetic) circuits as the representation system of $\poly(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07} (as shown in \cref{fig:nxDBSemantics}).
%Our work explores whether or not \abbrStepTwo in the computation model is \emph{always} in the same complexity class as deterministic query evaluation, when \abbrStepOne of $\query(\pdb)$ is easy. We examine the class of queries whose lineage computation in step one is lower bounded by the query runtime of step one.
Let us consider output $\poly(\vct{X})$ for $\query(\pdb)$ such that $\pdb$ is a bag-\abbrTIDB. For $\tup_i$ in $\pdb$, denote its corresponding probability as $\prob_i$, that is, $\probOf\pbox{X_i} = \prob_i$ for all $i$ in $[\numvar]$. Consider the special case when $\pdb$ is a deterministic database with one possible world $\db$. In this case, $\prob_i = 1$ for all $i$ in $[\numvar]$, and it can be seen that the problem of computing the expected count is linear in the size of the arithemetic circuit, since we can essentially push expectation through multiplication of variables dependent on one another\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.}. This means that \abbrStepTwo is $\bigO{\abbrStepOne}$ and we always have deterministic query runtime for $\query\inparen{\pdb}$ up to a constant factor for this special case. Is this the general case? This leads us to our problem statement:
Given bag-\abbrPDB query $\query$ and \abbrTIDB $\pdb$ with $\numvar$ tuples, let $\prob_i$ denote the probability of tuple $\tup_i$ ($\probOf\pbox{X_i = 1}$) for $i \in [\numvar]$. Consider the special case when for all $i$ in $[\numvar]$, $\prob_i = 1$. For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{X}}}$ is linear in the size of the arithemetic circuit, since we can essentially push expectation through multiplication of variables dependent on one another.\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.} Here is another special case where $\timeOf{\abbrStepTwo}$ is $\bigO{\timeOf{\abbrStepOne}}$ and we again achieve deterministic query runtime for $\query\inparen{\pdb}$ (up to a constant factor). Is this the general case? This leads us to the main problem statement of this paper:
\begin{Problem}\label{prob:intro-stmt}
Given a query $\query$ in $\raPlus$ and bag \abbrPDB $\pdb$, is it always the case that computing \abbrStepTwo is $\bigO{\abbrStepOne}$?%what is the complexity (in the size of the circuit representation) of computing step two ($\expct\pbox{\poly(\vct{X})}$) for each tuple $\tup$ in the output of $\query(\pdb)$?
Given a query $\query$ in $\raPlus$ and bag \abbrPDB $\pdb$, is it always the case that computing \abbrStepTwo is $\bigO{\timeOf{\abbrStepOne}}$?
\end{Problem}
We show, for the class of \abbrTIDB\xplural with $0 < \prob_i < 1$, the problem of computing \abbrStepTwo is superlinear in the size of the lineage polynomial representation under fine grained complexity hardness assumption.
Our work further introduces an approximation algorithm of \abbrStepTwo from the bag-\abbrPDB query $\query$ which runs in linear time.
As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\poly\inparen{\vct{X}}$, a stark contrast to the marginal probability %($\expct\pbox{\poly\inparen{\vct{X}}}$)
paradigm of set-\abbrPDB\xplural. To address the question of whether or not bag-\abbrPDB\xplural are easy, we focus on computing the expected count ($\expct\pbox{\poly\inparen{\vct{X}}}$) as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity. Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\poly_\tup\inparen{\vct{X}}$, a stark contrast to the marginal probability %($\expct\pbox{\poly\inparen{\vct{X}}}$)
paradigm of set-\abbrPDB\xplural. To address the question of whether or not bag-\abbrPDB\xplural are easy, we focus on computing the expected count ($\expct\pbox{\poly_\tup\inparen{\vct{X}}}$) as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity. Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
Our work focuses on the following setting for query computation. Inputs of $\query$ are set-\abbrPDB\xplural, while the output of $\query$ is a bag-\abbrPDB. This setting, however, is not limiting as a simple generalization exists, reducing a bag \abbrPDB to a set \abbrPDB with typically only an $O(c)$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{t}$.
@ -76,18 +67,21 @@ Our work focuses on the following setting for query computation. Inputs of $\qu
%Contributions, Overview, Paper Organization
%%%%%%%%%%%%%%%%%%%%%%%%%
Concretely, we make the following contributions:
(i) Under fine grained hardness assumption, we show that \cref{prob:intro-stmt} for bag-\abbrTIDB\xplural is \sharpwonehard in the size of the lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph; we further show superlinear hardness for a specific cubic graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ) followups~\cite{DBLP:conf/pods/KhamisNR16}) have runtime linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are at most quadratic\footnote{Note that this doesn't rule out queries for which approximation is linear}); (iii) We generalize the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural), a more general model of probabilistic data; (iv) We further prove that for \raPlus queries
(i) Under fine grained hardness assumption, we show that \cref{prob:intro-stmt} for bag-\abbrTIDB\xplural is not true in general
% \sharpwonehard in the size of the lineage circuit
by reduction from counting the number of $k$-matchings over an arbitrary graph; we further show superlinear hardness for a specific %cubic
graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries that makes \cref{prob:intro-stmt} true again; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ) followups~\cite{DBLP:conf/pods/KhamisNR16}) have runtime linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are at most quadratic\footnote{Note that this doesn't rule out queries for which approximation is linear}); (iii) We generalize the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural), a more general model of probabilistic data; (iv) We further prove that for \raPlus queries
\AH{This point \emph{\Large seems} weird to me. I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one. Where am I missing it?}
we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial in what follows.
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly_\tup$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial in what follows.
Consider the query $\query(\pdb) \coloneqq \project_\emptyset(OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)$
%$Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$
over the bag relations of \cref{fig:two-step}. It can be verified that $\Phi$ for $Q$ is $L_aR_aL_b + L_bR_bL_d + L_bR_cL_c$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.
over the bag relations of \cref{fig:two-step}. It can be verified that $\poly_\tup$ for $Q$ is $L_aR_aL_b + L_bR_bL_d + L_bR_cL_c$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.
The lineage polynomial for $Q^2$ is given by $\Phi^2$:
The lineage polynomial for $Q^2$ is given by $\poly^2$:
\begin{multline*}
\left(L_aR_aL_b + L_bR_bL_d + L_bR_cL_c\right)^2\\
=L_a^2R_a^2L_b^2 + L_b^2R_d^2L_d^2 + L_b^2R_c^2L_c^2 + 2L_aR_aL_b^2R_bL_d + 2L_aR_bL_b^2R_cL_c + 2L_b^2R_bL_dR_cL_c.

View File

@ -282,6 +282,7 @@
\newcommand{\sharpwone}{\#W[1]\xspace}
\newcommand{\sharpwonehard}{\#W[1]-hard\xspace}
\newcommand{\ptime}{PTIME\xspace}
\newcommand{\timeOf}[1]{T_{#1}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@ -156,7 +156,7 @@ to the variables $\vct{X}$. Intuitively, \Cref{lem:exp-poly-rpoly} states that w
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Corollary}\label{cor:expct-sop}
If $\poly$ is a \bi-lineage polynomial, then the expectation of $\poly$, i.e., $\expct\pbox{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $\bigO{\size\inparen{\smbOf{\poly}}}$, where $\size\inparen{\poly}$ (\Cref{def:size}) is proportional to the total number of multiplication/addition operators in $\poly$.
If $\poly$ is a \bi-lineage polynomial already in \abbrSMB, then the expectation of $\poly$, i.e., $\expct\pbox{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $\bigO{\size\inparen{\poly}}$, where $\size\inparen{\poly}$ (\Cref{def:size}) is proportional to the total number of multiplication/addition operators in $\poly$.
\end{Corollary}
%\AH{What if $\poly$ is not in \abbrSMB form?}

View File

@ -4,4 +4,4 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Parameterized Complexity}\label{sec:param-compl}
In \Cref{sec:hard}, we utilized common conjectures from fine-grained complexity theory. The notion of $\sharpwonehard$ is a standard notion in {\em parameterized complexity}, which by now is a standard complexity tool in providing data complexity bounds on query processing results~\cite{param-comp}. E.g. the fact that $k$-matching is $\sharpwonehard$ implies that we cannot have an $n^{\Omega(1)}$ runtime. However, these results do not carefully track the exponent in the hardness result. E.g. $\sharpwonehard$ for the general $k$-matching problem does not imply anything specific for the $3$-matching problem. Similar questions has led to intense research into the new sub-field of {\em fine-grained complexity} (see~\cite{virgi-survey}), where we care about the exponent in our hardness assumptions as well-- e.g. \Cref{conj:graph} is based on the popular {\em Triangle detection hypothesis} in this area (cf.~\cite{triang-hard}).
In \Cref{sec:hard}, we utilized common conjectures from fine-grained complexity theory. The notion of $\sharpwonehard$ is a standard notion in {\em parameterized complexity}, which by now is a standard complexity tool in providing data complexity bounds on query processing results~\cite{param-comp}. E.g. the fact that $k$-matching is $\sharpwonehard$ implies that we cannot have an $n^{\Omega(1)}$ runtime. However, these results do not carefully track the exponent in the hardness result. E.g. $\sharpwonehard$ for the general $k$-matching problem does not imply anything specific for the $3$-matching problem. Similar questions have led to intense research into the new sub-field of {\em fine-grained complexity} (see~\cite{virgi-survey}), where we care about the exponent in our hardness assumptions as well-- e.g. \Cref{conj:graph} is based on the popular {\em Triangle detection hypothesis} in this area (cf.~\cite{triang-hard}).

View File

@ -126,6 +126,6 @@
\node[below=0.2cm of rrect]{{\LARGE $\expct\pbox{\poly(\vct{X})}$}};
\end{tikzpicture}
}
\caption{Two step model of computation}
\caption{Intensional Evaluation Model}
\label{fig:two-step}
\end{figure}