More changes on Intro @atri comments 072021.

master
Aaron Huber 2021-07-26 12:14:13 -04:00
parent e890b14ae5
commit 6ebf335e90
2 changed files with 31 additions and 25 deletions

View File

@ -16,13 +16,40 @@ The model of computation in \cref{fig:two-step} views \abbrPDB query processing
%(where e.g. intensional evaluation is itself a separate computational step; further, computing $\expct\pbox{\poly\inparen{\vct{X}}}$ in extensional evaluation occurs as a separate step of each operator in the query tree, and therefore implies that both concerns can be separated)
and also by that of semiring provenance \cite{DBLP:conf/pods/GreenKT07}, where the $\semNX$-DB first computes the annotation via the query, and then the polynomial is evaluated on a specific valuation. Further, in this work, the model lends itself nicely in separating the deterministic computation from the probability computation. Observing this model prompts the question of whether or not bag \abbrStepTwo is always $\bigO{\abbrStepOne}$. If not, then query evaluation over bag \abbrPDB\xplural is not always equivalent to deterministic query evaluation.
The problem of computing $\query(\pdb)$ has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where the lineage polynomial is a propositional formula.\footnote{For the case when $\query$ is in the class of $\raPlus$ and $\pdb$ is a \abbrTIDB, a bag \abbrPDB lineage polynomial is over a natural number semiring and a set \abbrPDB lineage polynomial is over the boolean semiring.} The semantics of evaluating $\query(\pdb)$ in this setting require each output tuple in $\query(\pdb)$ appears at most once in the result and the computation of $\expct\pbox{\poly\inparen{\vct{X}}}$ is the marginal probability of $\tup$'s existence. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time step one. Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), fine grained analysis (complexity analysis that produces varying complexity classes based on a more fine grained parameter other than $\numvar$) of step two will not reduce the hardness results from the \sharpphard complexity class for any parameterized complexity class. To overcome this result, one can allow for approximation which reduces the problem to a quadratic upper bound.
The problem of computing $\query(\pdb)$ has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where the lineage polynomial is a propositional formula.\footnote{For the case when $\query$ is in the class of $\raPlus$ and $\pdb$ is a \abbrTIDB, a bag \abbrPDB lineage polynomial is over a natural number semiring and a set \abbrPDB lineage polynomial is over the boolean semiring.} The semantics of evaluating $\query(\pdb)$ in this setting require each output tuple in $\query(\pdb)$ appears at most once in the result and the computation of $\expct\pbox{\poly\inparen{\vct{X}}}$ is the marginal probability of $\tup$'s existence. Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time step one. Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), fine grained analysis (complexity analysis that produces varying complexity classes based on a more fine grained parameter other than $\numvar$) of step two will not reduce the hardness results from the \sharpphard complexity class for any parameterized complexity class. To overcome this result, one can allow for approximation which reduces the problem to a quadratic upper bound in data complexity.
There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit. One such query is the count query, where one might desire to compute the expected multiplicity ($\expct\pbox{\poly\inparen{\vct{X}}}$) of a result tuple $\tup$. Should we allow for approximation in this setting, this paper shows that we can \emph{guarantee} runtime of $\query(\pdb)$ to be linear in runtime of \abbrStepOne.
There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit. One such query is the count query, where one might desire to compute the expected multiplicity ($\expct\pbox{\poly\inparen{\vct{X}}}$) of a result tuple $\tup$.
%Should we allow for approximation in this setting, this paper shows that we can \emph{guarantee} runtime of $\query(\pdb)$ to be linear in runtime of \abbrStepOne.
%step one.
Is approximation necessary? The semantics of $\query(\pdb)$ in bag-\abbrPDB\xplural allow for output tuples to appear \emph{more} than once, which is naturally captured by a lineage polynomial with standard addition and multiplication polynomial operators. In this setting, linearity of expectation holds over the standard addition operator of the lineage polynomial, and given a sum of products (\abbrSOP) representation, the complexity of computing step two is linear in the size of the lineage polynomial. This result coupled with the prevalence that exists amongst most well-known \abbrPDB implementations to use an \abbrSOP representation, may partially explain why the bag-\abbrPDB query problem has long been thought to be easy.
%Is approximation necessary?
The main insight of the paper is that we should not stop here. One can have compact representations of $\poly(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations of $\poly(\vct{X})$. To capture such factorizations, this work uses (arithmetic) circuits as the representation system of $\poly(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07} (as shown in \cref{fig:nxDBSemantics}). Our work explores whether or not \abbrStepTwo in the computation model is \emph{always} in the same complexity class as deterministic query evaluation, when \abbrStepOne of $\query(\pdb)$ is easy. We examine the class of queries whose lineage computation in step one is lower bounded by the query runtime of step one. Consider again the bag-\abbrTIDB $\pdb$. When the probability of all tuples $\prob_i = 1$, the problem of computing the expected count is linear in the size of the arithemetic circuit, and we have polytime complexity for computing $\query(\pdb)$. Is this the general case? This leads us to our problem statement:
\begin{figure}
\begin{align*}
\evald{\project_A(\rel)}{\db}(\tup) =& \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
\evald{(\rel_1 \union \rel_2)}{\db}(\tup) =& \evald{\rel_1}{\db}(\tup) + \evald{\rel_2}{\db}(\tup)\\
\evald{\select_\theta(\rel)}{\db}(\tup) =& \begin{cases}
\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
\zeroK & \text{otherwise}.
\end{cases} &
\begin{aligned}
\evald{(\rel_1 \join \rel_2)}{\db}(\tup) =\\ ~
\end{aligned}&
\begin{aligned}
&\evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \\
&~~~\cdot\evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup))
\end{aligned}\\
& & \evald{R}{\db}(\tup) =& \rel(\tup)
\end{align*}\\[-10mm]
\caption{Evaluation semantics $\evald{\cdot}{\db}$ for $\semNX$-DBs~\cite{DBLP:conf/pods/GreenKT07}.}
\label{fig:nxDBSemantics}
\end{figure}
The semantics of $\query(\pdb)$ in bag-\abbrPDB\xplural allow for output tuples to appear \emph{more} than once, which is naturally captured by a lineage polynomial with standard addition and multiplication polynomial operators. In this setting, linearity of expectation holds over the standard addition operator of the lineage polynomial, and given a standard monomial basis (\abbrSMB) representation of the lineage polynomial, the complexity of computing step two is linear in the size of the lineage polynomial. This is true since the addition and multiplication operators in \cref{fig:nxDBSemantics} are those of the $\semN$-semiring, and computing the expected count over such operators allows for linearity of expectation over addition, and since \abbrSMB has no factorization, the monomials with dependent multiplicative variables are known up front without any additional operations. Thus, the expected count can indeed be computed by the same order of operations as contained in $\poly$. This result coupled with the prevalence that exists amongst most well-known \abbrPDB implementations to use an sum of products\footnote{Sum of products differs from \abbrSMB in allowing any arbitrary monomial $m_i$ to appear in the polynomial more than once, whereas, \abbrSMB requires all monomials $m_i,\ldots, m_j$ such that $m_i = \cdots = m_j$ to be combined into one monomial, such that each monomial appearing in \abbrSMB is unique. The complexity difference between the two representations is up to a constant factor.} representation, may partially explain why the bag-\abbrPDB query problem has long been thought to be easy.
The main insight of the paper is that we should not stop here. One can have compact representations of $\poly(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations of $\poly(\vct{X})$. To capture such factorizations, this work uses (arithmetic) circuits as the representation system of $\poly(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07} (as shown in \cref{fig:nxDBSemantics}).
%Our work explores whether or not \abbrStepTwo in the computation model is \emph{always} in the same complexity class as deterministic query evaluation, when \abbrStepOne of $\query(\pdb)$ is easy.
We examine the class of queries whose lineage computation in step one is lower bounded by the query runtime of step one. Consider the bag-\abbrTIDB $\pdb$. Denote the probability of a tuple $\tup_i$ as $\prob_i$. When $\prob_i = 1$ for all $i$ in $[\numvar]$, the problem of computing the expected count is linear in the size of the arithemetic circuit, since we can rea, and we have polytime complexity for computing $\query(\pdb)$. Is this the general case? This leads us to our problem statement:
\begin{Problem}\label{prob:intro-stmt}
Given a query $\query$ in $\raPlus$ and bag \abbrPDB $\pdb$, what is the complexity (in the size of the circuit representation) of computing step two ($\expct\pbox{\poly(\vct{X})}$) for each tuple $\tup$ in the output of $\query(\pdb)$?
\end{Problem}

View File

@ -61,27 +61,6 @@ Each possible world is defined by an assignment of $\numvar$ binary values $\vct
The multiplicity of $t \in R$ in this possible world, denoted $R(t)(\vct{\wElem})$, is obtained by evaluating the polynomial annotating $t$ on $\vct{\wElem}$.
$\semNX$-relations are closed under $\raPlus$ (\Cref{fig:nxDBSemantics}).
\begin{figure}
\begin{align*}
\evald{\project_A(\rel)}{\db}(\tup) =& \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
\evald{(\rel_1 \union \rel_2)}{\db}(\tup) =& \evald{\rel_1}{\db}(\tup) + \evald{\rel_2}{\db}(\tup)\\
\evald{\select_\theta(\rel)}{\db}(\tup) =& \begin{cases}
\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
\zeroK & \text{otherwise}.
\end{cases} &
\begin{aligned}
\evald{(\rel_1 \join \rel_2)}{\db}(\tup) =\\ ~
\end{aligned}&
\begin{aligned}
&\evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \\
&~~~\cdot\evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup))
\end{aligned}\\
& & \evald{R}{\db}(\tup) =& \rel(\tup)
\end{align*}\\[-10mm]
\caption{Evaluation semantics $\evald{\cdot}{\db}$ for $\semNX$-DBs~\cite{DBLP:conf/pods/GreenKT07}.}
\label{fig:nxDBSemantics}
\end{figure}
% For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07}.
% We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. Below, we assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$. Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.