More changes to Introduction.

master
Aaron Huber 2021-08-04 12:24:36 -04:00
parent c1cefb703a
commit 02ab8a49ef
2 changed files with 16 additions and 19 deletions

View File

@ -1,20 +1,15 @@
%root: main.tex
\section{Introduction (Rewrite - 070921)}
\input{two-step-model}
A probabilistic database (\abbrPDB) $\pdb$ is a probability distribution $\pd$ over a set of $\numvar$ tuples in a deterministic database $\db$. A tuple independent probabilistic database (\abbrTIDB) $\pdb$ further restricts $\pd$ to treating each tuple in $\db$ as an independent Bernoulli distributed random variable corresponding to the tuple's presence, where, in bag query semantics, $\query\inparen{\pdb}\inparen{\tup}$ is the random (polynomial) variable corresponding to the multiplicity of the output tuple $\tup$. Given a query $\query$ the set of positive relational algebra queries ($\raPlus$)\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the SPJU and renaming operators}, the goal is to compute the expected multiplicity\footnote{In set semantic \abbrPDB\xplural, computing $\expct\pbox{\query\inparen{\pdb}\inparen{t}}$ corresponds to computing the marginal probability.} ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$. There exists a polynomial $\poly_\tup\inparen{\vct{X}}$ such that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{X}}}$\footnote{In this work we focus on one output tuple $\tup$, and hence refer to $\poly_\tup$ as $\poly$}, where $\vct{X} = \inparen{X_1,\ldots, X_\numvar}$, and the expectation is any Bernoulli distribution over $\{0, 1\}^\numvar$, using $\semN$ semiring semantics.. The set of variables $X_i$ in $\vct{X}$ with nonzero assignments and nonzero coefficients in $\poly\inparen{\vct{X}}$ represent all contributing input tuples to $\tup$'s presence in the output. In bag semantics, the evaluation of $\poly\inparen{\vct{X}}$ is over standard addition and multiplication operators and computes $\tup$'s multiplicity.
\currentWork{
Bounding the runtime of $\query\inparen{\pdb}$ is central to developing theoretical foundations for any database model. For a general algorithm $\mathcal{A}$ with input size $\numvar$, we informally define some useful bounds.
\begin{Definition}[\sharpphard]\AH{This IS very informal, I know, but at this point I don't want to just copy the formal definition without developing a keen intuitive grasp that understands the concept.}
$\mathcal{A}$ is in \sharpphard if computing a solution is an element of $NP$ (i.e., there exist inputs to $\mathcal{A}$ such that the runtime is superpolynomial in $\numvar$), and there may exist multiple solutions to a given problem.
\end{Definition}
Here, superpolynomial defines runtimes that are $\omega\inparen{polytime}$. Examples of such include $\bigO{c^\numvar}$ for a constant $c$ or $\bigO{\numvar!}$.
\begin{Definition}[\sharpwonehard]
$\mathcal{A}$ is in the \sharpwonehard class if its runtime can be lower bounded by some function $f$ of a parameter $k$ such that the growth in runtime is dependent on $k$. Formally, $\mathcal{A}$ is an element of \sharpwonehard if it is in the form $f(k)\cdot \numvar^c$ for constant $c$.
\end{Definition}
}
A probabilistic database (\abbrPDB) $\pdb$ is a probability distribution $\pd$ over a set of $\numvar$ tuples in a deterministic database $\db$. A tuple independent probabilistic database (\abbrTIDB) $\pdb$ further restricts $\pd$ to treating each tuple in $\db$ as an independent Bernoulli distributed random variable corresponding to the tuple's presence, where, in bag query semantics, $\query\inparen{\pdb}\inparen{\tup}$ is the random (polynomial) variable corresponding to the multiplicity of the output tuple $\tup$. Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$)\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the SPJU and renaming operators}, the goal is to compute the expected multiplicity\footnote{In set semantic \abbrPDB\xplural, computing $\expct\pbox{\query\inparen{\pdb}\inparen{t}}$ corresponds to computing the marginal probability.} ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$. There exists a polynomial $\poly_\tup\inparen{\vct{X}}$ such that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{X}}}$\footnote{In this work we focus on one output tuple $\tup$, and hence refer to $\poly_\tup$ as $\poly$}, where $\vct{X} = \inparen{X_1,\ldots, X_\numvar}$, and the expectation is any Bernoulli distribution over $\{0, 1\}^\numvar$, using $\semN$ semiring semantics.. The set of variables $X_i$ in $\vct{X}$ with nonzero assignments and nonzero coefficients in $\poly\inparen{\vct{X}}$ represent all contributing input tuples to the presence of the output. In bag semantics, the evaluation of $\poly\inparen{\vct{X}}$ follows $\semN$-semiring semantics in computing the multiplicity.
Before describing complexity results, we informally express the notion of complexity classes most commonly used in this work, given a general algorithm $\mathcal{A}$ with input size $\numvar$.
Algorithm $\mathcal{A}$ is an element of the \sharpp complexity class if computing a solution is an element of $NP$ and there may exist multiple solutions to a given problem.
Algorithm $\mathcal{A}$ is in the \sharpwone class if its runtime can be lower bounded by some function $f$ of a parameter $k$ such that the growth in runtime is polynomially dependent on $f(k)$. Specifically, $\mathcal{A}$ is an element of \sharpwone if its lower bound on runtime is of the form $n^{f(k)}$.
The special case of deterministic query evaluation
%simply computing $\query$ over a deterministic database
is itself known to be \sharpwonehard in data complexity for general \query. An algorithm, such as a counting cliques query processing algorithm, is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.
is itself known to be \sharpwonehard in data complexity for general $\query$. An algorithm, such as a counting cliques query processing algorithm, is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.
%hardness is seen in such queries as counting $k$-cliques and $k$-way joins, where the superlinear runtime is parameterized in $k$.
This result is unsatisfying when considering complexity of evaluating $\query$ over \abbrPDB\xplural, since it does not account for computing $\expct\pbox{\poly\inparen{\vct{X}}}$, entirely ignoring the `P' in \abbrPDB.
%of intensional evaluation (computing $\expct\pbox{\poly\inparen{\vct{X}}}$).
@ -22,11 +17,11 @@ This result is unsatisfying when considering complexity of evaluating $\query$ o
A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly\inparen{\vct{X}}}$ separately from the complexity of deterministic query evaluation. Viewing \abbrPDB query evaluation as these two seperate steps is essentially what is known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}. \Cref{fig:two-step} illustrates the intensional evaluation computation model.
%one way to do this.
%The model of computation in \cref{fig:two-step} views \abbrPDB query processing as two steps.
As depicted, the first step we will refer to as \termStepOne (\abbrStepOne). \abbrStepOne consists of computing $\query$ over a $\abbrPDB$, which is essentially the deterministic computation of both the query output and $\poly(\vct{X})$\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in [placeholder].}. The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly(\vct{X})}$. Such a model of computation is nicely followed by intensional evaluation in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly\inparen{\vct{X}}$ must be computed separately for exact output when $\query(\pdb)$ is hard since extensional evaluation will only approximate in such a case.
As depicted, the first step we will refer to as \termStepOne (\abbrStepOne). \abbrStepOne consists of computing $\query$ over a $\abbrPDB$, which is essentially the deterministic computation of both the query output and $\poly(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in [placeholder].} The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly(\vct{X})}$. Such a model of computation is nicely followed by intensional evaluation in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly\inparen{\vct{X}}$ must be computed separately for exact output when $\query(\pdb)$ is hard since extensional evaluation will only approximate in such a case.
%(where e.g. intensional evaluation is itself a separate computational step; further, computing $\expct\pbox{\poly\inparen{\vct{X}}}$ in extensional evaluation occurs as a separate step of each operator in the query tree, and therefore implies that both concerns can be separated)
\Cref{fig:two-step} is also nicely patterned by semiring provenance \cite{DBLP:conf/pods/GreenKT07}, where the general $\semNX$-DB first computes the annotation via the query, and the $\semNX$-polynomial is subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the model lends itself nicely in separating the concerns of deterministic computation and the probability computation. Observing this model prompts the question of whether or not bag \abbrStepTwo is \sharpwonehard in query complexity.
\Cref{fig:two-step} is also nicely patterned by semiring provenance \cite{DBLP:conf/pods/GreenKT07}, where the general $\semNX$-DB first computes the annotation via the query, and the $\semNX$-polynomial is subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics. Further, in this work, the model lends itself nicely in separating the concerns of deterministic computation and the probability computation. Observing this model prompts the informal problems statement: given $\query\inparen{\pdb}$, is it the case \abbrStepTwo is always $\bigO{\abbrStepOne}$?
%always $\bigO{\abbrStepOne}$.
If not, then query evaluation over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation (up to a constant factor).
If so, then query evaluation over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation (up to a constant factor).
The problem of computing $\query(\pdb)$ has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where the lineage polynomial follows $\semB$-semiring semantics.
%is a propositional formula.\footnote{For the case when $\query$ is in the class of $\raPlus$ and $\pdb$ is a \abbrTIDB, a bag \abbrPDB lineage polynomial is over a natural number semiring and a set \abbrPDB lineage polynomial is over the boolean semiring.}

View File

@ -57,10 +57,10 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Perhaps PDB abbreviations should go here?
%Two-step (intensional evaluation model)
\newcommand{\termStepOne}{Deterministic Query Evaluation\xspace}
\newcommand{\abbrStepOne}{DQE\xspace}
\newcommand{\termStepTwo}{Expected Probability Computation\xspace}
\newcommand{\abbrStepTwo}{EPC\xspace}
\newcommand{\termStepOne}{Lineage Computation\xspace}
\newcommand{\abbrStepOne}{LC\xspace}
\newcommand{\termStepTwo}{Expectation Computation\xspace}
\newcommand{\abbrStepTwo}{EC\xspace}
%
\newcommand{\expectProblem}{\textsc{Expected Result Multiplicity Problem}\xspace}
\newcommand{\termSMB}{standard monomial basis\xspace}
@ -276,7 +276,9 @@
% COMPLEXITY
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\bigO}[1]{O\inparen{#1}}
\newcommand{\sharpp}{\#P\xspace}
\newcommand{\sharpphard}{\#P-hard\xspace}
\newcommand{\sharpwone}{\#W[1]\xspace}
\newcommand{\sharpwonehard}{\#W[1]-hard\xspace}
\newcommand{\ptime}{PTIME\xspace}