More work on rewriting the Intro.

This commit is contained in:
Aaron Huber 2021-06-23 14:40:56 -04:00
parent af6fa4b1e9
commit 43766706b1

View file

@ -39,16 +39,17 @@
\end{outline}
\AH{Setting}
A probabilistic database (\abbrPDB) $\pdb$ is a two-tuple ($\idb, \pd$) such that $\idb$ is the set of possible worlds $\db$ represented by $\pdb$, and $\pd$ is the associated probability distribution across each $\db$ in $\idb$. Given a query $\query$ the output of $\query(\pdb)$ is ($\idb', \pd'$) such that $\idb' = \{\query(\db_i) \suchthat i \in [\numvar]\}$ where $\numvar$ is the number of possible worlds, and $\pd'$ is the resulting probability distribution over $\idb'$. This computation process can be modeled in two steps, where the first step consists of the deterministic computation of the query and result tuple lineage(s) encoded in the respective representation, and the second step consists of the probability computation. This computational model is nicely followed by set-\abbrPDB computation and semiring provenance, and is useful in separating the deterministic computation from the probability computation.
A probabilistic database (\abbrPDB) $\pdb$ is a two-tuple ($\idb, \pd$) such that $\idb$ is the set of possible worlds $\db$ represented by $\pdb$, and $\pd$ is the associated probability distribution across each $\db$ in $\idb$. Given a query $\query$ the output of $\query(\pdb)$ is ($\idb', \pd'$) such that $\idb' = \{\query(\db_i) \suchthat i \in [\numvar]\}$ where $\numvar$ is the number of possible worlds, and $\pd'$ is the resulting probability distribution over $\idb'$. Computing $\query$ as outlined above can be modeled in two steps, where the first step consists of the deterministic computation of both the query output and result tuple lineage(s) encoded in the respective representation, and the second step consists of computing the probability distributation. This computational model is nicely followed by set-\abbrPDB computation and semiring provenance, and is useful in this work for the purpose separating the deterministic computation from the probability computation.
Much work already exists regarding \abbrPDB\xplural, most of which considers $\pdb$ to be a set, meaning all possible worlds $\db$ are a \emph{set} of tuples. The problem of computing $\query$ \emph{exactly} over a set-\abbrPDB is known to be \sharpphard in the general case. The dichotomy shown by Dalvi and Suicu shows that for set-\abbrPDB\xplural it is the case that $\query(\pdb)$ is either really easy or really hard. This dichotomy is \emph{based} on the query structure and in general is independent of the representation of the lineage polynomial.\footnote{We do note that there exist specific cases when given a specific database instance combined with an amenable representation, that a hard $\query$ can become easy, but this is {\emph not} the general case.} The hardness results depend on step two of the computation model.
Much work already exists regarding \abbrPDB\xplural, most of which considers $\pdb$ to be a set, meaning all possible worlds $\db$ are a \emph{set} of tuples. The problem of computing $\query$ \emph{exactly} over a set-\abbrPDB is known to be \sharpphard in the general case. The dichotomy of Dalvi and Suicu shows that for set-\abbrPDB\xplural it is the case that $\query(\pdb)$ is either polynomial or \sharpphard. Further, this dichotomy is \emph{based} on the query structure and in general is independent of the representation of the lineage polynomial.\footnote{We do note that there exist specific cases when given a specific database instance combined with an amenable representation, that a hard $\query$ can become easy, but this is {\emph not} the general case.} The hardness results for set-\abbrPDB\xplural depend on step two of the computation model.
The set-\abbrPDB, $\query(\pdb)$ is limited to computing the {\emph marginal} probability for a member tuple. When it is desirable to compute either a probability distribution over the set of possible multiplicities a tuple may have in the output of $\query$ or to compute certain statistical measures over the multiplicity of $\tup$, bag-\abbrPDB\xplural are a natural fit, proving very useful for posing questions such as count queries to the database.
A tuple independent database (\abbrTIDB) is a \abbrPDB whose tuples are treated as independent random events. Given a set-\abbrTIDB $\pdb$, $\query(\pdb)$ is essentially limited to computing the \emph{marginal} probability for a member tuple $\tup$. When it is desirable to compute either a probability distribution over the set of possible multiplicities $\tup$ or to compute certain statistical measures over the multiplicity of $\tup$, bag-\abbrPDB\xplural are a natural fit, proving very useful for posing questions such as count queries to the database. While other statistical measures can be computed, we focus primarily on computing the expected multiplicity of $\tup$, a natural interpretation of step two in the bag setting.\footnote{We consider this natural since it is true that computing the marginal probability of $\tup$ in set-\abbrPDB\xplural is essentially computing $\tup$'s expectation.} It is also compelling to consider the expected multiplicity since bag-\abbrPDB\xplural are not well studied from a theoretical perspective, and the expected count is both natural and simplistic to consider as a first building block. We consider higher moments in the appendix.\AH{Pointer here.}
Traditionally, bag-\abbrPDB\xplural have long been considered to be bottlenecked in step one only, or linear in the size of query. This may partially be due to the prevalence that exists in using a sum of products (\abbrSOP) representation of the lineage polynomial amongst many of the most well-known implementations of set-\abbrPDB\xplural. Such a representation used in the bag-\abbrPDB setting \emph{indeed} allows for step two to be linear in the \emph{size} of the \abbrSOP representation, a result due to linearity of expectation.
When we consider the bag-\abbrPDB setting, things change. Bag-\abbrPDB\xplural have long been considered to be bottlenecked in step one only. This may partially be due to the prevalence that exists in using a sum of products (\abbrSOP) representation of the lineage polynomial amongst many of the most well-known implementations of set-\abbrPDB\xplural. Such a representation used in the bag-\abbrPDB setting is {\emph indeed} linear in the {\emph size} of the \abbrSOP representation due to linearity of expectation. Considering bag-\abbrPDB\xplural to be linear in the size of the lineage polynomial representation may also be a result of viewing bag-\abbrPDB\xplural as irrelevant or unnecesary. However, bag-\abbrPDB\xplural are an advantegous tool for certain niche computations such as count queries.
However, it is not necessarily satisfying to stop here. Since typical implementations of \abbrPDB\xplural compute the representation of the lineage polynomial in sync with the particular choice of query plan, it is important that optimizations are allowed if we want to have a true comparison between step one and step two in bag-\abbrPDB queries. Optimizations like projection push-down produce factorized or non-\abbrSOP representations of the lineage polynomial. Our work explores whether or not step two in the computation model is \emph{always} linear in the \emph{size} of the representation of the lineage polynomial when step one of $\query(\pdb)$ is easy.\footnote{It is known that, in general, there exist queries that are \emph{not} linear in the size of the data. Such queries as multiple joins and counting cliques are specific examples of this. We are considering cases where the query is linear in the size of the data.}
It is known that, in general, there exist queries that are {\emph not} linear in the size of the data. Such queries as multiple joins and counting cliques are specific examples of this.
Our work focuses on the following setting for query computation. Inputs of $\query$ are set-\abbrPDB\xplural, while the output of $\query$ is a bag-\abbrPDB. This, however, is not limiting as a simple generalization exists, which involves assigning a unique id to each tuple of bag-\abbrPDB inputs.