Start on a new Intro.

This commit is contained in:
Aaron Huber 2021-06-23 11:32:57 -04:00
parent 091a5d8e40
commit af6fa4b1e9

View file

@ -37,6 +37,18 @@
\3 A simple generalization exists
\end{outline}
\AH{Setting}
A probabilistic database (\abbrPDB) $\pdb$ is a two-tuple ($\idb, \pd$) such that $\idb$ is the set of possible worlds $\db$ represented by $\pdb$, and $\pd$ is the associated probability distribution across each $\db$ in $\idb$. Given a query $\query$ the output of $\query(\pdb)$ is ($\idb', \pd'$) such that $\idb' = \{\query(\db_i) \suchthat i \in [\numvar]\}$ where $\numvar$ is the number of possible worlds, and $\pd'$ is the resulting probability distribution over $\idb'$. This computation process can be modeled in two steps, where the first step consists of the deterministic computation of the query and result tuple lineage(s) encoded in the respective representation, and the second step consists of the probability computation. This computational model is nicely followed by set-\abbrPDB computation and semiring provenance, and is useful in separating the deterministic computation from the probability computation.
Much work already exists regarding \abbrPDB\xplural, most of which considers $\pdb$ to be a set, meaning all possible worlds $\db$ are a \emph{set} of tuples. The problem of computing $\query$ \emph{exactly} over a set-\abbrPDB is known to be \sharpphard in the general case. The dichotomy shown by Dalvi and Suicu shows that for set-\abbrPDB\xplural it is the case that $\query(\pdb)$ is either really easy or really hard. This dichotomy is \emph{based} on the query structure and in general is independent of the representation of the lineage polynomial.\footnote{We do note that there exist specific cases when given a specific database instance combined with an amenable representation, that a hard $\query$ can become easy, but this is {\emph not} the general case.} The hardness results depend on step two of the computation model.
The set-\abbrPDB, $\query(\pdb)$ is limited to computing the {\emph marginal} probability for a member tuple. When it is desirable to compute either a probability distribution over the set of possible multiplicities a tuple may have in the output of $\query$ or to compute certain statistical measures over the multiplicity of $\tup$, bag-\abbrPDB\xplural are a natural fit, proving very useful for posing questions such as count queries to the database.
When we consider the bag-\abbrPDB setting, things change. Bag-\abbrPDB\xplural have long been considered to be bottlenecked in step one only. This may partially be due to the prevalence that exists in using a sum of products (\abbrSOP) representation of the lineage polynomial amongst many of the most well-known implementations of set-\abbrPDB\xplural. Such a representation used in the bag-\abbrPDB setting is {\emph indeed} linear in the {\emph size} of the \abbrSOP representation due to linearity of expectation. Considering bag-\abbrPDB\xplural to be linear in the size of the lineage polynomial representation may also be a result of viewing bag-\abbrPDB\xplural as irrelevant or unnecesary. However, bag-\abbrPDB\xplural are an advantegous tool for certain niche computations such as count queries.
It is known that, in general, there exist queries that are {\emph not} linear in the size of the data. Such queries as multiple joins and counting cliques are specific examples of this.