clarifications, flow, grammar

This commit is contained in:
Oliver Kennedy 2021-04-06 18:46:06 -04:00
parent f226af1dc3
commit a14f138633
Signed by: okennedy
GPG key ID: 3E5F9B3ABD3FDB60

View file

@ -3,7 +3,10 @@
\section{Introduction}
\label{sec:intro}
A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds paired with a probability distribution $\pd$ over these worlds. A well-studied problem in probabilistic databases is given a query $\query$ and probabilistic database $\pdb$ to compute the \emph{marginal probability} of a tuple $\tup$, i.e., its probability to exist in the result of query $\query$ over $\pdb$. This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs) which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases from cases that are in \ptime for the class of union of conjunctive queries (UCQs). In this work, we consider bag semantics where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analog problem of computing the expectation of the multiplicity of a query result tuple $\tup$:
A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds, paired with a probability distribution $\pd$ over these worlds.
A well-studied problem in probabilistic databases is, given a query $\query$ and probabilistic database $\pdb$, computing the \emph{marginal probability} of a tuple $\tup$, (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs), which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases from cases that are in \ptime for unions of conjunctive queries (UCQs).
In this work we consider bag semantics, where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analogous problem of computing the expectation of the multiplicity of a query result tuple $\tup$ (denoted $\query(\db)(t)$):
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{equation}\label{eq:intro-bag-expectation}
\expct_{\idb \sim \probDist}[\query(\idb)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \pd(\db) \hspace{2cm}\text{\textbf{(Expected Result Multiplicity)}}
@ -112,29 +115,41 @@ A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic dat
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Example}\label{ex:intro-tbls}
Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analog to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise has multiplicity zero) and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where processing of shipping is on time.
Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analog to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise has multiplicity zero) and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where processing of shipping is on time.
$$Q_1 := \pi_{\text{City}_1}(Loc \bowtie_{\text{City}_\ell = \text{City}_1} Route)$$
\Cref{subfig:ex-shipping-simp-queries} shows the possible results of this query. For example, there is a 90\% probability there is a single route starting in Buffalo that with 90\% probability is on time. Thus, the expected multiplicity of this result tuple is $0.9$. There are two shipping routes starting in Chicago and the Chicago location has a 50\% probability to be on time (we assume that either all shipping starting in a location is on time or all shipping from this location is delayed). Thus, the expected multiplicity of this result tuple is $0.5 + 0.5 = 1.0$.
\Cref{subfig:ex-shipping-simp-queries} shows the possible results of this query.
For example, there is a 90\% probability there is a single route starting in Buffalo that is on time, and the expected multiplicity of this result tuple is $0.9$.
There are two shipping routes starting in Chicago.
Since the Chicago location has a 50\% probability to be on schedule (we assume that delays are linked), the expected multiplicity of this result tuple is $0.5 + 0.5 = 1.0$.
\end{Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
A well-known result in probabilistic databases is that under set semantics the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vct{X}$) over random variables that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\mathbb{B}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula evaluates to true over the assignment for a world $\db$, then $\tup \in \query(\db)$. Thus, the marginal probability of tuple $\tup$ is equal to the probability that the lineage evaluates to true wrt. the probability distribution that associates each possible assignment from $\mathbb{B}^\numvar$ with the probability of the world it corresponds to.
A well-known result in probabilistic databases is that under set semantics the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vct{X}$) over random variables that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\mathbb{B}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula evaluates to true over the assignment for a world $\db$, then $\tup \in \query(\db)$.
Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the trivial analog of $\probDist$ defined over $\vct{X}$).
For bag semantics, the lineage of a tuple is a polynomial over random variables from the set $\vct{X} \in \mathbb{N}^\numvar$ with
coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$). Analog to the set case, evaluating the lineage over an assignment corresponding to a possible world (mapping variables to natural numbers representing input tuple multiplicities in this world) yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which we will denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$).
Analogously to the set case, evaluating the lineage over an assignment corresponding to a possible world (mapping variables to natural numbers representing input tuple multiplicities in this world) yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which we will denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Example}\label{ex:intro-lineage}
Associating a lineage variable with every input tuple as shown in \cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing over all possible worlds the multiplicity of the result tuple in this world multiplied by its probability.
Note that since $\Phi$ is a sum of products (SOP), we can use linearity of expectation to solve the problem in linear time in the size of $\linsett{\query}{\pdb}{\tup}$: the expectation of the sum is the sum of the expectations of each monomial. The expectation of each monomial is then computed by multiplying the probabilities of the variables (tuples) occurring in the monomial.
Associating a lineage variable with every input tuple as shown in \cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
In this example, $\Phi$ is a sum of products (SOP), and so observe that we can use linearity of expectation to solve the problem in linear time (in the size of $\linsett{\query}{\pdb}{\tup}$)
The expectation of the sum is the sum of the expectations of each monomial.
The expectation of each monomial is then computed by multiplying the probabilities of the variables (tuples) occurring in the monomial.
The expected multiplicity of Chicago is $1.0$.
\end{Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
While the expected multiplicity of a query result can be computed in linear time in the size of the result's lineage if it is in SOP form, this may not be true for compressed representations of polynomials such as factorized polynomials and arithmetic circuits. For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}. The left circuit encodes the lineage as a SOP while the right circuit uses distributivity to push the addition gate below the multiplication resulting in a smaller circuit. Given that there is a large body of work that can output such compressed representations \BG{cite FDBs and FAQ}, an interesting question is whether computing expectations is still in linear time for such compressed representations. We prove that this is not the case: computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a lineage circuit.
Of course, any complexity result for computing the expectation of polynomials only translates to the expected result multiplicity problem when we also take into account the complexity of constructing the lineage for a given tuple.
The expected multiplicity of a query result can be computed in linear time (in the size of the result's lineage) if the lineage is in SOP form.
However, this need not be true for compressed representations of polynomials such as factorized polynomials and arithmetic circuits.
For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}.
The left circuit encodes the lineage as a SOP while the right circuit uses distributivity to push the addition gate below the multiplication, resulting in a smaller circuit.
Given that there is a large body of work that can output such compressed representations\BG{cite FDBs and FAQ}, an interesting question is whether computing expectations is still in linear time for such compressed representations.
If the answer is in the affirmative, and if lineage formulas can also be computed in linear time (in the lineage size), then bag-relational probabilistic databases can theoretically match the performance of deterministic databases.
Unfortunately, we prove that this is not the case: computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a lineage circuit.
Concretely, we make the following contributions:
(i) We show that computing the expected result multiplicity problem for conjunctive queries for bag-$\ti$ is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;