Minor typo/flow tweaks

master
Oliver Kennedy 2021-04-10 13:33:44 -04:00
parent c3eb8efb79
commit c2c5f3d936
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
4 changed files with 12 additions and 12 deletions

View File

@ -4,13 +4,13 @@
The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds.
The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world.
In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately.
For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products.
For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products (the standard operating procedure for Set-PDBs).
However, using a reduction from the problem of counting $k$-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
Such factorized representations are necessary to realize the performance of modern join algorithms (e.g., worst-case optimal joins), and so our results imply that a Bag-PDB doing exact computations (via these factorized representations) can never be as fast as a classical (deterministic) database.
The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).
We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs).
We develop a sampling algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of polynomial circuits in linear time in the size of the polynomial.
By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for PDBs that can be competitive with deterministic databases.
By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases.
\end{abstract}
%%% Local Variables:

View File

@ -4,7 +4,7 @@
\section{Introduction}
\label{sec:intro}
A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds, paired with a probability distribution $\pd$ over these worlds.
A well-studied problem in probabilistic databases is to, given a query $\query$ and a probabilistic database $\pdb$, compute the \emph{marginal probability} of a tuple $\tup$ (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
A well-studied problem in probabilistic databases is to take a query $\query$ and a probabilistic database $\pdb$, and compute the \emph{marginal probability} of a tuple $\tup$ (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs), which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases, from cases that are in \ptime for unions of conjunctive queries (UCQs).
In this work we consider bag semantics, where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analogous problem of computing the expectation of the multiplicity of a query result tuple $\tup$ (denoted $\query(\db)(t)$):
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -115,7 +115,7 @@ In this work we consider bag semantics, where each tuple is associated with a mu
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Example}\label{ex:intro-tbls}
Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analogously to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where shipment processing is on time.
Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analogously to the set case: each input tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$, shown below returns starting points of shipping routes where shipment processing is on time.
$$Q_1(\text{City}) \dlImp OnTime(\text{City}), Route(\text{City}, \dlDontcare)$$
@ -126,17 +126,17 @@ Since the Chicago location has a 50\% probability of being on schedule (we assum
\end{Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
A well-known result in probabilistic databases is that under set semantics, the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions)
A well-known result in probabilistic databases is that under set semantics, the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$~\cite{DBLP:conf/pods/GreenKT07} of positive Boolean expressions)
over random variables
($\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07})
($\vct{X}=(X_1,\dots,X_n)$)
that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the obvious analog of probability distribution $\probDist$ defined over $\vct{X}$).
For bag semantics, the lineage of a tuple is a polynomial over variables $\vct{X}=(X_1,\dots,X_n)$ with % \in \mathbb{N}^\numvar$ with
coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$).
Analogously to sets, evaluating the lineage for $t$ over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which for this example we denote\footnote{
Analogously to sets, evaluating the lineage for $t$ over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$, which for this example we denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context\footnote{
In later sections, where we focus on a single lineage polynomial, we will simply refer to $\linsett{\query}{\pdb}{\tup}$ as $Q$.
} as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
}. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Example}\label{ex:intro-lineage}
@ -153,7 +153,7 @@ However, this need not be true for compressed representations of polynomials, in
For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}.
The left circuit encodes the lineage as a SOP while the right circuit uses distributivity to push the addition gate below the multiplication, resulting in a smaller circuit.
Given that there is a large body of work (on, e.g., deterministic bag-relational query processing) that can output such compressed representations~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}, %\BG{cite FDBs and FAQ},
an interesting question is whether computing expectations is still in linear time for such compressed representations.
an interesting question is whether computing expectations is still in linear time for their output.
If the answer is in the affirmative, and if lineage formulas can also be computed in linear time (in the lineage size), then bag-relational probabilistic databases can theoretically match the performance of deterministic databases.
Unfortunately, we prove that this is not the case: computing the expected count of a query result tuple is super-linear under standard complexity assumptions (\sharpwonehard) in the size of a lineage circuit.

View File

@ -26,7 +26,7 @@ We ignore the fields \vari{partial}, \vari{Lweight}, and \vari{Rweight} until \C
\begin{Example}
The circuit \circuit in \Cref{fig:circuit-express-tree} encodes the polynomial $XY + WZ$. Note that such an encoding lends itself naturally to having all gates with an outdegree of $1$. Note further that \circuit is indeed a tree with edges pointing towards the root.
The circuit \circuit in \Cref{fig:circuit-express-tree} encodes the polynomial $XY + WZ$. Note that circuit \circuit encodes a tree, with edges pointing towards the root.
\end{Example}
\begin{figure}[t]

View File

@ -57,7 +57,7 @@ We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBL
In an $\semNX$-database, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.
We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$.
Each possible world is defined by an assignment of $N$ binary values $\vct{W} \in \{0, 1\}^{\abs{\vct{X}}}$.
The multiplicity of $t \in R$ in this possible world is obtained by evaluating the polynomial annotating it on $\vct{W}$ (i.e., $R(t)(\vct{W})$).
The multiplicity of $t \in R$ in this possible world, denoted $R(t)(\vct{W})$, is obtained by evaluating the polynomial annotating $t$ on $\vct{W}$.
$\semNX$-relations are closed under $\raPlus$ (\Cref{fig:nxDBSemantics}).
@ -103,7 +103,7 @@ We focus on this problem from now on, assume an implicit result tuple, and so dr
\label{subsec:tidbs-and-bidbs}
In this paper, we focus on two popular forms of PDBs: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.
%
A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-PDB such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.
A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-PDB such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same block are disjoint events.
In other words, each random variable corresponds to the event of a single tuple's presence.
%
A \emph{\ti} is a \bi where each block contains exactly one tuple.