Minor typo/flow tweaks
parent
c3eb8efb79
commit
c2c5f3d936
|
@ -4,13 +4,13 @@
|
|||
The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds.
|
||||
The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world.
|
||||
In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately.
|
||||
For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products.
|
||||
For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products (the standard operating procedure for Set-PDBs).
|
||||
However, using a reduction from the problem of counting $k$-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
|
||||
Such factorized representations are necessary to realize the performance of modern join algorithms (e.g., worst-case optimal joins), and so our results imply that a Bag-PDB doing exact computations (via these factorized representations) can never be as fast as a classical (deterministic) database.
|
||||
The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).
|
||||
We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs).
|
||||
We develop a sampling algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of polynomial circuits in linear time in the size of the polynomial.
|
||||
By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for PDBs that can be competitive with deterministic databases.
|
||||
By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases.
|
||||
\end{abstract}
|
||||
|
||||
%%% Local Variables:
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
\section{Introduction}
|
||||
\label{sec:intro}
|
||||
A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds, paired with a probability distribution $\pd$ over these worlds.
|
||||
A well-studied problem in probabilistic databases is to, given a query $\query$ and a probabilistic database $\pdb$, compute the \emph{marginal probability} of a tuple $\tup$ (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
|
||||
A well-studied problem in probabilistic databases is to take a query $\query$ and a probabilistic database $\pdb$, and compute the \emph{marginal probability} of a tuple $\tup$ (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
|
||||
This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs), which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases, from cases that are in \ptime for unions of conjunctive queries (UCQs).
|
||||
In this work we consider bag semantics, where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analogous problem of computing the expectation of the multiplicity of a query result tuple $\tup$ (denoted $\query(\db)(t)$):
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
@ -115,7 +115,7 @@ In this work we consider bag semantics, where each tuple is associated with a mu
|
|||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Example}\label{ex:intro-tbls}
|
||||
Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analogously to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where shipment processing is on time.
|
||||
Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analogously to the set case: each input tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$, shown below returns starting points of shipping routes where shipment processing is on time.
|
||||
|
||||
$$Q_1(\text{City}) \dlImp OnTime(\text{City}), Route(\text{City}, \dlDontcare)$$
|
||||
|
||||
|
@ -126,17 +126,17 @@ Since the Chicago location has a 50\% probability of being on schedule (we assum
|
|||
\end{Example}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
A well-known result in probabilistic databases is that under set semantics, the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions)
|
||||
A well-known result in probabilistic databases is that under set semantics, the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$~\cite{DBLP:conf/pods/GreenKT07} of positive Boolean expressions)
|
||||
over random variables
|
||||
($\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07})
|
||||
($\vct{X}=(X_1,\dots,X_n)$)
|
||||
that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
|
||||
Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the obvious analog of probability distribution $\probDist$ defined over $\vct{X}$).
|
||||
|
||||
For bag semantics, the lineage of a tuple is a polynomial over variables $\vct{X}=(X_1,\dots,X_n)$ with % \in \mathbb{N}^\numvar$ with
|
||||
coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$).
|
||||
Analogously to sets, evaluating the lineage for $t$ over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which for this example we denote\footnote{
|
||||
Analogously to sets, evaluating the lineage for $t$ over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$, which for this example we denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context\footnote{
|
||||
In later sections, where we focus on a single lineage polynomial, we will simply refer to $\linsett{\query}{\pdb}{\tup}$ as $Q$.
|
||||
} as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
|
||||
}. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Example}\label{ex:intro-lineage}
|
||||
|
@ -153,7 +153,7 @@ However, this need not be true for compressed representations of polynomials, in
|
|||
For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}.
|
||||
The left circuit encodes the lineage as a SOP while the right circuit uses distributivity to push the addition gate below the multiplication, resulting in a smaller circuit.
|
||||
Given that there is a large body of work (on, e.g., deterministic bag-relational query processing) that can output such compressed representations~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}, %\BG{cite FDBs and FAQ},
|
||||
an interesting question is whether computing expectations is still in linear time for such compressed representations.
|
||||
an interesting question is whether computing expectations is still in linear time for their output.
|
||||
If the answer is in the affirmative, and if lineage formulas can also be computed in linear time (in the lineage size), then bag-relational probabilistic databases can theoretically match the performance of deterministic databases.
|
||||
Unfortunately, we prove that this is not the case: computing the expected count of a query result tuple is super-linear under standard complexity assumptions (\sharpwonehard) in the size of a lineage circuit.
|
||||
|
||||
|
|
|
@ -26,7 +26,7 @@ We ignore the fields \vari{partial}, \vari{Lweight}, and \vari{Rweight} until \C
|
|||
|
||||
|
||||
\begin{Example}
|
||||
The circuit \circuit in \Cref{fig:circuit-express-tree} encodes the polynomial $XY + WZ$. Note that such an encoding lends itself naturally to having all gates with an outdegree of $1$. Note further that \circuit is indeed a tree with edges pointing towards the root.
|
||||
The circuit \circuit in \Cref{fig:circuit-express-tree} encodes the polynomial $XY + WZ$. Note that circuit \circuit encodes a tree, with edges pointing towards the root.
|
||||
\end{Example}
|
||||
|
||||
\begin{figure}[t]
|
||||
|
|
|
@ -57,7 +57,7 @@ We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBL
|
|||
In an $\semNX$-database, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.
|
||||
We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$.
|
||||
Each possible world is defined by an assignment of $N$ binary values $\vct{W} \in \{0, 1\}^{\abs{\vct{X}}}$.
|
||||
The multiplicity of $t \in R$ in this possible world is obtained by evaluating the polynomial annotating it on $\vct{W}$ (i.e., $R(t)(\vct{W})$).
|
||||
The multiplicity of $t \in R$ in this possible world, denoted $R(t)(\vct{W})$, is obtained by evaluating the polynomial annotating $t$ on $\vct{W}$.
|
||||
$\semNX$-relations are closed under $\raPlus$ (\Cref{fig:nxDBSemantics}).
|
||||
|
||||
|
||||
|
@ -103,7 +103,7 @@ We focus on this problem from now on, assume an implicit result tuple, and so dr
|
|||
\label{subsec:tidbs-and-bidbs}
|
||||
In this paper, we focus on two popular forms of PDBs: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.
|
||||
%
|
||||
A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-PDB such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.
|
||||
A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-PDB such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same block are disjoint events.
|
||||
In other words, each random variable corresponds to the event of a single tuple's presence.
|
||||
%
|
||||
A \emph{\ti} is a \bi where each block contains exactly one tuple.
|
||||
|
|
Loading…
Reference in New Issue