small
parent
58b70f0fcf
commit
31c57d5b8f
|
@ -6,9 +6,9 @@
|
|||
In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately.
|
||||
For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products.
|
||||
However, using a reduction from the problem of counting $k$-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
|
||||
Such factorized representations are necessary to necessary to realize the performance of modern join algorithms (e.g., worst-case optimal joins), and so our results imply that a Bag-PDB doing exact computations (via these factorized representations) can never be as fast as a classical (deterministic) database.
|
||||
Such factorized representations are necessary to realize the performance of modern join algorithms (e.g., worst-case optimal joins), and so our results imply that a Bag-PDB doing exact computations (via these factorized representations) can never be as fast as a classical (deterministic) database.
|
||||
The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).
|
||||
We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs).
|
||||
We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs).
|
||||
We develop a sampling algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of polynomial circuits in linear time in the size of the polynomial.
|
||||
By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for PDBs that can be competitive with deterministic databases.
|
||||
\end{abstract}
|
||||
|
|
|
@ -117,7 +117,7 @@ In this work we consider bag semantics, where each tuple is associated with a mu
|
|||
\begin{Example}\label{ex:intro-tbls}
|
||||
Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analog to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where processing of shipping is on time.
|
||||
|
||||
$$Q_1(\text{City}) \dlImp Loc(\text{City}), Route(\text{City}, \dlDontcare)$$
|
||||
$$Q_1(\text{City}) \dlImp OnTime(\text{City}), Route(\text{City}, \dlDontcare)$$
|
||||
|
||||
\Cref{subfig:ex-shipping-simp-queries} shows the possible results of this query.
|
||||
For example, there is a 90\% probability there is a single route starting in Buffalo that is on time, and the expected multiplicity of this result tuple is $0.9$.
|
||||
|
@ -126,17 +126,17 @@ Since the Chicago location has a 50\% probability of being on schedule (we assum
|
|||
\end{Example}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
A well-known result in probabilistic databases is that under set semantics the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07}) over random variables that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
|
||||
A well-known result in probabilistic databases is that under set semantics the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula over random variables that encode the existence of input tuples. That is, it is an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07}. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
|
||||
Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the obvious analog of probability distribution $\probDist$ defined over $\vct{X}$).
|
||||
|
||||
For bag semantics, the lineage of a tuple is a polynomial over variables $\vct{X}=(X_1,\dots,X_n)$ with % \in \mathbb{N}^\numvar$ with
|
||||
coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$).
|
||||
Analogously to sets, evaluating the lineage for $t$ over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which we will denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
|
||||
Analogously to sets, evaluating the lineage for $t$ over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which we will denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Example}\label{ex:intro-lineage}
|
||||
Associating a lineage variable with every input tuple as shown in \cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
|
||||
In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation to solve the problem in linear time (in the size of $\linsett{\query}{\pdb}{\tup}$).
|
||||
Associating a lineage variable with every input tuple as shown in \Cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \Cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
|
||||
In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation to solve the problem in linear time (in the size of $\Phi$).
|
||||
The expectation of the sum is the sum of the expectations of monomials.
|
||||
The expectation of each monomial is then computed by multiplying the probabilities of the variables (tuples) occurring in the monomial.
|
||||
The expected multiplicity of Chicago is $1.0$.
|
||||
|
|
12
main.tex
12
main.tex
|
@ -55,6 +55,18 @@ sensitive=true
|
|||
|
||||
\input{macros}
|
||||
|
||||
% reference names
|
||||
\crefname{example}{ex.}{ex.}
|
||||
\Crefname{example}{Ex.}{Ex.}
|
||||
\Crefname{figure}{Fig.}{Fig.}
|
||||
\Crefname{section}{Sec.}{Sec.}
|
||||
\Crefname{definition}{Def.}{Def.}
|
||||
\Crefname{theorem}{Thm.}{Thm.}
|
||||
\Crefname{lemma}{Lem.}{Lem.}
|
||||
\crefname{equation}{eq.}{eq.}
|
||||
\Crefname{equation}{Eq.}{Eq.}
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue