master
Boris Glavic 2021-04-10 08:40:43 -05:00
parent 58b70f0fcf
commit 31c57d5b8f
3 changed files with 19 additions and 7 deletions

View File

@ -6,9 +6,9 @@
In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately.
For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products.
However, using a reduction from the problem of counting $k$-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
Such factorized representations are necessary to necessary to realize the performance of modern join algorithms (e.g., worst-case optimal joins), and so our results imply that a Bag-PDB doing exact computations (via these factorized representations) can never be as fast as a classical (deterministic) database.
Such factorized representations are necessary to realize the performance of modern join algorithms (e.g., worst-case optimal joins), and so our results imply that a Bag-PDB doing exact computations (via these factorized representations) can never be as fast as a classical (deterministic) database.
The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).
We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs).
We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs).
We develop a sampling algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of polynomial circuits in linear time in the size of the polynomial.
By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for PDBs that can be competitive with deterministic databases.
\end{abstract}

View File

@ -117,7 +117,7 @@ In this work we consider bag semantics, where each tuple is associated with a mu
\begin{Example}\label{ex:intro-tbls}
Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analog to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where processing of shipping is on time.
$$Q_1(\text{City}) \dlImp Loc(\text{City}), Route(\text{City}, \dlDontcare)$$
$$Q_1(\text{City}) \dlImp OnTime(\text{City}), Route(\text{City}, \dlDontcare)$$
\Cref{subfig:ex-shipping-simp-queries} shows the possible results of this query.
For example, there is a 90\% probability there is a single route starting in Buffalo that is on time, and the expected multiplicity of this result tuple is $0.9$.
@ -126,17 +126,17 @@ Since the Chicago location has a 50\% probability of being on schedule (we assum
\end{Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
A well-known result in probabilistic databases is that under set semantics the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07}) over random variables that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
A well-known result in probabilistic databases is that under set semantics the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula over random variables that encode the existence of input tuples. That is, it is an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07}. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the obvious analog of probability distribution $\probDist$ defined over $\vct{X}$).
For bag semantics, the lineage of a tuple is a polynomial over variables $\vct{X}=(X_1,\dots,X_n)$ with % \in \mathbb{N}^\numvar$ with
coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$).
Analogously to sets, evaluating the lineage for $t$ over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which we will denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
Analogously to sets, evaluating the lineage for $t$ over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which we will denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Example}\label{ex:intro-lineage}
Associating a lineage variable with every input tuple as shown in \cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation to solve the problem in linear time (in the size of $\linsett{\query}{\pdb}{\tup}$).
Associating a lineage variable with every input tuple as shown in \Cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \Cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation to solve the problem in linear time (in the size of $\Phi$).
The expectation of the sum is the sum of the expectations of monomials.
The expectation of each monomial is then computed by multiplying the probabilities of the variables (tuples) occurring in the monomial.
The expected multiplicity of Chicago is $1.0$.

View File

@ -55,6 +55,18 @@ sensitive=true
\input{macros}
% reference names
\crefname{example}{ex.}{ex.}
\Crefname{example}{Ex.}{Ex.}
\Crefname{figure}{Fig.}{Fig.}
\Crefname{section}{Sec.}{Sec.}
\Crefname{definition}{Def.}{Def.}
\Crefname{theorem}{Thm.}{Thm.}
\Crefname{lemma}{Lem.}{Lem.}
\crefname{equation}{eq.}{eq.}
\Crefname{equation}{Eq.}{Eq.}
%%%%%%%%%%%%%%%%%%%%