Merge branch 'master' of gitlab.odin.cse.buffalo.edu:ahuber/SketchingWorlds
commit
6c76601f6d
|
@ -104,7 +104,7 @@ Given a $\semNX$-PDB $\pxdb$ and query plan $Q$, the runtime of $Q$ over $\bagdb
|
|||
We now have all the pieces to argue that using our approximation algorithm, the expected multiplicities of a SPJU query can be computed in essentially the same runtime as deterministic query processing for the same query:
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Corollary}
|
||||
Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple with probability at least $1-\delta$ in time
|
||||
Given an SPJU query $Q$ over a \ti $\pxdb$ and let $\db_{max}$ denote the world containing all tuples of $\pxdb$, we can compute a $(1\pm\eps)$-approximation of the expectation for each output tuple in $\query(\pxdb)$ with probability at least $1-\delta$ in time
|
||||
%
|
||||
\[
|
||||
O_k\left(\frac 1{\eps^2}\cdot\qruntime{Q,\db_{max}}\cdot \log{\frac{1}{\conf}}\cdot \log(n)\right)
|
||||
|
|
17
intro.tex
17
intro.tex
|
@ -206,15 +206,14 @@ To see why computing this probability is hard, observe that the clauses of the d
|
|||
Conversely, in Bag-PDBs, correlations between clauses of the SOP polynomial are not problematic thanks to linearity of expectation.
|
||||
The expectation computation over the output lineage is simply the sum of expectations of each clause.
|
||||
For \Cref{ex:intro}, the expectation is simply
|
||||
{\small
|
||||
\begin{align*}
|
||||
\expct\pbox{\poly(W_a, W_b, W_c)} &= \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}\\
|
||||
\intertext{\normalsize
|
||||
\begin{equation*}
|
||||
\expct\pbox{\poly_{bag}(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}
|
||||
\end{equation*}
|
||||
In this particular lineage polynomial, all variables in each product clause are independent, so we can push expectations through.
|
||||
}
|
||||
&= \expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
|
||||
\end{align*}
|
||||
}
|
||||
\begin{equation*}
|
||||
= \expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
|
||||
\end{equation*}
|
||||
|
||||
Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
|
||||
As a further interesting feature of this example, note that $\expct\pbox{W_i} = \probOf[W_i = 1]$, and so taking the same polynomial over the reals:
|
||||
\begin{multline}
|
||||
|
@ -307,7 +306,7 @@ With $\poly^2$ as an example, we have:
|
|||
Note that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(\probOf\pbox{W_a=1}, \probOf\pbox{W_b=1}, \probOf\pbox{W_c=1})$).
|
||||
Also note that the $\poly$ in~\Cref{ex:bag-vs-set} is already in reduced form.
|
||||
|
||||
The reduced form of a polynomial can be obtained in a linear scan over the clauses of a SOP encoding of the polynomial.
|
||||
The reduced form of a polynomial can be obtained in a linear scan over the clauses of an SOP encoding of the polynomial.
|
||||
In prior work on lineage-based Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,yang:2015:pvldb:lenses} where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.
|
||||
In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the Cartesian product of $k$ copies of $\poly$ has a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
|
||||
This leads us to the \textbf{central question of this paper}:
|
||||
|
|
|
@ -22,7 +22,7 @@ A monomial is a product of variable terms, each raised to a non-negative integer
|
|||
\[
|
||||
\sum_{i=1}^n c_i \cdot m_i
|
||||
\]
|
||||
where each $c_i$ is a positive integer and each $m_i$ is a monomial and $m_i \neq m_j$ for $i \neq j$. The \abbrSMB of a polynomial $\poly$ is $\smbOf{\poly}$.
|
||||
where each $c_i$ is an integer and each $m_i$ is a monomial and $m_i \neq m_j$ for $i \neq j$. The \abbrSMB of a polynomial $\poly$ is $\smbOf{\poly}$.
|
||||
% fully expanded out such that no product of sums exist and where each unique monomial appears exactly once.
|
||||
\end{Definition}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
@ -94,7 +94,9 @@ Given the set of BIDB variables $\inset{X_{b,i}}$, define
|
|||
\end{Definition}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%
|
||||
Intuitively, in the reduced form, all exponents $e > 1$ are reduced to $e = 1$ by $\text{mod } \mathcal T$, and all monomials with multile variables from the same block $\block$ are dropped by $\text{mod } \mathcal B$ (i.e., any world containing more than one tuple from a block has $0$ probability and can be ignored).
|
||||
|
||||
Intuitively, in the reduced form, all exponents $e > 1$ are reduced to $e = 1$ by $\text{mod } \mathcal T$, and all monomials with multiple variables from the same block $\block$ are dropped by $\text{mod } \mathcal B$ (i.e., any world containing more than one tuple from a block has $0$ probability and can be ignored).
|
||||
|
||||
For the special case of \tis, the second step is not necessary since every block contains a single tuple.
|
||||
%Alternatively, one can think of $\rpoly$ as the \abbrSMB of $\poly(\vct{X})$ when the product operator is idempotent.
|
||||
%
|
||||
|
@ -126,7 +128,7 @@ Consider $\poly(X, Y) = (X + Y)(X + Y)$ where $X$ and $Y$ are from different blo
|
|||
%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{Definition}[Valid Worlds]
|
||||
For probability distribution $\probDist$ and its corresponding probability mass function $\probOf$, the set of valid worlds $\eta$ is the worlds with probability value greater than $0$; i.e., for variable vector $\vct{W}$
|
||||
For probability distribution $\probDist$ and its corresponding probability mass function $\probOf$, the set of valid worlds $\eta$ consists of all the worlds with probability value greater than $0$; i.e., for variable vector $\vct{W}$
|
||||
\[
|
||||
\eta = \{\vct{w}\st \probOf[\vct{W} = \vct{w}] > 0\}
|
||||
\]
|
||||
|
|
|
@ -13,7 +13,7 @@ Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\p
|
|||
For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
|
||||
\[\forall \db \in \query(\idb): \probOf'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \probOf(\db') \]
|
||||
|
||||
Note that in this work we consider multisets, i.e., each possible world is a set of multiset relations and queries are evaluated using bag semantics. We will use K-relations to model multisets. A \emph{K-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
|
||||
Note that in this work we consider multisets, i.e., each possible world is a set of multiset relations and queries are evaluated using bag semantics. We will use $\domK$-relations to model multisets. A \emph{$\domK$-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
|
||||
Let $\udom$ be a countable domain of values.
|
||||
Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
|
||||
A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ (resp., $\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to $t$.
|
||||
|
|
Loading…
Reference in New Issue