More changes @atri021122.

master
Aaron Huber 2022-02-15 13:11:33 -05:00
parent 360920a8ec
commit ff7d0c3630
2 changed files with 26 additions and 20 deletions

View File

@ -40,15 +40,19 @@ Let $\abs{\poly}$ be the number of operators in $\poly$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Corollary}\label{cor:expct-sop}
If $\poly$ is a \bi-lineage polynomial already in \abbrSMB, then the expectation of $\poly$, i.e., $\expct\pbox{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $\bigO{\abs{\poly}}$ time.
If $\poly$ is a $1$-\abbrBIDB lineage polynomial already in \abbrSMB, then the expectation of $\poly$, i.e., $\expct\pbox{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $\bigO{\abs{\poly}}$ time.
\end{Corollary}
\secrev{
Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\Omega$ is the set of query answers produced by evaluating $\query$ over each possible world $\omega\in\Omega$: $\inset{\query\inparen{\omega}: \omega\in\Omega}$.
The result of a query is the pair $\inparen{\query\inparen{\omega}, \bpd'}$ where $\bpd'$ is a probability distribution that assigns to each possible query result the sum of the probabilites of the worlds that produce this answer: $\probOf\pbox{\omega\in\Omega} = \sum{\omega'\in\Omega,\\\query\inparen{\omega'}=\query\inparen{\omega}}\probOf\pbox{\omega'}$.
Queries over probabilistic databases are traditionally viewed as being evaluated using the so-called possible world semantics. A general bag-\abbrPDB can be defined as the pair $\bpd = \inparen{\Omega, \bpd}$ where $\Omega$ is the set of possible worlds represented by $\pdb$. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\Omega$ is the set of query answers produced by evaluating $\query$ over each possible world $\omega\in\Omega$: $\inset{\query\inparen{\omega}: \omega\in\Omega}$.
The result of a query is the pair $\inparen{\query\inparen{\omega}, \bpd'}$ where $\bpd'$ is a probability distribution that assigns to each possible query result the sum of the probabilites of the worlds that produce this answer: $\probOf\pbox{\omega\in\Omega} = \sum_{\omega'\in\Omega,\\\query\inparen{\omega'}=\query\inparen{\omega}}\probOf\pbox{\omega'}$.
}
Recalling \Cref{fig:nxDBSemantics} again, which defines the lineage polynomial $\apolyqdt$ for any $\raPlus$ query. We now make a meaningful connection between possible world semantics and world assignments on the lineage polynomial.
Suppose that $\bpd$ is a $1$-\abbrBIDB. Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to, e.g., disjointness. The all worlds set can be modeled by $\worldvec\in \{0, 1\}^{\bound\numvar}$, such that $\worldvec_{\tup, j} \in \worldvec$ represents whether or not the multiplicity of $\tup$ is $j$ (\emph{here and later, especially in \Cref{sec:algo}, we will rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$}).%(where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$).
We can denote a probability distribution over all $\worldvec \in \{0, 1\}^{\bound\numvar}$ as $\bpd'$. When $\bpd'$ is the one induced from each $\prob_{\tup, j}$ while assigning $\probOf\pbox{\worldvec} = 0$ for any $\worldvec$ with $\worldvec_{\tup, j}, \worldvec_{\tup, j'} \geq 1$ for $j\neq j'$, we end up with a bijective mapping from $\bpd$ to $\bpd'$, such that each mapping is equivalent, implying the distributions are equivalent.
\Cref{subsec:supp-mat-ti-bi-def} has more details.
Recall \Cref{fig:nxDBSemantics} again, which defines the lineage polynomial $\apolyqdt$ for any $\raPlus$ query. We now make a meaningful connection between possible world semantics and world assignments on the lineage polynomial.
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
Given a \abbrBPDB $\pdb = (\Omega,\bpd)$, $\raPlus$ query $\query$, and lineage polynomial $\apolyqdt$ for arbitrary result tuple $\tup$, %$\semNX$-\abbrPDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$,

View File

@ -43,29 +43,34 @@ or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \abbr
%Following the typical representation of bags in production databases, for query inputs, we will use \abbrBPDB\xplural with multiplicities $\{0, 1\}$ (see \Cref{sec:gener-results-beyond} for more on this choice).
\subsection{\abbrCTIDB\xplural and \abbrOneBIDB\xplural}
\subsection{$\mathbf{1}$-BIDB}
\label{subsec:tidbs-and-bidbs}
%An \textit{incomplete database} $\Omega$ is a set of deterministic databases $\worldvec$ called possible worlds.
\noindent\secrev{
A block independent database \abbrBIDB $\pdb'$ can viewed as a $1$-\abbrTIDB $\pdb$ with the added flexibility that each $\tup\in\tupset$ has multiple disjoint alternatives, i.e., all $\tup \in \tupset'$ are partitioned into $m$ independent blocks with the condition that tuples $\tup \in \block_i$ for $i \in \pbox{m}$ are disjoint events. We define next a specific construction of \abbrBIDB that is useful for out work.
A block independent database \abbrBIDB $\pdb'$ can viewed as a $1$-\abbrTIDB $\pdb$ with the added flexibility that each $\tup\in\tupset$ has multiple disjoint alternatives, i.e., all $\tup \in \tupset'$ are partitioned into $m$ independent blocks with the condition that tuples $\tup \in \block_i$ for $i \in \pbox{m}$ are disjoint events. We define next a specific construction of \abbrBIDB that is useful for our work.
\begin{Definition}[$1$-\abbrBIDB]\label{def:one-bidb}
Define a $1$-\abbrBIDB to be the pair $\pdb' = \inparen{\prod_{\tup\in\tupset'}\inset{0, \bound_\tup}, \bpd'},$ where $\tupset'$ is the set of possible tuples such that each $\tup \in \tupset'$ has a multiplicity domain of $\inset{0, \bound_\tup}$, with $\bound_\tup \in \mathbb{N}$. The term $\prod_{\tup\in\tupset'}$ is the direct product of all such multiplicity domain pairs. The tuples $\tup\in\tupset'$ are further partitioned into $m$ independent blocks $\block_i,~i\in\pbox{m}$ of disjoint tuples. $\bpd$ is the probability distribution across all worlds such that, given $\worldvec\in\prod_{\tup\in\tupset'}\inset{0,\bound_\tup},\tup,~\tup'\in\block_i~:~\probOf\pbox{\worldvec_\tup, \worldvec_\tup'>0} = 0$.
Define a $1$-\abbrBIDB to be the pair $\pdb' = \inparen{\prod_{\tup\in\tupset'}\inset{0, \bound_\tup}, \bpd'},$ where $\tupset'$ is the set of possible tuples such that each $\tup \in \tupset'$ has a multiplicity domain of $\inset{0, \bound_\tup}$, with $\bound_\tup \in \mathbb{N}$. The operation $\prod_{\tup\in\tupset'}$ is the direct product of all such multiplicity domain pairs. The tuples $\tup\in\tupset'$ are partitioned into $m$ independent blocks $\block_i,~i\in\pbox{m}$, of disjoint tuples. $\bpd$ is the probability distribution across all worlds such that, given $\worldvec\in\prod_{\tup\in\tupset'}\inset{0,\bound_\tup},\tup,~\tup'\in\block_i~:~\probOf\pbox{\worldvec_\tup, \worldvec_\tup'>0} = 0$.
\end{Definition}
%A \abbrCTIDB $\pdb$ is a pair $\inparen{\worlds, \bpd}$ such that $\worlds$ is an incomplete database whose set of possible worlds is the $c+1^\numvar$ tuple/multiplicity combinations across all $\tup\in\tupset$, where $\abs{\tupset} = \numvar$, $\tupset = \bigcup_{\worldvec\in\worlds,~\worldvec_{\tup}\geq 1}\tup$ is the set of possible tuples across possible worlds, and $\bpd$ is a probability distribution over $\worlds$.
\begin{Definition}[$\bound$-Block Independent Disjoint Database ($\bound$-\abbrBIDB)]\label{def:bidb}
A $\bound$-block independent database ($\bound$-\abbrBIDB) $\pdb' = \inparen{\inset{0,\ldots,\bound}^{\tupset'}, \bpd'}$ is a probabilistic database such that the all worlds set is encoded as the set of vectors $\worldvec\in\inset{0,\ldots,\bound}^{\abs{\tupset'}}$ where $\worldvec_\tup\leq\bound$ is the multiplicity for tuple $\tup$. $\pdb'$ requires the set of all possible tuples $\tupset = \bigcup_{\worldvec\in\inset{0,\ldots, \bound}^{\tupset'},~\worldvec_\tup \geq 1}\tup$ to be partitioned into $m$ independent blocks $\block_i$ ($i\in\pbox{m}$) where all tuples $\tup_{i, j}\in \block_i$ are disjoint. $\bpd'$ is the probability distribution where, for all $\worldvec\in\inset{0,\ldots,\bound}^{\tupset'}$ such that $\worldvec_{\tup_{i, j}},\worldvec_{\tup_{i, j'}}\neq 0, j\neq j'$ for any $\block_i$, $\probOf\pbox{\worldvec} = 0$, where all other $\worldvec$ has $0<\probOf\pbox{\worldvec}\leq 1$.%bpd'$ set with the all worlds set $\worlds$ and probability distribution $\bpd'$ such that $\tupset' = \bigcup_{\worldvec\in\worlds, \worldvec_\tup \geq 1}\tup$ is the set of all possible tuples for which all $\tup\in\tupset'$ can be partitioned into $\numedge$ blocks $\block_i$ where the set of tuples $\tup_j \in \block_i$ are all disjoint, while blocks $\block_i$ are independent of one another. Each $\tup\in\tupset'$ has a multiplicity of at most $\bound$. $\bpd'$ is the distribution such that for any $\worldvec\in\worlds$ with $\worldvec_{\tup_{i, j}}\geq 1$ and $\worldvec_{\tup_{i, j'}}\geq 1$, $j\neq j'$ in any $\block_i$ more than one tuple present from the same block $\block_i$ has probability $\probOf\pbox{\worldvec} = 0$.
\end{Definition}
A block independent database (\abbrBIDB) is a related probabilistic data model $\pdb=\inparen{\Omega, \bpd}$ such that the base set of tuples $\tupset = \bigcup_{\omega\in\Omega,~\tup\in\omega}\tup$ is partitioned into a set of $\numvar$ independent blocks $\inset{\inparen{\block_\tup}_{\tup\in\pbox{\numvar}}}$ such that the set of tuples $\inset{\inparen{\tup_j}_{j\in\pbox{\abs{\block}}}}$ in block $\block_\tup$ are disjoint from one another. This construction produces the set of possible worlds $\Omega$ that consists of all unique combinations of tuples in $\tupset$ with the constraint that for any $\omega\in\Omega$, no two tuples $\tup_j, \tup_{j'}, j\neq j'$ from the same block $\block_\tup$ exist together. A $\bound$-\abbrBIDB has the further requirement that each block has a multiplicity of at most $c$. We present a reduction that is useful in producing our results:
%\begin{Definition}[$\bound$-Block Independent Disjoint Database ($\bound$-\abbrBIDB)]\label{def:bidb}
%A $\bound$-block independent database ($\bound$-\abbrBIDB) $\pdb' = \inparen{\inset{0,\ldots,\bound}^{\tupset'}, \bpd'}$ is a probabilistic database such that the all worlds set is encoded as the set of vectors $\worldvec\in\inset{0,\ldots,\bound}^{\abs{\tupset'}}$ where $\worldvec_\tup\leq\bound$ is the multiplicity for tuple $\tup$. $\pdb'$ requires the set of all possible tuples $\tupset = \bigcup_{\worldvec\in\inset{0,\ldots, \bound}^{\tupset'},~\worldvec_\tup \geq 1}\tup$ to be partitioned into $m$ independent blocks $\block_i$ ($i\in\pbox{m}$) where all tuples $\tup_{i, j}\in \block_i$ are disjoint. $\bpd'$ is the probability distribution where, for all $\worldvec\in\inset{0,\ldots,\bound}^{\tupset'}$ such that $\worldvec_{\tup_{i, j}},\worldvec_{\tup_{i, j'}}\neq 0, j\neq j'$ for any $\block_i$, $\probOf\pbox{\worldvec} = 0$, where all other $\worldvec$ has $0<\probOf\pbox{\worldvec}\leq 1$.%bpd'$ set with the all worlds set $\worlds$ and probability distribution $\bpd'$ such that $\tupset' = \bigcup_{\worldvec\in\worlds, \worldvec_\tup \geq 1}\tup$ is the set of all possible tuples for which all $\tup\in\tupset'$ can be partitioned into $\numedge$ blocks $\block_i$ where the set of tuples $\tup_j \in \block_i$ are all disjoint, while blocks $\block_i$ are independent of one another. Each $\tup\in\tupset'$ has a multiplicity of at most $\bound$. $\bpd'$ is the distribution such that for any $\worldvec\in\worlds$ with $\worldvec_{\tup_{i, j}}\geq 1$ and $\worldvec_{\tup_{i, j'}}\geq 1$, $j\neq j'$ in any $\block_i$ more than one tuple present from the same block $\block_i$ has probability $\probOf\pbox{\worldvec} = 0$.
%\end{Definition}
%A block independent database (\abbrBIDB) is a related probabilistic data model $\pdb=\inparen{\Omega, \bpd}$ such that the base set of tuples $\tupset = \bigcup_{\omega\in\Omega,~\tup\in\omega}\tup$ is partitioned into a set of $\numvar$ independent blocks $\inset{\inparen{\block_\tup}_{\tup\in\pbox{\numvar}}}$ such that the set of tuples $\inset{\inparen{\tup_j}_{j\in\pbox{\abs{\block}}}}$ in block $\block_\tup$ are disjoint from one another. This construction produces the set of possible worlds $\Omega$ that consists of all unique combinations of tuples in $\tupset$ with the constraint that for any $\omega\in\Omega$, no two tuples $\tup_j, \tup_{j'}, j\neq j'$ from the same block $\block_\tup$ exist together. A $\bound$-\abbrBIDB has the further requirement that each block has a multiplicity of at most $c$.
We now present a reduction that is useful in derivign our results:
\begin{Definition}[\abbrCTIDB reduction]\label{def:ctidb-reduct}
Given \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, let $\pdb' = \inparen{\Omega, \bpd'}$ be the \abbrOneBIDB obtained in the following manner: for each $\tup\in\tupset$, create block $\block_\tup = \inset{\intup{\tup, jX_{\tup, j}}_{j\in\pbox{\bound}}}$, such that $X_{\tup, j}\in\inset{0,1}$. %with $\bound$ disjoint copies, such that $\tup_j$ is annotated with variable $X_{\tup, j}$ for $j\in\pbox{\bound}$.
The probability distribution $\bpd'$ is the one induced by $\vct{p} = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}$ and the \abbrBIDB disjoint requirement.
Given \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, let $\pdb' = \inparen{\Omega, \bpd'}$ be the \abbrOneBIDB obtained in the following manner: for each $\tup\in\tupset$, create block $\block_\tup = \inset{\intup{\tup, jX_{\tup, j}}_{j\in\pbox{\bound}}}$, for all $j\in\pbox{\bound}$ such that $X_{\tup, j}\in\inset{0,1}$.
The probability distribution $\bpd'$ is the one induced by $\vct{p} = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}$ and the \abbrBIDB disjoint requirement that for any $X_{\tup, j} = 1, j'\in\pbox{\bound} - \inset{j}, X_{\tup, j'} = 0$.
% $\block_\tup,~j\in\pbox{\bound}~|~X_{\tup, j} = 1,\not\exists j'\neq j~|~X_{\tup, j'} = 1$.
%$\tup_j\geq1\implies \tup_{j'} = 0$.$\forall j, j' \in \pbox{\bound},\forall \tup\in\tupset, \tup_j\geq 1\implies \tup_{j'} = 0$ for any block $\block_\tup$.
\end{Definition}
For the \abbrCTIDB $\pdb$, each $X_\tup\in\pbox{\bound}$, while in the reduced \abbrOneBIDB $\pdb'$, each $X_{\tup, j}\in\inset{0, 1}$. %As previously noted, unlike $X_{\tup}\in\inset{0,\ldots,\bound}$ for $X_{\tup}\in\vars{\pdb}$, $X_{\tup, j}\in\inset{0,1}$ for $X_{\tup, j}\in\vars{\pdb'}$.
Hence, in the setting of \abbrOneBIDB, the base case of~\Cref{fig:nxDBSemantics} now becomes $\poly\pbox{\rel,\tupset, \tup} = \sum_{j\in\pbox{\bound}}jX_{\tup, j}$. Then given the disjoint requirement and the semantics for constructing the lineage polynomial over a \abbrOneBIDB, $\poly\pbox{\rel,\tupset',\tup}$ is of the same structure as the reformulated polynomial $\refpoly{}$ of step i) from~\Cref{def:reduced-poly}, which then implies that $\rpoly$ is the reduced polynomial that results from step ii) of~\Cref{def:reduced-poly}, and further that~\Cref{lem:tidb-reduce-poly} immediately follows for \abbrOneBIDB polynomials: $\expct_{\rvworld\sim\bpd'}\pbox{\poly\inparen{\rvworld}} = \rpoly\inparen{\vct{\prob}}$.
\AH{@atri, not sure if $\bpd'$ should be $\bpd''$ (in the above expectation) as discussed below. Since $\bpd'\equiv\bpd''$, then the proof still holds for~\Cref{lem:tidb-reduce-poly}, but maybe it is important to $\bpd''$ to drive the point home that we iterate over the all worlds set (as opposed to the set of possible worlds) when computing the expectation of a polynomial. Or maybe it suffices to note that $\bpd'\equiv\bpd''$.}
Hence, in the setting of \abbrOneBIDB, we have the following semantics for generating lineage polynomials in $\raPlus$ queries: $\poly'\pbox{\project_A\inparen{\query}, \tupset', \tup_j} = \sum_{\tup_{j'} \in \project_{A}\inparen{\query\inparen{\tupset'}}: \tup_{j'} = \tup_j}\poly'\pbox{\query, \tupset', \tup_{j'}}$,
$\poly'\pbox{\select_\theta\inparen{\query}, \tupset', \tup_j} = \begin{cases}\theta = 1&\poly'\pbox{\query, \tupset', \tup_j}\\\theta = 0& 0\\\end{cases}$,
$\poly'\pbox{\query_1\join\query_2, \tupset', \tup_j} = \poly'\pbox{\query_1, \tupset', \project_{attr\inparen{\query_1}}\inparen{\tup_j}}\cdot\poly'\pbox{\query_2, \tupset', \project_{attr\inparen{\query_2}}\inparen{\tup_j}}$,
$\poly'\pbox{\query_1\union\query_2, \tupset', \tup_j} = \poly'\pbox{\query_1, \tupset', \tup_j}+\poly'\pbox{\query_2, \tupset', \tup_j}$,
and the base case now becomes $\poly'\pbox{\rel,\tupset', \tup_j} = j\cdot X_{\tup, j}$ (c.f.~\Cref{fig:nxDBSemantics}). Then given the disjoint requirement and the semantics for constructing the lineage polynomial over a \abbrOneBIDB, $\poly'\pbox{\rel,\tupset',\tup}$ is of the same structure as the reformulated polynomial $\refpoly{}$ of step i) from~\Cref{def:reduced-poly}, which then implies that $\rpoly'$ is the reduced polynomial that results from step ii) of~\Cref{def:reduced-poly}, and further that~\Cref{lem:tidb-reduce-poly} immediately follows for \abbrOneBIDB polynomials: $\expct_{\rvworld\sim\bpd'}\pbox{\poly'\inparen{\rvworld}} = \rpoly'\inparen{\vct{\prob}}$.
}
%In this paper, we focus on two popular forms of \abbrPDB\xplural: Block-Independent (\bi) and Tuple-Independent (\ti) \abbrPDB\xplural.
%%
@ -78,9 +83,6 @@ Hence, in the setting of \abbrOneBIDB, the base case of~\Cref{fig:nxDBSemantics}
Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to disjointness. The all worlds set can be modeled by $\worldvec\in \{0, 1\}^{\bound\numvar}$,\footnote{Here and later, especially in \Cref{sec:algo}, we will rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.} such that $\worldvec_{\tup, j} \in \worldvec$ represents whether or not the multiplicity of $\tup$ is $j$.%(where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$).
We denote a probability distribution over all $\worldvec \in \{0, 1\}^\numvar$ as $\bpd''$. When $\bpd''$ is the one induced from each $\prob_{\tup, j}$ while assigning $\probOf\pbox{\worldvec} = 0$ for any $\worldvec$ with $\worldvec_{\tup, j} = \worldvec_{\tup, j'} = 1$ for $j\neq j'$, we end up with a bijective mapping from $\bpd'$ to $\bpd''$, such that each mapping is equivalent, implying the distributions are equivalent.
\Cref{subsec:supp-mat-ti-bi-def} has more details.
%%% Local Variables: