paper-BagRelationalPDBsAreHard/ra-to-poly.tex

%root: main.tex
%!TEX root=./main.tex
%\onecolumn
\section{Background and Notation}\label{sec:background}

\subsection{Probabilistic Databases}

The setting used in this section is primarily that of a bag-\abbrPDB query with set-\abbrPDB inputs.  Recall, as noted in \cref{sec:intro-rewrite-070921}, this is not limiting.  

An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.
Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world: $\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}$.

For a probabilistic  database $\pdb = (\idb, \pd)$,  the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$  that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
%
\[\forall \db \in \query(\idb): \pd'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \pd(\db') \]

Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_\numvar)$ with natural number coefficients and exponents.
We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations}. 
 $\semNX$-relations are functions from tuples to elements of $\semNX$, typically called annotations.
We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$. Note that $R(t)$ is the lineage polynomial for $t$.
Each possible world is defined by an assignment of $\numvar$ binary values $\vct{\wElem} \in \{0, 1\}^{\numvar}$ to $\vct{X}$.
The multiplicity of $t \in R$ in this possible world, denoted $R(t)(\vct{\wElem})$, is obtained by evaluating the polynomial annotating $t$ on $\vct{\wElem}$.
$\semNX$-relations are closed under $\raPlus$ (\Cref{fig:nxDBSemantics}).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

We will use $\semNX$-\abbrPDB $\pxdb$, defined as the tuple $(\idb_{\semNX}, \pd)$, where $\semNX$-database $\idb_{\semNX}$ is paired with probability distribution $\pd$.
We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ on an implicit $\semNX$-\abbrPDB (i.e., $\polyForTuple = \query(\pxdb)(t)$ for some $\pxdb$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{\numvar} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.
$\semNX$-\abbrPDB\xplural and a function $\rmod$ (which transforms an $\semNX$-\abbrPDB to  a classical bag-\abbrPDB, or $\semN$-\abbrPDB~\cite{DBLP:conf/pods/GreenKT07,feng:2019:sigmod:uncertainty}) are both formalized in \Cref{subsec:supp-mat-background}.
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
  Given an $\semN$-\abbrPDB $\pdb = (\idb,\pd)$ and $\semNX$-\abbrPDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$, we have:
  $ \expct_{\randDB \sim \pd}[\query(\randDB)(t)] = \expct_{\randWorld\sim \pd'}\pbox{\polyForTuple(\randWorld)}. $
  \footnote{Although assumed by most prior work on set-probabilistic databases, e.g., as an obvious consequence of~\cite{IL84a}'s Theorem 7.1, we are unaware of any formal proof for bag-probabilistic databases.}
\end{Proposition}
\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.
This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.
We focus on this problem from now on, assume an implicit result tuple, and so drop the subscript from $\polyForTuple$ (i.e., $\poly$ will denote a polynomial).

\subsubsection{\tis and \bis}
\label{subsec:tidbs-and-bidbs}
In this paper, we focus on two popular forms of \abbrPDB\xplural: Block-Independent (\bi) and Tuple-Independent (\ti) \abbrPDB\xplural.
%
A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-\abbrPDB  such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same block are disjoint events.
In other words, each random variable corresponds to the event of a single tuple's presence.
%
A \emph{\ti} is a \bi where each block contains exactly one tuple.
\Cref{subsec:supp-mat-ti-bi-def} explains \tis and \bis in greater detail.
%
In a \bi (and by extension a \ti) $\pxdb$, tuples are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ where tuple $t_{i,j} \in \block_i$ is associated with a probability $\prob_{\tup_{i,j}} = \probOf[X_{i,j} = 1]$, and is annotated with a unique variable $X_{i,j}$.\footnote{
  Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function).  For $t_{i, j} \in b_i$, the event $(X_{i,j} = 1)$ corresponds to the event $(X_i = j)$ in the customary annotation scheme.
}
Because blocks are independent and tuples from the same block are disjoint, the probabilities $\prob_{\tup_{i,j}}$ and the blocks induce the probability distribution $\pd$ of $\pxdb$.
We will write a \bi-lineage polynomial $\poly(\vct{X})$ for a \bi with $\ell$ blocks as
$\poly(\vct{X})$ = $\poly(X_{1, 1},\ldots, X_{1, \abs{\block_1}},$ $\ldots, X_{\ell, \abs{\block_\ell}})$, where $\abs{\block_i}$ denotes the size of $\block_i$.\footnote{Later on in the paper, especially in \Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:
Started texing poly reformation write up. 2020-06-12 11:45:15 -04:00			`%root: main.tex`
Oliver's notes 2020-06-26 17:27:52 -04:00			`%!TEX root=./main.tex`
More work on lemmas 3, 4, and lin sys. 2020-12-04 13:14:12 -05:00			`%\onecolumn`
Fixed end of lemma 3.5 proof. 2020-12-16 17:25:37 -05:00			`\section{Background and Notation}\label{sec:background}`
Added some pictures for single edge and two path patterns. 2020-09-09 12:11:05 -04:00
Rewrote 1st paragraph of Intro to be consistent with traditional nomenclature and notation used S.2. 2021-08-20 10:31:24 -04:00			`\subsection{Probabilistic Databases}`
Started with my pass on Sec 1 2020-07-02 16:15:35 -04:00
Changes to S.2. 2021-08-24 12:48:25 -04:00			`The setting used in this section is primarily that of a bag-\abbrPDB query with set-\abbrPDB inputs. Recall, as noted in \cref{sec:intro-rewrite-070921}, this is not limiting.`
Changes I apparently hadn't commited. 2021-08-23 09:01:25 -04:00
pdb def 2020-12-13 01:50:08 -05:00			`An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.`
multi 2021-04-10 14:11:35 -04:00			Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world: $\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}$.
Started fixing Oliver's suggestion 071820. 2020-07-20 15:50:44 -04:00
pdb def 2020-12-13 01:50:08 -05:00			`For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:`
updates 2021-04-10 14:35:38 -04:00			`%`
Started pass on Sec 2 (Aaron) 2021-04-06 17:44:14 -04:00			`\[\forall \db \in \query(\idb): \pd'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \pd(\db') \]`
pdb def 2020-12-13 01:50:08 -05:00
Notation changes started. 2021-06-09 12:42:26 -04:00			`Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_\numvar)$ with natural number coefficients and exponents.`
Rewrote 1st paragraph of Intro to be consistent with traditional nomenclature and notation used S.2. 2021-08-20 10:31:24 -04:00			`We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations}.`
updates 2021-04-10 14:35:38 -04:00			`$\semNX$-relations are functions from tuples to elements of $\semNX$, typically called annotations.`
multi 2021-04-10 14:11:35 -04:00			`We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$. Note that $R(t)$ is the lineage polynomial for $t$.`
Notation changes started. 2021-06-09 12:42:26 -04:00			`Each possible world is defined by an assignment of $\numvar$ binary values $\vct{\wElem} \in \{0, 1\}^{\numvar}$ to $\vct{X}$.`
			`The multiplicity of $t \in R$ in this possible world, denoted $R(t)(\vct{\wElem})$, is obtained by evaluating the polynomial annotating $t$ on $\vct{\wElem}$.`
cref 2021-04-10 09:48:26 -04:00			`$\semNX$-relations are closed under $\raPlus$ (\Cref{fig:nxDBSemantics}).`
RA 2020-12-13 15:51:55 -05:00
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`

Notation changes started. 2021-06-09 12:42:26 -04:00			`We will use $\semNX$-\abbrPDB $\pxdb$, defined as the tuple $(\idb_{\semNX}, \pd)$, where $\semNX$-database $\idb_{\semNX}$ is paired with probability distribution $\pd$.`
			`We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ on an implicit $\semNX$-\abbrPDB (i.e., $\polyForTuple = \query(\pxdb)(t)$ for some $\pxdb$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{\numvar} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.`
			`$\semNX$-\abbrPDB\xplural and a function $\rmod$ (which transforms an $\semNX$-\abbrPDB to a classical bag-\abbrPDB, or $\semN$-\abbrPDB~\cite{DBLP:conf/pods/GreenKT07,feng:2019:sigmod:uncertainty}) are both formalized in \Cref{subsec:supp-mat-background}.`
RA 2020-12-13 15:51:55 -05:00			`\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}`
Notation changes started. 2021-06-09 12:42:26 -04:00			`Given an $\semN$-\abbrPDB $\pdb = (\idb,\pd)$ and $\semNX$-\abbrPDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$, we have:`
			$ \expct_{\randDB \sim \pd}[\query(\randDB)(t)] = \expct_{\randWorld\sim \pd'}\pbox{\polyForTuple(\randWorld)}. $
Simplifying discussion of N[X]-DBs 2021-04-09 00:07:33 -04:00			`\footnote{Although assumed by most prior work on set-probabilistic databases, e.g., as an obvious consequence of~\cite{IL84a}'s Theorem 7.1, we are unaware of any formal proof for bag-probabilistic databases.}`
RA 2020-12-13 15:51:55 -05:00			`\end{Proposition}`
poly 2021-04-08 22:51:36 -04:00			`\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.`
Try this one neat trick to save 2 pages :) 2020-12-19 16:46:26 -05:00			`This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.`
Read through: Space, grammar, notation fixes 2021-04-07 01:02:46 -04:00			`We focus on this problem from now on, assume an implicit result tuple, and so drop the subscript from $\polyForTuple$ (i.e., $\poly$ will denote a polynomial).`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00
			`\subsubsection{\tis and \bis}`
Trimming for space 2020-12-19 12:59:27 -05:00			`\label{subsec:tidbs-and-bidbs}`
Notation changes started. 2021-06-09 12:42:26 -04:00			`In this paper, we focus on two popular forms of \abbrPDB\xplural: Block-Independent (\bi) and Tuple-Independent (\ti) \abbrPDB\xplural.`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`%`
Notation changes started. 2021-06-09 12:42:26 -04:00			`A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-\abbrPDB such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same block are disjoint events.`
Minor adjustments 2021-04-10 00:19:16 -04:00			`In other words, each random variable corresponds to the event of a single tuple's presence.`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`%`
			`A \emph{\ti} is a \bi where each block contains exactly one tuple.`
			`\Cref{subsec:supp-mat-ti-bi-def} explains \tis and \bis in greater detail.`
Minor adjustments 2021-04-10 00:19:16 -04:00			`%`
Notation changes started. 2021-06-09 12:42:26 -04:00			`In a \bi (and by extension a \ti) $\pxdb$, tuples are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ where tuple $t_{i,j} \in \block_i$ is associated with a probability $\prob_{\tup_{i,j}} = \probOf[X_{i,j} = 1]$, and is annotated with a unique variable $X_{i,j}$.\footnote{`
Changes to S.2. 2021-08-24 12:48:25 -04:00			`Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_{i, j} \in b_i$, the event $(X_{i,j} = 1)$ corresponds to the event $(X_i = j)$ in the customary annotation scheme.`
Minor adjustments 2021-04-10 00:19:16 -04:00			`}`
			`Because blocks are independent and tuples from the same block are disjoint, the probabilities $\prob_{\tup_{i,j}}$ and the blocks induce the probability distribution $\pd$ of $\pxdb$.`
			`We will write a \bi-lineage polynomial $\poly(\vct{X})$ for a \bi with $\ell$ blocks as`
cref 2021-04-10 09:48:26 -04:00			`$\poly(\vct{X})$ = $\poly(X_{1, 1},\ldots, X_{1, \abs{\block_1}},$ $\ldots, X_{\ell, \abs{\block_\ell}})$, where $\abs{\block_i}$ denotes the size of $\block_i$.\footnote{Later on in the paper, especially in \Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.}`
Minor adjustments 2021-04-10 00:19:16 -04:00
RA 2020-12-13 15:51:55 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`


poly 2021-04-08 22:51:36 -04:00			`%%% Local Variables:`
			`%%% mode: latex`
			`%%% TeX-master: "main"`
			`%%% End:`