paper-BagRelationalPDBsAreHard/ra-to-poly.tex

%root: main.tex
%!TEX root=./main.tex
%\onecolumn
\section{Background and Notation}\label{sec:background}

\subsection{Polynomial Definition and Terminology}
%We now introduce some terminology 
%and develop a reduced form of lineage polynomials for a \abbrBIDB or \abbrTIDB.
%Note that 
\secrev{A }
 polynomial over $\vct{X}=(X_1,\dots,X_n)$ with individual degree $B <\infty$ 
is formally defined as (where $c_{\vct{d}}\in \semN$): 
\begin{equation}
  \label{eq:sop-form}
\poly\inparen{X_1,\dots,X_n}=\secrev{\sum_{\vct{d}\in\{0,\ldots,B\}^\tupset} c_{\vct{d}}\cdot \prod_{\tup\in\tupset} X_\tup^{d_\tup}.}
\end{equation}
%where $c_{\vct{d}}\in \semN$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Standard Monomial Basis]\label{def:smb}
The term $\prod_{\tup\in\tupset} X_\tup^{d_\tup}$ in \Cref{eq:sop-form} is a {\em monomial}. A polynomial $\poly\inparen{\vct{X}}$ is in standard monomial basis (\abbrSMB) when we keep only the terms with $c_{\vct{d}}\ne 0$ from \Cref{eq:sop-form}.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Unless othewise noted, we consider all polynomials to be in \abbrSMB representation. 
When it is unclear, we use $\smbOf{\poly}$ to denote the \abbrSMB form of a polynomial $\poly$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Degree]\label{def:degree-of-poly}
The degree of polynomial $\poly(\vct{X})$ is the largest \secrev{$\norm{\vct{d}}_1$}% = \sum_{\tup\in\tupset} d_\tup$ 
such that $c_{(d_1,\dots,d_n)}\ne 0$. % maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As an example, the degree of the polynomial $X^2+2XY^2+Y^2$ is $3$.
Product terms in lineage arise only from join operations (\Cref{fig:nxDBSemantics}), so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins needed to produce a result tuple.
%in any clause of the $\raPlus$ query that created it.
\secrev{
We call a polynomial $\poly\inparen{\vct{X}}$ a \emph{\abbrCTIDB-lineage polynomial} (%resp., \emph{\ti-lineage polynomial}, 
or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \abbrCTIDB $\pdb$, and result tuple $\tup$ such that $\poly\inparen{\vct{X}} = \apolyqdt\inparen{\vct{X}}.$
}


%Following the typical representation of bags in production databases, for query inputs, we will use \abbrBPDB\xplural with multiplicities $\{0, 1\}$ (see \Cref{sec:gener-results-beyond} for more on this choice).
\subsubsection{\abbrCTIDB\xplural and \abbrOneBIDB\xplural}
\label{subsec:tidbs-and-bidbs}
An \textit{incomplete database} $\Omega$ is a set of deterministic databases $\omega$ called possible worlds.

\noindent\secrev{
A \abbrCTIDB $\pdb$ is a pair $\inparen{\worlds, \bpd}$ such that $\worlds$ is an incomplete database whose set of possible worlds is the $c+1^\numvar$ tuple/multiplicity combinations across all $\tup\in\tupset$, where $\abs{\tupset} = \numvar$, $\tupset = \bigcup_{\worldvec\in\worlds,~\worldvec_{\tup}\geq 1}\tup$ is the set of possible tuples across possible worlds, and $\bpd$ is a probability distribution over $\worlds$.  

A block independent database (\abbrBIDB) is a related probabilistic data model $\pdb=\inparen{\Omega, \bpd}$ such that the base set of tuples $\tupset = \bigcup_{\omega\in\Omega,~\tup\in\omega}\tup$ is partitioned into a set of $\numvar$ independent blocks $\inset{\inparen{\block_\tup}_{\tup\in\pbox{\numvar}}}$ such that the set of tuples $\inset{\inparen{\tup_j}_{j\in\pbox{\abs{\block}}}}$ in block $\block_\tup$ are disjoint from one another. This construction produces the set of possible worlds $\Omega$ that consists of all unique combinations of tuples in $\tupset$ with the constraint that for any $\omega\in\Omega$, no two tuples $\tup_j, \tup_{j'}, j\neq j'$ from the same block $\block_\tup$ exist together.  A $\bound$-\abbrBIDB has the further requirement that each block has a multiplicity of at most $c$.  We present a reduction that is useful in producing our results:

\begin{Definition}[\abbrCTIDB reduction]\label{def:ctidb-reduct}
Given \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, let $\pdb' = \inparen{\Omega, \bpd'}$ be the \abbrOneBIDB obtained in the following manner: for each $\tup\in\tupset$, create block $\block_\tup = \inset{\intup{\tup, jX_{\tup, j}}_{j\in\pbox{\bound}}}$, such that $X_{\tup, j}\in\inset{0,1}$. %with $\bound$ disjoint copies, such that $\tup_j$ is annotated with variable $X_{\tup, j}$ for $j\in\pbox{\bound}$.  
The probability distribution $\bpd'$ is the one induced by $\vct{p} = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}$ and the \abbrBIDB disjoint requirement. 
\end{Definition}

For the \abbrCTIDB $\pdb$, each $X_\tup\in\pbox{\bound}$, while in the reduced \abbrOneBIDB $\pdb'$, each $X_{\tup, j}\in\inset{0, 1}$.  %As previously noted, unlike $X_{\tup}\in\inset{0,\ldots,\bound}$ for $X_{\tup}\in\vars{\pdb}$, $X_{\tup, j}\in\inset{0,1}$ for $X_{\tup, j}\in\vars{\pdb'}$.  
Hence, in the setting of \abbrOneBIDB, the base case of~\Cref{fig:nxDBSemantics} now becomes $\poly\pbox{\rel,\tupset, \tup} = \sum_{j\in\pbox{\bound}}jX_{\tup, j}$.  Then given the disjoint requirement and the semantics for constructing the lineage polynomial over a \abbrOneBIDB, $\poly\pbox{\rel,\tupset',\tup}$ is of the same structure as the reformulated polynomial $\refpoly{}$ of step i) from~\Cref{def:reduced-poly}, which then implies that $\rpoly$ is the reduced polynomial that results from step ii) of~\Cref{def:reduced-poly}, and further that~\Cref{lem:tidb-reduce-poly} immediately follows for \abbrOneBIDB polynomials: $\expct_{\rvworld\sim\bpd'}\pbox{\poly\inparen{\rvworld}} = \rpoly\inparen{\vct{\prob}}$.
\AH{@atri, not sure if $\bpd'$ should be $\bpd''$ (in the above expectation) as discussed below.  Since $\bpd'\equiv\bpd''$, then the proof still holds for~\Cref{lem:tidb-reduce-poly}, but maybe it is important to $\bpd''$ to drive the point home that we iterate over the all worlds set (as opposed to the set of possible worlds) when computing the expectation of a polynomial.  Or maybe it suffices to note that $\bpd'\equiv\bpd''$.}
}
%In this paper, we focus on two popular forms of \abbrPDB\xplural: Block-Independent (\bi) and Tuple-Independent (\ti) \abbrPDB\xplural.
%%
%A \bi $\pdb$ is a \abbrPDB with the constraint that
%%(i) every tuple $\tup_i$ is annotated with a unique random variable $\randWorld_i \in \{0, 1\}$ and (ii) that
%the tuples in $\dbbase$ can be partitioned into a set of $\ell$ blocks such that tuples $\tup_{i, j}, \tup_{k, j'}$ from separate blocks $(i\neq k)$ are independent of each other while tuples $\tup_{i, j}, \tup_{i, k}$ from the same block are disjoint events.\footnote{
%  Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block~\cite{DBLP:series/synthesis/2011Suciu}, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function).  For $t_{i, j} \in b_i$, the event $(\randWorld_{i,j} = 1)$ corresponds to the event $(\randWorld_i = j)$ in the customary annotation scheme.
%}
%Each tuple $\tup_{i, j}$ is annotated with a random variable $\randWorld_{i, j} \in \{0, 1\}$ denoting its presence in a possible world $\db$.  The probability distribution $\pd$ over $\dbbase$ is the one induced from individual tuple probabilities $\prob_{i, j}\in \vct{\prob}=\inparen{\prob_{1, 1},\ldots,\prob_{\abs{\block},\ldots,\abs{\block_{\abs{\block}}}}}$ (where $\forall i$, $\sum_j p_{i,j}\le 1$) and the conditions on the blocks.  A \abbrTIDB is a \abbrBIDB where each block has size exactly $1$.


Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to disjointness.  The all worlds set can be modeled by $\worldvec\in \{0, 1\}^{\bound\numvar}$,\footnote{Here and later, especially in \Cref{sec:algo}, we will rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.} such that $\worldvec_{\tup, j} \in \worldvec$ represents whether or not the multiplicity of $\tup$ is $j$.%(where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$).  
We denote a probability distribution over all $\worldvec \in \{0, 1\}^\numvar$ as $\bpd''$.  When $\bpd''$ is the one induced from each $\prob_{\tup, j}$ while assigning $\probOf\pbox{\worldvec} = 0$ for any $\worldvec$ with $\worldvec_{\tup, j} = \worldvec_{\tup, j'} = 1$ for $j\neq j'$, we end up with a bijective mapping from $\bpd'$ to $\bpd''$, such that each mapping is equivalent, implying the distributions are equivalent.
\Cref{subsec:supp-mat-ti-bi-def} has more details. 


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:
Started texing poly reformation write up. 2020-06-12 11:45:15 -04:00			`%root: main.tex`
Oliver's notes 2020-06-26 17:27:52 -04:00			`%!TEX root=./main.tex`
More work on lemmas 3, 4, and lin sys. 2020-12-04 13:14:12 -05:00			`%\onecolumn`
Fixed end of lemma 3.5 proof. 2020-12-16 17:25:37 -05:00			`\section{Background and Notation}\label{sec:background}`
Added some pictures for single edge and two path patterns. 2020-09-09 12:11:05 -04:00
Changes/restructuring S 2. 2022-02-07 12:09:43 -05:00			`\subsection{Polynomial Definition and Terminology}`
			`%We now introduce some terminology`
			`%and develop a reduced form of lineage polynomials for a \abbrBIDB or \abbrTIDB.`
			`%Note that`
			`\secrev{A }`
			`polynomial over $\vct{X}=(X_1,\dots,X_n)$ with individual degree $B <\infty$`
			`is formally defined as (where $c_{\vct{d}}\in \semN$):`
			`\begin{equation}`
			`\label{eq:sop-form}`
			`\poly\inparen{X_1,\dots,X_n}=\secrev{\sum_{\vct{d}\in\{0,\ldots,B\}^\tupset} c_{\vct{d}}\cdot \prod_{\tup\in\tupset} X_\tup^{d_\tup}.}`
			`\end{equation}`
			`%where $c_{\vct{d}}\in \semN$.`

			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`\begin{Definition}[Standard Monomial Basis]\label{def:smb}`
			`The term $\prod_{\tup\in\tupset} X_\tup^{d_\tup}$ in \Cref{eq:sop-form} is a {\em monomial}. A polynomial $\poly\inparen{\vct{X}}$ is in standard monomial basis (\abbrSMB) when we keep only the terms with $c_{\vct{d}}\ne 0$ from \Cref{eq:sop-form}.`
			`\end{Definition}`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`Unless othewise noted, we consider all polynomials to be in \abbrSMB representation.`
			`When it is unclear, we use $\smbOf{\poly}$ to denote the \abbrSMB form of a polynomial $\poly$.`

			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`\begin{Definition}[Degree]\label{def:degree-of-poly}`
			`The degree of polynomial $\poly(\vct{X})$ is the largest \secrev{$\norm{\vct{d}}_1$}% = \sum_{\tup\in\tupset} d_\tup$`
			`such that $c_{(d_1,\dots,d_n)}\ne 0$. % maximum sum of exponents, over all monomials in $\smbOf{\poly(\vct{X})}$.`
			`\end{Definition}`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`As an example, the degree of the polynomial $X^2+2XY^2+Y^2$ is $3$.`
			`Product terms in lineage arise only from join operations (\Cref{fig:nxDBSemantics}), so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins needed to produce a result tuple.`
			`%in any clause of the $\raPlus$ query that created it.`
			`\secrev{`
			`We call a polynomial $\poly\inparen{\vct{X}}$ a \emph{\abbrCTIDB-lineage polynomial} (%resp., \emph{\ti-lineage polynomial},`
			`or simply lineage polynomial), if there exists a $\raPlus$ query $\query$, \abbrCTIDB $\pdb$, and result tuple $\tup$ such that $\poly\inparen{\vct{X}} = \apolyqdt\inparen{\vct{X}}.$`
			`}`

Started with my pass on Sec 1 2020-07-02 16:15:35 -04:00
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00
Finished S.2 pass. 2022-02-08 16:39:14 -05:00			`%Following the typical representation of bags in production databases, for query inputs, we will use \abbrBPDB\xplural with multiplicities $\{0, 1\}$ (see \Cref{sec:gener-results-beyond} for more on this choice).`
Small tweaks to S2. 2022-02-09 09:35:36 -05:00			`\subsubsection{\abbrCTIDB\xplural and \abbrOneBIDB\xplural}`
Trimming for space 2020-12-19 12:59:27 -05:00			`\label{subsec:tidbs-and-bidbs}`
Finished S.2 pass. 2022-02-08 16:39:14 -05:00			`An \textit{incomplete database} $\Omega$ is a set of deterministic databases $\omega$ called possible worlds.`

			`\noindent\secrev{`
Small tweaks to S2. 2022-02-09 09:35:36 -05:00			`A \abbrCTIDB $\pdb$ is a pair $\inparen{\worlds, \bpd}$ such that $\worlds$ is an incomplete database whose set of possible worlds is the $c+1^\numvar$ tuple/multiplicity combinations across all $\tup\in\tupset$, where $\abs{\tupset} = \numvar$, $\tupset = \bigcup_{\worldvec\in\worlds,~\worldvec_{\tup}\geq 1}\tup$ is the set of possible tuples across possible worlds, and $\bpd$ is a probability distribution over $\worlds$.`
Finished S.2 pass. 2022-02-08 16:39:14 -05:00
Small tweaks to S2. 2022-02-09 09:35:36 -05:00			A block independent database (\abbrBIDB) is a related probabilistic data model $\pdb=\inparen{\Omega, \bpd}$ such that the base set of tuples $\tupset = \bigcup_{\omega\in\Omega,~\tup\in\omega}\tup$ is partitioned into a set of $\numvar$ independent blocks $\inset{\inparen{\block_\tup}_{\tup\in\pbox{\numvar}}}$ such that the set of tuples $\inset{\inparen{\tup_j}_{j\in\pbox{\abs{\block}}}}$ in block $\block_\tup$ are disjoint from one another. This construction produces the set of possible worlds $\Omega$ that consists of all unique combinations of tuples in $\tupset$ with the constraint that for any $\omega\in\Omega$, no two tuples $\tup_j, \tup_{j'}, j\neq j'$ from the same block $\block_\tup$ exist together. A $\bound$-\abbrBIDB has the further requirement that each block has a multiplicity of at most $c$. We present a reduction that is useful in producing our results:
More changes to S.2. 2022-02-08 12:51:15 -05:00
			`\begin{Definition}[\abbrCTIDB reduction]\label{def:ctidb-reduct}`
Small tweaks to S2. 2022-02-09 09:35:36 -05:00			`Given \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, let $\pdb' = \inparen{\Omega, \bpd'}$ be the \abbrOneBIDB obtained in the following manner: for each $\tup\in\tupset$, create block $\block_\tup = \inset{\intup{\tup, jX_{\tup, j}}_{j\in\pbox{\bound}}}$, such that $X_{\tup, j}\in\inset{0,1}$. %with $\bound$ disjoint copies, such that $\tup_j$ is annotated with variable $X_{\tup, j}$ for $j\in\pbox{\bound}$.`
Finished S.2 pass. 2022-02-08 16:39:14 -05:00			`The probability distribution $\bpd'$ is the one induced by $\vct{p} = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}$ and the \abbrBIDB disjoint requirement.`
More changes to S.2. 2022-02-08 12:51:15 -05:00			`\end{Definition}`

Small tweaks to S2. 2022-02-09 09:35:36 -05:00			`For the \abbrCTIDB $\pdb$, each $X_\tup\in\pbox{\bound}$, while in the reduced \abbrOneBIDB $\pdb'$, each $X_{\tup, j}\in\inset{0, 1}$. %As previously noted, unlike $X_{\tup}\in\inset{0,\ldots,\bound}$ for $X_{\tup}\in\vars{\pdb}$, $X_{\tup, j}\in\inset{0,1}$ for $X_{\tup, j}\in\vars{\pdb'}$.`
Some notation changes and clarifications; changes to prose of S3 2022-02-10 13:03:25 -05:00			Hence, in the setting of \abbrOneBIDB, the base case of~\Cref{fig:nxDBSemantics} now becomes $\poly\pbox{\rel,\tupset, \tup} = \sum_{j\in\pbox{\bound}}jX_{\tup, j}$. Then given the disjoint requirement and the semantics for constructing the lineage polynomial over a \abbrOneBIDB, $\poly\pbox{\rel,\tupset',\tup}$ is of the same structure as the reformulated polynomial $\refpoly{}$ of step i) from~\Cref{def:reduced-poly}, which then implies that $\rpoly$ is the reduced polynomial that results from step ii) of~\Cref{def:reduced-poly}, and further that~\Cref{lem:tidb-reduce-poly} immediately follows for \abbrOneBIDB polynomials: $\expct_{\rvworld\sim\bpd'}\pbox{\poly\inparen{\rvworld}} = \rpoly\inparen{\vct{\prob}}$.
Small tweaks to S2. 2022-02-09 09:35:36 -05:00			`\AH{@atri, not sure if $\bpd'$ should be $\bpd''$ (in the above expectation) as discussed below. Since $\bpd'\equiv\bpd''$, then the proof still holds for~\Cref{lem:tidb-reduce-poly}, but maybe it is important to $\bpd''$ to drive the point home that we iterate over the all worlds set (as opposed to the set of possible worlds) when computing the expectation of a polynomial. Or maybe it suffices to note that $\bpd'\equiv\bpd''$.}`
ra 2021-09-17 14:11:40 -04:00			`}`
More changes to S.2. 2022-02-08 12:51:15 -05:00			`%In this paper, we focus on two popular forms of \abbrPDB\xplural: Block-Independent (\bi) and Tuple-Independent (\ti) \abbrPDB\xplural.`
			`%%`
			`%A \bi $\pdb$ is a \abbrPDB with the constraint that`
			`%%(i) every tuple $\tup_i$ is annotated with a unique random variable $\randWorld_i \in \{0, 1\}$ and (ii) that`
			`%the tuples in $\dbbase$ can be partitioned into a set of $\ell$ blocks such that tuples $\tup_{i, j}, \tup_{k, j'}$ from separate blocks $(i\neq k)$ are independent of each other while tuples $\tup_{i, j}, \tup_{i, k}$ from the same block are disjoint events.\footnote{`
			`% Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block~\cite{DBLP:series/synthesis/2011Suciu}, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function). For $t_{i, j} \in b_i$, the event $(\randWorld_{i,j} = 1)$ corresponds to the event $(\randWorld_i = j)$ in the customary annotation scheme.`
			`%}`
			%Each tuple $\tup_{i, j}$ is annotated with a random variable $\randWorld_{i, j} \in \{0, 1\}$ denoting its presence in a possible world $\db$. The probability distribution $\pd$ over $\dbbase$ is the one induced from individual tuple probabilities $\prob_{i, j}\in \vct{\prob}=\inparen{\prob_{1, 1},\ldots,\prob_{\abs{\block},\ldots,\abs{\block_{\abs{\block}}}}}$ (where $\forall i$, $\sum_j p_{i,j}\le 1$) and the conditions on the blocks. A \abbrTIDB is a \abbrBIDB where each block has size exactly $1$.

Changes to S 2 2021-09-07 10:33:13 -04:00
Changes/restructuring S 2. 2022-02-07 12:09:43 -05:00
Small tweaks to S2. 2022-02-09 09:35:36 -05:00			Instead of looking only at the possible worlds of $\pdb$, one can consider all worlds, including those that cannot exist due to disjointness. The all worlds set can be modeled by $\worldvec\in \{0, 1\}^{\bound\numvar}$,\footnote{Here and later, especially in \Cref{sec:algo}, we will rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.} such that $\worldvec_{\tup, j} \in \worldvec$ represents whether or not the multiplicity of $\tup$ is $j$.%(where $k = \sum_{\ell = 1}^{i - 1} \abs{b_\ell} + j$).
Finished S.2 pass. 2022-02-08 16:39:14 -05:00			`We denote a probability distribution over all $\worldvec \in \{0, 1\}^\numvar$ as $\bpd''$. When $\bpd''$ is the one induced from each $\prob_{\tup, j}$ while assigning $\probOf\pbox{\worldvec} = 0$ for any $\worldvec$ with $\worldvec_{\tup, j} = \worldvec_{\tup, j'} = 1$ for $j\neq j'$, we end up with a bijective mapping from $\bpd'$ to $\bpd''$, such that each mapping is equivalent, implying the distributions are equivalent.`
Changes/restructuring S 2. 2022-02-07 12:09:43 -05:00			`\Cref{subsec:supp-mat-ti-bi-def} has more details.`

RA 2020-12-13 15:51:55 -05:00
poly 2021-04-08 22:51:36 -04:00			`%%% Local Variables:`
			`%%% mode: latex`
			`%%% TeX-master: "main"`
			`%%% End:`