paper-BagRelationalPDBsAreHard/ra-to-poly.tex

%root: main.tex
%!TEX root=./main.tex
\onecolumn
\section{Query translation into polynomials}
%\AH{This section will involve the set of queries (RA+) that we are interested in, the probabilistic/incomplete models we address, and the outer aggregate functions we perform over the output \textit{annotation}
%1) RA notation
%2) DB (TIDB) notation
%3) How queries translate into polynomials
%}
\subsection{Introduction}


An incomplete database $\idb$ is a set of deterministic databases $\db$ where each element is known as a possible world.  %Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas $(\sch(\rel_i))$are unchanging across each $\db_j$.  
Denote the schema of $\db$ as $\sch(\db)$.  For the set of possible worlds, $\wSet$, i.e. the set of all $\db_i \in \idb$, define an injective mapping to the set $\{0, 1\}^M$, where for each vector $\vct{w} \in \{0, 1\}^M$ there is at most one element $\db_i \in \idb$ mapped to $\vct{w}$.  When $\idb$ is a probabilistic database, $\idb$ can be viewed as a two tuple $(\wSet, \pd)$, where $\wSet$ as noted, is the set of possible worlds, and $\pd$ is a probability distribution over $\wSet$.  

The possible worlds semantics gives a framework for how to think about running queries over $\idb$.  Given a query $\query$, $\query$ is deterministically run over each $\db \in \idb$, and the output of $\query(\idb)$ is defined as the set of results (worlds) from running $\query$ over each $\db_i \in \idb$.  We write this formally as,
\[\query(\idb) = \{\query(\db) | \db \in \idb\}\]


\subsection{Modeling and Semantics}
Define $\vct{X}$ to be the variables $X_1,\dots,X_M$.  Let the set of all tuples in domain $\mathbb{D}$ be $\tset$.


\subsubsection{K-relations}\label{subsubsec:k-rel}
A K-relation~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are each annotated with an expression whose values come from its respective commutative K-semiring, denoted $\{K, \oplus, \otimes, \mathbbold{0}, \mathbbold{1}\}$.  A commutative $K$-semiring has associative and commutative operators $\oplus$ and $\otimes$, with $\otimes$ distributing over $\oplus$, $\mathbbold{0}$ the identity of $\oplus$, $\mathbbold{1}$ likewise of $\otimes$, and element $\mathbbold{0}$ anihilates all elements of $K$ when being combined with $\otimes$.  The information encoded in the annotation depends on the underlying semiring of the relation.  
As noted in \cite{DBLP:conf/pods/GreenKT07}, the $\mathbb{N}[\vct{X}]$-semiring is a semiring over the set $\mathbb{N}[\vct{X}]$ of all polynomials, whose variables can then be substituted with $K$-values from other semirings, evaluating the operators with the operators of the substituted semiring, to produce varying semantics such as set, bag, and security annotations.  

When used with $\mathbb B$-typed variables, an $\mathbb{N}[\vct{X}]$ relation is effectively a C-Table \cite{DBLP:conf/pods/GreenKT07}, since all first order formulas can be equivalently modeled by polynomials, where disjunction is equivalent to the addition operator and conjunction is equivalent to the multiplication operator.

Using $\mathbb B$-typed variables in an $\mathbb{N}[\vct{X}]$ relation would correspond to substituting values and operators from the $\{\mathbb{B}, \vee, \wedge, \bot, \top\}$ semiring.

Further define $\nxdb$ as an $\mathbb{N}[\vct{X}]$ database where each tuple $\tup \in \db$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later.  
Since $\nxdb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tset \mapsto \mathbb{N}[\vct{X}]$, where $\rel(\tup)$ denotes the polynomial annotating tuple $\tup$.

It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations.  
The evalution semantics notation $\llbracket \cdot \rrbracket = x$ simply mean that the result of evaluating expression $\cdot$ is given by following the semantics $x$.  Given a query $\query$, operations in $\query$ are translated into the following polynomial expressions.

\begin{align*}
&\eval{\project_A(\rel)}(\tup) = &&\sum_{\tup': \project_A(\tup) = \tup} \eval{\rel}(\tup')\\
&\eval{(\rel_1 \union \rel_2)}(\tup) = &&\eval{\rel_1}(\tup) + \eval{\rel_2}(\tup)\\
&\eval{(\rel_1 \join \rel_2)}(\tup) = &&\eval{\rel_1}(\project_{\sch(\rel_1)}(\tup)) \times \eval{\rel_2}(\project_{\sch(\rel_2)}(\tup))	\\
&\eval{\select_\theta(\rel)}(\tup) = &&\begin{cases}
					\eval{\rel}(\tup)	&\text{if }\theta(\tup) = 1\\
					0		&\text{otherwise}.
				\end{cases}\\
&\eval{R}(\tup) = &&\rel(\tup)
\end{align*}

The above semantics show us how to obtain the annotation on a tuple in the result of query $\query$ from the annotations on the tuples in the input of $\query$.

\subsection{Defining the Data}

In the general case, the binary value of $\vct{w}$ uniquely identifies a potential possible world.  For example, consider the case of the  Tuple Independent Database $(\ti)$ data model in which each table is a set of tuples, each of which is independent of one another, and individually occur with a specific probability $\prob_\tup$.  Because of independence, a $\ti$ with $\numTup$ tuples naturally has $2^\numTup$ possible worlds, thus  $\numTup = M$, and the injective mapping for each $\vct{w} \in \{0, 1\}^M$ is trivial.  However in the Block Independent Disjoint data model (BIDB), because of the disjoint condition on tuples within the same block, a BIDB may not have exactly $2^M$ possible worlds. Such $\vct{w}$'s, that do not exist, are assigned a probability of $0$.  

Denote a random variable selecting a world according to distribution $P$ to be $\rw$.  Provided that for any non-possible world $\vct{w} \in \{0, 1\}^M, \pd[\rw = \vct{w}] = 0$, then, a probability distribution over $\{0, 1\}^M$ is a distribution over $\Omega$, which we have already defined as $\pd$.  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%This could be a way to think of world binary vectors in the general case
%Let $\vct{w}$ be a $\left\lceil\log_2\left(\left|\wSet\right|\right)\right\rceil = \numTup$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$. 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Assume a domain of $\{0, 1\}$ for each $X_i \in \vct{X}$.  Since, from this point on, our discussion will involve one polynomial for an arbirtrary $\tup$, we thus abuse notation by using $\poly(\vct{X})$ to be the annotated polynomial $\llbracket\poly(\db)\rrbracket(\tup)$, where the injective mapping maps $\db$ to $\vct{X}$.  

One of the aggregates we desire to compute over the annotated polynomial is the expectation over possible worlds, denoted,

\AH{With our notation, I no longer think that $\vct{w} \sim \pd$ is necessary footer for $\expct$.  We can probably just have $\expct\limits_{\vct{w}}$ instead.  Do you agree?}
\AR{No. How would you state Lemma 4 without explicitly using $P$ in the definition of expectation?}

\[\expct_{\vct{\rw} \sim \pd}\pbox{\poly(\rw)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \pd[\rw = \vct{w}].\]

For a $\ti$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world, where the $i^{th}$ bit position represents whether a tuple $\tup_i$ appears in the unique world identified by the binary value of $\vct{w}$.  Given an $\numTup$-sized vector $\vct{p}$, where the $i^{th}$ element, $\prob_i$ is the probability of the $i^{th}$ tuple, denote the vector $\vct{p}$ according to the probability distributation $\pd$ as $\pd^{(\vct{p})}$.  We can then write an equivalent expectation for $\ti$ model,

\[\expct_{\rw\sim \pd^{(\vct{p})}}\pbox{\poly(\rw)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]
Started texing poly reformation write up. 2020-06-12 11:45:15 -04:00			`%root: main.tex`
Oliver's notes 2020-06-26 17:27:52 -04:00			`%!TEX root=./main.tex`
Changed to one column. 2020-07-14 11:45:57 -04:00			`\onecolumn`
Started texing poly reformation write up. 2020-06-12 11:45:15 -04:00			`\section{Query translation into polynomials}`
Latest Version 2020-06-26 12:59:24 -04:00			`%\AH{This section will involve the set of queries (RA+) that we are interested in, the probabilistic/incomplete models we address, and the outer aggregate functions we perform over the output \textit{annotation}`
			`%1) RA notation`
			`%2) DB (TIDB) notation`
			`%3) How queries translate into polynomials`
			`%}`
Started rewriting section 1.2 2020-07-03 11:45:43 -04:00			`\subsection{Introduction}`
Started with my pass on Sec 1 2020-07-02 16:15:35 -04:00
Started rewriting section 1.2 2020-07-03 11:45:43 -04:00
Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			`An incomplete database $\idb$ is a set of deterministic databases $\db$ where each element is known as a possible world. %Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas $(\sch(\rel_i))$are unchanging across each $\db_j$.`
			`Denote the schema of $\db$ as $\sch(\db)$. For the set of possible worlds, $\wSet$, i.e. the set of all $\db_i \in \idb$, define an injective mapping to the set $\{0, 1\}^M$, where for each vector $\vct{w} \in \{0, 1\}^M$ there is at most one element $\db_i \in \idb$ mapped to $\vct{w}$. When $\idb$ is a probabilistic database, $\idb$ can be viewed as a two tuple $(\wSet, \pd)$, where $\wSet$ as noted, is the set of possible worlds, and $\pd$ is a probability distribution over $\wSet$.`
Started fixing Oliver's suggestion 071820. 2020-07-20 15:50:44 -04:00
Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			`The possible worlds semantics gives a framework for how to think about running queries over $\idb$. Given a query $\query$, $\query$ is deterministically run over each $\db \in \idb$, and the output of $\query(\idb)$ is defined as the set of results (worlds) from running $\query$ over each $\db_i \in \idb$. We write this formally as,`
Started fixing Oliver's suggestion 071820. 2020-07-20 15:50:44 -04:00			`\[\query(\idb) = \{\query(\db) \| \db \in \idb\}\]`


NOT done with pass yet. Am in middle of Sec 1.2. Will finish my pass later tonight 2020-07-02 16:23:46 -04:00
Added probability notation to notation section 2020-07-02 12:06:59 -04:00			`\subsection{Modeling and Semantics}`
Started fixing Oliver's suggestion 071820. 2020-07-20 15:50:44 -04:00			`Define $\vct{X}$ to be the variables $X_1,\dots,X_M$. Let the set of all tuples in domain $\mathbb{D}$ be $\tset$.`

A few comments. 2020-07-16 21:41:43 -04:00
Minor changes after 071020 meeting. 2020-07-10 13:50:13 -04:00			`\subsubsection{K-relations}\label{subsubsec:k-rel}`
A few comments. 2020-07-16 21:41:43 -04:00			A K-relation~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are each annotated with an expression whose values come from its respective commutative K-semiring, denoted $\{K, \oplus, \otimes, \mathbbold{0}, \mathbbold{1}\}$. A commutative $K$-semiring has associative and commutative operators $\oplus$ and $\otimes$, with $\otimes$ distributing over $\oplus$, $\mathbbold{0}$ the identity of $\oplus$, $\mathbbold{1}$ likewise of $\otimes$, and element $\mathbbold{0}$ anihilates all elements of $K$ when being combined with $\otimes$. The information encoded in the annotation depends on the underlying semiring of the relation.
Finished implementing Oliver's suggestions 071820. 2020-07-20 21:18:14 -04:00			`As noted in \cite{DBLP:conf/pods/GreenKT07}, the $\mathbb{N}[\vct{X}]$-semiring is a semiring over the set $\mathbb{N}[\vct{X}]$ of all polynomials, whose variables can then be substituted with $K$-values from other semirings, evaluating the operators with the operators of the substituted semiring, to produce varying semantics such as set, bag, and security annotations.`

Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			`When used with $\mathbb B$-typed variables, an $\mathbb{N}[\vct{X}]$ relation is effectively a C-Table \cite{DBLP:conf/pods/GreenKT07}, since all first order formulas can be equivalently modeled by polynomials, where disjunction is equivalent to the addition operator and conjunction is equivalent to the multiplication operator.`
More changes for the translation/notation/background section 2020-06-30 20:08:32 -04:00
Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			`Using $\mathbb B$-typed variables in an $\mathbb{N}[\vct{X}]$ relation would correspond to substituting values and operators from the $\{\mathbb{B}, \vee, \wedge, \bot, \top\}$ semiring.`
Oliver's notes 2020-06-26 17:27:52 -04:00
Finished implementing Oliver's suggestions 071820. 2020-07-20 21:18:14 -04:00			`Further define $\nxdb$ as an $\mathbb{N}[\vct{X}]$ database where each tuple $\tup \in \db$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later.`
Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			`Since $\nxdb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tset \mapsto \mathbb{N}[\vct{X}]$, where $\rel(\tup)$ denotes the polynomial annotating tuple $\tup$.`
Changes to modeling, data define sections per Atri's comments 070920 2020-07-09 15:26:27 -04:00
Finished implementing Oliver's suggestions 071820. 2020-07-20 21:18:14 -04:00			`It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations.`
Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			`The evalution semantics notation $\llbracket \cdot \rrbracket = x$ simply mean that the result of evaluating expression $\cdot$ is given by following the semantics $x$. Given a query $\query$, operations in $\query$ are translated into the following polynomial expressions.`
Started translation, notation section 2020-06-23 15:49:19 -04:00
			`\begin{align*}`
Minor changes after 071020 meeting. 2020-07-10 13:50:13 -04:00			`&\eval{\project_A(\rel)}(\tup) = &&\sum_{\tup': \project_A(\tup) = \tup} \eval{\rel}(\tup')\\`
Changes to modeling, data define sections per Atri's comments 070920 2020-07-09 15:26:27 -04:00			`&\eval{(\rel_1 \union \rel_2)}(\tup) = &&\eval{\rel_1}(\tup) + \eval{\rel_2}(\tup)\\`
Minor changes after 071020 meeting. 2020-07-10 13:50:13 -04:00			`&\eval{(\rel_1 \join \rel_2)}(\tup) = &&\eval{\rel_1}(\project_{\sch(\rel_1)}(\tup)) \times \eval{\rel_2}(\project_{\sch(\rel_2)}(\tup)) \\`
Changes to modeling, data define sections per Atri's comments 070920 2020-07-09 15:26:27 -04:00			`&\eval{\select_\theta(\rel)}(\tup) = &&\begin{cases}`
Minor changes after 071020 meeting. 2020-07-10 13:50:13 -04:00			`\eval{\rel}(\tup) &\text{if }\theta(\tup) = 1\\`
Started translation, notation section 2020-06-23 15:49:19 -04:00			`0 &\text{otherwise}.`
Changes to modeling, data define sections per Atri's comments 070920 2020-07-09 15:26:27 -04:00			`\end{cases}\\`
			`&\eval{R}(\tup) = &&\rel(\tup)`
Started translation, notation section 2020-06-23 15:49:19 -04:00			`\end{align*}`
Changes to modeling, data define sections per Atri's comments 070920 2020-07-09 15:26:27 -04:00
Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			`The above semantics show us how to obtain the annotation on a tuple in the result of query $\query$ from the annotations on the tuples in the input of $\query$.`
Started translation, notation section 2020-06-23 15:49:19 -04:00
Added probability notation to notation section 2020-07-02 12:06:59 -04:00			`\subsection{Defining the Data}`
More work on background/notational/translation section 2020-06-30 15:31:06 -04:00
Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			In the general case, the binary value of $\vct{w}$ uniquely identifies a potential possible world. For example, consider the case of the Tuple Independent Database $(\ti)$ data model in which each table is a set of tuples, each of which is independent of one another, and individually occur with a specific probability $\prob_\tup$. Because of independence, a $\ti$ with $\numTup$ tuples naturally has $2^\numTup$ possible worlds, thus $\numTup = M$, and the injective mapping for each $\vct{w} \in \{0, 1\}^M$ is trivial. However in the Block Independent Disjoint data model (BIDB), because of the disjoint condition on tuples within the same block, a BIDB may not have exactly $2^M$ possible worlds. Such $\vct{w}$'s, that do not exist, are assigned a probability of $0$.
Finished implementing Oliver's suggestions 071820. 2020-07-20 21:18:14 -04:00
Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			`Denote a random variable selecting a world according to distribution $P$ to be $\rw$. Provided that for any non-possible world $\vct{w} \in \{0, 1\}^M, \pd[\rw = \vct{w}] = 0$, then, a probability distribution over $\{0, 1\}^M$ is a distribution over $\Omega$, which we have already defined as $\pd$.`
RA to poly translation; corrections 062320 2020-06-23 19:33:28 -04:00
Rewrote data defintion based on 070320 discussion. 2020-07-08 13:08:35 -04:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`%This could be a way to think of world binary vectors in the general case`
			`%Let $\vct{w}$ be a $\left\lceil\log_2\left(\left\|\wSet\right\|\right)\right\rceil = \numTup$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$.`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
RA to poly translation; corrections 062320 2020-06-23 19:33:28 -04:00
Rewrote data defintion based on 070320 discussion. 2020-07-08 13:08:35 -04:00
Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			`Assume a domain of $\{0, 1\}$ for each $X_i \in \vct{X}$. Since, from this point on, our discussion will involve one polynomial for an arbirtrary $\tup$, we thus abuse notation by using $\poly(\vct{X})$ to be the annotated polynomial $\llbracket\poly(\db)\rrbracket(\tup)$, where the injective mapping maps $\db$ to $\vct{X}$.`
Rewrote data defintion based on 070320 discussion. 2020-07-08 13:08:35 -04:00
Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			`One of the aggregates we desire to compute over the annotated polynomial is the expectation over possible worlds, denoted,`
Rewrote data defintion based on 070320 discussion. 2020-07-08 13:08:35 -04:00
			`\AH{With our notation, I no longer think that $\vct{w} \sim \pd$ is necessary footer for $\expct$. We can probably just have $\expct\limits_{\vct{w}}$ instead. Do you agree?}`
Done with pass on Sec 1 2020-07-09 00:33:02 -04:00			`\AR{No. How would you state Lemma 4 without explicitly using $P$ in the definition of expectation?}`
Rewrote data defintion based on 070320 discussion. 2020-07-08 13:08:35 -04:00
Minor corrects+ new comments 2020-07-09 15:59:57 -04:00			`\[\expct_{\vct{\rw} \sim \pd}\pbox{\poly(\rw)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \pd[\rw = \vct{w}].\]`
Rewrote data defintion based on 070320 discussion. 2020-07-08 13:08:35 -04:00
Started incorporating Oliver's 081420 suggestions 2020-08-20 14:01:56 -04:00			For a $\ti$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world, where the $i^{th}$ bit position represents whether a tuple $\tup_i$ appears in the unique world identified by the binary value of $\vct{w}$. Given an $\numTup$-sized vector $\vct{p}$, where the $i^{th}$ element, $\prob_i$ is the probability of the $i^{th}$ tuple, denote the vector $\vct{p}$ according to the probability distributation $\pd$ as $\pd^{(\vct{p})}$. We can then write an equivalent expectation for $\ti$ model,
Rewrote data defintion based on 070320 discussion. 2020-07-08 13:08:35 -04:00
Minor corrects+ new comments 2020-07-09 15:59:57 -04:00			`\[\expct_{\rw\sim \pd^{(\vct{p})}}\pbox{\poly(\rw)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]`