paper-BagRelationalPDBsAreHard/ra-to-poly.tex

%root: main.tex
%!TEX root=./main.tex

\section{Query translation into polynomials}
%\AH{This section will involve the set of queries (RA+) that we are interested in, the probabilistic/incomplete models we address, and the outer aggregate functions we perform over the output \textit{annotation}
%1) RA notation
%2) DB (TIDB) notation
%3) How queries translate into polynomials
%}
\subsection{Introduction}


An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where each element is known as a possible world.  Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db_i \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas $(\sch(\rel_i))$ are unchanging across each $\db_j$.  For the set of possible worlds, $\wSet$, i.e. the set of all $\db_i \in \idb$,  define an injective mapping to the set $\{0, 1\}^M$, where for each vector $\vct{w} \in \{0, 1\}^M$ there is at most one element $\db_i \in \idb$ mapped to $\vct{w}$.  When $\idb$ is a probabilistic database, $\idb$ can be viewed as a two tuple $(\wSet, \pd)$, where $\wSet$ as noted, is the set of possible worlds, and $\pd$ is the probability distribution over $\wSet$.  
%Below may possibly need to be used again...we'll see.
%probability space $\left(\Omega, \mathcal{A}, P\right)$ over that set. \AR{I'm not sure why you are using the notation $\mathcal{A}$ and $P$, which you do not seem to use beyond this section. I would recommend that you only introduce a notation if you plan to use them later on.} Since the set of possible outcomes is the set of possible worlds, $\wSet$, and the set of outcomes is equivalent to the set of events, we will simplify notation and use $\left(\wSet, P\right)$ to denote the probability space of $\idb$. \AR{If you want to use $(\wSet,P)$ make sure you use the same notation in Sec 1.3 as well. If not, then use the notation from Sec 1.3 here}

\subsection{Modeling and Semantics}
Define $\vct{X}$ denote variables $X_1,\dots,X_M$.
Further define $\idb$ as an $\mathbb{N}[\vct{X}]$ database,\AR{There is a type error here: $\idb$ has alredy been defined as a PDB-- while here we are talking about an annotated DB: they are technically not the same thing so you cannot use the same notation. $\idb$ is used heavily in this sub-section so this change needs to be propagated. Am not sure if there is a standard notation-- if not $D(\vct{X})$ should work fine.} i.e., an incomplete/probabilistic database model where each tuple $\tup \in \idb$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later.  Intuitively, one can think of $\idb$ as a parameterized database, whose abstract form maps to each deterministic $\db_i \in \idb$.\AR{There is not need to connect back to possible world etc. in this sub-section.}

Since $\idb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tup \in \idb \mapsto \mathbb{N}[\vct{X}]$,\AR{function notation is always a map from domain to range. Also you need a notation for set of all tuples.} where $\rel(\tup)$ denotes the polynomial mapped to tuple $\tup$.

It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations.  Since $\idb$ is an $\mathbb{N}[\vct{X}]$ database, we are then working with the commutative semiring $\{\mathbb{N}[\vct{X}], +, \times, 0, 1\}$. %, where $\mathbb{N}[\vct{X}]$ is the set from which all annotations originate.  


Given a query $\query$, operations in $\query$ are translated into the following polynomial operations.
\AR{Explicitly mention what $\llbracket \cdot \rrbracket$ notation means.}


\begin{align*}
&\llbracket\project_A(\rel)\rrbracket(\tup) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \rel(\tup')\\
&\llbracket (\rel_1 \union \rel_2)\rrbracket(\tup) = &&\rel_1(\tup) + \rel_2(\tup)\\
&\llbracket(\rel_1 \join \rel_2)\rrbracket(\tup) = &&\rel_1(\tup[\sch(\rel_1)]) \times \rel_2(\tup(\sch(\rel_2)])	\\
&\llbracket\select_\theta(\rel)\rrbracket(\tup) = &&\begin{cases}
					\rel(\tup)	&\text{if }\theta(\tup) = 1\\
					0		&\text{otherwise}.
				\end{cases}
\end{align*}
\AR{You should have the base case of the reduction explicitly stated as well-- i.e. what the poly of a tuple is. Also, in the RHS of the equality should also have the evaluation notation. Finally why is the join not just the product of $R_1(t)$ and $R_2(t)$, or more precisely $\llbracket R_1\rrbracket(t)\times \llbracket R_2\rrbracket(t)$?}
Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.


\subsection{Defining the Data}

Assume a bijective mapping between the polynomial variables $X_1,\ldots, X_M$ and each bit position of elements $\in \{0, 1\}^M$.  In the general case, the binary value of $\vct{w}$ uniquely identifies a potential possible world.  For example, in the case of the  Tuple Independent Database $(\ti)$ data model, there are $\numTup$ tuples which yield $2^\numTup$ possible worlds, thus $\numTup = M$, and each $\vct{w} \in \{0, 1\}^M$ is indeed a possible world.  However in the Block Independent Disjoint data model, because of the disjoint condition on tuples within the same block, it is not the general case that every element $\vct{w} \in \{0, 1\}^M$ is in fact a possible world.  Denote a random world to be $\rw$.  Provided that for any non-possible world $\vct{w} \in \{0, 1\}^M, \pd[\rw = \vct{w}] = 0$, then, a probability distribution over $\{0, 1\}^M$ implies a distribution over $\Omega$, which we have already defined as $\pd$.  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%This could be a way to think of world binary vectors in the general case
%Let $\vct{w}$ be a $\left\lceil\log_2\left(\left|\wSet\right|\right)\right\rceil = \numTup$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$. 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Since we can view $\llbracket\poly(\rel)\rrbracket(\tup)$ as a function $\{0, 1\}^M \mapsto \mathbb{N}$, coupled with that from this point on, our discussion will involve one polynomial for an arbirtrary $\tup$, we abuse notation by using $\poly(\vct{w})$ to mean $\llbracket\poly(\rel)\rrbracket(\tup)(\vct{w})$.  

One of the aggregates we desire to compute over the annotated polynomial is the expectation, denoted,

\AH{With our notation, I no longer think that $\vct{w} \sim \pd$ is necessary footer for $\expct$.  We can probably just have $\expct\limits_{\vct{w}}$ instead.  Do you agree?}
\[\expct_{\wVec \sim \pd}\pbox{\poly(\vct{w})} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \pd[\rw = \vct{w}].\]


$\ti$ is a database model in which each table is a set of tuples, each of which are independent of one another, and individually occur with a specific probability, $\prob_\tup$.

There are features of $\ti$ that we can exploit.  Note that, because of independence, a $\ti$ with $\numTup$ tuples naturally has $2^\numTup$ possible worlds, each of which can be conveniently modeled by an $\numTup$ bit string.  Since the powerset of $[\numTup]$ is exactly $\wSet$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world.  Given an $\numTup$-sized vector $\vct{p}$, where the $i^{th}$ element, $\prob_i$ is the probability of the $i^{th}$ tuple, we can then write an equivalent expectation for $\ti$ models,

\[\expct_{\wVec\sim \pd^{\vct{p}}}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]
Started texing poly reformation write up. 2020-06-12 11:45:15 -04:00			`%root: main.tex`
Oliver's notes 2020-06-26 17:27:52 -04:00			`%!TEX root=./main.tex`
Started texing poly reformation write up. 2020-06-12 11:45:15 -04:00
			`\section{Query translation into polynomials}`
Latest Version 2020-06-26 12:59:24 -04:00			`%\AH{This section will involve the set of queries (RA+) that we are interested in, the probabilistic/incomplete models we address, and the outer aggregate functions we perform over the output \textit{annotation}`
			`%1) RA notation`
			`%2) DB (TIDB) notation`
			`%3) How queries translate into polynomials`
			`%}`
Started rewriting section 1.2 2020-07-03 11:45:43 -04:00			`\subsection{Introduction}`
Started with my pass on Sec 1 2020-07-02 16:15:35 -04:00
Started rewriting section 1.2 2020-07-03 11:45:43 -04:00
Done with pass on Sec 1.2 2020-07-09 00:23:09 -04:00			An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where each element is known as a possible world. Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db_i \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas $(\sch(\rel_i))$ are unchanging across each $\db_j$. For the set of possible worlds, $\wSet$, i.e. the set of all $\db_i \in \idb$, define an injective mapping to the set $\{0, 1\}^M$, where for each vector $\vct{w} \in \{0, 1\}^M$ there is at most one element $\db_i \in \idb$ mapped to $\vct{w}$. When $\idb$ is a probabilistic database, $\idb$ can be viewed as a two tuple $(\wSet, \pd)$, where $\wSet$ as noted, is the set of possible worlds, and $\pd$ is the probability distribution over $\wSet$.
Started rewriting section 1.2 2020-07-03 11:45:43 -04:00			`%Below may possibly need to be used again...we'll see.`
			%probability space $\left(\Omega, \mathcal{A}, P\right)$ over that set. \AR{I'm not sure why you are using the notation $\mathcal{A}$ and $P$, which you do not seem to use beyond this section. I would recommend that you only introduce a notation if you plan to use them later on.} Since the set of possible outcomes is the set of possible worlds, $\wSet$, and the set of outcomes is equivalent to the set of events, we will simplify notation and use $\left(\wSet, P\right)$ to denote the probability space of $\idb$. \AR{If you want to use $(\wSet,P)$ make sure you use the same notation in Sec 1.3 as well. If not, then use the notation from Sec 1.3 here}
NOT done with pass yet. Am in middle of Sec 1.2. Will finish my pass later tonight 2020-07-02 16:23:46 -04:00
Added probability notation to notation section 2020-07-02 12:06:59 -04:00			`\subsection{Modeling and Semantics}`
Done with pass on Sec 1.2 2020-07-09 00:23:09 -04:00			`Define $\vct{X}$ denote variables $X_1,\dots,X_M$.`
			Further define $\idb$ as an $\mathbb{N}[\vct{X}]$ database,\AR{There is a type error here: $\idb$ has alredy been defined as a PDB-- while here we are talking about an annotated DB: they are technically not the same thing so you cannot use the same notation. $\idb$ is used heavily in this sub-section so this change needs to be propagated. Am not sure if there is a standard notation-- if not $D(\vct{X})$ should work fine.} i.e., an incomplete/probabilistic database model where each tuple $\tup \in \idb$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later. Intuitively, one can think of $\idb$ as a parameterized database, whose abstract form maps to each deterministic $\db_i \in \idb$.\AR{There is not need to connect back to possible world etc. in this sub-section.}
Started rewriting section 1.2 2020-07-03 11:45:43 -04:00
Done with pass on Sec 1.2 2020-07-09 00:23:09 -04:00			`Since $\idb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tup \in \idb \mapsto \mathbb{N}[\vct{X}]$,\AR{function notation is always a map from domain to range. Also you need a notation for set of all tuples.} where $\rel(\tup)$ denotes the polynomial mapped to tuple $\tup$.`
More work on background/notational/translation section 2020-06-30 15:31:06 -04:00
Done with pass on Sec 1.2 2020-07-09 00:23:09 -04:00			`It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations. Since $\idb$ is an $\mathbb{N}[\vct{X}]$ database, we are then working with the commutative semiring $\{\mathbb{N}[\vct{X}], +, \times, 0, 1\}$. %, where $\mathbb{N}[\vct{X}]$ is the set from which all annotations originate.`
Done with pass on Sec 1.2 2020-07-02 16:58:19 -04:00
More changes for the translation/notation/background section 2020-06-30 20:08:32 -04:00
Modeling and Semantics Section redone using evaluation expression notation 2020-07-07 15:37:18 -04:00			`Given a query $\query$, operations in $\query$ are translated into the following polynomial operations.`
Done with pass on Sec 1.2 2020-07-09 00:23:09 -04:00			`\AR{Explicitly mention what $\llbracket \cdot \rrbracket$ notation means.}`
Oliver's notes 2020-06-26 17:27:52 -04:00
Started translation, notation section 2020-06-23 15:49:19 -04:00
			`\begin{align*}`
Modeling and Semantics Section redone using evaluation expression notation 2020-07-07 15:37:18 -04:00			`&\llbracket\project_A(\rel)\rrbracket(\tup) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \rel(\tup')\\`
			`&\llbracket (\rel_1 \union \rel_2)\rrbracket(\tup) = &&\rel_1(\tup) + \rel_2(\tup)\\`
			`&\llbracket(\rel_1 \join \rel_2)\rrbracket(\tup) = &&\rel_1(\tup[\sch(\rel_1)]) \times \rel_2(\tup(\sch(\rel_2)]) \\`
			`&\llbracket\select_\theta(\rel)\rrbracket(\tup) = &&\begin{cases}`
Started translation, notation section 2020-06-23 15:49:19 -04:00			`\rel(\tup) &\text{if }\theta(\tup) = 1\\`
			`0 &\text{otherwise}.`
			`\end{cases}`
			`\end{align*}`
Done with pass on Sec 1.2 2020-07-09 00:23:09 -04:00			`\AR{You should have the base case of the reduction explicitly stated as well-- i.e. what the poly of a tuple is. Also, in the RHS of the equality should also have the evaluation notation. Finally why is the join not just the product of $R_1(t)$ and $R_2(t)$, or more precisely $\llbracket R_1\rrbracket(t)\times \llbracket R_2\rrbracket(t)$?}`
Modeling and Semantics Section redone using evaluation expression notation 2020-07-07 15:37:18 -04:00			`Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.`

Started translation, notation section 2020-06-23 15:49:19 -04:00
Added probability notation to notation section 2020-07-02 12:06:59 -04:00			`\subsection{Defining the Data}`
More work on background/notational/translation section 2020-06-30 15:31:06 -04:00
Rewrote data defintion based on 070320 discussion. 2020-07-08 13:08:35 -04:00			Assume a bijective mapping between the polynomial variables $X_1,\ldots, X_M$ and each bit position of elements $\in \{0, 1\}^M$. In the general case, the binary value of $\vct{w}$ uniquely identifies a potential possible world. For example, in the case of the Tuple Independent Database $(\ti)$ data model, there are $\numTup$ tuples which yield $2^\numTup$ possible worlds, thus $\numTup = M$, and each $\vct{w} \in \{0, 1\}^M$ is indeed a possible world. However in the Block Independent Disjoint data model, because of the disjoint condition on tuples within the same block, it is not the general case that every element $\vct{w} \in \{0, 1\}^M$ is in fact a possible world. Denote a random world to be $\rw$. Provided that for any non-possible world $\vct{w} \in \{0, 1\}^M, \pd[\rw = \vct{w}] = 0$, then, a probability distribution over $\{0, 1\}^M$ implies a distribution over $\Omega$, which we have already defined as $\pd$.
RA to poly translation; corrections 062320 2020-06-23 19:33:28 -04:00
Rewrote data defintion based on 070320 discussion. 2020-07-08 13:08:35 -04:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`%This could be a way to think of world binary vectors in the general case`
			`%Let $\vct{w}$ be a $\left\lceil\log_2\left(\left\|\wSet\right\|\right)\right\rceil = \numTup$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$.`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
RA to poly translation; corrections 062320 2020-06-23 19:33:28 -04:00
Rewrote data defintion based on 070320 discussion. 2020-07-08 13:08:35 -04:00
			`Since we can view $\llbracket\poly(\rel)\rrbracket(\tup)$ as a function $\{0, 1\}^M \mapsto \mathbb{N}$, coupled with that from this point on, our discussion will involve one polynomial for an arbirtrary $\tup$, we abuse notation by using $\poly(\vct{w})$ to mean $\llbracket\poly(\rel)\rrbracket(\tup)(\vct{w})$.`

			`One of the aggregates we desire to compute over the annotated polynomial is the expectation, denoted,`

			`\AH{With our notation, I no longer think that $\vct{w} \sim \pd$ is necessary footer for $\expct$. We can probably just have $\expct\limits_{\vct{w}}$ instead. Do you agree?}`
			`\[\expct_{\wVec \sim \pd}\pbox{\poly(\vct{w})} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \pd[\rw = \vct{w}].\]`


			`$\ti$ is a database model in which each table is a set of tuples, each of which are independent of one another, and individually occur with a specific probability, $\prob_\tup$.`

			There are features of $\ti$ that we can exploit. Note that, because of independence, a $\ti$ with $\numTup$ tuples naturally has $2^\numTup$ possible worlds, each of which can be conveniently modeled by an $\numTup$ bit string. Since the powerset of $[\numTup]$ is exactly $\wSet$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world. Given an $\numTup$-sized vector $\vct{p}$, where the $i^{th}$ element, $\prob_i$ is the probability of the $i^{th}$ tuple, we can then write an equivalent expectation for $\ti$ models,
RA to poly translation; corrections 062320 2020-06-23 19:33:28 -04:00
Added probability notation to notation section 2020-07-02 12:06:59 -04:00			`\[\expct_{\wVec\sim \pd^{\vct{p}}}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]`
RA to poly translation; corrections 062320 2020-06-23 19:33:28 -04:00
Made pass on Sec 1 2020-06-23 09:57:35 -04:00