paper-BagRelationalPDBsAreHard/ra-to-poly.tex

%root: main.tex
%!TEX root=./main.tex

\section{Query translation into polynomials}
%\AH{This section will involve the set of queries (RA+) that we are interested in, the probabilistic/incomplete models we address, and the outer aggregate functions we perform over the output \textit{annotation}
%1) RA notation
%2) DB (TIDB) notation
%3) How queries translate into polynomials
%}
\subsection{Introduction}


An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where each element is known as a possible world.  Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db_i \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas $(\sch(\rel_i))$ are unchanging across each $\db_j$.  When $\idb$ is a probabilistic database, $\idb$ can be viewed as a two tuple $(\wSet, \pd)$, where $\wSet$ is the set of possible worlds and $\pd$ is the probability distribution over $\wSet$.
%Below may possibly need to be used again...we'll see.
%probability space $\left(\Omega, \mathcal{A}, P\right)$ over that set. \AR{I'm not sure why you are using the notation $\mathcal{A}$ and $P$, which you do not seem to use beyond this section. I would recommend that you only introduce a notation if you plan to use them later on.} Since the set of possible outcomes is the set of possible worlds, $\wSet$, and the set of outcomes is equivalent to the set of events, we will simplify notation and use $\left(\wSet, P\right)$ to denote the probability space of $\idb$. \AR{If you want to use $(\wSet,P)$ make sure you use the same notation in Sec 1.3 as well. If not, then use the notation from Sec 1.3 here}

\subsection{Modeling and Semantics}

Further define $\idb$ as an $\mathbb{N}[\vct{X}]$ database, i.e., an incomplete/probabilistic database model where each tuple $\tup \in \idb$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later.  Intuitively, one can think of $\idb$ as a parameterized database, whose abstract form maps to each deterministic $\db_i \in \idb$.

Since $\idb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tup \in \idb \mapsto \mathbb{N}[\vct{X}]$, where $\rel(\tup)$ denotes the polynomial mapped to tuple $\tup$.

It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations.  Since $\idb$ is an $\mathbb{N}[\vct{X}]$ database, we are then working with the commutative semiring $\{\mathbb{N}[\vct{X}], +, \times, 0, 1\}$, where $\mathbb{N}[\vct{X}]$ is the set from which all annotations originate.


Given a query $\query$, operations in $\query$ are translated into the following polynomial operations.


\begin{align*}
&\llbracket\project_A(\rel)\rrbracket(\tup) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \rel(\tup')\\
&\llbracket (\rel_1 \union \rel_2)\rrbracket(\tup) = &&\rel_1(\tup) + \rel_2(\tup)\\
&\llbracket(\rel_1 \join \rel_2)\rrbracket(\tup) = &&\rel_1(\tup[\sch(\rel_1)]) \times \rel_2(\tup(\sch(\rel_2)])	\\
&\llbracket\select_\theta(\rel)\rrbracket(\tup) = &&\begin{cases}
					\rel(\tup)	&\text{if }\theta(\tup) = 1\\
					0		&\text{otherwise}.
				\end{cases}
\end{align*}
Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.


\subsection{Defining the Data}
\AR{This is how this subsection should be structured. First you should connect the variables $X_1,\dots.X_m$ to $W$. Basically say that a vector in $\{0,1\}^M$ (so we only assign binary values to the $M$ variables) corresponds to a {\em potential} world $\vct{w}$ (for TIDB $N=M$ and there is a one to one correspondence between $W$ and $\{0,1\}^M$ but for say BI not every vector in $\{0,1\}^M$ would correspond to a world-- some of them would not correspond to any world. Then a probability distribution over $\{0,1\}^M$ implies a distribution over $W$, which is how you connect back to the $P$ from Section 1.1. More specific comments follow.}

Define $\pd$ to be the probability distribution for $\idb$. \AR{You should connect $\pd$ back to the $P$ from Section 1.1}  Let $\vct{w}$ be a $\left\lceil\log_2\left(\left|\wSet\right|\right)\right\rceil = \numTup$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$. \AR{The correspondence between $W$ and $\{0,1\}^N$ belongs to Sec 1.1}  Let $\prob(X_i)$ $\left(\prob(\vct{X})\right)$ denote the probability that a given variable (set of variables) occur(s). \AR{This sentence has many issues: (1) the variables $X_1,\dots,X_M$ are just there-- it does not make sense to say if they ``occur"; (2) The probability should have $\pd$ explicitly in it and (3) $p(\cdot)$ conflicts with the $p$ that we will use in TIDB.
Here is my suggestion to fix this. First we need a notation for a {\em random} world. We are already using $\vct{w}$ to denote a {\em specific} world. So for now let's say we use $\overline{\vct{w}}$ to denote the random variable. Then to denote the probability that the randomly chosen $\overline{\vct{w}}$ is $\vct{w}$ use the notation $\text{Pr}_{\overline{\vct{w}}\sim\pd}[\overline{\vct{w}}=\vct{w}] $. I would like to stress that $\overline{\vct{w}}$ is just a suggestion-- there is probably a better notation for the random variable. {\bf Propagate} this notation change.}  We can substitute $\wVec$ for $\vct{X}$ where the $i^{th}$ bit of $\wVec$ is bound to it's corresponding $X_i$ variable, and it follows that $\prob(\wVec)$ denotes the probability that a given world occurs.
[
Denote $\vct{w} \sim \pd$ to mean the probability of $\vct{w} \left(\prob(\vct{w})\right)$ according to the $\pd$ distribution.
One of the aggregates we desire to compute over the polynomial $\poly(X_1,\ldots, X_\numTup)$ is the expectation, denoted,
\[\expct_{\wVec \sim \pd}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \prob(\wVec).\]
\AR{With the notation change above, the above should be re-written as
\[\expct_{\overline{\wVec} \sim \pd}\pbox{\poly(\overline{\wVec})} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \mathrm{Pr}_{\overline{\wVec}\sim\pd}[\overline{\wVec}=\wVec].\]
}

A specific probabilistic data model is the Tuple Independent Database (\ti).  This is a database model in which each table is a set of tuples, each of which are independent of one another, and individually occur with a specific probability, $\prob_\tup$.

There are features of $\ti$ that we can exploit.  Note that a $\ti$ with $\numTup$ tuples naturally has $2^\numTup$ possible worlds, each of which can be conveniently modeled by an $\numTup$ bit string.  The bit-string world value can be used as an index to determine which tuples are present in the $\wVec$ world.  Given an $\numTup$ vector $\vct{p}$, where the $i^{th}$ element, $\prob_i$ is the probability of the $i^{th}$ tuple, we can then write an equivalent expectation for $\ti$ models,

\[\expct_{\wVec\sim \pd^{\vct{p}}}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]