paper-BagRelationalPDBsAreHard/ra-to-poly.tex

%root: main.tex
%!TEX root=./main.tex

\section{Query translation into polynomials}
%\AH{This section will involve the set of queries (RA+) that we are interested in, the probabilistic/incomplete models we address, and the outer aggregate functions we perform over the output \textit{annotation}
%1) RA notation
%2) DB (TIDB) notation
%3) How queries translate into polynomials
%}
\subsection{Intro}
An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where each element is known as a possible world.  Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db_i \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas are unchanging across each $\db_i$.  When $\idb$ is a probabilistice database, $\idb$ can be viewed as having two components, the set of possible worlds, and a probability space $\left(\Omega, \mathcal{A}, P\right)$ over that set.  Since the set of possible outcomes is the set of possible worlds, $\wSet$, and the set of outcomes is equivalent to the set of events, we will simplify notation and use $\left(\wSet, P\right)$ to denote the probability space of $\idb$.
\subsection{Modeling and Semantics}
$\idb$ can be generally viewed as the set of relations $\{\prel_1,\ldots, \prel_n\}$, where for each $\prel_i \in \idb$, $\prel_i$ consists of the set of all tuples appearing in $\rel_i$ across each of the possible worlds $\db_i \in \idb$, where each tuple is annotated with a provenance polynomial from the set $\mathbb{N}[X]$, and the set $X$ is the alphabet of variables in $\idb$.  One can think of $\idb$ as a parameterized database, whose abstract form maps to a deterministic $\db_i \in \idb$ based on the valuation to which the variables of $\idb$ are bound.

Note that the polynomial annotation of an arbitrary tuple can be viewed as a function $\poly(X_1,\ldots, X_N)$, where the variables can be bound to a specific valuation to determine the output of a tuple $\tup$'s annotation given the input valuation.  Alternatively, the annotation for arbitrary tuple $\tup$ can be viewed as an element of the image of $\query(\prel)$, where relation $\query(\prel)$ can be thought of as a function with preimage of all tuples in $\query(\prel)$, such that $\query(\prel)(\tup) = \poly(X_1,\ldots, X_\numTup)$.  Further, it is known that the algebraic semiring structure aptly models the translation and computation of query operations into tuple annotation, aka polynomials.
To make things more concrete, consider the $\{\mathbb{N}, \times, +, 1, 0\}$ bag semiring.  Here the set in which the tuple annotations (computed polynomials) exist is the natural numbers.  Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.

For the general commutative semiring, denote the plus and multiplication operators as $\oplus$ and $\otimes$ respectively, where summation represents summing over $\oplus$.  Operations in $\query$ are translated into the following polynomial operations.

%\OK{
%  Eventually, you probably want a little more background here, depending on the query notation you choose to use.  The simplest approach would be basing it on the Green et. al. Provenance Semirings paper.  As we discussed, that would make $\query(\mathcal D)(t)$ the query polynomial.
%}
%
%\OK{
%  I don't think we're on the same page here.  From the Prov. Semirings perspective, the entire $\poly(X_i)$ is the annotation of a tuple in an arbitrary query over a $\mathbb R[x]$-relation (i.e., a relation who's tuples are annotated by polynomials over the reals).  The $X_i$s are not annotations, they're the variables of that polynomial.  (footnote: Presumably, there are tuples in the database who's annotations are just a single variable, but that's not the general case).
%}
%
%\OK{
%  A good summary to start.  We'll need to make this more precise for the final paper though.
%}


\begin{align*}
&\project_A(\rel)(\tup) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \rel(\tup')\\
& (\rel_1 \union \rel_2)(\tup) = &&\rel_1(\tup) \oplus \rel_2(\tup)\\
&(\rel_1 \join_\theta \rel_2)(\tup) = &&\begin{cases}
						\rel_1(\tup_1) \otimes \rel_2(\tup_2)	&\text{if }\theta(\tup_1, \tup_2)\\
						0						&\text{otherwise}
					 \end{cases} \\
&\select_\theta(\rel) = &&\begin{cases}
					\rel(\tup)	&\text{if }\theta(\tup) = 1\\
					0		&\text{otherwise}.
				\end{cases}
\end{align*}

\subsection{Defining the Data}
Define $\pd$ to be the probability distribution for $\idb$.  Let $\vct{w}$ be a $\left\lceil\log_2\left(\left|\wSet\right|\right)\right\rceil = \numTup$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$.  Let $\prob(X_i)$ $\left(\prob(\vct{X})\right)$ denote the probability that a given variable (set of variables) occur(s).  We can substitute $\wVec$ for $\vct{X}$ where the $i^{th}$ bit of $\wVec$ is bound to it's corresponding $X_i$ variable, and it follows that $\prob(\wVec)$ denotes the probability that a given world occurs.

Denote $\vct{w} \sim \pd$ to mean the probability of $\vct{w} \left(\prob(\vct{w})\right)$ according to the $\pd$ distribution.
One of the aggregates we desire to compute over the polynomial $\poly(X_1,\ldots, X_\numTup)$ is the expectation, denoted,
\[\expct_{\wVec \sim \pd}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \prob(\wVec).\]

A specific probabilistic data model is the Tuple Independent Database (\ti).  This is a database model in which each table is a set of tuples, each of which are independent of one another, and individually occur with a specific probability, $\prob_\tup$.

There are features of $\ti$ that we can exploit.  Note that a $\ti$ with $\numTup$ tuples naturally has $2^\numTup$ possible worlds, each of which can be conveniently modeled by an $\numTup$ bit string.  The bit-string world value can be used as an index to determine which tuples are present in the $\wVec$ world.  Given an $\numTup$ vector $\vct{p}$, where the $i^{th}$ element, $\prob_i$ is the probability of the $i^{th}$ tuple, we can then write an equivalent expectation for $\ti$ models,

\[\expct_{\wVec\sim \pd^{\vct{p}}}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]