paper-BagRelationalPDBsAreHard/ra-to-poly.tex

67 lines
6.9 KiB
TeX

%root: main.tex
%!TEX root=./main.tex
\section{Query translation into polynomials}
%\AH{This section will involve the set of queries (RA+) that we are interested in, the probabilistic/incomplete models we address, and the outer aggregate functions we perform over the output \textit{annotation}
%1) RA notation
%2) DB (TIDB) notation
%3) How queries translate into polynomials
%}
\subsection{Introduction}
An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where each element is known as a possible world. Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db_i \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas $(\sch(\rel_i))$ are unchanging across each $\db_j$. For the set of possible worlds, $\wSet$, i.e. each $\db_i \in \idb$, define an injective mapping to the set $\{0, 1\}^M$, where for each vector $\vct{w} \in \{0, 1\}^M$ there is at most one element $\db_i \in \idb$ mapped to $\vct{w}$. When $\idb$ is a probabilistic database, $\idb$ can be viewed as a two tuple $(\wSet, \pd)$, where $\wSet$ as noted, is the set of possible worlds, and $\pd$ is the probability distribution over $\wSet$.
%Below may possibly need to be used again...we'll see.
%probability space $\left(\Omega, \mathcal{A}, P\right)$ over that set. \AR{I'm not sure why you are using the notation $\mathcal{A}$ and $P$, which you do not seem to use beyond this section. I would recommend that you only introduce a notation if you plan to use them later on.} Since the set of possible outcomes is the set of possible worlds, $\wSet$, and the set of outcomes is equivalent to the set of events, we will simplify notation and use $\left(\wSet, P\right)$ to denote the probability space of $\idb$. \AR{If you want to use $(\wSet,P)$ make sure you use the same notation in Sec 1.3 as well. If not, then use the notation from Sec 1.3 here}
\subsection{Modeling and Semantics}
Further define $\idb$ as an $\mathbb{N}[\vct{X}]$ database, i.e., an incomplete/probabilistic database model where each tuple $\tup \in \idb$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later. Intuitively, one can think of $\idb$ as a parameterized database, whose abstract form maps to each deterministic $\db_i \in \idb$.
Since $\idb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tup \in \idb \mapsto \mathbb{N}[\vct{X}]$, where $\rel(\tup)$ denotes the polynomial mapped to tuple $\tup$.
It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations. Since $\idb$ is an $\mathbb{N}[\vct{X}]$ database, we are then working with the commutative semiring $\{\mathbb{N}[\vct{X}], +, \times, 0, 1\}$, where $\mathbb{N}[\vct{X}]$ is the set from which all annotations originate.
Given a query $\query$, operations in $\query$ are translated into the following polynomial operations.
\begin{align*}
&\llbracket\project_A(\rel)\rrbracket(\tup) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \rel(\tup')\\
&\llbracket (\rel_1 \union \rel_2)\rrbracket(\tup) = &&\rel_1(\tup) + \rel_2(\tup)\\
&\llbracket(\rel_1 \join \rel_2)\rrbracket(\tup) = &&\rel_1(\tup[\sch(\rel_1)]) \times \rel_2(\tup(\sch(\rel_2)]) \\
&\llbracket\select_\theta(\rel)\rrbracket(\tup) = &&\begin{cases}
\rel(\tup) &\text{if }\theta(\tup) = 1\\
0 &\text{otherwise}.
\end{cases}
\end{align*}
Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.
\subsection{Defining the Data}
Assume a bijective mapping between the polynomial variables $X_1,\ldots, X_M$ and each bit position of elements $\in \{0, 1\}^M$. In the general case, the binary value of $\vct{w}$ uniquely identifies a potential possible world. For example, in the case of the Tuple Independent Database $(\ti)$ data model, there are $\numTup$ tuples which yield $2^\numTup$ possible worlds, thus $\numTup = M$, and each $\vct{w} \in \{0, 1\}^M$ is indeed a possible world. However in the Block Independent Disjoint data model, because of the disjoint condition on tuples within the same block, it is not the general case that every element $\vct{w} \in \{0, 1\}^M$ is in fact a possible world. Denote a random world to be $\rw$. Provided that for any non-possible world $\vct{w} \in \{0, 1\}^M, \pd[\rw = \vct{w}] = 0$, then, a probability distribution over $\{0, 1\}^M$ implies a distribution over $\Omega$, which we have already defined as $\pd$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%This could be a way to think of world binary vectors in the general case
%Let $\vct{w}$ be a $\left\lceil\log_2\left(\left|\wSet\right|\right)\right\rceil = \numTup$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Since we can view $\llbracket\poly(\rel)\rrbracket(\tup)$ as a function $\{0, 1\}^M \mapsto \mathbb{N}$, coupled with that from this point on, our discussion will involve one polynomial for an arbirtrary $\tup$, we abuse notation by using $\poly(\vct{w})$ to mean $\llbracket\poly(\rel)\rrbracket(\tup)(\vct{w})$.
One of the aggregates we desire to compute over the annotated polynomial is the expectation, denoted,
\AH{With our notation, I no longer think that $\vct{w} \sim \pd$ is necessary footer for $\expct$. We can probably just have $\expct\limits_{\vct{w}}$ instead. Do you agree?}
\[\expct_{\wVec \sim \pd}\pbox{\poly(\vct{w})} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \pd[\rw = \vct{w}].\]
$\ti$ is a database model in which each table is a set of tuples, each of which are independent of one another, and individually occur with a specific probability, $\prob_\tup$.
There are features of $\ti$ that we can exploit. Note that, because of independence, a $\ti$ with $\numTup$ tuples naturally has $2^\numTup$ possible worlds, each of which can be conveniently modeled by an $\numTup$ bit string. Since the powerset of $[\numTup]$ is exactly $\wSet$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world. Given an $\numTup$-sized vector $\vct{p}$, where the $i^{th}$ element, $\prob_i$ is the probability of the $i^{th}$ tuple, we can then write an equivalent expectation for $\ti$ models,
\[\expct_{\wVec\sim \pd^{\vct{p}}}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]