paper-BagRelationalPDBsAreHard/ra-to-poly.tex

92 lines
12 KiB
TeX

%root: main.tex
%!TEX root=./main.tex
\section{Query translation into polynomials}
%\AH{This section will involve the set of queries (RA+) that we are interested in, the probabilistic/incomplete models we address, and the outer aggregate functions we perform over the output \textit{annotation}
%1) RA notation
%2) DB (TIDB) notation
%3) How queries translate into polynomials
%}
\subsection{Introduction}
An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where each element is known as a possible world. Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db_i \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas are unchanging across each $\db_i$. When $\idb$ is a probabilistice database, $\idb$ can be viewed as having two components, the set of possible worlds, and a probability distribution $\pd.$
%Below may possibly need to be used again...we'll see.
%probability space $\left(\Omega, \mathcal{A}, P\right)$ over that set. \AR{I'm not sure why you are using the notation $\mathcal{A}$ and $P$, which you do not seem to use beyond this section. I would recommend that you only introduce a notation if you plan to use them later on.} Since the set of possible outcomes is the set of possible worlds, $\wSet$, and the set of outcomes is equivalent to the set of events, we will simplify notation and use $\left(\wSet, P\right)$ to denote the probability space of $\idb$. \AR{If you want to use $(\wSet,P)$ make sure you use the same notation in Sec 1.3 as well. If not, then use the notation from Sec 1.3 here}
\subsection{Modeling and Semantics}
\AR{The first para below is very confusing-- I know what you are trying to say here but your are mixing varous things here. In this section you should {\bf only} talk about the anotation polynomials. Conncting this to specific worlds I think makes things more confusing. You can connect all of this back to the worlds in Section 1.3. I have some more specific comments below but you should re-write this section based on this comment, which mostly would be removing all mention of worlds in this section.}
\AH{Okay, rewriting...}
Let each tuple in $\idb$ have a polynomial annotation. The polynomials we consider are over variables $X_1,\ldots, X_M$ for some $M$ that will be specified later on. Denote the polynomial annotation of an arbitrary tuple as $\poly_\tup(X_1,\ldots, X_N)$.
RA+ operations of $\query$ can be translated into the following polynomial operations.
\begin{align*}
&\poly_\tup(\project_A) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \poly_{\tup'}\\
& \poly_\tup(\tup_1 \union \tup_2) = &&\poly_{\tup_1} \oplus \poly_{\tup_2}\\
&\poly_\tup(\tup_1 \join_\theta \tup_2) = &&\begin{cases}
\poly_{\tup_1} \otimes \poly_{\tup_2} &\text{if }\theta(\tup_1, \tup_2)\\
0 &\text{otherwise}
\end{cases} \\
&\poly_\tup(\select_\theta) = &&\begin{cases}
\poly{\tup} &\text{if }\theta(\tup) = 1\\
0 &\text{otherwise}.
\end{cases}
\end{align*}
\AH{END: rewriting}
$\idb$ can be generally viewed as the set of relations $\{\prel_1,\ldots, \prel_n\}$, where for each $\prel_i \in \idb$, $\prel_i$ consists of the set of all tuples appearing in $\rel_i$ across each of the possible worlds $\db_j \in \idb$, where each tuple is annotated with a provenance polynomial from the set $\mathbb{N}[X]$, and the set $X$ is the alphabet of variables \AR{You have not defined what are ``variables" in $\idb$. In fact, based on the general comments at the start of the section, just say that we consider polynomials over variables $X_1,\dots,X_M$ for some value of $M$ that will be specified later.} in $\idb$. One can think of $\idb$ as a parameterized database, whose abstract form maps to a deterministic $\db_i \in \idb$ based on the valuation to which the variables of $\idb$ are bound.
Note that the polynomial annotation of an arbitrary tuple can be viewed as a function $\poly_\tup(X_1,\ldots, X_N)$, where the variables can be bound \AR{Again as per the general commeent at the start of the section, no need to talk about valuations yet.} to a specific valuation to determine the output of a tuple $\tup$'s annotation given the input valuation. Alternatively, the annotation for arbitrary tuple $\tup$ can be viewed as an element of the image of $\query(\prel)$, where relation $\query(\prel)$ \AR{I am not sure if $Q(\prel)0$ is good notation. I had used $Q$ for ``query" but if we are going to use it to indicate the function that maps any relation to an annotation polynomial, then using some other notation is more appropriate? I'm not sure if there is a standard notation for this? But whatever we decide on, we should stick with it-- see the comment at the end of the section.} can be thought of as a function with preimage of all tuples in $\query(\prel)$, such that $\query(\prel)(\tup) = \poly(X_1,\ldots, X_\numTup)$. Further, it is known that the algebraic semiring structure aptly models the translation and computation of query operations into tuple annotation, aka polynomials.
To make things more concrete, consider the $\{\mathbb{N}, \times, +, 1, 0\}$ bag semiring. Here the set in which the tuple annotations (computed polynomials) exist is the natural numbers\AR{I don't think this is the case-- the semi-ring in this case is that of polynomials over natural numbers}. Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.
For the general commutative semiring,\AR{Why are you introducing the general semi-ring case if we are only using the polynomial semi-ring?} denote the plus and multiplication operators as $\oplus$ and $\otimes$ respectively, where summation represents summing over $\oplus$. Operations in $\query$ are translated into the following polynomial operations.
\OK{
Eventually, you probably want a little more background here, depending on the query notation you choose to use. The simplest approach would be basing it on the Green et. al. Provenance Semirings paper. As we discussed, that would make $\query(\mathcal D)(t)$ the query polynomial.
}
%
%\OK{
% I don't think we're on the same page here. From the Prov. Semirings perspective, the entire $\poly(X_i)$ is the annotation of a tuple in an arbitrary query over a $\mathbb R[x]$-relation (i.e., a relation who's tuples are annotated by polynomials over the reals). The $X_i$s are not annotations, they're the variables of that polynomial. (footnote: Presumably, there are tuples in the database who's annotations are just a single variable, but that's not the general case).
%}
%
%\OK{
% A good summary to start. We'll need to make this more precise for the final paper though.
%}
\begin{align*}
&\project_A(\rel)(\tup) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \rel(\tup')\\
& (\rel_1 \union \rel_2)(\tup) = &&\rel_1(\tup) \oplus \rel_2(\tup)\\
&(\rel_1 \join_\theta \rel_2)(\tup) = &&\begin{cases}
\rel_1(\tup_1) \otimes \rel_2(\tup_2) &\text{if }\theta(\tup_1, \tup_2)\\
0 &\text{otherwise}
\end{cases} \\
&\select_\theta(\rel) = &&\begin{cases}
\rel(\tup) &\text{if }\theta(\tup) = 1\\
0 &\text{otherwise}.
\end{cases}
\end{align*}
\AR{The above needs to be re-written in terms of $Q(q)$ for any RA query $q$ (where you should replace $Q$ with whatever notation we decide based on the earlier comment. To be more precise you want to define $Q(q)$ and it should be defined recursively. You are sorta doing this above but you are using $q$ to both denote the query as well as $Q(q)$-- the above recrusive definitions should use $Q$ explicitly.}
\subsection{Defining the Data}
\AR{This is how this subsection should be structured. First you should connect the variables $X_1,\dots.X_m$ to $W$. Basically say that a vector in $\{0,1\}^M$ (so we only assign binary values to the $M$ variables) corresponds to a {\em potential} world $\vct{w}$ (for TIDB $N=M$ and there is a one to one correspondence between $W$ and $\{0,1\}^M$ but for say BI not every vector in $\{0,1\}^M$ would correspond to a world-- some of them would not correspond to any world. Then a probability distribution over $\{0,1\}^M$ implies a distribution over $W$, which is how you connect back to the $P$ from Section 1.1. More specific comments follow.}
Define $\pd$ to be the probability distribution for $\idb$. \AR{You should connect $\pd$ back to the $P$ from Section 1.1} Let $\vct{w}$ be a $\left\lceil\log_2\left(\left|\wSet\right|\right)\right\rceil = \numTup$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$. \AR{The correspondence between $W$ and $\{0,1\}^N$ belongs to Sec 1.1} Let $\prob(X_i)$ $\left(\prob(\vct{X})\right)$ denote the probability that a given variable (set of variables) occur(s). \AR{This sentence has many issues: (1) the variables $X_1,\dots,X_M$ are just there-- it does not make sense to say if they ``occur"; (2) The probability should have $\pd$ explicitly in it and (3) $p(\cdot)$ conflicts with the $p$ that we will use in TIDB.
Here is my suggestion to fix this. First we need a notation for a {\em random} world. We are already using $\vct{w}$ to denote a {\em specific} world. So for now let's say we use $\overline{\vct{w}}$ to denote the random variable. Then to denote the probability that the randomly chosen $\overline{\vct{w}}$ is $\vct{w}$ use the notation $\text{Pr}_{\overline{\vct{w}}\sim\pd}[\overline{\vct{w}}=\vct{w}] $. I would like to stress that $\overline{\vct{w}}$ is just a suggestion-- there is probably a better notation for the random variable. {\bf Propagate} this notation change.} We can substitute $\wVec$ for $\vct{X}$ where the $i^{th}$ bit of $\wVec$ is bound to it's corresponding $X_i$ variable, and it follows that $\prob(\wVec)$ denotes the probability that a given world occurs.
[
Denote $\vct{w} \sim \pd$ to mean the probability of $\vct{w} \left(\prob(\vct{w})\right)$ according to the $\pd$ distribution.
One of the aggregates we desire to compute over the polynomial $\poly(X_1,\ldots, X_\numTup)$ is the expectation, denoted,
\[\expct_{\wVec \sim \pd}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \prob(\wVec).\]
\AR{With the notation change above, the above should be re-written as
\[\expct_{\overline{\wVec} \sim \pd}\pbox{\poly(\overline{\wVec})} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \mathrm{Pr}_{\overline{\wVec}\sim\pd}[\overline{\wVec}=\wVec].\]
}
A specific probabilistic data model is the Tuple Independent Database (\ti). This is a database model in which each table is a set of tuples, each of which are independent of one another, and individually occur with a specific probability, $\prob_\tup$.
There are features of $\ti$ that we can exploit. Note that a $\ti$ with $\numTup$ tuples naturally has $2^\numTup$ possible worlds, each of which can be conveniently modeled by an $\numTup$ bit string. The bit-string world value can be used as an index to determine which tuples are present in the $\wVec$ world. Given an $\numTup$ vector $\vct{p}$, where the $i^{th}$ element, $\prob_i$ is the probability of the $i^{th}$ tuple, we can then write an equivalent expectation for $\ti$ models,
\[\expct_{\wVec\sim \pd^{\vct{p}}}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]