Done with pass on Sec 1.2

This commit is contained in:
Atri Rudra 2020-07-02 16:58:19 -04:00
parent e8a5817336
commit 33ba6d3aac

View file

@ -13,12 +13,14 @@
An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where each element is known as a possible world. Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db_i \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas are unchanging across each $\db_i$. When $\idb$ is a probabilistice database, $\idb$ can be viewed as having two components, the set of possible worlds, and a probability space $\left(\Omega, \mathcal{A}, P\right)$ over that set. \AR{I'm not sure why you are using the notation $\mathcal{A}$ and $P$, which you do not seem to use beyond this section. I would recommend that you only introduce a notation if you plan to use them later on.} Since the set of possible outcomes is the set of possible worlds, $\wSet$, and the set of outcomes is equivalent to the set of events, we will simplify notation and use $\left(\wSet, P\right)$ to denote the probability space of $\idb$. \AR{If you want to use $(\wSet,P)$ make sure you use the same notation in Sec 1.3 as well. If not, then use the notation from Sec 1.3 here}
\subsection{Modeling and Semantics}
$\idb$ can be generally viewed as the set of relations $\{\prel_1,\ldots, \prel_n\}$, where for each $\prel_i \in \idb$, $\prel_i$ consists of the set of all tuples appearing in $\rel_i$ across each of the possible worlds $\db_i \in \idb$, \AR{You should not be using the index $i$ for $R_i$ as well as $D_i$ since they are not connected} where each tuple is annotated with a provenance polynomial from the set $\mathbb{N}[X]$, and the set $X$ is the alphabet of variables \AR{You have not defined what are ``variables" in $\idb$.} in $\idb$. One can think of $\idb$ as a parameterized database, whose abstract form maps to a deterministic $\db_i \in \idb$ based on the valuation to which the variables of $\idb$ are bound.
\AR{The first para below is very confusing-- I know what you are trying to say here but your are mixing varous things here. In this section you should {\bf only} talk about the anotation polynomials. Conncting this to specific worlds I think makes things more confusing. You can connect all of this back to the worlds in Section 1.3. I have some more specific comments below but you should re-write this section based on this comment, which mostly would be removing all mention of worlds in this section.}
Note that the polynomial annotation of an arbitrary tuple can be viewed as a function $\poly(X_1,\ldots, X_N)$, where the variables can be bound to a specific valuation to determine the output of a tuple $\tup$'s annotation given the input valuation. Alternatively, the annotation for arbitrary tuple $\tup$ can be viewed as an element of the image of $\query(\prel)$, where relation $\query(\prel)$ can be thought of as a function with preimage of all tuples in $\query(\prel)$, such that $\query(\prel)(\tup) = \poly(X_1,\ldots, X_\numTup)$. Further, it is known that the algebraic semiring structure aptly models the translation and computation of query operations into tuple annotation, aka polynomials.
To make things more concrete, consider the $\{\mathbb{N}, \times, +, 1, 0\}$ bag semiring. Here the set in which the tuple annotations (computed polynomials) exist is the natural numbers. Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.
$\idb$ can be generally viewed as the set of relations $\{\prel_1,\ldots, \prel_n\}$, where for each $\prel_i \in \idb$, $\prel_i$ consists of the set of all tuples appearing in $\rel_i$ across each of the possible worlds $\db_i \in \idb$, \AR{You should not be using the index $i$ for $R_i$ as well as $D_i$ since they are not connected} where each tuple is annotated with a provenance polynomial from the set $\mathbb{N}[X]$, and the set $X$ is the alphabet of variables \AR{You have not defined what are ``variables" in $\idb$. In fact, based on the general comments at the start of the section, just say that we consider polynomials over variables $X_1,\dots,X_M$ for some value of $M$ that will be specified later.} in $\idb$. One can think of $\idb$ as a parameterized database, whose abstract form maps to a deterministic $\db_i \in \idb$ based on the valuation to which the variables of $\idb$ are bound.
For the general commutative semiring, denote the plus and multiplication operators as $\oplus$ and $\otimes$ respectively, where summation represents summing over $\oplus$. Operations in $\query$ are translated into the following polynomial operations.
Note that the polynomial annotation of an arbitrary tuple can be viewed as a function $\poly(X_1,\ldots, X_N)$, \AR{It would be better to use $Q_t$ instead of $Q$ to stress that the polynomial is specific ro $t$.} where the variables can be bound \AR{Again as per the general commeent at the start of the section, no need to talk about valuations yet.} to a specific valuation to determine the output of a tuple $\tup$'s annotation given the input valuation. Alternatively, the annotation for arbitrary tuple $\tup$ can be viewed as an element of the image of $\query(\prel)$, where relation $\query(\prel)$ \AR{I am not sure if $Q(\prel)0$ is good notation. I had used $Q$ for ``query" but if we are going to use it to indicate the function that maps any relation to an annotation polynomial, then using some other notation is more appropriate? I'm not sure if there is a standard notation for this? But whatever we decide on, we should stick with it-- see the comment at the end of the section.} can be thought of as a function with preimage of all tuples in $\query(\prel)$, such that $\query(\prel)(\tup) = \poly(X_1,\ldots, X_\numTup)$. Further, it is known that the algebraic semiring structure aptly models the translation and computation of query operations into tuple annotation, aka polynomials.
To make things more concrete, consider the $\{\mathbb{N}, \times, +, 1, 0\}$ bag semiring. Here the set in which the tuple annotations (computed polynomials) exist is the natural numbers\AR{I don't think this is the case-- the semi-ring in this case is that of polynomials over natural numbers}. Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.
For the general commutative semiring,\AR{Why are you introducing the general semi-ring case if we are only using the polynomial semi-ring?} denote the plus and multiplication operators as $\oplus$ and $\otimes$ respectively, where summation represents summing over $\oplus$. Operations in $\query$ are translated into the following polynomial operations.
%\OK{
% Eventually, you probably want a little more background here, depending on the query notation you choose to use. The simplest approach would be basing it on the Green et. al. Provenance Semirings paper. As we discussed, that would make $\query(\mathcal D)(t)$ the query polynomial.
@ -45,6 +47,7 @@ For the general commutative semiring, denote the plus and multiplication operato
0 &\text{otherwise}.
\end{cases}
\end{align*}
\AR{The above needs to be re-written in terms of $Q(q)$ for any RA query $q$ (where you should replace $Q$ with whatever notation we decide based on the earlier comment. To be more precise you want to define $Q(q)$ and it should be defined recursively. You are sorta doing this above but you are using $q$ to both denote the query as well as $Q(q)$-- the above recrusive definitions should use $Q$ explicitly.}
\subsection{Defining the Data}
Define $\pd$ to be the probability distribution for $\idb$. Let $\vct{w}$ be a $\left\lceil\log_2\left(\left|\wSet\right|\right)\right\rceil = \numTup$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$. Let $\prob(X_i)$ $\left(\prob(\vct{X})\right)$ denote the probability that a given variable (set of variables) occur(s). We can substitute $\wVec$ for $\vct{X}$ where the $i^{th}$ bit of $\wVec$ is bound to it's corresponding $X_i$ variable, and it follows that $\prob(\wVec)$ denotes the probability that a given world occurs.