Modeling and Semantics Section redone using evaluation expression notation

This commit is contained in:
Aaron Huber 2020-07-07 15:37:18 -04:00
parent f8c9ceaef8
commit 885d002fbf
3 changed files with 15 additions and 44 deletions

View file

@ -11,12 +11,13 @@
\newcommand{\relii}{T}
\newcommand{\db}{D}
\newcommand{\idb}{\mathcal{\db}}
\newcommand{\pd}{\idb_{pd}}%pd for probability distribution
\newcommand{\pd}{P}%pd for probability distribution
\newcommand{\query}{Q}
\newcommand{\join}{\Join}
\newcommand{\select}{\sigma}
\newcommand{\project}{\pi}
\newcommand{\union}{\cup}
\newcommand{\sch}{sch}
%PDBs
\newcommand{\ti}{TIDB}
@ -66,7 +67,7 @@
\newcommand{\vect}[1][v]{\textbf{#1}}
\newcommand{\wElem}{w}
\newcommand{\dw}{\widetilde{\wElem}}
\newcommand{\wSet}{W}
\newcommand{\wSet}{\Omega}
\newcommand{\sine}{s}
\newcommand{\hfunc}{h}
\newcommand{\est}{est}

BIN
main.synctex(busy) Normal file

Binary file not shown.

View file

@ -10,64 +10,34 @@
\subsection{Introduction}
An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where each element is known as a possible world. Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db_i \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas are unchanging across each $\db_i$. When $\idb$ is a probabilistice database, $\idb$ can be viewed as having two components, the set of possible worlds, and a probability distribution $\pd.$
An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where each element is known as a possible world. Since $\idb$ is modeling all the possible worlds of an uncertain database, it follows that each $\db_i \in \idb$ has the same named set of relations, $\{\rel_1,\ldots, \rel_n\}$ (albeit not equivalent across all instances), whose schemas $(\sch(\rel_i))$ are unchanging across each $\db_j$. When $\idb$ is a probabilistic database, $\idb$ can be viewed as a two tuple $(\wSet, \pd)$, where $\wSet$ is the set of possible worlds and $\pd$ is the probability distribution over $\wSet$.
%Below may possibly need to be used again...we'll see.
%probability space $\left(\Omega, \mathcal{A}, P\right)$ over that set. \AR{I'm not sure why you are using the notation $\mathcal{A}$ and $P$, which you do not seem to use beyond this section. I would recommend that you only introduce a notation if you plan to use them later on.} Since the set of possible outcomes is the set of possible worlds, $\wSet$, and the set of outcomes is equivalent to the set of events, we will simplify notation and use $\left(\wSet, P\right)$ to denote the probability space of $\idb$. \AR{If you want to use $(\wSet,P)$ make sure you use the same notation in Sec 1.3 as well. If not, then use the notation from Sec 1.3 here}
\subsection{Modeling and Semantics}
\AR{The first para below is very confusing-- I know what you are trying to say here but your are mixing varous things here. In this section you should {\bf only} talk about the anotation polynomials. Conncting this to specific worlds I think makes things more confusing. You can connect all of this back to the worlds in Section 1.3. I have some more specific comments below but you should re-write this section based on this comment, which mostly would be removing all mention of worlds in this section.}
\AH{Okay, rewriting...}
Let each tuple in $\idb$ have a polynomial annotation. The polynomials we consider are over variables $X_1,\ldots, X_M$ for some $M$ that will be specified later on. Denote the polynomial annotation of an arbitrary tuple as $\poly_\tup(X_1,\ldots, X_N)$.
RA+ operations of $\query$ can be translated into the following polynomial operations.
Further define $\idb$ as an $\mathbb{N}[\vct{X}]$ database, i.e., an incomplete/probabilistic database model where each tuple $\tup \in \idb$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later. Intuitively, one can think of $\idb$ as a parameterized database, whose abstract form maps to each deterministic $\db_i \in \idb$.
\begin{align*}
&\poly_\tup(\project_A) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \poly_{\tup'}\\
& \poly_\tup(\tup_1 \union \tup_2) = &&\poly_{\tup_1} \oplus \poly_{\tup_2}\\
&\poly_\tup(\tup_1 \join_\theta \tup_2) = &&\begin{cases}
\poly_{\tup_1} \otimes \poly_{\tup_2} &\text{if }\theta(\tup_1, \tup_2)\\
0 &\text{otherwise}
\end{cases} \\
&\poly_\tup(\select_\theta) = &&\begin{cases}
\poly{\tup} &\text{if }\theta(\tup) = 1\\
0 &\text{otherwise}.
\end{cases}
\end{align*}
\AH{END: rewriting}
$\idb$ can be generally viewed as the set of relations $\{\prel_1,\ldots, \prel_n\}$, where for each $\prel_i \in \idb$, $\prel_i$ consists of the set of all tuples appearing in $\rel_i$ across each of the possible worlds $\db_j \in \idb$, where each tuple is annotated with a provenance polynomial from the set $\mathbb{N}[X]$, and the set $X$ is the alphabet of variables \AR{You have not defined what are ``variables" in $\idb$. In fact, based on the general comments at the start of the section, just say that we consider polynomials over variables $X_1,\dots,X_M$ for some value of $M$ that will be specified later.} in $\idb$. One can think of $\idb$ as a parameterized database, whose abstract form maps to a deterministic $\db_i \in \idb$ based on the valuation to which the variables of $\idb$ are bound.
Since $\idb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tup \in \idb \mapsto \mathbb{N}[\vct{X}]$, where $\rel(\tup)$ denotes the polynomial mapped to tuple $\tup$.
Note that the polynomial annotation of an arbitrary tuple can be viewed as a function $\poly_\tup(X_1,\ldots, X_N)$, where the variables can be bound \AR{Again as per the general commeent at the start of the section, no need to talk about valuations yet.} to a specific valuation to determine the output of a tuple $\tup$'s annotation given the input valuation. Alternatively, the annotation for arbitrary tuple $\tup$ can be viewed as an element of the image of $\query(\prel)$, where relation $\query(\prel)$ \AR{I am not sure if $Q(\prel)0$ is good notation. I had used $Q$ for ``query" but if we are going to use it to indicate the function that maps any relation to an annotation polynomial, then using some other notation is more appropriate? I'm not sure if there is a standard notation for this? But whatever we decide on, we should stick with it-- see the comment at the end of the section.} can be thought of as a function with preimage of all tuples in $\query(\prel)$, such that $\query(\prel)(\tup) = \poly(X_1,\ldots, X_\numTup)$. Further, it is known that the algebraic semiring structure aptly models the translation and computation of query operations into tuple annotation, aka polynomials.
To make things more concrete, consider the $\{\mathbb{N}, \times, +, 1, 0\}$ bag semiring. Here the set in which the tuple annotations (computed polynomials) exist is the natural numbers\AR{I don't think this is the case-- the semi-ring in this case is that of polynomials over natural numbers}. Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.
It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations. Since $\idb$ is an $\mathbb{N}[\vct{X}]$ database, we are then working with the commutative semiring $\{\mathbb{N}[\vct{X}], +, \times, 0, 1\}$, where $\mathbb{N}[\vct{X}]$ is the set from which all annotations originate.
For the general commutative semiring,\AR{Why are you introducing the general semi-ring case if we are only using the polynomial semi-ring?} denote the plus and multiplication operators as $\oplus$ and $\otimes$ respectively, where summation represents summing over $\oplus$. Operations in $\query$ are translated into the following polynomial operations.
%\OK{
% Eventually, you probably want a little more background here, depending on the query notation you choose to use. The simplest approach would be basing it on the Green et. al. Provenance Semirings paper. As we discussed, that would make $\query(\mathcal D)(t)$ the query polynomial.
%}
%
%\OK{
% I don't think we're on the same page here. From the Prov. Semirings perspective, the entire $\poly(X_i)$ is the annotation of a tuple in an arbitrary query over a $\mathbb R[x]$-relation (i.e., a relation who's tuples are annotated by polynomials over the reals). The $X_i$s are not annotations, they're the variables of that polynomial. (footnote: Presumably, there are tuples in the database who's annotations are just a single variable, but that's not the general case).
%}
%
%\OK{
% A good summary to start. We'll need to make this more precise for the final paper though.
%}
Given a query $\query$, operations in $\query$ are translated into the following polynomial operations.
\begin{align*}
&\project_A(\rel)(\tup) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \rel(\tup')\\
& (\rel_1 \union \rel_2)(\tup) = &&\rel_1(\tup) \oplus \rel_2(\tup)\\
&(\rel_1 \join_\theta \rel_2)(\tup) = &&\begin{cases}
\rel_1(\tup_1) \otimes \rel_2(\tup_2) &\text{if }\theta(\tup_1, \tup_2)\\
0 &\text{otherwise}
\end{cases} \\
&\select_\theta(\rel) = &&\begin{cases}
&\llbracket\project_A(\rel)\rrbracket(\tup) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \rel(\tup')\\
&\llbracket (\rel_1 \union \rel_2)\rrbracket(\tup) = &&\rel_1(\tup) + \rel_2(\tup)\\
&\llbracket(\rel_1 \join \rel_2)\rrbracket(\tup) = &&\rel_1(\tup[\sch(\rel_1)]) \times \rel_2(\tup(\sch(\rel_2)]) \\
&\llbracket\select_\theta(\rel)\rrbracket(\tup) = &&\begin{cases}
\rel(\tup) &\text{if }\theta(\tup) = 1\\
0 &\text{otherwise}.
\end{cases}
\end{align*}
\AR{The above needs to be re-written in terms of $Q(q)$ for any RA query $q$ (where you should replace $Q$ with whatever notation we decide based on the earlier comment. To be more precise you want to define $Q(q)$ and it should be defined recursively. You are sorta doing this above but you are using $q$ to both denote the query as well as $Q(q)$-- the above recrusive definitions should use $Q$ explicitly.}
Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.
\subsection{Defining the Data}
\AR{This is how this subsection should be structured. First you should connect the variables $X_1,\dots.X_m$ to $W$. Basically say that a vector in $\{0,1\}^M$ (so we only assign binary values to the $M$ variables) corresponds to a {\em potential} world $\vct{w}$ (for TIDB $N=M$ and there is a one to one correspondence between $W$ and $\{0,1\}^M$ but for say BI not every vector in $\{0,1\}^M$ would correspond to a world-- some of them would not correspond to any world. Then a probability distribution over $\{0,1\}^M$ implies a distribution over $W$, which is how you connect back to the $P$ from Section 1.1. More specific comments follow.}