Changes to modeling, data define sections per Atri's comments 070920

This commit is contained in:
Aaron Huber 2020-07-09 15:26:27 -04:00
parent a838da405b
commit 3a282bd4bf
2 changed files with 34 additions and 23 deletions

View file

@ -11,7 +11,10 @@
\newcommand{\relii}{T}
\newcommand{\db}{D}
\newcommand{\idb}{\mathcal{\db}}
\newcommand{\nxdb}{D(\vct{X})}%\mathbb{N}[\vct{X}] db
\newcommand{\tset}{\mathcal{T}}%the set of tuples in a database
\newcommand{\pd}{P}%pd for probability distribution
\newcommand{\eval}[1]{\llbracket #1 \rrbracket}%evaluation double brackets
\newcommand{\query}{Q}
\newcommand{\join}{\Join}
\newcommand{\select}{\sigma}

View file

@ -15,34 +15,46 @@ An incomplete database $\idb$ is a set of deterministic databases $\db_i$ where
%probability space $\left(\Omega, \mathcal{A}, P\right)$ over that set. \AR{I'm not sure why you are using the notation $\mathcal{A}$ and $P$, which you do not seem to use beyond this section. I would recommend that you only introduce a notation if you plan to use them later on.} Since the set of possible outcomes is the set of possible worlds, $\wSet$, and the set of outcomes is equivalent to the set of events, we will simplify notation and use $\left(\wSet, P\right)$ to denote the probability space of $\idb$. \AR{If you want to use $(\wSet,P)$ make sure you use the same notation in Sec 1.3 as well. If not, then use the notation from Sec 1.3 here}
\subsection{Modeling and Semantics}
Define $\vct{X}$ denote variables $X_1,\dots,X_M$.
Further define $\idb$ as an $\mathbb{N}[\vct{X}]$ database,\AR{There is a type error here: $\idb$ has alredy been defined as a PDB-- while here we are talking about an annotated DB: they are technically not the same thing so you cannot use the same notation. $\idb$ is used heavily in this sub-section so this change needs to be propagated. Am not sure if there is a standard notation-- if not $D(\vct{X})$ should work fine.} i.e., an incomplete/probabilistic database model where each tuple $\tup \in \idb$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later. Intuitively, one can think of $\idb$ as a parameterized database, whose abstract form maps to each deterministic $\db_i \in \idb$.\AR{There is not need to connect back to possible world etc. in this sub-section.}
Since $\idb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tup \in \idb \mapsto \mathbb{N}[\vct{X}]$,\AR{function notation is always a map from domain to range. Also you need a notation for set of all tuples.} where $\rel(\tup)$ denotes the polynomial mapped to tuple $\tup$.
It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations. Since $\idb$ is an $\mathbb{N}[\vct{X}]$ database, we are then working with the commutative semiring $\{\mathbb{N}[\vct{X}], +, \times, 0, 1\}$. %, where $\mathbb{N}[\vct{X}]$ is the set from which all annotations originate.
Define $\vct{X}$ to be the variables $X_1,\dots,X_M$. Let the set of tuples in an arbitrary $\db$ be $\tset$.
Further define $\nxdb$ as an $\mathbb{N}[\vct{X}]$ database, i.e., an incomplete/probabilistic database model where each tuple $\tup \in \tset$ is annotated with a polynomial over variables $X_1,\ldots, X_M$ for some value of $M$ that will be specified later.
Given a query $\query$, operations in $\query$ are translated into the following polynomial operations.
\AR{Explicitly mention what $\llbracket \cdot \rrbracket$ notation means.}
\AH{The following is a rough draft to convey a high level, superficial view of the K-relational database framework, specifically in the setting of $\mathbb{N}[\vct{X}]$-relation. Definitely needs some tweaking...any advice is much appreciated.}
\subsubsection{K-relations}
A K-relation is a relation whose tuples are each annotated with an expression whose values come from its respective commutative K-semiring, denoted $\{K, \oplus, \otimes, \mathbbold{0}, \mathbbold{1}\}$. The commutative $K$-semiring has associative and commutative operators $\oplus$ and $\otimes$, with $\otimes$ distributing over $\oplus$, $\mathbbold{0}$ the identity of $\oplus$, $\mathbbold{1}$ likewise of $\otimes$, and element $\mathbbold{0}$ anihilates all elements of $K$ when being combined with $\otimes$. The information encoded in the annotation depends on the underlying semiring of the relation. As noted in the Provenance Semirings work, the $\mathbb{N}[\vct{X}]$-semiring produces polynomial values, whose variables can then be substituted with $K$-values from other semirings, evaluating the operators with the operators of the substituted semiring, to produce varying semantics such as set, bag, and security annotations.
Note that $\mathbb{N}[\vct{X}]$ databases are effectively C-tables, since all first order formulas can be lifted to polynomials, where disjunction is equivalent to the addition operator and conjunction is equivalent to the multiplication operator, and in boolean semantics, negation of variable $x$ can be easily translated into $(1 - x)$. This would correspond to substituting values and operators from the $\{\mathbb{B}, \vee, \wedge, \bot, \top\}$ semiring.
%A nice alternative perspective
%Intuitively, one can think of $\idb$ as a parameterized database, whose abstract form maps to each deterministic $\db_i \in \idb$.
Since $\nxdb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tset \mapsto \mathbb{N}[\vct{X}]$, where $\rel(\tup)$ denotes the polynomial mapped to tuple $\tup$.
It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to set annotations. Since $\nxdb$ is an $\mathbb{N}[\vct{X}]$ database,recall then that we are working with the commutative semiring $\{\mathbb{N}[\vct{X}], +, \times, 0, 1\}$.
The evalution semantics notation $\llbracket \cdot \rrbracket = x$ simply mean that the result of evaluating expression $\cdot$ is given by following the semantics $x$. Given a query $\query$, operations in $\query$ are translated into the following polynomial operations.
\begin{align*}
&\llbracket\project_A(\rel)\rrbracket(\tup) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \rel(\tup')\\
&\llbracket (\rel_1 \union \rel_2)\rrbracket(\tup) = &&\rel_1(\tup) + \rel_2(\tup)\\
&\llbracket(\rel_1 \join \rel_2)\rrbracket(\tup) = &&\rel_1(\tup[\sch(\rel_1)]) \times \rel_2(\tup(\sch(\rel_2)]) \\
&\llbracket\select_\theta(\rel)\rrbracket(\tup) = &&\begin{cases}
&\eval{\project_A(\rel)}(\tup) = &&\sum_{\tup' s.t. \tup'[A] = \tup} \eval{\rel}(\tup')\\
&\eval{(\rel_1 \union \rel_2)}(\tup) = &&\eval{\rel_1}(\tup) + \eval{\rel_2}(\tup)\\
&\eval{(\rel_1 \join \rel_2)}(\tup) = &&\eval{\rel_1}(\tup[\sch(\rel_1)]) \times \eval{\rel_2}(\tup[\sch(\rel_2)]) \\
&\eval{\select_\theta(\rel)}(\tup) = &&\begin{cases}
\rel(\tup) &\text{if }\theta(\tup) = 1\\
0 &\text{otherwise}.
\end{cases}
\end{cases}\\
&\eval{R}(\tup) = &&\rel(\tup)
\end{align*}
\AR{You should have the base case of the reduction explicitly stated as well-- i.e. what the poly of a tuple is. Also, in the RHS of the equality should also have the evaluation notation. Finally why is the join not just the product of $R_1(t)$ and $R_2(t)$, or more precisely $\llbracket R_1\rrbracket(t)\times \llbracket R_2\rrbracket(t)$?}
\AR{Finally why is the join not just the product of $R_1(t)$ and $R_2(t)$, or more precisely $\llbracket R_1\rrbracket(t)\times \llbracket R_2\rrbracket(t)$?}
\AH{I too had this question, and I \textit{think} the answer is that the join expression evaluation semantics are $\eval{R_1}(\tup[\sch(\rel_1)]) \times \eval{R_2}(\tup[\sch(\rel_2)])$ since it is possible that the output tuple $\tup$ may have a different schema than the original input tuples, and so we project away the dissimilar attributes because technically, if $\sch(\tup) \neq \sch(\rel_i)$, then $\rel_i(\tup)$ would be a diffirent polynomial than $\rel_i(\tup[\sch(\rel_i)])$.
{\bf Oliver}, please correct or add to this if necessary.}
Query operations are translated into one of the two semiring operators, with $\project$ and $\union$ of agreeing tuples being the equivalent of the '+' opertator in polynomial $\poly$, $\join$ translating into the $\times$ operator, and finally, $\select$ is better modeled as a function that returns either $\rel(\tup)$ or $0$ based on some predicate.
\subsection{Defining the Data}
Assume a bijective mapping\AR{This is NOT a bijective map: there are $M$ variables and $2^M$ binary vectors so there cannot be a bijective map. Just say that domain of each $X_i$ is $\{0,1\}$.} between the polynomial variables $X_1,\ldots, X_M$ and each bit position of elements $\in \{0, 1\}^M$. In the general case, the binary value of $\vct{w}$ uniquely identifies a potential possible world. For example, in the case of the Tuple Independent Database $(\ti)$ data model, there are $\numTup$ tuples which yield $2^\numTup$ possible worlds, thus $\numTup = M$, and each $\vct{w} \in \{0, 1\}^M$ is indeed a possible world. However in the Block Independent Disjoint data model, because of the disjoint condition on tuples within the same block, it is not the general case that every element $\vct{w} \in \{0, 1\}^M$ is in fact a possible world. Denote a random world (according to distribution $P$) to be $\rw$. Provided that for any non-possible world $\vct{w} \in \{0, 1\}^M, \pd[\rw = \vct{w}] = 0$, then, a probability distribution over $\{0, 1\}^M$ implies a distribution over $\Omega$, which we have already defined as $\pd$.
In the general case, the binary value of $\vct{w}$ uniquely identifies a potential possible world. For example, consider the case of the Tuple Independent Database $(\ti)$ data model in which each table is a set of tuples, each of which are independent of one another, and individually occur with a specific probability $\prob_\tup$. Because of independence, a $\ti$ with $\numTup$ tuples naturally has $2^\numTup$ possible worlds, thus $\numTup = M$, and each $\vct{w} \in \{0, 1\}^M$ is indeed a possible world. However in the Block Independent Disjoint data model, because of the disjoint condition on tuples within the same block, it is not the general case that every element $\vct{w} \in \{0, 1\}^M$ is in fact a possible world. Denote a random world (according to distribution $P$) to be $\rw$. Provided that for any non-possible world $\vct{w} \in \{0, 1\}^M, \pd[\rw = \vct{w}] = 0$, then, a probability distribution over $\{0, 1\}^M$ implies a distribution over $\Omega$, which we have already defined as $\pd$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%This could be a way to think of world binary vectors in the general case
@ -50,21 +62,17 @@ Assume a bijective mapping\AR{This is NOT a bijective map: there are $M$ variabl
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Since we can view $\llbracket\poly(\rel)\rrbracket(\tup)$ as a function $\{0, 1\}^M \mapsto \mathbb{N}$, coupled with that from this point on, our discussion will involve one polynomial for an arbirtrary $\tup$, we abuse notation by using $\poly(\vct{w})$ to mean $\llbracket\poly(\rel)\rrbracket(\tup)(\vct{w})$. \AR{The notation is NOT correct. It should be $\poly(\vct{X})$.}
Assume a domain of $\{0, 1\}$ for each $X_i \in \vct{X}$. The polynomial $\llbracket\poly(\rel)\rrbracket(\tup)$ can be viewed as a function $\{0, 1\}^M \mapsto \mathbb{N}$. Since, from this point on, our discussion will involve one polynomial for an arbirtrary $\tup$, we thus abuse notation by using $\poly(\vct{X})$ to be the annotated polynomial $\llbracket\poly(\rel)\rrbracket(\tup)$.
One of the aggregates we desire to compute over the annotated polynomial is the expectation, denoted,
\AH{With our notation, I no longer think that $\vct{w} \sim \pd$ is necessary footer for $\expct$. We can probably just have $\expct\limits_{\vct{w}}$ instead. Do you agree?}
\AR{No. How would you state Lemma 4 without explicitly using $P$ in the definition of expectation?}
\[\expct_{\wVec \sim \pd}\pbox{\poly(\vct{w})} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \pd[\rw = \vct{w}].\]
\AR{It should be $\rw$ and not $\vct{w}$ in LHS above.}
\[\expct_{\vct{w} \sim \pd}\pbox{\poly(\rw)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\cdot \pd[\rw = \vct{w}].\]
$\ti$ is a database model in which each table is a set of tuples, each of which are independent of one another, and individually occur with a specific probability, $\prob_\tup$.
The $\ti$ model has features that we can exploit. Since the powerset of $[\numTup]$ is exactly $\wSet$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world. Given an $\numTup$-sized vector $\vct{p}$, where the $i^{th}$ element, $\prob_i$ is the probability of the $i^{th}$ tuple, denote the vector $\vct{p}$ according to the probability distributation $\pd$ as $\pd^{(\vct{p})}$. We can then write an equivalent expectation for $\ti$ model,
There are features of $\ti$ that we can exploit. Note that, because of independence, a $\ti$ with $\numTup$ tuples naturally has $2^\numTup$ possible worlds, each of which can be conveniently modeled by an $\numTup$ bit string. Since the powerset of $[\numTup]$ is exactly $\wSet$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world. \AR{The discussion on TIDB in the earlier part fo this para and the previous para is better placed earlier in the sub-section where you introduce TIDB.} Given an $\numTup$-sized vector $\vct{p}$, where the $i^{th}$ element, $\prob_i$ is the probability of the $i^{th}$ tuple, we can then write an equivalent expectation for $\ti$ models,
\[\expct_{\wVec\sim \pd^{(\vct{p})}}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]
\[\expct_{\wVec\sim \pd^{\vct{p}}}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]
\AR{The notation $\pd^{\vct{p}}$ is not defined. Also consider using $\pd^{(\vct{p})}$ or $\pd(\vct{p})$.
}