paper-BagRelationalPDBsAreHard/ra-to-poly.tex
Boris Glavic 77c7fa7f06 RA
2020-12-13 14:51:55 -06:00

261 lines
23 KiB
TeX

%root: main.tex
%!TEX root=./main.tex
%\onecolumn
\section{Background and Notation}
\subsection{Probabilistic Databases (PDBs)}
An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.
Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world:
\[\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}\]
For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
\[\forall \db \in \query(\idb): \pd'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \pd(\db') \]
Note that in this work we consider multisets, i.e., each possible world is a set of multiset relations and queries are evaluated using bag semantics. We will use K-relations to model multisets. A \emph{K-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when being combined with $\multK$.
Let $\udom$ be a countable domain of values.
Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ ($\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to tuple $t$.
We review the semantics of positive relational algebra queries over $\semK$-relations below.
Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural number. $\semN$-databases are used to model bag semantics by annotating each tuple with its multiplicity. A probabilistic $\semN$-databases ($\semN$-PDB) is a PDB where each possible world is a $\semN$-database. We will study the problem of evaluating statical moments of query results over such databases. Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result tuple $t$, we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation $\expct_{\idb \sim \pd}[\query(\db)(t)]$:
\begin{align}\label{eq:bag-expectation}
\expct_{\idb \sim \pd}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \pd(\db)
\end{align}
Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of $t$ we expect to find in the query result.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{$\semK$-relational Query Semantics}\label{sec:semnx-as-repr}
For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07}.
We use $\eval{\cdot}{\db}$ to denote evaluating query $\query$ over database $\semK$-database $\db$. In the definition shown below, we assume that tuples are of appropriate arity and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$.
\begin{align*}
&\eval{\project_A(\rel)}(\tup)&& = &&\sum_{\tup': \project_A(\tup) = \tup} \eval{\rel}(\tup')\\
&\eval{(\rel_1 \union \rel_2)}(\tup)&& = &&\eval{\rel_1}(\tup) + \eval{\rel_2}(\tup)\\
&\eval{(\rel_1 \join \rel_2)}(\tup) && = &&\eval{\rel_1}(\project_{\sch(\rel_1)}(\tup)) \times \eval{\rel_2}(\project_{\sch(\rel_2)}(\tup)) \\
&\eval{\select_\theta(\rel)}(\tup) && = &&\begin{cases}
\eval{\rel}(\tup) &\text{if }\theta(\tup) = 1\\
0 &\text{otherwise}.
\end{cases}\\
&\eval{R}(\tup) && = &&\rel(\tup)
\end{align*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{$\semNX$ as a Representation System}\label{sec:semnx-as-repr}
Let $\semNX$ denote the set of polynomials over variables $\vct{X}$ with natural number co-efficients and exponents.
Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose elements are $\semNX$ and addition and multiplication are standard addition and multiplication of polynomials. We will utilize $\semNX$-databases $\db$ paired with a probability distribution to represent $\semN$-PDBs.\BG{Need more motivation?} To justify the use of $\semNX$-databases, we need to that we can encode any $\semN$-PDBs in this way and that the query semantics over this representation coincides with query semantics over $\semN$-PDB. For that it will be opportune to define the notion of representation systems.\BG{cite}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Representation System]\label{def:representation-syste}
A representation system for $\semN$-PDBs is a tuple $(\reprs, \rmod)$ where $\reprs$ is a set of representations and $\rmod$ associates which each $\repr \in \reprs$ a $\semN$-PDB $\pdb$. We say that a representation system is \emph{closed} under a class of queries $\qClass$ if for any query $\query \in \qClass$ we have:
%
\[ \rmod(\query(\repr)) = \query(\rmod(\repr)) \]
A representation system is \emph{complete} if for every $\semN$-PDB $\pdb$ there exists $\repr \in \reprs$ such that:
%
\[ \rmod(\repr) = \pdb \]
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As mentioned above we will use $\semNX$-databases paired with a probability distribution as a representation system.
We refer to such databases as $\semNX$-PDBs and use bold symbols to distinguish them from possible worlds (which are $\semN$-databases).
Formally, a $\semNX$-PDB is a $\semNX$-database and a probability distribution over assignments $\assign$ of the variables $\vct{X}$ occurring in annotations of $\db$ to $\{0,1\}$. Note that an assignment $\assign: \vct{X} \to \{0,1\}$ can be represented as a vector $\vct{w} \in \{0,1\}^n$ where $\vct{w}[i]$ records the value assigned to $X_i$. Thus, from now on we will solely use such vectors and implicitly understand them to represent assignments. Given an assignment $\assign$ we use $\assign(\pxdb)$ to denote the semiring homomorphism $\semNX \to \semN$ that applies the assignment $\assign$ to all variables of a polynomial and evaluates the resulting expression in $\semN$.\BG{explain connection to homomorphism lifting in K-relations}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[$\semNX$-PDBs]\label{def:semnx-pdbs}
A $\semNX$-PDB $\pxdb$ over variables $\vct{X} = \{X_1, \ldots, X_n\}$ is a tuple $(\db,\pd)$ where $\db$ is an $\semNX$-database and $\pd$ is a probability distribution over $\vct{w} \in \{0,1\}^n$. We use $\assign_{\vct{w}}$ to denote the assignment corresponding to $\vct{w} \in \{0,1\}^n$. The $\semN$-PDB $\rmod(\pxdb) = (\idb, \pd')$ encoded by $\pxdb$ is defined as:
\begin{align*}
\idb & = \{ \assign_{\vct{w}}(\pxdb) \mid \vct{w} \in \{0,1\}^n \} \\
\pd'(\db) & = \sum_{\vct{w} \in \{0,1\}^n: \assign_{\vct{w}}(\pxdb) = \db} \pd(\vct{w})
\end{align*}
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\BG{Need an example here}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Proposition}\label{prop:semnx-pdbs-are-a-}
$\semNX$-PDBs are a complete representation system for $\semN$-PDBs that is closed under $\raPlus$ queries.
\end{Proposition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{proof}
To prove that $\semNX$-PDBs are complete consider the following construction that for any $\semN$-PDB $\pdb$ produces a $\semNX$-PDB $\db$ such that $\rmod(\db) = \pdb$.
\BG{Add: create a number of variables $X_{ij}$ for each possible world $i$ that correspond to the maximum multiplicity of a tuple in the world. Then each tuple is annotated with a sum of variables $\sum_{i} \sum_{j \leq D_i(t)} X_{ij}$ and the probability distribution assigns only vectors where $X_{ij} = 1$ for a fixed $i$ and all $j$ a probability of $\pd(D_i)$}
The closure under $\raPlus$ queries follows from the fact that an assignment $\vct{X} \to \{0,1\}$ is a semiring homomorphism and that semiring homomorphisms commute with queries over $\semK$-relations.
\end{proof}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Since $\semNX$-PDBs $\pxdb$ are a complete representation system closed under $\raPlus$, computing the expectation of the multiplicity of a tuple $t$ in the result a $raPlus$ query over the $\semN$-PDB $\rmod(\pxdb)$, is the same as computing the exception of the polynomial $\query(\pxdb)(t)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
Given a $\semN$-PDB $\pdb$ and $\semNX$-PDB $\pxdb$ such that $\rmod(\pxdb) = \pdb$, we have:
\[ \expct_{\idb \sim \pd}[\query(\db)(t)] = \expct_{\vct{\rw} \sim \pd}\pbox{\poly(\rw)} \]
\end{Proposition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\BG{Define TIDB, BIDB as subclasses of $\semNX$ with restrictions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[TIDBs and BIDBs]\label{def:tidbs-and-bidbs}
A \emph{TIDB} $\pxdb = (\db, \pd)$ is a $\semNX$-PDB such that (i) every tuple is annotated with either $0$ or a unique variable $X_i$ and (ii) the probability distribution $\pd$ is such that all variables are independent.
A \emph{BIDB} $\pxdb = (\db, \pd)$ is a $\semNX$-PDB such that (i) every tuple is annotated with either $0$ or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Note that the main difference to the standard definitions of TIDB and BIDBs is that we define them as subclasses of $\semNX$-PDBs and that we use bag semantics. Even though tuples cannot occur more than once in the input TIDB or BIDB, they can occur with a multiplicity large than one in the result of a query.
\BG{Oliver's conjecture: Bag-TIDBs + Q can express any finite bag-PDB:
A well-known result for set semantics PDBs is that while not all finite PDBs can be encoded as TIDBs, any finite PDB can be encoded using a TIDB and a query. An analog result holds in our case: any finite $\semN$-PDB can be encoded as a bag TIDB and a query (WHAT CLASS? ADD PROOF)
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Expression Trees}\label{sec:expression-trees}
In the following we will make use of expression trees to encode polynomials which we define formally in this subsection.
For illustrative purposes consider the polynomial $\poly(\vct{X}) = 2x^2 + 3xy - 2y^2$ over $\vct{X} = (x,y)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Expression Tree]\label{def:express-tree}
Consider a vector of variables $\vct{X}$.
An expression tree $\etree$ over $\vct{X}$ is a binary %an ADT logically viewed as an n-ary
tree, whose internal nodes are from the set $\{+, \times\}$, with leaf nodes being either from the set $\mathbb{R}$ $(\tnum)$ or from the set of monomials $(\var)$. The members of $\etree$ are \type, \val, \vari{partial}, \vari{children}, and \vari{weight}, where \type is the type of value stored in the node $\etree$ (i.e. one of $\{+, \times, \var, \tnum\}$, \val is the value stored, and \vari{children} is the list of $\etree$'s children where $\etree_\lchild$ is the left child and $\etree_\rchild$ the right child.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We ignore the remaining fields (\vari{partial} and \vari{weight}) for now. Their purpose will become clear in~\Cref{sec:approximation-algo}. Note that $\etree$ need not encode an expression in standard monomial basis. For instance, $\etree$ could represent a compressed form of the running example, such as $(x + 2y)(2x - y)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[poly$(\cdot)$]\label{def:poly-func}
Denote $poly(\etree)$ to be the function that takes as input expression tree $\etree$ and outputs its corresponding polynomial. $poly(\cdot)$ is recursively defined on $\etree$ as follows, where $\etree_\lchild$ and $\etree_\rchild$ denote the left and right child of $\etree$ respectively.
% \begin{align*}
% &\etree.\type = +\mapsto&& \polyf(\etree_\lchild) + \polyf(\etree_\rchild)\\
% &\etree.\type = \times\mapsto&& \polyf(\etree_\lchild) \cdot \polyf(\etree_\rchild)\\
% &\etree.\type = \var \text{ OR } \tnum\mapsto&& \etree.\val
% \end{align*}
\begin{equation*}
\polyf(\etree) = \begin{cases}
\polyf(\etree_\lchild) + \polyf(\etree_\rchild) &\text{ if \etree.\type } = +\\
\polyf(\etree_\lchild) \cdot \polyf(\etree_\rchild) &\text{ if \etree.\type } = \times\\
\etree.\val &\text{ if \etree.\type } = \var \text{ OR } \tnum.
\end{cases}
\end{equation*}
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Note that addition and multiplication above follow the standard interpretation over polynomials.
%Specifically, when adding two monomials whose variables and respective exponents agree, the coefficients corresponding to the monomials are added and their sum is multiplied to the monomial. Multiplication here is denoted by concatenation of the monomial and coefficient. When two monomials are multiplied, the product of each corresponding coefficient is computed, and the variables in each monomial are multiplied, i.e., the exponents of like variables are added. Again we notate this by the direct product of coefficient product and all disitinct variables in the two monomials, with newly computed exponents.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[Expression Tree Set]\label{def:express-tree-set}$\etreeset{\smb}$ is the set of all possible expression trees $\etree$, such that $poly(\etree) = \poly(\vct{X})$.
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
For our running example, $\etreeset{\smb} = \{2x^2 + 3xy - 2y^2, (x + 2y)(2x - y), x(2x - y) + 2y(2x - y), 2x(x + 2y) - y(x + 2y)\}$. Note that \cref{def:express-tree-set} implies that $\etree \in \etreeset{poly(\etree)}$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Problem Definition}\label{sec:problem-definition}
We are now ready to formally state the main problem addressed in this work.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[The Expected Result Multiplicity Problem]\label{def:the-expected-multipl}
Let $\vct{X} = (X_1, \ldots, X_n)$, and $\pdb$ be an $\semNX$-PDB over $\vct{X}$ with probability distribution $\pd$ over assignments $\vct{X} \to [0,1]$, $\query$ an n-ary query, and $t$ an n-ary tuple.
The \expectProblem is defined as follows:
\begin{itemize}
\item \textbf{Input}: Given an expression tree $\etree \in \etreeset{\smb}$ for $\poly(\vct{X}) = \query(\pdb)(t)$
\item \textbf{Output}: $\expct_{\vct{X} \sim \pd}[\poly(\vct{X})]$
\end{itemize}
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% When $\idb$ is a probabilistic database, $\idb$ can be viewed as a two-tuple $(\wSet, \pd)$, where $\wSet$ as noted, is the set of possible worlds, and $\pd$ is a probability distribution over $\wSet$.
% The possible worlds semantics gives a framework for how to think about running queries over $\idb$. Given a query $\query$, $\query$ is deterministically run over each $\db \in \idb$, and the output of $\query(\idb)$ is defined as the set of results (worlds) from running $\query$ over each $\db_i \in \idb$. We write this formally as,
% \[\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}.\]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Previous}
\begin{Definition}[$\bi$~\cite{DBLP:series/synthesis/2011Suciu}]
A Block Independent Database ($\bi$) is a PDB whose tuples are partitioned in blocks, where we denote block $i$ as $\block_i$. Each $\block_i$ is independent of all other blocks, while all tuples sharing the same $\block_i$ are mutually exclusive.
\end{Definition}
\begin{Definition}[$\ti$]
A Tuple Independent Database ($\ti$) is a special case of a $\bi$ such that each tuple is its own block.
\end{Definition}
\subsection{Modeling and Semantics}
Define $\vct{X}$ to be the vector of variables $X_1,\dots,X_M$. Let the set of all tuples in domain of $\sch(\db)$ be $\tset$.
\subsubsection{K-relations}\label{subsubsec:k-rel}
The information encoded in the annotation depends on the underlying semiring of the relation.
As noted in \cite{DBLP:conf/pods/GreenKT07}, the $\mathbb{N}[\vct{X}]$-semiring is a semiring over the set $\mathbb{N}[\vct{X}]$ of all polynomials, whose variables can then be substituted with $K$-values from other semirings, evaluating the operators with the operators of the substituted semiring, to produce varying semantics such as set, bag, and security.
Further define $\nxdb$ as an $\mathbb{N}[\vct{X}]$ database where each tuple $\tup \in \db$ is annotated with a polynomial over variables $X_1,\ldots, X_M$.
Since $\nxdb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tset \mapsto \mathbb{N}[\vct{X}]$, where $\rel(\tup)$ denotes the polynomial annotating tuple $\tup$.
It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to $K$-annotations.
%The evalution semantics notation $\llbracket \cdot \rrbracket = x$ simply mean that the result of evaluating expression $\cdot$ is given by following the semantics $x$.
Given a query $\query$, operations in $\query$ are translated into the following polynomial expressions.
\begin{align*}
&\eval{\project_A(\rel)}(\tup)&& = &&\sum_{\tup': \project_A(\tup) = \tup} \eval{\rel}(\tup')\\
&\eval{(\rel_1 \union \rel_2)}(\tup)&& = &&\eval{\rel_1}(\tup) + \eval{\rel_2}(\tup)\\
&\eval{(\rel_1 \join \rel_2)}(\tup) && = &&\eval{\rel_1}(\project_{\sch(\rel_1)}(\tup)) \times \eval{\rel_2}(\project_{\sch(\rel_2)}(\tup)) \\
&\eval{\select_\theta(\rel)}(\tup) && = &&\begin{cases}
\eval{\rel}(\tup) &\text{if }\theta(\tup) = 1\\
0 &\text{otherwise}.
\end{cases}\\
&\eval{R}(\tup) && = &&\rel(\tup)
\end{align*}
The above semantics show us how to obtain the $K$-annotation on a tuple in the result of query $\query$ from the annotations of the input tuples. When used with $\mathbb B$-typed variables, an $\mathbb{N}[\vct{X}]$ relation is effectively a C-Table \cite{DBLP:conf/pods/GreenKT07}, since all first order formulas can be equivalently modeled by polynomials, where $\oplus$ is disjunction and $\otimes$ is conjunction.
This is the equivalent to substituting values and operators from the $\{\mathbb{B}, \vee, \wedge, \bot, \top\}$ semiring. In like manner, when assigning values from the $\mathbb{N}$ domain, the polynomials then model bag semantics, where the variables and $\oplus$ and $\otimes$ operations come from the natural numbers semiring $\{\mathbb{N}, +, \times, 0, 1\}$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Defining the Data}\label{subsec:def-data}
For the set of possible worlds, $\wSet$, i.e. the set of all $\db_i \in \idb$, define an injective mapping to the set $\{0, 1\}^M$, where for each vector $\vct{w} \in \{0, 1\}^M$ there is at most one element $\db_i \in \idb$ mapped to $\vct{w}$.
In the general case, the binary value of $\vct{w}$ uniquely identifies a potential possible world. For example, consider the case of the Tuple Independent Database $(\ti)$ data model in which each table is a set of tuples, each of which is independent of one another, and individually occur with a specific probability $\prob_\tup$. Because of independence, a $\ti$ with $\numvar$ tuples naturally has $2^\numvar$ possible worlds, thus $\numvar = M$, and the injective mapping for each $\vct{w} \in \{0, 1\}^M$ is trivial. In the Block Independent Disjoint data model ($\bi$), because of the disjoint condition on tuples within the same block, a $\bi$ may not have exactly $2^M$ possible worlds since there are combinations of tuples that cannot exist in the encoding.
Denote a random variable selecting a world according to distribution $P$ to be $\rw$. Provided that for any non-possible world $\vct{w} \in \{0, 1\}^M, \pd[\rw = \vct{w}] = 0$, a probability distribution over $\{0, 1\}^M$ is a distribution over $\Omega$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%This could be a way to think of world binary vectors in the general case
%Let $\vct{w}$ be a $\left\lceil\log_2\left(\left|\wSet\right|\right)\right\rceil = \numvar$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
From this point on our discussion focuses on exactly one specific tuple $\tup$. Thus, we abuse notation by using $\poly(\vct{X})$ to be the annotated polynomial $\llbracket\poly(\db)\rrbracket(\tup)$, and for a domain of $\{0, 1\}$ for each $X_i \in \vct{X}$, the injective mapping maps $\db$ to $\vct{X}$.
One of the aggregates we desire to compute over the annotated polynomial is the expectation over possible worlds, denoted,
\[\expct_{\vct{\rw} \sim \pd}\pbox{\poly(\rw)} = \sum\limits_{\vct{w} \in \{0, 1\}^\numvar} \poly(\vct{w})\cdot \pd[\rw = \vct{w}].\]
Above, $\poly(\vct{w})$ is used to mean the assignment of $\vct{w}$ to $\vct{X}$.
For a $\ti$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world, where the $i^{th}$ bit position $(\wbit_i)$ represents whether a tuple $\tup_i$ appears in the unique world $\vct{w}$. Denote the vector $\vct{p}$ to be a vector whose elements are the individual probabilities $\prob_i$ of each tuple $\tup_i$ such that those probabilities produce the possible worlds in D with a distribution $\pd$ over all worlds. Let $\pd^{(\vct{p})}$ represent the distribution induced by $\vct{p}$.
\[\expct_{\rw\sim \pd^{(\vct{p})}}\pbox{\poly(\rw)} = \sum\limits_{\vct{w} \in \{0, 1\}^\numvar} \poly(\vct{w})\prod_{\substack{i \in [\numvar]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numvar]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: