paper-BagRelationalPDBsAreHard/ra-to-poly.tex
Boris Glavic 1074db700e ra
2020-12-13 10:09:49 -06:00

116 lines
11 KiB
TeX

%root: main.tex
%!TEX root=./main.tex
%\onecolumn
\section{Background and Notation}
\subsection{Probabilistic Databases (PDBs)}
An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.
Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world:
\[\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}\]
For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
\[\forall \db \in \query(\idb): \pd'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \pd(\db') \]
Note that in this work we consider multisets, i.e., each possible world is a set of multiset relations and queries are evaluated using bag semantics. We will use K-relations to model multisets. A \emph{K-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are each annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\ondK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when being combined with $\multK$.
Let $\udom$ be a countable domain of values.
Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ ($\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to tuple $t$.
We review the semantics of positive relational algebra queries over $\semK$-relations below. \BG{should we use DL instead}
Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural number. $\semN$-databases are used to model bag semantics by annotating each tuple with its multiplicity. A probabilistic $\semN$-databases is a PDB where each possible world is a $\semN$-database. We will study the problem of evaluating statical moments of query results over such databases. Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result tuple $t$, we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation $\expct_{\db \in \idb}[\query(\db)(t)]$:
\[\expct_{\db \in \idb}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \pd(\db) \]
Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of $t$ we expect to find in the query result.
We are now ready to formally state the main problem addressed in this work.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}[The Expected Result Multiplicity Problem]\label{def:the-expected-multipl}
The \expectProblem is defined as follows:
\begin{itemize}
\item \textbf{Input}: A $\semN$-PDB $\pdb$, n-ary query $\query$, an n-ary tuple $t$
\item \textbf{Output}:
\end{itemize}
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% When $\idb$ is a probabilistic database, $\idb$ can be viewed as a two-tuple $(\wSet, \pd)$, where $\wSet$ as noted, is the set of possible worlds, and $\pd$ is a probability distribution over $\wSet$.
% The possible worlds semantics gives a framework for how to think about running queries over $\idb$. Given a query $\query$, $\query$ is deterministically run over each $\db \in \idb$, and the output of $\query(\idb)$ is defined as the set of results (worlds) from running $\query$ over each $\db_i \in \idb$. We write this formally as,
% \[\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}.\]
\begin{Definition}[$\bi$~\cite{DBLP:series/synthesis/2011Suciu}]
A Block Independent Database ($\bi$) is a PDB whose tuples are partitioned in blocks, where we denote block $i$ as $\block_i$. Each $\block_i$ is independent of all other blocks, while all tuples sharing the same $\block_i$ are mutually exclusive.
\end{Definition}
\begin{Definition}[$\ti$]
A Tuple Independent Database ($\ti$) is a special case of a $\bi$ such that each tuple is its own block.
\end{Definition}
\subsection{Modeling and Semantics}
Define $\vct{X}$ to be the vector of variables $X_1,\dots,X_M$. Let the set of all tuples in domain of $\sch(\db)$ be $\tset$.
\subsubsection{K-relations}\label{subsubsec:k-rel}
The information encoded in the annotation depends on the underlying semiring of the relation.
As noted in \cite{DBLP:conf/pods/GreenKT07}, the $\mathbb{N}[\vct{X}]$-semiring is a semiring over the set $\mathbb{N}[\vct{X}]$ of all polynomials, whose variables can then be substituted with $K$-values from other semirings, evaluating the operators with the operators of the substituted semiring, to produce varying semantics such as set, bag, and security.
Further define $\nxdb$ as an $\mathbb{N}[\vct{X}]$ database where each tuple $\tup \in \db$ is annotated with a polynomial over variables $X_1,\ldots, X_M$.
Since $\nxdb$ is a database that maps tuples to polynomials, it is customary for arbitrary table $\rel$ to be viewed as a function $\rel: \tset \mapsto \mathbb{N}[\vct{X}]$, where $\rel(\tup)$ denotes the polynomial annotating tuple $\tup$.
It has been shown in previous work that commutative semirings precisely model translations of RA+ query operations to $K$-annotations.
%The evalution semantics notation $\llbracket \cdot \rrbracket = x$ simply mean that the result of evaluating expression $\cdot$ is given by following the semantics $x$.
Given a query $\query$, operations in $\query$ are translated into the following polynomial expressions.
\begin{align*}
&\eval{\project_A(\rel)}(\tup)&& = &&\sum_{\tup': \project_A(\tup) = \tup} \eval{\rel}(\tup')\\
&\eval{(\rel_1 \union \rel_2)}(\tup)&& = &&\eval{\rel_1}(\tup) + \eval{\rel_2}(\tup)\\
&\eval{(\rel_1 \join \rel_2)}(\tup) && = &&\eval{\rel_1}(\project_{\sch(\rel_1)}(\tup)) \times \eval{\rel_2}(\project_{\sch(\rel_2)}(\tup)) \\
&\eval{\select_\theta(\rel)}(\tup) && = &&\begin{cases}
\eval{\rel}(\tup) &\text{if }\theta(\tup) = 1\\
0 &\text{otherwise}.
\end{cases}\\
&\eval{R}(\tup) && = &&\rel(\tup)
\end{align*}
The above semantics show us how to obtain the $K$-annotation on a tuple in the result of query $\query$ from the annotations of the input tuples. When used with $\mathbb B$-typed variables, an $\mathbb{N}[\vct{X}]$ relation is effectively a C-Table \cite{DBLP:conf/pods/GreenKT07}, since all first order formulas can be equivalently modeled by polynomials, where $\oplus$ is disjunction and $\otimes$ is conjunction.
This is the equivalent to substituting values and operators from the $\{\mathbb{B}, \vee, \wedge, \bot, \top\}$ semiring. In like manner, when assigning values from the $\mathbb{N}$ domain, the polynomials then model bag semantics, where the variables and $\oplus$ and $\otimes$ operations come from the natural numbers semiring $\{\mathbb{N}, +, \times, 0, 1\}$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Defining the Data}\label{subsec:def-data}
For the set of possible worlds, $\wSet$, i.e. the set of all $\db_i \in \idb$, define an injective mapping to the set $\{0, 1\}^M$, where for each vector $\vct{w} \in \{0, 1\}^M$ there is at most one element $\db_i \in \idb$ mapped to $\vct{w}$.
In the general case, the binary value of $\vct{w}$ uniquely identifies a potential possible world. For example, consider the case of the Tuple Independent Database $(\ti)$ data model in which each table is a set of tuples, each of which is independent of one another, and individually occur with a specific probability $\prob_\tup$. Because of independence, a $\ti$ with $\numvar$ tuples naturally has $2^\numvar$ possible worlds, thus $\numvar = M$, and the injective mapping for each $\vct{w} \in \{0, 1\}^M$ is trivial. In the Block Independent Disjoint data model ($\bi$), because of the disjoint condition on tuples within the same block, a $\bi$ may not have exactly $2^M$ possible worlds since there are combinations of tuples that cannot exist in the encoding.
Denote a random variable selecting a world according to distribution $P$ to be $\rw$. Provided that for any non-possible world $\vct{w} \in \{0, 1\}^M, \pd[\rw = \vct{w}] = 0$, a probability distribution over $\{0, 1\}^M$ is a distribution over $\Omega$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%This could be a way to think of world binary vectors in the general case
%Let $\vct{w}$ be a $\left\lceil\log_2\left(\left|\wSet\right|\right)\right\rceil = \numvar$ binary bit vector, uniquely identifying possible world $\db_i \in \idb$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
From this point on our discussion focuses on exactly one specific tuple $\tup$. Thus, we abuse notation by using $\poly(\vct{X})$ to be the annotated polynomial $\llbracket\poly(\db)\rrbracket(\tup)$, and for a domain of $\{0, 1\}$ for each $X_i \in \vct{X}$, the injective mapping maps $\db$ to $\vct{X}$.
One of the aggregates we desire to compute over the annotated polynomial is the expectation over possible worlds, denoted,
\[\expct_{\vct{\rw} \sim \pd}\pbox{\poly(\rw)} = \sum\limits_{\vct{w} \in \{0, 1\}^\numvar} \poly(\vct{w})\cdot \pd[\rw = \vct{w}].\]
Above, $\poly(\vct{w})$ is used to mean the assignment of $\vct{w}$ to $\vct{X}$.
For a $\ti$, the bit-string world value $\vct{w}$ can be used as indexing to determine which tuples are present in the $\vct{w}$ world, where the $i^{th}$ bit position $(\wbit_i)$ represents whether a tuple $\tup_i$ appears in the unique world $\vct{w}$. Denote the vector $\vct{p}$ to be a vector whose elements are the individual probabilities $\prob_i$ of each tuple $\tup_i$ such that those probabilities produce the possible worlds in D with a distribution $\pd$ over all worlds. Let $\pd^{(\vct{p})}$ represent the distribution induced by $\vct{p}$.
\[\expct_{\rw\sim \pd^{(\vct{p})}}\pbox{\poly(\rw)} = \sum\limits_{\vct{w} \in \{0, 1\}^\numvar} \poly(\vct{w})\prod_{\substack{i \in [\numvar]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numvar]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: