Simplifying discussion of N[X]-DBs

master
Oliver Kennedy 2021-04-09 00:07:33 -04:00
parent a089fe3a23
commit b1792a70ba
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
4 changed files with 64 additions and 48 deletions

View File

@ -2,6 +2,10 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Missing details from Section~\ref{sec:background}}\label{sec:proofs-background}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{K-Relations}\label{subsec:supp-mat-krelations}
\input{k-relations}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Supplementary Material for~\Cref{prop:expection-of-polynom}}\label{subsec:supp-mat-background}
\input{app_notation-background}

23
k-relations.tex Normal file
View File

@ -0,0 +1,23 @@
%!TEX root=./main.tex
We use $\domK$-relations to model bags. A \emph{$\domK$-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
Let $\udom$ be a countable domain of values.
Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ (resp., $\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to $t$.
For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07} illustrated in \cref{fig:nxDBSemantics}.
We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. Below, we assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$. Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.
Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases. Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result tuple $t$, we use $\query(\db)(t)$ for $\db \in \idb$ as input in RHS of \cref{eq:intro-bag-expectation} to compute the expected multiplicity of $t$. Note that the tables of \cref{fig:ex-shipping-simp} have an implicit $1$ $\semN$-valued annotation for each tuple in tables $OnTime$ and $Route$.
%\cref{ex:intro-tbls} and \cref{ex:intro-lineage} $\semN$-valued variable and are interested in computing its expectation $\expct_{\idb \sim \probDist}[\query(\db)(t)]$:
%%
%\begin{equation}\label{eq:bag-expectation}
%\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \probOf(\db)
%\end{equation}
%%
Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of $t$ we expect to find in result of query $\query$.
Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_n)$ with natural number coefficients and exponents.
Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose domain is $\semNX$, with the standard addition and multiplication of polynomials.
We will use $\semNX$-PDB $\pxdb$, defined as the tuple $(\idb_{\semNX}, \pd)$, where $\semNX$-database $\idb_{\semNX}$ is paired with probability distribution $\pd$.
We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ on an implicit $\semNX$-PDB (i.e., $\polyForTuple = \query(\pxdb)(t)$ for some $\pxdb$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{|\vct X|} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.
$\semNX$-PDBs and a function $\rmod$ (which transforms an $\semNX$-PDB to an equivalent $\semN$-PDB) are both formalized in \Cref{subsec:supp-mat-background}).

View File

@ -59,7 +59,7 @@ sensitive=true
\title{Standard Operating Procedure in Bag PDBs Queries Considered Harmful}
\title{Standard Operating Procedure in Bag PDB Queries Considered Harmful}
\titlerunning{Bag PDB Queries}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@ -52,59 +52,48 @@ Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\p
For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
\[\forall \db \in \query(\idb): \pd'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \pd(\db') \]
Note that in this work, for the query output, we consider bags, i.e., each possible world in the query output is a set of bag relations and queries are evaluated using bag semantics. We will use $\domK$-relations to model bags. A \emph{$\domK$-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
Let $\udom$ be a countable domain of values.
Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ (resp., $\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to $t$.
We review positive relational algebra semantics for $\semK$-relations below.
Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases. Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result tuple $t$, we use $\query(\db)(t)$ for $\db \in \idb$ as input in RHS of \cref{eq:intro-bag-expectation} to compute the expected multiplicity of $t$. Note that the tables of \cref{fig:ex-shipping-simp} have an implicit $1$ $\semN$-valued annotation for each tuple in tables $OnTime$ and $Route$.
%\cref{ex:intro-tbls} and \cref{ex:intro-lineage} $\semN$-valued variable and are interested in computing its expectation $\expct_{\idb \sim \probDist}[\query(\db)(t)]$:
%%
%\begin{equation}\label{eq:bag-expectation}
%\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \probOf(\db)
%\end{equation}
%%
Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of $t$ we expect to find in result of query $\query$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Representation System and Semantics}\label{sec:semnx-as-repr}
\subsubsection{$\semK$-relational Query Semantics}
For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07}.
We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. Below, we assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$. Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.
\begin{align*}
\evald{\project_A(\rel)}{\db}(\tup) &= \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
\evald{(\rel_1 \union \rel_2)}{\db}(\tup) &= \evald{\rel_1}{\db}(\tup) \addK \evald{\rel_2}{\db}(\tup)\\
\evald{\select_\theta(\rel)}{\db}(\tup) &= \begin{cases}
\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
\zeroK & \text{otherwise}.
\end{cases} &
\evald{(\rel_1 \join \rel_2)}{\db}(\tup) &=
\begin{aligned}
\evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \multK \\
\evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup))
\end{aligned}\\
& & \evald{R}{\db}(\tup) &= \rel(\tup)
\end{align*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{$\semNX$ as a Representation System}\label{sec:semnx-as-repr}
Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_n)$ with natural number coefficients and exponents.
Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose domain is $\semNX$, with the standard addition and multiplication of polynomials.
We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations} and summarized here.
In an $\semNX$-databases, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.
We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$.
Each possible world is defined by an assignment of $N$ binary values $\vct{W} \in \{0, 1\}^{|X|}$.
The multiplicity of $t \in R$ in this possible world is obtained by evaluating the polynomial annotating it on $\vct{W}$ (i.e., $R(t)(\vct{W})$).
$\semNX$-relations are closed under $\raPlus$ (\cref{fig:nxDBSemantics}).
\begin{figure}
\begin{align*}
\evald{\project_A(\rel)}{\db}(\tup) =& \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
\evald{(\rel_1 \union \rel_2)}{\db}(\tup) =& \evald{\rel_1}{\db}(\tup) + \evald{\rel_2}{\db}(\tup)\\
\evald{\select_\theta(\rel)}{\db}(\tup) =& \begin{cases}
\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
\zeroK & \text{otherwise}.
\end{cases} &
\begin{aligned}
\evald{(\rel_1 \join \rel_2)}{\db}(\tup) =\\ ~
\end{aligned}&
\begin{aligned}
&\evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \\
&~~~\cdot\evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup))
\end{aligned}\\
& & \evald{R}{\db}(\tup) =& \rel(\tup)
\end{align*}
\caption{Evaluation semantics $\evald{\cdot}{\db}$ for $\semNX$-DBs~\cite{DBLP:conf/pods/GreenKT07}.}
\label{fig:nxDBSemantics}
\end{figure}
% For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07}.
% We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. Below, we assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$. Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We will use $\semNX$-PDB $\pxdb$, defined as the tuple $(\idb_{\semNX}, \pd)$, where $\semNX$-database $\idb_{\semNX}$ is paired with probability distribution $\pd$.
We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ on an implicit $\semNX$-PDB (i.e., $\polyForTuple = \query(\pxdb)(t)$ for some $\pxdb$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{|\vct X|} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.
$\semNX$-PDBs and a function $\rmod$ (which transforms an $\semNX$-PDB to an equivalent $\semN$-PDB) are both formalized in \Cref{subsec:supp-mat-background}). \AR{Boris/Oliver: Should we mention that the proposition is obvious but has not been stated in literature for bags?}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
$\semNX$-PDBs and a function $\rmod$ (which transforms an $\semNX$-PDB to an equivalent $\semN$-PDB) are both formalized in \Cref{subsec:supp-mat-background}.
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
Given an $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$, we have:
$ \expct_{\idb \sim \pd}[\query(\idb)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})}. $
\footnote{Although assumed by most prior work on set-probabilistic databases, e.g., as an obvious consequence of~\cite{IL84a}'s Theorem 7.1, we are unaware of any formal proof for bag-probabilistic databases.}
\end{Proposition}
\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.
This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.