Cleaning up appendix

master
Oliver Kennedy 2021-04-09 22:00:34 -04:00
parent a9ee4c1fe0
commit 58b70f0fcf
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
4 changed files with 36 additions and 28 deletions

View File

@ -1,4 +1,7 @@
%root: main.tex
%!TEX root=./main.tex
\input{k-relations}
To justify the use of $\semNX$-databases, we need to show that we can encode any $\semN$-PDB in this way and that the query semantics over this representation coincides with query semantics over $\semN$-PDB. For that it will be opportune to define representation systems for $\semN$-PDBs.\BG{cite}
@ -43,7 +46,7 @@ Importantly, as the following proposition shows, any finite $\semN$-PDB can be e
$\semNX$-PDBs are a complete representation system for $\semN$-PDBs that is closed under $\raPlus$ queries.
\end{Proposition}
\subsection{Proof of~\Cref{prop:semnx-pdbs-are-a-}}
%\subsection{Proof of~\Cref{prop:semnx-pdbs-are-a-}}
\begin{proof}
To prove that $\semNX$-PDBs are complete consider the following construction that for any $\semN$-PDB $\pdb = (\idb, \pd)$ produces an $\semNX$-PDB $\pxdb = (\idb_{\semNX}, \pd')$ such that $\rmod(\pxdb) = \pdb$. Let $\idb = \{D_1, \ldots, D_{\abs{\idb}}\}$ and let $max(D_i)$ denote $max_{\tup} D_i(\tup)$. For each world $D_i$ we create a corresponding variable $X_i$.
%variables $X_{i1}$, \ldots, $X_{im}$ where $m = max(D_i)$.
@ -71,6 +74,29 @@ Since $\semNX$-PDBs $\pxdb$ are a complete representation system for $\semN$-PDB
\qed
\end{proof}
\subsection{\tis and \bis in the $\semNX$-PDB model}\label{subsec:supp-mat-ti-bi-def}
Two important subclasses of $\semNX$-PDBs that are of interest to us are the bag versions of tuple-independent databases (\tis) and block-independent databases (\bis). Under set semantics, a \ti is a deterministic database $\db$ where each tuple $\tup$ is assigned a probability $\prob_\tup$. The set of possible worlds represented by a \ti $\db$ is all subsets of $\db$. The probability of each world is the product of the probabilities of all tuples that exist with one minus the probability of all tuples of $\db$ that are not part of this world, i.e., tuples are treated as independent random events. In a \bi, we also assign each tuple a probability, but additionally partition $\db$ into blocks. The possible worlds of a \bi $\db$ are all subsets of $\db$ that contain at most one tuple from each block. Note then that the tuples sharing the same block are disjoint, and the sum of the probabilitites of all the tuples in the same block $\block$ is $1$. The probability of such a world is the product of the probabilities of all tuples present in the world. %and one minus the sum of the probabilities of all tuples from blocks for which no tuple is present in the world.
For bag \tis and \bis, we define the probability of a tuple to be the probability that the tuple exists with multiplicity at least $1$.
As already noted above, in this work, we define \tis and \bis as subclasses of $\semNX$-PDBs.
In this work, we consider one further deviation from the standard: We use bag semantics for queries.
Even though tuples cannot occur more than once in the input \ti or \bi, they can occur with a multiplicity larger than one in the result of a query.
Since in \tis and \bis, there is a one-to-one correspondence between tuples in the database and variables, we can interpret a vector $\vct{w} \in \{0,1\}^n$ as denoting which tuples exist in the possible world $\assign_{\vct{w}}(\pxdb)$ (the ones where $\vct{w}[j] = 1$).
For BIDBs specifically, note that that at most one of the bits corresponding to tuples in each block will be set (i.e., for any pair of bits $w_j$, $w_{j'}$ that are part of the same block $b_i \supseteq \{t_{i,j}, t_{i,j'}\}$, at most one of them will be set).
Denote the vector $\vct{p}$ to be a vector whose elements are the individual probabilities $\prob_i$ of each tuple $\tup_i$. Let $\pd^{(\vct{p})}$ denote the distribution induced by $\vct{p}$.
%
\begin{align}\label{eq:tidb-expectation}
\expct_{\vct{W} \sim \pd^{(\vct{p})}}\pbox{\poly(\vct{W})}
= \sum\limits_{\substack{\vct{w} \in \{0, 1\}^\numvar\\ s.t. w_j,w_{j'} = 1 \rightarrow \not \exists b_i \supseteq \{t_{i,j}, t_{i',j}\}}} \poly(\vct{w})\prod_{\substack{j \in [\numvar]\\ s.t. \wElem_j = 1}}\prob_j \prod_{\substack{j \in [\numvar]\\s.t. w_j = 0}}\left(1 - \prob_i\right)
\end{align}
%
Recall that tuple blocks in a TIDB always have size 1, so the outer summation of \cref{eq:tidb-expectation} is over the full set of vectors.
\BG{Oliver's conjecture: Bag-\tis + Q can express any finite bag-PDB:
A well-known result for set semantics PDBs is that while not all finite PDBs can be encoded as \tis, any finite PDB can be encoded using a \ti and a query. An analog result holds in our case: any finite $\semN$-PDB can be encoded as a bag \ti and a query (WHAT CLASS? ADD PROOF)
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Proof of~\Cref{prop:expection-of-polynom}}
\label{subsec:expectation-of-polynom-proof}
\begin{proof}
@ -89,24 +115,7 @@ By expanding $\polyForTuple$ and the expectation we have:
\qed
\end{proof}
\subsection{Supplementary Material for~\Cref{subsec:tidbs-and-bidbs}}\label{subsec:supp-mat-ti-bi-def}
Two important subclasses of $\semNX$-PDBs that are of interest to us are the bag versions of tuple-independent databases (\tis) and block-independent databases (\bis). Under set semantics, a \ti is a deterministic database $\db$ where each tuple $\tup$ is assigned a probability $\prob_\tup$. The set of possible worlds represented by a \ti $\db$ is all subsets of $\db$. The probability of each world is the product of the probabilities of all tuples that exist with one minus the probability of all tuples of $\db$ that are not part of this world, i.e., tuples are treated as independent random events. In a \bi, we also assign each tuple a probability, but additionally partition $\db$ into blocks. The possible worlds of a \bi $\db$ are all subsets of $\db$ that contain at most one tuple from each block. Note then that the tuples sharing the same block are disjoint, and the sum of the probabilitites of all the tuples in the same block $\block$ is $1$. The probability of such a world is the product of the probabilities of all tuples present in the world. %and one minus the sum of the probabilities of all tuples from blocks for which no tuple is present in the world.
For bag \tis and \bis, we define the probability of a tuple to be the probability that the tuple exists with multiplicity at least $1$.
\AH{This part \emph{below} needs more work if we include it.}
Note that the main difference to the standard definitions of \tis and \bis is that we define them as subclasses of $\semNX$-PDBs and that we use bag semantics. Even though tuples cannot occur more than once in the input \ti or \bi, they can occur with a multiplicity larger than one in the result of a query. Since in \tis and \bis, there is a one-to-one correspondence between tuples in the database and variables, we can interpret a vector $\vct{w} \in \{0,1\}^n$ as denoting which tuples exist in the possible world $\assign_{\vct{w}}(\pxdb)$ (the ones where $\vct{w}[i] = 1$). Denote the vector $\vct{p}$ to be a vector whose elements are the individual probabilities $\prob_i$ of each tuple $\tup_i$. Let $\pd^{(\vct{p})}$ denote the distribution induced by $\vct{p}$.
%
\begin{align}\label{eq:tidb-expectation}
\expct_{\vct{W} \sim \pd^{(\vct{p})}}\pbox{\poly(\vct{W})} = \sum\limits_{\vct{w} \in \{0, 1\}^\numvar} \poly(\vct{w})\prod_{\substack{i \in [\numvar]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numvar]\\s.t. w_i = 0}}\left(1 - \prob_i\right).
\end{align}
%
\BG{Do we need the BIDB formula?}
\BG{Oliver's conjecture: Bag-\tis + Q can express any finite bag-PDB:
A well-known result for set semantics PDBs is that while not all finite PDBs can be encoded as \tis, any finite PDB can be encoded using a \ti and a query. An analog result holds in our case: any finite $\semN$-PDB can be encoded as a bag \ti and a query (WHAT CLASS? ADD PROOF)
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{~\Cref{lem:pre-poly-rpoly}}\label{app:subsec-pre-poly-rpoly}
\begin{Lemma}\label{lem:pre-poly-rpoly}
If

View File

@ -3,11 +3,7 @@
\section{Missing details from Section~\ref{sec:background}}\label{sec:proofs-background}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{K-Relations}\label{subsec:supp-mat-krelations}
\input{k-relations}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Supplementary Material for~\Cref{prop:expection-of-polynom}}\label{subsec:supp-mat-background}
\subsection{Supplementary Material for~\Cref{prop:expection-of-polynom}}\label{subsec:supp-mat-background}\label{subsec:supp-mat-krelations}
\input{app_notation-background}

View File

@ -1,11 +1,11 @@
%!TEX root=./main.tex
We use $\domK$-relations to model bags. A \emph{$\domK$-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
Let $\udom$ be a countable domain of values.
Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
Formally, an n-ary $\semK$-relation over a domain of attribute values $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ (resp., $\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to $t$.
For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07} illustrated in \cref{fig:nxDBSemantics}.
We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. Below, we assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$. Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.
In \cref{fig:nxDBSemantics}, we use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$, assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$. Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.
Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases. Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result tuple $t$, we use $\query(\db)(t)$ for $\db \in \idb$ as input in RHS of \cref{eq:intro-bag-expectation} to compute the expected multiplicity of $t$. Note that the tables of \cref{fig:ex-shipping-simp} have an implicit $1$ $\semN$-valued annotation for each tuple in tables $OnTime$ and $Route$.
%\cref{ex:intro-tbls} and \cref{ex:intro-lineage} $\semN$-valued variable and are interested in computing its expectation $\expct_{\idb \sim \probDist}[\query(\db)(t)]$:
@ -20,4 +20,7 @@ Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_
Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose domain is $\semNX$, with the standard addition and multiplication of polynomials.
We will use $\semNX$-PDB $\pxdb$, defined as the tuple $(\idb_{\semNX}, \pd)$, where $\semNX$-database $\idb_{\semNX}$ is paired with probability distribution $\pd$.
We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ on an implicit $\semNX$-PDB (i.e., $\polyForTuple = \query(\pxdb)(t)$ for some $\pxdb$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{|\vct X|} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.
$\semNX$-PDBs and a function $\rmod$ (which transforms an $\semNX$-PDB to an equivalent $\semN$-PDB) are both formalized in \Cref{subsec:supp-mat-background}).
$\semNX$-PDBs and a function $\rmod$ (which transforms an $\semNX$-PDB to an equivalent $\semN$-PDB) are both formalized next.

View File

@ -89,7 +89,7 @@ $\semNX$-relations are closed under $\raPlus$ (\cref{fig:nxDBSemantics}).
We will use $\semNX$-PDB $\pxdb$, defined as the tuple $(\idb_{\semNX}, \pd)$, where $\semNX$-database $\idb_{\semNX}$ is paired with probability distribution $\pd$.
We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ on an implicit $\semNX$-PDB (i.e., $\polyForTuple = \query(\pxdb)(t)$ for some $\pxdb$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{|\vct X|} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.
$\semNX$-PDBs and a function $\rmod$ (which transforms an $\semNX$-PDB to an equivalent $\semN$-PDB) are both formalized in \Cref{subsec:supp-mat-background}.
$\semNX$-PDBs and a function $\rmod$ (which transforms an $\semNX$-PDB to classical, or $\semN$-PDB~\cite{DBLP:conf/pods/GreenKT07,feng:2019:sigmod:uncertainty}) are both formalized in \Cref{subsec:supp-mat-background}.
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
Given an $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$, we have:
$ \expct_{\idb \sim \pd}[\query(\idb)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})}. $