Rewrote 1st paragraph of Intro to be consistent with traditional nomenclature and notation used S.2.

This commit is contained in:
Aaron Huber 2021-08-20 10:31:24 -04:00
parent 3ae41e6d8b
commit a077d52e3e
3 changed files with 6 additions and 48 deletions

View file

@ -1,7 +1,9 @@
%root: main.tex
\section{Introduction (Rewrite - 070921)}
\input{two-step-model}
A tuple independent probabilistic database\footnote{In \cref{sec:background} and beyond, we generalize the data model.} (\abbrTIDB) $\pdb$ is a tuple $\inparen{\db, \pd}$ where $\db$ is a set of $\numvar$ tuples. The probability distribution $\pd$ over the set of database instances (possible worlds) encoded in $\db$ is the one induced from the requirement that each tuple be treated as an independent Bernoulli distributed random variable. In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ computes the multiplicity of its corresponding tuple $\tup$. In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics further requires the following condition:
A tuple independent probabilistic database\footnote{In \cref{sec:background} and beyond, we generalize the data model.} (\abbrTIDB) $\pdb$ is a tuple $\inparen{\idb, \pd}$ such that $\idb$ is the set of deterministic instances of $\pdb$ (possible worlds) and $\pd$ is the probability distribution over $\idb$. $\pdb$ can equivalently encoded as deterministic database with $\numvar$ tuples, with $\pd$
%with a deterministic table $\encodedDB$ which is a set of $\numvar$ tuples, encoding the set of possible worlds $\idb$. The probability distribution $\pd$ over the set of database instances (possible worlds) is the one
being the distribution induced from the requirement that each tuple in $\encodedDB$ be treated as an independent Bernoulli distributed random variable. In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ computes the multiplicity of its corresponding tuple $\tup$. In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics further requires the following condition:
\begin{Problem}\label{prob:bag-pdb-query-eval}
Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$),\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} compute the expected multiplicity ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$.
\end{Problem}

View file

@ -111,6 +111,7 @@
\newcommand{\idb}{\Omega}
\newcommand{\pd}{\mathcal{P}}%pd for probability distribution
\newcommand{\pdb}{\mathcal{D}}
\newcommand{\encodedDB}{\textnormal{\db}}
\newcommand{\pxdb}{\pdb_{\semNX}}
\newcommand{\nxdb}{D(\vct{X})}%\mathbb{N}[\vct{X}] db--Are we currently using this?

View file

@ -3,48 +3,7 @@
%\onecolumn
\section{Background and Notation}\label{sec:background}
\iffalse
\subsection{Superlinearity of Bag \abbrPDB\xplural}\label{sec:suplin-bags}
Moving forward, we focus exclusively on bags. For $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}_1, \text{City}_2),$ $OnTime(\text{City}')$ over the bag relations of \Cref{fig:ex-shipping-simp}, consider the product query $\poly^2()\dlImp Q \times Q$.
The factorized representation of $\poly^2$ is (for simplicity we ignore the random variables of $Route$ since each variable has probability of $1$):
\begin{equation*}
\poly^2 = \left(L_aL_b + L_bL_d + L_bL_c\right) \cdot \left(L_aL_b + L_bL_d + L_bL_c\right)
\end{equation*}
This equivalent SOP representation is
\begin{equation*}
L_a^2L_b^2 + L_b^2L_d^2 + L_b^2L_c^2 + 2L_aL_b^2L_d + 2L_aL_b^2L_c + 2L_b^2L_dL_c.
\end{equation*}
The expectation $\expct\pbox{\poly^2}$ then is:
\begin{footnotesize}
\begin{equation*}
\expct\pbox{L_a^2}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{L_d}\expct\pbox{L_c}
\end{equation*}
\end{footnotesize}
Note that if $Dom(W_i) = \{0, 1\}$, then for any $k > 0$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
This property leads us to consider a structure related to $\poly$.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.
\end{Definition}
With $\poly^2$ as an example, we have:
\begin{align*}
\rpoly^2(L_a, L_b, L_c, L_d)
=&\; L_aL_b + L_bL_d + L_bW_c + 2L_aL_bL_d + 2L_aL_bL_c + 2L_bL_cL_d
\end{align*}
It can be verified that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(\probOf\pbox{L_a=1}, \probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$).
The reduced form of a lineage polynomial can be obtained but requires a linear scan over the clauses of an SOP encoding of the polynomial. Note that for a compressed representation, this scheme would require an exponential number of computations in the size of the compressed representation. In \Cref{sec:hard}, we use $\rpoly$ to prove our hardness results .
%In prior work on lineage-based Bag-\abbrPDB\xplural~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,yang:2015:pvldb:lenses} where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.
%In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the product of $k$ copies of $\poly$ as a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
%This leads us to the \textbf{central question of this paper}:
%\begin{quote}
%{\em
%Is it always the case that the expectation of a UCQ in a Bag-\abbrPDB can be computed in time linear in the size of the \textbf{compressed} lineage polynomial?}
%\end{quote}
%If so, then Bag-\abbrPDB\xplural can indeed compete with deterministic databases.
%This is unfortunately not the case, and an approximation is required.
\fi
\subsection{Probabilistic Databases (\abbrPDB\xplural)}
\subsection{Probabilistic Databases}
An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.
Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world: $\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}$.
@ -54,17 +13,13 @@ For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is th
\[\forall \db \in \query(\idb): \pd'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \pd(\db') \]
Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_\numvar)$ with natural number coefficients and exponents.
We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations}. % and summarized here.
We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations}.
$\semNX$-relations are functions from tuples to elements of $\semNX$, typically called annotations.
We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$. Note that $R(t)$ is the lineage polynomial for $t$.
Each possible world is defined by an assignment of $\numvar$ binary values $\vct{\wElem} \in \{0, 1\}^{\numvar}$ to $\vct{X}$.
The multiplicity of $t \in R$ in this possible world, denoted $R(t)(\vct{\wElem})$, is obtained by evaluating the polynomial annotating $t$ on $\vct{\wElem}$.
$\semNX$-relations are closed under $\raPlus$ (\Cref{fig:nxDBSemantics}).
% For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07}.
% We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. Below, we assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$. Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We will use $\semNX$-\abbrPDB $\pxdb$, defined as the tuple $(\idb_{\semNX}, \pd)$, where $\semNX$-database $\idb_{\semNX}$ is paired with probability distribution $\pd$.