Rewrote 1st paragraph of Intro to be consistent with traditional nomenclature and notation used S.2.

2021-08-20 10:31:24 -04:00 · 2021-08-20 10:31:24 -04:00 · a077d52e3e
parent 3ae41e6d8b
commit a077d52e3e
3 changed files with 6 additions and 48 deletions
--- a/intro-rewrite-070921.tex
+++ b/intro-rewrite-070921.tex
@ -1,7 +1,9 @@
 %root: main.tex
 \section{Introduction (Rewrite - 070921)}
 \input{two-step-model}
-A tuple independent probabilistic database\footnote{In \cref{sec:background} and beyond, we generalize the data model.} (\abbrTIDB) $\pdb$ is a tuple $\inparen{\db, \pd}$ where $\db$ is a set of $\numvar$ tuples.  The probability distribution $\pd$ over the set of database instances (possible worlds) encoded in $\db$ is the one induced from the requirement that each tuple be treated as an independent Bernoulli distributed random variable.  In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ computes the multiplicity of its corresponding tuple $\tup$.  In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics further requires the following condition:
+A tuple independent probabilistic database\footnote{In \cref{sec:background} and beyond, we generalize the data model.} (\abbrTIDB) $\pdb$ is a tuple $\inparen{\idb, \pd}$ such that $\idb$ is the set of deterministic instances of $\pdb$ (possible worlds) and $\pd$ is the probability distribution over $\idb$.  $\pdb$ can equivalently encoded as deterministic database with $\numvar$ tuples, with $\pd$ 
+%with a deterministic table $\encodedDB$ which is a set of $\numvar$ tuples, encoding the set of possible worlds $\idb$.  The probability distribution $\pd$ over the set of database instances (possible worlds) is the one
+being the distribution induced from the requirement that each tuple in $\encodedDB$ be treated as an independent Bernoulli distributed random variable.  In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ computes the multiplicity of its corresponding tuple $\tup$.  In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics further requires the following condition:
 \begin{Problem}\label{prob:bag-pdb-query-eval}
 Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$),\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} compute the expected multiplicity ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$.
 \end{Problem}
--- a/macros.tex
+++ b/macros.tex
@ -111,6 +111,7 @@
 \newcommand{\idb}{\Omega}
 \newcommand{\pd}{\mathcal{P}}%pd for probability distribution
 \newcommand{\pdb}{\mathcal{D}}
+\newcommand{\encodedDB}{\textnormal{\db}}
 \newcommand{\pxdb}{\pdb_{\semNX}}
 \newcommand{\nxdb}{D(\vct{X})}%\mathbb{N}[\vct{X}] db--Are we currently using this?

--- a/ra-to-poly.tex
+++ b/ra-to-poly.tex
@ -3,48 +3,7 @@
 %\onecolumn
 \section{Background and Notation}\label{sec:background}

-\iffalse
-\subsection{Superlinearity of Bag \abbrPDB\xplural}\label{sec:suplin-bags}
-Moving forward, we focus exclusively on bags.  For $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}_1, \text{City}_2),$ $OnTime(\text{City}')$ over the bag relations of \Cref{fig:ex-shipping-simp}, consider the product query $\poly^2()\dlImp Q \times Q$.
-The factorized representation of $\poly^2$ is (for simplicity we ignore the random variables of $Route$ since each variable has probability of $1$):
-\begin{equation*}
-\poly^2 = \left(L_aL_b + L_bL_d + L_bL_c\right) \cdot \left(L_aL_b + L_bL_d + L_bL_c\right)
-\end{equation*}
-This equivalent SOP representation is
-\begin{equation*}
-L_a^2L_b^2 + L_b^2L_d^2 + L_b^2L_c^2 + 2L_aL_b^2L_d + 2L_aL_b^2L_c + 2L_b^2L_dL_c.
-\end{equation*}
-The expectation $\expct\pbox{\poly^2}$ then is:
-\begin{footnotesize}
-\begin{equation*}
-\expct\pbox{L_a^2}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{L_d}\expct\pbox{L_c}
-\end{equation*}
-\end{footnotesize}
-Note that if $Dom(W_i) = \{0, 1\}$, then for any $k > 0$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
-This property leads us to consider a structure related to $\poly$.
-\begin{Definition}\label{def:reduced-poly}
-For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.
-\end{Definition}
-With $\poly^2$ as an example, we have:
-\begin{align*}
-\rpoly^2(L_a, L_b, L_c, L_d)
-=&\; L_aL_b + L_bL_d + L_bW_c + 2L_aL_bL_d + 2L_aL_bL_c + 2L_bL_cL_d
-\end{align*}
-It can be verified that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(\probOf\pbox{L_a=1}, \probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$).
-
-The reduced form of a lineage polynomial can be obtained but requires a linear scan over the clauses of an SOP encoding of the polynomial.  Note that for a compressed representation, this scheme would require an exponential number of computations in the size of the compressed representation.  In \Cref{sec:hard}, we use $\rpoly$ to prove our hardness results .
-%In prior work on lineage-based Bag-\abbrPDB\xplural~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,yang:2015:pvldb:lenses} where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.
-%In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the product of $k$ copies of $\poly$ as a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
-%This leads us to the \textbf{central question of this paper}:
-%\begin{quote}
-%{\em
-%Is it always the case that the expectation of a UCQ in a Bag-\abbrPDB can be computed in time linear in the size of the \textbf{compressed} lineage polynomial?}
-%\end{quote}
-%If so, then Bag-\abbrPDB\xplural can indeed compete with deterministic databases.
-%This is unfortunately not the case, and an approximation is required.
-\fi
-
-\subsection{Probabilistic Databases (\abbrPDB\xplural)}
+\subsection{Probabilistic Databases}

 An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.
 Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under the possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world: $\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}$.
@ -54,17 +13,13 @@ For a probabilistic  database $\pdb = (\idb, \pd)$,  the result of a query is th
 \[\forall \db \in \query(\idb): \pd'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \pd(\db') \]

 Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_\numvar)$ with natural number coefficients and exponents.
-We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations}. % and summarized here.
+We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations}. 
 $\semNX$-relations are functions from tuples to elements of $\semNX$, typically called annotations.
 We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$. Note that $R(t)$ is the lineage polynomial for $t$.
 Each possible world is defined by an assignment of $\numvar$ binary values $\vct{\wElem} \in \{0, 1\}^{\numvar}$ to $\vct{X}$.
 The multiplicity of $t \in R$ in this possible world, denoted $R(t)(\vct{\wElem})$, is obtained by evaluating the polynomial annotating $t$ on $\vct{\wElem}$.
 $\semNX$-relations are closed under $\raPlus$ (\Cref{fig:nxDBSemantics}).

-% For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07}.
-% We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. Below, we assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$.  Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.
-
-
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 We will use $\semNX$-\abbrPDB $\pxdb$, defined as the tuple $(\idb_{\semNX}, \pd)$, where $\semNX$-database $\idb_{\semNX}$ is paired with probability distribution $\pd$.