paper-BagRelationalPDBsAreHard/ra-to-poly.tex

%root: main.tex
%!TEX root=./main.tex
%\onecolumn
\section{Background and Notation}\label{sec:background}

\subsection{Prelim: Superlinearity of Bag PDBs}\label{sec:suplin-bags}
Moving forward, we focus exclusively on bags.  The bag relations of \cref{fig:ex-shipping} are modeled by the atttribute $\Phi_{bag}$ (i.e., we can ignore the $\Phi_{set}$ attribute). 
Consider the following product query, which can be thought of the set of all route pairs.
\begin{equation}
\poly^2_E():- Loc(\text{City}), Route(\text{City}_1, \text{City}_2), Loc(\text{City}'),  Loc(\text{City}''), Route(\text{City}_1', \text{City}_2'), Loc(\text{City}''')\label{eq:edge-query}
\end{equation}
%For an arbitrary polynomial, it is known that there may exist equivalent compressed representations.
%One such compression is the factorized polynomial~\cite{factorized-db}, where the polynomial is broken up into separate factors.
%For example:
Consider the factorized representation of $\poly^2_E$:
\begin{equation*}
\poly^2_E = \left(L_aL_b + L_bL_d + L_bL_c\right) \cdot \left(L_aL_b + L_bL_d + L_bL_c\right)
\end{equation*}
This equivalent SOP representation is
\begin{equation*}
L_a^2L_b^2 + L_b^2L_d^2 + L_b^2L_c^2 + 2L_aL_b^2L_d + 2L_aL_b^2L_c + 2L_b^2L_dL_c.
\end{equation*}
The expectation $\expct\pbox{\poly^2_E()}$ then is:
\begin{footnotesize}
\begin{equation*}
\expct\pbox{L_a^2}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{L_d}\expct\pbox{L_c}
\end{equation*}
\end{footnotesize}
%Recall the nice property of $\query$ that its expected count could be computed by evaluating its lineage on the probability vector (i.e., \Cref{eqn:can-inline-probabilities-into-polynomial}).
%This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(\probOf\pbox{W_a}, \probOf\pbox{W_b}, \probOf\pbox{W_c})$), but does suggest a related closed form formula.
Note that if $Dom(W_i) = \{0, 1\}$, then for any $k > 0$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
This property leads us to consider a structure related to $\poly$.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.
\end{Definition}
With $\poly^2_E$ as an example, we have:
\begin{align*}
\rpoly^2_E(L_a, L_b, L_c, L_d)
=&\; L_aL_b + L_bL_d + L_bW_c + 2L_aL_bL_d + 2L_aL_bL_c + 2L_bL_cL_d
\end{align*}
It can be verified that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2_E} = \rpoly_E(\probOf\pbox{L_a=1}, \probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$).

The reduced form of a lineage polynomial can be obtained but requires a linear scan over the clauses of an SOP encoding of the polynomial.  Note that for a compressed representation, this scheme would require an exponential number of computations in the size of the compressed representation.  In \Cref{sec:hard}, we use $\rpoly$ to prove our hardness results .
%In prior work on lineage-based Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,yang:2015:pvldb:lenses} where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.
%In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the product of $k$ copies of $\poly$ as a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
%This leads us to the \textbf{central question of this paper}:
%\begin{quote}
%{\em
%Is it always the case that the expectation of a UCQ in a Bag-PDB can be computed in time linear in the size of the \textbf{compressed} lineage polynomial?}
%\end{quote}
%If so, then Bag-PDBs can indeed compete with deterministic databases.
%This is unfortunately not the case, and an approximation is required.


\subsection{Probabilistic Databases (PDBs)}

An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.
Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world:
\[\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}\]

For a probabilistic  database $\pdb = (\idb, \pd)$,  the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$  that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
\[\forall \db \in \query(\idb): \probOf'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \probOf(\db') \]

Note that in this work, for the query output, we consider bags, i.e., each possible world in the query output is a set of bag relations and queries are evaluated using bag semantics. We will use $\domK$-relations to model bags. A \emph{$\domK$-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$.  A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
Let $\udom$ be a countable domain of values.
Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ (resp., $\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to $t$.
We review positive relational algebra semantics for $\semK$-relations below.


Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A  probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases.  Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result tuple $t$,  we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation  $\expct_{\idb \sim \probDist}[\query(\db)(t)]$:
%
\begin{equation}\label{eq:bag-expectation}
\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \probOf(\db)
\end{equation}
%
Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of $t$ we expect to find in result of query $\query$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Representation System and Semantics}\label{sec:semnx-as-repr}

\subsubsection{$\semK$-relational Query Semantics}
For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07}.
We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. In the definition shown below, we assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$.  Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.
\begin{align*}
 	\evald{\project_A(\rel)}{\db}(\tup) &= \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &	
 	\evald{(\rel_1 \union \rel_2)}{\db}(\tup) &= \evald{\rel_1}{\db}(\tup) \addK \evald{\rel_2}{\db}(\tup)\\
 	\evald{\select_\theta(\rel)}{\db}(\tup) &= \begin{cases}
		\evald{\rel}{\db}(\tup)	& \text{if }\theta(\tup) \\
		\zeroK                       & \text{otherwise}.
		\end{cases} &
       \evald{(\rel_1 \join \rel_2)}{\db}(\tup) &= \evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \multK 				\evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup)) \\
       & & \evald{R}{\db}(\tup) &= \rel(\tup)
\end{align*}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{$\semNX$ as a Representation System}\label{sec:semnx-as-repr}

Let $\semNX$ denote the set of polynomials over variables $\vct{X}$ with natural number coefficients and exponents.
Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose domain is $\semNX$, with the standard addition and multiplication of polynomials. 
We will use $\semNX$-PDB $\pxdb$, defined as the tuple $(\idb_{\semNX}, \pd)$, where $\semNX$-database $\idb_{\semNX}$ is paired with probability distribution $\pd$.  
We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ on an implicit $\semNX$-PDB (i.e., $\polyForTuple = \query(\pxdb)(t)$ for some $\pxdb$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{|\vct X|} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.
$\semNX$-PDBs and a function $\rmod$ from an $\semNX$-PDB to an equivalent $\semN$-PDB are both formalized in \Cref{subsec:supp-mat-background}.

 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
  Given an $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$:
  \[ \expct_{\idb \sim \pd}[\query(\db)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})} \]
\end{Proposition}
\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.  
This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.
We focus on this problem from now on, assume an implicit result tuple, and so drop the subscript from $\polyForTuple$ (i.e., $\poly$ is used as a polynomial from now on).

\subsubsection{\tis and \bis}
\label{subsec:tidbs-and-bidbs}
In this paper, we focus on two popular forms of PDB: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.
%
A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-PDB  such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.
%
A \emph{\ti} is a \bi where each block contains exactly one tuple.
\Cref{subsec:supp-mat-ti-bi-def} explains \tis and \bis in greater detail.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Started texing poly reformation write up. 2020-06-12 11:45:15 -04:00			`%root: main.tex`
Oliver's notes 2020-06-26 17:27:52 -04:00			`%!TEX root=./main.tex`
More work on lemmas 3, 4, and lin sys. 2020-12-04 13:14:12 -05:00			`%\onecolumn`
Fixed end of lemma 3.5 proof. 2020-12-16 17:25:37 -05:00			`\section{Background and Notation}\label{sec:background}`
Added some pictures for single edge and two path patterns. 2020-09-09 12:11:05 -04:00
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`\subsection{Prelim: Superlinearity of Bag PDBs}\label{sec:suplin-bags}`
Small changes Sec 1, Sec 2 and Appendix A 2021-03-26 13:01:41 -04:00			`Moving forward, we focus exclusively on bags. The bag relations of \cref{fig:ex-shipping} are modeled by the atttribute $\Phi_{bag}$ (i.e., we can ignore the $\Phi_{set}$ attribute).`
Incorporated Virginia's 3-path observation into Lem 3.15. 2021-03-25 11:52:59 -04:00			`Consider the following product query, which can be thought of the set of all route pairs.`
Small changes Sec 1, Sec 2 and Appendix A 2021-03-26 13:01:41 -04:00			`\begin{equation}`
			`\poly^2_E():- Loc(\text{City}), Route(\text{City}_1, \text{City}_2), Loc(\text{City}'), Loc(\text{City}''), Route(\text{City}_1', \text{City}_2'), Loc(\text{City}''')\label{eq:edge-query}`
			`\end{equation}`
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`%For an arbitrary polynomial, it is known that there may exist equivalent compressed representations.`
			`%One such compression is the factorized polynomial~\cite{factorized-db}, where the polynomial is broken up into separate factors.`
			`%For example:`
Small changes Sec 1, Sec 2 and Appendix A 2021-03-26 13:01:41 -04:00			`Consider the factorized representation of $\poly^2_E$:`
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`\begin{equation*}`
Incorporated Virginia's 3-path observation into Lem 3.15. 2021-03-25 11:52:59 -04:00			`\poly^2_E = \left(L_aL_b + L_bL_d + L_bL_c\right) \cdot \left(L_aL_b + L_bL_d + L_bL_c\right)`
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`\end{equation*}`
Restructuring the Introduction Version 2 Completed. 2021-03-23 16:37:58 -04:00			`This equivalent SOP representation is`
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`\begin{equation*}`
Restructuring the Introduction Version 2 Completed. 2021-03-23 16:37:58 -04:00			`L_a^2L_b^2 + L_b^2L_d^2 + L_b^2L_c^2 + 2L_aL_b^2L_d + 2L_aL_b^2L_c + 2L_b^2L_dL_c.`
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`\end{equation*}`
Incorporated Virginia's 3-path observation into Lem 3.15. 2021-03-25 11:52:59 -04:00			`The expectation $\expct\pbox{\poly^2_E()}$ then is:`
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`\begin{footnotesize}`
			`\begin{equation*}`
Restructuring the Introduction Version 2 Completed. 2021-03-23 16:37:58 -04:00			`\expct\pbox{L_a^2}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{L_d}\expct\pbox{L_c}`
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`\end{equation*}`
			`\end{footnotesize}`
Restructuring the Introduction Version 2 Completed. 2021-03-23 16:37:58 -04:00			`%Recall the nice property of $\query$ that its expected count could be computed by evaluating its lineage on the probability vector (i.e., \Cref{eqn:can-inline-probabilities-into-polynomial}).`
			`%This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(\probOf\pbox{W_a}, \probOf\pbox{W_b}, \probOf\pbox{W_c})$), but does suggest a related closed form formula.`
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`Note that if $Dom(W_i) = \{0, 1\}$, then for any $k > 0$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.`
			`This property leads us to consider a structure related to $\poly$.`
			`\begin{Definition}\label{def:reduced-poly}`
			`For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.`
			`\end{Definition}`
Incorporated Virginia's 3-path observation into Lem 3.15. 2021-03-25 11:52:59 -04:00			`With $\poly^2_E$ as an example, we have:`
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`\begin{align*}`
Incorporated Virginia's 3-path observation into Lem 3.15. 2021-03-25 11:52:59 -04:00			`\rpoly^2_E(L_a, L_b, L_c, L_d)`
Restructuring the Introduction Version 2 Completed. 2021-03-23 16:37:58 -04:00			`=&\; L_aL_b + L_bL_d + L_bW_c + 2L_aL_bL_d + 2L_aL_bL_c + 2L_bL_cL_d`
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`\end{align*}`
Incorporated Virginia's 3-path observation into Lem 3.15. 2021-03-25 11:52:59 -04:00			`It can be verified that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2_E} = \rpoly_E(\probOf\pbox{L_a=1}, \probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$).`
Restructuring the Intro 2021-03-18 08:32:15 -04:00
More changes to the Intro 2021-03-18 10:38:16 -04:00			`The reduced form of a lineage polynomial can be obtained but requires a linear scan over the clauses of an SOP encoding of the polynomial. Note that for a compressed representation, this scheme would require an exponential number of computations in the size of the compressed representation. In \Cref{sec:hard}, we use $\rpoly$ to prove our hardness results .`
			`%In prior work on lineage-based Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,yang:2015:pvldb:lenses} where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.`
			`%In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the product of $k$ copies of $\poly$ as a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.`
			`%This leads us to the \textbf{central question of this paper}:`
			`%\begin{quote}`
			`%{\em`
			`%Is it always the case that the expectation of a UCQ in a Bag-PDB can be computed in time linear in the size of the \textbf{compressed} lineage polynomial?}`
			`%\end{quote}`
			`%If so, then Bag-PDBs can indeed compete with deterministic databases.`
			`%This is unfortunately not the case, and an approximation is required.`
Restructuring the Intro 2021-03-18 08:32:15 -04:00
More work on lemmas 3, 4, and lin sys. 2020-12-04 13:14:12 -05:00
pdb def 2020-12-13 01:50:08 -05:00			`\subsection{Probabilistic Databases (PDBs)}`
Started with my pass on Sec 1 2020-07-02 16:15:35 -04:00
pdb def 2020-12-13 01:50:08 -05:00			`An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.`
			`Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world:`
			`\[\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}\]`
Started fixing Oliver's suggestion 071820. 2020-07-20 15:50:44 -04:00
pdb def 2020-12-13 01:50:08 -05:00			`For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:`
Conformed S2 to notation convention for probabilities. 2020-12-19 23:19:02 -05:00			`\[\forall \db \in \query(\idb): \probOf'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \probOf(\db') \]`
pdb def 2020-12-13 01:50:08 -05:00
Finished my first past implementing Reviewer Suggestions. 2021-03-10 13:28:04 -05:00			Note that in this work, for the query output, we consider bags, i.e., each possible world in the query output is a set of bag relations and queries are evaluated using bag semantics. We will use $\domK$-relations to model bags. A \emph{$\domK$-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
pdb def 2020-12-13 01:50:08 -05:00			`Let $\udom$ be a countable domain of values.`
			`Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ (resp., $\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to $t$.`
			`We review positive relational algebra semantics for $\semK$-relations below.`
pdb def 2020-12-13 01:50:08 -05:00
Resolving conflicts 2020-12-15 18:48:44 -05:00
Small changes Sec 1, Sec 2 and Appendix A 2021-03-26 13:01:41 -04:00			Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases. Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result tuple $t$, we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation $\expct_{\idb \sim \probDist}[\query(\db)(t)]$:
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`%`
More adjustments to save space; currently ~8.5 pages over. 2021-03-09 11:43:38 -05:00			`\begin{equation}\label{eq:bag-expectation}`
Conformed S2 to notation convention for probabilities. 2020-12-19 23:19:02 -05:00			`\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \probOf(\db)`
More adjustments to save space; currently ~8.5 pages over. 2021-03-09 11:43:38 -05:00			`\end{equation}`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`%`
background 2020-12-14 00:30:09 -05:00			`Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of $t$ we expect to find in result of query $\query$.`
RA 2020-12-13 15:51:55 -05:00
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Consolidated S2 2020-12-18 17:04:29 -05:00			`\subsection{Representation System and Semantics}\label{sec:semnx-as-repr}`
RA 2020-12-13 15:51:55 -05:00
Consolidated S2 2020-12-18 17:04:29 -05:00			`\subsubsection{$\semK$-relational Query Semantics}`
RA 2020-12-13 15:51:55 -05:00			`For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07}.`
Finished my first past implementing Reviewer Suggestions. 2021-03-10 13:28:04 -05:00			`We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. In the definition shown below, we assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$. Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.`
RA 2020-12-13 15:51:55 -05:00			`\begin{align*}`
More adjustments to save space; currently ~8.5 pages over. 2021-03-09 11:43:38 -05:00			`\evald{\project_A(\rel)}{\db}(\tup) &= \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &`
			`\evald{(\rel_1 \union \rel_2)}{\db}(\tup) &= \evald{\rel_1}{\db}(\tup) \addK \evald{\rel_2}{\db}(\tup)\\`
			`\evald{\select_\theta(\rel)}{\db}(\tup) &= \begin{cases}`
			`\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\`
			`\zeroK & \text{otherwise}.`
			`\end{cases} &`
			`\evald{(\rel_1 \join \rel_2)}{\db}(\tup) &= \evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \multK \evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup)) \\`
			`& & \evald{R}{\db}(\tup) &= \rel(\tup)`
RA 2020-12-13 15:51:55 -05:00			`\end{align*}`

More adjustments to save space; currently ~8.5 pages over. 2021-03-09 11:43:38 -05:00
RA 2020-12-13 15:51:55 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Consolidated S2 2020-12-18 17:04:29 -05:00			`\subsubsection{$\semNX$ as a Representation System}\label{sec:semnx-as-repr}`
RA 2020-12-13 15:51:55 -05:00
Consolidated S2 2020-12-18 17:04:29 -05:00			`Let $\semNX$ denote the set of polynomials over variables $\vct{X}$ with natural number coefficients and exponents.`
Misc clarifications 2020-12-20 17:13:52 -05:00			`Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose domain is $\semNX$, with the standard addition and multiplication of polynomials.`
Small changes Sec 1, Sec 2 and Appendix A 2021-03-26 13:01:41 -04:00			`We will use $\semNX$-PDB $\pxdb$, defined as the tuple $(\idb_{\semNX}, \pd)$, where $\semNX$-database $\idb_{\semNX}$ is paired with probability distribution $\pd$.`
Misc clarifications 2020-12-20 17:13:52 -05:00			`We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ on an implicit $\semNX$-PDB (i.e., $\polyForTuple = \query(\pxdb)(t)$ for some $\pxdb$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{\|\vct X\|} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.`
			`$\semNX$-PDBs and a function $\rmod$ from an $\semNX$-PDB to an equivalent $\semN$-PDB are both formalized in \Cref{subsec:supp-mat-background}.`
RA 2020-12-13 15:51:55 -05:00
Consolidated S2 2020-12-18 17:04:29 -05:00
RA 2020-12-13 15:51:55 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}`
Small changes Sec 1, Sec 2 and Appendix A 2021-03-26 13:01:41 -04:00			`Given an $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\idb_{\semNX}',\pd')$ where $\rmod(\pxdb) = \pdb$:`
Conformed S2 to notation convention for probabilities. 2020-12-19 23:19:02 -05:00			`\[ \expct_{\idb \sim \pd}[\query(\db)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})} \]`
RA 2020-12-13 15:51:55 -05:00			`\end{Proposition}`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.`
Try this one neat trick to save 2 pages :) 2020-12-19 16:46:26 -05:00			`This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.`
			`We focus on this problem from now on, assume an implicit result tuple, and so drop the subscript from $\polyForTuple$ (i.e., $\poly$ is used as a polynomial from now on).`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00
			`\subsubsection{\tis and \bis}`
Trimming for space 2020-12-19 12:59:27 -05:00			`\label{subsec:tidbs-and-bidbs}`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`In this paper, we focus on two popular forms of PDB: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.`
			`%`
Small changes Sec 1, Sec 2 and Appendix A 2021-03-26 13:01:41 -04:00			`A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-PDB such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`%`
			`A \emph{\ti} is a \bi where each block contains exactly one tuple.`
			`\Cref{subsec:supp-mat-ti-bi-def} explains \tis and \bis in greater detail.`
RA 2020-12-13 15:51:55 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`