paper-BagRelationalPDBsAreHard/ra-to-poly.tex

%root: main.tex
%!TEX root=./main.tex
%\onecolumn
\section{Background and Notation}\label{sec:background}

\subsection{Prelim: Superlinearity of Bag PDBs}\label{sec:suplin-bags}
Moving forward, we focus exclusively on bags and drop the subscript from $\poly_{bag}$ used in example~\ref{ex:bag-vs-set}.
Consider the product of $\poly$ with itself:
\begin{equation*}
\poly^2() := \rel(A), E(A, B), \rel(B),\; \rel(C), E(C, D), \rel(D)
\end{equation*}
%For an arbitrary polynomial, it is known that there may exist equivalent compressed representations.
%One such compression is the factorized polynomial~\cite{factorized-db}, where the polynomial is broken up into separate factors.
%For example:
Still using the tables of \Cref{fig:intro-ex}, consider the factorized representation of $\poly^2(W_a, W_b, W_c)$:
{\small
\begin{equation*}
\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right)
\end{equation*}
}
This factorized expression can be easily modeled as a circuit, as in \Cref{fig:circuit-q2-intro},
 while the equivalent SOP representation is
\begin{equation*}
W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.
\end{equation*}
The expectation $\expct\pbox{\poly^2(W_a, W_b, W_c)}$ then is:
\begin{footnotesize}
\begin{equation*}
\expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} + 2\expct\pbox{W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + 2\expct\pbox{W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} + 2\expct\pbox{W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}
\end{equation*}
\end{footnotesize}
Recall the nice property of $\query$ that its expected count could be computed by evaluating its lineage on the probability vector (i.e., \Cref{eqn:can-inline-probabilities-into-polynomial}).
This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(\probOf\pbox{W_a}, \probOf\pbox{W_b}, \probOf\pbox{W_c})$), but does suggest a related closed form formula.
Note that if $Dom(W_i) = \{0, 1\}$, then for any $k > 0$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
This property leads us to consider a structure related to $\poly$.
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.
\end{Definition}
With $\poly^2$ as an example, we have:
\begin{align*}
\rpoly^2(W_a, W_b, W_c)
=&\; W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c
\end{align*}
Note that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(\probOf\pbox{W_a=1}, \probOf\pbox{W_b=1}, \probOf\pbox{W_c=1})$).
Also note that the $\poly$ in~\Cref{ex:bag-vs-set} is already in reduced form.

The reduced form of a polynomial can be obtained in a linear scan over the clauses of an SOP encoding of the polynomial.
In prior work on lineage-based Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,yang:2015:pvldb:lenses} where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.
In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the product of $k$ copies of $\poly$ has a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
This leads us to the \textbf{central question of this paper}:
\begin{quote}
{\em
Is it always the case that the expectation of a UCQ in a Bag-PDB can be computed in time linear in the size of the \textbf{compressed} lineage polynomial?}
\end{quote}
If so, then Bag-PDBs can compete with deterministic databases.
This is ufortunately not the case, and an approximation is required.


\subsection{Probabilistic Databases (PDBs)}

An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.
Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world:
\[\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}\]

For a probabilistic  database $\pdb = (\idb, \pd)$,  the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$  that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
\[\forall \db \in \query(\idb): \probOf'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \probOf(\db') \]

Note that in this work, for the query output, we consider bags, i.e., each possible world in the query output is a set of bag relations and queries are evaluated using bag semantics. We will use $\domK$-relations to model bags. A \emph{$\domK$-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$.  A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
Let $\udom$ be a countable domain of values.
Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ (resp., $\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to $t$.
We review positive relational algebra semantics for $\semK$-relations below.


Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A  probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases.  Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result $t$,  we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation  $\expct_{\idb \sim \probDist}[\query(\db)(t)]$:
%
\begin{equation}\label{eq:bag-expectation}
\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \probOf(\db)
\end{equation}
%
Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of $t$ we expect to find in result of query $\query$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Representation System and Semantics}\label{sec:semnx-as-repr}

\subsubsection{$\semK$-relational Query Semantics}
For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07}.
We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. In the definition shown below, we assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$.  Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.
\begin{align*}
 	\evald{\project_A(\rel)}{\db}(\tup) &= \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &	
 	\evald{(\rel_1 \union \rel_2)}{\db}(\tup) &= \evald{\rel_1}{\db}(\tup) \addK \evald{\rel_2}{\db}(\tup)\\
 	\evald{\select_\theta(\rel)}{\db}(\tup) &= \begin{cases}
		\evald{\rel}{\db}(\tup)	& \text{if }\theta(\tup) \\
		\zeroK                       & \text{otherwise}.
		\end{cases} &
       \evald{(\rel_1 \join \rel_2)}{\db}(\tup) &= \evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \multK 				\evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup)) \\
       & & \evald{R}{\db}(\tup) &= \rel(\tup)
\end{align*}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{$\semNX$ as a Representation System}\label{sec:semnx-as-repr}

Let $\semNX$ denote the set of polynomials over variables $\vct{X}$ with natural number coefficients and exponents.
Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose domain is $\semNX$, with the standard addition and multiplication of polynomials. 
We will use $\semNX$-PDB $\pxdb$, defined as the tuple $(\db, \pd)$, where $\semNX$-database $\db$ is paired with probability distribution $\pd$.  
We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ on an implicit $\semNX$-PDB (i.e., $\polyForTuple = \query(\pxdb)(t)$ for some $\pxdb$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{|\vct X|} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.
$\semNX$-PDBs and a function $\rmod$ from an $\semNX$-PDB to an equivalent $\semN$-PDB are both formalized in \Cref{subsec:supp-mat-background}.

 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
  Given an $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\db',\pd')$ where $\rmod(\pxdb) = \pdb$:
  \[ \expct_{\idb \sim \pd}[\query(\db)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})} \]
\end{Proposition}
\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.  
This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.
We focus on this problem from now on, assume an implicit result tuple, and so drop the subscript from $\polyForTuple$ (i.e., $\poly$ is used as a polynomial from now on).

\subsubsection{\tis and \bis}
\label{subsec:tidbs-and-bidbs}
In this paper, we focus on two popular forms of PDB: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.
%
A \bi $\pxdb = (\db, \pd)$ is an $\semNX$-PDB  such that (i) every tuple is annotated with either $0$ or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.
%
A \emph{\ti} is a \bi where each block contains exactly one tuple.
\Cref{subsec:supp-mat-ti-bi-def} explains \tis and \bis in greater detail.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Started texing poly reformation write up. 2020-06-12 11:45:15 -04:00			`%root: main.tex`
Oliver's notes 2020-06-26 17:27:52 -04:00			`%!TEX root=./main.tex`
More work on lemmas 3, 4, and lin sys. 2020-12-04 13:14:12 -05:00			`%\onecolumn`
Fixed end of lemma 3.5 proof. 2020-12-16 17:25:37 -05:00			`\section{Background and Notation}\label{sec:background}`
Added some pictures for single edge and two path patterns. 2020-09-09 12:11:05 -04:00
Restructuring the Intro 2021-03-18 08:32:15 -04:00			`\subsection{Prelim: Superlinearity of Bag PDBs}\label{sec:suplin-bags}`
			`Moving forward, we focus exclusively on bags and drop the subscript from $\poly_{bag}$ used in example~\ref{ex:bag-vs-set}.`
			`Consider the product of $\poly$ with itself:`
			`\begin{equation*}`
			`\poly^2() := \rel(A), E(A, B), \rel(B),\; \rel(C), E(C, D), \rel(D)`
			`\end{equation*}`
			`%For an arbitrary polynomial, it is known that there may exist equivalent compressed representations.`
			`%One such compression is the factorized polynomial~\cite{factorized-db}, where the polynomial is broken up into separate factors.`
			`%For example:`
			`Still using the tables of \Cref{fig:intro-ex}, consider the factorized representation of $\poly^2(W_a, W_b, W_c)$:`
			`{\small`
			`\begin{equation*}`
			`\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right)`
			`\end{equation*}`
			`}`
			`This factorized expression can be easily modeled as a circuit, as in \Cref{fig:circuit-q2-intro},`
			`while the equivalent SOP representation is`
			`\begin{equation*}`
			`W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.`
			`\end{equation*}`
			`The expectation $\expct\pbox{\poly^2(W_a, W_b, W_c)}$ then is:`
			`\begin{footnotesize}`
			`\begin{equation*}`
			`\expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} + 2\expct\pbox{W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + 2\expct\pbox{W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} + 2\expct\pbox{W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}`
			`\end{equation*}`
			`\end{footnotesize}`
			`Recall the nice property of $\query$ that its expected count could be computed by evaluating its lineage on the probability vector (i.e., \Cref{eqn:can-inline-probabilities-into-polynomial}).`
			`This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(\probOf\pbox{W_a}, \probOf\pbox{W_b}, \probOf\pbox{W_c})$), but does suggest a related closed form formula.`
			`Note that if $Dom(W_i) = \{0, 1\}$, then for any $k > 0$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.`
			`This property leads us to consider a structure related to $\poly$.`
			`\begin{Definition}\label{def:reduced-poly}`
			`For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.`
			`\end{Definition}`
			`With $\poly^2$ as an example, we have:`
			`\begin{align*}`
			`\rpoly^2(W_a, W_b, W_c)`
			`=&\; W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c`
			`\end{align*}`
			`Note that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(\probOf\pbox{W_a=1}, \probOf\pbox{W_b=1}, \probOf\pbox{W_c=1})$).`
			`Also note that the $\poly$ in~\Cref{ex:bag-vs-set} is already in reduced form.`

			`The reduced form of a polynomial can be obtained in a linear scan over the clauses of an SOP encoding of the polynomial.`
			`In prior work on lineage-based Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,yang:2015:pvldb:lenses} where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.`
			`In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the product of $k$ copies of $\poly$ has a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.`
			`This leads us to the \textbf{central question of this paper}:`
			`\begin{quote}`
			`{\em`
			`Is it always the case that the expectation of a UCQ in a Bag-PDB can be computed in time linear in the size of the \textbf{compressed} lineage polynomial?}`
			`\end{quote}`
			`If so, then Bag-PDBs can compete with deterministic databases.`
			`This is ufortunately not the case, and an approximation is required.`

More work on lemmas 3, 4, and lin sys. 2020-12-04 13:14:12 -05:00
pdb def 2020-12-13 01:50:08 -05:00			`\subsection{Probabilistic Databases (PDBs)}`
Started with my pass on Sec 1 2020-07-02 16:15:35 -04:00
pdb def 2020-12-13 01:50:08 -05:00			`An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.`
			`Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\pdb$ is a pair $(\idb, \pd)$ where $\idb$ is an incomplete database and $\pd$ is a probability distribution over $\idb$. Queries over probabilistic databases are evaluated using the so-called possible world semantics. Under possible world semantics, the result of a query $\query$ over an incomplete database $\idb$ is the set of query answers produced by evaluating $\query$ over each possible world:`
			`\[\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}\]`
Started fixing Oliver's suggestion 071820. 2020-07-20 15:50:44 -04:00
pdb def 2020-12-13 01:50:08 -05:00			`For a probabilistic database $\pdb = (\idb, \pd)$, the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$ that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:`
Conformed S2 to notation convention for probabilities. 2020-12-19 23:19:02 -05:00			`\[\forall \db \in \query(\idb): \probOf'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \probOf(\db') \]`
pdb def 2020-12-13 01:50:08 -05:00
Finished my first past implementing Reviewer Suggestions. 2021-03-10 13:28:04 -05:00			Note that in this work, for the query output, we consider bags, i.e., each possible world in the query output is a set of bag relations and queries are evaluated using bag semantics. We will use $\domK$-relations to model bags. A \emph{$\domK$-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$. A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
pdb def 2020-12-13 01:50:08 -05:00			`Let $\udom$ be a countable domain of values.`
			`Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ (resp., $\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to $t$.`
			`We review positive relational algebra semantics for $\semK$-relations below.`
pdb def 2020-12-13 01:50:08 -05:00
Resolving conflicts 2020-12-15 18:48:44 -05:00
Conformed S2 to notation convention for probabilities. 2020-12-19 23:19:02 -05:00			Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases. Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result $t$, we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation $\expct_{\idb \sim \probDist}[\query(\db)(t)]$:
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`%`
More adjustments to save space; currently ~8.5 pages over. 2021-03-09 11:43:38 -05:00			`\begin{equation}\label{eq:bag-expectation}`
Conformed S2 to notation convention for probabilities. 2020-12-19 23:19:02 -05:00			`\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \probOf(\db)`
More adjustments to save space; currently ~8.5 pages over. 2021-03-09 11:43:38 -05:00			`\end{equation}`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`%`
background 2020-12-14 00:30:09 -05:00			`Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of $t$ we expect to find in result of query $\query$.`
RA 2020-12-13 15:51:55 -05:00
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Consolidated S2 2020-12-18 17:04:29 -05:00			`\subsection{Representation System and Semantics}\label{sec:semnx-as-repr}`
RA 2020-12-13 15:51:55 -05:00
Consolidated S2 2020-12-18 17:04:29 -05:00			`\subsubsection{$\semK$-relational Query Semantics}`
RA 2020-12-13 15:51:55 -05:00			`For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07}.`
Finished my first past implementing Reviewer Suggestions. 2021-03-10 13:28:04 -05:00			`We use $\evald{\cdot}{\db}$ to denote the result of evaluating query $\query$ over $\semK$-database $\db$. In the definition shown below, we assume that tuples are of appropriate arity, use $\sch(\rel)$ to denote the attributes of $\rel$, and use $\project_A(\tup)$ to denote the projection of tuple $\tup$ on a list of attributes $A$. Furthermore, $\theta(\tup)$ denotes the (Boolean) result of evaluating condition $\theta$ over $\tup$.`
RA 2020-12-13 15:51:55 -05:00			`\begin{align*}`
More adjustments to save space; currently ~8.5 pages over. 2021-03-09 11:43:38 -05:00			`\evald{\project_A(\rel)}{\db}(\tup) &= \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &`
			`\evald{(\rel_1 \union \rel_2)}{\db}(\tup) &= \evald{\rel_1}{\db}(\tup) \addK \evald{\rel_2}{\db}(\tup)\\`
			`\evald{\select_\theta(\rel)}{\db}(\tup) &= \begin{cases}`
			`\evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\`
			`\zeroK & \text{otherwise}.`
			`\end{cases} &`
			`\evald{(\rel_1 \join \rel_2)}{\db}(\tup) &= \evald{\rel_1}{\db}(\project_{\sch(\rel_1)}(\tup)) \multK \evald{\rel_2}{\db}(\project_{\sch(\rel_2)}(\tup)) \\`
			`& & \evald{R}{\db}(\tup) &= \rel(\tup)`
RA 2020-12-13 15:51:55 -05:00			`\end{align*}`

More adjustments to save space; currently ~8.5 pages over. 2021-03-09 11:43:38 -05:00
RA 2020-12-13 15:51:55 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
Consolidated S2 2020-12-18 17:04:29 -05:00			`\subsubsection{$\semNX$ as a Representation System}\label{sec:semnx-as-repr}`
RA 2020-12-13 15:51:55 -05:00
Consolidated S2 2020-12-18 17:04:29 -05:00			`Let $\semNX$ denote the set of polynomials over variables $\vct{X}$ with natural number coefficients and exponents.`
Misc clarifications 2020-12-20 17:13:52 -05:00			`Consider now the semiring $(\semNX, +, \cdot, 0, 1)$ whose domain is $\semNX$, with the standard addition and multiplication of polynomials.`
			`We will use $\semNX$-PDB $\pxdb$, defined as the tuple $(\db, \pd)$, where $\semNX$-database $\db$ is paired with probability distribution $\pd$.`
			`We denote by $\polyForTuple$ the annotation of tuple $t$ in the result of $\query$ on an implicit $\semNX$-PDB (i.e., $\polyForTuple = \query(\pxdb)(t)$ for some $\pxdb$) and as before, interpret it as a function $\polyForTuple: \{0,1\}^{\|\vct X\|} \rightarrow \semN$ from vectors of variable assignments to the corresponding value of the annotating polynomial.`
			`$\semNX$-PDBs and a function $\rmod$ from an $\semNX$-PDB to an equivalent $\semN$-PDB are both formalized in \Cref{subsec:supp-mat-background}.`
RA 2020-12-13 15:51:55 -05:00
Consolidated S2 2020-12-18 17:04:29 -05:00
RA 2020-12-13 15:51:55 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`\begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}`
Addressing a few comments. 2020-12-19 23:20:31 -05:00			`Given an $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\db',\pd')$ where $\rmod(\pxdb) = \pdb$:`
Conformed S2 to notation convention for probabilities. 2020-12-19 23:19:02 -05:00			`\[ \expct_{\idb \sim \pd}[\query(\db)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})} \]`
RA 2020-12-13 15:51:55 -05:00			`\end{Proposition}`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`\noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.`
Try this one neat trick to save 2 pages :) 2020-12-19 16:46:26 -05:00			`This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.`
			`We focus on this problem from now on, assume an implicit result tuple, and so drop the subscript from $\polyForTuple$ (i.e., $\poly$ is used as a polynomial from now on).`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00
			`\subsubsection{\tis and \bis}`
Trimming for space 2020-12-19 12:59:27 -05:00			`\label{subsec:tidbs-and-bidbs}`
Pass over S2, S3; Ended up saving a column or so 2020-12-19 00:45:30 -05:00			`In this paper, we focus on two popular forms of PDB: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.`
			`%`
			`A \bi $\pxdb = (\db, \pd)$ is an $\semNX$-PDB such that (i) every tuple is annotated with either $0$ or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.`
			`%`
			`A \emph{\ti} is a \bi where each block contains exactly one tuple.`
			`\Cref{subsec:supp-mat-ti-bi-def} explains \tis and \bis in greater detail.`
RA 2020-12-13 15:51:55 -05:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`