Merge remote-tracking branch 'origin/master'

2020-12-19 23:22:49 -05:00 · 2020-12-19 23:22:49 -05:00 · fff29ce2f6
parent b7454db8c7 9aaf254977
commit fff29ce2f6
5 changed files with 28 additions and 24 deletions
--- a/abstract.tex
+++ b/abstract.tex
@ -7,7 +7,7 @@
  For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products. 
  However, using a reduction from the problem of counting k-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization. 
  As we show, this result has a significant implication: a Bag-PDB doing exact computations will never be as fast as a classical (deterministic) database.
-  The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $p$ (s.t. $p \not \in \{0,1\}$). 
+  The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \not \in \{0,1\}$). 
  We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial, paving the way for PDBs competitive with deterministic databases.
 % \AH{High-level intuition}
 % \BG{Most people think that computing expected multiplicity of an output tuple in a probabilistic database (PDB) is easy.  Due to the fact that most modern implementations of PDBs represent tuple lineage in their expanded form, it has to be the case that such a computation is linear in the size of the lineage.  This follows since, when we have an uncompressed lineage, linearity allows for expectation to be pushed through the sum.}
--- a/intro.tex
+++ b/intro.tex
@ -111,8 +111,8 @@ Consider the Tuple Independent ($\ti$) Set-PDB\footnote{Our work does also handl
 Each input tuple is assigned an annotation (attribute $\Phi_{set}$): an independent random Boolean variable ($W_i$) or the constant $\top$.
 % Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) \SF{Do we need to state the meaning of $\top$ and $\bot$? Also do we want to add bag annotation to Figure 1 too since we are discussing both sets and bags later?} identifies one \emph{possible world}, a deterministic database instance that contains exactly the tuples annotated by the constant $\top$ or by a variable assigned to $\top$.
 The probability of this world is the joint probability of the corresponding assignments.
-For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
-The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p\cdot p\cdot (1-p)=p^2-p^3$.
+For example, let $\probOf[W_a] = \probOf[W_b] = \probOf[W_c] = \prob$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.
+The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $\probOf[W_a]\cdot \probOf[W_b] \cdot \probOf[\neg W_c] = \prob\cdot \prob\cdot (1-\prob)=\prob^2-\prob^3$.
 \end{Example}

 Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,GL16}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
@ -138,9 +138,9 @@ The marginal probability (resp., expected count) of this query is computed over
 % \AR{What is $\mu$ below?}
 {\small
 \begin{align*}
-P[\poly_{set}] &= \hspace*{-1mm}
- \sum_{w_i \in \{\top,\bot\}} \indicator{\poly_{set}(w_a, w_b, w_c)}P[W_a = w_a,W_b = w_b,W_c = w_c]\\
-\expct[\poly_{bag}] &= \sum_{w_i \in \{0,1\}} \poly_{bag}(w_a, w_b, w_c)\cdot P[W_a = w_a,W_b = w_b,W_c = w_c]
+\probOf[\poly_{set}] &= \hspace*{-1mm}
+ \sum_{w_i \in \{\top,\bot\}} \indicator{\poly_{set}(w_a, w_b, w_c)}\probOf[W_a = w_a,W_b = w_b,W_c = w_c]\\
+\expct[\poly_{bag}] &= \sum_{w_i \in \{0,1\}} \poly_{bag}(w_a, w_b, w_c)\cdot \probOf[W_a = w_a,W_b = w_b,W_c = w_c]
 \end{align*}
 }
 \end{Example}
@ -169,13 +169,13 @@ In this particular lineage polynomial, all variables in each product clause are
 \end{align*}
 }
 Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.
-As a further interesting feature of this example, note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals:
+As a further interesting feature of this example, note that $\expct\pbox{W_i} = \probOf[W_i = 1]$, and so taking the same polynomial over the reals:
 \begin{multline}
 \label{eqn:can-inline-probabilities-into-polynomial}
 \expct\pbox{\poly_{bag}}
 % = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
 % + P[W_c = 1]P[W_a = 1]\\
-= \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1])
+= \poly_{bag}(\probOf[W_a=1], \probOf[W_b=1], \probOf[W_c=1])
 \end{multline}

 \begin{figure}[t]
@ -243,7 +243,7 @@ The expectation $\expct\pbox{\poly^2(W_a, W_b, W_c)}$ then is:
 + \expct\pbox{2W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}
 \end{multline*}
 Recall the nice property of $\query$ that its expected count could be computed by evaluating its lineage on the probability vector (i.e., \Cref{eqn:can-inline-probabilities-into-polynomial}).
-This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(P\pbox{W_a}, P\pbox{W_b}, P\pbox{W_c})$), but does suggest a related closed form formula.
+This property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(\probOf\pbox{W_a}, \probOf\pbox{W_b}, \probOf\pbox{W_c})$), but does suggest a related closed form formula.
 Note that if $Dom(W_i) = \{0, 1\}$, then for any $k > 0$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
 This property leads us to consider a structure related to $\poly$.
 % \AH{I don't know if we want to include the following statement: \par \emph{ bags are only hard with self-joins }
@ -257,7 +257,7 @@ With $\poly^2$ as an example, we have:
 =&\; W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c
 \end{align*}
 %\SF{Should this be like $\tilde{\poly^2}$ to avoid ambiguous?}
-Note that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(P\pbox{W_a=1}, P\pbox{W_b=1}, P\pbox{W_c=1})$).
+Note that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(\probOf\pbox{W_a=1}, \probOf\pbox{W_b=1}, \probOf\pbox{W_c=1})$).
 Also note that the $\poly$ in~\Cref{ex:bag-vs-set} is already in reduced form.

 The reduced form of a polynomial can be obtained in a linear scan over the clauses of a SOP encoding of the polynomial.
@ -284,10 +284,10 @@ Concretely, in this paper:
 (iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
 (iv) We further generalize our results to higher moments, polynomial circuits, and prove that for RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.

-Our hardness results follow by considering a suitable generalization of the lineage polynomial in Example~\ref{ex:bag-vs-set}. First it is easy to generalize the polynomial in Example~\ref{ex:bag-vs-set} to $\poly_G^k(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\inparen{\poly_G^k(X_1,\dots,X_n)}^k$ encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(p,\dots,p)$ can be written as $\sum_{i=0}^{2k} c_i\cdot p^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (computing which is \sharpwonehard\ ) in $G$. Thus, if we have access to $\rpoly_G^k(p_i,\dots,p_i)$ for distinct values of $p_i$ for $0\le i\le  2k$, then we can setup a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(p,\dots,p)$ for a {\em single specific} value of $p$ might be easy: indeed  it is easy for $p=0$ or $p=1$. However, we are able to show that for any other value of $p$, computing $\rpoly_G^k(p,\dots,p)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
+Our hardness results follow by considering a suitable generalization of the lineage polynomial in Example~\ref{ex:bag-vs-set}. First it is easy to generalize the polynomial in Example~\ref{ex:bag-vs-set} to $\poly_G^k(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\inparen{\poly_G^k(X_1,\dots,X_n)}^k$ encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (computing which is \sharpwonehard\ ) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le  2k$, then we can setup a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed  it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.


-The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(X_1,\dots,X_n)$ (which can be easily computed in linear time) is a $p^k$ approximation to the value $\rpoly(p,\dots,p)$ that we are after. If $p$ and $k=\deg(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_n)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(X_1,\dots,X_n)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well. For the ease of exposition, we start off with expression trees (see~\Cref{fig:intro-q2-etree} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
+The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(X_1,\dots,X_n)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ and $k=\deg(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_n)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(X_1,\dots,X_n)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well. For the ease of exposition, we start off with expression trees (see~\Cref{fig:intro-q2-etree} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).

 We also formalize our claim that, since our approximation algorithm runs in time linear in the size of the polynomial circuit, we can approximate the expected output tuple multiplicities with only a $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

--- a/macros.tex
+++ b/macros.tex
@ -21,7 +21,7 @@
 \newcommand{\pxdb}{\mathbf{D}}
 \newcommand{\nxdb}{D(\vct{X})}%\mathbb{N}[\vct{X}] db
 \newcommand{\tset}{\mathcal{T}}%the set of tuples in a database
-\newcommand{\pd}{P}%pd for probability distribution
+\newcommand{\pd}{\vct{P}}%pd for probability distribution
 \newcommand{\eval}[1]{\llbracket #1 \rrbracket}%evaluation double brackets
 \newcommand{\evald}[2]{\eval{{#1}}_{#2}}
 \newcommand{\query}{Q}
@ -116,6 +116,9 @@
 %PDBs
 \newcommand{\pdbx}{X_{DB}}
 \newcommand{\prob}{p}
+\newcommand{\probOf}{P}
+\newcommand{\probDist}{\vct{\probOf}}
+\newcommand{\probAllTup}{\vct{\prob}}
 \newcommand{\wSet}{\Omega}
 \newcommand{\ti}{TIDB\xspace}
 \newcommand{\tis}{TIDBs\xspace}
--- a/poly-form.tex
+++ b/poly-form.tex
@ -52,7 +52,8 @@ We call a polynomial $\query(\vct{X})$ a \emph{\bi-lineage polynomial} (resp., \
 there exists a $\raPlus$ query $\query$, \bi $\pxdb$ (\ti $\pxdb$, or $\semNX$-PDB $\pxdb$), and tuple $\tup$ such that $\query(\vct{X}) = \query(\pxdb)(\tup)$. % Before proceeding, note that the following is assume that polynomials are  \bis (which subsume \tis as a special case).
 As they are a special case of \bis, the following applies to \tis as well.
 Recall that in a \bi $\pxdb$ with tuples $t_1, \ldots, t_n$, each input tuple $t_i$ is annotated with a unique variable $X_i$. 
-Tuples of $\pxdb$ are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ where tuple $t_i$ is associated with a probability $\prob(\tup_i) = \pd[X_i = 1]$.\footnote{
+Tuples of $\pxdb$ are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ where tuple $t_i$ is associated with a probability $\prob_{\tup_i} = \pd[X_i = 1]$.
+\footnote{
  Although it is customary to define a single independent, $[\abs{\block_j}+1]$-valued variable per block, we decompose it into $\abs{\block_j}$ correlated $\{0,1\}$-valued variables per block that can be directly used in polynomials (without an indicator function).  For $t_i \in b_j$, the event $(X_i = 1)$ is identical to the event $(X_j = i)$ in the customary annotation scheme.
 } 
 Because blocks are independent and tuples from the same block are disjoint, $\prob$ and the blocks induce the probability distribution $\pd$ of $\pxdb$.
@ -77,8 +78,8 @@ For example when $S_0=\inset{X^2-X, Y^2-Y}$, taking the polynomial $2X^2 + 3XY -
 %
 \begin{Definition}\label{def:mod-set-polys}
 Given the set of BIDB variables $\inset{X_{b,i}}$, define
-\[\mathcal{B}=\comprehension{X_{b,i}\cdot X_{b,j}}{\text{ for every block } b \text{and } i\ne j}\]
-\[\mathcal{T}=\comprehension{X_{b,i}^2-X_{b,i}}{\text{ for every block } b \text{and } i}\]
+\[\mathcal{B}=\comprehension{X_{b,i}\cdot X_{b,j}}{\text{ for every block } b \text{ and } i\ne j \in [~\abs{\block}~]}\]
+\[\mathcal{T}=\comprehension{X_{b,i}^2-X_{b,i}}{\text{ for every block } b \text{ and } i \in [~\abs{\block}~]}\]
 \end{Definition}
 %
 \begin{Definition}[Reduced \bi Polynomials]\label{def:reduced-bi-poly}
@ -123,9 +124,9 @@ Consider $\poly(X, Y) = (X + Y)(X + Y)$ where $X$ and $Y$ are from different blo
 %
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Definition}[Valid Worlds]
-For probability distribution $\vct{P}$ and its corresponding PMF $P$, the set of valid worlds $\eta$ is the worlds with probability value greater than $0$; i.e., for variable vector $\vct{W}$
+For probability distribution $\probDist$ and its corresponding PMF $\probOf$, the set of valid worlds $\eta$ is the worlds with probability value greater than $0$; i.e., for variable vector $\vct{W}$
 \[
-\eta = \{\vct{w}\st P[\vct{W} = \vct{w}] > 0\}
+\eta = \{\vct{w}\st \probOf[\vct{W} = \vct{w}] > 0\}
 \]
 \end{Definition}

@ -143,10 +144,10 @@ We state additional equivalences between $\poly(\vct{X})$ and $\rpoly(\vct{X})$

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Lemma}\label{lem:exp-poly-rpoly}
-Let $\pxdb$ be a \bi over variables $\vct{X} = \{X_1, \ldots, X_\numvar\}$ and with probability distribution $\vct{p} = (\prob_1, \ldots, \prob_\numvar)$ over all $\vct{w}$ in $\eta$. For any \bi-lineage polynomial $\poly(\vct{X})$ based on $\pxdb$ and query $\query$ we have:
+Let $\pxdb$ be a \bi over variables $\vct{X} = \{X_1, \ldots, X_\numvar\}$ and with probability distribution $\probDist$ produced by the tuple probability vector $\probAllTup = (\prob_1, \ldots, \prob_\numvar)$ over all $\vct{w}$ in $\eta$. For any \bi-lineage polynomial $\poly(\vct{X})$ based on $\pxdb$ and query $\query$ we have:
  % The expectation over possible worlds in $\poly(\vct{X})$ is equal to $\rpoly(\prob_1,\ldots, \prob_\numvar)$.
 \begin{equation*}
-\expct_{\vct{w}\sim \vct{p}}\pbox{\poly(\vct{W})}  = \rpoly(\vct{p}).
+\expct_{\vct{W}\sim \probDist}\pbox{\poly(\vct{W})}  = \rpoly(\probAllTup).
 \end{equation*}
 \end{Lemma}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
--- a/ra-to-poly.tex
+++ b/ra-to-poly.tex
@ -11,7 +11,7 @@ Denote the schema of $\db$ as $\sch(\db)$. A \textit{probabilistic database} $\p
 \[\query(\idb) = \comprehension{\query(\db)}{\db \in \idb}\]

 For a probabilistic  database $\pdb = (\idb, \pd)$,  the result of a query is the pair $(\query(\idb), \pd')$ where $\pd'$ is a probability distribution over $\query(\idb)$  that assigns to each possible query result the sum of the probabilities of the worlds that produce this answer:
-\[\forall \db \in \query(\idb): \pd'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \pd(\db') \]
+\[\forall \db \in \query(\idb): \probOf'(\db) = \sum_{\db' \in \idb: \query(\db') = \db} \probOf(\db') \]

 Note that in this work we consider multisets, i.e., each possible world is a set of multiset relations and queries are evaluated using bag semantics. We will use K-relations to model multisets. A \emph{K-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$.  A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
 Let $\udom$ be a countable domain of values.
@ -20,10 +20,10 @@ A $\semK$-database is a set of $\semK$-relations. It will be convenient to also
 We review positive relational algebra semantics for $\semK$-relations below.


-Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A  probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases.  Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result $t$,  we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation  $\expct_{\idb \sim \pd}[\query(\db)(t)]$:
+Consider the semiring $\semN = (\domN,+,\times,0,1)$ of natural numbers. $\semN$-databases model bag semantics by annotating each tuple with its multiplicity. A  probabilistic $\semN$-database ($\semN$-PDB) is a PDB where each possible world is an $\semN$-database. We study the problem of computing statistical moments for query results over such databases.  Specifically, given a probabilistic $\semN$-database $\pdb = (\idb, \pd)$, query $\query$, and possible result $t$,  we treat $\query(\db)(t)$ as a random $\semN$-valued variable and are interested in computing its expectation  $\expct_{\idb \sim \probDist}[\query(\db)(t)]$:
 %
 \begin{align}\label{eq:bag-expectation}
-\expct_{\idb \sim \pd}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \pd(\db)
+\expct_{\idb \sim \probDist}[\query(\db)(t)] = \sum_{\db \in \idb} \query(\db)(t) \cdot \probOf(\db)
 \end{align}
 %
 Intuitively, the expectation of $\query(\db)(t)$ is the number of duplicates of $t$ we expect to find in result of query $\query$.
@ -59,7 +59,7 @@ $\semNX$-PDBs and a function $\rmod$ that takes an $\semNX$-PDB input and output
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Proposition}[Expectation of polynomials]\label{prop:expection-of-polynom}
  Given an $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\db',\pd')$ where $\rmod(\pxdb) = \pdb$:
-  \[ \expct_{\idb \sim \pd}[\query(\db)(t)] = \expct_{\vct{w} \sim \pd'}\pbox{\polyForTuple(\vct{w})} \]
+  \[ \expct_{\idb \sim \pd}[\query(\db)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})} \]
 \end{Proposition}
 \noindent A formal proof of \Cref{prop:expection-of-polynom} is given in \Cref{subsec:expectation-of-polynom-proof}.  
 This proposition shows that computing expected tuple multiplicities is equivalent to computing the expectation of a polynomial (for that tuple) from a probability distribution over all possible assignments of variables in the polynomial to $\{0,1\}$.