Resolving merge issues.

2021-08-23 09:04:56 -04:00 · 2021-08-23 09:04:56 -04:00 · 6ff39d4aea
parent f7cc75ef6c 33f78a545c
commit 6ff39d4aea
1 changed files with 51 additions and 11 deletions
--- a/intro-rewrite-070921.tex
+++ b/intro-rewrite-070921.tex
@ -1,18 +1,47 @@
+%!TEX root=./main.tex
 %root: main.tex
 \section{Introduction (Rewrite - 070921)}\label{sec:intro-rewrite-070921}
 \input{two-step-model}
-A tuple independent probabilistic database\footnote{In \cref{sec:background} and beyond, we generalize the data model.} (\abbrTIDB) $\pdb$ is a tuple $\inparen{\idb, \pd}$ such that $\idb$ is the set of deterministic instances of $\pdb$ (possible worlds) and $\pd$ is the probability distribution over $\idb$.  $\pdb$ can equivalently encoded as deterministic database with $\numvar$ tuples, with $\pd$ 
-%with a deterministic table $\encodedDB$ which is a set of $\numvar$ tuples, encoding the set of possible worlds $\idb$.  The probability distribution $\pd$ over the set of database instances (possible worlds) is the one
-being the distribution induced from the requirement that each tuple in $\encodedDB$ be treated as an independent Bernoulli distributed random variable.  In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ computes the multiplicity of its corresponding tuple $\tup$.  In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics further requires the following condition:
+A probabilistic database $\pdb$ is a tuple $\inparen{\idb, \pd}$ such that $\idb$ is a set of deterministic database instances (possible worlds) and $\pd$ is a probability distribution over $\idb$.  
+In bag count-query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ computes the multiplicity of its corresponding tuple $\tup$.  
+In addition to traditional deterministic query evaluation requirements (for a given query class), the count-query evaluation problem in bag-\abbrPDB semantics can be formally stated as:
 \begin{Problem}\label{prob:bag-pdb-query-eval}
 Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$),\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} compute the expected multiplicity ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$.
 \end{Problem}
-$\query\inparen{\pdb}\inparen{\tup}$ can be encoded by a polynomial, with variables in the vector $\vct{X}$, such that each of the $\numvar$ tuples in $\vct{X}$ has its own unique variable, i.e. $\vct{X} = \inparen{X_1,\ldots, X_\numvar}$.  Since $\raPlus$ operators have one to one correspondence to polynomial operators (\cref{fig:nxDBSemantics}), then there exists a polynomial $\poly_\tup\inparen{\vct{X}}$ such that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$, where $\vct{W}$ the set of random variables corresponding to $\vct{X}$ drawn from $\pd$.  The expectation is any Bernoulli distribution $\pd$ over $\{0, 1\}^\numvar$ (the set of possible worlds), whose evaluation semantics follow the standard interpretation of addition and multiplication operators over the natural numbers, i.e. $\semN$-semiring semantics. While the aforementioned assumes set \abbrTIDB inputs, this is not limiting, as a simple generalization from bag-\abbrPDB\xplural to set-\abbrPDB\xplural exists.\footnote{A bag-\abbrTIDB can be reduced to a set-\abbrTIDB by assigning unique keys across all $\tup$ in $\pdb$.  This typically has an $\bigO{c}$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{\tup}$, where $\db\inparen{\tup}$ denotes $\tup$'s multiplicity in the encoding.} 
+
+We initially focus on tuple-independent probabilistic bag-databases (\abbrTIDB), a compressed encoding of probabilistic databases where the presence of each individual copy of a tuple in a possible world can be modeled as an independent probabilistic event\footnote{
+This model corresponds to the classical set-relational approach to \abbrTIDB{}s, reducing duplicate tuples to a set-\abbrTIDB by assigning unique keys across all $\tup$ in $\pdb$.  This typically has an $\bigO{c}$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{\tup}$, where $\db\inparen{\tup}$ denotes $\tup$'s multiplicity in the encoding.
+We further generalize this model in \cref{sec:background} and beyond.
+} 
+A \abbrTIDB encodes a compatible $\pdb$ as a deterministic database with $\numvar$ tuples, each annotated with a probability $\prob_\tup$, and with $\pd$ 
+%with a deterministic table $\encodedDB$ which is a set of $\numvar$ tuples, encoding the set of possible worlds $\idb$.  The probability distribution $\pd$ over the set of database instances (possible worlds) is the one
+being the distribution induced from the requirement that each tuple $\tup \in \encodedDB$ be treated as an independent Bernoulli distributed random variable with probability $\prob_\tup$.  
+The possible worlds of a \abbrTIDB can be encoded by the vector $\vct{W}$, such that each of the $\numvar$ tuples in $\vct{W}$ has its own unique Bernoulli-distributed random variable, i.e. $\vct{W} = \inparen{W_{\tup_1},\ldots, W_{\tup_\numvar}}$, and for each tuple $\tup$, $\probOf(W_\tup) = \prob_\tup$.
+For any vector $\vct{X} \in \{0,1\}^\numvar$, denote by $\pdb_{\vct{X}}$ the deterministic database consisting of exactly those tuples $\tup$ where $X_\tup = 1$.
+
+When $\pdb$ is a \abbrTIDB, $\query\inparen{\pdb}\inparen{\tup}$ can be encoded by a polynomial, with variables in $\vct{X}$.
+Green, Karvounarakis, and Tannen established (\cite{DBLP:conf/pods/GreenKT07}; see \cref{fig:nxDBSemantics}) that for any $\raPlus$ query $\query$ and \abbrTIDB $\pdb$, there exists a polynomial $\poly_\tup\inparen{\vct{X}}$ following the standard addition and multiplication operators (i.e., $\semN$-semiring semantics), such that $\query\inparen{\pdb_{\vct{X}}}\inparen{\tup} = \poly_\tup\inparen{\vct{X}}$.
+This in turn implies that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$.  
+
+Thanks to linearity of expectation, polynomial-time algorithms exist (e.g., \cite{kennedy:2010:icde:pip}) for computing exact results for bag-probabilistic count queries over \abbrTIDB{}s.
+However the question remains: \emph{can bag-probabilistic databases be as fast as deterministic queries}.
+In this paper, we explore the \emph{fine-grained complexity} of bag-probabilistic database query evaluation.

 The problem of deterministic query evaluation is known to be \sharpwonehard\footnote{A problem is in \sharpwone if the runtime of the most efficient known algorithm to solve it is lower bounded by some function $f$ of a parameter $k$, where the growth in runtime is polynomially dependent on $f(k)$, i.e. $\Omega\inparen{\numvar^{f(k)}}$.} in data complexity for general $\query$.  For example, the counting $k$-cliques query problem (where the parameter $k$ is the size of the clique) is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.    
-Simply considering deterministic query evaluation, however, is unsatisfying when considering the complexity of evaluating $\query$ over \abbrPDB\xplural, since it does not account for computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$, entirely ignoring the `P' in \abbrPDB.

-\Cref{prob:bag-pdb-query-eval} has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where each output tuple appears at most once.  Here, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives.  Evaluating such a formula follows the standard semantics of the said operators on boolean variables.} whose evaluation follows the standard semantics ($\semB$-semiring semantics), denoting the presence or absence of $\tup$.  Computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ determines the marginal probability of $\tup$ appearing in the output.  Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard\footnote{\sharpp is the counting version for problems residing in the NP complexity class.} in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time deterministic query.  
+In this paper, we begin to explore whether the problem of bag-probabilistic query evaluation (which we relate to deterministic query processing more precisely below) falls into this same complexity class.
+An answer in the affirmative indicates that bag-probabilistic databases can be competitive with classical deterministic databases, opening the door for deployment in practice.
+
+\subsection{Relationship to Set-Probabilistic Query Evaluation}
+\Cref{prob:bag-pdb-query-eval} has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where each output tuple appears at most once.  Here, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives.  Evaluating such a formula follows the standard semantics of the said operators on boolean variables ($\semB$-semiring semantics) serv.} whose evaluation follows the standard semantics , denoting the presence or absence of $\tup$.  Computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ determines the marginal probability of $\tup$ appearing in the output.  Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard\footnote{\sharpp is the counting version for problems residing in the NP complexity class.} in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time deterministic query.  
+Concretely, easy queries in this dichotomy can be answered through so-called \emph{extensional} query evaluation, where probability computation is inlined into normal deterministic query processing.
+This is possible, because queries on the easy side of the dichotomy can always be rewritten into a form that guarantees that, for every relational operator in the query, the presence of every tuple in the operator's output is governed by either a conjunction or disjunction of \emph{independent} events.
+
+Such a guarantee is not possible for queries on the hard side of the dichotomy, and the best known approach is so-called \emph{intensional} query evaluation~\cite{DBLP:series/synthesis/2011Suciu}, a two step process that first computes the lineage of the query result --- a boolean formula analogous to $\Phi_\tup$ --- which it then uses to compute the probability.
+The complexity of this approach typically depends on the second step, computing the expectation $\expct\pbox{\poly_\tup(\vct{\randWorld})}$, a problem known to be in $\sharpphard$~\cite{DS07}.
+
+
+
 %BEGIN Needs to be said
 %Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), techniques such as parameterized complexity (bounding complexity by another parameter other than $\numvar$) and fine grained analysis (complexity analysis that asks what precisely is the value of this other parameter, for example, what is the value of $f(k)$ given a \sharpwone algorithm) of \abbrStepTwo will not refine the hardness results from \sharpphard.
 %END NEeds to be said
@ -27,8 +56,17 @@ There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natura
 A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ separately from the complexity of deterministic query evaluation, effectively dividing \abbrPDB query evaluation into two steps: deterministic query evaluation\footnote{Given input $\pdb$, this step includes outputting every tuple $\tup$ that satisfies $\query$, annotated with its lineage polynomial ($\poly_\tup$) which is computed inline across the query operators of $\query$.\cite{Imielinski1989IncompleteII}\cite{DBLP:conf/pods/GreenKT07}} and computing expectation.  Viewing  \abbrPDB query evaluation as these two seperate steps is also known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}, illustrated in \cref{fig:two-step}.  
 The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing both $\query\inparen{\db}$ and $\poly_\tup(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in \cref{sec:circuit-runtime}.}  The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{\randWorld})}$. Such a model of computation is nicely followed in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly_\tup\inparen{\vct{X}}$ must be computed separate from deterministic query evaluation to obtain exact output when $\query(\pdb)$ is hard since evaluating the probability inline with query operators (extensional evaluation) will only approximate the actual probability in such a case.  The paradigm of \cref{fig:two-step} is also analogous to semiring provenance, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with elements from the set of polynomials with variables in $\vct{X}$ and natural number coeficients and exponents.} query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial can then subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics.  Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation.  

-Let $\timeOf{\abbrStepOne}$ denote the runtime of \abbrStepOne and similarly for $\timeOf{\abbrStepTwo}$. 
-Given bag-\abbrPDB query $\query$ and \abbrTIDB $\pdb$  with $\numvar$ tuples, let us go a step further and assume that computing $\poly_\tup$ is lower bounded by the runtime of determistic query computation of $\query$ for the following situations, i.e. when $\abs{\textnormal{input}} \leq \abs{\textnormal{output}}$.  When $\poly_\tup$ is in standard monomial basis (\abbrSMB)\footnote{A polynomial is in \abbrSMB when it consists of a sum of unique products.}, by linearity of expectation and independence of \abbrTIDB, it follows that $\timeOf{\abbrStepTwo}$ is indeed $\bigO{\timeOf{\abbrStepOne}}$.  Let $\prob_i$ denote the probability of tuple $\tup_i$ ($\probOf\pbox{X_i = 1}$) for $i \in [\numvar]$.  Consider another special case when for all $i$ in $[\numvar]$, $\prob_i = 1$.  For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{\randWorld}}}$ is linear in 
+Analog to set-probabilistic databases, we focus on the intensional model of query evaluation, as illustrated in \cref{fig:nxDBSemantics}.
+Given input $\pdb$ and $\query$, the first step, which we will refer to as \termStepOne (\abbrStepOne), outputs every tuple $\tup$ that satisfies $\query$, annotated with its lineage polynomial ($\poly_\tup$) which is computed inline across the query operators of $\query$~\cite{Imielinski1989IncompleteII}\cite{DBLP:conf/pods/GreenKT07}.
+We show in \cref{sec:circuit-runtime} that, assuming a standard $\raPlus$ query evaluation algorithm, the cost of constructing the lineage polynomial for all tuples in a query result is upper-bounded by runtime of generating those tuples through deterministic query evaluation.
+In other words, the first step is in \sharpwonehard, allowing us to focus on the complexity of the second step.
+The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{\randWorld})}$. 
+
+We observe that the paradigm of \cref{fig:two-step} is also analogous to semiring provenance~\cite{DBLP:conf/pods/GreenKT07}, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with standard polynomials, i.e. elements from $\semNX$ connected by multiplication and addition operators.} query processing first computes the query and polynomial, and the $\semNX$-polynomial can then subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics.  Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation.  
+
+\subsection{Intensional Bag-Probabilistic Query Evaluation}
+Let $\timeOf{\abbrStepOne}$ denote the runtime of \abbrStepOne (Lineage Computation) and similarly for $\timeOf{\abbrStepTwo}$ (Expectation Computation). 
+Given bag-\abbrPDB query $\query$ and \abbrTIDB $\pdb$  with $\numvar$ tuples, let us go a step further and assume that computing $\poly_\tup$ is lower bounded by the runtime of determistic query computation of $\query$ (e.g. when $\abs{\textnormal{input}} \leq \abs{\textnormal{output}}$).  When $\poly_\tup$ is in standard monomial basis (\abbrSMB)\footnote{A polynomial is in \abbrSMB when it consists of a sum of unique products.}, by linearity of expectation and independence of \abbrTIDB, it follows that $\timeOf{\abbrStepTwo}$ is indeed $\bigO{\timeOf{\abbrStepOne}}$.  Let $\prob_i$ denote the probability of tuple $\tup_i$ ($\probOf\pbox{X_i = 1}$) for $i \in [\numvar]$.  Consider another special case when for all $i$ in $[\numvar]$, $\prob_i = 1$.  For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{\randWorld}}}$ is linear in 
 $\abs{\poly_\tup}$
 %the size of the arithemetic circuit
 , since we can essentially push expectation through multiplication of variables dependent on one another.\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.}  Here is another special case where $\timeOf{\abbrStepTwo}$ is $\bigO{\timeOf{\abbrStepOne}}$ and we again achieve deterministic query runtime for $\query\inparen{\pdb}$ (up to a constant factor).  These observations introduce our next problem statement:
@ -37,7 +75,7 @@ $\abs{\poly_\tup}$
 Given a \abbrPDB $\pdb$ and $\raPlus$ query $\query$, is it \emph{always} the case that $\timeOf{\abbrStepTwo}$ is always $\bigO{\timeOf{\abbrStepOne}}$?
 \end{Problem}

-If the answer to \cref{prob:big-o-step-one} is yes, then the query evaluation problem over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation.
+If the answer to \cref{prob:big-o-step-one} is yes, then the query evaluation problem over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation, and probabilistic databases can offer performance competitive with with deterministic databases.

 The main insight of the paper is that we should not stop here.  One can have compact representations of $\poly_\tup(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations\footnote{A factorized representation is a representation of a polynomial that is not in \abbrSMB form.} of $\poly_\tup(\vct{X})$.  To capture such factorizations, this work uses (arithmetic) circuits
 \footnote{An arithmetic circuit has variable and/or numeric inputs, with internal nodes each of which can take on a value of either an addition or multiplication operator.}
@ -67,6 +105,7 @@ as the representation system of $\poly_\tup(\vct{X})$, which are a natural fit t
 Above we have seen, given a circuit \circuit, if \circuit is in \abbrSMB, then we have that $\timeOf{\abbrStepTwo}$ is indeed $\bigO{\timeOf{\abbrStepOne}}$.  Such representations are produced by queries with the form $\project, \project\inparen{\join},$ etc.  Suppose, on the contrary, that \circuit is not in \abbrSMB and rather in some factorized form.  Then to naively compute \abbrStepTwo, one needs to convert \circuit into \circuit' such that \circuit' is in \abbrSMB, and then compute $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$, which takes $\bigO{\abbrStepOne^k}$ time for the general $k$-wise factorization.  Since \abbrStepTwo lies between $\bigO{\abbrStepOne}$ and $\bigO{\abbrStepOne^k}$, it behooves us to determine which of these extremes is true for the general \circuit.  This leads us to the main problem statement of our paper: 
 \begin{Problem}\label{prob:intro-stmt}
 Given \abbrPDB $\pdb$ and $\raPlus$ query $\query$, is it always the case that $\timeOf{\abbrStepTwo}$ is $\bigO{\abbrStepOne}$?
+\OK{This doesn't parse.  What is $\bigO{\abbrStepOne}$?  Should this be $\bigO{\poly}$?}
 \end{Problem}

 %%%%%%%%%%%%%%%%%%%%%%%%%
@ -78,7 +117,8 @@ Concretely, we make the following contributions:
 by reduction from counting the number of $k$-matchings over an arbitrary graph; we further show superlinear hardness in the size of \circuit for a specific %cubic 
 graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
 (ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries that makes \cref{prob:intro-stmt} true again; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ) followups~\cite{DBLP:conf/pods/KhamisNR16}), the approximation algorithm has runtime linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are  at most quadratic\footnote{Note that this doesn't rule out queries for which approximation is linear}); (iii) We generalize the \abbrPDB data model considerred by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural); (iv) We further prove that for \raPlus queries 
-\AH{This point \emph{\Large seems} weird to me.  I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one.  Where am I missing it?}  
+\AH{This point \emph{\Large seems} weird to me.  I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one.  Where am I missing it?}
+\OK{Atri's (and most theoretician's) statements about complexity always need to be suffixed with ``to within a log factor''}  
 we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

 \mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly_\tup$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial in what follows.
@ -117,7 +157,7 @@ With $\Phi^2\inparen{\vct{X}}$ as an example, we have:
 \end{align*}
 It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}} = \widetilde{\Phi^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{D=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$). In fact, the following lemma shows that this equivalence holds for {\em all} $\raPlus$ queries over TIDB (proof in \cref{subsec:proof-exp-poly-rpoly}).
 \begin{Lemma}
-Let $\pdb$ be a \abbrTIDB over variables $\vct{X} = \{X_1,\ldots,X_\numvar\}$ with the probability distribution $\pd$ induced by the probability vector $\probAllTup = \inparen{\prob_1,\ldots,\prob_\numvar}$ of each individual tuple's probability for each world $\vct{w}$ in the set of all $2^\numvar$ possible worlds $\vct{W}$.  For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}$ based on $\query\inparen{\pdb}$ the following holds:
+Let $\pdb$ be a \abbrTIDB over variables\OK{Should this be $\vct{W}$?} $\vct{X} = \{X_1,\ldots,X_\numvar\}$ with the probability distribution $\pd$ induced by the probability vector $\probAllTup = \inparen{\prob_1,\ldots,\prob_\numvar}$ of each individual tuple's probability for each world $\vct{w}$ in the set of all $2^\numvar$ possible worlds $\vct{W}$.  For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}$ based on $\query\inparen{\pdb}$ the following holds:

 \begin{equation*}
 	\expct_{\vct{W} \sim \pd}\pbox{\poly\inparen{\vct{W}}} = \rpoly\inparen{\probAllTup}.