Minor adjustments

2021-04-10 00:19:16 -04:00 · 2021-04-10 00:19:16 -04:00 · 6d2a684189
parent 58b70f0fcf
commit 6d2a684189
11 changed files with 49 additions and 39 deletions
--- a/app_notation-background.tex
+++ b/app_notation-background.tex
@ -160,7 +160,7 @@ Then, in expectation we have
 &= \sum_{\vct{d} \in \eta}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \prob_i\label{p1-s4}\\
 &= \rpoly(\prob_1,\ldots, \prob_\numvar)\label{p1-s5}
 \end{align}
-In steps \cref{p1-s1} and \cref{p1-s2}, by linearity of expectation (recall the variables are independent), the expecation can be pushed all the way inside of the product.  In \cref{p1-s3}, note that $w_i \in \{0, 1\}$ which further implies that for any exponent $e \geq 1$, $w_i^e = w_i$.  Next, in \cref{p1-s4} the expectation of a tuple is indeed its probability.
+In steps \cref{p1-s1} and \cref{p1-s2}, by linearity of expectation (recall the variables are independent, or the monomial expectation is 0), the expecation can be pushed all the way inside of the product.  In \cref{p1-s3}, note that $w_i \in \{0, 1\}$ which further implies that for any exponent $e \geq 1$, $w_i^e = w_i$.  Next, in \cref{p1-s4} the expectation of a tuple is indeed its probability.

 Finally, observe \Cref{p1-s5} by construction in \Cref{lem:pre-poly-rpoly}, that $\rpoly(\prob_1,\ldots, \prob_\numvar)$ is exactly the product of probabilities of each variable in each monomial across the entire sum.
 \qed
--- a/appendix.tex
+++ b/appendix.tex
@ -3,7 +3,7 @@
 \section{Missing details from Section~\ref{sec:background}}\label{sec:proofs-background}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Supplementary Material for~\Cref{prop:expection-of-polynom}}\label{subsec:supp-mat-background}\label{subsec:supp-mat-krelations}
+\subsection{$\semK$-relations and $\semNX$-PDBs}\label{subsec:supp-mat-background}\label{subsec:supp-mat-krelations}
 \input{app_notation-background}


--- a/approx_alg.tex
+++ b/approx_alg.tex
@ -3,7 +3,7 @@

 \section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}

-In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability  (\Cref{th:single-p-hard}).
+In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries), and by extension \bi (or any $\semNX$-PDB) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability  (\Cref{th:single-p-hard}).
 Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.\footnote{For a very broad class of circuits: please see the discussion after~\Cref{lem:val-ub} for more.}
 The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}.
 %it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
@ -106,7 +106,7 @@ Further, under either of the following conditions:
 we have $\abs{\circuit}(1,\ldots, 1)\le  \size(\circuit)^{O(k)}.$
 \end{Lemma}

-Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the specific conditions in~\Cref{lem:val-ub}. In~\Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios.
+Note that the above implies that with the assumption $\prob_0>0$ and $\gamma<1$ are absolute constants from \Cref{cor:approx-algo-const-p}, then the runtime there simplies to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)^2\cdot \log{\frac{1}{\conf}}\right)$ for general circuits $\circuit$ and to $O_k\left(\frac 1{\inparen{\error'}^2}\cdot\size(\circuit)\cdot \log{\frac{1}{\conf}}\right)$ for the case when $\circuit$ satisfies the specific conditions in~\Cref{lem:val-ub}. In~\Cref{app:proof-lem-val-ub} we argue that these conditions are very general and encompass many interesting scenarios, including query evaluation under \raPlus or FAQ.

 \subsection{Approximating $\rpoly$}
 The algorithm (\approxq detailed in \Cref{alg:mon-sam}) to prove~\Cref{lem:approx-alg} follows from the following observation.  Given a query polynomial $\poly(\vct{X})=\polyf(\circuit)$ for circuit \circuit over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
--- a/circuits-model-runtime.tex
+++ b/circuits-model-runtime.tex
@ -1,7 +1,7 @@
 %!TEX root=./main.tex

 \section{More on Circuits and Moments}\label{sec:gen}
-We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered in the same runtime as deterministic queries under reasonable assumptions.
+We formalize our claim from \Cref{sec:intro} that a linear approximation algorithm for our problem implies that PDB queries (under bag semantics) can be answered (approximately) in the same runtime as deterministic queries under reasonable assumptions.
 Lastly, we generalize our result for expectation to other moments.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -18,7 +18,7 @@ Lastly, we generalize our result for expectation to other moments.
 \mypar{The cost model}
 %\label{sec:cost-model}
 So far our analysis of $\approxq$ has been in terms of the size of the lineage circuits.
-We now show that this model corresponds to the behavior of a deterministic database by proving that for any UCQ query $\poly$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
+We now show that this model corresponds to the behavior of a deterministic database by proving that for any \raPlus query $\poly$, we can construct a compressed circuit for $\poly$ and \bi $\pxdb$ of size (and in runtime) linear in that of a general class of query processing algorithms for the same query $\poly$ on a deterministic database $\db$.
 We assume a linear relationship between input sizes $|\pxdb|$ and $|\db|$ (i.e., $\exists c, \db \in \pxdb$ s.t. $\abs{\pxdb} \leq c \cdot \abs{\db})$).
 \footnote{This is a reasonable assumption because each block of a \bi represents entities with uncertain attributes.
 In practice there is often a limited number of alternatives for each block (e.g., which of five conflicting data sources to trust). Note that all \tis trivially fulfill this condition (i.e., $c = 1$).}
@ -75,9 +75,11 @@ This follows from~\Cref{lem:circuits-model-runtime} (\cref{sec:circuit-runtime})
 %\label{sec:momemts}
 %
 We make a simple observation to conclude the presentation of our results.
-So far we have only focused on the expectation of $\poly$.  In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.  Progress can be made on this as follows:
+So far we have only focused on the expectation of $\poly$.  
+In addition, we could e.g. prove bounds of probability of the multiplicity being at least $1$.  
+Progress can be made on this as follows:
 For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
-We leave a further investigation of this question for future work.
+We leave further investigations for future work.

 %%% Local Variables:
 %%% mode: latex
--- a/intro-new.tex
+++ b/intro-new.tex
@ -4,7 +4,7 @@
 \section{Introduction}
 \label{sec:intro}
 A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds, paired with a probability distribution $\pd$ over these worlds.
-A well-studied problem in probabilistic databases is, given a query $\query$ and probabilistic database $\pdb$, computing the \emph{marginal probability} of a tuple $\tup$, (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
+A well-studied problem in probabilistic databases is to, given a query $\query$ and a probabilistic database $\pdb$, compute the \emph{marginal probability} of a tuple $\tup$ (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
 This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs), which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases, from cases that are in \ptime for unions of conjunctive queries (UCQs).
 In this work we consider bag semantics, where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analogous problem of computing the expectation of the multiplicity of a query result tuple $\tup$ (denoted $\query(\db)(t)$):
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -115,7 +115,7 @@ In this work we consider bag semantics, where each tuple is associated with a mu

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Example}\label{ex:intro-tbls}
-  Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analog to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where processing of shipping is on time.
+  Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analogously to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where shipment processing is on time.

 $$Q_1(\text{City}) \dlImp Loc(\text{City}), Route(\text{City}, \dlDontcare)$$

@ -126,7 +126,10 @@ Since the Chicago location has a 50\% probability of being on schedule (we assum
 \end{Example}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-A well-known result in probabilistic databases is that under set semantics the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions over variables $\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07}) over random variables that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
+A well-known result in probabilistic databases is that under set semantics, the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions)
+over random variables 
+($\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07}) 
+that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
 Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the obvious analog of probability distribution $\probDist$ defined over $\vct{X}$).

 For bag semantics, the lineage of a tuple is a polynomial over variables $\vct{X}=(X_1,\dots,X_n)$ with % \in \mathbb{N}^\numvar$ with
@ -135,11 +138,12 @@ Analogously to sets, evaluating the lineage for $t$  over an assignment correspo

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Example}\label{ex:intro-lineage}
-Associating a lineage variable with every input tuple as shown in \cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected multiplicity of this result tuple is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
-In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation to  solve the problem in linear time (in the size of  $\linsett{\query}{\pdb}{\tup}$).
-The expectation of the sum is the sum of the expectations of monomials.
+Associating a lineage variable with every input tuple as shown in \cref{fig:ex-shipping-simp}, we can compute the lineage of every result tuple as shown in \cref{subfig:ex-shipping-simp-route}. For example, the tuple Chicago is in the result, because $L_b$ joins with both $R_b$ and $R_c$. Its lineage is $\Phi = L_b \cdot R_b + L_b \cdot R_c$. The expected tuple multiplicity is calculated by summing the multiplicity of the result tuple, weighted by its probability, over all possible worlds.
+In this example, $\Phi$ is a sum of products (SOP), and so we can use linearity of expectation 
+(i.e., the expectation of the sum is the sum of the expectations of monomials)
+to solve the problem in linear time (in the size of  $\linsett{\query}{\pdb}{\tup}$).
 The expectation of each monomial is then computed by multiplying the probabilities of the variables (tuples) occurring in the monomial.
-The expected multiplicity of Chicago is $1.0$.
+The expected multiplicity for Chicago is $1.0$.
 \end{Example}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

@ -147,29 +151,29 @@ The expected multiplicity of a query result can be computed in linear time (in t
 However, this need not be true for compressed representations of polynomials, including factorized polynomials or arithmetic circuits.
 For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}.
 The left circuit encodes the lineage as a SOP while the right circuit uses distributivity to push the addition gate below the multiplication, resulting in a smaller circuit.
-Given that there is a large body of work that can  output such compressed representations~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}, %\BG{cite FDBs and FAQ},
+Given that there is a large body of work (on, e.g., deterministic bag-relational query processing) that can  output such compressed representations~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}, %\BG{cite FDBs and FAQ},
 an interesting question is whether computing expectations is still in linear time for such compressed representations.
 If the answer is in the affirmative, and if lineage formulas can also be computed in linear time (in the lineage size), then bag-relational probabilistic databases can theoretically match the performance of deterministic databases.
 Unfortunately, we prove that this is not the case: computing the expected count of a query result tuple is super-linear under standard complexity assumptions (\sharpwonehard) in the size of a lineage circuit.

 Concretely, we make the following contributions:
 (i) We show that the expected result multiplicity problem for conjunctive queries for bag-$\ti$s is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;
-(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or its FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) its complexity is linear in the size of the compressed lineage encoding; %;\BG{Fix not linear in all cases, restate after 4 is done}
+(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) its complexity is linear in the size of the compressed lineage encoding; %;\BG{Fix not linear in all cases, restate after 4 is done}
 (iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
-(iv) We further prove that for \raPlus queries\AR{Some places we use \raPlus and UCQ in others: we should use one consistently (assuming they are both the same)},  we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
+(iv) We further prove that for \raPlus queries (a equivalently expressive, but factorizable form of UCQs),  we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

 %\mypar{Implications of our Results} As mentioned above

 \mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial by continuing~\Cref{ex:intro-tbls}.

 %Moving forward, we focus exclusively on bags.
-Consider the query $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}_1, \text{City}_2),$ $OnTime(\text{City}')$ over the bag relations of \cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\poly^2()\dlImp Q(), Q()$.
+Consider the query $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$ over the bag relations of \cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\poly^2()\dlImp Q(), Q()$.
 %The factorized representation of $\poly^2$ is (for simplicity we ignore the random variables of $Route$ since each variable has probability of $1$):
 %\begin{equation*}
 %\poly^2 = \left(L_aL_b + L_bL_d + L_bL_c\right) \cdot \left(L_aL_b + L_bL_d + L_bL_c\right)
 %\end{equation*}
 %This equivalent SOP representation is
-Note that the lineage polynomial for $Q^2$ is given by $\Phi^2$:
+The lineage polynomial for $Q^2$ is given by $\Phi^2$:
 \begin{equation*}
 \left(L_aL_b + L_bL_d + L_bL_c\right)^2=L_a^2L_b^2 + L_b^2L_d^2 + L_b^2L_c^2 + 2L_aL_b^2L_d + 2L_aL_b^2L_c + 2L_b^2L_dL_c.
 \end{equation*}
@ -178,7 +182,7 @@ The expectation $\expct\pbox{\Phi^2}$ then is:
 \expct\pbox{L_a}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_d} \\
 + 2\expct\pbox{L_a}\expct\pbox{L_b^2}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{L_d}\expct\pbox{L_c}
 \end{multline*}
-\noindent Note that if the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
+\noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
 \begin{footnotesize}
 \begin{equation*}
 \expct\pbox{L_a^2}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{L_d}\expct\pbox{L_c}
@ -198,7 +202,7 @@ It can be verified that the reduced polynomial is a closed form of the expected

 %The reduced form of a lineage polynomial can be obtained but requires a linear scan over the clauses of an SOP encoding of the polynomial.  Note that for a compressed representation, this scheme would require an exponential number of computations in the size of the compressed representation.  In \Cref{sec:hard}, we use $\rpoly$ to prove our hardness results .

-To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode variaous hard graph counting problems.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$). For the upper bound is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
+To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode variaous hard graph-counting problems.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$). For the upper bound is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.
--- a/k-relations.tex
+++ b/k-relations.tex
@ -1,7 +1,7 @@
 %!TEX root=./main.tex
 We use $\domK$-relations to model bags. A \emph{$\domK$-relation}~\cite{DBLP:conf/pods/GreenKT07} is a relation whose tuples are annotated with elements from a commutative semiring $\semK = (\domK, \addK, \multK, \zeroK, \oneK)$.  A commutative semiring is a structure with a domain $\domK$ and associative and commutative binary operations $\addK$ and $\multK$ such that $\multK$ distributes over $\addK$, $\zeroK$ is the identity of $\addK$, $\oneK$ is the identity of $\multK$, and $\zeroK$ annihilates all elements of $\domK$ when combined by $\multK$.
 Let $\udom$ be a countable domain of values.
-Formally, an n-ary $\semK$-relation over a domain of attribute values $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
+Formally, an n-ary $\semK$-relation over $\udom$ is a function $\rel: \udom^n \to \domK$ with finite support $\support{\rel} = \{ \tup \mid \rel(\tup) \neq \zeroK \}$.
 A $\semK$-database is a set of $\semK$-relations. It will be convenient to also interpret a $\semK$-database as a function from tuples to annotations. Thus, $\rel(t)$ (resp., $\db(t)$) denotes the annotation associated by $\semK$-relation $\rel$ ($\semK$-database $\db$) to $t$.

 For completeness, we briefly review the semantics for $\raPlus$ queries over $\semK$-relations~\cite{DBLP:conf/pods/GreenKT07} illustrated in \cref{fig:nxDBSemantics}.
--- a/macros.tex
+++ b/macros.tex
@ -95,8 +95,8 @@
 \newcommand{\rchild}{\vari{R}}
 %members of T
 \newcommand{\val}{\vari{val}}
-\newcommand{\wght}{\vari{weight}\xspace}
-\newcommand{\vpartial}{\vari{partial}\xspace}
+\newcommand{\wght}{\vari{weight}}
+\newcommand{\vpartial}{\vari{partial}}
 %types of T
 \newcommand{\var}{\textsc{var}\xspace}
 \newcommand{\tnum}{\textsc{num}\xspace}
--- a/mult_distinct_p.tex
+++ b/mult_distinct_p.tex
@ -3,7 +3,7 @@
 \section{Hardness of exact computation}
 \label{sec:hard}

-In this section, we will prove that computing $\expct\limits_{\vct{W} \sim \pd}\pbox{\poly(\vct{W})}$ exactly for a \ti-lineage polynomial  $\poly(\vct{X})$ generated from a project-join query (even in an expression tree representation) is \sharpwonehard. Note that this implies hardness for \bis and general $\semNX$-PDBs. Furthermore, we demonstrate in \Cref{sec:single-p} that the problem remains hard, even if $\probOf[X_i=1] = \prob$ for all $X_i$ and any fixed valued $\prob \in (0, 1)$ as long as certain popular hardness conjectures in fine-grained complexity hold.
+In this section, we will prove that computing $\expct\limits_{\vct{W} \sim \pd}\pbox{\poly(\vct{W})}$ exactly for a \ti-lineage polynomial  $\poly(\vct{X})$ generated from a project-join query (even an expression tree representation) is \sharpwonehard. Note that this implies hardness for \bis and general $\semNX$-PDBs under bag semantics. Furthermore, we demonstrate in \Cref{sec:single-p} that the problem remains hard, even if $\probOf[X_i=1] = \prob$ for all $X_i$ and any fixed valued $\prob \in (0, 1)$ as long as certain popular hardness conjectures in fine-grained complexity hold.


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
--- a/poly-form.tex
+++ b/poly-form.tex
@ -33,7 +33,7 @@ The degree of polynomial $\poly(\vct{X})$ is the largest $\sum_{j=1}^n i_j$ such
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 The degree of the polynomial $X^2+2XY+Y^2$ is $2$.
-Product terms in lineage arise only as a consequence of join operations, so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins in any clause of the UCQ query that created it.
+Product terms in lineage arise only from join operations (\cref{fig:nxDBSemantics}), so intuitively, the degree of a lineage polynomial is analogous to the largest number of joins in any clause of the UCQ query that created it.
 In this paper we consider only finite degree polynomials.
 %
 % Throughout this paper, we also make the following \textit{assumption}.
@ -48,13 +48,7 @@ We call a polynomial $\query(\vct{X})$ a \emph{\bi-lineage polynomial} (resp., \
 %\AH{Why is it required for the tuple to be n-ary?  I think this slightly confuses me since we have n tuples.}
 % OK: agreed w/ AH, this can be treated as implicit
 there exists a $\raPlus$ query $\query$, \bi $\pxdb$ (\ti $\pxdb$, or $\semNX$-PDB $\pxdb$), and tuple $\tup$ such that $\query(\vct{X}) = \query(\pxdb)(\tup)$. % Before proceeding, note that the following is assume that polynomials are  \bis (which subsume \tis as a special case).
-As a special case of \bis, the following applies to \tis as well.
-In a \bi $\pxdb$, tuples are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ where tuple $t_{i,j} \in \block_i$ is associated with a probability $\prob_{\tup_{i,j}} = \pd[X_{i,j} = 1]$, and is annotated with a unique variable $X_{i,j}$.\footnote{
-  Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function).  For $t_j \in b_i$, the event $(X_{i,j} = 1)$ corresponds to the event $(X_i = j)$ in the customary annotation scheme.
-}
-Because blocks are independent and tuples from the same block are disjoint, the probabilities $\prob_{\tup_{i,j}}$ and the blocks induce the probability distribution $\pd$ of $\pxdb$.
-We will write a \bi-lineage polynomial $\poly(\vct{X})$ for a \bi with $\ell$ blocks as
-$\poly(\vct{X})$ = $\poly(X_{1, 1},\ldots, X_{1, \abs{\block_1}},$ $\ldots, X_{\ell, \abs{\block_\ell}})$, where $\abs{\block_i}$ denotes the size of $\block_i$.\footnote{Later on in the paper, especially in~\Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.}
+
 %\SF{Where is $\block_{i, j}$ used? Is it $X_{\block_{1, 1}}$ or $X_{\block_1, 1}$ ?}
 % and the probability distribution of $\pxdb$ is  uniquely determined based on a probability vector $\vct{p}$ that associates each tuple a probability
 % variables are independent of each other (or disjoint if they are from the same block) and each variable $X$ is associated with a probability $\vct{p}(X) = \pd[X = 1]$. Thus, we are dealing with polynomials $\poly(\vct{X})$ that are annotations of a tuple in the result of a query $\query$ over a BIDB $\pxdb$ where $\vct{X}$ is the set of variables that occur in annotations of tuples of $\pxdb$.
@ -72,7 +66,7 @@ Let $S$ be a {\em set} of polynomials over $\vct{X}$. Then $\poly(\vct{X})\mod{S
 \end{Definition}
 For example for a set of polynomials $S=\inset{X^2-X, Y^2-Y}$, taking the polynomial $2X^2 + 3XY - 2Y^2\mod S$ yields $2X+3XY-2Y$.
 %
-\begin{Definition}\label{def:mod-set-polys}
+\begin{Definition}[$\mathcal B$, $\mathcal T$]\label{def:mod-set-polys}
 Given the set of BIDB variables $\inset{X_{i,j}}$, define

 \setlength\parindent{0pt}
@ -109,7 +103,8 @@ Given the set of BIDB variables $\inset{X_{i,j}}$, define
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %

-All exponents $e > 1$ in $\smbOf{\poly(\vct{X})}$ are reduced to $e = 1$ via mod $\mathcal{T}$.  Performing the modulus of $\rpoly(\vct{X})$ with $\mathcal{B}$ ensures the disjoint condition of \bi, removing monomials with lineage variables from the same block.%, (recall the constraint on tuples from the same block being disjoint in a \bi).% any monomial containing more than one tuple from a block has $0$ probability and can be ignored).
+All exponents $e > 1$ in $\smbOf{\poly(\vct{X})}$ are reduced to $e = 1$ via mod $\mathcal{T}$.  Performing the modulus of $\rpoly(\vct{X})$ with $\mathcal{B}$ ensures the disjoint condition of \bi, removing monomials with lineage variables from the same block.
+%, (recall the constraint on tuples from the same block being disjoint in a \bi).% any monomial containing more than one tuple from a block has $0$ probability and can be ignored).
 %
 For the special case of \tis, the second step is not necessary since every block contains a single tuple.
 %Alternatively, one can think of $\rpoly$ as the \abbrSMB of $\poly(\vct{X})$ when the product operator is idempotent.
--- a/prob-def.tex
+++ b/prob-def.tex
@ -12,7 +12,7 @@ We represent query polynomials via {\em arithmetic circuits}~\cite{arith-complex
 \begin{Definition}[Circuit]\label{def:circuit}
 A circuit $\circuit$ is a Directed Acyclic Graph (DAG) whose source nodes (in degree of $0$) consist of elements in either $\reals$ or $\vct{X}$.  The internal nodes and (the single) sink node of $\circuit$ (corresponding to the result tuple $t$) have binary input and are either sum ($\circplus$) or product ($\circmult$) gates.

-$\circuit$ additionally has the following members: \type, \vari{val}, \vari{partial}, \vari{input}, \degval and \vari{Lweight}, \vari{Rweight}, where \type is the type of value stored in the node $\circuit$ (i.e. one of $\{\circplus, \circmult, \var, \tnum\}$, \val is the value stored (a constant or variable), and \vari{input} is the list of \circuit 's inputs where $\circuit_\linput$ is the left input and $\circuit_\rinput$ the right input.
+$\circuit$ additionally has the following members: \type, \val, \vpartial, \vari{input}, \degval and \vari{Lweight}, \vari{Rweight}, where \type is the type of value stored in the node $\circuit$ (i.e. one of $\{\circplus, \circmult, \var, \tnum\}$, \val is the value stored (a constant or variable), and \vari{input} is the list of \circuit 's inputs where $\circuit_\linput$ is the left input and $\circuit_\rinput$ the right input.
 %The member \degval holds the degree of \circuit.
 When the underlying DAG is a tree (with edges pointing towards the root), we will refer to the structure as an expression tree \etree.  Note that in such a case, the root of \etree is analogous to the sink of \circuit.
 \end{Definition}
--- a/ra-to-poly.tex
+++ b/ra-to-poly.tex
@ -54,7 +54,7 @@ For a probabilistic  database $\pdb = (\idb, \pd)$,  the result of a query is th

 Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_n)$ with natural number coefficients and exponents.
 We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations} and summarized here.
-In an $\semNX$-databases, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.
+In an $\semNX$-database, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.
 We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$.
 Each possible world is defined by an assignment of $N$ binary values $\vct{W} \in \{0, 1\}^{|X|}$.
 The multiplicity of $t \in R$ in this possible world is obtained by evaluating the polynomial annotating it on $\vct{W}$ (i.e., $R(t)(\vct{W})$).
@ -63,7 +63,7 @@ $\semNX$-relations are closed under $\raPlus$ (\cref{fig:nxDBSemantics}).

 \begin{figure}
 \begin{align*}
-  \evald{\project_A(\rel)}{\db}(\tup) =& \bigoplus_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
+  \evald{\project_A(\rel)}{\db}(\tup) =& \sum_{\tup': \project_A(\tup') = \tup} \evald{\rel}{\db}(\tup') &
  \evald{(\rel_1 \union \rel_2)}{\db}(\tup) =& \evald{\rel_1}{\db}(\tup) + \evald{\rel_2}{\db}(\tup)\\
  \evald{\select_\theta(\rel)}{\db}(\tup) =& \begin{cases}
    \evald{\rel}{\db}(\tup) & \text{if }\theta(\tup) \\
@ -104,9 +104,18 @@ We focus on this problem from now on, assume an implicit result tuple, and so dr
 In this paper, we focus on two popular forms of PDB: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.
 %
 A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-PDB  such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.
+In other words, each random variable corresponds to the event of a single tuple's presence.
 %
 A \emph{\ti} is a \bi where each block contains exactly one tuple.
 \Cref{subsec:supp-mat-ti-bi-def} explains \tis and \bis in greater detail.
+%
+In a \bi (and by extension a \ti) $\pxdb$, tuples are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ where tuple $t_{i,j} \in \block_i$ is associated with a probability $\prob_{\tup_{i,j}} = \pd[X_{i,j} = 1]$, and is annotated with a unique variable $X_{i,j}$.\footnote{
+  Although only a single independent, $[\abs{\block_i}+1]$-valued variable is customarily used per block, we decompose it into $\abs{\block_i}$ correlated $\{0,1\}$-valued variables per block that can be used directly in polynomials (without an indicator function).  For $t_j \in b_i$, the event $(X_{i,j} = 1)$ corresponds to the event $(X_i = j)$ in the customary annotation scheme.
+}
+Because blocks are independent and tuples from the same block are disjoint, the probabilities $\prob_{\tup_{i,j}}$ and the blocks induce the probability distribution $\pd$ of $\pxdb$.
+We will write a \bi-lineage polynomial $\poly(\vct{X})$ for a \bi with $\ell$ blocks as
+$\poly(\vct{X})$ = $\poly(X_{1, 1},\ldots, X_{1, \abs{\block_1}},$ $\ldots, X_{\ell, \abs{\block_\ell}})$, where $\abs{\block_i}$ denotes the size of $\block_i$.\footnote{Later on in the paper, especially in~\Cref{sec:algo}, we will overload notation and rename the variables as $X_1,\dots,X_n$, where $n=\sum_{i=1}^\ell \abs{b_i}$.}
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%