Merge branch 'master' of gitlab.odin.cse.buffalo.edu:ahuber/SketchingWorlds

2021-04-10 13:00:06 -05:00 · 2021-04-10 13:00:06 -05:00 · e836805534
parent 7c56dd55b2 c2c5f3d936
commit e836805534
6 changed files with 30 additions and 47 deletions
--- a/abstract.tex
+++ b/abstract.tex
@ -4,7 +4,7 @@
  The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds.
  The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world.
  In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected  multiplicity) exactly and approximately.
-  For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products.
+  For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage, if this polynomial is encoded as a sum of products (the standard operating procedure for Set-PDBs).
  However, using a reduction from the problem of counting $k$-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
 Such factorized representations are
 exploited by modern join algorithms (e.g., worst-case optimal joins), and
@ -13,7 +13,7 @@ so our results imply that computing probabilities for Bag-PDB based on the resul
  The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $\prob$ (s.t. $\prob \in (0,1)$).
  We proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs).
  We develop a sampling algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of polynomial circuits in linear time in the size of the polynomial.
-  By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for PDBs that can be competitive with deterministic databases.
+  By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases.
 \end{abstract}

 %%% Local Variables:
--- a/app_notation-background.tex
+++ b/app_notation-background.tex
@ -46,7 +46,7 @@ Importantly, as the following proposition shows, any finite $\semN$-PDB can be e
 $\semNX$-PDBs are a complete representation system for $\semN$-PDBs that is closed under $\raPlus$ queries.
 \end{Proposition}

-%\subsection{Proof of \Cref{prop:semnx-pdbs-are-a-}}
+%\subsection{Proof of~\Cref{prop:semnx-pdbs-are-a-}}
 \begin{proof}
 To prove that $\semNX$-PDBs are complete consider the following construction that for any $\semN$-PDB $\pdb = (\idb, \pd)$ produces an $\semNX$-PDB $\pxdb = (\idb_{\semNX}, \pd')$  such that $\rmod(\pxdb) = \pdb$. Let $\idb = \{D_1, \ldots, D_{\abs{\idb}}\}$ and let $max(D_i)$ denote $max_{\tup} D_i(\tup)$. For each world $D_i$ we create a corresponding variable $X_i$.
 %variables $X_{i1}$, \ldots, $X_{im}$ where $m = max(D_i)$.
@ -91,13 +91,13 @@ Denote the vector $\vct{p}$ to be a vector whose elements are the individual pro
  = \sum\limits_{\substack{\vct{w} \in \{0, 1\}^\numvar\\ s.t. w_j,w_{j'} = 1 \rightarrow \not \exists b_i \supseteq \{t_{i,j}, t_{i',j}\}}} \poly(\vct{w})\prod_{\substack{j \in [\numvar]\\ s.t. \wElem_j = 1}}\prob_j \prod_{\substack{j \in [\numvar]\\s.t. w_j = 0}}\left(1 - \prob_i\right)
 \end{align}
 %
-Recall that tuple blocks in a TIDB always have size 1, so the outer summation of \Cref{eq:tidb-expectation} is over the full set of vectors.
+Recall that tuple blocks in a TIDB always have size 1, so the outer summation of \cref{eq:tidb-expectation} is over the full set of vectors.
 \BG{Oliver's conjecture: Bag-\tis + Q can express any finite bag-PDB:
 A well-known result for set semantics PDBs is that while not all finite PDBs can be encoded as \tis, any finite PDB can be encoded using a \ti and a query. An analog result holds in our case: any finite $\semN$-PDB can be encoded as a bag \ti and a query (WHAT CLASS? ADD PROOF)
 }

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Proof of \Cref{prop:expection-of-polynom}}
+\subsection{Proof of~\Cref{prop:expection-of-polynom}}
 \label{subsec:expectation-of-polynom-proof}
 \begin{proof}
 We need to prove for $\semN$-PDB $\pdb = (\idb,\pd)$ and $\semNX$-PDB $\pxdb = (\db',\pd')$ where $\rmod(\pxdb) = \pdb$ that $\expct_{\db \sim \pd}[\query(\db)(t)] = \expct_{\vct{W} \sim \pd'}\pbox{\polyForTuple(\vct{W})}$
@ -116,7 +116,7 @@ By expanding $\polyForTuple$ and the expectation we have:
 \end{proof}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{ \Cref{lem:pre-poly-rpoly}}\label{app:subsec-pre-poly-rpoly}
+\subsection{~\Cref{lem:pre-poly-rpoly}}\label{app:subsec-pre-poly-rpoly}
 \begin{Lemma}\label{lem:pre-poly-rpoly}
 If
 $\poly(X_1,\ldots, X_\numvar) = \sum\limits_{\vct{d} \in \{0,\ldots, B\}^\numvar}q_{\vct{d}} \cdot \prod\limits_{\substack{i = 1\\s.t. d_i\geq 1}}^{\numvar}X_i^{d_i}$
@ -125,8 +125,8 @@ $\rpoly(X_1,\ldots, X_\numvar) = \sum\limits_{\vct{d} \in \eta} q_{\vct{d}}\cdot
 \end{Lemma}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\begin{proof}%[Proof for \Cref{lem:pre-poly-rpoly}]
-Follows by the construction of $\rpoly$ in \Cref{def:reduced-bi-poly}. 
+\begin{proof}%[Proof for~\Cref{lem:pre-poly-rpoly}]
+Follows by the construction of $\rpoly$ in \cref{def:reduced-bi-poly}. 
 \qed
 \end{proof}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -141,7 +141,7 @@ $%  \[
 $%    \]
 \end{Proposition}

-\begin{proof}%[Proof for \Cref{proposition:q-qtilde}]
+\begin{proof}%[Proof for~\Cref{proposition:q-qtilde}]
 Note that any $\poly$ in factorized form is equivalent to its \abbrSMB expansion.  For each term in the expanded form, further note that for all $b \in \{0, 1\}$ and all $e \geq 1$, $b^e = b$. 
 \qed
 \end{proof}
@ -160,7 +160,7 @@ Then, in expectation we have
 &= \sum_{\vct{d} \in \eta}q_{\vct{d}}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numvar \prob_i\label{p1-s4}\\
 &= \rpoly(\prob_1,\ldots, \prob_\numvar)\label{p1-s5}
 \end{align}
-In steps \Cref{p1-s1} and \Cref{p1-s2}, by linearity of expectation (recall the variables are independent, or the monomial expectation is 0), the expecation can be pushed all the way inside of the product.  In \Cref{p1-s3}, note that $w_i \in \{0, 1\}$ which further implies that for any exponent $e \geq 1$, $w_i^e = w_i$.  Next, in \Cref{p1-s4} the expectation of a tuple is indeed its probability.
+In steps \cref{p1-s1} and \cref{p1-s2}, by linearity of expectation (recall the variables are independent, or the monomial expectation is 0), the expecation can be pushed all the way inside of the product.  In \cref{p1-s3}, note that $w_i \in \{0, 1\}$ which further implies that for any exponent $e \geq 1$, $w_i^e = w_i$.  Next, in \cref{p1-s4} the expectation of a tuple is indeed its probability.

 Finally, observe \Cref{p1-s5} by construction in \Cref{lem:pre-poly-rpoly}, that $\rpoly(\prob_1,\ldots, \prob_\numvar)$ is exactly the product of probabilities of each variable in each monomial across the entire sum.
 \qed
@ -169,6 +169,6 @@ Finally, observe \Cref{p1-s5} by construction in \Cref{lem:pre-poly-rpoly}, that

 \subsection{Proof For Corollary ~\ref{cor:expct-sop}}
 \begin{proof}
-Note that \Cref{lem:exp-poly-rpoly} shows that $\expct\pbox{\poly} =$ $\rpoly(\prob_1,\ldots, \prob_\numvar)$.  Therefore, if $\poly$ is already in \abbrSMB form, one only needs to compute $\poly(\prob_1,\ldots, \prob_\numvar)$ ignoring exponent terms (note that such a polynomial is $\rpoly(\prob_1,\ldots, \prob_\numvar)$), which indeed has $O(\smbOf{|\poly|})$ computations.
+Note that \cref{lem:exp-poly-rpoly} shows that $\expct\pbox{\poly} =$ $\rpoly(\prob_1,\ldots, \prob_\numvar)$.  Therefore, if $\poly$ is already in \abbrSMB form, one only needs to compute $\poly(\prob_1,\ldots, \prob_\numvar)$ ignoring exponent terms (note that such a polynomial is $\rpoly(\prob_1,\ldots, \prob_\numvar)$), which indeed has $O(\smbOf{|\poly|})$ computations.
 \qed
 \end{proof}
--- a/intro-new.tex
+++ b/intro-new.tex
@ -4,7 +4,7 @@
 \section{Introduction}
 \label{sec:intro}
 A \emph{probabilistic database} $\pdb = (\idb, \pd)$ is set of deterministic databases $\idb = \{ \db_1, \ldots, \db_n\}$ called possible worlds, paired with a probability distribution $\pd$ over these worlds.
-A well-studied problem in probabilistic databases is to, given a query $\query$ and a probabilistic database $\pdb$, compute the \emph{marginal probability} of a tuple $\tup$ (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
+A well-studied problem in probabilistic databases is to take a query $\query$ and a probabilistic database $\pdb$, and compute the \emph{marginal probability} of a tuple $\tup$ (i.e., its probability of appearing in the result of query $\query$ over $\pdb$).
 This problem is \sharpphard for set semantics, even for \emph{tuple-independent probabilistic databases}~\cite{DBLP:series/synthesis/2011Suciu} (TIDBs), which are a subclass of probabilistic databases where tuples are independent events. The dichotomy of Dalvi and Suciu~\cite{10.1145/1265530.1265571} separates the hard cases, from cases that are in \ptime for unions of conjunctive queries (UCQs).
 In this work we consider bag semantics, where each tuple is associated with a multiplicity $\db_i(\tup)$ in each possible world $\db_i$ and study the analogous problem of computing the expectation of the multiplicity of a query result tuple $\tup$ (denoted $\query(\db)(t)$):
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -115,7 +115,7 @@ In this work we consider bag semantics, where each tuple is associated with a mu

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Example}\label{ex:intro-tbls}
-  Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analogously to the set case: each tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$ shown below returns starting points of shipping routes where shipment processing is on time.
+  Consider the bag-\ti relations shown in \Cref{fig:ex-shipping-simp}. We define a \ti under bag semantics analogously to the set case: each input tuple is associated with a probability of having a multiplicity of one (and otherwise multiplicity zero), and tuples are independent random events. Ignore column $\Phi$ for now. In this example, we have shipping routes that are certain (probability 1.0) and information about whether shipping at locations is on time (with a certain probability). Query $\query_1$, shown below returns starting points of shipping routes where shipment processing is on time.

 $$Q_1(\text{City}) \dlImp OnTime(\text{City}), Route(\text{City}, \dlDontcare)$$

@ -126,15 +126,17 @@ Since the Chicago location has a 50\% probability of being on schedule (we assum
 \end{Example}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-A well-known result in probabilistic databases is that under set semantics, the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$ of positive Boolean expressions)
+A well-known result in probabilistic databases is that under set semantics, the marginal probability of a query result $\tup$ can be computed based on the tuple's lineage. The lineage of a tuple is a Boolean formula (an element of the semiring $\text{PosBool}[\vct{X}]$~\cite{DBLP:conf/pods/GreenKT07} of positive Boolean expressions)
 over random variables
-($\vct{X}=(X_1,\dots,X_n)$~\cite{DBLP:conf/pods/GreenKT07})
+($\vct{X}=(X_1,\dots,X_n)$)
 that encode the existence of input tuples. Each possible world $\db$ corresponds to an assignment $\{0,1\}^\numvar$ of the variables in $\vct{X}$ to either true (the tuple exists in this world) or false (the tuple does not exist in this world). Importantly, the following holds: if the lineage formula for $t$ evaluates to true under the assignment for a world $\db$, then $\tup \in \query(\db)$.
 Thus, the marginal probability of tuple $\tup$ is equal to the probability that its lineage evaluates to true (with respect to the obvious analog of probability distribution $\probDist$ defined over $\vct{X}$).

 For bag semantics, the lineage of a tuple is a polynomial over variables $\vct{X}=(X_1,\dots,X_n)$ with % \in \mathbb{N}^\numvar$ with
 coefficients in the set of natural numbers $\mathbb{N}$ (an element of semiring $\mathbb{N}[\vct{X}]$).
-Analogously to sets, evaluating the lineage for $t$  over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$ which we will denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.
+Analogously to sets, evaluating the lineage for $t$  over an assignment corresponding to a possible world yields the multiplicity of the result tuple $\tup$ in this world. Thus, instead of using \Cref{eq:intro-bag-expectation} to compute the expected result multiplicity of a tuple $\tup$, we can equivalently compute the expectation of the lineage polynomial of $\tup$, which for this example we denote as $\linsett{\query}{\pdb}{\tup}$ or $\Phi$ if the parameters are clear from the context\footnote{
+In later sections, where we focus on a single lineage polynomial, we will simply refer to $\linsett{\query}{\pdb}{\tup}$ as $Q$.
+}. In this work, we study the complexity of computing the expectation of such polynomials encoded as arithmetic circuits.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Example}\label{ex:intro-lineage}
@ -157,10 +159,10 @@ If the answer is in the affirmative, then probabilities over bag-PDBs can be com
 Unfortunately, we prove that this is not the case: computing the expected count of a query result tuple is super-linear under standard complexity assumptions (\sharpwonehard) in the size of a lineage circuit.

 Concretely, we make the following contributions:
-(i) We show that the expected result multiplicity problem for conjunctive queries for bag-$\ti$s is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;
+(i) We show that the expected result multiplicity problem (\Cref{def:the-expected-multipl}) for conjunctive queries for bag-$\ti$s is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;
 (ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) its complexity is linear in the size of the compressed lineage encoding; %;\BG{Fix not linear in all cases, restate after 4 is done}
 (iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
-(iv) We further prove that for \raPlus queries (a equivalently expressive, but factorizable form of UCQs),  we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
+(iv) We further prove that for \raPlus queries (an equivalently expressive, but factorizable form of UCQs),  we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

 %\mypar{Implications of our Results} As mentioned above

@ -188,8 +190,7 @@ The expectation $\expct\pbox{\Phi^2}$ then is:
 \expct\pbox{L_a^2}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_d} + 2\expct\pbox{L_a}\expct\pbox{L_b}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{L_d}\expct\pbox{L_c}
 \end{equation*}
 \end{footnotesize}
-
-\noindent This property leads us to consider a structure related to $\poly$.
+\noindent This property leads us to consider a structure related to the lineage polynomial.
 \begin{Definition}\label{def:reduced-poly}
 For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the SOP form of $\poly(\vct{X})$ to $1$.
 \end{Definition}
@ -198,34 +199,16 @@ With $\Phi^2$ as an example, we have:
 \widetilde{\Phi^2}(L_a, L_b, L_c, L_d)
 =&\; L_aL_b + L_bL_d + L_bW_c + 2L_aL_bL_d + 2L_aL_bL_c + 2L_bL_cL_d
 \end{align*}
-It can be verified that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \rpoly(\probOf\pbox{L_a=1}, \probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, we show in \Cref{lem:exp-poly-rpoly} that this equivalence holds for {\em all} UCQs over TIDB/BIDB.
+It can be verified that the reduced polynomial is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1}, \probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, we show in \Cref{lem:exp-poly-rpoly} that this equivalence holds for {\em all} UCQs over TIDB/BIDB.

 %The reduced form of a lineage polynomial can be obtained but requires a linear scan over the clauses of an SOP encoding of the polynomial.  Note that for a compressed representation, this scheme would require an exponential number of computations in the size of the compressed representation.  In \Cref{sec:hard}, we use $\rpoly$ to prove our hardness results .

 To prove our hardness result we show that for the same $Q$ considered in the running example, the query $Q^k$ is able to encode various hard graph-counting problems.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$). For the upper bound it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.
+\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl}). Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.


-% Our hardness results follow by considering a suitable generalization of the lineage polynomial in \Cref{eq:edge-query}. First it is easy to generalize the polynomial to $\poly_G(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\poly_G^k(X_1,\dots,X_n)$ (i.e., $\inparen{\poly_G(X_1,\dots,X_n)}^k$) encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ (see \Cref{def:reduced-poly}) can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (which computing is \sharpwonehard) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le  2k$, then we can set up a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed  it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
-
-% The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(\prob,\dots, \prob)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ (i.e., the \emph{input} tuple probabilities) and $k=\degree(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_\numvar)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(1,\ldots, 1)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well.
-%For the ease of exposition, we start off with expression trees (see \Cref{fig:circuit-q2-intro} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
-
-
-
-
-% and then relating the size of the compressed lineage to the cost of answering a deterministic query.
-
-% This suggests that perhaps even Bag-PDBs have higher query processing complexity than deterministic databases.
-% In this paper, we confirm this intuition, first proving that computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a compressed lineage representation, and then relating the size of the compressed lineage to the cost of answering a deterministic query.
-
-% In view of this hardness result (i.e., step 2 of the workflow is the bottleneck in the bag setting as well), we develop an approximation algorithm for expected counts of SPJU query Bag-PDB output, that is, to our knowledge, the first linear time (in the size of the factorized lineage) $(1-\epsilon)$-\emph{multiplicative} approximation, eliminating step 2 from being the bottleneck of the workflow.
-% By extension, this algorithm only has a constant factor slower runtime relative to deterministic query processing.\footnote{
-% 	Monte-carlo sampling~\cite{jampani2008mcdb} is also trivially a constant factor slower, but can only guarantee additive rather than our stronger multiplicative bounds.
-% }
-% This is an important result, because it implies that computing approximate expectations for bag output PDBs of SPJU queries can indeed be competitive with deterministic query evaluation over bag databases.



--- a/poly-form.tex
+++ b/poly-form.tex
@ -4,9 +4,9 @@
 \subsection{Reduced Polynomials and Equivalences}

 We now introduce some terminology % for polynomials
-and develop a reduced form for polynomials --- a closed form of the polynomial's expectation over probability distributions derived from a \bi or \ti.
+and develop a reduced form (a closed form of the polynomial's expectation) for polynomials over probability distributions derived from a \bi or \ti.
 %We will use $(X + Y)^2$ as a running example.
-Recall that a polynomial over $\vct{X}=(X_1,\dots,X_n)$ is formally defined as:
+Note that a polynomial over $\vct{X}=(X_1,\dots,X_n)$ is formally defined as:
 \begin{equation}
  \label{eq:sop-form}
 Q(X_1,\dots,X_n)=\sum_{\vct{i}=(i_1,\dots,i_n)\in \semN^n} c_{\vct{i}}\cdot \prod_{j=1}^n X_j^{i_j}.
--- a/prob-def.tex
+++ b/prob-def.tex
@ -26,7 +26,7 @@ We ignore the fields \vari{partial}, \vari{Lweight}, and \vari{Rweight} until \C


 \begin{Example}
-The circuit \circuit in \Cref{fig:circuit-express-tree} encodes the polynomial $XY + WZ$.  Note that such an encoding lends itself naturally to having all gates with an outdegree of $1$.  Note further that \circuit is indeed a tree with edges pointing towards the root.
+The circuit \circuit in \Cref{fig:circuit-express-tree} encodes the polynomial $XY + WZ$.  Note that circuit \circuit encodes a tree, with edges pointing towards the root.
 \end{Example}

 \begin{figure}[t]
--- a/ra-to-poly.tex
+++ b/ra-to-poly.tex
@ -56,8 +56,8 @@ Let $\semNX$ denote the set of polynomials over variables $\vct{X}=(X_1,\dots,X_
 We model incomplete relations using Green et. al.'s $\semNX$-databases~\cite{DBLP:conf/pods/GreenKT07}, discussed in detail in \Cref{subsec:supp-mat-krelations} and summarized here.
 In an $\semNX$-database, relations are defined as functions from tuples to elements of $\semNX$, typically called annotations.
 We write $R(t)$ to denote the polynomial annotating tuple $t$ in relation $R$.
-Each possible world is defined by an assignment of $N$ binary values $\vct{W} \in \{0, 1\}^{|X|}$.
-The multiplicity of $t \in R$ in this possible world is obtained by evaluating the polynomial annotating it on $\vct{W}$ (i.e., $R(t)(\vct{W})$).
+Each possible world is defined by an assignment of $N$ binary values $\vct{W} \in \{0, 1\}^{\abs{\vct{X}}}$.
+The multiplicity of $t \in R$ in this possible world, denoted $R(t)(\vct{W})$, is obtained by evaluating the polynomial annotating $t$ on $\vct{W}$.
 $\semNX$-relations are closed under $\raPlus$ (\Cref{fig:nxDBSemantics}).


@ -101,9 +101,9 @@ We focus on this problem from now on, assume an implicit result tuple, and so dr

 \subsubsection{\tis and \bis}
 \label{subsec:tidbs-and-bidbs}
-In this paper, we focus on two popular forms of PDB: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.
+In this paper, we focus on two popular forms of PDBs: Block-Independent (\bi) and Tuple-Independent (\ti) PDBs.
 %
-A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-PDB  such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same blocks are disjoint events.
+A \bi $\pxdb = (\idb_{\semNX}, \pd)$ is an $\semNX$-PDB  such that (i) every tuple is annotated with either $0$ (i.e., the tuple does not exist) or a unique variable $X_i$ and (ii) that the tuples $\tup$ of $\pxdb$ for which $\pxdb(\tup) \neq 0$ can be partitioned into a set of blocks such that variables from separate blocks are independent of each other and variables from the same block are disjoint events.
 In other words, each random variable corresponds to the event of a single tuple's presence.
 %
 A \emph{\ti} is a \bi where each block contains exactly one tuple.