More intro polish

2020-12-13 17:45:44 -05:00 · 2020-12-13 17:45:44 -05:00 · d6825c38c6
parent 2a00de2a36
commit d6825c38c6
3 changed files with 164 additions and 120 deletions
--- a/aaron.bib
+++ b/aaron.bib
@ -266,3 +266,29 @@ numpages = {12}
  year      = {2016}
 }

+
+
+@inproceedings{DBLP:conf/pods/GreenKT07,
+  author    = {Todd J. Green and
+               Gregory Karvounarakis and
+               Val Tannen},
+  title     = {Provenance semirings},
+  booktitle = {{PODS}},
+  pages     = {31--40},
+  publisher = {{ACM}},
+  year      = {2007}
+}
+
+
+
+@article{DBLP:journals/sigmod/GuagliardoL17,
+  author    = {Paolo Guagliardo and
+               Leonid Libkin},
+  title     = {Correctness of {SQL} Queries on Databases with Nulls},
+  journal   = {{SIGMOD} Rec.},
+  volume    = {46},
+  number    = {3},
+  pages     = {5--16},
+  year      = {2017}
+}
+
--- a/intro.tex
+++ b/intro.tex
@ -15,20 +15,22 @@ Computing the probability of the tuple appearing in the result is thus analogous
 In the corresponding problem for Bag-PDBs~\cite{kennedy:2010:icde:pip,DBLP:conf/vldb/AgrawalBSHNSW06,feng:2019:sigmod:uncertainty}, the lineage is a polynomial over random variables that captures the multiplicity of the output tuple.
 Thus, the expectation of the multiplicity is the expectation of this formula.

-Lineage in Set-PDBs is typically encoded in conjunctive normal form.
+Lineage in Set-PDBs is typically encoded in disjunctive normal form.
 This representation is significantly larger than the query result sans lineage.
 However, even with alternative encodings~\cite{DBLP:journals/vldb/FinkHO13}, the limiting factor in computing marginal probabilities remains the probability computation itself and not the lineage formula.
-The corresponding representation in Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of terms, each of which is the product of a set of integer or variable atoms.
-Thanks to linearity of expectation, computing the expectation of tuple multiplicities is linear in the number of terms in the SOP polynomial.
+The corresponding representation in Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of clauses, each of which is the product of a set of integer or variable atoms.
+Thanks to linearity of expectation, computing the expectation of tuple multiplicities is linear in the number of clauses in the SOP polynomial.
 Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.  
 Such compressed representations like Factorized Databases~\cite{10.1145/3003665.3003667,DBLP:conf/tapp/Zavodny11} or Polynomial Circuits (cite), are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}.
 Thus, measuring the performance of a PDB algorithm in terms of the size of a \emph{compressed} lineage formula allows us to more closely relate the algorithm's performance to the complexity of query evaluation in a deterministic database.

 The initial picture is not good.
-In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed polynomial, meaning that even bag PDBs can not enjoy the same computational complexity as deterministic databases.
-This motivates our second goal, a linear time approximation algorithm that, as we prove, estimates the expected multiplicities for tuples in the result of an SPJU query with a complexity to within a constant factor of the equivalent deterministic query.
-
+In this paper, we prove that computing expected counts is \emph{not} linear in the size of a compressed --- specifically a factorized~\cite{10.1145/3003665.3003667} --- polynomial by reduction to counting the 3-matchings of a graph.
+Thus, even bag PDBs can not enjoy the same computational complexity as deterministic databases.
+This motivates our second goal, a linear time approximation algorithm for computing expected multiplicities in a bag database that runs in time linear in the size of a factorized lineage formula.
+\todo[noinline]{as we show...}The worst-case size of the factorized lineage formula for a query is on the same order as the worst-case complexity of deterministic query evaluation~\cite{DBLP:conf/pods/KhamisNR16,10.1145/3003665.3003667}, making it possible to estimate expected multiplicities for tuples in the result of an SPJU query with a complexity to within a constant factor of the equivalent deterministic query.

+\subsection{Sets vs Bags}

 %Consider an arbitrary output polynomial $\poly$.  Further, consider the same polynomial, with all exponents $e > 1$ set to $1$ and call the resulting polynomial $\rpoly$.

@ -37,7 +39,7 @@ This motivates our second goal, a linear time approximation algorithm that, as w
 \begin{figure}[ht]
 	\begin{subfigure}{0.15\textwidth}
 		\centering
-		\begin{tabular}{ c | c | c}
+		\begin{tabular}{ c | c c}
 			$\rel$ & A & $\Phi$\\
 			\hline
 			& a & $W_a$\\
@ -52,9 +54,9 @@ This motivates our second goal, a linear time approximation algorithm that, as w
 		\begin{tabular}{ c | c c c}
 			$E$ & A & B & $\Phi$\\
 			\hline
-			& a & b & 1\\
-			& b & c & 1\\
-			& c & a & 1\\
+			& a & b & $\top$\\
+			& b & c & $\top$\\
+			& c & a & $\top$\\
 		\end{tabular}
 		%\caption{Atom 3 of query $\poly$ in ~\cref{intro:ex}}
 		\label{subfig:ex-atom3}
@ -92,163 +94,178 @@ This motivates our second goal, a linear time approximation algorithm that, as w
 %\end{figure}

 \begin{Example}\label{ex:intro}
-Assume a set semantics setting.  Suppose we are given a Tuple Independent Database ($\ti$), which is a PDB whose tuples are independently present or not.  We are given the following boolean query $\poly() :- R(A), E(A, B), R(B)$.  The lineage of the output is computed by adding polynomials when a union operation is performed, and by multiplying polynomials for a join operation.  This yields the products of all input tuple lineages whose combination satsifies the join condition, summed together.  A $\ti$ example instance is given in~\cref{fig:intro-ex}.  The attribute column $\Phi$ contains its repsective random variable, where $P[W_i = 1]$ is its marginal probability.  While for completeness we should include random variables for Table E, since each tuple has a probability of $1$, we drop them for simplicity.  %Finally, see that the tuples in table E can be visualized as the graph in ~\cref{fig:intro-ex-graph}.
-Next we explain why this query is hard in set semantics % due to correlations in the lineage formula.  But
-and easy under bag semantics.% with a polynomial formula representing the multiple contributing tuples from the input set $\ti$, it is easy since we enjoy linearity of expectation.
+Consider the Tuple Independent ($\ti$) Set-PDB given in \cref{fig:intro-ex} with two input relations $R$ and $E$\footnote{Our work also handles Block Independent Disjoint Databases ($\bi$)~\cite{DBLP:conf/sigmod/BoulosDMMRS05,DBLP:series/synthesis/2011Suciu}, but we focus here on the $\ti$ model to keep exposition simple.}.
+Each input tuple is are assigned an annotation (attribute $\Phi$), an independent random boolean variable ($W_i$) or the constant $\top$.
+Each assignment of values to variables ($\{\;W_a,W_b,W_c\;\}\mapsto \{\;\top,\bot\;\}$) defines one \emph{possible world} containing exactly those tuples annotated with $\top$ or with a variable assigned to $\top$.
+The probability of this world is the joint probability of the corresponding assignments.  
+For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world where $R = \{\;\tuple{a}, \tuple{b}\;\}$.  
+The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p^2-p^3$ 
 \end{Example}

-Our work also handles Block Independent Disjoint Databases ($\bi$), a PDB model in which tuples are arranged in blocks, where all blocks are independent from one another, but tuples within the same block are mutually exclusive.  For now, let us consider the $\ti$ model.  In the example we consider a fixed probability $\prob$ for all tuple variables such that $P[W_i = 1] = \prob$.  Let us also be explicit in mentioning that the input tables are \textit{sets}, i.e. $Dom(W_i) = \{0, 1\}$, and the difference when we speak of bag semantics, is that we consider the query to potentially have duplicates, or in other words we are thinking about query output (over set instances) in the bag context.
+Prior efforts to generalize incomplete databases to bags~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17} replace the boolean annotations with natural numbers.
+Analogously, we generalize the above model of Set-PDBs to bags by using natural-number-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and positive natural number constants.
+Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while query evaluation follows bag semantics.
+We contrast bag and set query evaluation with the following example:

-To contrast the bag/polynomial and set/lineage interpretations, we provide another example.
 \begin{Example}\label{ex:bag-vs-set}
-The output polynomial in ~\cref{ex:intro} has the following lineage formula (top) and polynomial (bottom).
+Continuing the prior example, we are given the following boolean (resp,. count) query 
+$$\poly() :- R(A), E(A, B), R(B)$$
+The lineage of the result in a Set-PDB (resp., Bag-PDB) is a boolean formula (resp., polynomial) over random variables in the input (i.e., $W_a$, $W_b$, $W_c$).
+Because the boolean query has only a nullary relation, we write $Q(\cdot)$ to denote the function mapping variable assignments to a concrete value:
 \begin{align*}
-&\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a\\
-&\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a
+\poly_{set}(W_a, W_b, W_c) &= W_aW_b \vee W_bW_c \vee W_cW_a\\
+\poly_{bag}(W_a, W_b, W_c) &= W_aW_b + W_bW_c + W_cW_a
 \end{align*}
-
-Notice that $\poly$ in the set/lineage setting above, $\poly: (\mathbb{B})^3 \mapsto \mathbb{B}$, while under bag/polynomial semantics we define $\poly: (\mathbb{N})^3 \mapsto \mathbb{N}$.
-
-Assume the following $\mathbb{B}/\mathbb{N}$ variable assignments: $W_a\mapsto T/1, W_b \mapsto T/1, W_c \mapsto F/0.$  Then the polynomials evaluate as
+It is left as an exercise for the reader to show that, given assignments to $W_a$, $W_b$, $W_c$, these expressions correspond to the existence (resp., count) of the single nullary result tuple; 
+We show one possible world here, with the assignment $\{\;W_a\mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$ (and the corresponding bag assignment),
+The polynomials evaluate as:
 \begin{align*}
-&\poly(T, T, F) = TT \vee TF \vee FT = T\\
-&\poly(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1
+&\poly_{set}(\top, \top, \bot) = \top\top \vee \top\bot \vee \top\bot = \top\\
+&\poly_{bag}(1, 1, 0) = 1 \cdot 1 + 1\cdot 0 + 0 \cdot 1 = 1
 \end{align*}
-In the set/lineage setting, we find that the boolean query is satisfied, while in the bags evaluation we  see how many combinations of the input satsify the query.
+The Set-PDB query is satisfied in this possible world, while the Bag-PDB query produces a nullary tuple with a multiplicity of 1.
+The marginal probability (resp., expected count) of this query is computed over all possible worlds:
+{\small
+\begin{align*}
+P[\poly_{set}] &= \sum_{w_i \in \{\top,\bot\}} \mu(\poly_{set}(w_a, w_b, w_c))P[W_a = w_a,W_b = w_b,W_c = w_c]\\
+\expct[\poly_{bag}] &= \sum_{w_i \in \{0,1\}} \poly_{bag}(w_a, w_b, w_c)\cdot P[W_a = w_a,W_b = w_b,W_c = w_c]
+\end{align*}
+}
 \end{Example}

-Note that computing the probability of the query of ~\cref{ex:intro} in set semantics is indeed \sharpphard, since it is a query that is non-hierarchical
-
-%, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$,
-~\cite{10.1145/1265530.1265571}.  %Thus, computing $\expct\pbox{\poly(W_a, W_b, W_c)}$, i.e. the probability of the output with annotation $\poly(W_a, W_b, W_c)$, ($\prob(q)$ in Dalvi, Sucui) is hard in set semantics.
-To see why this computation is hard for query $\poly$ over set semantics, from the query input we compute an output lineage formula of $\poly(W_a, W_b, W_c) = W_aW_b \vee W_bW_c \vee W_cW_a$.  Note that the conjunctive clauses are not independent of one another and the computation of the probability is not linear in the size of $\poly(W_a, W_b, W_c)$:
-\begin{equation*}
-\expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
-\end{equation*}
-In general, such a computation can be exponential in the size of the database.
+Note that the query of \cref{ex:bag-vs-set} in set semantics is indeed \sharpphard, since it non-hierarchical~\cite{10.1145/1265530.1265571}.
+To see why computing this probability is hard, observe that the clauses of the disjunctive normal form boolean lineage are neither independent nor disjoint, forcing~\cite{DBLP:journals/vldb/FinkHO13} the use of Shannon decomposition, which is at worst exponential in the size of the input.  
+% \begin{equation*}
+% \expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
+% \end{equation*}
+% In general, such a computation can be exponential in the size of the database.
 %Using Shannon's Expansion,
 %\begin{align*}
 %&W_aW_b \vee W_bW_c \vee W_cW_a
 %= &W_a
 %\end{align*}
-
-However, in the bag setting, the polynomial is $\poly(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a$.  To be reiterate, the output lineage formula is produced from a query over a set $\ti$ input, where duplicates are allowed in the output.  The expectation computation over the output lineage is a computation of the expected multiplicity of an output tuple across possible worlds.  In ~\cref{ex:intro}, the expectation is simply
+Conversely, in Bag-PDBs, correlations between clauses of the SOP polynomial are not problematic thanks to linearity of expectation. 
+The expectation computation over the output lineage is simply the sum of expectations of each clause
+For \cref{ex:intro}, the expectation is simply
+{\small
 \begin{align*}
-&\expct\pbox{\poly(W_a, W_b, W_c)} = \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}\\
-= &\expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
+\expct\pbox{\poly(W_a, W_b, W_c)} &= \expct\pbox{W_aW_b} + \expct\pbox{W_bW_c} + \expct\pbox{W_cW_a}\\
+\intertext{\normalsize 
+In this particular lineage polynomial, all variables in each product clause are independent, so we can push expectations through.
+}
+&= \expct\pbox{W_a}\expct\pbox{W_b} + \expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{W_c}\expct\pbox{W_a}
 \end{align*}
-
-Note that $\expct\pbox{W_i} = P[W_i = 1]$, and so
+}
+Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.  
+A further interesting coincidence arises in the example.
+Note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals: 
+{\small
 \begin{align*}
-&\expct\pbox{\poly(W_a, W_b, W_c)} = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
+\expct\pbox{\poly_{bag}} =&\; P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
 &+ P[W_c = 1]P[W_a = 1]\\
-= &\prob^2 + \prob^2 + \prob^2 = 3\prob^2.
+=&\; \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1]) 
 \end{align*}
-Computing such expectations is indeed linear in the size of the SOP as the number of operations in the computation is \textit{exactly} the number of multiplication and addition operations of the polynomial.  The above equalities hold due to linearity of expectation.  In this particular case all variables are independent, so we can push expectation into the products as well.  Note that the answer is the same as substituting $\prob$ in for each variable, i.e., for this example $\expct\pbox{\poly(W_a, W_b, W_c)} = \poly(\prob, \prob, \prob)$.  Is this equality always the case?%This however is coincidental and not true for the general case.
-
-Now, consider the query
-\begin{equation*}
-\poly^2() := \rel(A), E(A, B), \rel(B), \rel(C), E(C, D), \rel(D),
-\end{equation*}
-
-For an arbitrary lineage formula, which we can view as a polynomial, it is known that there may exist equivalent compressed representations of the polynomial.  One such compression is the factorized polynomial ~\cite{10.1145/3003665.3003667}, where the polynomial can be broken up into separate factors.  %Another form of the polynomial is the SOP, which is the expansion of the factorized polynomial by multiplying out all terms, and in general is exponentially larger (in the number of products) than the factorized version.
-
-
-A factorized polynomial of $\poly^2$ is
-\begin{equation*}
-\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right).
-\end{equation*}
-This factorized expression can be easily modeled as an expression tree as depicted by ~\cref{fig:intro-q2-etree}.
+}

 \begin{figure}[h!]
-
 \begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=4.55cm}, level 2/.style={sibling distance=1.5cm}, level 3/.style={sibling distance=0.7cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
-	\node[tree_node](root){$\boldsymbol{\times}$}
-		child{node[tree_node]{$\boldsymbol{+}$}
-			child{node[tree_node]{$\boldsymbol{\times}$}
-				child{node[tree_node]{$W_a$}}
-				child{node[tree_node]{$W_b$}}
-				}
-			child{node[tree_node]{$\boldsymbol{\times}$}
-				child{node[tree_node]{$W_b$}}
-				child{node[tree_node]{$W_c$}}
-				}
-			child{node[tree_node]{$\boldsymbol{\times}$}
-				child{node[tree_node]{$W_c$}}
-				child{node[tree_node]{$W_a$}}
-				}
-			}
-		child{node[tree_node]{$\boldsymbol{+}$}
-			child{node[tree_node]{$\boldsymbol{\times}$}
-				child{node[tree_node]{$W_a$}}
-				child{node[tree_node]{$W_b$}}
-				}
-			child{node[tree_node]{$\boldsymbol{\times}$}
-				child{node[tree_node]{$W_b$}}
-				child{node[tree_node]{$W_c$}}
-				}
-			child{node[tree_node]{$\boldsymbol{\times}$}
-				child{node[tree_node]{$W_c$}}
-				child{node[tree_node]{$W_a$}}
-				}
-			};
+  \node[tree_node](root){$\boldsymbol{\times}$}
+    child{node[tree_node]{$\boldsymbol{+}$}
+      child{node[tree_node]{$\boldsymbol{\times}$}
+        child{node[tree_node]{$W_a$}}
+        child{node[tree_node]{$W_b$}}
+        }
+      child{node[tree_node]{$\boldsymbol{\times}$}
+        child{node[tree_node]{$W_b$}}
+        child{node[tree_node]{$W_c$}}
+        }
+      child{node[tree_node]{$\boldsymbol{\times}$}
+        child{node[tree_node]{$W_c$}}
+        child{node[tree_node]{$W_a$}}
+        }
+      }
+    child{node[tree_node]{$\boldsymbol{+}$}
+      child{node[tree_node]{$\boldsymbol{\times}$}
+        child{node[tree_node]{$W_a$}}
+        child{node[tree_node]{$W_b$}}
+        }
+      child{node[tree_node]{$\boldsymbol{\times}$}
+        child{node[tree_node]{$W_b$}}
+        child{node[tree_node]{$W_c$}}
+        }
+      child{node[tree_node]{$\boldsymbol{\times}$}
+        child{node[tree_node]{$W_c$}}
+        child{node[tree_node]{$W_a$}}
+        }
+      };
 \end{tikzpicture}

 \caption{Expression tree for query $\poly^2$.}
 \label{fig:intro-q2-etree}
 \end{figure}

- In contrast, the equivalent SOP representation is
+\subsection{Superlinearity of Bag PDBs}
+Moving forward, we focus exclusively on bags and drop the subscript from $\poly_{bag}$.  
+Consider the cartesian product of $\poly$ with itself:
+\begin{equation*}
+\poly^2() := \rel(A), E(A, B), \rel(B),\; \rel(C), E(C, D), \rel(D)
+\end{equation*}
+For an arbitrary lineage formula, which we can view as a polynomial, it is known that there may exist equivalent compressed representations of the polynomial.  
+One such compression is the factorized polynomial~\cite{10.1145/3003665.3003667}, where the polynomial can be broken up into separate factors.
+For example:
+\begin{equation*}
+\poly^2(W_a, W_b, W_c) = \left(W_aW_b + W_bW_c + W_cW_a\right) \cdot \left(W_aW_b + W_bW_c + W_cW_a\right).
+\end{equation*}
+This factorized expression can be easily modeled as an expression tree as in \cref{fig:intro-q2-etree}.
+In contrast, the equivalent SOP representation is
 \begin{equation*}
 W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c + 2W_aW_bW_c^2.
 \end{equation*}
-
-One can see that the factorized form more closely models the optimizations of deterministic query evaluation.
-
 The expectation then is
 \begin{align*}
 &\expct\pbox{\poly^2(W_a, W_b, W_c)}\\
 &= \expct\pbox{W_a^2}\expct\pbox{W_b^2} + \expct\pbox{W_b^2}\expct\pbox{W_c^2} + \expct\pbox{W_c^2}\expct\pbox{W_a^2} +\\
 &\qquad \expct\pbox{2W_a^2}\expct\pbox{W_b}\expct\pbox{W_c} + \expct\pbox{2W_a}\expct\pbox{W_b^2}\expct\pbox{W_c} +\\
 &\qquad  \expct\pbox{2W_a}\expct\pbox{W_b}\expct\pbox{W_c^2}\\
-= &\prob^2 + \prob^2 + \prob^2 + 2\prob^3 + 2\prob^3 + 2\prob^3\\
-= & 3\prob^2(1 + 2\prob) \neq \poly^2(\prob, \prob, \prob).
 \end{align*}
-
-In this case, even though we substitute probability values in for each variable, $\poly^2(\prob, \prob, \prob)$ is not the answer we seek since in our example $Dom(W_i) = \{0, 1\}$ and $\expct\pbox{W_i^2} = \sum\limits_{w \in Dom(W_i)}w^2 \cdot P[W_i = w] = \prob$.  This property leads us to consider another structure related to $\poly$.  \AH{I don't know if we want to include the following statement: \par \emph{ bags are only hard with self-joins }
-\par Atri suggests a proof in the appendix regarding this claim.}
-
-Define $\rpoly^2(\vct{X})$ to be the resulting polynomial when all exponents $e > 1$ are set to $1$ in $\poly^2$. For example, when we have
-
+In $\poly$, $\expct\pbox{\poly} = \poly(P\pbox{W_a}, P\pbox{W_b}, P\pbox{W_c})$.
+This same property does not hold for $\poly^2$ (i.e., $\expct\pbox{\poly^2} \neq \poly^2(P\pbox{W_a}, P\pbox{W_b}, P\pbox{W_c})$).
+Nevertheless, the structure of the query does admit some simplification.
+Observe that under assumption that $Dom(W_i) = \{0, 1\}$, it is generally true that for any $k$, $\expct\pbox{W_i^k} = \expct\pbox{W_i}$.
+This property leads us to consider another structure related to $\poly$.
+% \AH{I don't know if we want to include the following statement: \par \emph{ bags are only hard with self-joins }
+% \par Atri suggests a proof in the appendix regarding this claim.}
+For any polynomial $\poly(\vct{X})$, we define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in $\poly(\vct{X})$ to $1$.  
+With $\poly^2$ as an example, we have:
 \begin{align*}
-&\poly^2(W_a, W_b, W_c) = W_a^2W_b^2 + W_b^2W_c^2 + W_c^2W_a^2 + 2W_a^2W_bW_c + 2W_aW_b^2W_c\\
-&+ 2W_aW_bW_c^2,
-\end{align*}
-then
-\begin{align*}
-&\rpoly^2(W_a, W_b, W_c) = W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
+\rpoly^2(W_a, W_b, W_c) =&\; W_aW_b + W_bW_c + W_cW_a + 2W_aW_bW_c + 2W_aW_bW_c\\
 &+ 2W_aW_bW_c\\
-&= W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c
+=&\; W_aW_b + W_bW_c + W_cW_a + 6W_aW_bW_c
 \end{align*}
-Note that this structure $\rpoly^2(\prob, \prob, \prob)$ is the expectation we computed, since it is always the case that $i^2 = i$ for all $i$ in $\{0, 1\}$.  And, $\poly^2()$ is still computable in linear time in the size of the output polynomial, compressed or SOP.
+The reduced polynomial is a closed form formula for the expected count (i.e., $\expct\pbox{\poly^2} = \rpoly(P\pbox{W_a=1}, P\pbox{W_b=1}, P\pbox{W_c=1})$).
+Note that our initial example polynomial $\poly$ is already in reduced form.

-A compressed polynomial can be exponentially smaller in $k$ for $k$-products.  It is also always the case that computing the expectation of an output polynomial in SOP is always linear in the size of the polynomial, since expecation can be pushed through addition.
-
-This works seeks to explore the complexity landscape for compressed representations of polynomials.  Note that when we are linear in the size of the lineage formula, we essentially have runtime that is of deterministic query complexity.
-
-Up to this point the message seems consistent that bags are always easy in the size of the SOP representation, but
+Observe that the reduced form of a polynomial can be obtained in a linear scan over the clauses of a SOP encoding of the polynomial.
+In prior work on PDBs, where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.
+In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the cartesian product of $k$ copies of $\poly$ has a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
+This leads us to the central question of this paper:
 \begin{Question}
-Is it always the case that bags are easy in the size of the \emph{compressed} polynomial?
+Is it always the case that bags are linear in the size of the \emph{compressed} polynomial?
 \end{Question}
-If bags \textit{are} always easy for any compressed version of the polynomial, then there is no need for improvement.  But, if proveably not, then the option to approximate the computation over a compressed polynomial in linear time is critical for making PDBs practical.
-
-Consider the query
-\begin{equation*}
-\poly^3() := \left(\rel(A), E(A, B), R(B)\right), \left(\rel(C), E(C, D), R(D)\right), \left(\rel(F), E(F, G), R(G)\right).
-\end{equation*}
-Upon inspection one can see that the factorized output polynomial consists of three product terms, while the SOP version consists of $3^3$ terms.  We show in this paper that, given a $\ti$ and any conjunctive query with input $\prob$ for all variables of $\poly^3$, this particular query is hard given a factorized polynomial as input.  We show this via a reduction to computing the number of $3$-matchings over an arbitrary graph.  The fact that bags are not easy in the general case when considering compressed polynomials necessitates an approximation algorithm that computes the expected multiplicity of the output in linear time when the output polynomial is in factorized form.  We introduce such an approximation algorithm with confidence guarantees to compute $\rpoly(\vct{X})$ in linear time.  Our apporximation algorithm generalizes to the $\bi$ model as well.  This shows that for all RA+ queries, the processing time in approximation is essentially the same deterministic processing.
+If bags \textit{are} always linear for any compressed version of the polynomial, then there is no need for improvement.  
+But, if proveably not, then an approximation algorithm for the expected count is mandatory for creating a PDB with practical performance.

+% Consider the :
+% \begin{equation*}
+% \poly^3() := \left(\rel(A), E(A, B), R(B)\right), \left(\rel(C), E(C, D), R(D)\right), \left(\rel(F), E(F, G), R(G)\right).
+% \end{equation*}
+% The factorized output polynomial consists of a product of three identical three-way summations, while the SOP encoding is exponential --- $3^3$ clauses to be precise.

+Concretely, in this paper: 
+(i) We show that conjunctive queries over a bag-$\ti$ are hard (i.e., superlinear in the size of a compressed lineage encoding) by reduction to counting the number of $3$-matchings over an arbitrary graph; 
+(ii) We present an $\epsilon - \delta$ approximation algorithm for bag-$\ti$s and show that its complexity is linear in the size of the compressed lineage encoding;
+(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
+(iv) We further generalize our results to higher moments, polynomial circuits, and prove RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
--- a/macros.tex
+++ b/macros.tex
@ -34,6 +34,7 @@
 \newcommand{\aug}[1]{AUG^{\graph{#1}}}
 \newcommand{\mtrix}[1]{M_{#1}}
 \newcommand{\dtrm}[1]{Det\left(#1\right)}
+\newcommand{\tuple}[1]{\left<#1\right>}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %Approx Alg