Finished another Intro iteration based on 072721 discussion.

2021-08-05 11:34:01 -04:00 · 2021-08-05 11:34:01 -04:00 · aa825f93a6
parent 02ab8a49ef
commit aa825f93a6
3 changed files with 24 additions and 25 deletions
--- a/intro-rewrite-070921.tex
+++ b/intro-rewrite-070921.tex
@ -1,15 +1,14 @@
 %root: main.tex
 \section{Introduction (Rewrite - 070921)}
 \input{two-step-model}
-A probabilistic database (\abbrPDB) $\pdb$ is a probability distribution $\pd$ over a set of $\numvar$ tuples in a deterministic database $\db$.  A tuple independent probabilistic database (\abbrTIDB) $\pdb$ further restricts $\pd$ to treating each tuple in $\db$ as an independent Bernoulli distributed random variable corresponding to the tuple's presence, where, in bag query semantics, $\query\inparen{\pdb}\inparen{\tup}$ is the random (polynomial) variable corresponding to the multiplicity of the output tuple $\tup$.  Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$)\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the SPJU and renaming operators}, the goal is to compute the expected multiplicity\footnote{In set semantic \abbrPDB\xplural, computing $\expct\pbox{\query\inparen{\pdb}\inparen{t}}$ corresponds to computing the marginal probability.} ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$.  There exists a polynomial $\poly_\tup\inparen{\vct{X}}$ such that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{X}}}$\footnote{In this work we focus on one output tuple $\tup$, and hence refer to $\poly_\tup$ as $\poly$}, where $\vct{X} = \inparen{X_1,\ldots, X_\numvar}$, and the expectation is any Bernoulli distribution over $\{0, 1\}^\numvar$, using $\semN$ semiring semantics..  The set of variables $X_i$ in $\vct{X}$ with nonzero assignments and nonzero coefficients in $\poly\inparen{\vct{X}}$ represent all contributing input tuples to the presence of the output.  In bag semantics, the evaluation of $\poly\inparen{\vct{X}}$ follows $\semN$-semiring semantics in computing the multiplicity.
+A probabilistic database (\abbrPDB) $\pdb$ is a probability distribution $\pd$ over a set of $\numvar$ tuples in a deterministic database $\db$.  A tuple independent probabilistic database (\abbrTIDB) $\pdb$ further restricts $\pd$ to treating each tuple in $\db$ as an independent Bernoulli distributed random variable corresponding to the tuple's presence, where, in bag query semantics, $\query\inparen{\pdb}\inparen{\tup}$ is the random (polynomial) variable corresponding to the multiplicity of the output tuple $\tup$.  Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$)\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the SPJU and renaming operators.}, the goal is to compute the expected multiplicity\footnote{In set semantic \abbrPDB\xplural, computing $\expct\pbox{\query\inparen{\pdb}\inparen{t}}$ corresponds to computing the marginal probability.} ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$.  There exists a polynomial $\poly_\tup\inparen{\vct{X}}$ such that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{X}}}$\footnote{In this work we focus on one output tuple $\tup$, and hence refer to $\poly_\tup$ as $\poly$}, where $\vct{X} = \inparen{X_1,\ldots, X_\numvar}$, and the expectation is any Bernoulli distribution over $\{0, 1\}^\numvar$, using $\semN$ semiring semantics..  The set of variables $X_i$ in $\vct{X}$ with nonzero assignments\footnote{While there are semirings such as the security semiring that use $0$ (denoted $\zerosymbol$) as a valid annotation over tuples of a finite database, for our purposes this distinction is not necessary.} and nonzero coefficients in $\poly\inparen{\vct{X}}$ represent all contributing input tuples to the presence of the output.  In the context of bags, as is the case for $\expct\pbox{\poly\inparen{\vct{X}}}$, computing the multiplicity $\poly\inparen{\vct{X}}$ follows $\semN$-semiring semantics.

-Before describing complexity results, we informally express the notion of complexity classes most commonly used in this work, given a general algorithm $\mathcal{A}$ with input size $\numvar$.  
-Algorithm $\mathcal{A}$ is an element of the \sharpp complexity class if computing a solution is an element of $NP$ and there may exist multiple solutions to a given problem.  
-Algorithm $\mathcal{A}$ is in the \sharpwone class if its runtime can be lower bounded by some function $f$ of a parameter $k$ such that the growth in runtime is polynomially dependent on $f(k)$.  Specifically, $\mathcal{A}$ is an element of \sharpwone if its lower bound on runtime is of the form $n^{f(k)}$.
+To precisely compare and contrast runtime complexities across varying database models and \abbrPDB computations, a brief informal review of certain complexity classes is useful.  Given an algorithm $\mathcal{A}$ and input size $\numvar$, $\mathcal{A}$ is an element of the \sharpp complexity class if computing a solution is an element of $\np$ and there may exist multiple solutions to a given problem.  
+To be an element of the \sharpwone class, the runtime of algorithm $\mathcal{A}$ must be lower bounded by some function $f$ of a parameter $k$ such that the growth in runtime is polynomially dependent on $f(k)$.  Specifically, $\mathcal{A}$ is an element of \sharpwone if its lower bound on runtime is of the form $n^{f(k)}$.

 The special case of deterministic query evaluation
 %simply computing $\query$ over a deterministic database 
-is itself known to be \sharpwonehard in data complexity for general $\query$.  An algorithm, such as a counting cliques query processing algorithm, is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.  
+is itself known to be \sharpwonehard in data complexity for general $\query$.  An algorithm, such as a counting cliques query, is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.  
 %hardness is seen in such queries as counting $k$-cliques and $k$-way joins, where the superlinear runtime is parameterized in $k$.  
 This result is unsatisfying when considering complexity of evaluating $\query$ over \abbrPDB\xplural, since it does not account for computing $\expct\pbox{\poly\inparen{\vct{X}}}$, entirely ignoring the `P' in \abbrPDB.
 %of intensional evaluation (computing $\expct\pbox{\poly\inparen{\vct{X}}}$).  
@ -17,20 +16,15 @@ This result is unsatisfying when considering complexity of evaluating $\query$ o
 A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly\inparen{\vct{X}}}$ separately from the complexity of deterministic query evaluation.  Viewing  \abbrPDB query evaluation as these two seperate steps is essentially what is known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}.  \Cref{fig:two-step} illustrates the intensional evaluation computation model.
 %one way to do this.
 %The model of computation in \cref{fig:two-step} views \abbrPDB query processing as two steps.  
-As depicted, the first step we will refer to as \termStepOne (\abbrStepOne).  \abbrStepOne consists of computing $\query$ over a $\abbrPDB$, which is essentially the deterministic computation of both the query output and $\poly(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in [placeholder].}  The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly(\vct{X})}$. Such a model of computation is nicely followed by intensional evaluation in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly\inparen{\vct{X}}$ must be computed separately for exact output when $\query(\pdb)$ is hard since extensional evaluation will only approximate in such a case.
+The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing $\query$ over a $\abbrPDB$, which is essentially the deterministic computation of both the query output and $\poly(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in [placeholder].}  The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly(\vct{X})}$. Such a model of computation is nicely followed by intensional evaluation in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly\inparen{\vct{X}}$ must be computed separately for exact output when $\query(\pdb)$ is hard since extensional evaluation will only approximate in such a case.
 %(where e.g. intensional evaluation is itself a separate computational step; further, computing $\expct\pbox{\poly\inparen{\vct{X}}}$ in extensional evaluation occurs as a separate step of each operator in the query tree, and therefore implies that both concerns can be separated) 
-\Cref{fig:two-step} is also nicely patterned by semiring provenance \cite{DBLP:conf/pods/GreenKT07}, where the general $\semNX$-DB first computes the annotation via the query, and the $\semNX$-polynomial is subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics.  Further, in this work, the model lends itself nicely in separating the concerns of deterministic computation and the probability computation.  Observing this model prompts the informal problems statement: given $\query\inparen{\pdb}$, is it the case \abbrStepTwo is always $\bigO{\abbrStepOne}$?
+The paradigm of \cref{fig:two-step} is also neatly followed by semiring provenance, where $\semNX$-DB query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial is subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics.  Further, in this work, the model lends itself nicely in separating the concerns of deterministic computation and the probability computation.  Observing this model prompts the informal problem statement: given $\query\inparen{\pdb}$, is it the case \abbrStepTwo is always $\bigO{\abbrStepOne}$?
 %always $\bigO{\abbrStepOne}$.  
 If so, then query evaluation over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation (up to a constant factor). 

-The problem of computing $\query(\pdb)$ has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where the lineage polynomial follows $\semB$-semiring semantics.
-%is a propositional formula.\footnote{For the case when $\query$ is in the class of $\raPlus$ and $\pdb$ is a \abbrTIDB, a bag \abbrPDB lineage polynomial is over a natural number semiring and a set \abbrPDB lineage polynomial is over the boolean semiring.}  
-Evaluating $\query(\pdb)$ in this setting requires each output tuple in $\query(\pdb)$ appear at most once and the computation of $\expct\pbox{\poly\inparen{\vct{X}}}$ is its marginal probability.  Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time step one.  Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), fine grained analysis (complexity analysis that produces varying complexity classes based on a more fine grained parameter other than $\numvar$) of step two will not reduce the hardness results from the \sharpphard complexity class for any parameterized complexity class.  %To overcome this result, one can allow for approximation which reduces the problem to a quadratic upper bound in data complexity \cite{DBLP:series/synthesis/2011Suciu}[cite: Sprout(Fink/Olteanu), Karp/Luby].
+The problem of computing $\query(\pdb)$ has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where the lineage polynomial follows $\semB$-semiring semantics, each output tuple appears at most once, and $\expct\pbox{\poly\inparen{\vct{X}}}$ is essentially the marginal probability of a tuple's prensence in the output.  Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time \abbrStepOne.  Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), techniques such as parameterized complexity (bounding complexity by another parameter other than $\numvar$) and fine grained analysis (complexity analysis that asks what precisely is the value of this other parameter, for example, what is the value of $f(k)$ given a \sharpwone algorithm) of \abbrStepTwo will not refine the hardness results from \sharpphard.

-There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit.  One such query is the count query, where one might desire to compute the expected multiplicity ($\expct\pbox{\poly\inparen{\vct{X}}}$) of a result tuple $\tup$.  
-%Should we allow for approximation in this setting, this paper shows that we can \emph{guarantee} runtime of $\query(\pdb)$ to be linear in runtime of \abbrStepOne. 
-%step one.  
-%Is approximation necessary?  
+There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit.  One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly\inparen{\vct{X}}}$) of the result. 

 \begin{figure}
 \begin{align*}
@ -54,23 +48,27 @@ There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natura
 \end{figure}


-The semantics of $\query(\pdb)$ in bag-\abbrPDB\xplural allow for output tuples to appear \emph{more} than once, which is naturally captured by a lineage polynomial with standard addition and multiplication polynomial operators.  In this setting, linearity of expectation holds over the standard addition operator of the lineage polynomial, and given a standard monomial basis (\abbrSMB) representation of the lineage polynomial, the complexity of computing step two is linear in the size of the lineage polynomial.  This is true since the addition and multiplication operators in \cref{fig:nxDBSemantics} are those of the $\semN$-semiring, and computing the expected count over such operators allows for linearity of expectation over addition, and since \abbrSMB has no factorization, the monomials with dependent multiplicative variables are known up front without any additional operations.  Thus, the expected count can indeed be computed by the same order of operations as contained in $\poly$.  In other words, given an \abbrSMB representation, $\expct\pbox{\poly\inparen{\vct{X}}}$ can always be computed in $\bigO{n^k}$ for $\numvar$ input tuples and $k$-size query complexity.  This result coupled with the prevalence that exists amongst most well-known \abbrPDB implementations to use an sum of products\footnote{Sum of products differs from \abbrSMB in allowing any arbitrary monomial $m_i$ to appear in the polynomial more than once, whereas, \abbrSMB requires all monomials $m_i,\ldots, m_j$ such that $m_i = \cdots = m_j$ to be combined into one monomial, such that each monomial appearing in \abbrSMB is unique.  The complexity difference between the two representations is up to a constant factor.} representation,
-\currentWork{
- show us that to develop comparable bounds between \abbrStepOne and \abbrStepTwo for bag $\query\inparen{\pdb}$, one must examine complexity at finer levels....to not be bottlenecked in \abbrStepTwo, it must be the case that \abbrStepTwo is not \sharpwonehard regardless of the polynomial representation.  may partially explain why the bag-\abbrPDB query problem has long been thought to be easy.
-}
+The semantics of $\query(\pdb)$ in bag-\abbrPDB\xplural allow for output tuples to appear \emph{more} than once, which is naturally captured by the $\semN$-semiring.  Given a standard monomial basis (\abbrSMB)\footnote{\abbrSMB is akin to the sum of products expansion but with the added requirement that all monomials are unique.} representation of the lineage polynomial, the complexity of computing \abbrStepTwo is linear in the size of the lineage polynomial.  This is true since standard addition operators allow for
+%the addition and multiplication operators in \cref{fig:nxDBSemantics} are those of the $\semN$-semiring, and computing the expected count over such operators allows for 
+linearity of expectation, and since \abbrSMB has no factorization, the monomials with dependent multiplicative variables are known up front without any additional operations needed.  Thus, the expected count can indeed be computed by the same order of operations as contained in $\poly$.  In other words, given an \abbrSMB representation, $\expct\pbox{\poly\inparen{\vct{X}}}$ can always be computed in $\bigO{n^k}$ for $\numvar$ input tuples and $k$-size query complexity.  Then for the special case of queries whose operators naturally produce an \abbrSMB representation, we have that \abbrStepTwo is $\bigO{\abbrStepOne}$.
+%This result coupled with the prevalence that exists amongst most well-known \abbrPDB implementations to use an sum of products\footnote{Sum of products differs from \abbrSMB in allowing any arbitrary monomial $m_i$ to appear in the polynomial more than once, whereas, \abbrSMB requires all monomials $m_i,\ldots, m_j$ such that $m_i = \cdots = m_j$ to be combined into one monomial, such that each monomial appearing in \abbrSMB is unique.  The complexity difference between the two representations is up to a constant factor.} representation,
+%\currentWork{
+% show us that to develop comparable bounds between \abbrStepOne and \abbrStepTwo for bag $\query\inparen{\pdb}$, one must examine complexity at finer levels....to not be bottlenecked in \abbrStepTwo, it must be the case that \abbrStepTwo is not \sharpwonehard regardless of the polynomial representation.  may partially explain why the bag-\abbrPDB query problem has long been thought to be easy.
+%}

 The main insight of the paper is that we should not stop here.  One can have compact representations of $\poly(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations of $\poly(\vct{X})$.  To capture such factorizations, this work uses (arithmetic) circuits as the representation system of $\poly(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07} (as shown in \cref{fig:nxDBSemantics}).  
 %Our work explores whether or not \abbrStepTwo in the computation model is \emph{always} in the same complexity class as deterministic query evaluation, when \abbrStepOne of $\query(\pdb)$ is easy.  We examine the class of queries whose lineage computation in step one is lower bounded by the query runtime of step one.  

-Let us consider $\poly(\vct{X})$ for an arbitrary $\tup$ in $\query(\pdb)$ such that $\pdb$ is a bag-\abbrTIDB.  Denote the probability of a tuple $\tup_i$ as $\prob_i$, that is, $\probOf\pbox{X_i} = \prob_i$ for all $i$ in $[\numvar]$.  Consider the special case when $\pdb$ is a deterministic database with one possible world $\db$.  In this case, $\prob_i = 1$ for all $i$ in $[\numvar]$, and it can be seen that the problem of computing the expected count is linear in the size of the arithemetic circuit, since we can essentially push expectation through multiplication of variables dependent on one another\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i = p_i \cdot 1^{h-1}x_i$.}.  This means that \abbrStepTwo is upperbounded by \abbrStepOne and we always have deterministic query runtime for $\query\inparen{\pdb}$ up to a constant factor for this special case.  Is this the general case?  This leads us to our problem statement: 
+Let us consider output $\poly(\vct{X})$ for $\query(\pdb)$ such that $\pdb$ is a bag-\abbrTIDB.  For $\tup_i$ in $\pdb$, denote its corresponding probability as $\prob_i$, that is, $\probOf\pbox{X_i} = \prob_i$ for all $i$ in $[\numvar]$.  Consider the special case when $\pdb$ is a deterministic database with one possible world $\db$.  In this case, $\prob_i = 1$ for all $i$ in $[\numvar]$, and it can be seen that the problem of computing the expected count is linear in the size of the arithemetic circuit, since we can essentially push expectation through multiplication of variables dependent on one another\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.}.  This means that \abbrStepTwo is $\bigO{\abbrStepOne}$ and we always have deterministic query runtime for $\query\inparen{\pdb}$ up to a constant factor for this special case.  Is this the general case?  This leads us to our problem statement: 
 \begin{Problem}\label{prob:intro-stmt}
-Given a query $\query$ in $\raPlus$ and bag \abbrPDB $\pdb$, what is the complexity (in the size of the circuit representation) of computing step two ($\expct\pbox{\poly(\vct{X})}$) for each tuple $\tup$ in the output of $\query(\pdb)$? 
+Given a query $\query$ in $\raPlus$ and bag \abbrPDB $\pdb$, is it always the case that computing \abbrStepTwo is $\bigO{\abbrStepOne}$?%what is the complexity (in the size of the circuit representation) of computing step two ($\expct\pbox{\poly(\vct{X})}$) for each tuple $\tup$ in the output of $\query(\pdb)$? 
 \end{Problem}

 We show, for the class of \abbrTIDB\xplural with $0 < \prob_i < 1$, the problem of computing \abbrStepTwo is superlinear in the size of the lineage polynomial representation under fine grained complexity hardness assumption. 
-Our work further introduces an approximation algorithm of the expected count of $\tup$ from the bag-\abbrPDB query $\query$ which runs in linear time.
+Our work further introduces an approximation algorithm of \abbrStepTwo from the bag-\abbrPDB query $\query$ which runs in linear time.

-As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\tup$, which is a stark contrast to the marginal probability ($\expct\pbox{\poly\inparen{\vct{X}}}$) paradigm of set-\abbrPDB\xplural.  To address the assumption of whether or not \abbrPDB\xplural are easy, we focus on computing the expected count ($\expct\pbox{\poly\inparen{\vct{X}}}$) of $\tup$ as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity.  Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
+As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\poly\inparen{\vct{X}}$, a stark contrast to the marginal probability %($\expct\pbox{\poly\inparen{\vct{X}}}$)
+ paradigm of set-\abbrPDB\xplural.  To address the question of whether or not bag-\abbrPDB\xplural are easy, we focus on computing the expected count ($\expct\pbox{\poly\inparen{\vct{X}}}$) as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity.  Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.

 Our work focuses on the following setting for query computation.  Inputs of $\query$ are set-\abbrPDB\xplural, while the output of $\query$ is a bag-\abbrPDB.  This setting, however, is not limiting as a simple generalization exists, reducing a bag \abbrPDB to a set \abbrPDB with typically only an $O(c)$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{t}$.

@ -126,7 +124,7 @@ Let $\pdb$ be a \abbrTIDB over variables $\vct{X} = \{X_1,\ldots,X_\numvar\}$ wi
 \end{equation*} 
 \end{Lemma}

-To prove our hardness result we show that for the same $Q$ considered in the query above, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \cref{fig:two-step}}.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$).  For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation.  Recall $\query^2$ from above, and note that 
+To prove our hardness result we show that for the same $Q$ considered in the query above, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \cref{fig:two-step}}.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$).  For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation.  For example, using $\query^2$ from above, we can see that 
 \begin{equation*}
 	\poly^2\inparen{\probAllTup} = \inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
 \end{equation*}
--- a/macros.tex
+++ b/macros.tex
@ -276,6 +276,7 @@
 % COMPLEXITY
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \newcommand{\bigO}[1]{O\inparen{#1}}
+\newcommand{\np}{NP\xspace}
 \newcommand{\sharpp}{\#P\xspace}
 \newcommand{\sharpphard}{\#P-hard\xspace}
 \newcommand{\sharpwone}{\#W[1]\xspace}
--- a/two-step-model.tex
+++ b/two-step-model.tex
@ -39,7 +39,7 @@
 			%label below cylinder
 			\node[below=0.2 cm of cylinder]{{\LARGE$ \pdb$}};
 		%First arrow
-		\node[single arrow, right=0.25 of cylinder, draw=black, fill=black!65, text=white, minimum height=0.75cm, minimum width=0.25cm](arrow1) {\textbf{Step 1}};
+		\node[single arrow, right=0.25 of cylinder, draw=black, fill=black!65, text=white, minimum height=0.75cm, minimum width=0.25cm](arrow1) {\textbf{\abbrStepOne}};
 		\node[above=of arrow1](arrow1Label) {$\query$};
 		\usetikzlibrary{arrows.meta}%for the following arrow configurations
 			\draw[line width=0.5mm, dashed, arrows = -{Latex[length=3mm, open]}] (arrow1Label)->(arrow1);
@ -108,7 +108,7 @@
 		%label below rectangle
 		\node[below=0.2cm of rect]{{\LARGE $\query(\pdb)\inparen{\tup}\equiv \poly_\tup\inparen{\vct{X}}$}};
 		%Second arrow
-		\node[single arrow, right=0.25 of rect, draw=black, fill=black!65, text=white, minimum height=0.75cm, minimum width=0.25cm](arrow2) {\textbf{Step 2}};
+		\node[single arrow, right=0.25 of rect, draw=black, fill=black!65, text=white, minimum height=0.75cm, minimum width=0.25cm](arrow2) {\textbf{\abbrStepTwo}};
 		%Expectation computation; (output of step 2)
 		\node[rectangle, right=0.25 of arrow2, rounded corners, draw=black, fill=red!10, text=black, minimum height=4.5cm, minimum width=2cm](rrect) {
 		\tabcolsep=0.09cm