Finished iteration of @atri 081021 notes.

2021-08-18 12:47:58 -04:00 · 2021-08-18 12:47:58 -04:00 · aac539c9d2
parent 7b6d6dc37e
commit aac539c9d2
2 changed files with 67 additions and 51 deletions
--- a/intro-rewrite-070921.tex
+++ b/intro-rewrite-070921.tex
@ -1,30 +1,47 @@
 %root: main.tex
 \section{Introduction (Rewrite - 070921)}
 \input{two-step-model}
-A tuple independent probabilistic database\footnote{In \cref{sec:background} and beyond, we generalize the data model.} (\abbrTIDB) $\pdb$ is a tuple $\inparen{\db, \pd}$ where $\db$ is a set of $\numvar$ tuples.  The probability distribution $\pd$ over $\db$ is the one induced from the requirement that each tuple be treated as an independent Bernoulli distributed random variable.  In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ computes the multiplicity of its corresponding tuple $\tup$.  The query evaluation problem in bag-\abbrPDB semantics can be stated as
+A tuple independent probabilistic database\footnote{In \cref{sec:background} and beyond, we generalize the data model.} (\abbrTIDB) $\pdb$ is a tuple $\inparen{\db, \pd}$ where $\db$ is a set of $\numvar$ tuples.  The probability distribution $\pd$ over $\db$ is the one induced from the requirement that each tuple be treated as an independent Bernoulli distributed random variable.  In bag query semantics the random variable $\query\inparen{\pdb}\inparen{\tup}$ computes the multiplicity of its corresponding tuple $\tup$.  In addition to traditional deterministic query evaluation requirements (for a given query class), the query evaluation problem in bag-\abbrPDB semantics further requires the following condition:
 \begin{Problem}\label{prob:bag-pdb-query-eval}
 Given a query $\query$ from the set of positive relational algebra queries ($\raPlus$),\footnote{The class of $\raPlus$ queries consists of all queries that can be composed of the positive (monotonic) relational algebra operators: selection, projection, join, and union (SPJU).} compute the expected multiplicity ($\expct\pbox{\query\inparen{\pdb}\inparen{\tup}}$) of output tuple $\tup$.
 \end{Problem}
-There exists a polynomial $\poly_\tup\inparen{\vct{X}}$ such that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{X}}}$, where $\vct{X} = \inparen{X_1,\ldots, X_\numvar}$ is the set of variables annotating the tuples in $\pdb$.  The expectation is any Bernoulli distribution over $\{0, 1\}^\numvar$, whose evaluation semantics follow the standard interpretation of addition and multiplication operators over the natural numbers, i.e. $\semN$-semiring semantics.  
+There exists a polynomial $\poly_\tup\inparen{\vct{X}}$ such that $\expct\pbox{\query\inparen{\pdb}\inparen{\tup}} = \expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$, where $\vct{X} = \inparen{X_1,\ldots, X_\numvar}$ is the set of variables annotating the tuples in $\pdb$ and $\vct{W}$ the set of random variables corresponding to $\vct{X}$ drawn from $\pd$.  The expectation is any Bernoulli distribution $\pd$ over $\{0, 1\}^\numvar$, whose evaluation semantics follow the standard interpretation of addition and multiplication operators over the natural numbers, i.e. $\semN$-semiring semantics. While the aforementioned assumes set \abbrTIDB inputs, this is not limiting, since one can reduce a bag-\abbrTIDB input to a set-\abbrTIDB by assigning unique keys across all $\tup$ in $\pdb$.  Such a generalization has an $\bigO{c}$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{\tup}$, where $\db\inparen{\tup}$ denotes $\tup$'s multiplicity.
+
+%BEGIN MOVE TO S. 2
+%Our work focuses on the following setting for query computation.  Inputs of $\query$ are set-\abbrPDB\xplural, while the output of $\query$ is a bag-\abbrPDB.  This setting, however, is not limiting as a simple generalization exists, reducing a bag \abbrPDB to a set \abbrPDB with typically only an $O(c)$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{t}$.
+%END MOVE
+ 

 The problem of deterministic query evaluation is known to be \sharpwonehard\footnote{A problem is in \sharpwone if the runtime of the most efficient known algorithm to solve it is lower bounded by some function $f$ of a parameter $k$, where the growth in runtime is polynomially dependent on $f(k)$, i.e. $\Omega\inparen{\numvar^{f(k)}}$.} in data complexity for general $\query$.  For example, the counting $k$-cliques query problem (where the parameter $k$ is the size of the clique) is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.    
-This result is unsatisfying when considering complexity of evaluating $\query$ over \abbrPDB\xplural, since it does not account for computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$, entirely ignoring the `P' in \abbrPDB.
+This result is unsatisfying when considering complexity of evaluating $\query$ over \abbrPDB\xplural, since it does not account for computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$, entirely ignoring the `P' in \abbrPDB.

-A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$ separately from the complexity of deterministic query evaluation.  Viewing  \abbrPDB query evaluation as these two seperate steps is also known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}, illustrated in \cref{fig:two-step}.  
-The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing both $\query\inparen{\db}$ and $\poly_\tup(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in \cref{sec:circuit-runtime}.}  The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{X})}$. Such a model of computation is nicely followed in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly_\tup\inparen{\vct{X}}$ must be computed separate from deterministic query evaluation to obtain exact output when $\query(\pdb)$ is hard since evaluating the probability inline with query operators (extensional evaluation) will only approximate the actual probability in such a case.  The paradigm of \cref{fig:two-step} is also neatly followed by semiring provenance, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with standard polynomials, i.e. elements from $\semNX$ connected by addition operators.} query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial is subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics.  Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation.  
+\Cref{prob:bag-pdb-query-eval} has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where each output tuple appears at most once.  Here, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives.  Evaluating such a formula follows the standard semantics of the said operators on boolean variables.} whose evaluation follows the standard semantics ($\semB$-semiring semantics), denoting the presence or absence of $\tup$.  Computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ determines the marginal probability of $\tup$ appearing in the output.  Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard\footnote{\sharpp is the counting version for problems residing in the NP complexity class.} in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time \abbrStepOne.  
+%BEGIN Needs to be said
+%Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), techniques such as parameterized complexity (bounding complexity by another parameter other than $\numvar$) and fine grained analysis (complexity analysis that asks what precisely is the value of this other parameter, for example, what is the value of $f(k)$ given a \sharpwone algorithm) of \abbrStepTwo will not refine the hardness results from \sharpphard.
+%END NEeds to be said

-Let $\timeOf{\abbrStepOne}$ denote the runtime of \abbrStepOne and similarly for $\timeOf{\abbrStepTwo}$. When solving \cref{prob:bag-pdb-query-eval}, $\timeOf{\abbrStepTwo}$ lies somewhere between $\bigO{\timeOf{\abbrStepOne}}$ and $\bigO{\timeOf{\abbrStepOne}^k}$, since when $\poly_\tup$ is in \abbrSMB\footnote{\abbrSMB is akin to the sum of products expansion but with the added requirement that all monomials are unique.} computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$ is linear (due to linearity of expectation and the independence assumption of \abbrTIDB), while the case of a factorized $\poly_\tup$ has a worst case (since in general product terms must be expanded) of $\timeOf{\abbrStepTwo}$ being $\timeOf{\abbrStepOne}^k$ for a $k$-wise factorization.  This observation introduces our next problem statement:
+There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit than set-\abbrPDB\xplural.  One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$) of the result.  This works focuses on computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity.  Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
+
+%BEGIN Needs to be noted.
+%As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\poly_\tup\inparen{\vct{X}}$, a stark contrast to the marginal probability %($\expct\pbox{\poly\inparen{\vct{X}}}$)
+% paradigm of set-\abbrPDB\xplural.  To address the question of whether or not bag-\abbrPDB\xplural are easy, 
+%END Needs to be noted.
+
+A natural question is whether or not we can quantify the complexity of computing $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$ separately from the complexity of deterministic query evaluation.  Viewing  \abbrPDB query evaluation as these two seperate steps is also known as intensional evaluation \cite{DBLP:series/synthesis/2011Suciu}, illustrated in \cref{fig:two-step}.  
+The first step, which we will refer to as \termStepOne (\abbrStepOne), consists of computing both $\query\inparen{\db}$ and $\poly_\tup(\vct{X})$.\footnote{Assuming standard $\raPlus$ query processing algorithms, computing the lineage polynomial of $\tup$ is upperbounded by the runtime of deterministic query evaluation of $\tup$, as we show in \cref{sec:circuit-runtime}.}  The second step is \termStepTwo (\abbrStepTwo), which consists of computing $\expct\pbox{\poly_\tup(\vct{\randWorld})}$. Such a model of computation is nicely followed in set-\abbrPDB semantics \cite{DBLP:series/synthesis/2011Suciu}, where $\poly_\tup\inparen{\vct{X}}$ must be computed separate from deterministic query evaluation to obtain exact output when $\query(\pdb)$ is hard since evaluating the probability inline with query operators (extensional evaluation) will only approximate the actual probability in such a case.  The paradigm of \cref{fig:two-step} is also neatly followed by semiring provenance, where $\semNX$-DB\footnote{An $\semNX$-DB is a database whose tuples are annotated with standard polynomials, i.e. elements from $\semNX$ connected by addition operators.} query processing \cite{DBLP:conf/pods/GreenKT07} first computes the query and polynomial, and the $\semNX$-polynomial is subsequently evaluated over a semantically appropriate semiring, e.g. $\semN$ to model bag semantics.  Further, in this work, the intensional model lends itself nicely in separating the concerns of deterministic computation and the probability computation.  
+
+Let $\timeOf{\abbrStepOne}$ denote the runtime of \abbrStepOne and similarly for $\timeOf{\abbrStepTwo}$.
+In bag-\abbrPDB\xplural, by linearity of expectation and independence of \abbrTIDB, $\timeOf{\abbrStepTwo}$ is $\bigO{\abs{\poly_\tup}}$\footnote{$\abs{\poly_\tup}$ denotes the size of $\poly_\tup$, i.e., the number of arithmetic operations.} when $\poly_\tup$ is in \abbrSMB.  Given bag-\abbrPDB query $\query$ and \abbrTIDB $\pdb$  with $\numvar$ tuples, let us go a step further and assume that\footnote{This assumption is not necessary for the following results, but more clearly illustrates the point.} computing $\poly_\tup$ is lower bounded by the runtime of determistic query computation of $\query$ for the following situations.  Such an assumption can occur when $\query$ is a sequence of query algorithms (e.g. $\project\inparen{\join}$) whose evaluation is precisely mirrored in the \abbrSMB representation of $\poly_\tup$.  When $\poly_\tup$ is in \abbrSMB, it follows that $\timeOf{\abbrStepTwo}$ is indeed $\bigO{\timeOf{\abbrStepOne}}$.  Let $\prob_i$ denote the probability of tuple $\tup_i$ ($\probOf\pbox{X_i = 1}$) for $i \in [\numvar]$.  Consider another special case when for all $i$ in $[\numvar]$, $\prob_i = 1$.  For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{\randWorld}}}$ is linear in 
+$\abs{\poly_\tup}$
+%the size of the arithemetic circuit
+, since we can essentially push expectation through multiplication of variables dependent on one another.\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.}  Here is another special case where $\timeOf{\abbrStepTwo}$ is $\bigO{\timeOf{\abbrStepOne}}$ and we again achieve deterministic query runtime for $\query\inparen{\pdb}$ (up to a constant factor).  
+% When solving \cref{prob:bag-pdb-query-eval}, $\timeOf{\abbrStepTwo}$ lies somewhere between $\bigO{\timeOf{\abbrStepOne}}$ and $\bigO{\timeOf{\abbrStepOne}^k}$, since when $\poly_\tup$ is in \abbrSMB\footnote{\abbrSMB is akin to the sum of products expansion but with the added requirement that all monomials are unique.} computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$ is linear (due to linearity of expectation and the independence assumption of \abbrTIDB), while the case of a factorized $\poly_\tup$ has a worst case (since in general product terms must be expanded) of $\timeOf{\abbrStepTwo}$ being $\timeOf{\abbrStepOne}^k$ for a $k$-wise factorization.  
+These observations introduce our next problem statement:
 \begin{Problem}\label{prob:big-o-step-one}
 Given a \abbrPDB $\pdb$ and $\raPlus$ query $\query$, is it \emph{always} the case that $\timeOf{\abbrStepTwo}$ is always $\bigO{\timeOf{\abbrStepOne}}$?
 \end{Problem}
 If the answer to \cref{prob:big-o-step-one} is yes, then the query evaluation problem over bag \abbrPDB\xplural is of the same complexity as deterministic query evaluation.

-\Cref{prob:big-o-step-one} has been extensively studied in the context of \emph{set}-\abbrPDB\xplural, where each output tuple appears at most once.  Here, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula\footnote{To be precise, $\poly_\tup\inparen{\vct{X}}$ is a propositional formula composed of boolean variables and the logical disjunction and conjunction connectives.  Evaluating such a formula follows the standard semantics of the said operators on boolean variables.} whose evaluation follows the standard semantics ($\semB$-semiring semantics), denoting the presence or absence of $\tup$.  Computing $\expct\pbox{\poly_\tup\inparen{\vct{X}}}$ determines the marginal probability of $\tup$ appearing in the output.  Dalvi and Suicu \cite{10.1145/1265530.1265571} showed that the complexity of the query computation problem over set-\abbrPDB\xplural is \sharpphard\footnote{\sharpp is the counting version for problems residing in the NP complexity class.} in general, and proved that a dichotomy exists for this problem, where the runtime of $\query(\pdb)$ is either polynomial or \sharpphard for any polynomial-time \abbrStepOne.  Since the hardness is in data complexity (the size of the input, $\Theta(\numvar$)), techniques such as parameterized complexity (bounding complexity by another parameter other than $\numvar$) and fine grained analysis (complexity analysis that asks what precisely is the value of this other parameter, for example, what is the value of $f(k)$ given a \sharpwone algorithm) of \abbrStepTwo will not refine the hardness results from \sharpphard.
-
-There exist some queries for which \emph{bag}-\abbrPDB\xplural are a more natural fit than set-\abbrPDB\xplural.  One such query is the count query, where one might desire, for example, to compute the expected multiplicity ($\expct\pbox{\poly_\tup\inparen{\vct{X}}}$) of the result. 
-
-In bag-\abbrPDB\xplural (as alluded to above), $\timeOf{\abbrStepTwo}$ is $\bigO{\abs{\poly_\tup}}$\footnote{$\abs{\poly_\tup}$ denotes the size of $\poly_\tup$, i.e., the number of arithmetic operations.} when $\poly_\tup$ is in \abbrSMB.  For the special case when $\query$ is a sequence of query algorithms (e.g. $\project\inparen{\join}$) whose evaluation is precisely mirrored in the \abbrSMB representation of $\poly_\tup$, it then follows that $\timeOf{\abbrStepTwo}$ is indeed $\bigO{\timeOf{\abbrStepOne}}$.
-
 The main insight of the paper is that we should not stop here.  One can have compact representations of $\poly_\tup(\vct{X})$ resulting from, for example, optimizations like projection push-down which produce factorized representations of $\poly_\tup(\vct{X})$.  To capture such factorizations, this work uses (arithmetic) circuits
 \footnote{An arithmetic circuit has variable and/or numeric inputs, with internal nodes each of which can take on a value of either an addition or multiplication operator.}
 as the representation system of $\poly_\tup(\vct{X})$, which are a natural fit to $\raPlus$ queries as each operator maps to either a $\circplus$ or $\circmult$ operation \cite{DBLP:conf/pods/GreenKT07}.  The standard query evaluation semantics depicted in \cref{fig:nxDBSemantics} nicely illustrate this.  
@ -50,19 +67,11 @@ as the representation system of $\poly_\tup(\vct{X})$, which are a natural fit t
 \label{fig:nxDBSemantics}
 \end{figure}  

-Given bag-\abbrPDB query $\query$ and \abbrTIDB $\pdb$  with $\numvar$ tuples, let $\prob_i$ denote the probability of tuple $\tup_i$ ($\probOf\pbox{X_i = 1}$) for $i \in [\numvar]$.  Consider the special case when for all $i$ in $[\numvar]$, $\prob_i = 1$.  For output tuple $\tup'$ of $\query\inparen{\pdb}$, computing $\expct\pbox{\poly_{\tup'}\inparen{\vct{X}}}$ is linear in the size of the arithemetic circuit, since we can essentially push expectation through multiplication of variables dependent on one another.\footnote{For example in this special case, computing $\expct\pbox{(X_iX_j + X_\ell X_k)^2}$ does not require product expansion, since we have that $p_i^h x_i^h = p_i \cdot 1^{h-1}x_i^h$.}  Here is another special case where $\timeOf{\abbrStepTwo}$ is $\bigO{\timeOf{\abbrStepOne}}$ and we again achieve deterministic query runtime for $\query\inparen{\pdb}$ (up to a constant factor).  Is this the general case?  This leads us to the main problem statement of this paper: 
+Above we have seen, given a circuit \circuit, if \circuit is in \abbrSMB, then we have that $\timeOf{\abbrStepTwo}$ is indeed $\bigO{\timeOf{\abbrStepOne}}$.  Suppose, on the contrary, that \circuit is not in \abbrSMB and rather in some factorized form.  Then to naively compute \abbrStepTwo, one needs to convert \circuit into \circuit' such that \circuit' is in \abbrSMB, and then compute $\expct\pbox{\poly_\tup\inparen{\vct{\randWorld}}}$, which takes $\bigO{\abbrStepOne^k}$ time for the general $k$-wise factorization.  Since \abbrStepTwo lies between $\bigO{\abbrStepOne}$ and $\bigO{\abbrStepOne^k}$, it behooves us to determine which of these extremes is true for the general \circuit.  This leads us to the main problem statement of our paper: 
 \begin{Problem}\label{prob:intro-stmt}
-Given a query $\query$ in $\raPlus$ and bag \abbrPDB $\pdb$, is it always the case that computing \abbrStepTwo is $\bigO{\timeOf{\abbrStepOne}}$?
+Given \abbrPDB $\pdb$ and $\raPlus$ query $\query$, is it always the case that $\timeOf{\abbrStepTwo}$ is $\bigO{\abbrStepOne}$?
 \end{Problem}

-We show, for the class of \abbrTIDB\xplural with $0 < \prob_i < 1$, the problem of computing \abbrStepTwo is superlinear in the size of the lineage polynomial representation under fine grained complexity hardness assumption. 
-Our work further introduces an approximation algorithm of \abbrStepTwo from the bag-\abbrPDB query $\query$ which runs in linear time.
-
-As noted, bag-\abbrPDB query output is a probability distribution over the possible multiplicities of $\poly_\tup\inparen{\vct{X}}$, a stark contrast to the marginal probability %($\expct\pbox{\poly\inparen{\vct{X}}}$)
- paradigm of set-\abbrPDB\xplural.  To address the question of whether or not bag-\abbrPDB\xplural are easy, we focus on computing the expected count ($\expct\pbox{\poly_\tup\inparen{\vct{X}}}$) as a natural statistic to develop the theoretical foundations of bag-\abbrPDB complexity.  Other statistical measures are beyond the scope of this paper, though we consider higher moments in the appendix.
-
-Our work focuses on the following setting for query computation.  Inputs of $\query$ are set-\abbrPDB\xplural, while the output of $\query$ is a bag-\abbrPDB.  This setting, however, is not limiting as a simple generalization exists, reducing a bag \abbrPDB to a set \abbrPDB with typically only an $O(c)$ increase in size, for $c = \max_{\tup \in \db}\db\inparen{t}$.
-
 %%%%%%%%%%%%%%%%%%%%%%%%%
 %Contributions, Overview, Paper Organization
 %%%%%%%%%%%%%%%%%%%%%%%%%
@ -71,7 +80,7 @@ Concretely, we make the following contributions:
 % \sharpwonehard in the size of the lineage circuit 
 by reduction from counting the number of $k$-matchings over an arbitrary graph; we further show superlinear hardness for a specific %cubic 
 graph query for the special case of all $\prob_i = \prob$ for some $\prob$ in $(0, 1)$;
-(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries that makes \cref{prob:intro-stmt} true again; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ) followups~\cite{DBLP:conf/pods/KhamisNR16}) have runtime linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are  at most quadratic\footnote{Note that this doesn't rule out queries for which approximation is linear}); (iii) We generalize the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural), a more general model of probabilistic data; (iv) We further prove that for \raPlus queries 
+(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-\abbrTIDB\xplural and $\raPlus$ queries that makes \cref{prob:intro-stmt} true again; we further show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their Functional Aggregate Query (FAQ) followups~\cite{DBLP:conf/pods/KhamisNR16}), the approximation algorithm has runtime linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are  at most quadratic\footnote{Note that this doesn't rule out queries for which approximation is linear}); (iii) We generalize the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural), a more general model of probabilistic data; (iv) We further prove that for \raPlus queries 
 \AH{This point \emph{\Large seems} weird to me.  I thought we just said that the approximation complexity is linear in step one, but now it's as if we're saying that it's $\log{\text{step one}} + $ the runtime of step one.  Where am I missing it?}  
 we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).

@ -79,37 +88,37 @@ we can approximate the expected output tuple multiplicities with only $O(\log{Z}

 Consider the query $\query(\pdb) \coloneqq \project_\emptyset(OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)$
 %$Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$ 
-over the bag relations of \cref{fig:two-step}. It can be verified that $\poly_\tup$ for $Q$ is $L_aR_aL_b + L_bR_bL_d + L_bR_cL_c$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.
+over the bag relations of \cref{fig:two-step}. It can be verified that $\poly_\tup\inparen{A, B, C, D, X, Y, Z}$ for $Q$ is $AXB + BYD + BZC$. Now consider the product query $\query^2(\pdb) = \query(\pdb) \times \query(\pdb)$.

 The lineage polynomial for $Q^2$ is given by $\poly^2$:
 \begin{multline*}
-\left(L_aR_aL_b + L_bR_bL_d + L_bR_cL_c\right)^2\\
-=L_a^2R_a^2L_b^2 + L_b^2R_d^2L_d^2 + L_b^2R_c^2L_c^2 + 2L_aR_aL_b^2R_bL_d + 2L_aR_bL_b^2R_cL_c + 2L_b^2R_bL_dR_cL_c.
+\inparen{AXB + BYD + BZC}^2\\
+=A^2X^2B^2 + B^2Y^2D^2 + B^2Z^2C^2 + 2AXB^2YD + 2AXB^2ZC + 2B^2YDZC.
 \end{multline*}
-By exploiting linearity of expectation of summand terms, and further pushing expectation through independent \abbrTIDB variables, the expectation $\expct\pbox{\Phi^2}$ then is:
+By exploiting linearity of expectation of summand terms, and further pushing expectation through independent \abbrTIDB variables, the expectation $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}}$ then is:\footnote{The random variable corresponding to a formal variable $A$ is denoted $\randWorld_A$, with probability drawn from $\pd$.}
 \begin{footnotesize}
 \begin{multline*}
-\expct\pbox{L_a^2}\expct\pbox{R_a^2}\expct\pbox{L_b^2} + \expct\pbox{L_b^2}\expct\pbox{R_b^2}\expct\pbox{L_d^2} + \expct\pbox{L_b^2}\expct\pbox{R_c^2}\expct\pbox{L_c^2} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\\
-+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b^2}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b^2}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
+\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_X^2}\expct\pbox{\randWorld_B^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y^2}\expct\pbox{\randWorld_D^2} + \expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z^2}\expct\pbox{\randWorld_C^2} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_D}\\
+ 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_B^2}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_D}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C}.
 \end{multline*}
 \end{footnotesize}
-\noindent If the domain of a random variable $W$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{W^k} = \expct\pbox{W}$, which means that $\expct\pbox{\Phi^2}$ simplifies to:
+\noindent If the domain of a random variable $\randWorld$ is $\{0, 1\}$, then for any $k > 0$, $\expct\pbox{\randWorld^k} = \expct\pbox{\randWorld}$, which means that $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}}$ simplifies to:
 \begin{footnotesize}
 \begin{multline*}
-\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b} + \expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d} + \expct\pbox{L_b}\expct\pbox{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_a}\expct\pbox{R_a}\expct\pbox{L_b}\expct{R_b}\expct\pbox{L_d} \\
-+ 2\expct\pbox{L_a}\expct\pbox{R_b}\expct\pbox{L_b}\expct{R_c}\expct\pbox{L_c} + 2\expct\pbox{L_b}\expct\pbox{R_b}\expct\pbox{L_d}\expct\pbox{R_c}\expct\pbox{L_c}
+\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_D} + \expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_X}\expct\pbox{\randWorld_B}\expct{\randWorld_Y}\expct\pbox{\randWorld_D} \\
+ 2\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C} + 2\expct\pbox{\randWorld_B}\expct\pbox{\randWorld_Y}\expct\pbox{\randWorld_D}\expct\pbox{\randWorld_Z}\expct\pbox{\randWorld_C}
 \end{multline*}
 \end{footnotesize}
 \noindent This property leads us to consider a structure related to the lineage polynomial.
 \begin{Definition}\label{def:reduced-poly}
 For any polynomial $\poly(\vct{X})$, define the \emph{reduced polynomial} $\rpoly(\vct{X})$ to be the polynomial obtained by setting all exponents $e > 1$ in the \abbrSMB form of $\poly(\vct{X})$ to $1$.
 \end{Definition}
-With $\Phi^2$ as an example, we have:
+With $\Phi^2\inparen{\vct{X}}$ as an example, we have:
 \begin{align*}
-&\widetilde{\Phi^2}(L_a, L_b, L_c, L_d, R_a, R_b, R_c)\\
-&\; = L_aR_aL_b + L_bR_bL_d + L_bR_cL_c + 2L_aR_aL_bR_bL_d + 2L_aR_aL_bR_cL_c + 2L_bR_bL_dR_cL_c
+&\widetilde{\Phi^2}(A, B, C, D, X, Y, Z)\\
+&\; = AXB + BYD + BZC + 2AXBYD + 2AXBZC + 2BYDZC
 \end{align*}
-It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\pbox{\Phi^2} = \widetilde{\Phi^2}(\probOf\pbox{L_a=1},$ $\probOf\pbox{L_b=1}, \probOf\pbox{L_c=1}), \probOf\pbox{L_d=1})$). In fact, the following lemma shows that this equivalence holds for {\em all} $\raPlus$ queries over TIDB (proof in \cref{subsec:proof-exp-poly-rpoly}).
+It can be verified that the reduced polynomial parameterized with each variable's respective marginal probability is a closed form of the expected count (i.e., $\expct\limits_{\vct{\randWorld}\sim\pd}\pbox{\Phi^2\inparen{\vct{X}}} = \widetilde{\Phi^2}(\probOf\pbox{A=1},$ $\probOf\pbox{B=1}, \probOf\pbox{C=1}), \probOf\pbox{D=1}, \probOf\pbox{X=1}, \probOf\pbox{Y=1}, \probOf\pbox{Z=1})$). In fact, the following lemma shows that this equivalence holds for {\em all} $\raPlus$ queries over TIDB (proof in \cref{subsec:proof-exp-poly-rpoly}).
 \begin{Lemma}
 Let $\pdb$ be a \abbrTIDB over variables $\vct{X} = \{X_1,\ldots,X_\numvar\}$ with the probability distribution $\pd$ induced by the probability vector $\probAllTup = \inparen{\prob_1,\ldots,\prob_\numvar}$ of each individual tuple's probability for each world $\vct{w}$ in the set of all $2^\numvar$ possible worlds $\vct{W}$.  For any \abbrTIDB-lineage polynomial $\poly\inparen{\vct{X}}$ based on $\query\inparen{\pdb}$ the following holds:

@ -118,12 +127,19 @@ Let $\pdb$ be a \abbrTIDB over variables $\vct{X} = \{X_1,\ldots,X_\numvar\}$ wi
 \end{equation*} 
 \end{Lemma}

-To prove our hardness result we show that for the same $Q$ considered in the query above, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \cref{fig:two-step}}.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$).  For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation.  For example, using $\query^2$ from above, we can see that 
-\begin{equation*}
-	\poly^2\inparen{\probAllTup} = \inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
-\end{equation*}
-%For example, if we know that $\prob_0 = \max_{i \in [\numvar]}\prob_i$, then $\poly(\prob_0,\ldots, \prob_0)$ is an upper bound constant factor approximation.  Consider the first output tuple of \cref{fig:two-step}.  Here, we set $\prob_0 = 1$, and the approximation $\poly\inparen{\vct{1}} = 1 \cdot 1 = 1$.  The opposite holds true for determining a constant factor lower bound.  
-To get an $(1\pm \epsilon)$-multiplicative approximation we sample monomials from $\Phi$ and `adjust' their contribution to $\widetilde{\Phi}\left(\cdot\right)$.
+To prove our hardness result we show that for the same $Q$ considered in the query above, for an arbitrary product width $k$, the query $Q^k$ is able to encode various hard graph-counting problems\footnote{While $\query$ is the same, our results assume $\bigO{\numvar}$ tuples rather than the constant number of tuples appearing in \cref{fig:two-step}}.  We do so by analyzing  how the coefficients in the (univariate) polynomial $\widetilde{\Phi}\left(p,\dots,p\right)$ relate to counts of various sub-graphs on $k$ edges in an arbitrary graph $G$ (which is used to define the relations in $Q$).  
+
+For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then ${\Phi}\left(\probOf\pbox{X_1=1},\dots, \probOf\pbox{X_n=1}\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation.  For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$, we can see that 
+\begin{align*}
+	\poly^2\inparen{\probAllTup} &= \prob_A^2\prob_X^2\prob_B^2 + \prob_B^2\prob_Y^2\prob_D^2 + \prob_B^2\prob_Z^2\prob_C^2 + 2\prob_A\prob_X\prob_B^2\prob_Y\prob_D + 2\prob_A\prob_X\prob_B^2\prob_Z\prob_C + 2\prob_B^2\prob_Y\prob_D\prob_Z\prob_C\\
+	&\leq\prob_A\prob_X\prob_B + \prob_B\prob_Y\prob_D + \prob_B\prob_Z\prob_C + 
+2\prob_A\prob_X\prob_B\prob_Y\prob_D + 2\prob_A\prob_X\prob_B\prob_Z\prob_C + 2\prob_B\prob_Y\prob_D\prob_Z\prob_C \\
+	&= \rpoly\inparen{\vct{p}}	
+	 %\inparen{0.9\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 1.0 + 0.5\cdot 1.0\cdot 0.5}^2 = 2.7225 < 3.45 = \rpoly^2\inparen{\probAllTup}
+\end{align*}
+Choose the least factor that is reduced in $\rpoly^2\inparen{\vct{X}}$, in this case $\prob_A\prob_X\prob_B$, and we see that $\poly^2\inparen{\vct{\prob}}$ is in the range $[\prob_A\prob_X\prob_B\cdot\rpoly\inparen{\vct{\prob}}, \rpoly\inparen{\vct{\prob}}]$.
+
+To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from $\Phi$ and `adjust' their contribution based on the construction of $\widetilde{\Phi}\left(\cdot\right)$.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem (\Cref{def:the-expected-multipl})\AH{Aren't they the same?}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.
--- a/two-step-model.tex
+++ b/two-step-model.tex
@ -19,10 +19,10 @@
 				%\toprule
 				City$_\ell$ & $\Phi$  & \textbf{p}\\
 				\midrule
-	                     Buffalo     & $L_a$ & 0.9 \\
-	                     Chicago     & $L_b$ & 0.5\\
-	                     Bremen      & $L_c$ & 0.5\\
-	                     Zurich      & $L_d$ & 1.0\\	                     
+	                     Buffalo     & $A$ & 0.9 \\
+	                     Chicago     & $B$ & 0.5\\
+	                     Bremen      & $C$ & 0.5\\
+	                     Zurich      & $D$ & 1.0\\	                     
 			\end{tabular}\\
 			\tabcolsep=0.05cm
 			%\captionof{table}{Route}
@ -31,10 +31,10 @@
 				%\toprule
 				$\text{City}_1$ & $\text{City}_2$ & $\Phi$ & \textbf{p} \\
 				\midrule
-	                    Buffalo         & Chicago         & $R_a$          & 1.0        \\
-	                    Chicago         & Zurich          & $R_b$          & 1.0        \\
+	                    Buffalo         & Chicago         & $X$          & 1.0        \\
+	                    Chicago         & Zurich          & $Y$          & 1.0        \\
 	                    %& $\cdots$        & $\cdots$        & $\cdots$     & $\cdots$   \\
-	                    Chicago         & Bremen          & $R_c$          & 1.0        \\
+	                    Chicago         & Bremen          & $Z$          & 1.0        \\
 			\end{tabular}};
 			%label below cylinder
 			\node[below=0.2 cm of cylinder]{{\LARGE$ \pdb$}};
@ -55,7 +55,7 @@
 			\midrule
 	          	%\hline 
 	          	%\\\\[-3.5\medskipamount]
-	                 Buffalo & $L_a R_a$ &\resizebox{!}{10mm}{
+	                 Buffalo & $AX$ &\resizebox{!}{10mm}{
 	                       \begin{tikzpicture}[thick]
 	                       		\node[gen_tree_node](sink) at (0.5, 0.8){$\boldsymbol{\circmult}$};
 	                       		\node[gen_tree_node](source1) at (0, 0){$L_a$};
@ -64,7 +64,7 @@
 	                       		\draw[->] (source2)--(sink);
 					\end{tikzpicture}% & $0.5 \cdot 1.0 + 0.5 \cdot 1.0 = 1.0$   
 					}\\%                 & $0.9$                                            \\
-	                       Chicago & $L_b(R_b + R_c)$\newline \text{Or}\newline $L_bR_b + L_bR_c$&
+	                       Chicago & $B(Y + Z)$\newline \text{Or}\newline $BY+ BZ$&
 	                       \resizebox{!}{16mm} {
 						\begin{tikzpicture}[thick]
 							\node[gen_tree_node] (a1) at (1, 0){$R_b$};