shorten

2020-12-19 15:44:18 -06:00 · 2020-12-19 15:44:18 -06:00 · 0dbf75ebba
parent 9b9b575f6a
commit 0dbf75ebba
6 changed files with 60 additions and 48 deletions
--- a/approx_alg.tex
+++ b/approx_alg.tex
@ -3,15 +3,15 @@

 \section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}

-In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability  (\Cref{th:single-p-hard}). 
-Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}. 
+In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability  (\Cref{th:single-p-hard}).
+Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
 Unlike the results in~\Cref{sec:hard} our approximation algorithm works for \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark.
 %it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.

 \subsection{Preliminaries and some more notation}

-First, let us introduce some useful definitions and notation related to polynomials and their representations.  For illustrative purposes in the definitions below, we use the following %{\em bivariate} 
-polynomial: 
+First, let us introduce some useful definitions and notation related to polynomials and their representations.  For illustrative purposes in the definitions below, we use the following %{\em bivariate}
+polynomial:
 \begin{equation}
 \label{eq:poly-eg}
 \poly(X, Y) = 2X^2 + 3XY - 2Y^2.
@ -145,10 +145,10 @@ In the subsequent subsections we will prove the following theorem.

 \begin{Theorem}\label{lem:approx-alg}
 Let $\etree$ be an expression tree for a UCQ over \bi and define $\poly(\vct{X})=\polyf(\etree)$ and let $k=\degree(\poly)$
-%Let $\poly(\vct{X})$ be a query polynomial corresponding to the output of a UCQ in a \bi. 
-An estimate $\mathcal{E}$ %=\approxq(\etree, (p_1,\dots,p_\numvar), \conf, \error')$ 
- of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time 
-\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\etree}^2(1,\ldots, 1)\cdot  k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\] 
+%Let $\poly(\vct{X})$ be a query polynomial corresponding to the output of a UCQ in a \bi.
+An estimate $\mathcal{E}$ %=\approxq(\etree, (p_1,\dots,p_\numvar), \conf, \error')$
+ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
+\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\etree}^2(1,\ldots, 1)\cdot  k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\]
 such that
 \begin{equation}
 \label{eq:approx-algo-bound}
@ -172,7 +172,7 @@ We next present couple of corollaries of~\Cref{lem:approx-alg}.
 \label{cor:approx-algo-const-p}
 Let $\poly(\vct{X})$ be as in~\Cref{lem:approx-alg} and let $\gamma=\gamma(\etree)$. Further let it be the case that $p_i\ge p_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$  of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ satisfying~\Cref{eq:approx-algo-bound} can be computed in time
 \[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot p_0^{2k}}\right)\]
-In particular, if $p_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\frac 1{\eps^2}\cdot\treesize(\etree)\cdot \log{\frac{1}{\conf}}\right)$. 
+In particular, if $p_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\frac 1{\eps^2}\cdot\treesize(\etree)\cdot \log{\frac{1}{\conf}}\right)$.
 \end{Corollary}

 The proof for~\Cref{cor:approx-algo-const-p} can be seen in~\Cref{sec:proofs-approx-alg}.
@ -190,7 +190,7 @@ The algorithm to prove~\Cref{lem:approx-alg} follows from the following observat
 \rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(v,c)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\hspace*{-1mm}\prod_{X_i\in \var\inparen{v}}\hspace*{-2mm} X_i.
 \end{equation}
 Given the above, the algorithm is a sampling based algorithm for the above sum: we sample $(v,c)\in \expandtree{\etree}$ with probability proportional\footnote{We could have also uniformly sampled from $\expandtree{\etree}$ but this gives better parameters.}
-%\AH{Regarding the footnote, is there really a difference?  I \emph{suppose} technically, but in this case they are \emph{effectively} the same.  Just wondering.} 
+%\AH{Regarding the footnote, is there really a difference?  I \emph{suppose} technically, but in this case they are \emph{effectively} the same.  Just wondering.}
 %\AR{Yes, there is! If we used uniform distribution then in our bounds we will have a parameter that depends on the largest $\abs{coef}$, which e.g. could be dependent on $n$. But with the weighted probability distribution, we avoid paying this price. Though I guess perhaps we can say for the kinds of queries we consider thhese coefficients are all constants?}
 to $\abs{c}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{v}} p_i$. Taking enough samples and computing the average of $Y$ gives us our final estimate. Algorithm~\ref{alg:mon-sam} has the details.
 \OK{Even if the proof is offloaded to the appendix, it would be useful to state the formula for $N$ (line 4 of \Cref{alg:mon-sam}), along with a pointer to the appendix.}
@ -343,9 +343,9 @@ The function $\sampmon$ completes in $O(\log{k} \cdot k \cdot depth(\etree))$ ti

 Armed with the above two lemmas, we are ready to argue the following result (proof in~\Cref{sec:proofs-approx-alg}):
 \begin{Theorem}\label{lem:mon-samp}
-%If the contracts for $\onepass$ and $\sampmon$ hold, then 
+%If the contracts for $\onepass$ and $\sampmon$ hold, then
 For any $\etree$ with $\degree(poly(|\etree|)) = k$, algorithm \ref{alg:mon-sam} outputs an estimate $\vari{acc}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ such that %$\expct\pbox{\empmean} = \frac{\rpoly(\prob_1,\ldots, \prob_\numvar)\cdot(1 - \gamma)}{\abs{\etree}(1,\ldots, 1)}$.  %within an additive $\error \cdot \abs{\etree}(1,\ldots, 1)$ error with
-$\empmean$ has bounds 
+$\empmean$ has bounds
 \[P\left(\left|\vari{acc} - \rpoly(\prob_1,\ldots, \prob_\numvar)\right|> \error \cdot \abs{\etree}(1,\ldots, 1)\right) \leq \conf,\]
 in $O\left(\treesize(\etree)\right.$ $+$ $\left.\left(\frac{\log{\frac{1}{\conf}}}{\error^2} \cdot k \cdot\log{k} \cdot depth(\etree)\right)\right)$ time.
 \end{Theorem}
@ -406,9 +406,9 @@ It turns out that for proof of~\Cref{lem:sample}, we need to argue that when $\e
 A naive (slow) implementation of \sampmon\ would first compute $E(T)$ and then sample from it.
 % However, this would be too time consuming.
 %
-Instead, \Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by top-down traversal.  
-For a parent $+$ node, the child to be visited is sampled from the weighted distribution precomputed by \onepass.  
-When a parent $\times$ node is visited, both children are visited.  
+Instead, \Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by top-down traversal.
+For a parent $+$ node, the child to be visited is sampled from the weighted distribution precomputed by \onepass.
+When a parent $\times$ node is visited, both children are visited.
 The algorithm computes two properties: the set of all variable leaf nodes visited, and the product of signs of visited coefficient leaf nodes.

 %\begin{Definition}[TreeSet]
@ -458,3 +458,8 @@ $\sampmon$ is given in \Cref{alg:sample}, and a proof of its correctness (via \C

 %\AR{Experimental stuff about \bi should go in here}
 %%%%%%%%%%%%%%%%%%%%%%%
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "main"
+%%% End:
--- a/conclusions.tex
+++ b/conclusions.tex
@ -1,14 +1,14 @@
 %!TEX root=./main.tex
 \section{Conclusions and Future Work}\label{sec:concl-future-work}

-We have studied the problem of calculating the expectation of polynomials over random integer variables. 
+We have studied the problem of calculating the expectation of polynomials over random integer variables.
 This problem has a practical application in probabilistic databases over multisets, where it corresponds to calculating the expected multiplicity of a query result tuple.
-This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far. 
-While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in SOP form, the problem is \sharpwonehard for factorized polynomials. 
-We have proven this claim through a reduction from the problem of counting k-matchings. 
-When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is still possible to approximate the expectation of a polynomial in linear time. 
-An interesting direction for future work would be development of a dichotomy for queries over bag PDBs. 
-Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
+This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far.
+While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in SOP form, the problem is \sharpwonehard for factorized polynomials.
+We have proven this claim through a reduction from the problem of counting k-matchings.
+When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is still possible to approximate the expectation of a polynomial in linear time.
+An interesting direction for future work would be development of a dichotomy for queries over bag PDBs.
+% Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.

 \BG{I am not sure what interesting future work is here. Some wild guesses, if anybody agrees I'll try to flesh them out:
 \textbullet{More queries: what happens with negation can circuits with monus be used?}
--- a/intro.tex
+++ b/intro.tex
@ -32,7 +32,8 @@ However, even with alternative encodings~\cite{FH13}, the limiting factor in com
 The corresponding lineage encoding for Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of `clauses', each of which is the product of a set of integer or variable atoms.
 Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
 Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
-Compressed representations like Factorized Databases~\cite{factorized-db,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
+Compressed representations like Factorized Databases~\cite{factorized-db} %DBLP:conf/tapp/Zavodny11
+or Arithmetic/Polynomial  Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
 Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula more closely relates the algorithm's performance to the complexity of query evaluation in a deterministic database.

 The initial picture is not good.
@ -115,7 +116,7 @@ For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world
 The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p\cdot p\cdot (1-p)=p^2-p^3$.
 \end{Example}

-Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
+Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,GL16}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
 Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while query evaluation follows bag semantics.

 \begin{Example}\label{ex:bag-vs-set}
@ -145,7 +146,7 @@ P[\poly_{set}] &= \hspace*{-1mm}
 }
 \end{Example}

-Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{10.1145/1265530.1265571}, and thus \sharpphard.
+Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{DS12}, and thus \sharpphard.
 To see why computing this probability is hard, observe that the clauses of the disjunctive normal form Boolean lineage are neither independent nor disjoint, leading to e.g.~\cite{FH13} the use of Shannon decomposition, which is at worst exponential in the size of the input.
 % \begin{equation*}
 % \expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
@ -172,7 +173,7 @@ Computing such expectations is indeed linear in the size of the SOP as the numbe
 As a further interesting feature of this example, note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals:
 \begin{multline}
 \label{eqn:can-inline-probabilities-into-polynomial}
-\expct\pbox{\poly_{bag}} 
+\expct\pbox{\poly_{bag}}
 % = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
 % + P[W_c = 1]P[W_a = 1]\\
 = \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1])
--- a/main.tex
+++ b/main.tex
@ -194,10 +194,6 @@ sensitive=true
 \bibliographystyle{plain}
 \bibliography{main}

-
-
-
-
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 % APPENDIX
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
--- a/related-work-extra.tex
+++ b/related-work-extra.tex
@ -2,7 +2,7 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Compressed Representations of Polynomials and Boolean Formulas}\label{sec:compr-repr-polyn}

-There is a large body of work on compact using representations of Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{DBLP:conf/tapp/Zavodny11,factorized-db}) some of which have been utilized for  probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form.
+There is a large body of work on compact using representations of Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{factorized-db}) some of which have been utilized for  probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Parameterized Complexity}\label{sec:param-compl}
--- a/related-work.tex
+++ b/related-work.tex
@ -1,32 +1,42 @@
 %!TEX root=./main.tex
 \section{Related Work}\label{sec:related-work}

-In addition to work on probabilistic databases, our work has connections to work on compact representations of polynomials and relies on past work in fine-grained complexity which we review in \Cref{sec:compr-repr-polyn} and \Cref{sec:param-compl}.
+In addition to probabilistic databases, our work has connections to work on compact representations of polynomials and on fine-grained complexity which we review in \Cref{sec:compr-repr-polyn,sec:param-compl}.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %\subsection{Probabilistic Databases}\label{sec:prob-datab}

-Probabilistic Databases (PDBs) have been studied predominantly under set semantics.
-A multitude of data models have been proposed for encoding a PDB more compactly than as its set of possible worlds. 
-Tuple-independent databases (\tis) consist of a classical database where each tuple associated with a probability and tuples are treated as independent probabilistic events. 
-While unable to encode correlations directly, \tis are popular because any finite probabilistic database can be encoded as a \ti and a set of constraints that ``condition'' the \ti~\cite{VS17}. 
-Block-independent databases (\bis) generalize \tis by partitioning the input into blocks of disjoint tuples, where blocks are independent~\cite{RS07,BS06}. \emph{PC-tables}~\cite{GT06} pair a C-table~\cite{IL84a} with probability distribution over its variables. This is similar to our $\semNX$-PDBs, except that we do not allow for variables as attribute values and instead of local conditions (propositional formulas that may contain comparisons), we associate tuples with polynomials $\semNX$.
+Probabilistic Databases (PDBs) have been studied predominantly for set semantics.
+A multitude of data models have been proposed for encoding a PDB more compactly than as its set of possible worlds. These include tuple-independent databases~\cite{VS17} (\tis), block-independent databases (\bis)~\cite{RS07}, and \emph{PC-tables}~\cite{GT06} pair a C-table % ~\cite{IL84a}
+with probability distribution  over its variables.
+This is similar to our $\semNX$-PDBs, but we use polynomials instead of Boolean expressions and only allow constants as attribute values.
+% Tuple-independent databases (\tis) consist of a classical database where each tuple associated with a probability and tuples are treated as independent probabilistic events.
+% While unable to encode correlations directly, \tis are popular because any finite probabilistic database can be encoded as a \ti and a set of constraints that ``condition'' the \ti~\cite{VS17}.
+% Block-independent databases (\bis) generalize \tis by partitioning the input into blocks of disjoint tuples, where blocks are independent~\cite{RS07}. %,BS06
+% \emph{PC-tables}~\cite{GT06} pair a C-table % ~\cite{IL84a}
+% with probability distribution over its variables. This is similar to our $\semNX$-PDBs, except that we do not allow for variables as attribute values and instead of local conditions (propositional formulas that may contain comparisons), we associate tuples with polynomials $\semNX$.

-Approaches for probabilistic query processing (i.e., computing the marginal probability for query result tuples), fall into two broad categories. 
-\emph{Intensional} (or \emph{grounded}) query evaluation computes the \emph{lineage} of a tuple (a Boolean formula encoding the provenance of the tuple) and then the probability of the lineage formula. 
-In this paper we focus on intensional query evaluation using polynomials instead of boolean formulas. 
-It is a well-known fact that computing the marginal probability of a tuple is \sharpphard (proven through a reduction from weighted model counting~\cite{provan-83-ccccptg,valiant-79-cenrp} using the fact the tuple's marginal probability is the probability of a its lineage formula). 
-The second category, \emph{extensional} query evaluation, avoids calculating the lineage. 
-This approach is in \ptime, but is limited to certain classes of queries. 
-Dalvi et al.~\cite{DS12} proved a dichotomy for unions of conjunctive queries (UCQs): for any UCQ the probabilistic query evaluation problem is either \sharpphard (requires extensional evaluation) or \ptime (allows intensional). 
-Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation, R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries. 
-Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15,AB15c}. 
+Approaches for probabilistic query processing (i.e., computing the marginal probability for query result tuples), fall into two broad categories.
+\emph{Intensional} (or \emph{grounded}) query evaluation computes the \emph{lineage} of a tuple % (a Boolean formula encoding the provenance of the tuple)
+and then the probability of the lineage formula.
+In this paper we focus on intensional query evaluation using polynomials instead of boolean formulas.
+It is a well-known fact that computing the marginal probability of a tuple is \sharpphard (proven through a reduction from weighted model counting~\cite{valiant-79-cenrp} %provan-83-ccccptg
+using the fact the tuple's marginal probability is the probability of a its lineage formula).
+The second category, \emph{extensional} query evaluation, % avoids calculating the lineage.
+% This approach
+is in \ptime, but is limited to certain classes of queries.
+Dalvi et al.~\cite{DS12} proved that  a dichotomy for unions of conjunctive queries (UCQs):
+for any UCQ the probabilistic query evaluation problem is either \sharpphard (requires extensional evaluation) or \ptime (permits intensional).
+Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation. % R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries.
+Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15}. %,AB15c
 Another line of work, studies which structural properties of lineage formulas lead to tractable cases~\cite{kenig-13-nclexpdc,roy-11-f,sen-10-ronfqevpd}.

-Several techniques for approximating tuple probabilities have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07,re-07-eftqevpd}, relying on Monte Carlo sampling, e.g., \cite{DS07,re-07-eftqevpd}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10,fink-11}. 
+Several techniques for approximating tuple probabilities have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07}, relying on Monte Carlo sampling, e.g.,~\cite{DS07}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10}.
 The approximation algorithm for bag expectation we present in this work is based on sampling.

-Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of the extension of K-relations for aggregate queries proposed in~\cite{AD11d} (this data model is referred to as \emph{pvc-tables}). As an extension of K-relations, this approach supports bags. Probabilities are computed using a decomposition approach~\cite{DBLP:conf/icde/OlteanuHK10} over the symbolic expressions that are used as tuple annotations and values in pvc-tables. \cite{FH12} identifies a tractable class of queries involving aggregation. In contrast, we study a less general data model and query class, but provide a linear time approximation algorithm and provide new insights into the complexity of computing expectation (while \cite{FH12} computes probabilities for individual output annotations).
+Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of the extension of K-relations for aggregate queries proposed in~\cite{AD11d} (this data model is referred to as \emph{pvc-tables}). As an extension of K-relations, this approach supports bags. Probabilities are computed using a decomposition approach~\cite{DBLP:conf/icde/OlteanuHK10}. % over the symbolic expressions that are used as tuple annotations and values in pvc-tables.
+% \cite{FH12} identifies a tractable class of queries involving aggregation.
+In contrast, we study a less general data model and query class, but provide a linear time approximation algorithm and provide new insights into the complexity of computing expectation (while~\cite{FH12} computes probabilities for individual output annotations).

 %%% Local Variables:
 %%% mode: latex