From 0dbf75ebbaeb79084c5ec93450c02f1ce43f6f9e Mon Sep 17 00:00:00 2001 From: Boris Glavic Date: Sat, 19 Dec 2020 15:44:18 -0600 Subject: [PATCH] shorten --- approx_alg.tex | 35 +++++++++++++++++++-------------- conclusions.tex | 14 +++++++------- intro.tex | 9 +++++---- main.tex | 4 ---- related-work-extra.tex | 2 +- related-work.tex | 44 ++++++++++++++++++++++++++---------------- 6 files changed, 60 insertions(+), 48 deletions(-) diff --git a/approx_alg.tex b/approx_alg.tex index e60fbe9..3183db9 100644 --- a/approx_alg.tex +++ b/approx_alg.tex @@ -3,15 +3,15 @@ \section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo} -In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}). -Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}. +In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}). +Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}. Unlike the results in~\Cref{sec:hard} our approximation algorithm works for \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark. %it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next. \subsection{Preliminaries and some more notation} -First, let us introduce some useful definitions and notation related to polynomials and their representations. For illustrative purposes in the definitions below, we use the following %{\em bivariate} -polynomial: +First, let us introduce some useful definitions and notation related to polynomials and their representations. For illustrative purposes in the definitions below, we use the following %{\em bivariate} +polynomial: \begin{equation} \label{eq:poly-eg} \poly(X, Y) = 2X^2 + 3XY - 2Y^2. @@ -145,10 +145,10 @@ In the subsequent subsections we will prove the following theorem. \begin{Theorem}\label{lem:approx-alg} Let $\etree$ be an expression tree for a UCQ over \bi and define $\poly(\vct{X})=\polyf(\etree)$ and let $k=\degree(\poly)$ -%Let $\poly(\vct{X})$ be a query polynomial corresponding to the output of a UCQ in a \bi. -An estimate $\mathcal{E}$ %=\approxq(\etree, (p_1,\dots,p_\numvar), \conf, \error')$ - of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time -\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\etree}^2(1,\ldots, 1)\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\] +%Let $\poly(\vct{X})$ be a query polynomial corresponding to the output of a UCQ in a \bi. +An estimate $\mathcal{E}$ %=\approxq(\etree, (p_1,\dots,p_\numvar), \conf, \error')$ + of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time +\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\etree}^2(1,\ldots, 1)\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\] such that \begin{equation} \label{eq:approx-algo-bound} @@ -172,7 +172,7 @@ We next present couple of corollaries of~\Cref{lem:approx-alg}. \label{cor:approx-algo-const-p} Let $\poly(\vct{X})$ be as in~\Cref{lem:approx-alg} and let $\gamma=\gamma(\etree)$. Further let it be the case that $p_i\ge p_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ satisfying~\Cref{eq:approx-algo-bound} can be computed in time \[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot p_0^{2k}}\right)\] -In particular, if $p_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\frac 1{\eps^2}\cdot\treesize(\etree)\cdot \log{\frac{1}{\conf}}\right)$. +In particular, if $p_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\frac 1{\eps^2}\cdot\treesize(\etree)\cdot \log{\frac{1}{\conf}}\right)$. \end{Corollary} The proof for~\Cref{cor:approx-algo-const-p} can be seen in~\Cref{sec:proofs-approx-alg}. @@ -190,7 +190,7 @@ The algorithm to prove~\Cref{lem:approx-alg} follows from the following observat \rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(v,c)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\hspace*{-1mm}\prod_{X_i\in \var\inparen{v}}\hspace*{-2mm} X_i. \end{equation} Given the above, the algorithm is a sampling based algorithm for the above sum: we sample $(v,c)\in \expandtree{\etree}$ with probability proportional\footnote{We could have also uniformly sampled from $\expandtree{\etree}$ but this gives better parameters.} -%\AH{Regarding the footnote, is there really a difference? I \emph{suppose} technically, but in this case they are \emph{effectively} the same. Just wondering.} +%\AH{Regarding the footnote, is there really a difference? I \emph{suppose} technically, but in this case they are \emph{effectively} the same. Just wondering.} %\AR{Yes, there is! If we used uniform distribution then in our bounds we will have a parameter that depends on the largest $\abs{coef}$, which e.g. could be dependent on $n$. But with the weighted probability distribution, we avoid paying this price. Though I guess perhaps we can say for the kinds of queries we consider thhese coefficients are all constants?} to $\abs{c}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{v}} p_i$. Taking enough samples and computing the average of $Y$ gives us our final estimate. Algorithm~\ref{alg:mon-sam} has the details. \OK{Even if the proof is offloaded to the appendix, it would be useful to state the formula for $N$ (line 4 of \Cref{alg:mon-sam}), along with a pointer to the appendix.} @@ -343,9 +343,9 @@ The function $\sampmon$ completes in $O(\log{k} \cdot k \cdot depth(\etree))$ ti Armed with the above two lemmas, we are ready to argue the following result (proof in~\Cref{sec:proofs-approx-alg}): \begin{Theorem}\label{lem:mon-samp} -%If the contracts for $\onepass$ and $\sampmon$ hold, then +%If the contracts for $\onepass$ and $\sampmon$ hold, then For any $\etree$ with $\degree(poly(|\etree|)) = k$, algorithm \ref{alg:mon-sam} outputs an estimate $\vari{acc}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ such that %$\expct\pbox{\empmean} = \frac{\rpoly(\prob_1,\ldots, \prob_\numvar)\cdot(1 - \gamma)}{\abs{\etree}(1,\ldots, 1)}$. %within an additive $\error \cdot \abs{\etree}(1,\ldots, 1)$ error with -$\empmean$ has bounds +$\empmean$ has bounds \[P\left(\left|\vari{acc} - \rpoly(\prob_1,\ldots, \prob_\numvar)\right|> \error \cdot \abs{\etree}(1,\ldots, 1)\right) \leq \conf,\] in $O\left(\treesize(\etree)\right.$ $+$ $\left.\left(\frac{\log{\frac{1}{\conf}}}{\error^2} \cdot k \cdot\log{k} \cdot depth(\etree)\right)\right)$ time. \end{Theorem} @@ -406,9 +406,9 @@ It turns out that for proof of~\Cref{lem:sample}, we need to argue that when $\e A naive (slow) implementation of \sampmon\ would first compute $E(T)$ and then sample from it. % However, this would be too time consuming. % -Instead, \Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by top-down traversal. -For a parent $+$ node, the child to be visited is sampled from the weighted distribution precomputed by \onepass. -When a parent $\times$ node is visited, both children are visited. +Instead, \Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by top-down traversal. +For a parent $+$ node, the child to be visited is sampled from the weighted distribution precomputed by \onepass. +When a parent $\times$ node is visited, both children are visited. The algorithm computes two properties: the set of all variable leaf nodes visited, and the product of signs of visited coefficient leaf nodes. %\begin{Definition}[TreeSet] @@ -458,3 +458,8 @@ $\sampmon$ is given in \Cref{alg:sample}, and a proof of its correctness (via \C %\AR{Experimental stuff about \bi should go in here} %%%%%%%%%%%%%%%%%%%%%%% + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: "main" +%%% End: diff --git a/conclusions.tex b/conclusions.tex index 691a000..8e05c87 100644 --- a/conclusions.tex +++ b/conclusions.tex @@ -1,14 +1,14 @@ %!TEX root=./main.tex \section{Conclusions and Future Work}\label{sec:concl-future-work} -We have studied the problem of calculating the expectation of polynomials over random integer variables. +We have studied the problem of calculating the expectation of polynomials over random integer variables. This problem has a practical application in probabilistic databases over multisets, where it corresponds to calculating the expected multiplicity of a query result tuple. -This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far. -While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in SOP form, the problem is \sharpwonehard for factorized polynomials. -We have proven this claim through a reduction from the problem of counting k-matchings. -When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is still possible to approximate the expectation of a polynomial in linear time. -An interesting direction for future work would be development of a dichotomy for queries over bag PDBs. -Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system. +This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far. +While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in SOP form, the problem is \sharpwonehard for factorized polynomials. +We have proven this claim through a reduction from the problem of counting k-matchings. +When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is still possible to approximate the expectation of a polynomial in linear time. +An interesting direction for future work would be development of a dichotomy for queries over bag PDBs. +% Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system. \BG{I am not sure what interesting future work is here. Some wild guesses, if anybody agrees I'll try to flesh them out: \textbullet{More queries: what happens with negation can circuits with monus be used?} diff --git a/intro.tex b/intro.tex index 3104fb4..43c0b1c 100644 --- a/intro.tex +++ b/intro.tex @@ -32,7 +32,8 @@ However, even with alternative encodings~\cite{FH13}, the limiting factor in com The corresponding lineage encoding for Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of `clauses', each of which is the product of a set of integer or variable atoms. Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial. Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general. -Compressed representations like Factorized Databases~\cite{factorized-db,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}. +Compressed representations like Factorized Databases~\cite{factorized-db} %DBLP:conf/tapp/Zavodny11 +or Arithmetic/Polynomial Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}. Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula more closely relates the algorithm's performance to the complexity of query evaluation in a deterministic database. The initial picture is not good. @@ -115,7 +116,7 @@ For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p\cdot p\cdot (1-p)=p^2-p^3$. \end{Example} -Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example). +Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,GL16}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example). Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while query evaluation follows bag semantics. \begin{Example}\label{ex:bag-vs-set} @@ -145,7 +146,7 @@ P[\poly_{set}] &= \hspace*{-1mm} } \end{Example} -Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{10.1145/1265530.1265571}, and thus \sharpphard. +Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{DS12}, and thus \sharpphard. To see why computing this probability is hard, observe that the clauses of the disjunctive normal form Boolean lineage are neither independent nor disjoint, leading to e.g.~\cite{FH13} the use of Shannon decomposition, which is at worst exponential in the size of the input. % \begin{equation*} % \expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3 @@ -172,7 +173,7 @@ Computing such expectations is indeed linear in the size of the SOP as the numbe As a further interesting feature of this example, note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals: \begin{multline} \label{eqn:can-inline-probabilities-into-polynomial} -\expct\pbox{\poly_{bag}} +\expct\pbox{\poly_{bag}} % = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\ % + P[W_c = 1]P[W_a = 1]\\ = \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1]) diff --git a/main.tex b/main.tex index a76a778..0b4aebe 100644 --- a/main.tex +++ b/main.tex @@ -194,10 +194,6 @@ sensitive=true \bibliographystyle{plain} \bibliography{main} - - - - %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % APPENDIX %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% diff --git a/related-work-extra.tex b/related-work-extra.tex index 71f7fa9..b9ea8bd 100644 --- a/related-work-extra.tex +++ b/related-work-extra.tex @@ -2,7 +2,7 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Compressed Representations of Polynomials and Boolean Formulas}\label{sec:compr-repr-polyn} -There is a large body of work on compact using representations of Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{DBLP:conf/tapp/Zavodny11,factorized-db}) some of which have been utilized for probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form. +There is a large body of work on compact using representations of Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{factorized-db}) some of which have been utilized for probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Parameterized Complexity}\label{sec:param-compl} diff --git a/related-work.tex b/related-work.tex index 5a345e4..e30e67e 100644 --- a/related-work.tex +++ b/related-work.tex @@ -1,32 +1,42 @@ %!TEX root=./main.tex \section{Related Work}\label{sec:related-work} -In addition to work on probabilistic databases, our work has connections to work on compact representations of polynomials and relies on past work in fine-grained complexity which we review in \Cref{sec:compr-repr-polyn} and \Cref{sec:param-compl}. +In addition to probabilistic databases, our work has connections to work on compact representations of polynomials and on fine-grained complexity which we review in \Cref{sec:compr-repr-polyn,sec:param-compl}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\subsection{Probabilistic Databases}\label{sec:prob-datab} -Probabilistic Databases (PDBs) have been studied predominantly under set semantics. -A multitude of data models have been proposed for encoding a PDB more compactly than as its set of possible worlds. -Tuple-independent databases (\tis) consist of a classical database where each tuple associated with a probability and tuples are treated as independent probabilistic events. -While unable to encode correlations directly, \tis are popular because any finite probabilistic database can be encoded as a \ti and a set of constraints that ``condition'' the \ti~\cite{VS17}. -Block-independent databases (\bis) generalize \tis by partitioning the input into blocks of disjoint tuples, where blocks are independent~\cite{RS07,BS06}. \emph{PC-tables}~\cite{GT06} pair a C-table~\cite{IL84a} with probability distribution over its variables. This is similar to our $\semNX$-PDBs, except that we do not allow for variables as attribute values and instead of local conditions (propositional formulas that may contain comparisons), we associate tuples with polynomials $\semNX$. +Probabilistic Databases (PDBs) have been studied predominantly for set semantics. +A multitude of data models have been proposed for encoding a PDB more compactly than as its set of possible worlds. These include tuple-independent databases~\cite{VS17} (\tis), block-independent databases (\bis)~\cite{RS07}, and \emph{PC-tables}~\cite{GT06} pair a C-table % ~\cite{IL84a} +with probability distribution over its variables. +This is similar to our $\semNX$-PDBs, but we use polynomials instead of Boolean expressions and only allow constants as attribute values. +% Tuple-independent databases (\tis) consist of a classical database where each tuple associated with a probability and tuples are treated as independent probabilistic events. +% While unable to encode correlations directly, \tis are popular because any finite probabilistic database can be encoded as a \ti and a set of constraints that ``condition'' the \ti~\cite{VS17}. +% Block-independent databases (\bis) generalize \tis by partitioning the input into blocks of disjoint tuples, where blocks are independent~\cite{RS07}. %,BS06 +% \emph{PC-tables}~\cite{GT06} pair a C-table % ~\cite{IL84a} +% with probability distribution over its variables. This is similar to our $\semNX$-PDBs, except that we do not allow for variables as attribute values and instead of local conditions (propositional formulas that may contain comparisons), we associate tuples with polynomials $\semNX$. -Approaches for probabilistic query processing (i.e., computing the marginal probability for query result tuples), fall into two broad categories. -\emph{Intensional} (or \emph{grounded}) query evaluation computes the \emph{lineage} of a tuple (a Boolean formula encoding the provenance of the tuple) and then the probability of the lineage formula. -In this paper we focus on intensional query evaluation using polynomials instead of boolean formulas. -It is a well-known fact that computing the marginal probability of a tuple is \sharpphard (proven through a reduction from weighted model counting~\cite{provan-83-ccccptg,valiant-79-cenrp} using the fact the tuple's marginal probability is the probability of a its lineage formula). -The second category, \emph{extensional} query evaluation, avoids calculating the lineage. -This approach is in \ptime, but is limited to certain classes of queries. -Dalvi et al.~\cite{DS12} proved a dichotomy for unions of conjunctive queries (UCQs): for any UCQ the probabilistic query evaluation problem is either \sharpphard (requires extensional evaluation) or \ptime (allows intensional). -Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation, R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries. -Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15,AB15c}. +Approaches for probabilistic query processing (i.e., computing the marginal probability for query result tuples), fall into two broad categories. +\emph{Intensional} (or \emph{grounded}) query evaluation computes the \emph{lineage} of a tuple % (a Boolean formula encoding the provenance of the tuple) +and then the probability of the lineage formula. +In this paper we focus on intensional query evaluation using polynomials instead of boolean formulas. +It is a well-known fact that computing the marginal probability of a tuple is \sharpphard (proven through a reduction from weighted model counting~\cite{valiant-79-cenrp} %provan-83-ccccptg +using the fact the tuple's marginal probability is the probability of a its lineage formula). +The second category, \emph{extensional} query evaluation, % avoids calculating the lineage. +% This approach +is in \ptime, but is limited to certain classes of queries. +Dalvi et al.~\cite{DS12} proved that a dichotomy for unions of conjunctive queries (UCQs): +for any UCQ the probabilistic query evaluation problem is either \sharpphard (requires extensional evaluation) or \ptime (permits intensional). +Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation. % R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries. +Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15}. %,AB15c Another line of work, studies which structural properties of lineage formulas lead to tractable cases~\cite{kenig-13-nclexpdc,roy-11-f,sen-10-ronfqevpd}. -Several techniques for approximating tuple probabilities have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07,re-07-eftqevpd}, relying on Monte Carlo sampling, e.g., \cite{DS07,re-07-eftqevpd}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10,fink-11}. +Several techniques for approximating tuple probabilities have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07}, relying on Monte Carlo sampling, e.g.,~\cite{DS07}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10}. The approximation algorithm for bag expectation we present in this work is based on sampling. -Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of the extension of K-relations for aggregate queries proposed in~\cite{AD11d} (this data model is referred to as \emph{pvc-tables}). As an extension of K-relations, this approach supports bags. Probabilities are computed using a decomposition approach~\cite{DBLP:conf/icde/OlteanuHK10} over the symbolic expressions that are used as tuple annotations and values in pvc-tables. \cite{FH12} identifies a tractable class of queries involving aggregation. In contrast, we study a less general data model and query class, but provide a linear time approximation algorithm and provide new insights into the complexity of computing expectation (while \cite{FH12} computes probabilities for individual output annotations). +Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of the extension of K-relations for aggregate queries proposed in~\cite{AD11d} (this data model is referred to as \emph{pvc-tables}). As an extension of K-relations, this approach supports bags. Probabilities are computed using a decomposition approach~\cite{DBLP:conf/icde/OlteanuHK10}. % over the symbolic expressions that are used as tuple annotations and values in pvc-tables. +% \cite{FH12} identifies a tractable class of queries involving aggregation. +In contrast, we study a less general data model and query class, but provide a linear time approximation algorithm and provide new insights into the complexity of computing expectation (while~\cite{FH12} computes probabilities for individual output annotations). %%% Local Variables: %%% mode: latex