shorten
parent
9b9b575f6a
commit
0dbf75ebba
|
@ -3,15 +3,15 @@
|
|||
|
||||
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
|
||||
|
||||
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
|
||||
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
|
||||
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
|
||||
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
|
||||
Unlike the results in~\Cref{sec:hard} our approximation algorithm works for \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark.
|
||||
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
|
||||
|
||||
\subsection{Preliminaries and some more notation}
|
||||
|
||||
First, let us introduce some useful definitions and notation related to polynomials and their representations. For illustrative purposes in the definitions below, we use the following %{\em bivariate}
|
||||
polynomial:
|
||||
First, let us introduce some useful definitions and notation related to polynomials and their representations. For illustrative purposes in the definitions below, we use the following %{\em bivariate}
|
||||
polynomial:
|
||||
\begin{equation}
|
||||
\label{eq:poly-eg}
|
||||
\poly(X, Y) = 2X^2 + 3XY - 2Y^2.
|
||||
|
@ -145,10 +145,10 @@ In the subsequent subsections we will prove the following theorem.
|
|||
|
||||
\begin{Theorem}\label{lem:approx-alg}
|
||||
Let $\etree$ be an expression tree for a UCQ over \bi and define $\poly(\vct{X})=\polyf(\etree)$ and let $k=\degree(\poly)$
|
||||
%Let $\poly(\vct{X})$ be a query polynomial corresponding to the output of a UCQ in a \bi.
|
||||
An estimate $\mathcal{E}$ %=\approxq(\etree, (p_1,\dots,p_\numvar), \conf, \error')$
|
||||
of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
|
||||
\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\etree}^2(1,\ldots, 1)\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\]
|
||||
%Let $\poly(\vct{X})$ be a query polynomial corresponding to the output of a UCQ in a \bi.
|
||||
An estimate $\mathcal{E}$ %=\approxq(\etree, (p_1,\dots,p_\numvar), \conf, \error')$
|
||||
of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
|
||||
\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\etree}^2(1,\ldots, 1)\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\]
|
||||
such that
|
||||
\begin{equation}
|
||||
\label{eq:approx-algo-bound}
|
||||
|
@ -172,7 +172,7 @@ We next present couple of corollaries of~\Cref{lem:approx-alg}.
|
|||
\label{cor:approx-algo-const-p}
|
||||
Let $\poly(\vct{X})$ be as in~\Cref{lem:approx-alg} and let $\gamma=\gamma(\etree)$. Further let it be the case that $p_i\ge p_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ satisfying~\Cref{eq:approx-algo-bound} can be computed in time
|
||||
\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot p_0^{2k}}\right)\]
|
||||
In particular, if $p_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\frac 1{\eps^2}\cdot\treesize(\etree)\cdot \log{\frac{1}{\conf}}\right)$.
|
||||
In particular, if $p_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\frac 1{\eps^2}\cdot\treesize(\etree)\cdot \log{\frac{1}{\conf}}\right)$.
|
||||
\end{Corollary}
|
||||
|
||||
The proof for~\Cref{cor:approx-algo-const-p} can be seen in~\Cref{sec:proofs-approx-alg}.
|
||||
|
@ -190,7 +190,7 @@ The algorithm to prove~\Cref{lem:approx-alg} follows from the following observat
|
|||
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(v,c)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\hspace*{-1mm}\prod_{X_i\in \var\inparen{v}}\hspace*{-2mm} X_i.
|
||||
\end{equation}
|
||||
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample $(v,c)\in \expandtree{\etree}$ with probability proportional\footnote{We could have also uniformly sampled from $\expandtree{\etree}$ but this gives better parameters.}
|
||||
%\AH{Regarding the footnote, is there really a difference? I \emph{suppose} technically, but in this case they are \emph{effectively} the same. Just wondering.}
|
||||
%\AH{Regarding the footnote, is there really a difference? I \emph{suppose} technically, but in this case they are \emph{effectively} the same. Just wondering.}
|
||||
%\AR{Yes, there is! If we used uniform distribution then in our bounds we will have a parameter that depends on the largest $\abs{coef}$, which e.g. could be dependent on $n$. But with the weighted probability distribution, we avoid paying this price. Though I guess perhaps we can say for the kinds of queries we consider thhese coefficients are all constants?}
|
||||
to $\abs{c}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{v}} p_i$. Taking enough samples and computing the average of $Y$ gives us our final estimate. Algorithm~\ref{alg:mon-sam} has the details.
|
||||
\OK{Even if the proof is offloaded to the appendix, it would be useful to state the formula for $N$ (line 4 of \Cref{alg:mon-sam}), along with a pointer to the appendix.}
|
||||
|
@ -343,9 +343,9 @@ The function $\sampmon$ completes in $O(\log{k} \cdot k \cdot depth(\etree))$ ti
|
|||
|
||||
Armed with the above two lemmas, we are ready to argue the following result (proof in~\Cref{sec:proofs-approx-alg}):
|
||||
\begin{Theorem}\label{lem:mon-samp}
|
||||
%If the contracts for $\onepass$ and $\sampmon$ hold, then
|
||||
%If the contracts for $\onepass$ and $\sampmon$ hold, then
|
||||
For any $\etree$ with $\degree(poly(|\etree|)) = k$, algorithm \ref{alg:mon-sam} outputs an estimate $\vari{acc}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ such that %$\expct\pbox{\empmean} = \frac{\rpoly(\prob_1,\ldots, \prob_\numvar)\cdot(1 - \gamma)}{\abs{\etree}(1,\ldots, 1)}$. %within an additive $\error \cdot \abs{\etree}(1,\ldots, 1)$ error with
|
||||
$\empmean$ has bounds
|
||||
$\empmean$ has bounds
|
||||
\[P\left(\left|\vari{acc} - \rpoly(\prob_1,\ldots, \prob_\numvar)\right|> \error \cdot \abs{\etree}(1,\ldots, 1)\right) \leq \conf,\]
|
||||
in $O\left(\treesize(\etree)\right.$ $+$ $\left.\left(\frac{\log{\frac{1}{\conf}}}{\error^2} \cdot k \cdot\log{k} \cdot depth(\etree)\right)\right)$ time.
|
||||
\end{Theorem}
|
||||
|
@ -406,9 +406,9 @@ It turns out that for proof of~\Cref{lem:sample}, we need to argue that when $\e
|
|||
A naive (slow) implementation of \sampmon\ would first compute $E(T)$ and then sample from it.
|
||||
% However, this would be too time consuming.
|
||||
%
|
||||
Instead, \Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by top-down traversal.
|
||||
For a parent $+$ node, the child to be visited is sampled from the weighted distribution precomputed by \onepass.
|
||||
When a parent $\times$ node is visited, both children are visited.
|
||||
Instead, \Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by top-down traversal.
|
||||
For a parent $+$ node, the child to be visited is sampled from the weighted distribution precomputed by \onepass.
|
||||
When a parent $\times$ node is visited, both children are visited.
|
||||
The algorithm computes two properties: the set of all variable leaf nodes visited, and the product of signs of visited coefficient leaf nodes.
|
||||
|
||||
%\begin{Definition}[TreeSet]
|
||||
|
@ -458,3 +458,8 @@ $\sampmon$ is given in \Cref{alg:sample}, and a proof of its correctness (via \C
|
|||
|
||||
%\AR{Experimental stuff about \bi should go in here}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
%%% Local Variables:
|
||||
%%% mode: latex
|
||||
%%% TeX-master: "main"
|
||||
%%% End:
|
||||
|
|
|
@ -1,14 +1,14 @@
|
|||
%!TEX root=./main.tex
|
||||
\section{Conclusions and Future Work}\label{sec:concl-future-work}
|
||||
|
||||
We have studied the problem of calculating the expectation of polynomials over random integer variables.
|
||||
We have studied the problem of calculating the expectation of polynomials over random integer variables.
|
||||
This problem has a practical application in probabilistic databases over multisets, where it corresponds to calculating the expected multiplicity of a query result tuple.
|
||||
This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far.
|
||||
While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in SOP form, the problem is \sharpwonehard for factorized polynomials.
|
||||
We have proven this claim through a reduction from the problem of counting k-matchings.
|
||||
When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is still possible to approximate the expectation of a polynomial in linear time.
|
||||
An interesting direction for future work would be development of a dichotomy for queries over bag PDBs.
|
||||
Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
|
||||
This problem has been studied extensively for sets (lineage formulas), but the bag settings has not received much attention so far.
|
||||
While the expectation of a polynomial can be calculated in linear time in the size of polynomials that are in SOP form, the problem is \sharpwonehard for factorized polynomials.
|
||||
We have proven this claim through a reduction from the problem of counting k-matchings.
|
||||
When only considering polynomials for result tuples of UCQs over TIDBs and BIDBs (under the assumption that there are $O(1)$ cancellations), we prove that it is still possible to approximate the expectation of a polynomial in linear time.
|
||||
An interesting direction for future work would be development of a dichotomy for queries over bag PDBs.
|
||||
% Furthermore, it would be interesting to see whether our approximation algorithm can be extended to support queries with negations, perhaps using circuits with monus as a representation system.
|
||||
|
||||
\BG{I am not sure what interesting future work is here. Some wild guesses, if anybody agrees I'll try to flesh them out:
|
||||
\textbullet{More queries: what happens with negation can circuits with monus be used?}
|
||||
|
|
|
@ -32,7 +32,8 @@ However, even with alternative encodings~\cite{FH13}, the limiting factor in com
|
|||
The corresponding lineage encoding for Bag-PDBs is a polynomial in sum of products (SOP) form --- a sum of `clauses', each of which is the product of a set of integer or variable atoms.
|
||||
Thanks to linearity of expectation, computing the expectation of a count query is linear in the number of clauses in the SOP polynomial.
|
||||
Unlike Set-PDBs, however, when we consider compressed representations of this polynomial, the complexity landscape becomes much more nuanced and is \textit{not} linear in general.
|
||||
Compressed representations like Factorized Databases~\cite{factorized-db,DBLP:conf/tapp/Zavodny11} or Arithmetic/Polynomial Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
|
||||
Compressed representations like Factorized Databases~\cite{factorized-db} %DBLP:conf/tapp/Zavodny11
|
||||
or Arithmetic/Polynomial Circuits~\cite{arith-complexity} are analogous to deterministic query optimizations (e.g. pushing down projections)~\cite{DBLP:conf/pods/KhamisNR16,factorized-db}.
|
||||
Thus, measuring the performance of a PDB algorithm in terms of the size of the \emph{compressed} lineage formula more closely relates the algorithm's performance to the complexity of query evaluation in a deterministic database.
|
||||
|
||||
The initial picture is not good.
|
||||
|
@ -115,7 +116,7 @@ For example, let $P[W_a] = P[W_b] = P[W_c] = p$ and consider the possible world
|
|||
The corresponding variable assignment is $\{\;W_a \mapsto \top, W_b \mapsto \top, W_c \mapsto \bot\;\}$, and the probability of this world is $P[W_a]\cdot P[W_b] \cdot P[\neg W_c] = p\cdot p\cdot (1-p)=p^2-p^3$.
|
||||
\end{Example}
|
||||
|
||||
Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,DBLP:journals/sigmod/GuagliardoL17}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
|
||||
Following prior efforts~\cite{feng:2019:sigmod:uncertainty,DBLP:conf/pods/GreenKT07,GL16}, we generalize this model of Set-PDBs to bags using $\semN$-valued random variables (i.e., $Dom(W_i) \subseteq \mathbb N$) and constants (annotation $\Phi_{bag}$ in the example).
|
||||
Without loss of generality, we assume that input relations are sets (i.e. $Dom(W_i) = \{0, 1\}$), while query evaluation follows bag semantics.
|
||||
|
||||
\begin{Example}\label{ex:bag-vs-set}
|
||||
|
@ -145,7 +146,7 @@ P[\poly_{set}] &= \hspace*{-1mm}
|
|||
}
|
||||
\end{Example}
|
||||
|
||||
Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{10.1145/1265530.1265571}, and thus \sharpphard.
|
||||
Note that the query of \Cref{ex:bag-vs-set} in set semantics is indeed non-hierarchical~\cite{DS12}, and thus \sharpphard.
|
||||
To see why computing this probability is hard, observe that the clauses of the disjunctive normal form Boolean lineage are neither independent nor disjoint, leading to e.g.~\cite{FH13} the use of Shannon decomposition, which is at worst exponential in the size of the input.
|
||||
% \begin{equation*}
|
||||
% \expct\pbox{\poly(W_a, W_b, W_c)} = W_aW_b + W_a\overline{W_b}W_c + \overline{W_a}W_bW_c = 3\prob^2 - 2\prob^3
|
||||
|
@ -172,7 +173,7 @@ Computing such expectations is indeed linear in the size of the SOP as the numbe
|
|||
As a further interesting feature of this example, note that $\expct\pbox{W_i} = P[W_i = 1]$, and so taking the same polynomial over the reals:
|
||||
\begin{multline}
|
||||
\label{eqn:can-inline-probabilities-into-polynomial}
|
||||
\expct\pbox{\poly_{bag}}
|
||||
\expct\pbox{\poly_{bag}}
|
||||
% = P[W_a = 1]P[W_b = 1] + P[W_b = 1]P[W_c = 1]\\
|
||||
% + P[W_c = 1]P[W_a = 1]\\
|
||||
= \poly_{bag}(P[W_a=1], P[W_b=1], P[W_c=1])
|
||||
|
|
4
main.tex
4
main.tex
|
@ -194,10 +194,6 @@ sensitive=true
|
|||
\bibliographystyle{plain}
|
||||
\bibliography{main}
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
% APPENDIX
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\section{Compressed Representations of Polynomials and Boolean Formulas}\label{sec:compr-repr-polyn}
|
||||
|
||||
There is a large body of work on compact using representations of Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{DBLP:conf/tapp/Zavodny11,factorized-db}) some of which have been utilized for probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form.
|
||||
There is a large body of work on compact using representations of Boolean formulas (e.g, various types of circuits including OBDDs~\cite{jha-12-pdwm}) and polynomials (e.g.,factorizations~\cite{factorized-db}) some of which have been utilized for probabilistic query processing, e.g.,~\cite{jha-12-pdwm}. Compact representations of Boolean formulas for which probabilities can be computed in linear time include OBDDs, SDDs, d-DNNF, and FBDD. In terms of circuits over semiring expression,~\cite{DM14c} studies circuits for absorptive semirings while~\cite{S18a} studies circuits that include negation (expressed as the monus operation of a semiring). Algebraic Decision Diagrams~\cite{bahar-93-al} (ADDs) generalize BDDs to variables with more than two values. Chen et al.~\cite{chen-10-cswssr} introduced the generalized disjunctive normal form.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\section{Parameterized Complexity}\label{sec:param-compl}
|
||||
|
|
|
@ -1,32 +1,42 @@
|
|||
%!TEX root=./main.tex
|
||||
\section{Related Work}\label{sec:related-work}
|
||||
|
||||
In addition to work on probabilistic databases, our work has connections to work on compact representations of polynomials and relies on past work in fine-grained complexity which we review in \Cref{sec:compr-repr-polyn} and \Cref{sec:param-compl}.
|
||||
In addition to probabilistic databases, our work has connections to work on compact representations of polynomials and on fine-grained complexity which we review in \Cref{sec:compr-repr-polyn,sec:param-compl}.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%\subsection{Probabilistic Databases}\label{sec:prob-datab}
|
||||
|
||||
Probabilistic Databases (PDBs) have been studied predominantly under set semantics.
|
||||
A multitude of data models have been proposed for encoding a PDB more compactly than as its set of possible worlds.
|
||||
Tuple-independent databases (\tis) consist of a classical database where each tuple associated with a probability and tuples are treated as independent probabilistic events.
|
||||
While unable to encode correlations directly, \tis are popular because any finite probabilistic database can be encoded as a \ti and a set of constraints that ``condition'' the \ti~\cite{VS17}.
|
||||
Block-independent databases (\bis) generalize \tis by partitioning the input into blocks of disjoint tuples, where blocks are independent~\cite{RS07,BS06}. \emph{PC-tables}~\cite{GT06} pair a C-table~\cite{IL84a} with probability distribution over its variables. This is similar to our $\semNX$-PDBs, except that we do not allow for variables as attribute values and instead of local conditions (propositional formulas that may contain comparisons), we associate tuples with polynomials $\semNX$.
|
||||
Probabilistic Databases (PDBs) have been studied predominantly for set semantics.
|
||||
A multitude of data models have been proposed for encoding a PDB more compactly than as its set of possible worlds. These include tuple-independent databases~\cite{VS17} (\tis), block-independent databases (\bis)~\cite{RS07}, and \emph{PC-tables}~\cite{GT06} pair a C-table % ~\cite{IL84a}
|
||||
with probability distribution over its variables.
|
||||
This is similar to our $\semNX$-PDBs, but we use polynomials instead of Boolean expressions and only allow constants as attribute values.
|
||||
% Tuple-independent databases (\tis) consist of a classical database where each tuple associated with a probability and tuples are treated as independent probabilistic events.
|
||||
% While unable to encode correlations directly, \tis are popular because any finite probabilistic database can be encoded as a \ti and a set of constraints that ``condition'' the \ti~\cite{VS17}.
|
||||
% Block-independent databases (\bis) generalize \tis by partitioning the input into blocks of disjoint tuples, where blocks are independent~\cite{RS07}. %,BS06
|
||||
% \emph{PC-tables}~\cite{GT06} pair a C-table % ~\cite{IL84a}
|
||||
% with probability distribution over its variables. This is similar to our $\semNX$-PDBs, except that we do not allow for variables as attribute values and instead of local conditions (propositional formulas that may contain comparisons), we associate tuples with polynomials $\semNX$.
|
||||
|
||||
Approaches for probabilistic query processing (i.e., computing the marginal probability for query result tuples), fall into two broad categories.
|
||||
\emph{Intensional} (or \emph{grounded}) query evaluation computes the \emph{lineage} of a tuple (a Boolean formula encoding the provenance of the tuple) and then the probability of the lineage formula.
|
||||
In this paper we focus on intensional query evaluation using polynomials instead of boolean formulas.
|
||||
It is a well-known fact that computing the marginal probability of a tuple is \sharpphard (proven through a reduction from weighted model counting~\cite{provan-83-ccccptg,valiant-79-cenrp} using the fact the tuple's marginal probability is the probability of a its lineage formula).
|
||||
The second category, \emph{extensional} query evaluation, avoids calculating the lineage.
|
||||
This approach is in \ptime, but is limited to certain classes of queries.
|
||||
Dalvi et al.~\cite{DS12} proved a dichotomy for unions of conjunctive queries (UCQs): for any UCQ the probabilistic query evaluation problem is either \sharpphard (requires extensional evaluation) or \ptime (allows intensional).
|
||||
Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation, R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries.
|
||||
Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15,AB15c}.
|
||||
Approaches for probabilistic query processing (i.e., computing the marginal probability for query result tuples), fall into two broad categories.
|
||||
\emph{Intensional} (or \emph{grounded}) query evaluation computes the \emph{lineage} of a tuple % (a Boolean formula encoding the provenance of the tuple)
|
||||
and then the probability of the lineage formula.
|
||||
In this paper we focus on intensional query evaluation using polynomials instead of boolean formulas.
|
||||
It is a well-known fact that computing the marginal probability of a tuple is \sharpphard (proven through a reduction from weighted model counting~\cite{valiant-79-cenrp} %provan-83-ccccptg
|
||||
using the fact the tuple's marginal probability is the probability of a its lineage formula).
|
||||
The second category, \emph{extensional} query evaluation, % avoids calculating the lineage.
|
||||
% This approach
|
||||
is in \ptime, but is limited to certain classes of queries.
|
||||
Dalvi et al.~\cite{DS12} proved that a dichotomy for unions of conjunctive queries (UCQs):
|
||||
for any UCQ the probabilistic query evaluation problem is either \sharpphard (requires extensional evaluation) or \ptime (permits intensional).
|
||||
Olteanu et al.~\cite{FO16} presented dichotomies for two classes of queries with negation. % R\'e et al~\cite{RS09b} present a trichotomy for HAVING queries.
|
||||
Amarilli et al. investigated tractable classes of databases for more complex queries~\cite{AB15}. %,AB15c
|
||||
Another line of work, studies which structural properties of lineage formulas lead to tractable cases~\cite{kenig-13-nclexpdc,roy-11-f,sen-10-ronfqevpd}.
|
||||
|
||||
Several techniques for approximating tuple probabilities have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07,re-07-eftqevpd}, relying on Monte Carlo sampling, e.g., \cite{DS07,re-07-eftqevpd}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10,fink-11}.
|
||||
Several techniques for approximating tuple probabilities have been proposed in related work~\cite{FH13,heuvel-19-anappdsd,DBLP:conf/icde/OlteanuHK10,DS07}, relying on Monte Carlo sampling, e.g.,~\cite{DS07}, or a branch-and-bound paradigm~\cite{DBLP:conf/icde/OlteanuHK10}.
|
||||
The approximation algorithm for bag expectation we present in this work is based on sampling.
|
||||
|
||||
Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of the extension of K-relations for aggregate queries proposed in~\cite{AD11d} (this data model is referred to as \emph{pvc-tables}). As an extension of K-relations, this approach supports bags. Probabilities are computed using a decomposition approach~\cite{DBLP:conf/icde/OlteanuHK10} over the symbolic expressions that are used as tuple annotations and values in pvc-tables. \cite{FH12} identifies a tractable class of queries involving aggregation. In contrast, we study a less general data model and query class, but provide a linear time approximation algorithm and provide new insights into the complexity of computing expectation (while \cite{FH12} computes probabilities for individual output annotations).
|
||||
Fink et al.~\cite{FH12} study aggregate queries over a probabilistic version of the extension of K-relations for aggregate queries proposed in~\cite{AD11d} (this data model is referred to as \emph{pvc-tables}). As an extension of K-relations, this approach supports bags. Probabilities are computed using a decomposition approach~\cite{DBLP:conf/icde/OlteanuHK10}. % over the symbolic expressions that are used as tuple annotations and values in pvc-tables.
|
||||
% \cite{FH12} identifies a tractable class of queries involving aggregation.
|
||||
In contrast, we study a less general data model and query class, but provide a linear time approximation algorithm and provide new insights into the complexity of computing expectation (while~\cite{FH12} computes probabilities for individual output annotations).
|
||||
|
||||
%%% Local Variables:
|
||||
%%% mode: latex
|
||||
|
|
Loading…
Reference in New Issue