This commit is contained in:
Aaron Huber 2020-12-11 20:20:02 -05:00
commit 17b2e6df8f
5 changed files with 43 additions and 33 deletions

View file

@ -1,9 +1,15 @@
%root: main.tex
\begin{abstract}
The problem of computing the marginal probability of a tuple in the result of a query over a probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{linage formula} of the result which is a Boolean formula whose variables represent the existence of tuples in the database. Under bag semantics, lineage formulas have to be replaced with provenance polynomials. For any given possible world, the polynomial of a result tuple evaluates to the multiplicity of the tuple in this world. In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately. For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can be computed in linear time in the size of the tuple's provenance polynomial if the polynomial is encoded as a sum of products. However, using a reduction from the problem of counting k-matchings, we demonstrate that calculating the expectation for factorized polynomials is \sharpwonehard. We then proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial.
\AH{High-level intution}
Most people think that computing expected multiplicity of an output tuple in a probabilistic database (PDB) is easy. Due to the fact that most modern implementations of PDBs represent tuple lineage in their expanded form, it has to be the case that such a computation is linear in the size of the lineage. This follows since, when we have an uncompressed lineage, linearity allows for expectation to be pushed through the sum.
\BG{Most people think that computing expected multiplicity of an output tuple in a probabilistic database (PDB) is easy. Due to the fact that most modern implementations of PDBs represent tuple lineage in their expanded form, it has to be the case that such a computation is linear in the size of the lineage. This follows since, when we have an uncompressed lineage, linearity allows for expectation to be pushed through the sum.}
\AH{Low-level why-would-an-expert-read-this}
However, when we consider compressed representations of the tuple lineage, the complexity landscape changes. If we use a lineage computed over a factorized database, we find in general that computation time is not linear in the size of the compressed lineage.
\BG{However, when we consider compressed representations of the tuple lineage, the complexity landscape changes. If we use a lineage computed over a factorized database, we find in general that computation time is not linear in the size of the compressed lineage.}
\AH{Key technical contributions}
This work theoretically demonstrates that bags are not easy in general, and in the case of compressed lineage forms, the computation can be greater than linear. As such, it is then desirable to have an approximation algorithm to approximate the expected multiplicity in linear time. We introduce such an algorithm and give theoretical guarentees on its efficiency and accuracy. It then follows that computing an approximate value of the tuple's expected multiplicity on a bag PDB is equivalent to deterministic query processing complexity.
\BG{This work theoretically demonstrates that bags are not easy in general, and in the case of compressed lineage forms, the computation can be greater than linear. As such, it is then desirable to have an approximation algorithm to approximate the expected multiplicity in linear time. We introduce such an algorithm and give theoretical guarentees on its efficiency and accuracy. It then follows that computing an approximate value of the tuple's expected multiplicity on a bag PDB is equivalent to deterministic query processing complexity.}
\end{abstract}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End:

View file

@ -2,7 +2,7 @@
\section{Introduction}
Modern production databases like Postgres and Oracle use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing the probability of an output tuple is analogous to weighted model counting (a known $\sharpP$ problem).
Modern production databases like Postgres and Oracle use bag semantics. In contrast, most implementations of probabilistic databases (PDBs) are built in the setting of set semantics, where computing the probability of an output tuple is analogous to weighted model counting (a known \sharpPhard problem).
%the annotation of the tuple is a lineage formula ~\cite{DBLP:series/synthesis/2011Suciu}, which can essentially be thought of as a boolean formula. It is known that computing the probability of a lineage formula is \#-P hard in general
In PDBs, a boolean formula, ~\cite{DBLP:series/synthesis/2011Suciu} also called a lineage formula, encodes the conditions under which each output tuple appears in the result.
%The marginal probability of this formula being true is the tuple's probability to appear in a possible world.

View file

@ -8,7 +8,6 @@
\newcommand{\wElem}{w} %an element of \vct{w}
\newcommand{\st}{\;|\;} %such that
\newcommand{\kElem}{k}%the kth element
\newcommand{\sharpP}{\#P}
%RA-to-Poly Notation
\newcommand{\polyinput}[2]{\left(#1,\ldots, #2\right)}
\newcommand{\numvar}{n}
@ -310,12 +309,15 @@
\newcommand{\dbDomK}[1]{\mathcal{DB}_{#1}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% COMPLEXITY
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\sharpphard}{\#P-hard\xspace}
\newcommand{\sharpwonehard}{\#W[1]-hard\xspace}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "attrprov"
%%% TeX-master: "main"
%%% End:
%%%Adding stuff below so that long chain of display equatoons can be split across pages

View file

@ -1,4 +1,3 @@
\documentclass[sigconf]{acmart}
\usepackage{algpseudocode}
@ -191,8 +190,3 @@ sensitive=true
% \input{glossary.tex}
% \input{addproofappendix.tex}
\end{document}

View file

@ -5,10 +5,14 @@
We would like to argue for a compressed version of $\poly(\vct{w})$, in general $\expct_{\vct{w}}\pbox{\poly(\vct{w})}$ cannot be computed in linear time.
\AR{Added the hardness result below.}
Our hardness result is based on the following hardness result:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}[\cite{k-match}]
\label{thm:k-match-hard}
Given a positive integer $k$ and an undirected graph $G$ with no self-loops or parallel edges, counting the number of $k$-matchings in $G$ is $\#W[1]$-hard.
Given a positive integer $k$ and an undirected graph $G$ with no self-loops or parallel edges, counting the number of $k$-matchings in $G$ is \sharpwonehard.
\end{Theorem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The above result means that we cannot hope to count the number of $k$-matchings in $G=(V,E)$ in time $f(k)\cdot |V|^{O(1)}$ for any function $f$. In fact, all known algorithms to solve this problem take time $|V|^{\Omega(k)}$.
To prove our hardness result, consider a graph $G(V, E)$, where $|E| = \numedge$, $|V| = \numvar$, and $i, j \in [\numvar]$.
@ -72,3 +76,7 @@ The proof follows by ~\cref{thm:k-match-hard} and ~\cref{lem:qEk-multi-p}.
%The proof follows by ~\cref{thm:k-match-hard}, ~\cref{lem:qEk-multi-p} and ~\cref{cor:lem-qEk}.
%\end{proof}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: