Merge branch 'master' of gitlab.odin.cse.buffalo.edu:ahuber/SketchingWorlds

This commit is contained in:
Boris Glavic 2020-12-15 11:02:33 -06:00
commit df38fb6796
5 changed files with 22 additions and 6 deletions

View file

@ -1,4 +1,13 @@
@misc{pdbench,
title = {pdbench},
howpublished = {\url{http://pdbench.sourceforge.net/}},
note = {Accessed: 2020-12-15}
}
@MISC{Antova_fastand,
author = {Lyublena Antova and Thomas Jansen and Christoph Koch and Dan Olteanu},
title = {Fast and Simple Relational Processing of Uncertain Data},
year = {}
}
@book{DBLP:series/synthesis/2011Suciu,
author = {Dan Suciu and
Dan Olteanu and

View file

@ -1,6 +1,12 @@
%root: main.tex
\begin{abstract}
The problem of computing the marginal probability of a tuple in the result of a query over a probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result which is a Boolean formula whose variables represent the existence of tuples in the database. Under bag semantics, lineage formulas have to be replaced with provenance polynomials. For any given possible world, the polynomial of a result tuple evaluates to the multiplicity of the tuple in this world. In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately. For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's provenance polynomial if the polynomial is encoded as a sum of products. However, using a reduction from the problem of counting k-matchings, we demonstrate that calculating the expectation for factorized polynomials is \sharpwonehard. More importantly, the problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $p$ (where $p \neq 0$ and $p \neq 1$). We then proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial.
The problem of computing the marginal probability of a tuple in the result of a query over set-probabilistic databases (PDBs) can be reduced to calculating the probability of the \emph{lineage formula} of the result, a Boolean formula over random variables representing the existence of tuples in the database's possible worlds.
The analog for bag semantics is a natural number-valued polynomial over random variables that evaluates to the multiplicity of the tuple in each world.
In this work, we study the problem of calculating the expectation of such polynomials (a tuple's expected multiplicity) exactly and approximately.
For tuple-independent databases (TIDBs), the expected multiplicity of a query result tuple can trivially be computed in linear time in the size of the tuple's lineage if this polynomial is encoded as a sum of products.
However, using a reduction from the problem of counting k-matchings, we demonstrate that calculating the expectation is \sharpwonehard when the polynomial is compressed, for example through factorization.
The problem stays hard even for polynomials generated by conjunctive queries (CQs) if all input tuples have a fixed probability $p$ (where $p \neq 0$ and $p \neq 1$).
We then proceed to study polynomials of result tuples of union of conjunctive queries (UCQs) over TIDBs and for a non-trivial subclass of block-independent databases (BIDBs). We develop an algorithm that computes a $1 \pm \epsilon$-approximation of the expectation of such polynomials in linear time in the size of the polynomial.
% \AH{High-level intuition}
% \BG{Most people think that computing expected multiplicity of an output tuple in a probabilistic database (PDB) is easy. Due to the fact that most modern implementations of PDBs represent tuple lineage in their expanded form, it has to be the case that such a computation is linear in the size of the lineage. This follows since, when we have an uncompressed lineage, linearity allows for expectation to be pushed through the sum.}
% \AH{Low-level why-would-an-expert-read-this}

View file

@ -1,11 +1,11 @@
% root: main.tex
We ran our experiments using Windows 10 WSL Operating System on a machine with an Intel Core i7 2.40GHz processor with 16GB RAM. All experiments used the PostgreSQL 13.0 database system.
The intention of the experiments was to determine whether queries over $\bi$ instances in practice generate a lot of cancellations or not. Recall that by definition of $\bi$, a query result cannot be derived by a self-join between tuples belonging to the same block.
The intention of the experiments was to determine whether queries over $\bi$ instances in practice generate a lot of cancellations or not. Recall that by definition of $\bi$, a query result cannot be derived by a self-join between non-identical tuples belonging to the same block.
For this purpose we used the MayBMS data generator~\cite{pdbench} tool to generate uncertain versions of TPCH tables. We then ran $\poly_1$, $\poly_2$, and $\poly_3$ from~\cite{U-relations}, all of which are modified versions of TPC-H queries $\poly_3$, $\poly_6$, and $\poly_7$ where all aggregations have been dropped.
For this purpose we used the MayBMS data generator~\cite{pdbench} tool to randomly generate uncertain versions of TPCH tables. We then ran $\poly_1$, $\poly_2$, and $\poly_3$ from~\cite{Antova_fastand}, all of which are modified versions of TPC-H queries $\poly_3$, $\poly_6$, and $\poly_7$ where all aggregations have been dropped.
As written, the queries disallow $\bi$ cross terms. We ran all queries, and then rewrote the queries so as not to filter out the cross terms. The results show that in practice, there are little to no cancelling terms, as shown in \Cref{fig:experiment-bidb-cancel}. \Cref{tbl:cancel} has the number of result tuples returned when the query filters out tuples that are cancelled by $\bi$ constraints, the number of output tuples when the cancelled tuples are included in the result, and the difference between the two.
As written, the queries disallow $\bi$ cross terms. We ran all queries, and then rewrote the queries so as not to filter out the cross terms. The results show that in practice, there are little to no cancelling terms, as shown in \Cref{fig:experiment-bidb-cancel}. The columns of the table in~\Cref{fig:experiment-bidb-cancel} show the number of result tuples returned when the query filters out tuples that are cancelled by $\bi$ constraints, the number of output tuples when the cancelled tuples are included in the result, and the difference between the two. The experiments show a range between $[0, 0.001]\%$ of tuples are cancelled tuples across the queries, suggesting that only a negligible amount of tuples are cancelled in practice when running queries over a typical $\bi$ instance. Interestingly, only one of the three queries had tuples that violated the $\bi$ constraint.
\begin{figure}[ht]
\begin{tabular}{ c | c c c}\label{tbl:cancel}

View file

@ -110,6 +110,7 @@
%PDBs
\newcommand{\pdbx}{X_{DB}}
\newcommand{\prob}{p}
\newcommand{\wSet}{\Omega}
\newcommand{\ti}{TIDB\xspace}

View file

@ -353,7 +353,7 @@ All of the observations above focused only on the shape of $\eset{1}$, and since
\subsubsection{Proof of~\cref{lem:3p-G2}}
For $\mathcal{P} \in f_2^{-1}\inparen{ \eset{2}}$ such that $\mathcal{P} $ is a $3$-path, it \textit{must} be the case by definition of $f_2$ that (i)eall edges in $f_2(\mathcal{P} )$ have at least one mapping from an edge in $\mathcal{P} $ and recall that (ii) $\mathcal{P} $ is connected. These constraint rules out every pattern $\eset{1}$ consisting of $3$ edges (it can be verified that in each three-edge pattern at least one of (i) or (ii) is violated), as well as when $\eset{1} = \twodis$. For $\eset{1} = \ed$, note that $\eset{1}$ doesn't have enough edges to have any output in $f_2^{-1}(\eset{1})$, i.e., there exists no $\eset{1} \in \binom{E_2}{3}$ such that $f_2(\mathcal{P} ) = \eset{1}$. The only surviving pattern is $\eset{1} \equiv \twopath$, where the edges of $\eset{2}$ have successive connectivity from $(e_1, 0)$ to $(e_2, 1)$. There are then two $3$-paths sharing edges $e_1$ and $e_2$ in $f_2^{-1}(\eset{1}), \pbrace{(e_1, 0), (e_1, 1), (e_2, 0)} \text{ and }\pbrace{(e_1, 1), (e_2, 0), (e_2, 1)}$.
For $\mathcal{P} \in f_2^{-1}\inparen{ \eset{2}}$ such that $\mathcal{P} $ is a $3$-path, it \textit{must} be the case by definition of $f_2$ that (i)eall edges in $f_2(\mathcal{P} )$ have at least one mapping from an edge in $\mathcal{P} $ and recall that (ii) $\mathcal{P} $ is connected. These constraint rules out every pattern $\eset{1}$ consisting of $3$ edges (it can be verified that in each three-edge pattern at least one of (i) or (ii) is violated), as well as when $\eset{1} = \twodis$. For $\eset{1} = \ed$, note that $\eset{1}$ doesn't have enough edges to have any output in $f_2^{-1}(\eset{1})$, i.e., there exists no $\eset{1} \in \binom{E_2}{3}$ such that $f_2(\mathcal{P} ) = \eset{1}$. The only surviving pattern is $\eset{1} \equiv \twopath$, where the edges of $\eset{2}$ have successive connectivity from $(e_1, 0)$ to $(e_2, 1)$. There are then two $3$-paths sharing edges $e_1$ and $e_2$ in $f_2^{-1}(\eset{1}), \pbrace{(e_1, 0), (e_1, 1), (e_2, 0)} \text{ and }\{(e_1, 1)$,$ (e_2, 0), (e_2, 1)\}$.
All of the observations above focused only on the shape of $\eset{1}$, and since we see that for fixed $\eset{1}$, we have a fixed number of $3$-paths, this implies the identity.
%\end{proof}