New start on the Intro

This commit is contained in:
Aaron Huber 2020-11-23 16:41:29 -05:00
parent df7f3bec6c
commit 9f0b6dadb0
3 changed files with 98 additions and 5 deletions

View file

@ -1,4 +1,20 @@
@inproceedings{10.1145/1265530.1265571,
author = {Dalvi, Nilesh and Suciu, Dan},
title = {The Dichotomy of Conjunctive Queries on Probabilistic Structures},
year = {2007},
isbn = {9781595936851},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi-org.gate.lib.buffalo.edu/10.1145/1265530.1265571},
doi = {10.1145/1265530.1265571},
abstract = {We show that for every conjunctive query, the complexity of evaluating it on a probabilistic database is either PTIME or P-complete, and we give an algorithm for deciding whether a given conjunctive query is PTIME or P-complete. The dichotomy property is a fundamental result on query evaluation on probabilistic databases and it gives a complete classification of the complexity of conjunctive queries.},
booktitle = {Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems},
pages = {293302},
numpages = {10},
keywords = {probabilistic databases, dichotomy, conjunctive queries},
location = {Beijing, China},
series = {PODS '07}
}
@inproceedings{DBLP:conf/icde/OlteanuHK10,
author = {Dan Olteanu and

View file

@ -1,7 +1,81 @@
%root: main.tex
\AH{I need help not being redundant...}
\section{Introduction}
\section{Introduction}
\subsection{Problem Statement}
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, most implementations of PDBs are built in the setting of set semantics, where the annotation polynomial is in disjunctive normal form (DNF), and computations over the output polynomial such as expectation (probability of the tuple) are \#-P hard in general. However for the equivalent sum of products (SOP) representation in the bags setting, computing the expectation (expected multiplicity) over the output polynomial is iinear. In this work we show that, if we use alternative representations of the output polynomial, such as factorized forms, the complexity landscape becomes much more nuanced.
\subsection{Theoretical Problem}
%Figures, etc
%Relations for example 1
\begin{figure}[ht]
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c}
$\rel$ & A\\
\hline
& a \\
& b \\
& c \\
\end{tabular}
%\caption{Atom 1 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom1}
\end{subfigure}
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c c}
$E$ & A & B\\
\hline
& a & b\\
& b & c\\
& c & a\\
\end{tabular}
%\caption{Atom 3 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom3}
\end{subfigure}
\begin{subfigure}{0.15\textwidth}
\centering
\begin{tabular}{ c | c}
$\rel$ & B\\
\hline
& b\\
& c\\
& a\\
\end{tabular}
%\caption{Atom 2 of query $\poly$ in ~\cref{intro:ex}}
\label{subfig:ex-atom2}
\end{subfigure}
\caption{$\ti$ relations for $\poly$}
\label{fig:intro-ex}
\end{figure}
%Graph of query output for intro example
\begin{figure}
\begin{tikzpicture}
\node at (1.5, 3) [tree_node](top){a};
\node at (0, 0) [tree_node](left){b};
\node at (3, 0) [tree_node](right){c};
\draw (top)--(left);
\draw (left)--(right);
\draw (right)--(top);
\end{tikzpicture}
\caption{Output edges of $\poly$}
\label{fig:intro-ex-graph}
\end{figure}
\begin{Example}\label{intro:ex}
Suppose we have a query $\poly() := R(A), E(A, B), R(B)$, whose relations are given in ~\cref{fig:intro-ex}. The output for $\poly$ is visualized as a graph in ~\cref{fig:intro-ex-graph}.
\end{Example}
Note that such a query in set semantics is indeed \#-P hard, since it is a query that is non-hierarchical, i.e., for $Vars(\poly)$ denoting the set of variables occuring across all atoms of $\poly$, a function $sg(x)$ whose output is the set of all atoms that contain variable $x$, we have that $sg(A) \cap sg(B) \neq \emptyset$ and $sg(A)\not\subseteq sg(B)$ and $sg(B)\not\subseteq sg(A)$, as defined by Dalvi and Suciu in ~\cite{10.1145/1265530.1265571}. Thus, abusing notation, denoting the output polynomial as $\poly(\prob_1,\ldots, \prob_\numvar)$, computing $\expct\pbox{\poly(\prob_1,\ldots, \prob_\numvar)}$ is hard in set semantics.
However, in the bag setting, $\expct\pbox{\poly(\prob_1,\ldots, \prob_\numvar)}$ is indeed linear in the size of the output polynomial as the number of operations in the computation is \textit{exactly} the number of output polynomial operations.
\AH{{\bf \Large New Material Up To This Point}}
\AR{The para below has some text that is too coloquial and should not be in a paper, e.g. ``giant pile of monomials" or ``folks".}
Most implementations of modern probabilistic databases (PDBs) view the annotation polynomials as a giant pile of monomials in disujunctive normal form (DNF). Most folks have considered bag PDBs as easy, and due to the almost all theoretical framework of PDBs being in set semantics, few have considered bag PDBs. However, there is a subtle, but easliy missed advantage in the bag semantic setting, that expectation can push through addition, making the computation easier than the oversimplified view of the polynomial being in its expanded sum of products (SOP) form. There is not a lot of existing work in bag PDBs per se, however this work seeks to unite previous work in factorized databases with theoretical guarantees when used in computations over bag PDBs, which have not been extensively studied in the literature. We give theoretical results for computing the expectation of a bag polynomial, while introducing a linear time approximation algorithm for computing the expecation of a bag PDB tuple.
\AR{The para above does not quite seem to follow the outline we discussed for the first para. Here is what I thought we had discussed (but it is possible I'm mis-remembering). Here is how you can pharse this, line by line:
@ -11,7 +85,10 @@ Most implementations of modern probabilistic databases (PDBs) view the annotatio
(Line 4): In this work we show that if we use better representation like factorized DBs for annotation polynomial then the complexity landscape becomes much more nuanced.
}
In practice, modern production databases, e.g., Postgres, Oracle, etc. use bag semantics. In contrast, as noted above, most implementations of PDBs are built in the setting of set semantics,\AR{The stuff so far with minor modifications can function as the first two lines of the first para.} and this contributes to slow computation time \AR{Why did you put this comment on slow computation time? What is it buying you? It seems like you are jumping ahead over here.}. In both settings it is the case that each tuple is annotated with a polynomial, which describes the tuples contributing to the given tuple's presence in the output. While in set semantics, the output polynomial is predominantly viewed as the probability\AR{The annotation polynomial does {\bf NOT} give the probability of a tuple-- it just says whether a tuple is present in a world or not. Only when you take the expecation of the polynomial do you get a probability value.} that its associated tuple exists or not, in the bags setting the polynomial is an encoding of the multiplicity of the tuple in the output. Note that in general, as we allude later one, the polynomial can also represent set semantics, access levels, and other encodings.\AR{How does the last statement help? In this paper we are interested in set and bag semantics: why are you bring up other possibilities that have nothing to do with the main message of the paper?} \AR{I'll pause here now (except for two comments on next page) since it would be quicker to give comments during the meeting but here is something to ponder about, which might make things simpler to present. I would recommend that you introduce the theoretical problem {\bf right after} the first para. Once you define the problem, what you are trying to describe in words in this para can be done much more succinctly with relevant notation. If you want to be bit more gentle then perhaps start off with an example annotation polynomial and talk about that example-- this latter one might be better for PODS. But {\em ideally}, you would like an example query that is hard in set semantics but easy in bag in SoP representation but also hard with more succinct representation. Side Q-- is our hard query that we use for triangle counting etc. \#P-hard in the set semantics? If so that would be a great example to use throughout the intro.} In bag semantics, the polynomial is composed of $+$ and $\times$ operators, with constants from the set $\mathbb{N}$ and variables from the set of variables $\vct{X}$. Should we attempt to make computations, e.g. expectation, over the output polynomial, the naive algorithm cannot hope to do better than linear time in the size of the polynomial. However, in the set semantics setting, when e.g., computing the expectation (probability) of the output polynomial given values for each variable in the polynomial's set of variables $\vct{X}$, this problem is \#P-hard. %of the output polynomial of a result tuple for an arbitrary query. In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.
and this contributes to slow computation time \AR{Why did you put this comment on slow computation time? What is it buying you? It seems like you are jumping ahead over here.}. In both settings it is the case that each tuple is annotated with a polynomial, which describes the tuples contributing to the given tuple's presence in the output. While in set semantics, the output polynomial is predominantly viewed as the probability\AR{The annotation polynomial does {\bf NOT} give the probability of a tuple-- it just says whether a tuple is present in a world or not. Only when you take the expecation of the polynomial do you get a probability value.} that its associated tuple exists or not, in the bags setting the polynomial is an encoding of the multiplicity of the tuple in the output. Note that in general, as we allude later one, the polynomial can also represent set semantics, access levels, and other encodings.\AR{How does the last statement help? In this paper we are interested in set and bag semantics: why are you bring up other possibilities that have nothing to do with the main message of the paper?}
\AR{I'll pause here now (except for two comments on next page) since it would be quicker to give comments during the meeting but here is something to ponder about, which might make things simpler to present. I would recommend that you introduce the theoretical problem {\bf right after} the first para. Once you define the problem, what you are trying to describe in words in this para can be done much more succinctly with relevant notation. If you want to be bit more gentle then perhaps start off with an example annotation polynomial and talk about that example-- this latter one might be better for PODS. But {\em ideally}, you would like an example query that is hard in set semantics but easy in bag in SoP representation but also hard with more succinct representation. Side Q-- is our hard query that we use for triangle counting etc. \#P-hard in the set semantics? If so that would be a great example to use throughout the intro.}
In bag semantics, the polynomial is composed of $+$ and $\times$ operators, with constants from the set $\mathbb{N}$ and variables from the set of variables $\vct{X}$. Should we attempt to make computations, e.g. expectation, over the output polynomial, the naive algorithm cannot hope to do better than linear time in the size of the polynomial. However, in the set semantics setting, when e.g., computing the expectation (probability) of the output polynomial given values for each variable in the polynomial's set of variables $\vct{X}$, this problem is \#P-hard. %of the output polynomial of a result tuple for an arbitrary query. In contrast, in the bag setting, one cannot generate a result better than linear time in the size of the polynomial.
There is limited work and results in the area of bag semantic PDBs. This work seeks to leverage prior work in factorized databases (e.g. Olteanu et. al.)~\cite{DBLP:conf/tapp/Zavodny11} with PDB implementations to improve efficient computation over output polynomials, with theoretical guarantees. \AR{I know what you are trying to say in the rest of the para but it can be easily interpreted to be saying something that is {\bf false}. It is always the case that the compressed form of the polynomial always evaluates to the same value as the extended SoP form for any value. So the expcted value of compressed poly is {\em always the same} as expected value of the SoP forms. What you are trying to get here if when you "push" in the expectations. Again the latter is very hard to describe in words. But this would be much easier to state once you have the notation in place. Or if you have a runnign example.} When considering PDBs in the bag setting a subtelty arises that is easily overlooked due to the \textit{oversimplification} of PDBs in the set setting, i.e., in set semantics expectation doesn't have linearity over disjunction, and a consequence of this is that it is not true in the general case that a compressed polynomial has an equivalent expectation to its DNF form. In the bag PDB setting, however, expectation does enjoy linearity over addition, and the expectation of a compressed polynomial and its equivalent SOP are indeed the same.

View file

@ -95,7 +95,7 @@ Finally, observe \cref{p1-s5} by construction in \cref{lem:pre-poly-rpoly}, that
\end{proof}
\begin{Corollary}\label{cor:expct-sop}
If $\poly$ is given as a sum of monomials, the expectation of $\poly$, i.e., $\ex{\poly} = \rpoly{Q}\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $O(|\poly|)$, where $|\poly|$ denotes the total number of multiplication/addition operators.
If $\poly$ is given as a sum of monomials, the expectation of $\poly$, i.e., $\ex{\poly} = \rpoly\left(\prob_1,\ldots, \prob_\numvar\right)$ can be computed in $O(|\poly|)$, where $|\poly|$ denotes the total number of multiplication/addition operators.
\end{Corollary}
\begin{proof}[Proof For Corollary ~\ref{cor:expct-sop}]