Done with intro pass:

This commit is contained in:
Atri Rudra 2020-12-18 12:03:17 -05:00
parent a8c399325e
commit 9dc5ed2345

View file

@ -4,7 +4,7 @@
\section{Introduction}
\label{sec:intro}
\AR{\textbf{Oliver/Boris:} What is missing from the intro is why would someone care about bag-PDBs in {\em practice}? This is kinda obliquely referred to in the first para but it would be good to motivate this more. The intro (rightly) focusses on the theoretical reasons to study bag PDBs but what (fi any) are the practical significance of getting bag PDBs done in linear-time? Would this lead to much faster real-life PDB systems?}
\AR{\textbf{Oliver/Boris:} What is missing from the intro is why would someone care about bag-PDBs in {\em practice}? This is kinda obliquely referred to in the first para but it would be good to motivate this more. The intro (rightly) focuses on the theoretical reasons to study bag PDBs but what (if any) are the practical significance of getting bag PDBs done in linear-time? Would this lead to much faster real-life PDB systems?}
Modern production databases like Postgres and Oracle use bag semantics, while research on probabilistic databases (PDBs)~\cite{DBLP:series/synthesis/2011Suciu,DBLP:conf/sigmod/BoulosDMMRS05,DBLP:conf/icde/AntovaKO07a,DBLP:conf/sigmod/SinghMMPHS08} focuseses predominantly on query evaluation under set semantics.
This is not surprising, as the conventional strategy for encoding the lineage of a query result --- a key component of query evaluation in PDBs --- makes computing typical statistics like marginal probabilities or moments easy (at worst linear in the size of the lineage) for bags and hence, perhaps not worthy of research attention, but hard (at worst exponential in the size of the lineage) for sets and hence, interesting from a research perspective.
@ -208,7 +208,7 @@ As a further interesting feature of this example, note that $\expct\pbox{W_i} =
\subsection{Superlinearity of Bag PDBs}
Moving forward, we focus exclusively on bags and drop the subscript from $\poly_{bag}$.
Consider the cartesian product of $\poly$ with itself:
Consider the Cartesian product of $\poly$ with itself:
\begin{equation*}
\poly^2() := \rel(A), E(A, B), \rel(B),\; \rel(C), E(C, D), \rel(D)
\end{equation*}
@ -252,7 +252,7 @@ Also note that the $\poly$ in~\Cref{ex:bag-vs-set} is already in reduced form.
The reduced form of a polynomial can be obtained in a linear scan over the clauses of a SOP encoding of the polynomial.
In prior work on PDBs, where this encoding is implicitly assumed, computing the expected count is linear in the size of the encoding.
In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the cartesian product of $k$ copies of $\poly$ has a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
In general however, compressed encodings of the polynomial can be exponentially smaller in $k$ for $k$-products --- the query $\poly^k$ obtained by taking the Cartesian product of $k$ copies of $\poly$ has a factorized encoding of size $6\cdot k$, while the SOP encoding is of size $2\cdot 3^k$.
This leads us to the \textbf{central question of this paper}:
\begin{quote}
{\em
@ -274,7 +274,14 @@ Concretely, in this paper:
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further generalize our results to higher moments, polynomial circuits, and prove RA+ queries, the processing time in approximation is within a constant factor of the same query processed deterministically.
\paragraph{Paper Organization.} We present some relevant background and setup our notation in~\Cref{sec:background}. We present our hardness results in~\Cref{sec:hard} and our approximation algorithm in~\Cref{sec:algo}. We present some (easy) generalizations of our results in~\Cref{sec:gen}. We do a quick overview of related work in~\Cref{sec:related-work} and conclude with some open questions in~\Cref{sec:concl-future-work}
Our hardness results follow by considering a suitable generalization of the lineage polynomial in Example~\ref{ex:bag-vs-set}. First it is easy to generalize the polynomial in Example~\ref{ex:bag-vs-set} to $\poly_G^k(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\inparen{\poly_G^k(X_1,\dots,X_n)}^k$ encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(p,\dots,p)$ can be written as $\sum_{i=0}^2k c_i\cdot p^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (computing which is \sharpwonehard\ ) in $G$. Thus, if we have access to $\rpoly_G^k(p_i,\dots,p_)$ for distinct values of $p_i$ for $0\le i\le 2k$, then we can setup a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(p,\dots,p)$ for a {\em single specific} value of $p$ might be easy: indeed it is easy for $p=0$ or $p=1$. However, we are able to show that for any other value of $p$, computing $\rpoly_G^k(p,\dots,p)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants then $\poly(X_1,\dots,X_n)$ (which can be easily computed in linear time) is a $p^k$ approximation to the value $\rpoly(p,\dots,p)$ that we are after. If $p$ and $k=\deg(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_n)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(X_1,\dots,X_n)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well. For the ease of exposition, we start off with expression trees (see~\Cref{fig:intro-q2-etree} for an example) and show that they satisfy both of these properties. Later we show that it easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the requires time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
We also formalize our claim since our approximation algorithm runs in time linear in the size of the polynomial circuit, we show that we can approximate the expected output tuple multiplicities with only a $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to the problem of computing higher moments of the tuple multiplicity (instead of just the expectation).
\paragraph{Paper Organization.} We present some relevant background and setup our notation in~\Cref{sec:background}. We present our hardness results in~\Cref{sec:hard} and our approximation algorithm in~\Cref{sec:algo}. We present some (easy) generalizations of our results in~\Cref{sec:gen}. We do a quick overview of related work in~\Cref{sec:related-work} and conclude with some open questions in~\Cref{sec:concl-future-work}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%