master
Boris Glavic 2021-04-06 09:16:09 -05:00
parent 7d60ec8a4d
commit 1481e3b1f9
4 changed files with 14 additions and 11 deletions

View File

@ -134,23 +134,25 @@ The expected multiplicity of Chicago is $1.0$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
While the expected multiplicity of a query result can be computed in linear time in the size of the result's lineage if it is in sum of product form, this may not be true for compressed representations of polynomials such as factorized polynomials and arithmetic circuits. For instance, \Cref{subfig:ex-proj-push-circ-q4} shows two circuits encoding the lineage of the result tuple $(Chicago)$ from \Cref{ex:intro-lineage}. The left one encodes the lineage as a sum of products while the right one uses distributivity push the addition below the multiplication resulting in a smaller circuit. Given that there is a large body of work that can output such compressed representations \BG{cite FDBs and FAQ}, an interesting question is whether computing expectations is still in linear time for such compressed representations. We prove that this is not the case: computing the expected count of a query result tuple is super-linear (\sharpwonehard) in the size of a lineage circuit.
Of course, any complexity results for computing the expectation of polynomials only translate to the expected result multiplicity problem when we also take into account the complexity of constructing the lineage for a given tuple.
Of course, any complexity results for computing the expectation of polynomials only translate to the expected result multiplicity problem when we also take into account the complexity of constructing the lineage for a given tuple. As long as the complexity of constructing the lineage polynomial $\linsett{\query}{\pdb}{\tup}$ for a given database $\pdb$, query $\query$, and result tuple $\tup$ is less or equal to the complexity of computing the expected multiplicity of $\linsett{\query}{\pdb}{\tup}$, then our results for polynomials translate to the expected result multiplicity problem.
% As long as the complexity of constructing the lineage polynomial $\linsett{\query}{\pdb}{\tup}$ for a given database $\pdb$, query $\query$, and result tuple $\tup$ is less or equal to the complexity of computing the expected multiplicity of $\linsett{\query}{\pdb}{\tup}$, then our results for polynomials translate to the expected result multiplicity problem.
Concretely, me make the following contributions:
(i) We show that computing the expected result multiplicity problem for conjunctive queries for bag-$\ti$ is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;
(ii) We present an $(1-\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that its complexity is linear in the size of the compressed lineage encoding;
(i) We show that computing the expected result multiplicity problem for conjunctive queries for bag-$\ti$ is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;
(ii) We present an $(1-\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that its complexity is linear in the size of the compressed lineage encoding;\BG{Fix not linear in all cases, restate after 4 is done}
(iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data;
(iv) We further generalize our results to higher moments and prove that for \raPlus queries, the processing time in approximation is within a constant factor of the same query processed deterministically.
(iv) We further generalize our results to higher moments and prove that for \raPlus queries, we can approximate the expected output tuple multiplicities with only a $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
Our hardness results follow by considering a suitable generalization of the lineage polynomial in \cref{eq:edge-query}. First it is easy to generalize the polynomial to $\poly_G(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\poly_G^k(X_1,\dots,X_n)$ (i.e., $\inparen{\poly_G(X_1,\dots,X_n)}^k$) encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ (see \Cref{def:reduced-poly}) can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (which computing is \sharpwonehard) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le 2k$, then we can set up a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen} and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.
The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(\prob,\dots, \prob)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ (i.e., the \emph{input} tuple probabilities) and $k=\degree(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_\numvar)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(1,\ldots, 1)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well.
% Our hardness results follow by considering a suitable generalization of the lineage polynomial in \cref{eq:edge-query}. First it is easy to generalize the polynomial to $\poly_G(X_1,\dots,X_n)$ that represents the edge set of a graph $G$ in $n$ vertices. Then $\poly_G^k(X_1,\dots,X_n)$ (i.e., $\inparen{\poly_G(X_1,\dots,X_n)}^k$) encodes as its monomials all subgraphs of $G$ with at most $k$ edges in it. This implies that the corresponding reduced polynomial $\rpoly_G^k(\prob,\dots,\prob)$ (see \Cref{def:reduced-poly}) can be written as $\sum_{i=0}^{2k} c_i\cdot \prob^i$ and we observe that $c_{2k}$ is proportional to the number of $k$-matchings (which computing is \sharpwonehard) in $G$. Thus, if we have access to $\rpoly_G^k(\prob_i,\dots,\prob_i)$ for distinct values of $\prob_i$ for $0\le i\le 2k$, then we can set up a system of linear equations and compute $c_{2k}$ (and hence the number of $k$-matchings in $G$). This result, however, does not rule out the possibility that computing $\rpoly_G^k(\prob,\dots, \prob)$ for a {\em single specific} value of $\prob$ might be easy: indeed it is easy for $\prob=0$ or $\prob=1$. However, we are able to show that for any other value of $\prob$, computing $\rpoly_G^k(\prob,\dots, \prob)$ exactly will most probably require super-linear time. This reduction needs more work (and we cannot yet extend our results to $k>3$). Further, we have to rely on more recent conjectures in {\em fine-grained} complexity on e.g. the complexity of counting the number of triangles in $G$ and not more standard parameterized hardness like \sharpwonehard.
% The starting point of our approximation algorithm was the simple observation that for any lineage polynomial $\poly(X_1,\dots,X_n)$, we have $\rpoly(1,\dots,1)=Q(1,\dots,1)$ and if all the coefficients of $\poly$ are constants, then $\poly(\prob,\dots, \prob)$ (which can be easily computed in linear time) is a $\prob^k$ approximation to the value $\rpoly(\prob,\dots, \prob)$ that we are after. If $\prob$ (i.e., the \emph{input} tuple probabilities) and $k=\degree(\poly)$ are constants, then this gives a constant factor approximation. We then use sampling to get a better approximation factor of $(1\pm \eps)$: we sample monomials from $\poly(X_1,\dots,X_\numvar)$ and do an appropriate weighted sum of their coefficients. Standard tail bounds then allow us to get our desired approximation scheme. To get a linear runtime, it turns out that we need the following properties from our compressed representation of $\poly$: (i) be able to compute $\poly(1,\ldots, 1)$ in linear time and (ii) be able to sample monomials from $\poly(X_1,\dots,X_n)$ quickly as well.
%For the ease of exposition, we start off with expression trees (see~\Cref{fig:circuit-q2-intro} for an example) and show that they satisfy both of these properties. Later we show that it is easy to show that these properties also extend to polynomial circuits as well (we essentially show that in the required time bound, we can simulate access to the `unrolled' expression tree by considering the polynomial circuit).
We formalize our claim that, since our approximation algorithm runs in time linear in the size of the polynomial circuit, we can approximate the expected output tuple multiplicities with only a $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
\paragraph{Paper Organization.} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. We present some (easy) generalizations of our results in \Cref{sec:gen}. Finally, we discuss related work in \Cref{sec:related-work} and conclude in \Cref{sec:concl-future-work}.

View File

@ -414,6 +414,7 @@
% Forcing Layouts
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\trimfigurespacing}{\vspace*{-5mm}}
\newcommand{\mypar}[1]{\smallskip\noindent\textbf{{#1}.}}
%%%Adding stuff below so that long chain of display equatoons can be split across pages
\allowdisplaybreaks

View File

@ -95,7 +95,7 @@ sensitive=true
\input{abstract}
\input{intro-new}
\input{intro}
% \input{intro}
\input{ra-to-poly}
\input{poly-form}
\input{prob-def}

View File

@ -1 +1 @@
\contitem\title{Standard Operating Procedure in Bag PDBs Queries Considered Harmful}\author{Su Feng, Boris Glavic, Aaron Huber, Oliver Kennedy, and Atri Rudra}\page{23:1--23:47}
\contitem\title{Standard Operating Procedure in Bag PDBs Queries Considered Harmful}\author{Su Feng, Boris Glavic, Aaron Huber, Oliver Kennedy, and Atri Rudra}\page{23:1--23:45}