Merge branch 'master' of gitlab.odin.cse.buffalo.edu:ahuber/SketchingWorlds

master
Boris Glavic 2021-09-03 10:05:24 -05:00
commit 28a11390f2
6 changed files with 79 additions and 67 deletions

View File

@ -3,7 +3,8 @@
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
In \Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries), and by extension \bi (or any $\semNX$-PDB) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
In \Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed lineage polynomial for \ti (even just based on project-join queries), and by extension \bi (or more general \abbrPDB models) %any $\semNX$-PDB)
is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.\footnote{For a very broad class of circuits: please see the discussion after \Cref{lem:val-ub} for more.}
The folowing approximation algorithm applies to \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark~\cite{pdbench}.
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.

View File

@ -32,7 +32,7 @@ Note that, if $\apolyqdt$ is given, then \Cref{prob:bag-pdb-query-eval} reduces
However, as we will show, these results also have implications for \cref{prob:bag-pdb-query-eval} when considering the cost of generating polynomials of query result tuples.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{\abbrTIDBs}
\mypar{\abbrTIDB\xplural}
%Solving~\cref{prob:bag-pdb-query-eval} for arbitrary $\pd$ is hopeless since we need exponential space to repreent an arbitrary $\pd$.
We initially focus on tuple-independent probabilistic bag-databases (\abbrTIDB),\BG{cite} a compressed encoding of probabilistic databases where the presence of each individual tuple (out of a total of $\numvar$ input tuples) in a possible world is modeled as an independent probabilistic event\footnote{
This model corresponds to the classical set-relational approach to \abbrTIDB{}s, where we can handle the case of each input tuple having its own multiplicity by replacing each input tuple with as many copies as its multiplicity. To make each duplicate tuple unique in a set-\abbrTIDB we can assign unique keys across all duplicates. This increases the size of the input but this overhead is negligible when each input tuple has constant multiplicity. %$\tup$ in $\pdb$.

0
main.synctex(busy) Normal file
View File

View File

@ -84,7 +84,7 @@ sensitive=true
\title{Fine-Grained Analysis of Query Evaluation Over Bag PDBs}
\title{Parameterized and Fine-Grained Analysis of Query Evaluation Over Bag PDBs}
\titlerunning{Bag PDB Queries}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@ -3,13 +3,12 @@
\section{Hardness of exact computation}
\label{sec:hard}
In this section, we will prove that computing $\expct\limits_{\vct{W} \sim \pd}\pbox{\poly(\vct{W})}$ exactly for a \ti-lineage polynomial $\poly(\vct{X})$ generated from a project-join query (even an expression tree representation) is \sharpwonehard. Note that this implies hardness for \bis and general $\semNX$-PDBs under bag semantics. Furthermore, we demonstrate in \Cref{sec:single-p} that the problem remains hard, even if $\probOf[X_i=1] = \prob$ for all $X_i$ and any fixed valued $\prob \in (0, 1)$ as long as certain popular hardness conjectures in fine-grained complexity hold.
In this section, we will prove that computing $\expct\limits_{\vct{W} \sim \pd}\pbox{\poly(\vct{W})}$ exactly for a \ti-lineage polynomial $\poly(\vct{X})$ generated from a project-join query (even an expression tree representation) is \sharpwonehard. Note that this implies hardness for \bis and general $\semNX$-PDBs under bag semantics. Furthermore, we demonstrate in \Cref{sec:single-p} that the problem remains hard, even if $\probOf[X_i=1] = \prob$ for all $X_i$ and any fixed valued $\prob \in (0, 1)$ as long as certain popular hardness conjectures in fine-grained complexity hold. As mentioned previously, all proofs are in the appendix.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Preliminaries}
\AH{One of the complaints with reviewers is that the term \emph{subgraph} is not precise enough.}
Our hardness results are based on (exactly) counting the number of occurrences of a subgraph $H$ in $G$. Let $\numocc{G}{H}$ denote the number of occurrences of $H$ in graph $G$. We can think of $H$ as being of constant size and $G$ as growing. %In query processing, $H$ can be viewed as the query while $G$ as the database instance.
Our hardness results are based on (exactly) counting the number of subgraphs in $G$ isomorphic to $H$. Let $\numocc{G}{H}$ denote this quantity. We can think of $H$ as being of constant size and $G$ as growing. %In query processing, $H$ can be viewed as the query while $G$ as the database instance.
In particular, we will consider the problems of computing the following counts (given $G$ as an input and its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threedis}$ (the number of $3$-matchings), and the latter's generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings). Our hardness result in \Cref{sec:multiple-p} is based on the following result:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -41,25 +40,25 @@ $\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in \edgeSet} X_i \cdot X_j.$
The hard polynomial for our problem will be a suitable power $k\ge 3$ of the polynomial above:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}\label{def:qk}
For any graph $G=([n],\edgeSet)$ and $\kElem\ge 1$, define
For any graph $G=(V,\edgeSet)$ and $\kElem\ge 1$, define
\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in \edgeSet} X_i \cdot X_j\right)^\kElem\]
\end{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned to $X_i$ by $\probAllTup$) are the same value. Note that our hardness results % do not require the general circuit representation and
even hold for the expression trees. %this polynomial can be encoded in an expression tree of size $\Theta(km)$.
%Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned to $X_i$ by $\probAllTup$) are the same value. Note that our hardness results % do not require the general circuit representation and
%even hold for the expression trees. %this polynomial can be encoded in an expression tree of size $\Theta(km)$.
\AH{Dangling pointers needs to be fixed.}
\noindent Returning to \Cref{fig:ex-shipping-simp}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ generalizes our running example query:
\resizebox{1\linewidth}{!}{
\begin{minipage}{1.05\linewidth}
\[\poly^k_G\dlImp OnTime(C_1),Route(C_1, C_1'),OnTime(C_1'),\dots,OnTime(C_\kElem),Route(C_\kElem,C_\kElem'),OnTime(C_\kElem')\]
\end{minipage}
}
where adapting the PDB instance in \Cref{fig:ex-shipping-simp}, relation $OnTime$ has $n$ tuples corresponding to each vertex in $\vset=[n]$ each with probability $\prob$ and $Route(\text{City}_1, \text{City}_2)$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\AH{This footnote is probably unnecessary now since we changed the example.}\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
Note that this implies that our hard query polynomial can be represented as an expression tree \AH{If we speak of an expression tree, we need to have defined this somewhere.} produced by a project-join query with same probability value for each input tuple $\prob_i$.
\AH{The above discussion from \cref{def:qk} seems to be a bit ambiguous. I'm not sure it's entirely accurate to end with $\prob_i$.}
\noindent Returning to \cref{fig:two-step}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ generalizes our example query from the introduction:
\begin{align*}
\query^k_G \coloneqq &\inparen{\project_\emptyset\inparen{OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)}}\times_2\cdots\\
&\cdots \times_k \inparen{\project_\emptyset\inparen{OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)}}
\end{align*}
%\resizebox{1\linewidth}{!}{
%\begin{minipage}{1.05\linewidth}
%\[\poly^k_G\dlImp OnTime(C_1),Route(C_1, C_1'),OnTime(C_1'),\dots,OnTime(C_\kElem),Route(C_\kElem,C_\kElem'),OnTime(C_\kElem')\]
%\end{minipage}
%}
where adapting the PDB instance in \cref{fig:two-step}, relation $OnTime$ has $4$ tuples corresponding to each vertex $v_i$ in $\vset$ for $i$ in $[4]$, each with probability $\prob_i$ and $Route$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
Note that this implies that our hard query polynomial can be represented as an expression tree produced by a project-join query with same probability value for each input tuple $\prob_i$.
\subsection{Multiple Distinct $\prob$ Values}
\label{sec:multiple-p}
@ -74,10 +73,16 @@ Computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ for arbitrary $G$ and any $(2
%
We will prove the above result by reducing from the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state-of-the-art $k$-matching algorithms are improved, we cannot hope to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$ where $m=\abs{\edgeSet}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its \abbrSMB form and then using \Cref{cor:expct-sop}. The approximation algorithm we present in \Cref{sec:algo} has runtime $O_k\inparen{m}$ for this query. % (since it runs in linear-time on all lineage polynomials).
\noindent The following lemma reduces the problem of counting $\kElem$-matchings in a graph to our problem (and proves \Cref{thm:mult-p-hard-result}):
\begin{Lemma}\label{lem:qEk-multi-p}
Let $\prob_0,\ldots, \prob_{2\kElem}$ be distinct values in $(0, 1]$. Then given the values $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$, the number of $\kElem$-matchings in $G$ can be computed in $\bigO{\kElem^3}$ time.
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%NEEDS to be moved to appendix
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\noindent The following lemma reduces the problem of counting $\kElem$-matchings in a graph to our problem (and proves \Cref{thm:mult-p-hard-result}):
%\begin{Lemma}\label{lem:qEk-multi-p}
%Let $\prob_0,\ldots, \prob_{2\kElem}$ be distinct values in $(0, 1]$. Then given the values $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$, the number of $\kElem$-matchings in $G$ can be computed in $\bigO{\kElem^3}$ time.
%\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%END move to appendix
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Local Variables:
%%% mode: latex

View File

@ -19,48 +19,54 @@ The above shows the hardness for a very specific query polynomial but it is easy
Unlike \Cref{thm:mult-p-hard-result} the above result does not show that computing $\rpoly_{G}^3(\prob,\dots,\prob)$ for a fixed $\prob\in (0,1)$ is \sharpwonehard.
However, in \Cref{sec:algo} we show that if we are willing to compute an approximation, then this problem (and indeed solving our problem for a much more general setting) is in linear time.
%\AH{@atri needs to put in the result for triangles of $\numvar^{\frac{4}{3}}$ runtime.}
We will prove the above result by the following reduction:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}\label{th:single-p}
Fix $\prob\in (0,1)$. Let $G$ be a graph on $\numedge$ edges.
If we can compute $\rpoly_{G}^3(\prob,\dots,\prob)$ exactly in $T(\numedge)$ time, then we can exactly compute $\numocc{G}{\tri}$ %count the number of triangles, 3-paths, and 3-matchings in $G$
in $O\inparen{T(\numedge) + \numedge}$ time.
\end{Theorem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%Before we move on to the proof itself, we state the results, lemmas, and defintions that will be useful in the proof.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
To prove \cref{th:single-p}, we use the following notion.
\begin{Definition}\label{def:Gk}
For $\ell \geq 1$, let graph $\graph{\ell}$ be a graph generated from an arbitrary graph $G$, by replacing every edge $e$ of $G$ with a $\ell$-path, such that all inner vertexes of an $\ell$-path replacement edge are disjoint from all other vertexes.\footnote{Note that $G\equiv \graph{1}$.}% of any other $\ell$-path replacement edge. % in the sense that they only intersect at the original intersection endpoints as seen in $\graph{1}$.
\end{Definition}
The following result immediately implies \Cref{th:single-p}:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Lemma}\label{lem:lin-sys}
Fix $\prob\in (0,1)$. Given $\rpoly_{\graph{\ell}}^3(\prob,\dots,\prob)$ for $\ell\in [2]$, we can compute in $O(m)$ time a vector $\vct{b}\in\mathbb{R}^3$ such that
\[ \begin{pmatrix}
1 - 3p & -(3\prob^2 - \prob^3)\\
10(3\prob^2 - \prob^3) & 10(3\prob^2 - \prob^3)
\end{pmatrix}
\cdot
\begin{pmatrix}
\numocc{G}{\tri}]\\
\numocc{G}{\threedis}
\end{pmatrix}
=\vct{b},
\]
allowing us to compute $\numocc{G}{\tri}$ and $\numocc{G}{\threedis}$ in $O(1)$ time.
\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%
%NEED to move to appendix
%%%%%%%%%%%%%%%%%%%%%%%%%
%%\AH{@atri needs to put in the result for triangles of $\numvar^{\frac{4}{3}}$ runtime.}
%We will prove the above result by the following reduction:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{Theorem}\label{th:single-p}
%Fix $\prob\in (0,1)$. Let $G$ be a graph on $\numedge$ edges.
%If we can compute $\rpoly_{G}^3(\prob,\dots,\prob)$ exactly in $T(\numedge)$ time, then we can exactly compute $\numocc{G}{\tri}$ %count the number of triangles, 3-paths, and 3-matchings in $G$
%in $O\inparen{T(\numedge) + \numedge}$ time.
%\end{Theorem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%
%%Before we move on to the proof itself, we state the results, lemmas, and defintions that will be useful in the proof.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%
%To prove \cref{th:single-p}, we use the following notion.
%\begin{Definition}\label{def:Gk}
%For $\ell \geq 1$, let graph $\graph{\ell}$ be a graph generated from an arbitrary graph $G$, by replacing every edge $e$ of $G$ with a $\ell$-path, such that all inner vertexes of an $\ell$-path replacement edge are disjoint from all other vertexes.\footnote{Note that $G\equiv \graph{1}$.}% of any other $\ell$-path replacement edge. % in the sense that they only intersect at the original intersection endpoints as seen in $\graph{1}$.
%\end{Definition}
%
%
%
%The following result immediately implies \Cref{th:single-p}:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%\begin{Lemma}\label{lem:lin-sys}
%Fix $\prob\in (0,1)$. Given $\rpoly_{\graph{\ell}}^3(\prob,\dots,\prob)$ for $\ell\in [2]$, we can compute in $O(m)$ time a vector $\vct{b}\in\mathbb{R}^3$ such that
%\[ \begin{pmatrix}
%1 - 3p & -(3\prob^2 - \prob^3)\\
%10(3\prob^2 - \prob^3) & 10(3\prob^2 - \prob^3)
%\end{pmatrix}
%\cdot
%\begin{pmatrix}
%\numocc{G}{\tri}]\\
%\numocc{G}{\threedis}
%\end{pmatrix}
%=\vct{b},
%\]
%allowing us to compute $\numocc{G}{\tri}$ and $\numocc{G}{\threedis}$ in $O(1)$ time.
%\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%END move to appendix
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\AH{Corollary needs refinement.}
\begin{Corollary}