paper-BagRelationalPDBsAreHard/mult_distinct_p.tex

109 lines
8.6 KiB
TeX
Raw Normal View History

2020-12-04 13:14:12 -05:00
%root:main.tex
%!TEX root=./main.tex
2020-12-13 11:32:55 -05:00
\section{Hardness of exact computation}
2020-12-13 21:53:22 -05:00
\label{sec:hard}
2021-04-07 12:21:41 -04:00
2021-09-09 11:42:30 -04:00
In this section, we will prove that computing $\expct\pbox{\poly(\vct{W})}$ exactly for a \ti-lineage polynomial $\poly(\vct{X})$ generated from a project-join query (even an expression tree representation) is \sharpwonehard. Note that this implies hardness for \bis and general \abbrBPDB, answering \Cref{prob:intro-stmt} in the negative. Furthermore, we demonstrate in \Cref{sec:single-p} that the problem remains hard, even if $\probOf[X_i=1] = \prob$ for all $X_i$ and any fixed valued $\prob \in (0, 1)$ as long as certain popular hardness conjectures in fine-grained complexity hold.
2020-12-13 11:32:55 -05:00
2020-12-18 12:23:13 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-13 11:32:55 -05:00
\subsection{Preliminaries}
2021-09-09 11:42:30 -04:00
Our hardness results are based on (exactly) counting the number of (not necessarily induced) subgraphs in $G$ isomorphic to $H$. Let $\numocc{G}{H}$ denote this quantity. We can think of $H$ as being of constant size and $G$ as growing. %In query processing, $H$ can be viewed as the query while $G$ as the database instance.
In particular, we will consider the problems of computing the following counts (given $G$ in its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threedis}$ (the number of $3$-matchings), and the latter's generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings). Our hardness result in \Cref{sec:multiple-p} is based on the following result:
2020-12-11 19:50:53 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}[\cite{k-match}]
2020-12-09 00:00:04 -05:00
\label{thm:k-match-hard}
2021-04-09 16:12:46 -04:00
Given positive integer $k$ and undirected graph $G$ with no self-loops or parallel edges, computing $\numocc{G}{\kmatch}$ exactly is %counting the number of $k$-matchings in $G$ is
\sharpwonehard (parameterization is in $k$).
\end{Theorem}
2020-12-11 19:50:53 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-06-11 11:22:58 -04:00
The above result means that we cannot hope to count the number of $k$-matchings in $G=(\vset,\edgeSet)$ in time $f(k)\cdot |\vset|^{c}$ for any function $f$ and constant $c$ independent of $k$. In fact, all known algorithms to solve this problem take time $|\vset|^{\Omega(k)}$.
%
2020-12-13 13:05:43 -05:00
Our hardness result in Section~\ref{sec:single-p} is based on the following conjectured hardness result:
2020-12-18 12:23:13 -05:00
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-13 13:05:43 -05:00
\begin{hypo}
\label{conj:graph}
2021-09-09 11:42:30 -04:00
There exists a constant $\eps_0>0$ such that given an undirected graph $G=(\vset,\edgeSet)$, computing $\numocc{G}{\tri}$ exactly cannot be done in time $o\inparen{|\edgeSet|^{1+\eps_0}}$.
2020-12-13 13:05:43 -05:00
\end{hypo}
2020-12-18 12:23:13 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
2021-06-11 11:22:58 -04:00
Based on the so called {\em Triangle detection hypothesis} (cf.~\cite{triang-hard}), which states that detection of whether $G$ has a triangle or not takes time $\Omega\inparen{|\edgeSet|^{4/3}}$, implies that in Conjecture~\ref{conj:graph} we can take $\eps_0\ge \frac 13$.
2020-12-20 00:22:12 -05:00
%The current best known algorithm to count the number of $3$-matchings, to
%\AR{Need to add something about 3-paths and 3-matchings as well.}
2020-12-13 13:05:43 -05:00
2021-09-09 11:42:30 -04:00
Both of our hardness results rely on a simple lineage polynomial encoding of the edges of a graph.
To prove our hardness result, consider a graph $G(\vset, \edgeSet)$, where $|\edgeSet| = m$, $\vset = [\numvar]$. Our lineage polynomial has a variable $X_i$ for every $i$ in $[\numvar]$.
2021-04-09 16:12:46 -04:00
Consider the polynomial
2021-06-11 11:22:58 -04:00
$\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in \edgeSet} X_i \cdot X_j.$
The hard polynomial for our problem will be a suitable power $k\ge 3$ of the polynomial above:
2020-12-18 12:23:13 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Definition}\label{def:qk}
2021-09-02 17:01:17 -04:00
For any graph $G=(V,\edgeSet)$ and $\kElem\ge 1$, define
2021-06-11 11:22:58 -04:00
\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in \edgeSet} X_i \cdot X_j\right)^\kElem\]
2020-12-13 13:41:42 -05:00
\end{Definition}
2020-12-18 12:23:13 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-09-02 17:01:17 -04:00
%Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned to $X_i$ by $\probAllTup$) are the same value. Note that our hardness results % do not require the general circuit representation and
%even hold for the expression trees. %this polynomial can be encoded in an expression tree of size $\Theta(km)$.
2020-12-04 13:14:12 -05:00
2021-09-09 11:42:30 -04:00
\noindent Returning to \Cref{fig:two-step}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ generalizes our example query from the introduction. Let us alias
\begin{lstlisting}
SELECT 1 FROM OnTime a, Route r, OnTime b
WHERE a.city = r.city1 AND b.city = r.city2
\end{lstlisting}
as $R_i$ for each $i \in [k]$. The query then becomes
\begin{lstlisting}
SELECT 1 FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
\end{lstlisting}
%RA format for the same query
%\begin{align*}
%\query^k_G \coloneqq &\inparen{\project_\emptyset\inparen{OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)}}\times_2\cdots\\
%&\cdots \times_k \inparen{\project_\emptyset\inparen{OnTime \join_{City = City_1} Route \join_{{City}_2 = City'}\rename_{City' \leftarrow City}(OnTime)}}
%\end{align*}
2021-09-02 17:01:17 -04:00
%\resizebox{1\linewidth}{!}{
%\begin{minipage}{1.05\linewidth}
%\[\poly^k_G\dlImp OnTime(C_1),Route(C_1, C_1'),OnTime(C_1'),\dots,OnTime(C_\kElem),Route(C_\kElem,C_\kElem'),OnTime(C_\kElem')\]
%\end{minipage}
%}
2021-09-09 11:42:30 -04:00
where adapting the PDB instance in \Cref{fig:two-step}, relation $OnTime$ has $4$ tuples corresponding to each vertex for $i$ in $[4]$, each with probability $\prob_i$ and $Route$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
Note that this implies that our hard lineage polynomial can be represented as an expression tree produced by a project-join query with same probability value for each input tuple $\prob_i$, and hence is indeed a lineage polynomial for a \abbrTIDB \abbrPDB.
2020-12-13 11:32:55 -05:00
\begin{Lemma}\label{lem:pdb-for-def-qk}
The relations encoding the edges for the hard query of \Cref{def:qk} can be computed in $\bigO{\numedge}$ time.
\end{Lemma}
\begin{proof}
Only two relations need be constructed, one for the vertexes and one for the edges. By a simple linear scan, each can be constructed in time $\bigO{\numedge + \numvar}$. If we assume a constant factor of edges in the number of vertexes, then we have $\bigO{\numedge}$ time..
\end{proof}
2020-12-13 11:32:55 -05:00
\subsection{Multiple Distinct $\prob$ Values}
\label{sec:multiple-p}
2021-04-10 09:48:26 -04:00
%Unless otherwise noted, all proofs for this section are in \Cref{app:single-mult-p}.
2020-12-13 14:16:32 -05:00
We are now ready to present our main hardness result.
2020-12-18 12:23:13 -05:00
%
2021-04-09 16:12:46 -04:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2020-12-13 14:16:32 -05:00
\begin{Theorem}\label{thm:mult-p-hard-result}
2021-09-09 11:42:30 -04:00
Let $\prob_0,\ldots,\prob_{2k}$ be $2k + 1$ distinct values in $(0, 1]$. Then computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ for arbitrary $G$
%and any $(2k+1)$ distinct values $\prob_i$ ($0\le i \le 2k$)
is \sharpwonehard (parameterization is in $k$).
2020-12-13 14:16:32 -05:00
\end{Theorem}
2020-12-18 12:23:13 -05:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
2021-06-11 11:22:58 -04:00
We will prove the above result by reducing from the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state-of-the-art $k$-matching algorithms are improved, we cannot hope to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$ where $m=\abs{\edgeSet}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its \abbrSMB form and then using \Cref{cor:expct-sop}. The approximation algorithm we present in \Cref{sec:algo} has runtime $O_k\inparen{m}$ for this query. % (since it runs in linear-time on all lineage polynomials).
2020-12-13 14:16:32 -05:00
2021-09-02 17:01:17 -04:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%NEEDS to be moved to appendix
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\noindent The following lemma reduces the problem of counting $\kElem$-matchings in a graph to our problem (and proves \Cref{thm:mult-p-hard-result}):
%\begin{Lemma}\label{lem:qEk-multi-p}
%Let $\prob_0,\ldots, \prob_{2\kElem}$ be distinct values in $(0, 1]$. Then given the values $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$, the number of $\kElem$-matchings in $G$ can be computed in $\bigO{\kElem^3}$ time.
%\end{Lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
%END move to appendix
%%%%%%%%%%%%%%%%%%%%%%%%%%%
2021-04-10 14:35:38 -04:00
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: