paper-BagRelationalPDBsAreHard/mult_distinct_p.tex

%root:main.tex

\section{Hardness of exact computation}
\label{sec:hard}
\AH{The notation used here is different than in~\Cref{sec:background}, in particular~\Cref{eq:expect-q-nx}.  Maybe we should decide on a notation and try to stick to it as much as possible?} 
We would like to argue for a compressed version of $\poly(\vct{X})$, in general $\expct\limits_{\vct{X} \sim \pd}\pbox{\poly(\vct{X})}$ even for TIDB, cannot be computed in linear time. We will argue two flavors of such a hardness result. In Section~\ref{sec:multiple-p}, we argue that computing the expected value exactly for all query polynommials $\poly(\vct{X})$ for multiple values of $p$ is \sharpwonehard. However, this does not rule out the possibility of being able to solve the problem for a any {\em fixed} value of $p$ in linear time. In Section~\ref{sec:single-p}, we rule out even this possibility (based on some popular hardness conjectures in fine-grained complexity).


\subsection{Preliminaries}

Our hardness results are based on (exactly) counting the number of occurrences of a fixed graph $H$ as a subgraph in $G$. Let $\numocc{G}{H}$ denote the number of occurrences of pattern $H$ in graph $G$. %, where, for example, $\numocc{G}{\ed}$ means the number of single edges in $G$.
In particular, we will consider the problems of computing the following counts (given $G$ as an input in its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threepath}$ (the number of $3$-paths),  $\numocc{G}{\threedis}$ (the number of $3$-matchings or collection of three node disjoint edges) and its generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings or collections fo $k$ node disjoint edges).


Our hardness result in Section~\ref{sec:multiple-p} is based on the following hardness result:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{Theorem}[\cite{k-match}]
\label{thm:k-match-hard}
Given a positive integer $k$ and  an undirected graph $G$ with no self-loops or parallel edges, computing $\numocc{G}{\kmatch}$ exactly is %counting the number of $k$-matchings in $G$ is 
\sharpwonehard.
\end{Theorem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The above result means that we cannot hope to count the number of $k$-matchings in $G=(V,E)$ in time $f(k)\cdot |V|^{O(1)}$ for any function $f$. In fact, all known algorithms to solve this problem take time $|V|^{\Omega(k)}$.

Our hardness result in Section~\ref{sec:single-p} is based on the following conjectured hardness result:
\begin{hypo}
\label{conj:graph}
There exists a constant $\eps_0>0$ such that given an undirected graph $G=(V,E)$, computing exactly the values $\numocc{G}{\tri}$, $\numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ cannot be done in time $o\inparen{|E|^{1+\eps_0}}$.
\end{hypo}
Based on the so called {\em Triangle detection hypothesis} (cf.~\cite{triang-hard}), which states that detection whether $G$ has a triangle or not takes time $\Omega\inparen{|E|^{4/3}}$, implies that in Conjecture~\ref{conj:graph} we can take $\eps_0\ge \frac 13$.
\AR{Need to add something about 3-paths and 3-matchings as well.}

Both of our hardness results use a query polynomial that is based on a simple encoding of the edges of a graph.
To prove our hardness result, consider a graph $G(V, E)$, where $|E| = m$, $|V| = \numvar$. Our query polynomial will have a variable $X_i$ for every $i$ in $[\numvar]$.
Now consider the query 
\[\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in E} X_i \cdot X_j.\]
The hard query polynomial for our problem will be a suitable power $k\ge 3$ of the polynomial above, i.e.
\begin{Definition}
Let $G=([n],E)$ be a graph. Then for any $\kElem\ge 1$, define
\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in E} X_i \cdot X_j\right)^\kElem.\]
\end{Definition}

Our hardness results only need a TIDB instance and further, we consider the special case when all the tuple probabilities are the same value. It is not too hard to see that we can encode the above polynomial in an expression tree of size $\Theta(km)$.

Following up on the discussion around Example~\ref{ex:intro}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ is the query polynomial corresponding to the following query:
\[\poly^k_G:- R(A_1),E(A_1,B_1),R(B_1),\dots,R(A_\kElem),E(A_\kElem,B_\kElem),R(B_\kElem)\]
where generalizaing the PDB instance in Example~\ref{ex:intro}, relation $R$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $p$ and $E(A,B)$ has tuples corresponding to the edges in $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $E$ as well but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $E$ also are present with probability $p$ but to make notation a bit simpler, we make this simplification.}

Note that this imples that our hard query polynimial can be created from a join-project query-- by contrast our approximation algorithm in Section~\ref{sec:algo} can handle lineage polynonmials generated by union of select-project-join queries. % (i.e. we do not need union or select operator to derive our hardness result).

%\AR{need discussion on the `tightness' of various params. First, this is for degree 6 poly-- while things are easy for say deg 2. Second this is for any fixed p.  Finally, we only need porject-join queries to get the hardness results. Also need to compare this with the generality of the approx upper bound results.}

\subsection{Multiple Distinct $\prob$ Values}
\label{sec:multiple-p}

We are now ready to present our main hardness result.
\begin{Theorem}\label{thm:mult-p-hard-result}
Computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ for arbitrary $G$ and any $(2k+1)$ values $\prob_i$ ($0\le i \le 2k$) is \sharpwonehard.
\end{Theorem}
We will prove the above result by reducing the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state of the art $k$-matching algorithms are improved, we cannot hope to have a better runtime to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its SOP form and then using~\Cref{cor:expct-sop}. By constrast our approximation algorithm would run in time $O_k\inparen{m}$ on this query (since it runs in linear-time on all query polynomials).


As mentioned earlier, we prove our hardness result by presenting a reduction from the problem of couting $\kElem$-matchings in a graph:
\begin{Lemma}\label{lem:qEk-multi-p}
Let $\prob_0,\ldots, \prob_{2\kElem}$ be distinct values in $(0, 1]$.  Then given the values $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$, the number of $\kElem$-matchings in $G$ can be computed in $O\inparen{\kElem^3}$ time.
\end{Lemma}

Before we prove the above Lemma, let us use it to prove~\Cref{thm:mult-p-hard-result}:
\begin{proof}[Proof of Theorem~\ref{thm:mult-p-hard-result}]
For the sake of contradiction, let us assume we can solve our problem in $f(\kElem)\cdot m^c$ time for some absolute constant $c$. Then given a graph $G$ we can compute the query polynomial $\rpoly_G^\kElem$ (in the obvious way) in $O(km)$ time. Then after we run our algorithm on $\rpoly_G^\kElem$, we get $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$ in additional $f(\kElem)\cdot m^c$ time. \Cref{lem:qEk-multi-p} then computes the number of $k$-matchings in $G$ in $O(\kElem^3)$ time. Thus, overall we have an algorithm for computing the number of $k$-matchings in time
\begin{align*}
 O(km) + f(\kElem)\cdot m^c + O(\kElem^3)
&\le \inparen{O(\kElem^3) + f(\kElem)}\cdot m^{c+1} \\
&\le \inparen{O(\kElem^3) + f(\kElem)}\cdot n^{2c+2},
\end{align*}
which contradicts~\cref{thm:k-match-hard}.
\end{proof}

Finally, we are ready to prove~\Cref{lem:qEk-multi-p}:
\begin{proof}[Proof of ~\cref{lem:qEk-multi-p}]
%It is trivial to see that one can readily expand the exponential expression by performing the $n^\kElem$ product operations, yielding the polynomial in the sum of products form of the lemma statement.  By definition $\rpoly_{G}^\kElem$ reduces all variable exponents greater than $1$ to $1$.  Thus, a monomial such as $X_i^\kElem X_j^\kElem$ is $X_iX_j$ in $\rpoly_{G}^\kElem$, and the value after substitution is $p_i\cdot p_j = p^2$.  Further, that the number of terms in the sum is no greater than $2\kElem + 1$, can be easily justified by the fact that each edge has two endpoints, and the most endpoints occur when we have $\kElem$ distinct edges (such a subgraph is also known as a $\kElem$-matching), with non-intersecting points, a case equivalent to $p^{2\kElem}$.
We first argue that $\rpoly_{G}^\kElem(\prob,\ldots, \prob) = \sum\limits_{i = 0}^{2\kElem} c_i \cdot \prob^i$.  First, since $\poly_G(\vct{X})$ has %$\kElem$ products of monomials of 
 degree $2$, it follows that $\poly_G^\kElem(\vct{X})$ has degree $2\kElem$.  
%We can further write $\poly_{G}^{\kElem}(\vct{X})$ in its expanded SOP form,
%\begin{equation*}
%\sum_{\substack{(i_1, j_1),\\\cdots,\\(i_\kElem, j_\kElem) \in E}}X_{i_1}X_{j_1}\cdots X_{i_\kElem}X_{j_\kElem}
%\end{equation*}
%Since each of $(i_1, j_1),\ldots, (i_\kElem, j_\kElem)$ are from $E$, it follows that the set of $\kElem!$ permutations of the $\kElem$ $X_iX_j$ pairs which form the monomial products are of degree $2\kElem$ with the number of distinct variables in an arbitrary monomial $\leq 2\kElem$.  
By definition, $\rpoly_{G}^{\kElem}(\vct{X})$ sets every exponent $e > 1$ to $e = 1$, which means that $\degree(\rpoly_{G}^\kElem)\le \degree(\poly_G^\kElem)=2k$. Thus, if we think of $\prob$ as a variable, then $\rpoly_{G}^{\kElem}(\prob,\dots,\prob)$ is a univariate polynomial of degree at most $\degree(\rpoly_{G}^\kElem)\le 2k$. Thus, we can write
%thereby shrinking the degree a monomial product term in the SOP form of $\poly_{G}^{\kElem}(\vct{X})$ to the exact number of distinct variables the monomial contains.  This implies that $\rpoly_{G}^\kElem$ is a polynomial of degree $2\kElem$ and hence $\rpoly_{G}^\kElem(\prob,\ldots, \prob)$ is a polynomial in $\prob$ of degree $2\kElem$.  Then it is the case that
\begin{equation*}
\rpoly_{G}^{\kElem}(\prob,\ldots, \prob) = \sum_{i = 0}^{2\kElem} c_i \prob^i
\end{equation*}

We note that $c_i$ is {\em exactly} the number of monomials in the SOP expansion of $\poly_{G}^{\kElem}(\vct{X})$ composed of $i$ distinct variables.%, with $\prob$ substituted for each distinct variable
\footnote{Since $\rpoly_G^\kElem(\vct{X})$ does not have any monomial with degree $< 2$, it is the case that $c_0 = c_1 = 0$ but for the sake of simplcity we will ignore this observation.}

Given that we then have $2\kElem + 1$ distinct values of $\rpoly_{G}^\kElem(\prob,\ldots, \prob)$ for $0\leq i\leq2\kElem$, it follows that 
%we then have $2\kElem + 1$ distinct rows of the form $\prob_i^0\ldots\prob_i^{2\kElem}$ which form a matrix $M$.  
 we have a linear system of the form $\vec{M} \cdot \vct{c} = \vct{b}$ where the $i$th row of $\vec{M}$ is $\inparen{\prob_i^0\ldots\prob_i^{2\kElem}}$, $\vct{c}$ is the coefficient vector $\inparen{c_0,\ldots, c_{2\kElem}}$, and $\vct{b}$ is the vector such that $\vct{b}[i] = \rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$.  In other words, matrix $\vec{M}$ is the Vandermonde matrix, from which it follows that we have a matrix with full rank (since the $p_i$'s are distinct), and we can solve the linear system in $O(k^3)$ time (say using Gaussian Elimination) to determine $\vct{c}$ exactly. Thus, after $O(k^3)$ work, we know $\vct{c}$ and in particular, $c_{2k}$ exactly. Next we show why we can compute $\numocc{G}{\kmatch}$ from $c_{2k}$ in $O(1)$ additional time.

%Denote the number of $\kElem$-matchings in $G$ as $\numocc{G}{\kmatch}$.  
We claim that $c_{2\kElem}$ is $\kElem! \cdot \numocc{G}{\kmatch}$.  This can be seen intuitively by looking at the original factorized representation 
\[\poly_{G}^\kElem(\vct{X}) = \sum_{\substack{(i_1, j_1),\\\cdots,\\(i_\kElem, j_\kElem) \in E}}X_{i_1}X_{j_1}\cdots X_{i_\kElem}X_{j_\kElem},\] 
where across each of the $\kElem$ products, an arbitrary $\kElem$-matching can be selected $\prod_{i = 1}^\kElem \kElem = \kElem!$ times.  Next, note that each $\kElem$-matching $(i_1, j_1)\ldots$ $(i_k, j_k)$ in $G$ corresponds to the monomial $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ in $\poly_{G}^\kElem(\vct{X})$, with all indexes distinct.  %Since each index is distinct, then each variable has an exponent $e = 1$ and this monomial survives in $\rpoly_{G}^{\kElem}(\vct{X})$  Since $\rpoly$ contains only exponents $e \leq 1$, the only degree $2\kElem$ terms that can exist in $\rpoly_{G}^\kElem$ are $\kElem$-matchings since every other monomial in $\poly_{G}^\kElem(\vct{X})$ has strictly less than $2\kElem$ distinct variables, which, as stated earlier implies that every other non-$\kElem$-matching monomial in $\rpoly_{G}^\kElem(\vct{X})$ has degree $< 2\kElem$.
Second, the only surviving monomiasl $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ of degree exactly $2k$ in $\rpoly_{G}^{\kElem}(\vct{X})$ must have that all of $i_1,j_1,\dots,i_\kElem,j_\kElem$ are distinct in $\poly_{G}^{\kElem}(\vct{X})$. Then, by the last two statements, only monomials composed of $2k$ distinct variables in $\poly_{G}^{\kElem}(\vct{X})$ (and hence of degree $2\kElem$ in $\rpoly_{G}^{\kElem}(\vct{X})$) correspond to a $k$-matching in $G$.
%It has already been established above that a $\kElem$-matching ($\kmatch$) has coefficient $c_{2\kElem}$.  As noted, a $\kElem$-matching occurs when there are $\kElem$ edges, $e_1, e_2,\ldots, e_\kElem$, such that all of them are disjoint, i.e., $e_1 \neq e_2 \neq \cdots \neq e_\kElem$.  In all $\kElem$ factors of $\poly_{G}^\kElem(\vct{X})$ there are $k$ choices from the first factor to select an edge for a given $\kElem$ matching, $\kElem - 1$ choices in the second factor, and so on throughout all the factors, yielding $\kElem!$ duplicate terms for each $\kElem$ matching in the expansion of $\poly_{G}^\kElem(\vct{X})$.

Notice that %we have $\kElem!$ duplicates of 
each of the $k!$ permutations of an arbitrary monomial maps to the same distinct $\kElem$-matching in $G$, and this implies a $\kElem!$ to $1$ mapping between degree $2\kElem$ monomials in $\rpoly_{G}^{\kElem}(\vct{X})$ and $\kElem$-matchings in $G$. 
%we then have $\kElem!$ monomials mapped to each distinct $\kElem$-matching of $G$.  
%Since $c_{2k}$ is the cardinality of the multi-set of degree $2\kElem$ monomials in $\rpoly_{G}^{\kElem}(\vct{X})$, where each $\kElem$-matching in $G$ has $\kElem!$ monomial representations,
It then follows that $c_{2\kElem}= \kElem! \cdot \numocc{G}{\kmatch}$.
% and the fact that $c_{2\kElem}$ contains all monomials with degree $2\kElem$, it follows that $c_{2\kElem} = \kElem!\cdot\numocc{G}{\kmatch}$.  
Thus, simply dividing $c_{2\kElem}$ by $\kElem!$ gives us $\numocc{G}{\kmatch}$, as needed. % by simply dividing $c_{2\kElem}$ by $\kElem!$.
\end{proof}

\qed


%\begin{Corollary}\label{cor:lem-qEk}
%One can compute $\numocc{G}{\kmatch}$ in $\query_{G}^\kElem(\vct{X})$ exactly.
%\end{Corollary}
%
%\begin{proof}[Proof for Corollary ~\ref{cor:lem-qEk}]
%By ~\cref{lem:qEk-multi-p}, the term $c_{2\kElem}$ can be exactly computed.  Additionally we know that $c_{2\kElem}$ can be broken into two factors, and by dividing $c_{2\kElem}$ by the factor $\kElem!$, it follows that the resulting value is indeed $\numocc{G}{\kmatch}$.
%\end{proof}
%
%\qed
%\begin{Corollary}\label{cor:tilde-q-hard}
%Computing $\rpoly(\vct{X})$ is $\#W[1]$-hard.
%\end{Corollary}
%
%\begin{proof}[Proof of Corollary ~\ref{cor:tilde-q-hard}]
%The proof follows by ~\cref{thm:k-match-hard}, ~\cref{lem:qEk-multi-p} and ~\cref{cor:lem-qEk}.
%\end{proof}

%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: