Done with pass on S3

2021-09-15 17:15:53 -04:00 · 2021-09-15 17:15:53 -04:00 · 247a30c714
parent 17b1afb954
commit 247a30c714
2 changed files with 30 additions and 19 deletions
--- a/mult_distinct_p.tex
+++ b/mult_distinct_p.tex
@ -3,23 +3,27 @@
 \section{Hardness of exact computation}
 \label{sec:hard}

-In this section, we will prove that computing $\expct\pbox{\poly(\vct{W})}$ exactly for a \ti-lineage polynomial  $\poly(\vct{X})$ generated from a project-join query (even an expression tree representation) is \sharpwonehard. Note that this implies hardness for \bis and general \abbrBPDB, answering \Cref{prob:intro-stmt} in the negative. Furthermore, we demonstrate in \Cref{sec:single-p} that the problem remains hard, even if $\probOf[X_i=1] = \prob$ for all $X_i$ and any fixed valued $\prob \in (0, 1)$ as long as certain popular hardness conjectures in fine-grained complexity hold. 
+In this section, we will prove the hardness results claimed in Table~\ref{tab:lbs} for a specific (family) of hard instance $(\query,\pdb)$ for \Cref{prob:bag-pdb-poly-expected} where $\pdb$ is a \abbrTIDB.
+% that computing $\expct\pbox{\poly(\vct{W})}$ exactly for a \ti-lineage polynomial  $\poly(\vct{X})$ generated from a project-join query (even an expression tree representation) is \sharpwonehard. 
+ Note that this implies hardness for \bis and general \abbrBPDB, answering \Cref{prob:bag-pdb-poly-expected} (and hence the equivalent \Cref{prob:bag-pdb-query-eval}) in the negative. 
+%Furthermore, we demonstrate in \Cref{sec:single-p} that the problem remains hard, even if $\probOf[X_i=1] = \prob$ for all $X_i$ and any fixed valued $\prob \in (0, 1)$ as long as certain popular hardness conjectures in fine-grained complexity hold. 

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Preliminaries}
 Our hardness results are based on (exactly) counting the number of (not necessarily induced) subgraphs in $G$ isomorphic to $H$. Let $\numocc{G}{H}$ denote this quantity.  We can think of $H$ as being of constant size and $G$ as growing.  %In query processing, $H$ can be viewed as the query while $G$ as the database instance.
-In particular, we will consider the problems of computing the following counts (given $G$ in its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threedis}$ (the number of $3$-matchings), and the latter's generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings).  We use $\kmatchtime$ to denote the runtime of computing $\numocc{G}{\kmatch}$.  Our hardness result in \Cref{sec:multiple-p} is based on the following result:
+In particular, we will consider the problems of computing the following counts (given $G$ in its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threedis}$ (the number of $3$-matchings), and the latter's generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings).  We use $\kmatchtime$ to denote the optimal runtime of computing $\numocc{G}{\kmatch}$.  Our hardness results in \Cref{sec:multiple-p} is based on the following hardness results/conjectures:

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Theorem}[\cite{k-match}]
 \label{thm:k-match-hard}
-Given positive integer $k$ and undirected graph $G$ with no self-loops or parallel edges, the time $\kmatchtime$ to compute $\numocc{G}{\kmatch}$ exactly is $\littleomega{f(k)\cdot \numedge^c}$ for any function $f$ and fixed constant $c$ independent of $\numedge$ and $k$. %counting the number of $k$-matchings in $G$ is\sharpwonehard (parameterization is in $k$).
+Given positive integer $k$ and undirected graph $G=(\vset,\edgeSet)$ with no self-loops or parallel edges, the time $\kmatchtime$ to compute $\numocc{G}{\kmatch}$ exactly is $\littleomega{f(k)\cdot |\edgeSet|^c}$ for any function $f$ and fixed constant $c$ independent of $\numedge$ and $k$ (assuming $\sharpwzero\ne\sharpwone$. %counting the number of $k$-matchings in $G$ is\sharpwonehard (parameterization is in $k$).
 \end{Theorem}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %The above result means that we cannot hope to count the number of $k$-matchings in $G=(\vset,\edgeSet)$ in time $f(k)\cdot |\vset|^{c}$ for any function $f$ and constant $c$ independent of $k$. 
-\begin{hypo}[\cite{k-match}]\label{conj:known-algo-kmatch}
-All best known algorithms to solve $\numocc{G}{\kmatch}$ take time $\kmatchtime = \bigO{|\vset|^{\Omega(k)}}$.
+\begin{hypo}\label{conj:known-algo-kmatch}
+There exists an absolute constant $c_0>0$ such that for every $G=(\vset,\edgeSet)$, we have $\kmatchtime \ge \Omega{|E|^{c_0\cdot k}}$.
 \end{hypo}
+We note that the above conjecture is somewhat non-standard. In particular, the best known state of the art algorithm to compute $\numocc{G}{\kmatch}$ takes time $\Omega\inparen{|V|^{k/2}}$ (i.e. if this is the best algorithm then $c_0=\frac 14$)~\cite{k-match}. What the above conjecture is saying is that one can only hope for a polynomial improvement over the state of the art algorithm to compute $\numocc{G}{\kmatch}$.
 %
 Our hardness result in Section~\ref{sec:single-p} is based on the following conjectured hardness result:
 %
@ -34,21 +38,21 @@ Based on the so called {\em Triangle detection hypothesis} (cf.~\cite{triang-har
 %The current best known algorithm to count the number of $3$-matchings, to
 %\AR{Need to add something about 3-paths and 3-matchings as well.}

-Both of our hardness results rely on a simple lineage polynomial encoding of the edges of a graph.
-To prove our hardness result, consider a graph $G(\vset, \edgeSet)$, where $|\edgeSet| = m$, $\vset = [\numvar]$. Our lineage polynomial has a variable $X_i$ for every $i$ in $[\numvar]$.
+All of our hardness results rely on a simple lineage polynomial encoding of the edges of a graph.
+To prove our hardness result, consider a graph $G=(\vset, \edgeSet)$, where $|\edgeSet| = m$, $\vset = [\numvar]$. Our lineage polynomial has a variable $X_i$ for every $i$ in $[\numvar]$.
 Consider the polynomial
 $\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in \edgeSet} X_i \cdot X_j.$
 The hard polynomial for our problem will be a suitable power $k\ge 3$ of the polynomial above:
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Definition}\label{def:qk}
 For any graph $G=(V,\edgeSet)$ and $\kElem\ge 1$, define
-\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in \edgeSet} X_i \cdot X_j\right)^\kElem\]
+\[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in \edgeSet} X_i \cdot X_j\right)^\kElem.\]
 \end{Definition}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %Our hardness results only need a \ti instance; We also consider the special case when all the tuple probabilities (probabilities assigned to $X_i$ by $\probAllTup$) are the same value. Note that our hardness results % do not require the general circuit representation and
 %even hold for the expression trees. %this polynomial can be encoded in an expression tree of size $\Theta(km)$.

-\noindent Returning to \Cref{fig:two-step}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ generalizes our example query from the introduction. Let us alias 
+\noindent Returning to \Cref{fig:two-step}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ is the lineage polynomial corresponding to the query that generalizes our example query from \Cref{sec:intro}. Let us alias 
 \begin{lstlisting}
 SELECT 1 FROM OnTime a, Route r, OnTime b
 WHERE a.city = r.city1 AND b.city = r.city2
@ -68,11 +72,17 @@ SELECT 1 FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
 %\[\poly^k_G\dlImp OnTime(C_1),Route(C_1, C_1'),OnTime(C_1'),\dots,OnTime(C_\kElem),Route(C_\kElem,C_\kElem'),OnTime(C_\kElem')\]
 %\end{minipage}
 %}
-\noindent where adapting the PDB instance in \Cref{fig:two-step}, relation $OnTime$ has $4$ tuples corresponding to each vertex for $i$ in $[4]$, each with probability $\prob_i$ and $Route$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
-Note that this implies that our hard lineage polynomial can be represented as an expression tree produced by a  project-join query with same probability value for each input tuple $\prob_i$, and hence is indeed a lineage polynomial for a \abbrTIDB \abbrPDB.
+\noindent Further, the PDB instance generalizes the one in \Cref{fig:two-step} as follows. Relation $OnTime$ has $n$ tuples corresponding to each vertex for $i$ in $[n]$, each with probability $\prob_i$ and $Route$ has tuples corresponding to the edges $\edgeSet$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $Route$ as well, but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $Route$ also are present with probability $\prob$ but to simplify notation we assign probability $1$ to edges.}
+In other words, for this instance $\dbbase$ contains the set of $n$ unary tuples in $OnTime$ (which corresponds to $\vset$) and $m$ binary tuples in $Route$ (which corresponds to $\edgeSet$).
+Note that this implies that $\poly_{G}^\kElem$ 
+%our hard lineage polynomial can be represented as an expression tree produced by a  project-join query with same probability value for each input tuple $\prob_i$, and hence
+ is indeed a lineage polynomial for a \abbrTIDB \abbrPDB.

+Next, we note that the runtime for \abbrStepOne with $\query^k$ and $\dbbase$ as defined above is $O(m)$ (i.e.  \abbrStepOne is `easy' for this query):
 \begin{Lemma}\label{lem:tdet-om}
-For the $\query^k$ of \Cref{def:qk}, the runtime $\qruntime{\query^k, \dbbase}$ is $O_k(\numedge)$.
+Let $\query^k$ and $\dbbase$ be as defined above. Then
+% of \Cref{def:qk}, the runtime 
+$\qruntime{\query^k, \dbbase}$ is $O(\kElem\numedge)$.
 \end{Lemma}

 %\begin{Corollary}\label{cor:at-least-kmatch}
@ -90,16 +100,16 @@ For the $\query^k$ of \Cref{def:qk}, the runtime $\qruntime{\query^k, \dbbase}$
 %Unless otherwise noted, all proofs for this section are in \Cref{app:single-mult-p}.
 We are now ready to present our main hardness result.
 %
-
+e
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Theorem}\label{thm:mult-p-hard-result}
-Let $\prob_0,\ldots,\prob_{2k}$ be $2k + 1$ distinct values in $(0, 1]$.  Then computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ for arbitrary $G$
+Let $\prob_0,\ldots,\prob_{2k}$ be $2k + 1$ distinct values in $(0, 1]$.  Then computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ (over all $i\in [2k+1]$ for arbitrary $G=(\vset,\edgeSet)$
 %and any $(2k+1)$ distinct values $\prob_i$ ($0\le i \le 2k$)
-is in time $\bigOmega{\kmatchtime}$.
+needs time $\bigOmega{\kmatchtime}$, assuming $\kmatchtime\ge \omega\inparen{\abs{\edgeSet}}$.
 \end{Theorem}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %
-The second row of \Cref{tab:lbs} is upheld by \Cref{thm:mult-p-hard-result}, \Cref{lem:tdet-om}, and \Cref{thm:k-match-hard}.  The third row is proved by \Cref{thm:mult-p-hard-result}, \Cref{lem:tdet-om}, and \Cref{conj:known-algo-kmatch}.
+Note that the second row of \Cref{tab:lbs} follows from \Cref{prop:expection-of-polynom}, \Cref{thm:mult-p-hard-result}, \Cref{lem:tdet-om}, and \Cref{thm:k-match-hard} while the third row is proved by \Cref{prop:expection-of-polynom}, \Cref{thm:mult-p-hard-result}, \Cref{lem:tdet-om}, and \Cref{conj:known-algo-kmatch}. Since \Cref{conj:known-algo-kmatch} is non-standard, the latter hardness result should be interpreted as follows. Any substantial polynomial improvement for \Cref{prob:bag-pdb-poly-expected} (over the trivial algorithm that converts $\poly$ into SMB and then runs the obvious algorithm for \abbrStepTwo) would lead to an improvement over the state of the art {\em upper} bounds on  $\kmatchtime$. Finally, note that \Cref{thm:mult-p-hard-result} needs one to be able to compute the expected multiplicities over $(2k+1)$ distinct values of $p_i$, each of which corresponds to distinct $\pd$ (for the same $\dbbase$), which explain the `Multiple' entry in the second column in the second and third row in \Cref{tab:lbs}. Next, we argue how to get rid of this latter requirement.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%
 %NEEDS to be moved to appendix
 %%%%%%%%%%%%%%%%%%%%%%%%%%%
--- a/single_p.tex
+++ b/single_p.tex
@ -15,9 +15,10 @@ Fix $\prob\in (0,1)$. Then assuming \Cref{conj:graph} is true, any algorithm tha
 %\begin{proof}[Proof of Corollary ~\ref{th:single-p-gen-k}]
 %Consider $\poly^3_{G}$ and $\poly' = 1$ such that $\poly'' = \poly^3_{G} \cdot \poly'$.  By  \Cref{th:single-p}, query $\poly''$ with $\kElem = 4$ has $\Omega(\numvar^{\frac{4}{3}})$ complexity.
 %\end{proof}
-The above shows the hardness for a very specific lineage polynomial but it is easy to convert this into a parameterized complexity result as follows.  One can come up with an infinite family of hard query polynomials by `embedding' $\rpoly_{G}^3$ into an infinite family of trivial query polynomials.
-Unlike \Cref{thm:mult-p-hard-result} the above result does not show that computing $\rpoly_{G}^3(\prob,\dots,\prob)$ for a fixed $\prob\in (0,1)$ is \sharpwonehard.
-However, in \Cref{sec:algo} we show that if we are willing to compute an approximation, then this problem (and indeed solving our problem for a much more general setting) is in linear time, yielding an affirmative answer to \Cref{prob:intro-stmt}.
+Note that \Cref{prop:expection-of-polynom} and \Cref{th:single-p-hard} above imply the hardness result in the first row of \Cref{tab:lbs}.
+The above shows the hardness for a very specific lineage polynomial but it is easy to convert this into a parameterized complexity result as follows.  One can come up with an infinite family of hard query polynomials by `embedding' $\rpoly_{G}^3$ into an infinite family of trivial lineage polynomials.
+%Unlike \Cref{thm:mult-p-hard-result} the above result does not show that computing $\rpoly_{G}^3(\prob,\dots,\prob)$ for a fixed $\prob\in (0,1)$ is \sharpwonehard.
+%However, in \Cref{sec:algo} we show that if we are willing to compute an approximation, then this problem (and indeed solving our problem for a much more general setting) is in linear time, yielding an affirmative answer to \Cref{prob:intro-stmt}.

 %%%%%%%%%%%%%%%%%%%%%%%%%
 %NEED to move to appendix