sec 3

2020-12-18 11:23:13 -06:00 · 2020-12-18 11:23:13 -06:00 · aaebbc7912
parent e9d13722af
commit aaebbc7912
3 changed files with 38 additions and 24 deletions
--- a/mult_distinct_p.tex
+++ b/mult_distinct_p.tex
@ -2,17 +2,21 @@

 \section{Hardness of exact computation}
 \label{sec:hard}
-\AH{The notation used here is different than in~\Cref{sec:background}, in particular~\Cref{eq:expect-q-nx}.  Maybe we should decide on a notation and try to stick to it as much as possible?} 
-We would like to argue for a compressed version of $\poly(\vct{X})$, in general $\expct\limits_{\vct{X} \sim \pd}\pbox{\poly(\vct{X})}$ even for TIDB, cannot be computed in linear time. We will argue two flavors of such a hardness result. In Section~\ref{sec:multiple-p}, we argue that computing the expected value exactly for all query polynommials $\poly(\vct{X})$ for multiple values of $p$ is \sharpwonehard. However, this does not rule out the possibility of being able to solve the problem for a any {\em fixed} value of $p$ in linear time. In Section~\ref{sec:single-p}, we rule out even this possibility (based on some popular hardness conjectures in fine-grained complexity).
+\AH{The notation used here is different than in~\Cref{sec:background}, in particular~\Cref{eq:expect-q-nx}.  Maybe we should  decide on a notation and try to stick to it as much as possible?}
+\BG{We sometimes use $\expct_{\vct{X} \sim P}$ sometimes $\expct_{\vct{X}}$}
+In this section, we will prove that computing $\expct\limits_{\vct{X} \sim \pd}\pbox{\poly(\vct{X})}$ for a \tis-lineage polynomial  $\poly(\vct{X})$ generated from a project-join query is \sharpwonehard. Note that this implies hardness for \bis and general $\semNX$-PDBs. Furthermore,
+using popular hardness conjectures in fine-grained complexity we demonstrate \Cref{sec:single-p} that the problem remains hard, even if $\pd(X_i) = p$ for all $X_i$ and some fixed valued $p$ as long as these conjectures hold. 

+% We would like to argue for a compressed version of $\poly(\vct{X})$, in general $\expct\limits_{\vct{X} \sim \pd}\pbox{\poly(\vct{X})}$ even for tis, cannot be computed in linear time. We will argue two flavors of such a hardness result. In Section~\ref{sec:multiple-p}, we argue that computing the expected value exactly for all query polynommials $\poly(\vct{X})$ for multiple values of $p$ is \sharpwonehard. However, this does not rule out the possibility of being able to solve the problem for a any {\em fixed} value of $p$ in linear time. In Section~\ref{sec:single-p}, we rule out even this possibility (based on some popular hardness conjectures in fine-grained complexity).

+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Preliminaries}

 Our hardness results are based on (exactly) counting the number of occurrences of a fixed graph $H$ as a subgraph in $G$. Let $\numocc{G}{H}$ denote the number of occurrences of pattern $H$ in graph $G$. %, where, for example, $\numocc{G}{\ed}$ means the number of single edges in $G$.
-In particular, we will consider the problems of computing the following counts (given $G$ as an input in its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threepath}$ (the number of $3$-paths),  $\numocc{G}{\threedis}$ (the number of $3$-matchings or collection of three node disjoint edges) and its generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings or collections fo $k$ node disjoint edges).
+In particular, we will consider the problems of computing the following counts (given $G$ as an input in its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threepath}$ (the number of $3$-paths),  $\numocc{G}{\threedis}$ (the number of $3$-matchings or collection of three node disjoint edges) and its generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings or collections of $k$ node-disjoint edges).


-Our hardness result in Section~\ref{sec:multiple-p} is based on the following hardness result:
+Our hardness result in \Cref{sec:multiple-p} is based on the following hardness result:

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Theorem}[\cite{k-match}]
@ -25,42 +29,50 @@ Given a positive integer $k$ and  an undirected graph $G$ with no self-loops or
 The above result means that we cannot hope to count the number of $k$-matchings in $G=(V,E)$ in time $f(k)\cdot |V|^{O(1)}$ for any function $f$. In fact, all known algorithms to solve this problem take time $|V|^{\Omega(k)}$.

 Our hardness result in Section~\ref{sec:single-p} is based on the following conjectured hardness result:
+%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{hypo}
 \label{conj:graph}
 There exists a constant $\eps_0>0$ such that given an undirected graph $G=(V,E)$, computing exactly the values $\numocc{G}{\tri}$, $\numocc{G}{\threepath}$ and $\numocc{G}{\threedis}$ cannot be done in time $o\inparen{|E|^{1+\eps_0}}$.
 \end{hypo}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%
 Based on the so called {\em Triangle detection hypothesis} (cf.~\cite{triang-hard}), which states that detection whether $G$ has a triangle or not takes time $\Omega\inparen{|E|^{4/3}}$, implies that in Conjecture~\ref{conj:graph} we can take $\eps_0\ge \frac 13$.
 \AR{Need to add something about 3-paths and 3-matchings as well.}

 Both of our hardness results use a query polynomial that is based on a simple encoding of the edges of a graph.
 To prove our hardness result, consider a graph $G(V, E)$, where $|E| = m$, $|V| = \numvar$. Our query polynomial will have a variable $X_i$ for every $i$ in $[\numvar]$.
-Now consider the query 
+Now consider the polynomial 
 \[\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in E} X_i \cdot X_j.\]
-The hard query polynomial for our problem will be a suitable power $k\ge 3$ of the polynomial above, i.e.
+The hard polynomial for our problem will be a suitable power $k\ge 3$ of the polynomial above, i.e.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{Definition}
 Let $G=([n],E)$ be a graph. Then for any $\kElem\ge 1$, define
 \[\poly_{G}^\kElem(X_1,\dots,X_n) = \left(\sum\limits_{(i, j) \in E} X_i \cdot X_j\right)^\kElem.\]
 \end{Definition}
-
-Our hardness results only need a TIDB instance and further, we consider the special case when all the tuple probabilities are the same value. It is not too hard to see that we can encode the above polynomial in an expression tree of size $\Theta(km)$.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Our hardness results only need a \ti instance and further, we consider the special case when all the tuple probabilities (probabilities assigned by to $X_i$ by $\vct{p}$) are the same value. It is not too hard to see that we can encode the above polynomial in an expression tree of size $\Theta(km)$.

 Following up on the discussion around Example~\ref{ex:intro}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ is the query polynomial corresponding to the following query:
 \[\poly^k_G:- R(A_1),E(A_1,B_1),R(B_1),\dots,R(A_\kElem),E(A_\kElem,B_\kElem),R(B_\kElem)\]
-where generalizaing the PDB instance in Example~\ref{ex:intro}, relation $R$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $p$ and $E(A,B)$ has tuples corresponding to the edges in $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $E$ as well but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $E$ also are present with probability $p$ but to make notation a bit simpler, we make this simplification.}
+where generalizaing the PDB instance in Example~\ref{ex:intro}, relation $R$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $p$ and $E(A,B)$ has tuples corresponding to the edges in $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $E$ as well but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $E$ also are present with probability $p$ but to simplify notation we assign probability $1$ to edges.}

-Note that this imples that our hard query polynimial can be created from a join-project query-- by contrast our approximation algorithm in Section~\ref{sec:algo} can handle lineage polynonmials generated by union of select-project-join queries. % (i.e. we do not need union or select operator to derive our hardness result).
+Note that this imples that our hard query polynomial can be created from a project-join query -- by contrast our approximation algorithm in \Cref{sec:algo} can handle lineage polynomials generated by union of select-project-join queries. % (i.e. we do not need union or select operator to derive our hardness result).

-%\AR{need discussion on the `tightness' of various params. First, this is for degree 6 poly-- while things are easy for say deg 2. Second this is for any fixed p.  Finally, we only need porject-join queries to get the hardness results. Also need to compare this with the generality of the approx upper bound results.}
+%\AR{need discussion on the `tightness' of various params. First, this is for degree 6 poly-- while things are easy for say deg 2. Second this is for any fixed p.  Finally, we only need project-join queries to get the hardness results. Also need to compare this with the generality of the approx upper bound results.}

 \subsection{Multiple Distinct $\prob$ Values}
 \label{sec:multiple-p}

 We are now ready to present our main hardness result.
+%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
 \begin{Theorem}\label{thm:mult-p-hard-result}
 Computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ for arbitrary $G$ and any $(2k+1)$ values $\prob_i$ ($0\le i \le 2k$) is \sharpwonehard.
 \end{Theorem}
-We will prove the above result by reducing the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state of the art $k$-matching algorithms are improved, we cannot hope to have a better runtime to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its SOP form and then using~\Cref{cor:expct-sop}. By constrast our approximation algorithm would run in time $O_k\inparen{m}$ on this query (since it runs in linear-time on all query polynomials).
-
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%
+We will prove the above result by reduction from the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state-of-the-art $k$-matching algorithms are improved, we cannot hope to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its \abbrSMB form and then using \Cref{cor:expct-sop}. By contrast the approximation algorithm we present in \Cref{sec:algo} runtime  is in$O_k\inparen{m}$ for  this query (since it runs in linear-time on all lineage polynomials).

 As mentioned earlier, we prove our hardness result by presenting a reduction from the problem of couting $\kElem$-matchings in a graph:
 \begin{Lemma}\label{lem:qEk-multi-p}
@ -68,6 +80,7 @@ Let $\prob_0,\ldots, \prob_{2\kElem}$ be distinct values in $(0, 1]$.  Then give
 \end{Lemma}

 Before we prove the above Lemma, let us use it to prove~\Cref{thm:mult-p-hard-result}:
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{proof}[Proof of Theorem~\ref{thm:mult-p-hard-result}]
 For the sake of contradiction, let us assume we can solve our problem in $f(\kElem)\cdot m^c$ time for some absolute constant $c$. Then given a graph $G$ we can compute the query polynomial $\rpoly_G^\kElem$ (in the obvious way) in $O(km)$ time. Then after we run our algorithm on $\rpoly_G^\kElem$, we get $\rpoly_{G}^\kElem(\prob_i,\ldots, \prob_i)$ for $0\leq i\leq 2\kElem$ in additional $f(\kElem)\cdot m^c$ time. \Cref{lem:qEk-multi-p} then computes the number of $k$-matchings in $G$ in $O(\kElem^3)$ time. Thus, overall we have an algorithm for computing the number of $k$-matchings in time
 \begin{align*}
@ -75,11 +88,13 @@ For the sake of contradiction, let us assume we can solve our problem in $f(\kEl
 &\le \inparen{O(\kElem^3) + f(\kElem)}\cdot m^{c+1} \\
 &\le \inparen{O(\kElem^3) + f(\kElem)}\cdot n^{2c+2},
 \end{align*}
-which contradicts~\cref{thm:k-match-hard}.
+which contradicts \Cref{thm:k-match-hard}.
 \end{proof}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-Finally, we are ready to prove~\Cref{lem:qEk-multi-p}:
-\begin{proof}[Proof of ~\cref{lem:qEk-multi-p}]
+Finally, we are ready to prove \Cref{lem:qEk-multi-p}:
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{proof}[Proof of \Cref{lem:qEk-multi-p}]
 %It is trivial to see that one can readily expand the exponential expression by performing the $n^\kElem$ product operations, yielding the polynomial in the sum of products form of the lemma statement.  By definition $\rpoly_{G}^\kElem$ reduces all variable exponents greater than $1$ to $1$.  Thus, a monomial such as $X_i^\kElem X_j^\kElem$ is $X_iX_j$ in $\rpoly_{G}^\kElem$, and the value after substitution is $p_i\cdot p_j = p^2$.  Further, that the number of terms in the sum is no greater than $2\kElem + 1$, can be easily justified by the fact that each edge has two endpoints, and the most endpoints occur when we have $\kElem$ distinct edges (such a subgraph is also known as a $\kElem$-matching), with non-intersecting points, a case equivalent to $p^{2\kElem}$.
 We first argue that $\rpoly_{G}^\kElem(\prob,\ldots, \prob) = \sum\limits_{i = 0}^{2\kElem} c_i \cdot \prob^i$.  First, since $\poly_G(\vct{X})$ has %$\kElem$ products of monomials of 
 degree $2$, it follows that $\poly_G^\kElem(\vct{X})$ has degree $2\kElem$.  
@ -94,7 +109,7 @@ By definition, $\rpoly_{G}^{\kElem}(\vct{X})$ sets every exponent $e > 1$ to $e
 \rpoly_{G}^{\kElem}(\prob,\ldots, \prob) = \sum_{i = 0}^{2\kElem} c_i \prob^i
 \end{equation*}

-We note that $c_i$ is {\em exactly} the number of monomials in the SOP expansion of $\poly_{G}^{\kElem}(\vct{X})$ composed of $i$ distinct variables.%, with $\prob$ substituted for each distinct variable
+We note that $c_i$ is {\em exactly} the number of monomials in the SOP\BG{\abbrSMB?} expansion of $\poly_{G}^{\kElem}(\vct{X})$ composed of $i$ distinct variables.%, with $\prob$ substituted for each distinct variable
 \footnote{Since $\rpoly_G^\kElem(\vct{X})$ does not have any monomial with degree $< 2$, it is the case that $c_0 = c_1 = 0$ but for the sake of simplcity we will ignore this observation.}

 Given that we then have $2\kElem + 1$ distinct values of $\rpoly_{G}^\kElem(\prob,\ldots, \prob)$ for $0\leq i\leq2\kElem$, it follows that 
@ -114,11 +129,9 @@ each of the $k!$ permutations of an arbitrary monomial maps to the same distinct
 %Since $c_{2k}$ is the cardinality of the multi-set of degree $2\kElem$ monomials in $\rpoly_{G}^{\kElem}(\vct{X})$, where each $\kElem$-matching in $G$ has $\kElem!$ monomial representations,
 It then follows that $c_{2\kElem}= \kElem! \cdot \numocc{G}{\kmatch}$.
 % and the fact that $c_{2\kElem}$ contains all monomials with degree $2\kElem$, it follows that $c_{2\kElem} = \kElem!\cdot\numocc{G}{\kmatch}$.  
-Thus, simply dividing $c_{2\kElem}$ by $\kElem!$ gives us $\numocc{G}{\kmatch}$, as needed. % by simply dividing $c_{2\kElem}$ by $\kElem!$.
+Thus, simply dividing $c_{2\kElem}$ by $\kElem!$ gives us $\numocc{G}{\kmatch}$, as needed. \qed % by simply dividing $c_{2\kElem}$ by $\kElem!$.
 \end{proof}
-
-\qed
-
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



--- a/poly-form.tex
+++ b/poly-form.tex
@ -47,8 +47,8 @@ The degree of the running example polynomial is $2$. In this paper we consider o
 % \end{Assumption}
 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-We call a polynomial $\query(\vct{X})$ a \emph{\bi-lineage polynomial} (\emph{\ti-lineage polynomial}), if
-\AH{Why is it required for the tuple to be n-ary?  I think this slightly confuses me since we have n tuples.} there exists an n-ary $\raPlus$ query $\query$, \bi $\pxdb$ (\ti $\pxdb$), and n-ary tuple $\tup$ such that $\query(\vct{X}) = \query(\pxdb)(\tup)$. % Before proceeding, note that the following is assume that polynomials are  \bis (which subsume \tis as a special case).
+We call a polynomial $\query(\vct{X})$ a \emph{\bi-lineage polynomial} (\emph{\ti-lineage polynomial}, or simply lineage polynomial), if
+\AH{Why is it required for the tuple to be n-ary?  I think this slightly confuses me since we have n tuples.} there exists an n-ary $\raPlus$ query $\query$, \bi $\pxdb$ (\ti $\pxdb$, or $\semNX$-PDB $\pxdb$), and n-ary tuple $\tup$ such that $\query(\vct{X}) = \query(\pxdb)(\tup)$. % Before proceeding, note that the following is assume that polynomials are  \bis (which subsume \tis as a special case).
 Note the \tis are a special case of \bis and, thus, the following applies to \tis as well.
 Recall that in a \bi $\pxdb$ with tuples $t_1, \ldots, t_n$, each input tuple $t_i$ is annotated with a unique variable $X_i$. The tuples of $\pxdb$ are partitioned into $\ell$ blocks $\block_1, \ldots, \block_\ell$ and each tuple $t_i$ is associated with a probability $\prob(\tup_i) = \pd[X_i = 1]$. Together with the assumption that blocks are assumed to be independent and tuples from the same block are disjoint events, $\prob$ and the blocks induce the probability distribution $\pd$ of $\pxdb$.
 We will write a \bi-lineage polynomial $\poly(\vct{X})$ for a \bi with $\ell$ blocks as
--- a/ra-to-poly.tex
+++ b/ra-to-poly.tex
@ -197,7 +197,8 @@ We are now ready to formally state the main problem addressed in this work.
 \begin{Definition}[The Expected Result Multiplicity Problem]\label{def:the-expected-multipl}
 Let $\vct{X} = (X_1, \ldots, X_n)$, and $\pdb$ be an $\semNX$-PDB over $\vct{X}$ with probability distribution $\pd$ over assignments $\vct{X}  \to [0,1]$, $\query$ an n-ary query, and $t$ an n-ary tuple.
  The \expectProblem is defined as follows:
-\AH{I think we mean $\poly(\vct{X}) = \query(\pxdb)(t)$ instead of $\poly(\vct{X}) = \query(\pdb)(t)$.  I changed the following to reflect this.}
+  \AH{I think we mean $\poly(\vct{X}) = \query(\pxdb)(t)$ instead of $\poly(\vct{X}) = \query(\pdb)(t)$.  I changed the following to reflect this.}
+  \BG{Correct}
 \begin{itemize}
 \item \textbf{Input}: Given an expression tree $\etree \in \etreeset{\smb}$ for $\poly(\vct{X}) = \query(\pxdb)(t)$
 \item \textbf{Output}: $\expct_{\vct{X} \sim \pd}[\poly(\vct{X})]$