Fixed end of lemma 3.5 proof.

master
Aaron Huber 2020-12-16 17:25:37 -05:00
parent ac684a8d47
commit fe14d240b6
2 changed files with 19 additions and 16 deletions

View File

@ -2,14 +2,14 @@
\section{Hardness of exact computation}
\label{sec:hard}
We would like to argue for a compressed version of $\poly(\vct{w})$, in general $\expct_{\vct{w}}\pbox{\poly(\vct{w})}$ even for TIDB, cannot be computed in linear time. We will argue two flavors of such a hardness result. In Section~\ref{sec:multiple-p}, we argue that computing the expected value exactly for all query polynommials $\poly(\vct{X})$ for multiple values of $p$ is \sharpwonehard. However, this does not rule out the possibility of being able to solve the problem for a any {\em fixed} value of $p$ being say even in linear time. In Section~\ref{sec:single-p}, we rule out even this possibility (based on some popular hardness conjectures in fine-grained complexity).
\AH{The notation used here is different than in~\Cref{sec:background}, in particular~\Cref{eq:expect-q-nx}. Maybe we should decide on a notation and try to stick to it as much as possible?}
We would like to argue for a compressed version of $\poly(\vct{X})$, in general $\expct\limits_{\vct{X} \sim \pd}\pbox{\poly(\vct{X})}$ even for TIDB, cannot be computed in linear time. We will argue two flavors of such a hardness result. In Section~\ref{sec:multiple-p}, we argue that computing the expected value exactly for all query polynommials $\poly(\vct{X})$ for multiple values of $p$ is \sharpwonehard. However, this does not rule out the possibility of being able to solve the problem for a any {\em fixed} value of $p$ in linear time. In Section~\ref{sec:single-p}, we rule out even this possibility (based on some popular hardness conjectures in fine-grained complexity).
\subsection{Preliminaries}
Our hardness results are based on (exactly) counting the number of occurrences of a fixed graph $H$ as a subgraph in $G$. Let $\numocc{G}{H}$ denote the number of occurrences of pattern $H$ in graph $G$. %, where, for example, $\numocc{G}{\ed}$ means the number of single edges in $G$.
In particular, we will consider the problems of computing the following counts (given $G$ as an input in its adjaceny list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threepath}$ (the number of $3$-paths), $\numocc{G}{\threedis}$ (the number of $3$-matchings or collection of three node disjoint edges) and its generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings or collections fo $k$ node disjoint edges).
In particular, we will consider the problems of computing the following counts (given $G$ as an input in its adjacency list representation): $\numocc{G}{\tri}$ (the number of triangles), $\numocc{G}{\threepath}$ (the number of $3$-paths), $\numocc{G}{\threedis}$ (the number of $3$-matchings or collection of three node disjoint edges) and its generalization $\numocc{G}{\kmatch}$ (the number of $k$-matchings or collections fo $k$ node disjoint edges).
Our hardness result in Section~\ref{sec:multiple-p} is based on the following hardness result:
@ -33,7 +33,7 @@ Based on the so called {\em Triangle detection hypothesis} (cf.~\cite{triang-har
\AR{Need to add something about 3-paths and 3-matchings as well.}
Both of our hardness results use a query polynomial that is based on a simple encoding of the edges of a graph.
To prove our hardness result, consider a graph $G(V, E)$, where $|E| = m$, $|V| = \numvar$. Our query polynomial will have a variable $X_i$ for every $i, [\numvar]$.
To prove our hardness result, consider a graph $G(V, E)$, where $|E| = m$, $|V| = \numvar$. Our query polynomial will have a variable $X_i$ for every $i$ in $[\numvar]$.
Now consider the query
\[\poly_{G}(\vct{X}) = \sum\limits_{(i, j) \in E} X_i \cdot X_j.\]
The hard query polynomial for our problem will be a suitable power $k\ge 3$ of the polynomial above, i.e.
@ -46,7 +46,7 @@ Our hardness results only need a TIDB instance and further, we consider the spec
Following up on the discussion around Example~\ref{ex:intro}, it is easy to see that $\poly_{G}^\kElem(\vct{X})$ is the query polynomial corresponding to the following query:
\[\poly^k_G:- R(A_1),E(A_1,B_1),R(B_1),\dots,R(A_\kElem),E(A_\kElem,B_\kElem),R(B_\kElem)\]
where generalizaing the PDB instance in Example~\ref{ex:intro}, relation $R$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $p$ and $E(A,B)$ has tuples corresponding to the edges in $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $E$ as well but since they always are present with probability $1$, we drop those. Our argument also work when all the tuples in $E$ also are present with probability $p$ but to make notation a bit simpler, we make this simplification.}
where generalizaing the PDB instance in Example~\ref{ex:intro}, relation $R$ has $n$ tuples corresponding to each vertex in $V=[n]$ each with probability $p$ and $E(A,B)$ has tuples corresponding to the edges in $E$ (each with probability of $1$).\footnote{Technically, $\poly_{G}^\kElem(\vct{X})$ should have variables corresponding to tuples in $E$ as well but since they always are present with probability $1$, we drop those. Our argument also works when all the tuples in $E$ also are present with probability $p$ but to make notation a bit simpler, we make this simplification.}
Note that this imples that our hard query polynimial can be created from a join-project query-- by contrast our approximation algorithm in Section~\ref{sec:algo} can handle lineage polynonmials generated by union of select-project-join queries. % (i.e. we do not need union or select operator to derive our hardness result).
@ -57,9 +57,9 @@ Note that this imples that our hard query polynimial can be created from a join-
We are now ready to present our main hardness result.
\begin{Theorem}\label{thm:mult-p-hard-result}
Computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ for arbitraryy $G$ and any $(2k+1)$ values $\prob_i$ ($0\le i \le 2k$) is \sharpwonehard.
Computing $\rpoly_G^\kElem(\prob_i,\dots,\prob_i)$ for arbitrary $G$ and any $(2k+1)$ values $\prob_i$ ($0\le i \le 2k$) is \sharpwonehard.
\end{Theorem}
We will prove the above result by reducing the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state of the art $k$-matching algorithms are improved, we cannot hope to have a better runtime to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its SOP form and use~\Cref{cor:expct-sop}. By constrast our approximation algorithm would run in time $O_k\inparen{m}$ on this query (since it runs in linear-time on all query polynomials).
We will prove the above result by reducing the problem of computing the number of $k$-matchings in $G$. Given the current best-known algorithm for this counting problem, our results imply that unless the state of the art $k$-matching algorithms are improved, we cannot hope to have a better runtime to solve our problem in time better than $\Omega_k\inparen{m^{k/2}}$, which is only quadratically faster than expanding $\poly_{G}^\kElem(\vct{X})$ into its SOP form and then using~\Cref{cor:expct-sop}. By constrast our approximation algorithm would run in time $O_k\inparen{m}$ on this query (since it runs in linear-time on all query polynomials).
As mentioned earlier, we prove our hardness result by presenting a reduction from the problem of couting $\kElem$-matchings in a graph:
@ -78,7 +78,7 @@ For the sake of contradiction, let us assume we can solve our problem in $f(\kEl
which contradicts~\cref{thm:k-match-hard}.
\end{proof}
Finally, we are rerady to prove~\Cref{lem:qEk-multi-p}:
Finally, we are ready to prove~\Cref{lem:qEk-multi-p}:
\begin{proof}[Proof of ~\cref{lem:qEk-multi-p}]
%It is trivial to see that one can readily expand the exponential expression by performing the $n^\kElem$ product operations, yielding the polynomial in the sum of products form of the lemma statement. By definition $\rpoly_{G}^\kElem$ reduces all variable exponents greater than $1$ to $1$. Thus, a monomial such as $X_i^\kElem X_j^\kElem$ is $X_iX_j$ in $\rpoly_{G}^\kElem$, and the value after substitution is $p_i\cdot p_j = p^2$. Further, that the number of terms in the sum is no greater than $2\kElem + 1$, can be easily justified by the fact that each edge has two endpoints, and the most endpoints occur when we have $\kElem$ distinct edges (such a subgraph is also known as a $\kElem$-matching), with non-intersecting points, a case equivalent to $p^{2\kElem}$.
We first argue that $\rpoly_{G}^\kElem(\prob,\ldots, \prob) = \sum\limits_{i = 0}^{2\kElem} c_i \cdot \prob^i$. First, since $\poly_G(\vct{X})$ has %$\kElem$ products of monomials of
@ -104,12 +104,15 @@ Given that we then have $2\kElem + 1$ distinct values of $\rpoly_{G}^\kElem(\pro
%Denote the number of $\kElem$-matchings in $G$ as $\numocc{G}{\kmatch}$.
We claim that $c_{2\kElem}$ is $\kElem! \cdot \numocc{G}{\kmatch}$. This can be seen intuitively by looking at the original factorized representation
\[\poly_{G}^\kElem(\vct{X}) = \sum_{\substack{(i_1, j_1),\\\cdots,\\(i_\kElem, j_\kElem) \in E}}X_{i_1}X_{j_1}\cdots X_{i_\kElem}X_{j_\kElem},\]
where across each of the $\kElem$ products, an arbitrary $\kElem$-matching can be selected $\prod_{i = 1}^\kElem \kElem = \kElem!$ times. Indeed, note that each $\kElem$-matching $(i_1, j_1)\ldots$ $(i_k, j_k)$ in $G$ corresponds to the monomial $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ in $\poly_{G}^\kElem(\vct{X})$, where each index is distinct. %Since each index is distinct, then each variable has an exponent $e = 1$ and this monomial survives in $\rpoly_{G}^{\kElem}(\vct{X})$ Since $\rpoly$ contains only exponents $e \leq 1$, the only degree $2\kElem$ terms that can exist in $\rpoly_{G}^\kElem$ are $\kElem$-matchings since every other monomial in $\poly_{G}^\kElem(\vct{X})$ has strictly less than $2\kElem$ distinct variables, which, as stated earlier implies that every other non-$\kElem$-matching monomial in $\rpoly_{G}^\kElem(\vct{X})$ has degree $< 2\kElem$.
Further, the only monomial $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ of degree exactly $2k$ in $\rpoly_{G}^{\kElem}(\vct{X})$ needs to have all of $i_1,j_1,\dots,i_\kElem,j_\kElem$ to be distinct. This every monomial of degree $2k$ in $\poly_{G}^{\kElem}(\vct{X})$ (and hence in $\rpoly_{G}^{\kElem}(\vct{X})$) corresponds to a $k$-matching in $G$.
where across each of the $\kElem$ products, an arbitrary $\kElem$-matching can be selected $\prod_{i = 1}^\kElem \kElem = \kElem!$ times. Next, note that each $\kElem$-matching $(i_1, j_1)\ldots$ $(i_k, j_k)$ in $G$ corresponds to the monomial $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ in $\poly_{G}^\kElem(\vct{X})$, with all indexes distinct. %Since each index is distinct, then each variable has an exponent $e = 1$ and this monomial survives in $\rpoly_{G}^{\kElem}(\vct{X})$ Since $\rpoly$ contains only exponents $e \leq 1$, the only degree $2\kElem$ terms that can exist in $\rpoly_{G}^\kElem$ are $\kElem$-matchings since every other monomial in $\poly_{G}^\kElem(\vct{X})$ has strictly less than $2\kElem$ distinct variables, which, as stated earlier implies that every other non-$\kElem$-matching monomial in $\rpoly_{G}^\kElem(\vct{X})$ has degree $< 2\kElem$.
Second, the only surviving monomiasl $\prod_{\ell = 1}^\kElem X_{i_\ell}X_{j_\ell}$ of degree exactly $2k$ in $\rpoly_{G}^{\kElem}(\vct{X})$ must have that all of $i_1,j_1,\dots,i_\kElem,j_\kElem$ are distinct in $\poly_{G}^{\kElem}(\vct{X})$. Then, by the last two statements, only monomials composed of $2k$ distinct variables in $\poly_{G}^{\kElem}(\vct{X})$ (and hence of degree $2\kElem$ in $\rpoly_{G}^{\kElem}(\vct{X})$) correspond to a $k$-matching in $G$.
%It has already been established above that a $\kElem$-matching ($\kmatch$) has coefficient $c_{2\kElem}$. As noted, a $\kElem$-matching occurs when there are $\kElem$ edges, $e_1, e_2,\ldots, e_\kElem$, such that all of them are disjoint, i.e., $e_1 \neq e_2 \neq \cdots \neq e_\kElem$. In all $\kElem$ factors of $\poly_{G}^\kElem(\vct{X})$ there are $k$ choices from the first factor to select an edge for a given $\kElem$ matching, $\kElem - 1$ choices in the second factor, and so on throughout all the factors, yielding $\kElem!$ duplicate terms for each $\kElem$ matching in the expansion of $\poly_{G}^\kElem(\vct{X})$.
Then, since %we have $\kElem!$ duplicates of
each $k!$ permutations of a $\kElem$-matching does not change the $\kElem$-matching (but map to distinct degree exactly $2k$ monomials in the SOP expansion of $\poly_{G}^\kElem$), gives us that we have a $k!$-to-$1$ mapping between the $\kElem$-matchings in $G$ and degree exactly $2k$ monomials in the SOP expansion of $\poly_{G}^\kElem$. Recalling that $c_{2k}$ counts the number of monomials of degree $2k$ in $\poly_{G}^\kElem$ proves $c_{2\kElem}= \kElem! \cdot \numocc{G}{\kmatch}$.
Notice that %we have $\kElem!$ duplicates of
each of the $k!$ permutations of an arbitrary monomial maps to the same distinct $\kElem$-matching in $G$, and this implies a $\kElem!$ to $1$ mapping between degree $2\kElem$ monomials in $\rpoly_{G}^{\kElem}(\vct{X})$ and $\kElem$-matchings in $G$.
%we then have $\kElem!$ monomials mapped to each distinct $\kElem$-matching of $G$.
%Since $c_{2k}$ is the cardinality of the multi-set of degree $2\kElem$ monomials in $\rpoly_{G}^{\kElem}(\vct{X})$, where each $\kElem$-matching in $G$ has $\kElem!$ monomial representations,
It then follows that $c_{2\kElem}= \kElem! \cdot \numocc{G}{\kmatch}$.
% and the fact that $c_{2\kElem}$ contains all monomials with degree $2\kElem$, it follows that $c_{2\kElem} = \kElem!\cdot\numocc{G}{\kmatch}$.
Thus, simply dividing $c_{2\kElem}$ by $\kElem!$ gives us $\numocc{G}{\kmatch}$, as needed. % by simply dividing $c_{2\kElem}$ by $\kElem!$.
\end{proof}

View File

@ -1,7 +1,7 @@
%root: main.tex
%!TEX root=./main.tex
%\onecolumn
\section{Background and Notation}
\section{Background and Notation}\label{sec:background}
\subsection{Probabilistic Databases (PDBs)}
@ -107,9 +107,9 @@ The closure under $\raPlus$ queries follows from the fact that an assignment $\v
Now let us consider computing the expected multiplicity of a tuple $\tup$ in the result of a query $\query$ over an $\semN$-PDB $\pdb$ using the annotation of $\tup$ in the result of evaluating $\query$ over an $\semNX$-PDB $\pxdb$ for which $\rmod(\pxdb) = \pdb$. The expectation of the polynomial $\poly = \query(\pxdb)(\tup)$ based on the probability distribution of $\pxdb$ over the variables in $\pxdb$ is:
\[
\expct_{\vct{X} \sim \pd}\pbox{\poly(\vct{X})} = \sum_{\vct{w} \in \{0,1\}^n} \query(\assign_{\vct{w}}(\pxdb))(\tup) \cdot \pd(\vct{w})
\]
\begin{equation}
\expct_{\vct{X} \sim \pd}\pbox{\poly(\vct{X})} = \sum_{\vct{w} \in \{0,1\}^n} \query(\assign_{\vct{w}}(\pxdb))(\tup) \cdot \pd(\vct{w})\label{eq:expect-q-nx}
\end{equation}
Since $\semNX$-PDBs $\pxdb$ are a complete representation system for $\semN$-PDBs which are closed under $\raPlus$, computing the expectation of the multiplicity of a tuple $t$ in the result of an $\raPlus$ query over the $\semN$-PDB $\rmod(\pxdb)$, is the same as computing the expectation of the polynomial $\query(\pxdb)(t)$.