paper-BagRelationalPDBsAreHard/poly-form.tex
2020-06-23 10:57:17 -04:00

218 lines
22 KiB
TeX

%root: main.tex
\section{Polynomial Formulation}
Let $\vect_1,\ldots, \vect_\numTup$ be vectors annotating $\numTup$ tuples in a TIDB such that \begin{equation}\vect_i[(\wbit_1,\ldots,\wbit_\numTup)] =
\begin{cases} 1 &\wbit_i = 1,\\
0 &otherwise.\label{eq:vec-def}
\end{cases}
\end{equation}
Here we define vector indexing by the $\numTup$-bit binary tuple $\wVec = (\wbit_1,\ldots,\wbit_\numTup)$ such that the possible world $\wVec$ is identified by its bit vector binary value.
\AR{I do not see why we need to define $\vect$-- everything can be defined without bringing the $\vect$ in. I would recommend that in Section 1 you define things without going into $\vect$-- i.e. state the DB queries and TIDBs and directly define the query polynomial.}
%---We have chosen to ignore the vector formulation
%Futher we define the polynomial $\poly(\vect_1,\ldots,\vect_\numTup)$ as an arbitrary polynomial defined over the input vectors, whose addition and multiplication operations are defined as the traditional point-wise vector addition and multiplication. We overload notation and denote the $i^{th}$ world of $\poly$ as $\poly[\wVec]$, where $\poly$ can be viewed as the output annotation vector, and the L-1 norm can be represented as
%
%\[\norm{\poly}_1 = \sum\limits_{\wVec \in \wSet} \poly[\wVec].\]<---technically incorrect when we consider negative values in \poly
%------
Define $\poly(X_1,\ldots, X_\numTup)$ as a polynomial whose variables represent the tuple annotations of an arbitrary query. While recent research has benefited in viewing the possible worlds problem as a vector $\vect$, notice that \cref{eq:vec-def} is equivalent to $\vect_i[\wVec] = \wbit_i$. With this observation we can further reformulate this problem by viewing $\poly$ as a polynomial over bit values rather than one over vectors, where each element of the input is a bit element of the $\wVec$ bit vector, and we can thus replace each variable of $\poly$ with its corresponding input bit, and solve for the particular world $\wVec$. The output we desire is
\[\expct_{\wVec}\pbox{\poly(\wVec)} = \sum\limits_{\wVec \in \{0, 1\}^\numTup} \poly(\wVec)\prod_{\substack{i \in [\numTup]\\ s.t. \wElem_i = 1}}\prob_i \prod_{\substack{i \in [\numTup]\\s.t. w_i = 0}}\left(1 - \prob_i\right).\]
\AR{The above should go into Section 1 (without using $\vect$ of course). And as I mentioned in my comment in Sec 1, you need to figure out a notation for the queries. Check with Oliver on what is standard notation in the PDB literature (unless you know the standard notation in which case eno need to ask Oliver :-).
Also it might be worthwhile to define a notation for the probability that the world is the specific $\wVec$-- then you can define the expectation for PDB models other than TIDBs.}
Further, define $\rpoly(X_1,\ldots, X_\numTup)$ as the reduced version of $\poly(X_1,\ldots, X_\numTup)$, of the form
\[\rpoly(\wbit_1,\ldots, \wbit_\numTup) = \poly(\wbit_1,\ldots, \wbit_\numTup) \mod \wbit_1^2-\wbit\cdots\mod \wbit_\numTup^2 - \wbit_\numTup.\]
\AR{the $w_i$'s should be $X_i$'s. A general comment: to make things clearer, always use $X_i$'s to denote the variabls and $w_i$'s to denote the values that we substitute the variables with.}
Intuitively, $\rpoly(\wVec)$ is the expanded sum of products form of $\poly(\wVec)$ such that if any $\wbit_j$ term has an exponent $e > 1$, it is reduced to $1$, i.e. $\wbit_j^e\mapsto \wbit_j$ for any $e > 1$. The usefulness of this reduction will be seen shortly.
\AR{The intuition above should be given for the variable setting: i.e. using $X_i$ instead of $w_i$.}
\AR{You should first state a lemma that show what $\rpoly$ looks like given $\poly(X_1,\ldots, X_\numTup) = \sum_{\vct{d} \in \{0,\ldots, D\}^\numTup}q_{\vct{d}}\cdot \prod_{i = 1\text{ s.t. }d_i \geq 1}^\numTup X_i^{d_i}$.}
\AR{The statement below should be typeset as a proposition.}
First, note the following fact:
\[\text{For all } (\wbit_1,\ldots, \wbit_\numTup) \in \{0, 1\}^\numTup, \poly(\wbit_1,\ldots, \wbit_\numTup) = \rpoly(\wbit_1,\ldots, \wbit_\numTup).\]
\begin{proof}
For all $b \in \{0, 1\}$ and all $e \geq 1$, $b^e = 1$.\qed
\end{proof}
\AR{The statement below should be a lemma.}
\begin{Property}\label{prop:l1-rpoly-numTup}
The expectation of a possible world in $\poly$ is equal to $\rpoly(\prob_1,\ldots, \prob_\numTup)$.
\begin{equation*}
\expct_{\wVec}\pbox{\poly(\wVec)} = \rpoly(\prob_1,\ldots, \prob_\numTup).
\end{equation*}
\end{Property}
\begin{proof}
%Using the fact above, we need to compute \[\sum_{(\wbit_1,\ldots, \wbit_\numTup) \in \{0, 1\}}\rpoly(\wbit_1,\ldots, \wbit_\numTup)\]. We therefore argue that
%\[\sum_{(\wbit_1,\ldots, \wbit_\numTup) \in \{0, 1\}}\rpoly(\wbit_1,\ldots, \wbit_\numTup) = 2^\numTup \cdot \rpoly(\frac{1}{2},\ldots, \frac{1}{2}).\]
Let $\poly$ be the generalized polynomial, i.e., the polynomial of $\numTup$ variables with highest degree $= D$: %, in which every possible monomial permutation appears,
\[\poly(X_1,\ldots, X_\numTup) = \sum_{\vct{d} \in \{0,\ldots, D\}^\numTup}q_{d_i}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numTup X_i^{d_i}\].
\AR{It should be $q_{\vct{d}}$ and not $q_{d_i}$ and this change needs to be a propagated.}
Then for expectation we have
\begin{align*}
\expct_{\wVec}\pbox{\poly(\wVec)} &= \sum_{\vct{d} \in \{0,\ldots, D\}^\numTup}q_{d_i}\cdot \expct_{\wVec}\pbox{\prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numTup w_i^{d_i}}\\
&= \sum_{\vct{d} \in \{0,\ldots, D\}^\numTup}q_{d_i}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numTup \expct_{\wVec}\pbox{w_i^{d_i}}\\
&= \sum_{\vct{d} \in \{0,\ldots, D\}^\numTup}q_{d_i}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numTup \expct_{\wVec}\pbox{w_i}\\
&= \sum_{\vct{d} \in \{0,\ldots, D\}^\numTup}q_{d_i}\cdot \prod_{\substack{i = 1\\s.t. d_i \geq 1}}^\numTup \prob_i\\
&= \rpoly(\prob_1,\ldots, \prob_\numTup)
\end{align*}
\AR{General comment on when you have a sequence of equalities/inequalities. Always number them and when you are trying to justify them refer to tge specific number. Also I think the 2nd and 3rrd justification in the para below probably need to be switched?}
First, by linearity of expectation, the expecation can be pushed all the way inside of the product. Second, note that $w_i \in \{0, 1\}$ which further implies that for any exponent $e \geq 1$, $w_i^e = w_i$. Next, by definition of TIDB, the expectation of a tuple across all possible worlds is indeed its probability. Finally, observe by construction, that $\rpoly(\prob_1,\ldots, \prob_\numTup)$ is exactly the product of probabilities of each variable in each monomial across the entire sum.\AR{The last claim needs more argument-- but this should be easy once you put in the lemma for $\rpoly$ that I asked you to put in one of the comments above.}
%Note that for any single monomial, this is indeed the case since the variables in a single monomial are independent and their joint probability equals the product of the probabilities of each variable in the monomial, i.e., for monomial $M$, $\prob[M] = \prod_{x_i \in M}\prob[x_i].$ This is equivalent to the sum of all probabilities of worlds where each variable in $M$ is a $1$. Since $1$ is the identity element, it is also the case that $\prod_{x_i \in M}\prob[x_i] = \ex{M}$. (Note all other terms in the expectation will not contribute since $M$ will equal $0$, and a product containing a factor of $0$ always equals $0$.) It follows then that $\ex{M} = \rpoly(\prob_1,\ldots, \prob_\numTup)$.
%
%The final result follows by the fact that $\rpoly$ is a sum of monomials, and we can, by linearity of expectation, equivalently push the expectation through the sum and into the monomials.\qed
\qed
\end{proof}
%\begin{Property}\label{prop:exp-rpoly}
%For the case of general $\prob$, where each tuple in the TIDB is present with probability $\prob$, the expectation of polynomial $\poly$ is equal to $\rpoly(\prob,\ldots, \prob).$
%\end{Property}
%
%\begin{proof}
%Note that $\poly$ has an equivalent sum of products form such that $\poly$ is the sum of monomials. By linearity of expectation, the expectation of $\poly$ is equivalent to expectation of each monomial in $\poly$.
%
%Note further, that for the general monomial, the only operation is product. In the binary (TIDB) setting, if any of the variables in the monomial are zero, then the whole monomial becomes zero. Note that this case contributes nothing to the expectation. Notice also, that if the variables are all one, then their product is one, and the product of the identity element with the product of probabilities is the product of probabilities. This is the only condition which contributes to the expectation. It is therefore necessary to know which worlds the variables in the monomial are all equal to one, and the expectation then is the sum of the probabilities of each world for which the monomial's variables are equal to one. This sum of probabilities is also known as the marginal probability, and this sum is always equal the overall probability of the variables in the monomial. It then stands that the general monomial's expectation is equal to the product of its variable's overall probabilities. The sum of all such expecations is exactly the definition of $\rpoly(\prob,\ldots, \prob)$. Let $M$ be a monomial, $t$ be the number of monomials in $\poly$, and $x_1(\cdots x_v)$ represent the product variable(s) in $M_i$. Then
%\begin{equation*}
%\ex{M_1 + \cdots + M_t} = \ex{M_1} +\cdots+\ex{M_t}
%\end{equation*}
%For any $M_i$,
%\begin{align*}
%&\ex{M_i} = \ex{x_1(\cdots x_v)} = \sum_{(\wElem_1,\ldots, \wElem_N) \in \{0, 1\}^\numTup} x_1(\cdots x_v) \cdot p_1\cdots p_v\\
%&=\sum_{\substack{(\wElem_1,\ldots, \wElem_\numTup) \in \{0, 1\}\\\
% s.t. \forall i' \in \{j | x_J \in M_i\}\\
% \wElem_{i'} = 1}} x_1(\cdots x_v) \prod_{i' \in \{j | x_j \in M_i\}} p_i \prod_{\substack{i'' \not\in \{j | x_j \in M_i\}\\ \wElem_{i''} = 1}}p_{i''}\prod_{\substack{i'' \not\in \{j | x_j \in M_i\}\\ \wElem_{i'''} = 0}} 1 - p_{i'''}\\
%&=\prod_{i' \in \{j | x_j \in M_i\}}\prob_{i'}\\
%&\implies \ex{M_1} +\cdots+\ex{M_t} = \prod_{i_1 \in \{j | x_j \in M_1\}}\prob_{i_1} +\cdots+\prod_{i_v \in \{j | x_j \in M_v\}} \prob_{i_v}\\
%&=\rpoly(\prob_1,\ldots, \prob_\numTup).\qed
%\end{align*}
%\AH{I just realized, that I could have saved a lot of time by noting that for the case of TIDB, all monomial variables in $M_i$ are independent, and then using linearity of expectation to conclude the proof.}
%\end{proof}
\begin{Corollary}
If $\poly$ is given to us in a sum of monomials form, the expectation of $\poly$ ($\ex{\poly}$) can be computed in $O(|\poly|)$, where $|\poly|$ denotes the total number of multiplication/addition operators.
\end{Corollary}
The corollary follows immediately by \cref{prop:l1-rpoly-numTup}.\AR{It does not follow from the statement of \cref{prop:l1-rpoly-numTup} but rather its proof. So this atatement needs its proof as well.}
\subsection{When $\poly$ is not in sum of monomials form}
\AR{I made my pass on a printout when this section has a partial proof of Claim 1. So this section will not have much comments beyond that. However my comments have implications for rest of the section, so will make my pass once the comments below have been propagated to the rest of the section.}
We would like to argue that in the general case there is no computation of expectation in linear time.
To this end, consider the following graph $G(V, E)$, where $|E| = m$, $|V| = \numTup$, and $i, j \in [\numTup]$. Consider the query $q_E(\wElem_1,\ldots, \wElem_\numTup) = \sum\limits_{(i, j) \in E} \wElem_i \cdot \wElem_j$.\AR{Again the query polynomial should have $X_i$ as variables.}
\AR{The two lemmas need to be re-written once notation for representing a query is finalized in Section 1.}
\begin{Lemma}\label{lem:gen-p}
If we can compute $\poly(\wElem_1,\ldots, \wElem_\numTup) = q_E(\wElem_1,\ldots, \wElem_\numTup)^3$ in T(m) time for fixed $\prob$,\AR{The statement so far technically does not make sense since the definition of $\poly(\wElem_1,\ldots, \wElem_\numTup)$ does not have $p$ anywhere in it. See my comment above.} then we can count the number of triangles in $G$ in T(m) + O(m) time.
\AR{ANY math notation e.g. T(m) should always be in math mode, like so $T(m)$.}
\AR{Also your just it {\bf not} to just TeX up what is in the yhand-written notes-- you need to verify that the statement is correct and modify the statement as necessary. E.g. I think the final claim above should be for 3-matchings and not triangles.}
\end{Lemma}
\begin{Lemma}\label{lem:const-p}
If we can compute $\poly(\wElem_1,\ldots, \wElem_\numTup) = q_E(\wElem_1,\ldots, \wElem_\numTup)^3$ in T(m) time for O(1) distinct values of $\prob$ then we can count the number of triangles (and the number of 3-paths, the number of 3-matchings) in $G$ in O(T(m) + m) time.
\end{Lemma}
\AR{This warmup should not be in the actual paper since it is not quite relevant but it is fine to keep for now.}
First, let us do a warm-up by computing $\rpoly(\wElem_1,\dots, \wElem_\numTup)$ when $\poly = q_E(\wElem_1,\ldots, \wElem_\numTup)$. Before doing so, we introduce a notation. Let $\numocc{H}$ denote the number of occurrences that $H$ occurs in $G$. So, e.g., $\numocc{\ed}$ is the number of edges ($m$) in $G$.
\AR{Sorry I should have made this more explicit in the hand-written notes. The notation of $\twopath$ and $\twodis$ are {\bf not} standard notation and we should not be use them ilke in the handwritten notes. There are two options: we could have explicit notation (like $H_{\text{triang}}$) or if you want the figure notation then the edge actually needs to look like an edge-- i.e. the nodes should show up as well-- i.e. the figures for the sub-graphs should look {\bf exactly} like in the hand-written notes. I have seen this done in other papers but I personally do not know how to do this in latex-- you'll need to figure this out on your own if you use this option. I personally am fine with either option (check if Oliver has a preference though).}
\AR{Also we should discuss if $\numocc{H}$ is the best notation. E.g. one could use $\#\textsc{triang}(G)$ to denote the number of triangles in $G$ and so on. This might help with the above comment as well.}
\begin{Claim}
\begin{enumerate}
\item $\rpoly_2(\prob,\ldots, \prob) = \numocc{\ed} \cdot \prob^2 + 2\cdot \numocc{\twopath}\cdot \prob^3 + 2\cdot \numocc{\twodis}\cdot \prob^4$
\item We can compute $\rpoly_2$ in O(m) time.
\end{enumerate}
\end{Claim}
\AR{Note on latex use-- the begin\{claim\} and end\{claim\} should only be around the statement of the claim and not include the proof inside it as well.}
\AR{Also the claim statement should only include the 2nd part. The first part is only useful in proving the 2nd part so no need to explicitly state it in the claim statement iself.}
\begin{proof}
The proof basically follows by definition.
\begin{enumerate}
\item First note that
\begin{align*}
\poly_2(\wVec) &= \sum_{(i, j) \in E} (\wElem_i\wElem_j)^2 + \sum_{(i, j), (k, \ell) \in E s.t. (i, j) \neq (k, \ell)} \wElem_i\wElem_j\wElem_k\wElem_\ell\\
&= \sum_{(i, j) \in E} (\wElem_i\wElem_j)^2 + \sum_{\substack{(i, j), (j, \ell) \in E\\s.t. i \neq \ell}}\wElem_i
\wElem_j^2\wElem_\ell + \sum_{\substack{(i, j), (k, \ell) \in E\\s.t. i \neq j \neq k \neq \ell}} \wElem_i\wElem_j\wElem_k\wElem_\ell\\
\end{align*}
By definition,
\begin{equation*}
\rpoly_2(\wVec) = \sum_{(i, j) \in E} \wElem_i\wElem_j + \sum_{\substack{(i, j), (j, \ell) \in E\\s.t. i \neq \ell}}\wElem_i\wElem_j\wElem_\ell + \sum_{\substack{(i, j), (k, \ell) \in E\\s.t. i \neq j \neq k \neq \ell}} \wElem_i\wElem_j\wElem_k\wElem_\ell\label{eq:part-1}
\end{equation*}
Notice that the first term is $\numocc{\ed}\cdot \prob^2$, the second $\numocc{\twopath}\cdot \prob^3$, and the third $\numocc{\twodis}\cdot \prob^4.$
\item Note that
\begin{align*}
&\numocc{\ed} = m,\\
&\numocc{\twopath} = \sum_{u \in V} \binom{d_u}{2} \text{where $d_u$ is the degree of vertex $u$}\\ &\numocc{\twodis} = \text{a correct formula}
\end{align*}
\end{enumerate}
Thus, since each of the summations can be computed in O(m) time, this implies that by \cref{eq:part-1} $\rpoly(\prob,\ldots, \prob)$ can be computed in O(m) time.\qed
\end{proof}
We are now ready to state the claim we need to prove \cref{lem:gen-p} and \cref{lem:const-p}.
Let $\poly(\wVec) = q_E^3(\wVec)^3$.
\begin{Claim}
\begin{enumerate}
\item\label{claim:four-one} $\rpoly(\prob,\ldots, \prob) = \numocc{\ed}\prob^2 + 6\numocc{\twopath}\prob^3 + 6\numocc{\twodis} + 6\numocc{\tri}\prob^3 + 6\numocc{\oneint}\prob^4 + 6\numocc{\threepath}\prob^4 + 6\numocc{\twopathdis}\prob^5 + 6\numocc{\threedis}\prob^6.$
\item\label{claim:four-two} The following can be computed in O(m) time. $\numocc{\ed}$, $\numocc{\twopath}$, $\numocc{\twodis}$, $\numocc{\oneint}$, $\numocc{\twopathdis}$, $\numocc{\threedis}$.
\end{enumerate}
$\implies$ If one can compute $\rpoly_3(\prob,\ldots, \prob)$ in time T(m), then we can compute the following in O(T(m) + m):
\[\numocc{\tri} + \numocc{\threepath} \cdot \prob - \numocc{\threedis}\cdot(\prob^2 - \prob^3).\]
\end{Claim}
\AR{The claim statement should only include the implication in the 2nd part. The first part is only useful in proving the 2nd part so no need to explicitly state it in the claim statement iself. As a general note: the handwritten notes were written in haste-- you should not assume that notation/statements of claims are the final word. Think if they make sense and/or how you can improve them.}
\begin{proof}
By definition we have that
\[\poly_3(\wElem_1,\ldots, \wElem_\numTup) = \sum_{\substack{(i_1, j_1),\\ (i_2, j_2),\\ (i_3, j_3) \in E}} \prod_{\ell = 1}^{3}\wElem_{i_\ell}\wElem_{j_\ell}.\]
Rather than list all the expressions in full detail, let us make some observations regarding the sum. Let $e_1 = (i_1, j_1), e_2 = (i_2, j_2), e_3 = (i_3, j_3)$. Notice that each expression in the sum consists of a triple $(e_1, e_2, e_3)$. There are three forms the triple $(e_1, e_2, e_3)$ can take.
\underline{case 1:} $e_1 = e_2 = e_3$, where all edges are the same. There are exactly $m$ such triples, each with a $\prob^2$ factor.
\underline{case 2:} This case occurs when there are two distinct edges of the three. All 6 combinations of two distinct values consist of the same monomial in $\rpoly_3$, i.e. $(e_1, e_1, e_2)$ is the same as $(e_2, e_1, e_2)$. This case produces the following edge patterns: $\twodis, \twopath$.
\underline{case 3:} $e_1 \neq e_2 \neq e_3$, i.e., when all edges are distinct. This case consists of the following edge patterns: $\threedis, \twopathdis, \threepath, \oneint, \tri$.
It has already been shown previously that $\numocc{\ed}, \numocc{\twopath}, \numocc{\twodis}$ can be computed in O(m) time. Here are the arguments for the rest.
\[\numocc{\oneint} = \sum_{u \in V} \binom{d_u}{3}\]
$\numocc{\twopathdis} + \numocc{\threedis} = $ the number of occurrences of three distinct edges with five or six vertices. This can be counted in the following manner. For every edge $(u, v) \in E$, throw away all neighbors of $u$ and $v$ and pick two more distinct edges.
\[\numocc{\twopathdis} + \numocc{\threedis} = \sum_{(u, v) \in E} \binom{m - d_u - d_v - 1}{2}\] The implication in \cref{claim:four-two} follows by the above and \cref{claim:four-one}.\qed
\end{proof}
\begin{proof}
\underline{Lemma 2}
\AR{You should {\bf NEVER EVER} use hard-coded lemma numbers etc. Latex keeps track of numbering for you-- so {\bf ALWAYS} use the automatic numbering. {\bf Using hard coded numbering is very bad practice.}}
\AR{Also you can modify the text of \textsc{Proof} by using the following latex command \texttt{\\begin\{proof\}[Proof of Lemma 2]} and Latex will typeset this as \textsc{Proof of Lemma 2}, which is what you really want.}
\cref{claim:four-two} of Claim 4 implies that if we know $\rpoly_3(\prob,\ldots, \prob)$, then we can know in O(m) additional time
\[\numocc{\tri} + \numocc{\threepath} \cdot \prob - \numocc{\threedis}\cdot(\prob^2 - \prob^3).\] We can think of each term in the above equation as a variable, where one can solve a linear system given 3 distinct $\prob$ values, assuming independence of the three linear equation. In the worst case, without independence, 4 distince values of $\prob$ would suffice...because Atri said so, and I need to ask him for understanding why this is the case, of which I suspect that it has to do with basic result(s) in linear algebra.\AR{Follows from the fact that the corresponding coefficient matrix is the so called Vandermonde matrix, which has full rank.}\qed
\end{proof}
\begin{proof}
\underline{Lemma 1}
The argument for lemma 2 cannot be applied to lemma 1 since we have that $\prob$ is fixed. We have hope in the following: we assume that we can solve this problem for all graphs, and the hope would be be to solve the problem for say $G_1, G_2, G_3$, where $G_1$ is arbitrary, and relate the values of $\numocc{H}$, where $H$ is a placeholder for the relevant edge combination. The hope is that these relations would result in three independent linear equations, and then we would be done.
The following is an option.
\begin{enumerate}
\item Let $G_1$ be an arbitrary graph
\item Build $G_2$ from $G_1$, where each edge in $G_1$ gets replaced by a 2 path.
\end{enumerate}
Then $\numocc{\tri}_2 = 0$, and if we can prove that\AR{Again you are not transcribing the handwritten notes. If the notes has a claim without proof, then you need to finish off the proof. Of course am happy to help if you get stuck but one of the primary goals of you latexing up the handwritten notes is for you to verify what is in the notes is correct and that cannot happen unless you write down complete proofs for all claims and convince yourself that the claims are correct-- e.g. they {\em could} be wrong and the hope is that your pass will catch bugs.}
\begin{itemize}
\item $\numocc{\threepath}_2 = 2 \cdot \numocc{\twopath}_1$
\item $\numocc{\threedis}_2 = 8 \cdot \numocc{\threedis}_1$
\end{itemize}
we solve our problem for $q_E^3$ based on $G_2$ and we can compute $\numocc{\threedis}$, a hard problem.
\end{proof}