paper-BagRelationalPDBsAreHard/analysis.tex

% -*- root: main.tex -*-
\section{Analysis}
\label{sec:analysis}
We begin the analysis by showing that with high probability an estimate is approximately $\numWorldsP$, where $p$ is the probability measure for a given TIPD.  Note that
\begin{equation}
\numWorldsP = \numWorldsSum\label{eq:mu}.
\end{equation}

The first step is to show that the expectation of the estimate of a tuple t's membership across all worlds is $\numWorldsSum$.

\AR{While the analysis below is correct, the way it is stated it seems to `come out of the blue.' I would recommend that you re-structure the argument below as follows. First argue that $\expect{\sketch[i][\sketchHash[\wVec]]\cdot s_i[\wVec]}=v_t[\wVec]$. From this the claim below just follows by linearity of expectation but this result is a good thing for the reader to realize. Also instead of summing over $j\in [B],\wVec|h_i[\wVec]=j,\wVec'|h_i[\wVec']=j$ it would be better to just write it as sum over all $\wVec,\wVec'\in W\text{ s.t. }h_i[\wVec]=h_i[\wVec']$-- the latter is bit more compact and it is easier to comprehend as well.}
\begin{align}
&\expect{\estimate}\\
=&\expect{\estExpOne}\\
=&\expect{\sum_{\substack{j \in [B],\\
			 \wVec \in \pw~|~ \sketchHash{i}[\wVec] = j,\\
			 \wVec[w']\in \pw~|~ \sketchHash{i}[\wVec[w']] = j} } v_t[\wVec] \cdot s_i[\wVec] \cdot s_i[\wVec[w']]}\\
=&\multLineExpect\big[\sum_{\substack{j \in [B],\\
				\wVec~|~\sketchHashParam{\wVec}= j,\\
				\wVecPrime~|~\sketchHashParam{\wVecPrime} = j,\\
				\wVec = \wVecPrime}} \kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot \sketchPolarParam{\wVecPrime} +  \nonumber \\
&\phantom{{}\kMapParam{\wVec}}\sum_{\substack{j \in [B], \\
				\wVec~|~\sketchHashParam{\wVec} = j,\\
				\wVecPrime ~|~ \sketchHashParam{\wVecPrime} = j,\\ \wVec \neq \wVecPrime}} \kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot\sketchPolarParam{\wVecPrime}\big]\textit{(by linearity of expectation)}\\
=&\expect{\sum_{\substack{j \in [B],\\
				\wVec~|~\sketchHashParam{\wVec}= j,\\
				\wVecPrime~|~\sketchHashParam{\wVecPrime} = j,\\
				\wVec = \wVecPrime}} \kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot \sketchPolarParam{\wVecPrime}} \nonumber \\
&\phantom{{}\big[}\textit{(by uniform distribution in the second summation)}\\
=& \estExp \label{eq:estExpect}
\end{align}

\AR{A general comment: The last display equation should have a period at the end. The idea is that display equations are considered part of a sentence and every sentence should end with a period.}

For the next step, we show that the variance of an estimate is small.$$\varParam{\estimate}$$

\begin{align}
&=\varParam{\estExpOne}\\
&= \expect{\big(\estTwo\big)^2}\\
&=\expect{\sum_{\substack{
		\wVec_1, \wVec_2,\\
		 \wVecPrime_1, \wVecPrime_2 \in \pw,\\
		 \sketchHashParam{\wVec_1} = \sketchHashParam{\wVecPrime_1},\\
		 \sketchHashParam{\wVec_2} = \sketchHashParam{\wVecPrime_2}
		 }}\kMapParam{\wVec_1} \cdot \kMapParam{\wVec_2}\cdot\sketchPolarParam{\wVec_1}\cdot\sketchPolarParam{\wVec_2}\cdot\sketchPolarParam{\wVecPrime_1}\cdot\sketchPolarParam{\wVecPrime_2} }\label{eq:var-sum-w}
\end{align}
\AR{The $-\mu^2$ term is missing in the above.}

Note that four-wise independence is assumed across all four random variables of \eqref{eq:var-sum-w}.  Zooming in on the inner products of the $\sketchPolar$ functions,
\begin{equation}
\polarProdEq \label{eq:polar-product}
\end{equation}
it can be seen that for $\wOne, \wOneP \in \pw$ and $\wTwo, \wTwoP \in \pw'$, all four random variables in \eqref{eq:polar-product} take their values from $\pw$, although we have iteration over two separate sets $\pw$.\AR{I do not know what you mean by ``iteration"}  Thus, there are four possible sets of $\wVec$ variable combinations, namely:
\begin{align*}
&\distPattern{1}:&\forElems{\cOne}&\\
&\distPattern{2}:&\forElems{\cTwo}& \textit{*} \\
&\distPattern{3}:&\forElems{\cThree}& \textit{*} \\
&\distPattern{4}:&\forElems{\cFour}& \textit{*}\\
&\distPattern{5}:&\forElems{\cFive}&
\end{align*}
$$\text{ }^*\textit{(and all variants of the respective pattern)}$$
\AR{Two comments on the notation above. You should define the sets exactly-- i.e. you should not put a $*$ on some of the definitions. Second, it is not immediately clear why the above cover all the cases, so you should argue that is the case. I think it is easier to argue this is you argue in terms of number of inequalities in the possible $\binom{4}{2}=6$ comparisons-- note you probably should not write down all the 6 comparisons since that would be cumbersome: just use it in your argument.}

We are interested in those particular cases whose expectation does not equal zero, since an expectation of zero will not add to the summation of \eqref{eq:var-sum-w}.  In expectation we have that
\begin{align}
\forAllW{\distPattern{1}}&\rightarrow\expect{%\sum_{\substack{\elems \\
			%\st \cOne}}
		 \polarProdEq} = 1 \label{eq:polar-prod-all}\\
\forAllW{\distPattern{2}}&\rightarrow\expect{%\sum_{\substack{\elems \\
			%\st \cTwo}}
		\polarProdEq} = 1 \label{eq:polar-prod-two-and-two}\\
\forAllW{\distPattern{3}}&\rightarrow\expect{%\sum_{\substack{\elems \\
			%\st \cThree}}
		\polarProdEq} = 0 \nonumber \\
\forAllW{\distPattern{4}}&\rightarrow\expect{%\sum_{\substack{\elems \\
			%\st \cFour}}
		\polarProdEq} = 0 \nonumber \\
\forAllW{\distPattern{5}}&\rightarrow\expect{%\sum_{\substack{\elems \\
			%\st \cFive}}
		\polarProdEq} = 0 \nonumber
\end{align}
\AR{You should argue why each of the equalities above. While we might decide to drop the arguments in the submitted paper when we are working things out, it is better to write down all the arguments. This is the best way to spot bugs in a proof. Otherwise, it is easy to introduce bugs by not checking for things that are ``obvious."}

Only equations \eqref{eq:polar-prod-all} and \eqref{eq:polar-prod-two-and-two} influence the $\var$ computation.
Considering $\distPattern{1}$ the variance results in
\begin{equation}
\distPatOne\label{eq:distPatOne}
\end{equation}

For the distribution pattern $\cTwo$, we have three variants to consider.
\begin{align*}
&\vCase{1}:&\cTwo \\
&\vCase{2}:&\cTwoV{\wOne}{\wTwo}{\wOneP}{\wTwoP}\\
&\vCase{3}:&\cTwoV{\wOne}{\wTwoP}{\wOneP}{\wTwo}
\end{align*}
\AR{Again you should be defining sets and not variants. E.g. you could have defined the subsets $S_{21},S_{22},S_{23}$.}
When considered separately, the variants have the following $\var$.
\begin{align}
\cTwo&= \variantOne \label{eq:variantOne}\\
\cTwoV{\wOne}{\wTwo}{\wOneP}{\wTwoP}&=\variantTwo \label{eq:variantTwo}\\
\cTwoV{\wOne}{\wTwoP}{\wOneP}{\wTwo}&=\variantThree\label{eq:variantThree}
\end{align}
\AR{You should again argue each of the claimed equalities above. Actually in the second equality, the term $|h_i[\wVec]=h_i[\wVec']|$ should really be $|\{\wVec'|h_i[\wVec]=h_i[\wVec']\}|$. Also this change needs to be propagated.}

\AR{Also while I do like the use of macros, I think you have gone over-board in the other direction. It is good to create macros for symbols/variables names that you will use frequently but using macros for entire expressions is not a good idea. Among others, it makes it really hard for others to read it since they have to refer back to your macro definition each time they see it.}

Note that at the start of the analysis of $\var$, the second term (expectation \eqref{eq:estExpect} squared) of the $\var$ calculation was not considered. \AR{You should {\bf not} start with a {\em wrong} expression and then later on correct it. Start off with the correct expression in the first place: otherwise it just creates more confusion.}  This is because it is canceled out by \eqref{eq:distPatOne} and \eqref{eq:variantOne}.
\begin{equation*}
\big(\estExp\big)^2 = \distPatOne + \variantOne
\end{equation*}
With only \eqref{eq:variantTwo} and \eqref{eq:variantThree} remaining, we have

\begin{multline*}
\varParam{\estimate} = \\
\variantTwo ~+ \\
\variantThree
\end{multline*}
\AR{The expectations are missing on the RHS. And this needs to be propagated.}

Converting terms into their space requirements yields
\begin{align}
&\variantTwo \Rightarrow\numWorldsP \cdot \frac{\numWorlds}{\sketchCols} - 1\label{eq:spaceOne}\\
&\variantThree \Rightarrow \numWorldsP \cdot  \frac{\numWorldsP - 1}{\sketchCols}\label{eq:spaceTwo}
\end{align}
\AR{Again, argue why the above claims are true.}
\eqref{eq:spaceOne} and \eqref{eq:spaceTwo} further reduce to
\begin{equation}
\frac{2^{2N}(\prob + \prob^2)}{\sketchCols} - \numWorlds(\frac{\prob}{\sketchCols} + \prob)\label{eq:variance}
\end{equation}
By \eqref{eq:variance} we have then
\begin{align*}
\varSym &< 2^{2N}\big(\frac{2\prob}{\sketchCols}\big) \\
\sd &<\sdEq\\
\sdRel& < \sqrt{\frac{2}{\sketchCols\prob}}.
\end{align*}
Recall that $\sdRel = \frac{\sd}{\mu}$ where $\mu$ is defined as $\numWorldsP$ in \eqref{eq:mu}.

Since the sketch has multiple trials, a probability of exceeding error bound $\errB$ smaller than one half guarantees an estimate that is less than or equal to the error bound when taking the median of all trials.  Expressing the error relative to $\mu$ in Chebyshev's Inequality yields
\begin{equation*}
\cheby.
\end{equation*}
\AR{It would be better to state the deviation as say $\Delta$ instead of $\epsilon\mu$. Then derive the expression for $B$ in terms of $N,p,\Delta$. Then you can state as consequences what values of $B$ you get for the special cases of $\Delta=\epsilon\cdot 2^N$ and $\Delta=\epsilon\mu$.}
Substituting $\mu\epsilon$ for $k\sd$ and solving for $\sketchCols$ results in
\begin{align*}
&k\cdot\sdEq = \mu\epsilon\\
&k = \frac{\mu\epsilon}{\sdEq}\\
&k = \frac{\mu\epsilon\sqrt{\sketchCols}}{\numWorlds \sqrt{2\prob}}\\
&k^2 = \frac{\mu^2\epsilon^2\sketchCols}{2^{\numWorlds}\cdot2\prob} = \frac{\prob\errB^2\sketchCols}{2}\\
&\chebyK\Rightarrow \sketchCols = \frac{6}{\epsilon^2\prob}
\end{align*}