\section{Sum of Products Analysis}
\AR{You should do the analysis for $\lambda(j,j')$ instead of just $\sigma_j^2=\lambda(j,j)$. The former is not that different from what you have below and you'll need to do it when computing $\sigma^2$ in any case.}
We now seek to bound the variance of a $\prodsize$-way join.
&\sigsq_j = \ex{\est_j \cdot \overline{\est_j}} - \ex{\est_j} \cdot \ex{\overline{\est_j}} \nonumber\\
&= \ex{\prod_{i = 1}^{\prodsize}\sum_{w \in W_j}v_i(w)s(w) \cdot \prod_{i = 1}^\prodsize\sum_{w' \in W_j}v_i(w')\overline{s(w')}} -
\ex{\prod_{i = 1}^{\prodsize}\sum_{w \in W_j}v_i(w)s(w)}\cdot \ex{\prod_{i = 1}^\prodsize\sum_{w' \in W_j}v_i(w')\overline{s(w')}}\nonumber\\
&= \ex{\sum_{\substack{w_1...w_\prodsize\\w'_1...w'_\prodsize\\ \in W}}\prod_{i = 1}^\prodsize v_i(w_i)v(w'_i)s(w_i)\overline{s(w'_i)}\ind{h(w_i) = j}\ind{h(w'_i) = j}} -
\ex{\sum_{w_1...w_\prodsize \in W} \prod_{i = 1}^\prodsize v_i(w_i)s(w_i)\ind{h(w_i) = j}} \cdot
\ex{\sum_{w'_1...w'_\prodsize \in W} \prod_{i = 1}^\prodsize v_i(w'_i)\overline{s(w'_i)}\ind{h(w'_i) = j}}\nonumber\\
=&\sum_{\substack{w_1...w_\prodsize\\w'_1...w'_\prodsize\\ \in W}}\ex{\prod_{i = 1}^\prodsize v_i(w_i)v_i(w'_i)s(w_i)\overline{s(w'_i)}\ind{h(w_i) = j}\ind{h(w'_i) = j}} -
\ex{\prod_{i = 1}^kv_i(w_i)s(w_i)\ind{h(w_i) = j}} \cdot \ex{\prod_{i = 1}^\prodsize v_i(w'_i)\overline{s(w'_i)}\ind{h(w'_i) = j}}\nonumber\\
&= \sum_{\substack{w_1...w_\prodsize\\w'_1...w'_\prodsize\\ \in W}}\prod_{i = 1}^\prodsize v_i(w_i)v_i(w'_i)\cdot\left( \ex{\prod_{i = 1}^\prodsize s(w_i)\overline{s(w'_i)}\ind{h(w_i) = j}\ind{h(w'_i) = j}} -
\ex{\prod_{i = 1}^ks(w_i)\ind{h(w_i) = j}}\cdot \ex{\prod_{i = 1}^\prodsize\overline{s(w'_i)}\ind{h(w'_i) = j}} \right)\label{eq:sig-j-last}.
Before proceeding, we introduce some notation and terminology that will aid in communicating the bounds we are about to establish. We refer to the leftmost expectation of \cref{eq:sig-j-last} in the following way:
\[\term_1\left(\wElem_1,\ldots,\wElem_\prodsize, \wElem_1',\ldots, \wElem_\prodsize'\right) = \ex{\prod_{i = 1}^\prodsize s(w_i)\overline{s(w'_i)}\ind{h(w_i) = j}\ind{h(w'_i) = j}}.%\text{, and}
%\[\term_2\left(\wElem_1,\ldots,\wElem_\prodsize, \wElem_1',\ldots, \wElem_\prodsize'\right) = \ex{\prod_{i = 1}^ks(w_i)\ind{h(w_i) = j}}\cdot \ex{\prod_{i = 1}^\prodsize\overline{s(w'_i)}\ind{h(w'_i) = j}}. \]
\AH{Not sure about this. This is correct for $\sigsq_j$, but I think we need to refer to $\term_2$ when dealing with $\lambda(j, j')$, as this produces the only surviving terms.}
We will use the vocabulary 'term' to denote $\prod_{i = 1}^{\prodsize}\vect_i(\wElem_i)\vect_i(\wElem_i') \cdot\left(\term_1 - \term_2\right)$ given a specific set of world values. To say that a term survives the expectation is to mean that $\term_1 - \term_2 \neq 0$. Note, that the only terms that survive the expectation above are mappings of $w_i = w'_j = w$ for $i, j \in [\prodsize]$, such that each $w_i$ has a match, i.e., no $w_i$ or $w'_j$ stands alone without a matching world in its complimentary set. In other words, the set of values in $\wElem_1,\ldots,\wElem_k$ has a bijective mapping to the set of values in $\wElem'_1,\ldots,\wElem'_k$.
\subsection{f, f'}
Define and then fix the lexicographical ordering of distinct world elements to be the total ordering of the elements in $[\dist]$ such that $\forall i < j \in [\dist], \widetilde{\wElem_i} < \widetilde{\wElem_j}$.
To help describe all possible matchings we introduce functions $f$ and $f'$.
Functions f, f' are the set of surjective mappings from $\prodsize$ to $\dist$ elements: $f: [\prodsize] \rightarrow [\dist], f': [\prodsize] \rightarrow [\dist'].$
The functions $f, f'$ are used to produce the mappings $w_i \mapsto \dMap{w_{f(i)}}$. In particular, $f$ and $f'$ are machinery for mapping $\prodsize$ $\wElem$-world variables to $\dist$ distinct values.
2020-04-01 10:57:37 -04:00
We rewrite equation \eqref{eq:sig-j-last} in terms of $\dist$ distinct worlds, with $f, f'$ mappings.
2020-04-06 11:35:13 -04:00
\sum_{\dist \in [\prodsize]}\sum_{\dist' \in [\prodsize]}\sum_{f, f'}\sum_{\substack{\dMap{\wElem_1}, \ldots,\dMap{\wElem_\dist},\\\dMap{\wElem'_1},\ldots,\dMap{\wElem'_{\dist'}}\\ \in W}}\prod_{i = 1}^{\prodsize}\vect_i(\dMap{\wElem_{f(i)}})\vect_i(\dMap{\wElem'_{f'(i)}})\cdot\left( \ex{\prod_{i = 1}^\prodsize \sine(\dMap{\wElem_{f(i)}}\conj{\sine(\dMap{\wElem'_{f'(i)}})}\ind{h(\dMap{\wElem_{f(i)}}) = j}\ind{h(\dMap{w'_{f'(i)}}) = j}} -
\ex{\prod_{i = 1}^\prodsize \sine(\dMap{\wElem_{f(i)}})\ind{h(\dMap{\wElem_{f(i)}}) = j}}\cdot \ex{\prod_{i = 1}^\prodsize\conj{\sine(\dMap{\wElem'_{f'(i)}})}\ind{h(\dMap{w'_{f'(i)}}) = j}} \right)\label{eq:sig-j-distinct}
2020-04-01 10:57:37 -04:00
The fact that \cref{eq:sig-j-last} $\equiv$ \cref{eq:sig-j-distinct} follows since \cref{eq:sig-j-distinct} is simply a rearrangement of the addends in the sum.
2020-03-31 11:52:00 -04:00
Functions $f:[\prodsize]\mapsto [\dist], f':[\prodsize]\mapsto [\dist']$ are said to be matching, denoted $\match{f}{f'}$, if and only if
2020-03-31 11:52:00 -04:00
2020-04-01 10:57:37 -04:00
\item $\dist = \dist'$
\item $\{|f^{-1}(i)| ~|~ \forall i \in [\dist]\} = \{|f'^{-1}(i')| ~|~ \forall i' \in [\dist] \}$, i.e., the set of preimage cardinalities for $f$ equals the set of preimage cardinalities for $f'$.
2020-04-01 10:57:37 -04:00
To avoid double counting, we impose an ordering on the set of functions $f, f'$ to omit symmetrical mappings.
For every $i, j \in [\dist]~|~ i < j$, the numerical value of the concatenation of the numerically ordered elements of $f^{-1}(i)$ < the numerical value of the concatenation of the numerically ordered elements of $f^{-1}(j)$, where $<$ is the order of the natural numbers.
2020-04-10 11:11:53 -04:00
We illustrate with an example. Consider a join of $k = 3$ tuples, where $\dist = 2$, and we have that $f^{-1}(1) = 1$ and $f^{-1}(2) = 2$. Imposing the above ordering yields the following set of unique functions:
f_1 = \begin{cases}
1 \mapsto 1 &\implies\wElem_1 \mapsto \dMap{\wElem_1}\\
2, 3 \mapsto 2 &\implies\wElem_2, \wElem_3 \mapsto \dMap{\wElem_2}
f_2 = \begin{cases}
2 \mapsto 1 &\implies\wElem_2 \mapsto \dMap{\wElem_1}\\
1, 3 \mapsto 2 &\implies\wElem_1, \wElem_3 \mapsto \dMap{\wElem_2}
f_3 = \begin{cases}
3 \mapsto1 &\implies\wElem_3 \mapsto \dMap{\wElem_1}\\
1, 2 \mapsto 2 &\implies\wElem_1, \wElem_2 \mapsto \dMap{\wElem_2}
The above mappings satisfy the ordering constraint so that for $f_1$, $1 < 23$, for $f_2$, $2 < 13$, and for $f_3$, $3 < 12$.
Note that above orderings share no symmetry, while the symmetrical versions of the above, which are the orderings for the case when $f^{-1}(1) = 2$ and $f^{-1}(2) = 1$, break our above ordering requirements, and are therefore disallowed, thus avoiding double counting. Another way of saying this is that the preimage sizes follow the natural order of their respective counterparts in the image. For the case when the two are equal, we need a more defined order, and can distinguish using the same idea as first described.
2020-04-06 11:35:13 -04:00
The only terms surviving $\term_1 - \term_2$ are those with $f, f'$ matching, where $\forall j \in[\dist], \dMap{\wElem_j} = \dMap{\wElem'_j}$.
In proving \cref{lem:sig-j-survive}, we introduce another fact.
Given a $\prodsize^{th}$ root of unity $\rou$, the expectation of the product of $\rou^i \cdot \rou^j$ for $i, j \in [\prodsize]$ is zero.
&\ex{\sine(\wElem)^i \conj{\sine(w')}^j}\\
2020-04-06 11:35:13 -04:00
= &\ex{\sine(\wElem^i)}\ex{\conj{\sine(\wElem')}^j}\\
2020-04-03 10:47:13 -04:00
= &0
2020-04-06 11:35:13 -04:00
In the above, since we have more than pairwise independence for $\wElem \neq \wElem'$, we can push the expectation into the product. Then by \cref{lem:exp-sine} we get 0 for both expectations.\newline
To prove that \cref{lem:sig-j-survive} is true, consider what the expectation looks like when $f, f'$ are not matching. Looking at the first condition for $f, f'$ to be matching when $\dist \neq \dist'$ note that since $\dist \neq \dist'$ we know that one set of variables has at least one more distinct world than the other set of variables. Also, to be explicit, $\wElem_1\ldots\wElem_\dist, \wElem_1'\ldots\wElem_{\dist'}'$ are distinct world values such that $\forall i \neq j \in [\dist], \wElem_i = \wElem_i' \neq \wElem_j = \wElem_j'$. To make things easier, assume that $\dist < \dist'$. The opposite case of $\dist > \dist'$ has a symmetrical proof. Fixing variables $\wElem_1\ldots\wElem_\dist, \wElem_1'\ldots\wElem_\dist$, in both $\term_1$ and $\term_2$ we have one extra distinct value, $\wElem_{\dist'}'$. This distinct term cancels out all the other values in the expectations.
2020-04-06 11:35:13 -04:00
\term_1 - \term_2 = &\ex{\sine(\wElem_1)\conj{\sine(\wElem_1)}\cdot\ldots\cdot\sine(\wElem_\dist)\conj{\sine(\wElem_\dist)}\cdot\conj{\sine(\wElem'_{\dist'})}} - \ex{\prod_{i = 1}^{\dist}\sine(\wElem_i)^{j_i}}\ex{\left(\prod_{i = 1}^{\dist}\sine(\wElem_i)^{j_i}\right) \cdot \conj{\sine(\wElem'_{\dist'})}}\\
= &\ex{\sine(\wElem_1)\conj{\sine(\wElem_1)}\cdot\ldots\cdot\sine(\wElem_\dist)\conj{\sine(\wElem_\dist)}}\cdot \ex{\conj{\sine(\wElem'_{\dist'})}} - \ex{\prod_{i = 1}^{\dist}\sine(\wElem_i)^{j_i}}\ex{\prod_{i = 1}^{\dist}\sine(\wElem_i)^{j_i}} \cdot \ex{\conj{\sine(\wElem'_{\dist'})}}\\
= &0.
By the same reasoning in the proof of \cref{lem:exp-prod-rand-roots}, we can push the expectation into the product of two independent random values. \textit{Here at most we assume 2k wise independence, but we really would like less}. Then by \cref{lem:exp-sine}, we get an factor of zero in both products, giving a result of zero.\qed
To finish the proof of \cref{lem:sig-j-survive}, we now approach the case where $\dist = \dist'$, but the set of preimage cardinalities for $f, f'$ are unequal. Effectively this condition means that we end up with the same result of unequal pairs as when $\dist \neq \dist'$.
&\{|f^{-1}(i)| ~|~ \forall i \in [\dist]\} \neq \{|f'^{-1}(i')| ~|~ \forall i' \in [\dist] \}\\
2020-04-06 11:35:13 -04:00
\rightarrow &\exists i, i' \in [m] s.t. |f^{-1}(i)| \neq |f'^{-1}(i)|, |f^{-1}(i')| \neq |f'^{-1}(i')|\\
2020-04-06 11:35:13 -04:00
The above means that we will have at least two world values that don't match, i.e. a $\wElem_i \neq \wElem_{i'}'$, both of which $i \neq i' \in [m]$. Fixing all world values except $\wElem_i$ and $\wElem_{i'}'$,
2020-04-06 11:35:13 -04:00
\term_1 - \term_2 = &\ex{\left(\prod_{i''= 1}^{\dist}\sine(\wElem_{i''})\conj{\sine(\wElem_{i''}')}\right)\sine(\wElem_i)\conj{\sine(\wElem_{i'}')}} - \ex{\prod_{i'' = 1}^{\dist}\sine(\wElem_{i''})}\cdot\ex{\prod_{i'' = 1}^{\dist}\conj{\sine(\wElem_{i''}')}}\\
= &\ex{\prod_{i''= 1}^{\dist}\sine(\wElem_{i''})\conj{\sine(\wElem_{i''}')}}\ex{\sine(\wElem_i)\conj{\sine(\wElem_{i'}')}} -
\ex{\prod_{i'' = 1 s.t. i'' \neq i}^{\dist}\sine(\wElem_{i''})}\cdot \ex{\sine(\wElem_{i})}\cdot\ex{\prod_{i'' = 1 s.t. i'' \neq i'}^{\dist}\conj{\sine(\wElem_{i''}')}}\cdot \ex{\conj{\sine(\wElem_{i'}')}}\\
= &0.
2020-04-10 11:11:53 -04:00
By the same arguments as before, we have at least one distinct world value in each expectation, which by independence of random variables allows us to push the expectations into the product, and then by \cref{lem:exp-prod-rand-roots} and \cref{lem:exp-sine} produce a zero in each product, yielding a value of zero.\qed
We now seek to show that when $f, f'$ are matching, that $\term_1 - \term_2$ will always equal 1.
Using the above definitions, we can now present the variance bounds for $\sigsq_j$ based on \eqref{eq:sig-j-distinct}.
2020-04-10 11:11:53 -04:00
By the fact that the expectations cancel when $\forall i, i', j, j'\in [\prodsize], \wElem_i = \wElem_j = \wElem_{i'}' = \wElem_{j'}' = \wElem$, we can rid ourselves of the case when there exists only one distinct world value. We then need to sum up all the $\dist$ distinct world value possibilities for $\dist \in [2, \prodsize]$. Note that the number of distinct values $\dist$ affects the randomness of the hash function $\hfunc$. E.g. only $\dist = 2$ distinct values will yield $\frac{1}{\sketchCols} \cdot \frac{1}{\sketchCols} = \frac{1}{\sketchCols^2} = \frac{1}{\sketchCols^\dist}$. By lemma \ref{lem:sig-j-survive} and equation \eqref{eq:sig-j-distinct} we get
\sigsq_j = \sum_{\dist \in [2, \prodsize]} \frac{1}{B^\dist} \sum_{\dMap{w_1}\ldots\widetilde{w_\dist}\in W} \sum_{\substack{f, f',\\\match{f}{f'}}} \prod_{i = 1}^{\prodsize} v_i(\dMap{w_{f(i)}}) v_i(\dMap{w_{f'(i)}})
