paper-BagRelationalPDBsAreHard/sop.tex

70 lines
6.6 KiB
TeX
Raw Normal View History

2020-03-26 12:38:07 -04:00
%root--main.tex
\section{Sum of Products Analysis}
We now seek to bound the variance of a k-way join.
\begin{align}
&\sigsq_j = \ex{est_j \cdot \overline{est_j}} - \ex{est_j} \cdot \ex{\overline{est_j}} \nonumber\\
&= \ex{\prod_{i = 1}^{k}\sum_{w \in W_j}v_i(w)s(w) \cdot \prod_{i = 1}^k\sum_{w' \in W_j}v_i(w')\overline{s(w')}} -
\ex{\prod_{i = 1}^{k}\sum_{w \in W_j}v_i(w)s(w)}\cdot \ex{\prod_{i = 1}^k\sum_{w' \in W_j}v_i(w')\overline{s(w')}}\nonumber\\
&= \ex{\sum_{\substack{w_1...w_k\\w'_1...w'_k\\ \in W}}\prod_{i = 1}^k v_i(w_i)v(w'_i)s(w_i)\overline{s(w'_i)}\ind{h(w_i) = j}\ind{h(w'_i) = j}} -
\ex{\sum_{w_1...w_k \in W} \prod_{i = 1}^k v_i(w_i)s(w_i)\ind{h(w_i) = j}} \cdot
\ex{\sum_{w'_1...w'_k \in W} \prod_{i = 1}^k v_i(w'_i)\overline{s(w'_i)}\ind{h(w'_i) = j}}\nonumber\\
=&\sum_{\substack{w_1...w_k\\w'_1...w'_k\\ \in W}}\ex{\prod_{i = 1}^k v_i(w_i)v_i(w'_i)s(w_i)\overline{s(w'_i)}\ind{h(w_i) = j}\ind{h(w'_i) = j}} -
\ex{\prod_{i = 1}^kv_i(w_i)s(w_i)\ind{h(w_i) = j}} \cdot \ex{\prod_{i = 1}^k v_i(w'_i)\overline{s(w'_i)}\ind{h(w'_i) = j}}\nonumber\\
&= \sum_{\substack{w_1...w_k\\w'_1...w'_k\\ \in W}}\prod_{i = 1}^k v_i(w_i)v_i(w'_i)\cdot\left( \ex{\prod_{i = 1}^k s(w_i)\overline{s(w'_i)}\ind{h(w_i) = j}\ind{h(w'_i) = j}} -
\ex{\prod_{i = 1}^ks(w_i)\ind{h(w_i) = j}}\cdot \ex{\prod_{i = 1}^k\overline{s(w'_i)}\ind{h(w'_i) = j}} \right)\label{eq:sig-j-last}.
\end{align}
2020-03-27 12:10:41 -04:00
Before proceeding, we introduce some notation that will aid in communicating the bounds we are about to establish. First note, that the only terms that survive the expectation above are mappings of $w_i = w'_j = w$ for $i, j \in [k]$, such that each $w_i$ has a match, i.e., no $w_i$ or $w'_j$ stands alone without a matching world in its complimentary set. To help describe all possible matchings we use m-tuples and functions $f$ and $f'$.
2020-03-26 12:38:07 -04:00
\subsection{M-tuples}
\begin{Definition}
Given a $k$-way join, define $m \in [k]$. An m-tuple then is a set of tuples, each tuple conatining $m$ elements, such that the values of each tuple sum up to $m$, i.e. $\forall i \in [m], \sum_j m_{t_{i, j}} = m$, where i is the $i^{th}$ tuple in $m_t$, and $j$ is the $j^{th}$ index of that tuple $t$. The set consists of each unique sum up to symmetry, meaning a tuple with the same elements only reversed is disallowed.
\end{Definition}
For example, when $k = 4$, $m = 2$, the m-tuple, denoted, $m_2$, would be$\left\{\left(1, 3\right), \left(2, 2\right)\right\}$. Here, $m_{2_{1, 1}} = 1$, and while the tuple $\left(3, 1\right)$ sums up to $k = 4$, we do not include it since we have it's symmetrical term $\left(1, 3\right)$.
2020-03-26 20:15:00 -04:00
\AR{Why is the definition of M-tuples needed? From what I understand you need this to define what kinds of $f$ and $f'$ are allowed but in that case why not state those properties directly in terms of $f$ and $f'$? Actually after reading the next section, I do not see why these properties are needed at all..}
2020-03-27 12:10:41 -04:00
\AH{I use the m-tuples to explain 1) what kind of matchings survive and 2) that $f, f'$ must only cross product from within the matchings of the same tuple. Maybe there is an easier way to do this.}
2020-03-26 20:15:00 -04:00
2020-03-26 12:38:07 -04:00
\subsection{f, f'}
\begin{Definition}
2020-03-27 12:10:41 -04:00
Functions f, f' are the set of surjective mappings from $k$ to $m$ elements: $f: [k] \rightarrow [m].$
2020-03-26 12:38:07 -04:00
\end{Definition}
2020-03-27 12:10:41 -04:00
%\begin{equation*}
%f(i) = \begin{cases}
% \widetilde{w_1} &f(i) = 1\\
% \widetilde{w_2} &f(i) = 2\\
% \vdots &\vdots\\
% \widetilde{w_m} &f(i) = m.
% \end{cases}
%\end{equation*}
The functions $f, f'$ are used to produce the mappings $w_i \mapsto \widetilde{w_{f(i)}}$.
2020-03-26 20:15:00 -04:00
2020-03-27 12:10:41 -04:00
In particular, $f$ and $f'$ are machinery for mapping $k$ $\wElem$-world variables to $m$ distinct values. Note that for a given $m$, we may have several ways to map $k$ worlds to $m$ distinct values.
\AH{Here is where I have attempted to use prose to discuss the restrictions on $f$ and $f'$, rather than the use of m-tuples. Maybe there is a better, cleaner formal way?}
E.g., for $k = 4, m = 2$, mappings could be such that one $\wElem_i$ is distinct, while the other three $\wElem_i$ are mapped to the other distinct value. Additionally, we would have the case where two $\wElem_i$ map to a distinct value, while the other two $\wElem_i$ map to a seperate distinct world. The expectations of equation \eqref{eq:sig-j-last} restrict $f$ and $f'$ to belonging to the same class of $m$-mapping, meaning, if the mapping $f$ for $k = 4, m = 2$ is in the setting of one distinct world and three equal world values, then $f'$ must be from that same set of mappings, and not from another class of mappings, such as when two $w_i$ map to a distinct world, while the other two $w_i$ map to a separate distinct world.
\AH{Here is the use of m-tuples to explain the same thing.}
In the example above, $f$ mappings for $m_{2_1}$ may only cross product with $f'$ mappings for $m_{2_1}$ and not with those for $m_{2_2}$. Likewise for $f, f'$ mappings of $m_{2_2}$.
2020-03-26 12:38:07 -04:00
Using the above definitions, we can now present the variance bounds for $\sigsq_j$ based on \eqref{eq:sig-j-last}.
2020-03-27 12:10:41 -04:00
By the fact that the expectations cancel when $\forall i, i', j, j'\in [k], \wElem_i = \wElem_j =/\neq \wElem_{i'}' = \wElem_{j'}'$, we can rid ourselves only one distinct world value. We then need to sum up all the $m$ distinct world value possibilities for $m \in [2, k]$. From equation \eqref{eq:sig-j-last}, starting with $m$ = 2, for one $f$ and one $f'$ from the same class of mappings, we get
\begin{equation*}
\frac{1}{\sketchCols^2}\sum_{\widetilde{\wElem_1}, \widetilde{\wElem_2}}\prod_{i = 1}^{k}\vect_i(\widetilde{\wElem_{f(i)}})\vect_i(\widetilde{\wElem_{f'(i)}}).
\end{equation*}
This is because we know that the expectation from \eqref{eq:sig-j-last} will survive when we have mappings that produce pairs of the form $\sine(\wElem)\conj{\sine(\wElem)}$. With two distinct variables, the indicator variables in the expectation yield $\frac{1}{\sketchCols}\cdot \frac{1}{\sketchCols}$.
We need to sum over all mappings for each case (c) when the number of distinct values is $m = 2$, resulting in
\begin{equation*}
\frac{1}{\sketchCols^2}\sum_{\widetilde{\wElem_1}, \widetilde{\wElem_2}}\sum_{c \in m = 2}\sum_{f, f'}\prod_{i = 1}^{k}\vect_i(\widetilde{\wElem_{f(i)}})\vect_i(\widetilde{\wElem_{f'(i)}}).
\end{equation*}
Finally, we need to do this for all $m$.
2020-03-26 12:38:07 -04:00
\begin{equation*}
2020-03-27 12:10:41 -04:00
\sigsq_j = \sum_{m \in [2, k]} \frac{1}{B^m} \sum_{\widetilde{w_1}\cdots\widetilde{w_m}\in W} \sum_{c \in m_l}\sum_{f, f'} \prod_{i = 1}^{k} v_i(\widetilde{w_{f(i)}}) v_i(\widetilde{w_{f'(i)}})
2020-03-26 20:15:00 -04:00
\end{equation*}
2020-03-27 12:10:41 -04:00
\AH{Need better notation for $cases$.}
2020-03-26 20:15:00 -04:00
\AR{You need to argue why the above follows from~\eqref{eq:sig-j-last}: either here or somewhere else. Basically how you go from a sum of $2k$ variables to this nested seem-- this needs to be fully argued.}