paper-BagRelationalPDBsAreHard/analysis.tex

374 lines
26 KiB
TeX
Raw Normal View History

2019-06-05 11:57:05 -04:00
% -*- root: main.tex -*-
\section{Analysis}
\label{sec:analysis}
\AR{This ia a notational nitpick but I would prefer it if this section was written for a function $v: W\to K$ and not neccessarily the special case of $v=v_t$. In particular, there is no nottion of probablitty $p$. At some point, we'll have to revisit this but I think it would be good to have the analysis in this section be for arbirary functuon $v$ and not the specific one from the TIDB. Note that this means that you should not have the first two equations in this section.}
2019-08-07 09:51:55 -04:00
\AH{I think I may need some clarification on this. First, I had originally thought the $t$ subscript denotes the tuple identifier, so removing that generalizes $v$ to a function for an arbirtrary tuple. But, based on the comment above, I don't think this is what you are going for. I think that you are saying we need to treat $v$ generally to account for more than just TIDB. If this is the case, and I am understanding you correctly to this point, it might be helpful to discuss this face to face.}
We begin the analysis by showing that with high probability an estimate is approximately $\numWorldsP$, where $p$ is a tuple's probability measure for a given TIPD. Note that
2019-06-14 10:50:38 -04:00
\begin{equation}
%\gVt{k\cdot}
\numWorldsP = \numWorldsSum\label{eq:mu}.
2019-06-14 10:50:38 -04:00
\end{equation}
Furthermore, when $\kMap{t}$ is generalized to have elements in the range $\left[0, \infty\right]$, we obtain the result
\begin{equation}
\norm{\kMap{t}}\prob = \numWorldsSum\label{eq:gen-mu}.
\end{equation}
2019-06-05 11:57:05 -04:00
2019-08-07 09:51:55 -04:00
We start off by making the claim that the expectation of the estimate of a tuple's annotations across all worlds is $\sum\limits_{\wVec \in \pw}\kMapParam{\wVec}$, formally
\begin{equation}
\expect{\sum_{\wVec \in \pw} \sketchJParam{\sketchHashParam{\wVec}} \cdot \sketchPolarParam{\wVec}} = \sum_{\wVec \in \pw}\kMapParam{\wVec}\label{eq:allWorlds-est}.
\end{equation}
2019-08-07 09:51:55 -04:00
To verify this claim, we argue that $\forall \wVec \in \pw$, the expectation of the estimate of a tuple's annotation in a single world is its annotation, i.e. the output of $\genKMap{\wVec}$, \AR{Again this claim should be for every $\mathbf{w}\in W$ and not related to whether $t$ appears in a world or not.} i.e.
\begin{equation}
\expect{\sketchJParam{\sketchHashParam{\wVec}}\cdot \sketchPolarParam{\wVec}} = \kMapParam{\wVec} \label{eq:single-est}.
\end{equation}
For a given $\wVec \in \pw$, substituting definitions we have
2019-08-07 09:51:55 -04:00
\setcounter{equation}{3}
\begin{subequations}
\begin{align}
&\expect{\sketchJParam{\sketchHashParam{\wVec}} \cdot \sketchPolarParam{\wVec}} = \nonumber\\
&\phantom{{}\sketchJParam{\sketchHashParam{\wVec}}}\expect{\big(\sum_{\substack{\wVecPrime \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec}}}\kMapParam{\wVecPrime} \cdot \sketchPolarParam{\wVecPrime}\big) \cdot \sketchPolarParam{\wVec} }\label{eq:step-one}\\.
%\end{align}
%Since $\wVec \in \pw$, we know that for $\wVecPrime\in \pw, \exists \wVecPrime \st \wVecPrime = \wVec$. This yields
%\[
2019-08-07 09:51:55 -04:00
=&~\expect{\sum\limits_{\substack{\wVecPrime\in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec},\\
\wVecPrime = \wVec}}\kMapParam{\wVecPrime}\sketchPolarParam{\wVecPrime}^2 +
\sum\limits_{\substack{\wVecPrime, \wVec \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec},\\
\wVecPrime \neq \wVec}}\kMapParam{\wVecPrime}\sketchPolarParam{\wVecPrime}\sketchPolarParam{\wVec}}\label{eq:step-two}\\
%\] which can be written as
%\[
2019-08-07 09:51:55 -04:00
=&~\expect{\sum\limits_{\substack{\wVecPrime \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec},\\
\wVecPrime = \wVec}}\kMapParam{\wVecPrime}\sketchPolarParam{\wVecPrime}^2} +
\expect{\sum\limits_{\substack{\wVecPrime, \wVec \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec} \\
\wVecPrime \neq \wVec}}\kMapParam{\wVecPrime}\sketchPolarParam{\wVecPrime}\sketchPolarParam{\wVec}}\label{eq:step-three}\\
2019-08-07 11:57:07 -04:00
=&~\expect{\sum\limits_{\substack{\wVecPrime \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec},\\
\wVecPrime = \wVec}}\kMapParam{\wVecPrime}} \cdot \expect{\sum\limits_{\substack{\wVecPrime \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec},\\
\wVecPrime = \wVec}}\sketchPolarParam{\wVecPrime}^2} + \nonumber\\
&\qquad\expect{\sum\limits_{\substack{\wVecPrime, \wVec \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec} \\
\wVecPrime \neq \wVec}}\kMapParam{\wVecPrime}}\expect{\sum\limits_{\substack{\wVecPrime, \wVec \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec} \\
\wVecPrime \neq \wVec}}\sketchPolarParam{\wVecPrime}\sketchPolarParam{\wVec}}\label{eq:step-three-a}\\
=&~{\sum\limits_{\substack{\wVecPrime \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec},\\
\wVecPrime = \wVec}}\kMapParam{\wVecPrime}} \cdot \expect{\sum\limits_{\substack{\wVecPrime \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec},\\
\wVecPrime = \wVec}}\sketchPolarParam{\wVecPrime}^2} + \nonumber\\
&\qquad\sum\limits_{\substack{\wVecPrime, \wVec \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec} \\
\wVecPrime \neq \wVec}}\kMapParam{\wVecPrime}\expect{\sum\limits_{\substack{\wVecPrime, \wVec \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec} \\
\wVecPrime \neq \wVec}}\sketchPolarParam{\wVecPrime}\sketchPolarParam{\wVec}}\label{eq:step-three-b}\\
%\] from which the last term evaluates to $0$ and we have
%\[
2019-08-07 11:57:07 -04:00
=&~\sum\limits_{\substack{\wVecPrime \in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec},\\
\wVecPrime = \wVec}}\kMapParam{\wVecPrime} \label{eq:step-three-c}\\
2019-08-07 09:51:55 -04:00
=&~\expect{\sum\limits_{\substack{\wVecPrime\in \pw \st \\
\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec},\\
\wVecPrime = \wVec}}\kMapParam{\wVecPrime}\sketchPolarParam{\wVecPrime}^2}\label{eq:step-four}\\
%\]
=&~\kMapParam{\wVec}\label{eq:step-five}
\end{align}
\end{subequations}
\AR{The numbering of the equations above is a bit off: you go from (4) to (3a) and so on. Also for the case when $\mathbf{w}=\mathbf{w'}$ there is no need to sum over $\mathbf{w},\mathbf{w'}\in W$-- it just makes things confusing-- sjust sum over $\mathbf{w}'\in W$.}
2019-08-07 09:51:55 -04:00
\AH{The numbering got jumbled because I had added another equation before the 'setcounter' command, and there seems to be no easy way to retrieve a particular equation and supply it's number as input to 'setcounter'.}
\AH{I am slightly confused with the second point of this comment. My confusion lies in that I thought that, although arbirtrary, $\wVec$ was fixed, meaning that the sum is not over all $\wVec \in \pw$, but rather for an arbitrarily fixed $\wVec \in \pw$. However, I have made slight changes that reflect the suggestion, hoping that I have interpretted as you intended.}
\begin{Justification}
\hfill
\begin{itemize}
\item \eq{\eqref{eq:step-one}} is a substitution of the definition of $\sketch$.
2019-08-07 09:51:55 -04:00
\item \eq{\eqref{eq:step-two}} uses the associativity of addition to rearrange the sum.
\item \eq{\eqref{eq:step-three}} uses linearity of expectation to reduce the large expectation into smaller expectations. \AR{I would puch the expectation further in so that they only deal with the $s_i$ terms.}
\item \eq{\eqref{eq:step-four}} follows from the second term of \eq{eq:step-three} evaluating to zero. This assumes pairwise independence of $\sketchPolar.$
\item \eq{\eqref{eq:step-five}} follows from the squaring of the $\sketchPolarParam{\wVec}$ term, which will always evaluate to 1. Keep in mind that in the summation we trivially have only 1 $\wVecPrime$ which equals $\wVec$.
\end{itemize}
\end{Justification}
%which in turn
%\begin{multline*}
%\mathbb{E}\big[\kMapParam{\wVecPrime_0}\cdot \sketchPolarParam{\wVecPrime_0} + \cdots \\
%+\kMapParam{\wVecPrime_j}\cdot \sketchPolarParam{\wVecPrime_j}\cdot \sketchPolarParam{\wVecPrime_j}+ \cdots \\
%+ \kMapParam{\wVecPrime_n}\sketchPolarParam{\wVecPrime_n}\big]
%\end{multline*}
%\AH{break it up into w' and w}
%Due to the uniformity of $\sketchPolar$, we have
%\begin{equation*}
%= \kMapParam{\wVec},
%\end{equation*}
thus verifying \eqref{eq:single-est}.
\begin{Assumption}
\hfill
\begin{itemize}
\item \eq{\eqref{eq:step-three}} assumes that $\sketchPolar$ is pairwise independent.
%\item $\sketchHash$ is uniformly distributed.
\end{itemize}
\end{Assumption}
Since \eqref{eq:single-est} holds, by linearity of expectation, \eqref{eq:allWorlds-est} also must hold.
%We can now take \eqref{eq:single-est}, substitute it in for \eqref{eq:allWorlds-est} and show by linearity of expectation that \eqref{eq:allWorlds-est} holds.
%\begin{align}
%&\expect{\sum_{\wVec \in \pw} \sketchJParam{\sketchHashParam{\wVec}} \cdot \sketchPolarParam{\wVec}} \nonumber\\
%&= \expect{\sum_{\wVecPrime \in \pw}\kMapParam{\wVecPrime} \cdot \sketchPolarParam{\wVecPrime} \cdot \sum_{\substack{\wVec \in \pw \st \\
%\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec}}}\sketchPolarParam{\wVec}}\nonumber\\
%&= \sum_{\wVec \in \pw} \expect{\left( \sum_{\substack{\wVecPrime \in \pw \st \\
%\sketchHashParam{\wVecPrime} = \sketchHashParam{\wVec}}}\kMapParam{\wVecPrime}\cdot\sketchPolarParam{\wVecPrime}\right) \cdot \sketchPolarParam{\wVec}}\nonumber\\
%&= \sum_{\wVec \in \pw}\kMapParam{\wVec}\label{eq:estExpect}.
%\end{align}
%\begin{align}
%&\expect{\estimate}\\
%=&\expect{\estExpOne}\\
%=&\expect{\sum_{\substack{j \in [B],\\
% \wVec \in \pw~|~ \sketchHash{i}[\wVec] = j,\\
% \wVec[w']\in \pw~|~ \sketchHash{i}[\wVec[w']] = j} } v_t[\wVec] \cdot s_i[\wVec] \cdot s_i[\wVec[w']]}\\
%=&\multLineExpect\big[\sum_{\substack{j \in [B],\\
% \wVec~|~\sketchHashParam{\wVec}= j,\\
% \wVecPrime~|~\sketchHashParam{\wVecPrime} = j,\\
% \wVec = \wVecPrime}} \kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot \sketchPolarParam{\wVecPrime} + \nonumber \\
%&\phantom{{}\kMapParam{\wVec}}\sum_{\substack{j \in [B], \\
% \wVec~|~\sketchHashParam{\wVec} = j,\\
% \wVecPrime ~|~ \sketchHashParam{\wVecPrime} = j,\\ \wVec \neq \wVecPrime}} \kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot\sketchPolarParam{\wVecPrime}\big]\textit{(by linearity of expectation)}\\
%=&\expect{\sum_{\substack{j \in [B],\\
% \wVec~|~\sketchHashParam{\wVec}= j,\\
% \wVecPrime~|~\sketchHashParam{\wVecPrime} = j,\\
% \wVec = \wVecPrime}} \kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot \sketchPolarParam{\wVecPrime}} \nonumber \\
%&\phantom{{}\big[}\textit{(by uniform distribution in the second summation)}\\
%=& \estExp \label{eq:estExpect}
%\end{align}
2019-06-05 11:57:05 -04:00
%\AR{A general comment: The last display equation should have a period at the end. The idea is that display equations are considered part of a sentence and every sentence should end with a period.}
%\AH{Thank you for clarifying this, as I have always wondered what the convention was for display equations. Hopefully, I haven't missed any end display equations in this paper, and have them all fixed properly.}
2019-06-27 23:55:45 -04:00
2019-07-09 12:52:54 -04:00
For the next step, we show that the variance of an estimate is small.%$$\varParam{\estimate}$$
\begin{subequations}
2019-06-05 11:57:05 -04:00
\begin{align}
&\varParam{\sum_{\wVec \in \pw}\sketchJParam{\sketchHashParam{\wVec}} \cdot \sketchPolarParam{\wVec}}\\%\nonumber\\
=~&\varParam{\sum_{\wVec \in \pw}\kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \sum_{\substack{\wVecPrime \in \pw \st\\ \sketchHashParam{\wVec} = \sketchHashParam{\wVecPrime}}}\sketchPolarParam{\wVecPrime}}\label{eq:var_step-one}\\%\nonumber\\%\estExpOne}\\
=~& \mathbb{E}\big[\big(\sum_{\substack{ \wVec, \wVecPrime \in \pw \st \\
\sketchHashParam{\wVec} = \sketchHashParam{\wVecPrime}}} \kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot \sketchPolarParam{\wVecPrime}\nonumber\\
&\qquad - \expect{\sum_{\wVec \in \pw} \sketchJParam{\sketchHashParam{\wVec}} \cdot \sketchPolarParam{\wVec}}\big)^2\big]\label{eq:var_step-two}\\%\nonumber\\
2019-07-09 12:52:54 -04:00
=~&\mathbb{E}\big[\sum_{\substack{
2019-06-05 11:57:05 -04:00
\wVec_1, \wVec_2,\\
2019-06-05 13:14:32 -04:00
\wVecPrime_1, \wVecPrime_2 \in \pw,\\
\sketchHashParam{\wVec_1} = \sketchHashParam{\wVecPrime_1},\\
\sketchHashParam{\wVec_2} = \sketchHashParam{\wVecPrime_2}
}}\kMapParam{\wVec_1} \kMapParam{\wVec_2}\sketchPolarParam{\wVec_1}\sketchPolarParam{\wVec_2}\sketchPolarParam{\wVecPrime_1}\sketchPolarParam{\wVecPrime_2}\big]\nonumber\\
&\qquad - \left(\sum_{\wVec \in \pw}\kMapParam{\wVec}\right)^2 \label{eq:var-sum-w}.
2019-06-07 15:38:01 -04:00
\end{align}
\end{subequations}
\begin{Justification}
\hfill
\begin{itemize}
\item \eq{\eqref{eq:var_step-one}} follows from substituting the definition of $\sketch$ and the commutativity of addition. Note the constraint on $\sketchHash$ hashing to the same bucket follows from the definition of $\sketch$. Also, the sum can be rearranged to take each component item in the sum of each bucket and take its sum of products with each of the $\sketchPolar$ mapped to it. This can be done as previously stated, using the commutativity of addition.
\item \eq{\eqref{eq:var_step-two}} by substituting the definition of variance.
\item \eq{\eqref{eq:var-sum-w}} results from the further evaluation of \eqref{eq:var_step-two}.
\end{itemize}
\end{Justification}
\begin{Assumption}
\hfill
\begin{itemize}
\item The subsequent evaluations of expectation assume 4-wise independence of $\sketchPolar$.
\end{itemize}
\end{Assumption}
2019-06-07 15:38:01 -04:00
Note that four-wise independence is assumed across all four random variables of \eqref{eq:var-sum-w}. Zooming in on the products of the $\sketchPolar$ functions,
2019-06-07 15:38:01 -04:00
\begin{equation}
\sketchPolarParam{\wa}\cdot\sketchPolarParam{\wb}\cdot\sketchPolarParam{\wc}\cdot\sketchPolarParam{\wVecD} \label{eq:polar-product}
2019-06-07 15:38:01 -04:00
\end{equation}
2019-07-25 09:52:02 -04:00
we see that %it can be seen that for $\wOne, \wOneP \in \pw$ and $\wTwo, \wTwoP \in \pw'$, all four random variables in \eqref{eq:polar-product} take their values from $\pw$, although we have iteration over two separate sets $\pw$.
there are five possible sets of $\wVec$ variable combinations, namely for $a, b, c, d \in \{1, 1', 2, 2'\} \st a \neq b \neq c \neq d$:
\AR{This confused me a lot to start off with. I think it is better to use $a,b,c,d$ only in the definitions of $S_1$ to $S_5$ where it is needed. In particular, it is not the case in $S_1$ to $S_3$ that you look at all possible assignment of $a, b, c, d \in \{1, 1', 2, 2'\}$.}
2019-06-07 15:38:01 -04:00
\begin{align*}
2019-07-09 12:52:54 -04:00
&\distPattern{1}:&\forElems{\cOne}\\
&\distPattern{2}:&\forElems{\cTwo}\\
&\distPattern{3}:&\forElems{\cThree}\\
&\distPattern{4}:&\forElems{\cFour}\\
&\distPattern{5}:&\forElems{\cFive}
2019-06-07 15:38:01 -04:00
\end{align*}
\AR{I think the definitions above need more work and/or there needs to be a justification for why $S_1$ to $S_2$ partition all the possibilities.}
2019-08-07 09:51:55 -04:00
\AH{Maybe we could further discuss this today 8/7/19.}
Note that each $\wVec$ is the preimage of the same $\sketchPolar$ function, meaning, that equal worlds produce the same element in the image of $\sketchPolar$. \AR{I am not sure what the senetence above is saying.}
2019-06-07 15:38:01 -04:00
2019-06-27 23:55:45 -04:00
We are interested in those particular cases whose expectation does not equal zero, since an expectation of zero will not add to the summation of \eqref{eq:var-sum-w}. In expectation we have that
2019-06-07 15:38:01 -04:00
\begin{align}
2019-06-19 14:08:03 -04:00
\forAllW{\distPattern{1}}&\rightarrow\expect{%\sum_{\substack{\elems \\
2019-06-14 10:50:38 -04:00
%\st \cOne}}
2019-07-09 12:52:54 -04:00
\polarProdEq} = 1 \label{eq:polar-prod-all}
\end{align}
since we have the same element of the image of $\sketchPolar$ being multiplied to itself an even number of times. Similarly,
\begin{align}
2019-06-19 14:08:03 -04:00
\forAllW{\distPattern{2}}&\rightarrow\expect{%\sum_{\substack{\elems \\
2019-06-14 10:50:38 -04:00
%\st \cTwo}}
2019-07-09 12:52:54 -04:00
\polarProdEq} = 1 \label{eq:polar-prod-two-and-two}
\end{align}
because the same element of the image of $\sketchPolar$ is being multiplied to itself for each equality, producing a polarity of 1 for each equality, and then a final product of 1. For $\distPattern{3}, \distPattern{4}, \distPattern{5}$, we have a final product of two, three or four independent variables $\in \{-1, 1\}$, thus producing the following results:
\begin{align}
2019-06-19 14:08:03 -04:00
\forAllW{\distPattern{3}}&\rightarrow\expect{%\sum_{\substack{\elems \\
2019-06-14 10:50:38 -04:00
%\st \cThree}}
2019-07-09 12:52:54 -04:00
\polarProdEq} = 0 \nonumber
\end{align}
\begin{align}
2019-06-19 14:08:03 -04:00
\forAllW{\distPattern{4}}&\rightarrow\expect{%\sum_{\substack{\elems \\
2019-06-14 10:50:38 -04:00
%\st \cFour}}
2019-07-09 12:52:54 -04:00
\polarProdEq} = 0 \nonumber
\end{align}
\begin{align}
2019-06-19 14:08:03 -04:00
\forAllW{\distPattern{5}}&\rightarrow\expect{%\sum_{\substack{\elems \\
2019-06-14 10:50:38 -04:00
%\st \cFive}}
2019-07-09 12:52:54 -04:00
\polarProdEq} = 0. \nonumber
2019-06-07 15:38:01 -04:00
\end{align}
2019-06-07 15:38:01 -04:00
2019-06-14 10:50:38 -04:00
Only equations \eqref{eq:polar-prod-all} and \eqref{eq:polar-prod-two-and-two} influence the $\var$ computation.
Considering $\distPattern{1}$ the variance results in
2019-06-07 15:38:01 -04:00
\begin{equation}
2019-07-09 12:52:54 -04:00
\distPatOne\label{eq:distPatOne}.
2019-06-07 15:38:01 -04:00
\end{equation}
2019-07-09 12:52:54 -04:00
This is the case because we have that
\begin{align*}
&\sum_{\substack{\wOne, \wOneP, \wTwo, \wTwoP \in \pw \st \\
\wOne = \wTwo = \wOneP = \wTwoP = \wVec}}
\kMapParam{\wVec} \cdot \kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot \sketchPolarParam{\wVec}\\
2019-08-02 23:35:58 -04:00
= &\sum_{\wVec \in \pw} \kMapParam{\wVec}\cdot \kMapParam{\wVec}\\
2019-07-09 12:52:54 -04:00
= &\sum_{\wVec \in \pw} \kMapParam{\wVec}^2.
\end{align*}
2019-06-07 15:38:01 -04:00
2019-07-09 12:52:54 -04:00
For the distribution pattern $\cTwo$, we have three subsets $\distPattern{21}, \distPattern{22}, \distPattern{23} \subseteq \distPattern{2}$ to consider.
2019-06-07 15:38:01 -04:00
\begin{align*}
2019-07-09 12:52:54 -04:00
&\distPattern{21}:&\cTwoV{\wOne}{\wOneP}{\wTwo}{\wTwoP} \\
&\distPattern{22}:&\cTwoV{\wOne}{\wTwo}{\wOneP}{\wTwoP}\\
&\distPattern{23}:&\cTwoV{\wOne}{\wTwoP}{\wOneP}{\wTwo}
2019-06-07 15:38:01 -04:00
\end{align*}
2019-07-09 12:52:54 -04:00
Considered separately, the subsets result in the following $\var$.
2019-06-07 15:38:01 -04:00
\begin{align}
2019-07-09 12:52:54 -04:00
&\wOne = \wOneP \neq \wTwo =\wTwoP \rightarrow\nonumber\\
&\qquad = \sum_{\substack{\wOne, \wOneP, \wTwo, \wTwoP \in \pw \st \\
\wOne = \wOneP = \wVec \neq\\
\wTwo = \wTwoP = \wVecPrime}}\kMapParam{\wVec}\kMapParam{\wVecPrime}\sketchPolarParam{\wVec}\sketchPolarParam{\wVec}\sketchPolarParam{\wVecPrime}\sketchPolarParam{\wVecPrime} \label{eq:variantOne}\nonumber\\
&\qquad = \sum_{\wVec, \wVecPrime \in \pw \st \wVec \neq \wVecPrime}\kMapParam{\wVec}\kMapParam{\wVecPrime}\\
&\wOne = \wTwo \neq \wOneP = \wTwoP \rightarrow\nonumber\\
&\qquad = \sum_{\substack{\wOne, \wOneP, \wTwo, \wTwoP \in \pw \st \\
\wOne = \wTwo = \wVec \neq\\
\wOneP = \wTwoP = \wVecPrime,\\
\sketchHashParam{\wVec} = \sketchHashParam{\wVecPrime}}} \kMapParam{\wVec}\kMapParam{\wVec}\sketchPolarParam{\wVec}\sketchPolarParam{\wVecPrime}\sketchPolarParam{\wVec}\sketchPolarParam{\wVecPrime}\nonumber \\
2019-08-02 23:35:58 -04:00
&\qquad = \sum_{\wVec \in \pw}| \{\wVecPrime \st \wVecPrime \neq \wVec, \sketchHashParam{\wVec} = \sketchHashParam{\wVecPrime}\} | \cdot \kMapParam{\wVec}^2\label{eq:variantTwo} \\
2019-07-09 12:52:54 -04:00
&\wOne = \wTwoP \neq \wOneP =\wTwo \rightarrow \nonumber \\
&\qquad = \sum_{\substack{\wOne, \wOneP, \wTwo, \wTwoP \in \pw \st \\
\wOne = \wTwoP = \wVec \neq \\
\wOneP = \wTwo = \wVecPrime,\\
\sketchHashParam{\wVec} = \sketchHashParam{\wVecPrime}}} \kMapParam{\wVec} \kMapParam{\wVecPrime}\sketchPolarParam{\wVec}\sketchPolarParam{\wVecPrime}\sketchPolarParam{\wVecPrime}\sketchPolarParam{\wVec} \nonumber \\
&\qquad = \sum_{\substack{\wVec, \wVecPrime \in \pw \st \\
\wVec \neq \wVecPrime,\\
\sketchHashParam{\wVec} = \sketchHashParam{\wVecPrime}}}\kMapParam{\wVec}\cdot\kMapParam{\wVecPrime}\label{eq:variantThree}
2019-06-05 11:57:05 -04:00
\end{align}
2019-07-09 12:52:54 -04:00
Note that for $\distPattern{22}$, we have the cardinality of a bucket as a multiplicative factor for each squared annotation. This is because of the constraint that $\wOne \neq \wOneP$ coupled with the additional constraint that $\sketchHashParam{\wOne} = \sketchHashParam{\wOneP}$. Since $\wOneP$ must belong to the same bucket as $\wOne$, yet not equal to $\wOne$, we have that each operand of the sum must be the annotation squared for each $\wOneP$ that belongs to the same bucket but is not equal to $\wOne$.
Looking at $\distPattern{23}$, we have a similar case as $\distPattern{22}$, but this time there is no multiplicative factor since $\wOneP$ and $\wTwoP$ are constrained to equal their opposite $\wVec$ counterparts, which are the arguments for both $\kMap{t}$ terms.
2019-06-27 23:55:45 -04:00
2019-06-05 11:57:05 -04:00
Notice that the second term (expectation squared) of the $\var$ calculation is cancelled out by \eqref{eq:distPatOne} and \eqref{eq:variantOne}. %
2019-07-09 12:52:54 -04:00
2019-06-10 13:36:43 -04:00
\begin{equation*}
\big(\sum_{\wVec \in \pw}\kMapParam{\wVec}\big)^2 = \sum_{\wVec \in \pw}\kMapParam{\wVec}^2 +
\sum_{\substack{\wVec, \wVecPrime \in \pw \st\\
\wVec \neq \wVecPrime}}\kMapParam{\wVec}\kMapParam{\wVecPrime}.%\distPatOne + \variantOne.
2019-06-10 13:36:43 -04:00
\end{equation*}
\begin{Justification}
\hfill
\begin{itemize}
\item The LHS is the expectation squared. We obtain the RHS by first squaring the sum, and then, using the commutative property of addition, rearranging the operands of the summation.
\end{itemize}
\end{Justification}
2019-06-10 13:36:43 -04:00
With only \eqref{eq:variantTwo} and \eqref{eq:variantThree} remaining, we have
\begin{multline*}
\varParam{\estimate} = \\
2019-07-09 16:41:58 -04:00
\expect{\sum_{\wVec, \wVecPrime \in \pw \st \wVec \neq \wVecPrime}| \{\wVecPrime \st \sketchHashParam{\wVec} = \sketchHashParam{\wVecPrime}\} | \cdot \kMapParam{\wVec}^2} ~+ \\
\expect{\sum_{\substack{\wVec, \wVecPrime \in \pw \st \\
\wVec \neq \wVecPrime,\\
\sketchHashParam{\wVec} = \sketchHashParam{\wVecPrime}}}\kMapParam{\wVec}\cdot\kMapParam{\wVecPrime}}.
2019-06-10 13:36:43 -04:00
\end{multline*}
%Our current analysis is limited to TIPDBs, where the annotations are in the boolean $\mathbb{B}$ set. Because this is the case, the square of any element is itself.
Computing each term separately gives
2019-06-10 13:36:43 -04:00
\begin{align}
2019-07-09 16:41:58 -04:00
&\expect{\sum_{\substack{\wVec, \wVecPrime \in \pw \st\\
\wVec \neq \wVecPrime}}| \{\wVecPrime \st \sketchHashParam{\wVec} = \sketchHashParam{\wVecPrime}\} | \cdot \kMapParam{\wVec}^2} =%\numWorldsP
\norm{\kMap{t}}^2_2\prob\cdot \frac{\numWorlds}{\sketchCols} - 1\label{eq:spaceOne}\\
&\expect{ \sum_{\substack{\wVec, \wVecPrime \in \pw \st \\
\wVec \neq \wVecPrime,\\
\sketchHashParam{\wVec} = \sketchHashParam{\wVecPrime}}}\kMapParam{\wVec}\cdot\kMapParam{\wVecPrime}} = %\numWorldsP \cdot \frac{\numWorldsP - 1}{\sketchCols}\label{eq:spaceTwo}.
\norm{\kMap{t}}\prob \cdot \frac{\norm{\kMap{t}}\prob - \frac{\norm{\kMap{t}}}{\numWorlds}}{\sketchCols}\label{eq:spaceTwo}.
2019-06-10 13:36:43 -04:00
\end{align}
%In both equations, the sum of $\kMapParam{\wVec}$ over all $\wVec \in \pw$ is $\numWorldsP$ since as noted in equation \eqref{eq:mu} we are summing the number of worlds a tuple $t$ appears in, and for a TIPDB, that is exactly 2 to the power of the number of tuples in the TIPDB (due to the independence of tuples) times tuple $t$'s probability.
\AR{the above two need more work. Let's discuss more in the Aug 7 meeting.}
2019-07-09 16:41:58 -04:00
In equation \eqref{eq:spaceOne} we have the multiplicative factor which in expectation turns out to be the number of worlds $\numWorlds$ divided evenly across the number of buckets $\sketchCols$ minus the one tuple that $\wVecPrime$ cannot be. This factor is multiplied to sum of squares of each of the $\numWorldsP$ worlds that $t$ appears in.
2019-07-09 16:41:58 -04:00
Equation \eqref{eq:spaceTwo} has each of the $\numWorldsP$ worlds times all the rest of the worlds that tuple $t$ appears in within that bucket. This factor is represented by $\frac{\numWorldsP - 1}{\sketchCols}$, i.e. we have a world in a given bucket $j$ in which tuple $t$ appears, being summed over each of its products with other worlds in which it is present in bucket $j$.
%\AR{Again, argue why the above claims are true.}
%\AH{All my arguing is plain English. Is there a better way to go about this?}
2019-06-10 13:36:43 -04:00
\eqref{eq:spaceOne} and \eqref{eq:spaceTwo} further reduce to
\begin{equation}
%\frac{2^{2N}(\prob + \prob^2)}{\sketchCols} - \numWorlds(\frac{\prob}{\sketchCols} + \prob)\label{eq:variance}
\norm{\kMap{t}}^2_2\prob\left(\numWorlds - 1\right) + \norm{\kMap{t}}\left(\norm{\kMap{t}} - \frac{\sketchCols}{\numWorlds}\right)\label{eq:variance}
2019-06-10 13:36:43 -04:00
\end{equation}
2019-06-12 08:39:03 -04:00
By \eqref{eq:variance} we have then
\begin{align*}
%\varSym &< 2^{2N}\big(\frac{2\prob}{\sketchCols}\big) \\
\varSym &< \norm{\kMap{t}}^2_2\prob(\numWorlds - 1) + (\norm{\kMap{t}}\prob)^2 \\
%\sd &<\sdEq\\
\sd &< \sqrt{\norm{\kMap{t}}^2_2\prob(\numWorlds - 1) + (\norm{\kMap{t}}\prob)^2} \\
%\sdRel& < \sqrt{\frac{2}{\sketchCols\prob}}.
\sdRel &< \frac{\sqrt{\norm{\kMap{t}}^2_2\prob(\numWorlds - 1) + (\norm{\kMap{t}}\prob)^2} }{\norm{\kMap{t}}\prob}
2019-06-12 08:39:03 -04:00
\end{align*}
Recall that $\sdRel = \frac{\sd}{\mu}$ where $\mu$ is defined as $\numWorldsP$ in \eqref{eq:mu} for TIDB and $\norm{\kMap{t}}\prob$ for general $\kMap{t}$ in \eqref{eq:gen-mu}.
2019-06-14 10:50:38 -04:00
Since the sketch has multiple trials, a probability of exceeding error bound $\errB$ smaller than one half guarantees an estimate that is less than or equal to the error bound when taking the median of all trials. Expressing the error relative to $\mu$ in Chebyshev's Inequality yields
2019-06-12 08:39:03 -04:00
\begin{equation*}
Pr\left[~|X - \mu|~> \Delta\right] < \frac{1}{3}.
%\cheby.
2019-06-12 08:39:03 -04:00
\end{equation*}
2019-07-09 16:41:58 -04:00
Substituting $\Delta = k\sigma \rightarrow k = \frac{\Delta}{\sigma} \rightarrow k^2 = \frac{\Delta^2}{\sigma^2}$ we have
\begin{equation*}
Pr\left[~|X - \mu|~> \Delta~\right] < \frac{\sigma^2}{\Delta^2}
\end{equation*}
%\AR{It would be better to state the deviation as say $\Delta$ instead of $\epsilon\mu$. Then derive the expression for $B$ in terms of $N,p,\Delta$. Then you can state as consequences what values of $B$ you get for the special cases of $\Delta=\epsilon\cdot 2^N$ and $\Delta=\epsilon\mu$.}
%\AH{Done.}
2019-07-09 16:41:58 -04:00
For the case when $\Delta = \mu\epsilon$, taking both Chebyshev bounds, setting them equal to each other, simplifying and solving for $\sketchCols$ results in
2019-06-12 08:39:03 -04:00
\begin{align*}
2019-07-09 16:41:58 -04:00
\frac{\sigma^2}{\Delta^2} &= \frac{1}{3}\\
\frac{ 2^{2N}\big(\frac{2\prob}{\sketchCols}\big)}{\mu^2\epsilon^2} &= \frac{1}{3}\\
\frac{2^{2N + 1}\prob}{\mu^2\epsilon^2\sketchCols} &= \frac{1}{3}\\
\frac{6 \cdot 2^{2N}\prob}{\mu^2\epsilon^2} &= \sketchCols \\
2019-07-09 16:41:58 -04:00
\frac{6}{p\epsilon^2} &= \sketchCols.
2019-06-12 08:39:03 -04:00
\end{align*}
In the above, recall that $\mu$ or the expectation of an estimate is $\numWorldsP$ as seen in equations \eqref{eq:mu} and \eqref{eq:allWorlds-est}.
Setting $\Delta = \epsilon\numWorlds$ gives
\begin{align*}
\frac{ 2^{2N}\big(\frac{2\prob}{\sketchCols}\big)}{\epsilon^22^{2N}} &= \frac{1}{3}\\
\frac{2^{2N+ 1}\prob}{\epsilon^22^{2N}\sketchCols} &= \frac{1}{3}\\
\frac{6 \cdot 2^{2N}\prob}{\epsilon^22^{2N}} &= \sketchCols \\
\frac{6\prob}{\epsilon^2} &= \sketchCols.
\end{align*}
2019-07-09 16:41:58 -04:00
Other cases for $\Delta$ can be solved similarly.
2019-06-10 13:36:43 -04:00
2019-06-05 11:57:05 -04:00
2019-06-14 10:50:38 -04:00