2019-06-05 11:57:05 -04:00
% -*- root: main.tex -*-
\section { Analysis}
\label { sec:analysis}
2019-06-14 10:50:38 -04:00
We begin the analysis by showing that with high probability an estimate is approximately $ \numWorldsP $ , where $ p $ is the probability measure for a given TIPD. Note that
\begin { equation}
\numWorldsP = \numWorldsSum \label { eq:mu} .
\end { equation}
2019-06-05 11:57:05 -04:00
The first step is to show that the expectation of the estimate of a tuple t's membership across all worlds is $ \numWorldsSum $ .
\begin { align}
2019-06-07 15:38:01 -04:00
& \expect { \estimate } \\
=& \expect { \estExpOne } \\
=& \expect { \sum _ { \substack { j \in [B],\\
2019-06-05 11:57:05 -04:00
\wVec \in \pw ~|~ \sketchHash { i} [\wVec ] = j,\\
2019-06-07 15:38:01 -04:00
\wVec [w'] \in \pw ~|~ \sketchHash { i} [\wVec [w'] ] = j} } v_ t[\wVec ] \cdot s_ i[\wVec ] \cdot s_ i[\wVec [w'] ]} \\
=& \multLineExpect \big [\sum_{\substack{j \in [B] ,\\
2019-06-05 11:57:05 -04:00
\wVec ~|~\sketchHashParam { \wVec } = j,\\
\wVecPrime ~|~\sketchHashParam { \wVecPrime } = j,\\
2019-06-07 15:38:01 -04:00
\wVec = \wVecPrime } } \kMapParam { \wVec } \cdot \sketchPolarParam { \wVec } \cdot \sketchPolarParam { \wVecPrime } + \nonumber \\
& \phantom { { } \kMapParam { \wVec } } \sum _ { \substack { j \in [B], \\
2019-06-05 11:57:05 -04:00
\wVec ~|~\sketchHashParam { \wVec } = j,\\
2019-06-07 15:38:01 -04:00
\wVecPrime ~|~ \sketchHashParam { \wVecPrime } = j,\\ \wVec \neq \wVecPrime } } \kMapParam { \wVec } \cdot \sketchPolarParam { \wVec } \cdot \sketchPolarParam { \wVecPrime } \big ]\textit { (by linearity of expectation)} \\
=& \expect { \sum _ { \substack { j \in [B],\\
2019-06-05 11:57:05 -04:00
\wVec ~|~\sketchHashParam { \wVec } = j,\\
\wVecPrime ~|~\sketchHashParam { \wVecPrime } = j,\\
2019-06-07 15:38:01 -04:00
\wVec = \wVecPrime } } \kMapParam { \wVec } \cdot \sketchPolarParam { \wVec } \cdot \sketchPolarParam { \wVecPrime } } \nonumber \\
2019-06-05 11:57:05 -04:00
& \phantom { { } \big [} \textit { (by uniform distribution in the second summation)} \\
2019-06-14 10:50:38 -04:00
=& \estExp \label { eq:estExpect}
2019-06-05 11:57:05 -04:00
\end { align}
2019-06-07 15:38:01 -04:00
For the next step, we show that the variance of an estimate is small.$$ \varParam { \estimate } $$
2019-06-05 11:57:05 -04:00
\begin { align}
2019-06-07 15:38:01 -04:00
& =\varParam { \estExpOne } \\
& = \expect { \big (\estTwo \big )^ 2} \\
& =\expect { \sum _ { \substack {
2019-06-05 11:57:05 -04:00
\wVec _ 1, \wVec _ 2,\\
2019-06-05 13:14:32 -04:00
\wVecPrime _ 1, \wVecPrime _ 2 \in \pw ,\\
\sketchHashParam { \wVec _ 1} = \sketchHashParam { \wVecPrime _ 1} ,\\
\sketchHashParam { \wVec _ 2} = \sketchHashParam { \wVecPrime _ 2}
2019-06-07 15:38:01 -04:00
} } \kMapParam { \wVec _ 1} \cdot \kMapParam { \wVec _ 2} \cdot \sketchPolarParam { \wVec _ 1} \cdot \sketchPolarParam { \wVec _ 2} \cdot \sketchPolarParam { \wVecPrime _ 1} \cdot \sketchPolarParam { \wVecPrime _ 2} } \label { eq:var-sum-w}
\end { align}
Note that four-wise independence is assumed across all four random variables of \eqref { eq:var-sum-w} . Zooming in on the inner products of the $ \sketchPolar $ functions,
\begin { equation}
\polarProdEq \label { eq:polar-product}
\end { equation}
2019-06-14 10:50:38 -04:00
it can be seen that for $ \wOne , \wOneP \in \pw $ and $ \wTwo , \wTwoP \in \pw ' $ , all four random variables in \eqref { eq:polar-product} take their values from $ \pw $ , although we have iteration over two separate sets $ \pw $ . Thus, there are four possible sets of $ \wVec $ variable combinations, namely:
2019-06-07 15:38:01 -04:00
\begin { align*}
& \distPattern { 1} :& \cOne \\
& \distPattern { 2} :& \cTwo \textit { *} \\
& \distPattern { 3} :& \cThree \textit { *} \\
& \distPattern { 4} :& \cFour \textit { *} \\
& \distPattern { 5} :& \cFive
\end { align*}
$$ \text { } ^ * \textit { ( and all variants of the respective pattern ) } $$
We are interested in those particular cases whose expecation does not equal zero, since an expectation of zero will not add to the summation of \eqref { eq:var-sum-w} . In expectation we have that
\begin { align}
2019-06-14 10:50:38 -04:00
& \expect { %\sum_{\substack{\elems \\
%\st \cOne}}
\polarProdEq \st \cOne } = 1 \label { eq:polar-prod-all} \\
& \expect { %\sum_{\substack{\elems \\
%\st \cTwo}}
\polarProdEq \st \cTwo } = 1 \label { eq:polar-prod-two-and-two} \\
& \expect { %\sum_{\substack{\elems \\
%\st \cThree}}
\polarProdEq \st \cThree } = 0 \nonumber \\
& \expect { %\sum_{\substack{\elems \\
%\st \cFour}}
\polarProdEq \st \cFour } = 0 \nonumber \\
& \expect { %\sum_{\substack{\elems \\
%\st \cFive}}
\polarProdEq \st \cFive } = 0 \nonumber
2019-06-07 15:38:01 -04:00
\end { align}
2019-06-14 10:50:38 -04:00
Only equations \eqref { eq:polar-prod-all} and \eqref { eq:polar-prod-two-and-two} influence the $ \var $ computation.
Considering $ \distPattern { 1 } $ the variance results in
2019-06-07 15:38:01 -04:00
\begin { equation}
2019-06-10 13:36:43 -04:00
\distPatOne \label { eq:distPatOne}
2019-06-07 15:38:01 -04:00
\end { equation}
For the distribution pattern $ \cTwo $ , we have three variants to consider.
\begin { align*}
& \vCase { 1} :& \cTwo \\
& \vCase { 2} :& \cTwoV { \wOne } { \wTwo } { \wOneP } { \wTwoP } \\
& \vCase { 3} :& \cTwoV { \wOne } { \wTwoP } { \wOneP } { \wTwo }
\end { align*}
When considered separately, the variants have the following $ \var $ .
\begin { align}
2019-06-10 13:36:43 -04:00
\cTwo & = \variantOne \label { eq:variantOne} \\
\cTwoV { \wOne } { \wTwo } { \wOneP } { \wTwoP } & =\variantTwo \label { eq:variantTwo} \\
\cTwoV { \wOne } { \wTwoP } { \wOneP } { \wTwo } & =\variantThree \label { eq:variantThree}
2019-06-05 11:57:05 -04:00
\end { align}
2019-06-10 13:36:43 -04:00
Note that at the start of the analysis of $ \var $ , the second term (expectation \eqref { eq:estExpect} squared) of the $ \var $ calculation was not considered. This is because it is cancelled out by \eqref { eq:distPatOne} and \eqref { eq:variantOne} .
\begin { equation*}
\big (\estExp \big )^ 2 = \distPatOne + \variantOne
\end { equation*}
With only \eqref { eq:variantTwo} and \eqref { eq:variantThree} remaining, we have
\begin { multline*}
\varParam { \estimate } = \\
\variantTwo ~+ \\
\variantThree
\end { multline*}
Converting terms into their space requirements yields
\begin { align}
2019-06-12 08:39:03 -04:00
& \variantTwo \Rightarrow \numWorldsP \cdot \frac { \numWorlds } { \sketchCols } - 1\label { eq:spaceOne} \\
2019-06-10 13:36:43 -04:00
& \variantThree \Rightarrow \numWorldsP \cdot \frac { \numWorldsP - 1} { \sketchCols } \label { eq:spaceTwo}
\end { align}
\eqref { eq:spaceOne} and \eqref { eq:spaceTwo} further reduce to
\begin { equation}
2019-06-12 08:39:03 -04:00
\frac { 2^ { 2N} (\prob + \prob ^ 2)} { \sketchCols } - \numWorlds (\frac { \prob } { \sketchCols } + \prob )\label { eq:variance}
2019-06-10 13:36:43 -04:00
\end { equation}
2019-06-12 08:39:03 -04:00
By \eqref { eq:variance} we have then
\begin { align*}
2019-06-14 10:50:38 -04:00
\varSym & < 2^ { 2N} \big (\frac { 2\prob } { \sketchCols } \big ) \\
\sd & <\sdEq \\
\sdRel & < \sqrt { \frac { 2} { \sketchCols \prob } } .
2019-06-12 08:39:03 -04:00
\end { align*}
2019-06-14 10:50:38 -04:00
Recall that $ \sdRel = \frac { \sd } { \mu } $ where $ \mu $ is defined as $ \numWorldsP $ in \eqref { eq:mu} .
Since the sketch has multiple trials, a probability of exceeding error bound $ \errB $ smaller than one half guarantees an estimate that is less than or equal to the error bound when taking the median of all trials. Expressing the error relative to $ \mu $ in Chebyshev's Inequality yields
2019-06-12 08:39:03 -04:00
\begin { equation*}
\cheby .
\end { equation*}
Substituting $ \mu \epsilon $ for $ k \sd $ and solving for $ \sketchCols $ results in
\begin { align*}
2019-06-14 10:50:38 -04:00
& k\cdot \sdEq = \mu \epsilon \\
& k = \frac { \mu \epsilon } { \sdEq } \\
& k = \frac { \mu \epsilon \sqrt { \sketchCols } } { \numWorlds \sqrt { 2\prob } } \\
& k^ 2 = \frac { \mu ^ 2\epsilon ^ 2\sketchCols } { 2^ { \numWorlds } \cdot 2\prob } = \frac { \prob \errB ^ 2\sketchCols } { 2} \\
& \chebyK \Rightarrow \sketchCols = \frac { 6} { \epsilon ^ 2\prob }
2019-06-12 08:39:03 -04:00
\end { align*}
2019-06-10 13:36:43 -04:00
2019-06-05 11:57:05 -04:00
2019-06-14 10:50:38 -04:00