paper-BagRelationalPDBsAreHard/analysis.tex

142 lines
6.4 KiB
TeX
Raw Normal View History

2019-06-05 11:57:05 -04:00
% -*- root: main.tex -*-
\section{Analysis}
\label{sec:analysis}
2019-06-14 10:50:38 -04:00
We begin the analysis by showing that with high probability an estimate is approximately $\numWorldsP$, where $p$ is the probability measure for a given TIPD. Note that
\begin{equation}
\numWorldsP = \numWorldsSum\label{eq:mu}.
\end{equation}
2019-06-05 11:57:05 -04:00
The first step is to show that the expectation of the estimate of a tuple t's membership across all worlds is $\numWorldsSum$.
\begin{align}
2019-06-07 15:38:01 -04:00
&\expect{\estimate}\\
=&\expect{\estExpOne}\\
=&\expect{\sum_{\substack{j \in [B],\\
2019-06-05 11:57:05 -04:00
\wVec \in \pw~|~ \sketchHash{i}[\wVec] = j,\\
2019-06-07 15:38:01 -04:00
\wVec[w']\in \pw~|~ \sketchHash{i}[\wVec[w']] = j} } v_t[\wVec] \cdot s_i[\wVec] \cdot s_i[\wVec[w']]}\\
=&\multLineExpect\big[\sum_{\substack{j \in [B],\\
2019-06-05 11:57:05 -04:00
\wVec~|~\sketchHashParam{\wVec}= j,\\
\wVecPrime~|~\sketchHashParam{\wVecPrime} = j,\\
2019-06-07 15:38:01 -04:00
\wVec = \wVecPrime}} \kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot \sketchPolarParam{\wVecPrime} + \nonumber \\
&\phantom{{}\kMapParam{\wVec}}\sum_{\substack{j \in [B], \\
2019-06-05 11:57:05 -04:00
\wVec~|~\sketchHashParam{\wVec} = j,\\
2019-06-07 15:38:01 -04:00
\wVecPrime ~|~ \sketchHashParam{\wVecPrime} = j,\\ \wVec \neq \wVecPrime}} \kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot\sketchPolarParam{\wVecPrime}\big]\textit{(by linearity of expectation)}\\
=&\expect{\sum_{\substack{j \in [B],\\
2019-06-05 11:57:05 -04:00
\wVec~|~\sketchHashParam{\wVec}= j,\\
\wVecPrime~|~\sketchHashParam{\wVecPrime} = j,\\
2019-06-07 15:38:01 -04:00
\wVec = \wVecPrime}} \kMapParam{\wVec} \cdot \sketchPolarParam{\wVec} \cdot \sketchPolarParam{\wVecPrime}} \nonumber \\
2019-06-05 11:57:05 -04:00
&\phantom{{}\big[}\textit{(by uniform distribution in the second summation)}\\
2019-06-14 10:50:38 -04:00
=& \estExp \label{eq:estExpect}
2019-06-05 11:57:05 -04:00
\end{align}
2019-06-07 15:38:01 -04:00
For the next step, we show that the variance of an estimate is small.$$\varParam{\estimate}$$
2019-06-05 11:57:05 -04:00
\begin{align}
2019-06-07 15:38:01 -04:00
&=\varParam{\estExpOne}\\
&= \expect{\big(\estTwo\big)^2}\\
&=\expect{\sum_{\substack{
2019-06-05 11:57:05 -04:00
\wVec_1, \wVec_2,\\
2019-06-05 13:14:32 -04:00
\wVecPrime_1, \wVecPrime_2 \in \pw,\\
\sketchHashParam{\wVec_1} = \sketchHashParam{\wVecPrime_1},\\
\sketchHashParam{\wVec_2} = \sketchHashParam{\wVecPrime_2}
2019-06-07 15:38:01 -04:00
}}\kMapParam{\wVec_1} \cdot \kMapParam{\wVec_2}\cdot\sketchPolarParam{\wVec_1}\cdot\sketchPolarParam{\wVec_2}\cdot\sketchPolarParam{\wVecPrime_1}\cdot\sketchPolarParam{\wVecPrime_2} }\label{eq:var-sum-w}
\end{align}
Note that four-wise independence is assumed across all four random variables of \eqref{eq:var-sum-w}. Zooming in on the inner products of the $\sketchPolar$ functions,
\begin{equation}
\polarProdEq \label{eq:polar-product}
\end{equation}
2019-06-14 10:50:38 -04:00
it can be seen that for $\wOne, \wOneP \in \pw$ and $\wTwo, \wTwoP \in \pw'$, all four random variables in \eqref{eq:polar-product} take their values from $\pw$, although we have iteration over two separate sets $\pw$. Thus, there are four possible sets of $\wVec$ variable combinations, namely:
2019-06-07 15:38:01 -04:00
\begin{align*}
&\distPattern{1}:&\cOne\\
&\distPattern{2}:&\cTwo \textit{*} \\
&\distPattern{3}:&\cThree \textit{*} \\
&\distPattern{4}:&\cFour \textit{*}\\
&\distPattern{5}:&\cFive
\end{align*}
$$\text{ }^*\textit{(and all variants of the respective pattern)}$$
We are interested in those particular cases whose expecation does not equal zero, since an expectation of zero will not add to the summation of \eqref{eq:var-sum-w}. In expectation we have that
\begin{align}
2019-06-14 10:50:38 -04:00
&\expect{%\sum_{\substack{\elems \\
%\st \cOne}}
\polarProdEq \st \cOne} = 1 \label{eq:polar-prod-all}\\
&\expect{%\sum_{\substack{\elems \\
%\st \cTwo}}
\polarProdEq \st \cTwo} = 1 \label{eq:polar-prod-two-and-two}\\
&\expect{%\sum_{\substack{\elems \\
%\st \cThree}}
\polarProdEq \st \cThree} = 0 \nonumber \\
&\expect{%\sum_{\substack{\elems \\
%\st \cFour}}
\polarProdEq \st \cFour} = 0 \nonumber \\
&\expect{%\sum_{\substack{\elems \\
%\st \cFive}}
\polarProdEq \st \cFive} = 0 \nonumber
2019-06-07 15:38:01 -04:00
\end{align}
2019-06-14 10:50:38 -04:00
Only equations \eqref{eq:polar-prod-all} and \eqref{eq:polar-prod-two-and-two} influence the $\var$ computation.
Considering $\distPattern{1}$ the variance results in
2019-06-07 15:38:01 -04:00
\begin{equation}
2019-06-10 13:36:43 -04:00
\distPatOne\label{eq:distPatOne}
2019-06-07 15:38:01 -04:00
\end{equation}
For the distribution pattern $\cTwo$, we have three variants to consider.
\begin{align*}
&\vCase{1}:&\cTwo \\
&\vCase{2}:&\cTwoV{\wOne}{\wTwo}{\wOneP}{\wTwoP}\\
&\vCase{3}:&\cTwoV{\wOne}{\wTwoP}{\wOneP}{\wTwo}
\end{align*}
When considered separately, the variants have the following $\var$.
\begin{align}
2019-06-10 13:36:43 -04:00
\cTwo&= \variantOne \label{eq:variantOne}\\
\cTwoV{\wOne}{\wTwo}{\wOneP}{\wTwoP}&=\variantTwo \label{eq:variantTwo}\\
\cTwoV{\wOne}{\wTwoP}{\wOneP}{\wTwo}&=\variantThree\label{eq:variantThree}
2019-06-05 11:57:05 -04:00
\end{align}
2019-06-10 13:36:43 -04:00
Note that at the start of the analysis of $\var$, the second term (expectation \eqref{eq:estExpect} squared) of the $\var$ calculation was not considered. This is because it is cancelled out by \eqref{eq:distPatOne} and \eqref{eq:variantOne}.
\begin{equation*}
\big(\estExp\big)^2 = \distPatOne + \variantOne
\end{equation*}
With only \eqref{eq:variantTwo} and \eqref{eq:variantThree} remaining, we have
\begin{multline*}
\varParam{\estimate} = \\
\variantTwo ~+ \\
\variantThree
\end{multline*}
Converting terms into their space requirements yields
\begin{align}
2019-06-12 08:39:03 -04:00
&\variantTwo \Rightarrow\numWorldsP \cdot \frac{\numWorlds}{\sketchCols} - 1\label{eq:spaceOne}\\
2019-06-10 13:36:43 -04:00
&\variantThree \Rightarrow \numWorldsP \cdot \frac{\numWorldsP - 1}{\sketchCols}\label{eq:spaceTwo}
\end{align}
\eqref{eq:spaceOne} and \eqref{eq:spaceTwo} further reduce to
\begin{equation}
2019-06-12 08:39:03 -04:00
\frac{2^{2N}(\prob + \prob^2)}{\sketchCols} - \numWorlds(\frac{\prob}{\sketchCols} + \prob)\label{eq:variance}
2019-06-10 13:36:43 -04:00
\end{equation}
2019-06-12 08:39:03 -04:00
By \eqref{eq:variance} we have then
\begin{align*}
2019-06-14 10:50:38 -04:00
\varSym &< 2^{2N}\big(\frac{2\prob}{\sketchCols}\big) \\
\sd &<\sdEq\\
\sdRel& < \sqrt{\frac{2}{\sketchCols\prob}}.
2019-06-12 08:39:03 -04:00
\end{align*}
2019-06-14 10:50:38 -04:00
Recall that $\sdRel = \frac{\sd}{\mu}$ where $\mu$ is defined as $\numWorldsP$ in \eqref{eq:mu}.
Since the sketch has multiple trials, a probability of exceeding error bound $\errB$ smaller than one half guarantees an estimate that is less than or equal to the error bound when taking the median of all trials. Expressing the error relative to $\mu$ in Chebyshev's Inequality yields
2019-06-12 08:39:03 -04:00
\begin{equation*}
\cheby.
\end{equation*}
Substituting $\mu\epsilon$ for $k\sd$ and solving for $\sketchCols$ results in
\begin{align*}
2019-06-14 10:50:38 -04:00
&k\cdot\sdEq = \mu\epsilon\\
&k = \frac{\mu\epsilon}{\sdEq}\\
&k = \frac{\mu\epsilon\sqrt{\sketchCols}}{\numWorlds \sqrt{2\prob}}\\
&k^2 = \frac{\mu^2\epsilon^2\sketchCols}{2^{\numWorlds}\cdot2\prob} = \frac{\prob\errB^2\sketchCols}{2}\\
&\chebyK\Rightarrow \sketchCols = \frac{6}{\epsilon^2\prob}
2019-06-12 08:39:03 -04:00
\end{align*}
2019-06-10 13:36:43 -04:00
2019-06-05 11:57:05 -04:00
2019-06-14 10:50:38 -04:00