paper-BagRelationalPDBsAreHard/approx_alg.tex

470 lines
32 KiB
TeX
Raw Normal View History

%root: main.tex
2020-12-19 01:15:50 -05:00
%!TEX root=./main.tex
2020-12-17 16:40:48 -05:00
\section{$1 \pm \epsilon$ Approximation Algorithm}\label{sec:algo}
2020-12-19 16:44:18 -05:00
In~\Cref{sec:hard}, we showed that computing the expected multiplicity of a compressed representation of a bag polynomial for \ti (even just based on project-join queries) is unlikely to be possible in linear time (\Cref{thm:mult-p-hard-result}), even if all tuples have the same probability (\Cref{th:single-p-hard}).
Given this, we now design an approximation algorithm for our problem that runs in {\em linear time}.
2020-12-19 01:15:50 -05:00
Unlike the results in~\Cref{sec:hard} our approximation algorithm works for \bi, though our bounds are more meaningful for a non-trivial subclass of \bis that contains both \tis, as well as the PDBench benchmark.
2020-12-14 11:47:18 -05:00
%it is then desirable to have an algorithm to approximate the multiplicity in linear time, which is what we describe next.
\subsection{Preliminaries and some more notation}
2020-12-19 16:53:17 -05:00
We now introduce useful definitions and notation related to polynomials. We use the following polynomial as an example:
2020-12-14 11:47:18 -05:00
\begin{equation}
\label{eq:poly-eg}
\poly(X, Y) = 2X^2 + 3XY - 2Y^2.
2020-12-14 11:47:18 -05:00
\end{equation}
\begin{Definition}[Variables in a monomial]\label{def:vars}
Given a monomial $v$, we use $\var(v)$ to denote the set of variables in $v$.
\end{Definition}
For example the monomial $XY$ has $\var(XY)=\inset{X,Y}$.
2020-12-14 11:47:18 -05:00
%\begin{Definition}[Expression Tree]\label{def:express-tree}
%An expression tree $\etree$ is a binary %an ADT logically viewed as an n-ary
%tree, whose internal nodes are from the set $\{+, \times\}$, with leaf nodes being either from the set $\mathbb{R}$ $(\tnum)$ or from the set of monomials $(\var)$. The members of $\etree$ are \type, \val, \vari{partial}, \vari{children}, and \vari{weight}, where \type is the type of value stored in the node $\etree$ (i.e. one of $\{+, \times, \var, \tnum\}$, \val is the value stored, and \vari{children} is the list of $\etree$'s children where $\etree_\lchild$ is the left child and $\etree_\rchild$ the right child. Remaining fields hold values whose semantics we will fix later. When $\etree$ is used as input of ~\Cref{alg:mon-sam} and ~\Cref{alg:one-pass}, the values of \vari{partial} and \vari{weight} will not be set. %SEMANTICS FOR \etree: \vari{partial} is the sum of $\etree$'s coefficients , n, and \vari{weight} is the probability of $\etree$ being sampled.
%\end{Definition}
2020-08-07 13:04:18 -04:00
2020-12-19 01:15:50 -05:00
%Note that $\etree$ need not encode an expression in the standard monomial basis. For instance, $\etree$ could represent a compressed form of the polynomial in~\Cref{eq:poly-eg}, such as $(x + 2y)(2x - y)$.
2020-08-17 13:52:18 -04:00
\begin{Definition}[$\polyf(\cdot)$]\label{def:poly-func}
Denote $\polyf(\etree)$ to be the function from expression tree $\etree$ to its corresponding polynomial. $poly(\cdot)$ is recursively defined on $\etree$ as follows, where $\etree_\lchild$ and $\etree_\rchild$ denote the left and right child of $\etree$ respectively.
With addition and multiplication following the standard interpretation:
2020-12-19 01:15:50 -05:00
%
% \begin{align*}
% &\etree.\type = +\mapsto&& \polyf(\etree_\lchild) + \polyf(\etree_\rchild)\\
% &\etree.\type = \times\mapsto&& \polyf(\etree_\lchild) \cdot \polyf(\etree_\rchild)\\
% &\etree.\type = \var \text{ OR } \tnum\mapsto&& \etree.\val
% \end{align*}
2020-12-19 01:15:50 -05:00
%
\begin{equation*}
\polyf(\etree) = \begin{cases}
\polyf(\etree_\lchild) + \polyf(\etree_\rchild) &\text{ if \etree.\type } = +\\
\polyf(\etree_\lchild) \cdot \polyf(\etree_\rchild) &\text{ if \etree.\type } = \times\\
\etree.\val &\text{ if \etree.\type } = \var \text{ OR } \tnum.
\end{cases}
\end{equation*}
2020-08-17 13:52:18 -04:00
\end{Definition}
%Specifically, when adding two monomials whose variables and respective exponents agree, the coefficients corresponding to the monomials are added and their sum is multiplied to the monomial. Multiplication here is denoted by concatenation of the monomial and coefficient. When two monomials are multiplied, the product of each corresponding coefficient is computed, and the variables in each monomial are multiplied, i.e., the exponents of like variables are added. Again we notate this by the direct product of coefficient product and all disitinct variables in the two monomials, with newly computed exponents.
2020-09-04 18:32:40 -04:00
%\begin{Definition}[Expression Tree Set]\label{def:express-tree-set}$\etreeset{\smb}$ is the set of all possible expression trees $\etree$, such that $poly(\etree) = \poly(\vct{X})$.
%\end{Definition}
%
2020-12-19 01:15:50 -05:00
%For the polynomial in~\Cref{eq:poly-eg}, $\etreeset{\smb}$ would include the following (represented as their corresponding expression trees): $2x^2 + 3xy - 2y^2, (x + 2y)(2x - y), x(2x - y) + 2y(2x - y), 2x(x + 2y) - y(x + 2y)$. Note that \Cref{def:express-tree-set} implies that for any expression tree $\etree$, we have $\etree \in \etreeset{poly(\etree)}$.
\begin{Definition}[Expanded T]\label{def:expand-tree}
2020-12-14 11:47:18 -05:00
$\expandtree{\etree}$ is the (pure) sum of products expansion of $\etree$, which we formally define next. The logical view of \expandtree{\etree} ~is a list of tuples $(\monom, \coef)$, where $\monom$ is a monomial and $\coef$ is in $\mathbb{R}$. \expandtree{\etree} has the following recursive definition (where $\circ$ is list concatenation).
2020-12-19 01:15:50 -05:00
%
% recursively defined as
% \begin{align*}
% &\etree.\type = + \mapsto&& \elist{\expandtree{\etree_\lchild}, \expandtree{\etree_\rchild}}\\
% &\etree.\type = \times \mapsto&& \elist{\expandtree{\etree_\lchild} \otimes \expandtree{\etree_\rchild}}\\
% &\etree.\type = \tnum \mapsto&& \elist{(\emptyset, \etree.\val)}\\
% &\etree.\type = \var \mapsto&& \elist{(\etree.\val, 1)}
% \end{align*}
{\small
\begin{multline*}
\expandtree{\etree} =
\begin{cases}
\expandtree{\etree_\lchild} \circ \expandtree{\etree_\rchild} &\textbf{ if }\etree.\type = +\\
\left\{(\monom_\lchild \cup \monom_\rchild, \coef_\lchild \cdot \coef_\rchild) ~|~\right.&\\ \quad \left.(\monom_\lchild, \coef_\lchild) \in \expandtree{\etree_\lchild}, (\monom_\rchild, \coef_\rchild) \in \expandtree{\etree_\rchild}\right\} &\textbf{ if }\etree.\type = \times\\
\elist{(\emptyset, \etree.\val)} &\textbf{ if }\etree.\type = \tnum\\
\elist{(\{\etree.\val\}, 1)} &\textbf{ if }\etree.\type = \var.\\
\end{cases}
\end{multline*}
}
2020-12-19 01:15:50 -05:00
\end{Definition}
2020-12-13 15:51:55 -05:00
%where that the multiplication of two tuples %is the standard multiplication over monomials and the standard multiplication over coefficients to produce the product tuple, as in
%is their direct product $(\monom_1, \coef_1) \cdot (\monom_2, \coef_2) = (\monom_1 \cdot \monom_2, \coef_1 \times \coef_2)$ such that monomials $\monom_1$ and $\monom_2$ are concatenated in a product operation, while the standard product operation over reals applies to $\coef_1 \times \coef_2$. The product of $\expandtree{\etree_\lchild} \cdot \expandtree{\etree'_\rchild}$ is then the cross product of the multiplication of all such tuples returned to both $\expandtree{\etree_\lchild}$ and $\expandtree{\etree_\rchild}$. %The operator $\otimes$ is defined as the cross-product tuple multiplication of all such tuples returned by both $\expandtree{\etree_\lchild}$ and $\expandtree{\etree_\rchild}$.
2020-09-10 22:14:25 -04:00
\begin{Example}\label{example:expr-tree-T}
2020-12-19 01:15:50 -05:00
Consider the factorized representation $(X+ 2Y)(2X - Y)$ of the polynomial in~\Cref{eq:poly-eg}. Its expression tree $\etree$ is illustrated in Figure ~\ref{fig:expr-tree-T}. The pure expansion of the product is $2X^2 - XY + 4XY - 2Y^2$ and the $\expandtree{\etree}$ is $[(2, X^2), (-1, XY), (4, XY), (-2, Y^2)]$.
2020-09-10 22:14:25 -04:00
\end{Example}
\begin{figure}[t]
2020-09-10 22:14:25 -04:00
2020-12-19 12:59:27 -05:00
\resizebox{0.65\columnwidth}{!}{
2020-09-10 22:14:25 -04:00
\begin{tikzpicture}[thick, level distance=0.9cm,level 1/.style={sibling distance=3.55cm}, level 2/.style={sibling distance=1.8cm}, level 3/.style={sibling distance=0.8cm}]% level/.style={sibling distance=6cm/(#1 * 1.5)}]
\node[tree_node](root){$\boldsymbol{\times}$}
child{node[tree_node]{$\boldsymbol{+}$}
child{node[tree_node]{x}
%child[missing]{node[tree_node]{}}
%child{node[tree_node]{x}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{2}}
child{node[tree_node]{y}}
}
}
child{node[highlight_treenode] (TR) {$\boldsymbol{+}$}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node]{2}}
child{node[tree_node]{x}}
}
child{node[tree_node]{$\boldsymbol{\times}$}
child{node[tree_node] (neg-leaf) {-1}}
child{node[tree_node]{y}}
}
2020-09-10 22:14:25 -04:00
%child[sibling distance= 0cm, grow=north east, red]{node[tree_node]{$\etree_\rchild$}}
};
% \node[below=2pt of neg-leaf, inner sep=1pt, blue] (neg-comment) {\textbf{Negation pushed to leaf nodes}};
% \draw[<-|, blue] (neg-leaf) -- (neg-comment);
\node[above right=0.7cm of TR, highlight_color, inner sep=0pt, font=\bfseries] (tr-label) {$\etree_\rchild$};
\node[above right=0.7cm of root, highlight_color, inner sep=0pt, font=\bfseries] (t-label) {$\etree$};
\draw[<-|, highlight_color] (TR) -- (tr-label);
\draw[<-|, highlight_color] (root) -- (t-label);
\end{tikzpicture}
2020-12-19 12:59:27 -05:00
}
\vspace*{-2mm}
2020-09-10 22:14:25 -04:00
\caption{Expression tree $\etree$ for the product $\boldsymbol{(x + 2y)(2x - y)}$.}
\label{fig:expr-tree-T}
2020-12-19 12:59:27 -05:00
\trimfigurespacing
2020-09-10 22:14:25 -04:00
\end{figure}
\begin{Definition}[Positive T]\label{def:positive-tree}
For any expression tree $\etree$, the corresponding
2020-12-14 11:47:18 -05:00
{\em positive tree}, denoted $\abs{\etree}$ obtained from $\etree$ as follows. For each leaf node $\ell$ of $\etree$ where $\ell.\type$ is $\tnum$, update $\ell.\vari{value}$ to $|\ell.\vari{value}|$. %value $\coef$ of each coefficient leaf node in $\etree$ is set to %$\coef_i$ in $\etree$ is exchanged with its absolute value$|\coef|$.
\end{Definition}
2020-08-12 17:41:09 -04:00
2020-12-19 01:15:50 -05:00
Using the same factorization from ~\Cref{example:expr-tree-T}, $poly(\abs{\etree}) = (X + 2Y)(2X + Y) = 2X^2 +XY +4XY + 2Y^2 = 2X^2 + 5XY + 2Y^2$. Note that this \textit{is not} the same as the polynomial from~\Cref{eq:poly-eg}.
2020-08-12 17:41:09 -04:00
\begin{Definition}[Evaluation]\label{def:exp-poly-eval}
2020-12-19 12:59:27 -05:00
Given an expression tree $\etree$ and $\vct{v} \in \mathbb{R}^\numvar$, we define the evaluation of $\etree$ on $\vct{v}$ as $\etree(\vct{v}) = poly(\etree)(\vct{v})$.
\end{Definition}
2020-12-14 11:47:18 -05:00
\subsection{Our main result}
2020-12-14 11:47:18 -05:00
In the subsequent subsections we will prove the following theorem.
2020-09-08 12:05:51 -04:00
2020-08-22 15:47:56 -04:00
\begin{Theorem}\label{lem:approx-alg}
2020-12-19 01:15:50 -05:00
Let $\etree$ be an expression tree for a UCQ over \bi and define $\poly(\vct{X})=\polyf(\etree)$ and let $k=\degree(\poly)$
2020-12-19 16:44:18 -05:00
%Let $\poly(\vct{X})$ be a query polynomial corresponding to the output of a UCQ in a \bi.
An estimate $\mathcal{E}$ %=\approxq(\etree, (p_1,\dots,p_\numvar), \conf, \error')$
of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ can be computed in time
\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot \abs{\etree}^2(1,\ldots, 1)\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot\rpoly^2(\prob_1,\ldots, \prob_\numvar)}\right)\]
such that
\begin{equation}
\label{eq:approx-algo-bound}
P\left(\left|\mathcal{E} - \rpoly(\prob_1,\dots,\prob_\numvar)\right|> \error' \cdot \rpoly(\prob_1,\dots,\prob_\numvar)\right) \leq \conf.
\end{equation}
2020-12-14 11:47:18 -05:00
%with multiplicative $(\error,\delta)$-bounds, where $k$ denotes the degree of $\poly$.
2020-08-22 15:47:56 -04:00
\end{Theorem}
2020-12-17 16:40:48 -05:00
The proof of~\Cref{lem:approx-alg} can be found in~\Cref{sec:proofs-approx-alg}.
2020-12-19 12:59:27 -05:00
To get linear runtime results from~\Cref{lem:approx-alg}, we will need to define another parameter modeling the (weighted) number of monomials in $\expandtree{\etree}$ to be `canceled' when it is modded with $\mathcal{B}$:
2020-12-14 11:47:18 -05:00
\begin{Definition}[Parameter $\gamma$]\label{def:param-gamma}
Given an expression tree $\etree$, define
2020-12-19 01:15:50 -05:00
\[\gamma(\etree)=\frac{\sum_{(\monom, \coef)\in \expandtree{\etree}} \abs{\coef}\cdot \indicator{\monom\mod{\mathcal{B}}\equiv 0}}{\abs{\etree}(1,\ldots, 1)}\]
2020-12-14 11:47:18 -05:00
\end{Definition}
2020-12-15 19:26:19 -05:00
%\AH{This....combined with \Cref{def:mod-set-polys} is \emph{really} nice notation!}
2020-12-14 11:47:18 -05:00
\AR{Need to make sure use of indicator variable $\onesymbol$ above is consistent with the rest of the paper.}
We next present couple of corollaries of~\Cref{lem:approx-alg}.
\begin{Corollary}
\label{cor:approx-algo-const-p}
2020-12-19 01:15:50 -05:00
Let $\poly(\vct{X})$ be as in~\Cref{lem:approx-alg} and let $\gamma=\gamma(\etree)$. Further let it be the case that $p_i\ge p_0$ for all $i\in[\numvar]$. Then an estimate $\mathcal{E}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ satisfying~\Cref{eq:approx-algo-bound} can be computed in time
\[O\left(\treesize(\etree) + \frac{\log{\frac{1}{\conf}}\cdot k\cdot \log{k} \cdot depth(\etree))}{\inparen{\error'}^2\cdot(1-\gamma)^2\cdot p_0^{2k}}\right)\]
2020-12-19 16:44:18 -05:00
In particular, if $p_0>0$ and $\gamma<1$ are absolute constants then the above runtime simplifies to $O_k\left(\frac 1{\eps^2}\cdot\treesize(\etree)\cdot \log{\frac{1}{\conf}}\right)$.
\end{Corollary}
2020-12-17 16:40:48 -05:00
The proof for~\Cref{cor:approx-algo-const-p} can be seen in~\Cref{sec:proofs-approx-alg}.
The restriction on $\gamma$ is satisfied by \ti (where $\gamma=0$) as well as for all queries of the PDBench \bi benchmark (see \Cref{app:subsec:experiment}).
\AH{I am thinking that perhaps the terminology and presentation of~\Cref{sec:experiments} may need word-smithing to clearly illustrate the $\bi$ benchmarks satisfied--although the substance is already written there.}
2020-12-15 19:26:19 -05:00
\AR{Yes! E.g. $\gamma$ is not used at all in~\Cref{sec:experiments}}
\AR{{\bf Boris/Oliver:} Is there a way to claim that all probabilities in practice are actually constants: i.e. they do not increase with the number of tuples?}
2020-12-19 01:15:50 -05:00
\OK{@Atri: This seems like a reasonable claim. It's too late for me to come up with a reasonable motivation (maybe something will come to me in the morning), but the intuition for me is that each tuple/block is independent... it would be hard for that to be the case if the probability were a function of the number of tuples.}
\subsection{Approximating $\rpoly$}
2020-12-19 12:59:27 -05:00
The algorithm to prove~\Cref{lem:approx-alg} follows from the following observation. Given a query polynomial $\poly(\vct{X})=poly(\etree)$ for expression tree $\etree$ over $\bi$, we can exactly represent $\rpoly(\vct{X})$ as follows:
\begin{equation}
\label{eq:tilde-Q-bi}
\rpoly\inparen{X_1,\dots,X_\numvar}=\hspace*{-1mm}\sum_{(v,c)\in \expandtree{\etree}} \hspace*{-2mm} \indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot c\cdot\hspace*{-2mm}\prod_{X_i\in \var\inparen{v}}\hspace*{-2mm} X_i
\end{equation}
2020-12-15 19:26:19 -05:00
Given the above, the algorithm is a sampling based algorithm for the above sum: we sample $(v,c)\in \expandtree{\etree}$ with probability proportional\footnote{We could have also uniformly sampled from $\expandtree{\etree}$ but this gives better parameters.}
2020-12-19 16:44:18 -05:00
%\AH{Regarding the footnote, is there really a difference? I \emph{suppose} technically, but in this case they are \emph{effectively} the same. Just wondering.}
2020-12-15 19:26:19 -05:00
%\AR{Yes, there is! If we used uniform distribution then in our bounds we will have a parameter that depends on the largest $\abs{coef}$, which e.g. could be dependent on $n$. But with the weighted probability distribution, we avoid paying this price. Though I guess perhaps we can say for the kinds of queries we consider thhese coefficients are all constants?}
to $\abs{c}$ and compute $Y=\indicator{\monom\mod{\mathcal{B}}\not\equiv 0}\cdot \prod_{X_i\in \var\inparen{v}} p_i$. Taking $\numsamp$ samples and computing the average of $Y$ gives us our final estimate.
The number of samples is computed by (see \Cref{app:subsec-th-mon-samp}):
2020-12-19 16:13:42 -05:00
\begin{equation*}
2\exp{\left(-\frac{\samplesize\error^2}{2}\right)}\leq \conf \implies\samplesize \geq \frac{2\log{\frac{2}{\conf}}}{\error^2}.
%\exp{\left(-\frac{\samplesize\error^2}{2}\right)}\leq \frac{\conf}{2}\\
%\frac{\samplesize\error^2}{2}\geq \log{\frac{2}{\conf}}\\
\end{equation*}
%We state the approximation algorithm in terms of a $\bi$.
%\subsubsection{Description}
2020-12-19 01:15:50 -05:00
%Algorithm ~\ref{alg:mon-sam} approximates $\rpoly$ using the following steps. First, a call to $\onepass$ on its input $\etree$ produces a non-biased weight distribution over the monomials of $\expandtree{\etree}$ and a correct count of $|\etree|(1,\ldots, 1)$, i.e., the number of monomials in $\expandtree{\etree}$. Next, ~\Cref{alg:mon-sam} calls $\sampmon$ to sample one monomial and its sign from $\expandtree{\etree}$. The sampling is repeated $\ceil{\frac{2\log{\frac{2}{\delta}}}{\epsilon^2}}$ times, where each of the samples are evaluated with input $\vct{p}$, multiplied by $1 \times sign$, and summed. The final result is scaled accordingly returning an estimate of $\rpoly$ with the claimed $(\error, \conf)$-bound of ~\Cref{lem:mon-samp}.
2020-12-15 19:26:19 -05:00
%\AR{Seems like the notation below belongs to the notation section (if we decide to state this explicitly at all)?}
%\AH{Yes, I only included this per your request a few months ago. Based on @lordpretzel removing my definition of monomial, perhaps we can assume that the reader understands the notation below. I \emph{think} this should be a reasonable assumption.}
%Recall that the notation $[x, y]$ denotes the range of values between $x$ and $y$ inclusive. The notation $\{x, y\}$ denotes the set of values consisting of $x$ and $y$.
%\subsubsection{Psuedo Code}
2020-12-19 01:15:50 -05:00
%Original \ti Algorithm
%\begin{algorithm}[H]
% \caption{$\approxq$($\etree$, $\vct{p}$, $\conf$, $\error$)}
% \label{alg:mon-sam}
% \begin{algorithmic}[1]
% \Require \etree: Binary Expression Tree
% \Require $\vct{p} = (\prob_1,\ldots, \prob_\numvar)$ $\in [0, 1]^N$
% \Require $\conf$ $\in [0, 1]$
% \Require $\error$ $\in [0, 1]$
% \Ensure \vari{acc} $\in \mathbb{R}$
% \State $\accum \gets 0$\label{alg:mon-sam-global1}
% \State $\numsamp \gets \ceil{\frac{2 \log{\frac{2}{\conf}}}{\error^2}}$\label{alg:mon-sam-global2}
2020-12-19 01:15:50 -05:00
% \State $(\vari{\etree}_\vari{mod}, \vari{size}) \gets $ \onepass($\etree$)\label{alg:mon-sam-onepass}\Comment{$\onepass$ is ~\Cref{alg:one-pass} \;and \sampmon \; is ~\Cref{alg:sample}}\newline
% \For{\vari{i} \text{ in } $1\text{ to }\numsamp$}\Comment{Perform the required number of samples}
% \State $(\vari{M}_\vari{i}, \vari{sgn}_\vari{i}) \gets $ \sampmon($\etree_\vari{mod}$)\label{alg:mon-sam-sample}
% \State $\vari{Y}_\vari{i} \gets 1$\label{alg:mon-sam-assign1}
% \For{$\vari{x}_{\vari{j}}$ \text{ in } $\vari{M}_{\vari{i}}$}
% \State $\vari{Y}_\vari{i} \gets \vari{Y}_\vari{i} \times \; \vari{\prob}_\vari{j}$\label{alg:mon-sam-product2} \Comment{$\vari{p}_\vari{j}$ is the assignment to $\vari{x}_\vari{j}$ from input $\vct{p}$}
% \EndFor
% \State $\vari{Y}_\vari{i} \gets \vari{Y}_\vari{i} \times\; \vari{sgn}_\vari{i}$\label{alg:mon-sam-product}
% \State $\accum \gets \accum + \vari{Y}_\vari{i}$\Comment{Store the sum over all samples}\label{alg:mon-sam-add}
% \EndFor
%
% \State $\vari{acc} \gets \vari{acc} \times \frac{\vari{size}}{\numsamp}$\label{alg:mon-sam-global3}
% \State \Return \vari{acc}
% \end{algorithmic}
%\end{algorithm}
2020-12-19 01:15:50 -05:00
%\bi Version of Approximation Algorithm
2020-12-13 15:51:55 -05:00
2020-12-19 12:59:27 -05:00
\begin{algorithm}[t]
\caption{$\approxq(\etree, \vct{p}, \conf, \error)$}
\label{alg:mon-sam}
\begin{algorithmic}[1]
\Require \etree: Binary Expression Tree
\Require $\vct{p} = (\prob_1,\ldots, \prob_\numvar)$ $\in [0, 1]^N$
\Require $\conf$ $\in [0, 1]$
\Require $\error$ $\in [0, 1]$
%\Require $\abs{\block} \in \mathbb{N}$%\bivec$ $\in [0, 1]^{\abs{\block}}$
\Ensure \vari{acc} $\in \mathbb{R}$
%\State $\vari{sample}_\vari{next} \gets 0$
2020-09-01 14:39:50 -04:00
\State $\accum \gets 0$\label{alg:mon-sam-global1}
\State $\numsamp \gets \ceil{\frac{2 \log{\frac{2}{\conf}}}{\error^2}}$\label{alg:mon-sam-global2}
2020-12-19 01:15:50 -05:00
\State $(\vari{\etree}_\vari{mod}, \vari{size}) \gets $ \onepass($\etree$)\label{alg:mon-sam-onepass}\Comment{$\onepass$ is ~\Cref{alg:one-pass}}
%\newline
%\State $\vari{i} \gets 1$
\For{$\vari{i} \in 1 \text{ to }\numsamp$}\label{alg:sampling-loop}\Comment{Perform the required number of samples}
%\State $\bivec \gets [0]^{\abs{\block}}$\Comment{$\bivec$ is an array whose size is the number of blocks, used to check for cross-terms}\newline
2020-12-19 01:15:50 -05:00
\State $(\vari{M}, \vari{sgn}_\vari{i}) \gets $ \sampmon($\etree_\vari{mod}$)\label{alg:mon-sam-sample}\Comment{\sampmon \; is ~\Cref{alg:sample}}
%\For{$\vari{x}_\vari{\block,i}$ \text{ in } $\vari{M}$}
% \If{$\bivec[\block] = 1$}\label{alg:mon-sam-check}\Comment{If we have already had a variable from this block, $\rpoly$ drops the sample.}
% \newline
% \State $\vari{sample}_{\vari{next}} \gets 1$
% \State break
% \Else
% \State $\bivec[\block] = 1$
% \State $\vari{sum} = 0$
% \For{$\ell \in [\abs{\block}]$}
% \State $\vari{sum} = \vari{sum} + \bivec[\block][\ell]$
% \EndFor
% \If{$\vari{sum} \geq 2$}
% \State $\vari{sample}_{\vari{next}} \gets 1$
% \State continue\Comment{Not sure for psuedo code the best way to state this, but this is analogous to C language continue statement.}
% \EndIf
% \EndFor
% \If{$\vari{sample}_{\vari{next}} = 1$}\label{alg:mon-sam-drop}
% \State $\vari{sample}_{\vari{next}} \gets 0$\label{alg:mon-sam-resamp}
% \Else
\If{$\vari{M}$ has at most one variable from each block}\label{alg:check-duplicate-block}
\State $\vari{Y}_\vari{i} \gets \prod_{X_j\in\var\inparen{\vari{M}}}p_j$\label{alg:mon-sam-assign1}%\newline
%\For{$\vari{x}_{\vari{j}}$ \text{ in } $\vari{M}$}%_{\vari{i}}$}
% \State $\vari{Y}_\vari{i} \gets \vari{Y}_\vari{i} \times \; \vari{\prob}_\vari{j}$\label{alg:mon-sam-product2} \Comment{$\vari{p}_\vari{j}$ is the assignment to $\vari{x}_\vari{j}$ from input $\vct{p}$}
%\EndFor
2020-09-04 21:08:02 -04:00
\State $\vari{Y}_\vari{i} \gets \vari{Y}_\vari{i} \times\; \vari{sgn}_\vari{i}$\label{alg:mon-sam-product}
\State $\accum \gets \accum + \vari{Y}_\vari{i}$\Comment{Store the sum over all samples}\label{alg:mon-sam-add}
%\State $\vari{i} \gets \vari{i} + 1$
\EndIf
\EndFor
2020-12-13 15:51:55 -05:00
%\State $\gamma \gets $ $\algname{Estimate}$ $\gamma(\etree, \numsamp, \abs{\block})$
\State $\vari{acc} \gets \vari{acc} \times \frac{\vari{size}}{\numsamp}$\label{alg:mon-sam-global3}
2020-08-25 11:18:08 -04:00
\State \Return \vari{acc}
\end{algorithmic}
\end{algorithm}
2020-09-04 18:32:40 -04:00
2020-12-14 11:47:18 -05:00
%\begin{algorithm}[H]
% \caption{$\algname{Estimate}$ $\gamma(\etree, \numsamp, \abs{\block})$}
% \label{alg:est-gamma}
% \begin{algorithmic}[1]
% \Require \etree: Binary Expression Tree
% \Require $\numsamp \in \mathbb{N}$
% \Require $\abs{\block} \in \mathbb{N}$
% \Ensure \vari{cTerms} $]in \mathbb{R}$
%
% \State $\vari{cTerms} \gets 0$
% \State $\vari{isCross} \gets 0$
% \For{$\vari{i} \text{ in } 1 \text{ to } \numsamp$}
% \State $\bivec \gets [0]^{\abs{\block}}$
% \State $(\vari{M}, \vari{sgn}) \gets $ \sampmon($\etree_\vari{mod}$)
% \For{$\vari{x}_{\vari{b}, \vari{j}} \text{ in } \vari{M}$}
% \If{$\bivec[b] = 1$}
% \State $\vari{isCross} \gets 1$
% \State Break
% \Else
% \State $\bivec[b] \gets 1$
% \EndIf
% \EndFor
% \If{$\vari{isCross} = 1$}
% \State $\vari{cTerms} \gets \vari{cTerms} + 1$
% \State $\vari{isCross} \gets 0$
% \EndIf
% \EndFor
% \State \Return $\frac{\vari{cTerms}}{\numsamp}$
% \end{algorithmic}
%\end{algorithm}
\subsubsection{Correctness}
2020-12-19 01:15:50 -05:00
In order to prove~\Cref{lem:approx-alg}, we will need to argue the correctness of~\Cref{alg:mon-sam}. Before we formally do that,
we first state the lemmas that summarize the relevant properties of $\onepass$ and $\sampmon$, the auxiliary algorithms on which ~\Cref{alg:mon-sam} relies. Their proofs are given in~\Cref{sec:onepass} and~\Cref{sec:samplemonomial} respectively.
2020-09-01 14:39:50 -04:00
\begin{Lemma}\label{lem:one-pass}
2020-12-19 12:59:27 -05:00
The $\onepass$ function completes in $O(size(\etree))$ time. $\onepass$ guarantees two post conditions: First, for each subtree $\vari{S}$ of $\etree$, we have that $\vari{S}.\vari{partial}$ is set to $\abs{\vari{S}}(1,\ldots, 1)$. Second, when $\vari{S}.\type = +$, each $\vari{child}$ of $\vari{S}$, $\vari{child}.\vari{weight}$ is set to $\frac{\abs{\vari{S}_{\vari{child}}}(1,\ldots, 1)}{\abs{\vari{S}}(1,\ldots, 1)}$. % is correctly computed for each child of $\vari{S}.$
\end{Lemma}
2020-12-19 12:59:27 -05:00
To prove correctness of~\Cref{alg:mon-sam}, we only use the following fact that follows from the above lemma: $\etree_{\vari{mod}}.\vari{partial}=\abs{\etree}(1,\dots,1)$.
2020-12-15 19:26:19 -05:00
%\AH{I'm wondering if there is a better notation to use here. I myself got confused by my own notation of $\etree_{\vari{mod}}$. \emph{But}, we need to to be referencing the modified $\etree$ returned by $\onepass$ in the algorithm, so maybe this is the best we can do?}
%\AR{yeah, I think this is fine.}
%At the conclusion of $\onepass$, $\etree.\vari{partial}$ will hold the sum of all coefficients in $\expandtree{\abs{\etree}}$, i.e., $\sum\limits_{(\monom, \coef) \in \expandtree{\abs{\etree}}}\coef$. $\etree.\vari{weight}$ will hold the weighted probability that $\etree$ is sampled from from its parent $+$ node.
2020-09-01 14:39:50 -04:00
\begin{Lemma}\label{lem:sample}
The function $\sampmon$ completes in $O(\log{k} \cdot k \cdot depth(\etree))$ time, where $k = \degree(poly(\abs{\etree})$. Upon completion, every $\left(\monom, sign(\coef)\right)\in \expandtree{\abs{\etree}}$ is returned with probability $\frac{|\coef|}{\abs{\etree}(1,\ldots, 1)}$. %, $\sampmon$ returns the sampled term $\left(\monom, sign(\coef)\right)$ from $\expandtree{\abs{\etree}}$.
2020-09-01 14:39:50 -04:00
\end{Lemma}
2020-12-17 16:40:48 -05:00
Armed with the above two lemmas, we are ready to argue the following result (proof in~\Cref{sec:proofs-approx-alg}):
\begin{Theorem}\label{lem:mon-samp}
2020-12-19 16:44:18 -05:00
%If the contracts for $\onepass$ and $\sampmon$ hold, then
For any $\etree$ with $\degree(poly(|\etree|)) = k$, algorithm \ref{alg:mon-sam} outputs an estimate $\vari{acc}$ of $\rpoly(\prob_1,\ldots, \prob_\numvar)$ such that %$\expct\pbox{\empmean} = \frac{\rpoly(\prob_1,\ldots, \prob_\numvar)\cdot(1 - \gamma)}{\abs{\etree}(1,\ldots, 1)}$. %within an additive $\error \cdot \abs{\etree}(1,\ldots, 1)$ error with
2020-12-19 16:44:18 -05:00
$\empmean$ has bounds
\[P\left(\left|\vari{acc} - \rpoly(\prob_1,\ldots, \prob_\numvar)\right|> \error \cdot \abs{\etree}(1,\ldots, 1)\right) \leq \conf,\]
in $O\left(\treesize(\etree)\right.$ $+$ $\left.\left(\frac{\log{\frac{1}{\conf}}}{\error^2} \cdot k \cdot\log{k} \cdot depth(\etree)\right)\right)$ time.
\end{Theorem}
2020-08-13 20:54:06 -04:00
2020-12-15 01:09:00 -05:00
\subsection{\onepass\ Algorithm}
\label{sec:onepass}
2020-12-15 01:09:00 -05:00
%\subsubsection{Description}
%Algorithm ~\ref{alg:one-pass} satisfies the requirements of lemma ~\ref{lem:one-pass}.
2020-12-15 01:09:00 -05:00
The evaluation of $\abs{\etree}(1,\ldots, 1)$ can be defined recursively, as follows (where $\etree_\lchild$ and $\etree_\rchild$ are the `left' and `right' children of $\etree$ if they exist):
2020-12-19 12:59:27 -05:00
{\small
2020-12-17 00:02:07 -05:00
\begin{align}
\label{eq:T-all-ones}
\abs{\etree}(1,\ldots, 1) = \begin{cases}
\abs{\etree_\lchild}(1,\ldots, 1) \cdot \abs{\etree_\rchild}(1,\ldots, 1) &\textbf{if }\etree.\type = \times\\
\abs{\etree_\lchild}(1,\ldots, 1) + \abs{\etree_\rchild}(1,\ldots, 1) &\textbf{if }\etree.\type = + \\
|\etree.\val| &\textbf{if }\etree.\type = \tnum\\
1 &\textbf{if }\etree.\type = \var.
\end{cases}
2020-12-17 00:02:07 -05:00
\end{align}
2020-12-19 12:59:27 -05:00
}
%\begin{align*}
%&\eval{\etree ~|~ \etree.\type = +}_{\abs{\etree}} =&& \eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}\\
%&\eval{\etree ~|~ \etree.\type = \times}_{\abs{\etree}} = && \eval{\etree_\lchild}_{\abs{\etree}} \cdot \eval{\etree_\rchild}_{\abs{\etree}}\\
%&\eval{\etree ~|~ \etree.\type = \tnum}_{\abs{\etree}} = && \etree.\val\\
2020-12-13 15:51:55 -05:00
%&\eval{\etree ~|~ \etree.\val = \var}_{\abs{\etree}} = && 1
%\end{align*}
2020-12-15 01:09:00 -05:00
%In the same fashion the weighted distribution can be described as above with the following modification for the case when $\etree.\type = +$:
It turns out that for proof of~\Cref{lem:sample}, we need to argue that when $\etree.\type = +$, we indeed have
2020-12-17 00:02:07 -05:00
\begin{align}
\label{eq:T-weights}
2020-12-15 01:09:00 -05:00
%&\abs{\etree_\lchild}(1,\ldots, 1) + \abs{\etree_\rchild}(1,\ldots, 1); &\textbf{if }\etree.\type = + \\
\etree_\lchild.\vari{weight} &\gets \frac{\abs{\etree_\lchild}(1,\ldots, 1)}{\abs{\etree_\lchild}(1,\ldots, 1) + \abs{\etree_\rchild}(1,\ldots, 1)};\\
\etree_\rchild.\vari{weight} &\gets \frac{\abs{\etree_\rchild}(1,\ldots, 1)}{\abs{\etree_\lchild}(1,\ldots, 1)+ \abs{\etree_\rchild}(1,\ldots, 1)}
2020-12-17 00:02:07 -05:00
\end{align}
%\begin{align*}
%&\eval{\etree~|~\etree.\type = +}_{\wght} =&&\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}; \etree_\lchild.\wght = \frac{\eval{\etree_\lchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}; \etree_\rchild.\wght = \frac{\eval{\etree_\rchild}_{\abs{\etree}}}{\eval{\etree_\lchild}_{\abs{\etree}} + \eval{\etree_\rchild}_{\abs{\etree}}}
%\end{align*}
2020-12-19 12:59:27 -05:00
\noindent \onepass\ (Algorithm ~\ref{alg:one-pass} in \Cref{sec:proofs-approx-alg}) essentially populates the \vari{weight} variable on each node with the above definitions.
2020-08-17 17:12:25 -04:00
2020-12-15 01:09:00 -05:00
%\subsubsection{Psuedo Code}
%See algorithm ~\ref{alg:one-pass} for details.
2020-12-15 01:09:00 -05:00
\subsection{\sampmon\ Algorithm}
\label{sec:samplemonomial}
2020-08-17 13:52:18 -04:00
2020-12-15 01:09:00 -05:00
%Algorithm ~\ref{alg:sample} takes $\etree$ as input, samples an arbitrary $(\monom, \coef)$ from $\expandtree{\etree}$ with probabilities $\stree_\lchild.\wght$ and $\stree_\rchild.\wght$ for each subtree $\stree$ with $\stree.\type = +$, outputting the tuple $(\monom, \sign(\coef))$. While one cannot compute $\expandtree{\etree}$ in time better than $O(N^k)$, the algorithm, similar to \textsc{OnePass}, uses a technique on $\etree$ which produces a sample from $\expandtree{\etree}$ without ever materializing $\expandtree{\etree}$.
2020-08-17 13:52:18 -04:00
2020-12-19 12:59:27 -05:00
A naive (slow) implementation of \sampmon\ would first compute $E(T)$ and then sample from it.
% However, this would be too time consuming.
2020-12-15 01:09:00 -05:00
%
2020-12-19 16:44:18 -05:00
Instead, \Cref{alg:sample} selects a monomial from $\expandtree{\etree}$ by top-down traversal.
For a parent $+$ node, the child to be visited is sampled from the weighted distribution precomputed by \onepass.
When a parent $\times$ node is visited, both children are visited.
2020-12-19 12:59:27 -05:00
The algorithm computes two properties: the set of all variable leaf nodes visited, and the product of signs of visited coefficient leaf nodes.
2020-08-17 13:52:18 -04:00
2020-12-15 01:09:00 -05:00
%\begin{Definition}[TreeSet]
%A TreeSet is a data structure whose elements form a set, each of which are stored in a binary tree.
%\end{Definition}
2020-09-07 17:03:22 -04:00
We will assume the TreeSet data structure to maintain sets with logarithmic time insertion and linear time traversal of its elements.
2020-12-19 12:59:27 -05:00
%
$\sampmon$ is given in \Cref{alg:sample}, and a proof of its correctness (via \Cref{lem:sample}) is provided in \Cref{sec:proofs-approx-alg}.
2020-09-07 17:03:22 -04:00
2020-12-19 12:59:27 -05:00
\begin{algorithm}[t]
\caption{\sampmon(\etree)}
\label{alg:sample}
\begin{algorithmic}[1]
\Require \etree: Binary Expression Tree
2020-08-25 11:18:08 -04:00
\Ensure \vari{vars}: TreeSet
\Ensure \vari{sgn} $\in \{-1, 1\}$
2020-12-15 01:09:00 -05:00
\Comment{\Cref{alg:one-pass} should have been run before this one} % algorithm ~\ref{alg:sample}}
\State $\vari{vars} \gets \emptyset$ \label{alg:sample-global1}
\If{$\etree.\type = +$}\Comment{Sample at every $+$ node}
2020-09-07 17:03:22 -04:00
\State $\etree_{\vari{samp}} \gets$ Sample from left subtree ($\etree_{\lchild}$) and right subtree ($\etree_{\rchild}$) w.p. $\etree_\lchild.\wght$ and $\etree_\rchild.\wght$. \label{alg:sample-plus-bsamp}
2020-09-08 12:05:51 -04:00
\State $(\vari{v}, \vari{s}) \gets \sampmon(\etree_{\vari{samp}})$\label{alg:sample-plus-traversal}
% \State $\vari{vars} \gets \vari{vars} \;\cup \;\{\vari{v}\}$\label{alg:sample-plus-union}
% \State $\vari{sgn} \gets \vari{sgn} \times \vari{s}$\label{alg:sample-plus-product}
\State $\Return ~(\vari{v}, \vari{s})$
\ElsIf{$\etree.\type = \times$}\Comment{Multiply the sampled values of all subtree children}
\State $\vari{sgn} \gets 1$\label{alg:sample-global2}
2020-12-13 15:51:55 -05:00
\For {$child$ in $\etree.\vari{children}$}
2020-08-25 11:18:08 -04:00
\State $(\vari{v}, \vari{s}) \gets \sampmon(child)$
2020-09-07 17:03:22 -04:00
\State $\vari{vars} \gets \vari{vars} \cup \{\vari{v}\}$\label{alg:sample-times-union}
\State $\vari{sgn} \gets \vari{sgn} \times \vari{s}$\label{alg:sample-times-product}
\EndFor
2020-08-25 11:18:08 -04:00
\State $\Return ~(\vari{vars}, \vari{sgn})$
\ElsIf{$\etree.\type = numeric$}\Comment{The leaf is a coefficient}
%\State $\vari{sgn} \gets \vari{sgn} \times sign(\etree.\val)$
\State $\Return ~\left(\{\}, sign(\etree.\val)\right)$\label{alg:sample-num-return}
\ElsIf{$\etree.\type = var$}
%\State $\vari{vars} \gets \vari{vars} \; \cup \; \{\;\etree.\val\;\}\label{alg:sample-var-union}$\Comment{Add the variable to the set}
\State $\Return~\left(\{\etree.\val\}, 1\right) $\label{alg:sample-var-return}
\EndIf
\end{algorithmic}
\end{algorithm}
2020-12-19 01:15:50 -05:00
% \subsection{Experimental results}
% \label{sec:experiments}
% We conducted an experiment running modified TPCH queries over uncertain data generated by pdbench~\cite{pdbench}, both of which (data and queries) represent what is typically encountered in practice. Queries were run two times, once filtering $\bi$ cancellations, and then second not filtering the cancellations. The purpose of this was to determine an indication for how many $\bi$ cancellations occur in practice. Details and results can be found in~.
2020-12-14 11:47:18 -05:00
2020-12-19 01:15:50 -05:00
%\AR{Experimental stuff about \bi should go in here}
2020-10-01 14:38:40 -04:00
%%%%%%%%%%%%%%%%%%%%%%%
2020-12-19 16:44:18 -05:00
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: