Added figures to the revised intro

master
Aaron Huber 2021-06-25 10:38:59 -04:00
parent 0213a07ec2
commit 4cf2dd7c31
3 changed files with 115 additions and 6 deletions

View File

@ -38,8 +38,13 @@
\end{outline}
\AH{Setting}
A probabilistic database (\abbrPDB) $\pdb$ is a two-tuple ($\idb, \pd$) such that $\idb$ is the set of possible worlds $\db$ represented by $\pdb$, and $\pd$ is the associated probability distribution across each $\db$ in $\idb$. Given a query $\query$ the output of $\query(\pdb)$ is ($\idb', \pd'$) such that $\idb' = \{\query(\db_i) \suchthat i \in [\numvar]\}$ where $\numvar = \abs{\idb}$, the number of possible worlds in $\pdb$, and $\pd'$ is the resulting probability distribution over $\idb'$. Computing $\query$ as outlined above can be modeled in two steps, where the first step consists of the deterministic computation of both the query output and result tuple lineage(s) encoded in the respective representation, and the second step consists of computing the probability distributation. This computational model is nicely followed by set-\abbrPDB computation and semiring provenance, and is useful in this work for the purpose separating the deterministic computation from the probability computation.
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{twostep}
\caption{Two-Step Computation Model (\abbrPDB\xplural)}
\label{fig:two-step}
\end{figure}
A probabilistic database (\abbrPDB) $\pdb$ is a two-tuple ($\idb, \pd$) such that $\idb$ is the set of possible worlds $\db$ represented by $\pdb$, and $\pd$ is the associated probability distribution across each $\db$ in $\idb$. Given a query $\query$ the output of $\query(\pdb)$ is ($\idb', \pd'$) such that $\idb' = \{\query(\db_i) \suchthat i \in [\numvar]\}$ where $\numvar = \abs{\idb}$, the number of possible worlds in $\pdb$, and $\pd'$ is the resulting probability distribution over $\idb'$. As depicted in \cref{fig:two-step}, computing $\query$ as outlined above can be modeled in two steps, where the first step consists of the deterministic computation of both the query output and result tuple lineage(s) encoded in the respective representation, and the second step consists of computing the probability distributation. This computational model is nicely followed by set-\abbrPDB computation and semiring provenance, and is useful in this work for the purpose separating the deterministic computation from the probability computation.
Much work already exists regarding \abbrPDB\xplural, most of which considers $\pdb$ to be a set, meaning all possible worlds $\db$ are a \emph{set} of tuples. The problem of computing $\query$ \emph{exactly} over a set-\abbrPDB is known to be \sharpphard in the general case. The dichotomy of Dalvi and Suicu shows that for set-\abbrPDB\xplural it is the case that $\query(\pdb)$ is either polynomial or \sharpphard. Further, this dichotomy is \emph{based} on the query structure and in general is independent of the representation of the lineage polynomial.\footnote{We do note that there exist specific cases when given a specific database instance combined with an amenable representation, that a hard $\query$ can become easy, but this is {\emph not} the general case.} The hardness results for set-\abbrPDB\xplural depend on step two of the computation model.
@ -58,7 +63,111 @@ Concretely, we make the following contributions:
(i) We show that the expected result multiplicity problem (\Cref{def:the-expected-multipl}) for conjunctive queries for bag-$\ti$s is \sharpwonehard in the size of a lineage circuit by reduction from counting the number of $k$-matchings over an arbitrary graph;
(ii) We present an $(1\pm\epsilon)$-\emph{multiplicative} approximation algorithm for bag-$\ti$s and show that for typical database usage patterns (e.g. when the circuit is a tree or is generated by recent worst-case optimal join algorithms or their FAQ followups~\cite{DBLP:conf/pods/KhamisNR16}) its complexity is linear in the size of the compressed lineage encoding (in contrast, known approximation techniques in set-\abbrPDB\xplural are quadratic); (iii) We generalize the approximation algorithm to bag-$\bi$s, a more general model of probabilistic data; (iv) We further prove that for \raPlus queries (an equivalently expressive, but factorizable form of UCQs), we can approximate the expected output tuple multiplicities with only $O(\log{Z})$ overhead (where $Z$ is the number of output tuples) over the runtime of a broad class of query processing algorithms. We also observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial by continuing \Cref{ex:intro-tbls}.
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\Phi$. In fact, it turns out that for the TIDB (and BIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the TIDB/BIDB. Next, we motivate this reduced polynomial in what follows.%continuing \Cref{ex:intro-tbls}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\begin{subfigure}[b]{0.49\linewidth}
\centering
{\small
\begin{tabular}{ c | c c c}
$OnTime$ & City$_\ell$ & $\Phi$ & \textbf{p}\\
\hline
& Buffalo & $L_a$ & 0.9 \\
& Chicago & $L_b$ & 0.5\\
& Bremen & $L_c$ & 0.5\\
& Zurich & $L_d$ & 1.0\\
\end{tabular}
}
\caption{Relation $OnTime$}
\label{subfig:ex-shipping-simp-loc}
\end{subfigure}%
\begin{subfigure}[b]{0.49\linewidth}
\centering
{\small
\begin{tabular}{ c | c c c c}
$Route$ & $\text{City}_1$ & $\text{City}_2$ & $\Phi$ & \textbf{p} \\
\hline
& Buffalo & Chicago & $R_a$ & 1.0 \\
& Chicago & Zurich & $R_b$ & 1.0 \\
%& $\cdots$ & $\cdots$ & $\cdots$ & $\cdots$ \\
& Chicago & Bremen & $R_c$ & 1.0 \\
\end{tabular}
}
\caption{Relation $Route$}
\label{subfig:ex-shipping-simp-route}
\end{subfigure}%
% \begin{subfigure}[b]{0.17\linewidth}
% \centering
% \caption{Circuit for $(Chicago)$}
% \label{subfig:ex-proj-push-circ-q3}
% \end{subfigure}
\begin{subfigure}[b]{0.66\linewidth}
\centering
{\small
\begin{tabular}{ c | c c c}
$\query_1$ & City & $\Phi$ & $\expct_{\idb \sim \probDist}[\query(\db)(t)]$ \\ \hline
& Buffalo & $L_a \cdot R_a$ & $0.9$ \\
& Chicago & $L_b \cdot R_b + L_b \cdot R_c$ & $0.5 \cdot 1.0 + 0.5 \cdot 1.0 = 1.0$ \\
%& $\cdots$ & $\cdots$ & $\cdots$ \\
\end{tabular}
}
\caption{$Q_1$'s Result}
\label{subfig:ex-shipping-simp-queries}
\end{subfigure}%
\begin{subfigure}[b]{0.33\linewidth}
\centering
\resizebox{!}{16mm} {
\begin{tikzpicture}[thick]
\node[tree_node] (a2) at (0, 0){$R_b$};
\node[tree_node] (b2) at (1, 0){$L_b$};
\node[tree_node] (c2) at (2, 0){$R_c$};
%level 1
\node[tree_node] (a1) at (0.5, 0.8){$\boldsymbol{\circmult}$};
\node[tree_node] (b1) at (1.5, 0.8){$\boldsymbol{\circmult}$};
%level 0
\node[tree_node] (a0) at (1.0, 1.6){$\boldsymbol{\circplus}$};
%edges
\draw[->] (a2) -- (a1);
\draw[->] (b2) -- (a1);
\draw[->] (b2) -- (b1);
\draw[->] (c2) -- (b1);
\draw[->] (a1) -- (a0);
\draw[->] (b1) -- (a0);
\end{tikzpicture}
}
\resizebox{!}{16mm} {
\begin{tikzpicture}[thick]
\node[tree_node] (a1) at (1, 0){$R_b$};
\node[tree_node] (b1) at (2, 0){$R_c$};
%level 1
\node[tree_node] (a2) at (0.75, 0.8){$L_b$};
\node[tree_node] (b2) at (1.5, 0.8){$\boldsymbol{\circplus}$};
%level 0
\node[tree_node] (a3) at (1.1, 1.6){$\boldsymbol{\circmult}$};
%edges
\draw[->] (a1) -- (b2);
\draw[->] (b1) -- (b2);
\draw[->] (a2) -- (a3);
\draw[->] (b2) -- (a3);
\end{tikzpicture}
}
\caption{Two circuits for $Q_1(Chicago)$}
\label{subfig:ex-proj-push-circ-q4}
\end{subfigure}%
\vspace*{-3mm}
\caption{\ti instance and query results for \cref{ex:overview}}%\Cref{ex:intro-tbls}.}%{$\ti$ relations for $\poly$}
\label{fig:ex-shipping-simp}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Consider the query $Q()\dlImp$$OnTime(\text{City}), Route(\text{City}, \text{City}'),$ $OnTime(\text{City}')$ over the bag relations of \Cref{fig:ex-shipping-simp}. It can be verified that $\Phi$ for $Q$ is $L_aL_b + L_bL_d + L_bL_c$. Now consider the product query $\query^2()\dlImp Q(), Q()$.

View File

@ -110,10 +110,10 @@ sensitive=true
\maketitle
\input{abstract}
\input{intro-rewrite2}
\input{intro-rewrite2}%ICDT 2nd Round submission
%\input{outline-intro-new}
%\input{intro-new}
% \input{intro}
%\input{intro-new}%ICDT 1st Round submission
% \input{intro}--PODS submission
\input{ra-to-poly}
\input{poly-form}
\input{prob-def}

BIN
twostep.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 305 KiB