Oliver's pass

master
Oliver Kennedy 2021-09-21 00:07:45 -04:00
parent 0ea83c2fb3
commit d24000557e
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
4 changed files with 13 additions and 13 deletions

View File

@ -346,17 +346,17 @@ We leave further investigations for future work.
Computing the marginal probability of a tuple in the output of a set-probabilistic database query has been studied extensively.
To the best of our knowledge, the current state of the art approximation algorithm for this problem is the Karp-Luby estimator~\cite{DBLP:journals/jal/KarpLM89}, which first appeared in MayBMS/Sprout~\cite{DBLP:conf/icde/OlteanuHK10}, and more recently as part of an online ``anytime'' approximation algorithm~\cite{FH13,heuvel-19-anappdsd}.
The estimator works by observing that for any $\ell$ random binary (but not necessarily independent) events $\vct{W}_1, \ldots, \vct{W}_\ell$, the probability of either occurring $\probOf\inparen{\vct{W}_1 \vee \ldots \vee\vct{W}_\ell}$ is bounded from above by the sum of the independent event probabilities (i.e., $\probOf\inparen{\vct{W}_1 \vee \ldots \vee \vct{W}_\ell} \leq \probOf\inparen{\vct{W}_1} + \ldots + \probOf\inparen{\vct{W}_\ell}$).
Starting from this (easy to compute and large) value, the estimator proceeds to `correct' the estimate by estimating how much of an over-estimate it is.
The estimator works by observing that for any $\ell$ random binary (but not necessarily independent) events $\vct{W}_1, \ldots, \vct{W}_\ell$, the probability of at least one event occurring (i.e., $\probOf\inparen{\vct{W}_1 \vee \ldots \vee\vct{W}_\ell}$) is bounded from above by the sum of the independent event probabilities (i.e., $\probOf\inparen{\vct{W}_1 \vee \ldots \vee \vct{W}_\ell} \leq \probOf\inparen{\vct{W}_1} + \ldots + \probOf\inparen{\vct{W}_\ell}$).
Starting from this (`easily' computable and large) value, the estimator proceeds to correct the estimate by estimating how much of an over-estimate it is.
Specifically, if $\mathcal P$ is the joint distribution over $\vct{W}$, the estimator computes an approximation of:
$$\mathcal O = \underset{\vct{W} \sim \mathcal P}{\expct}\left[
$$\mathcal O = \underset{\vct{W} \sim \mathcal P}{\expct}\Big[
\left|\comprehension{i}{\vct{W}_i = 1, i \in [\ell]}\right|
\right].$$
The accuracy of this estimate is improved by conditioning $\mathcal P$ on a $W_i$ chosen uniformly at random (which entails that the sampled value will be at least 1). From here, it can easily be verified that the probability of the disjunction can be computed as:
\Big].$$
The accuracy of this estimate is improved by conditioning $\mathcal P$ on a $W_i$ chosen uniformly at random (which ensures that the sampled count will be at least 1) and correcting the resulting estimate by $\probOf\inparen{W_i}$. With an estimate of $\mathcal O$, it can easily be verified that the probability of the disjunction can be computed as:
$$\probOf\inparen{\vct{W}_1 \vee \ldots \vee\vct{W}_\ell} = \probOf\inparen{\vct{W}_1} + \ldots + \probOf\inparen{\vct{W}_\ell} - \mathcal O$$
The Karp-Luby estimator is employed on the \abbrSMB representation\footnote{Note that since we are in the set semantics, in the lineage polynomial/formula, addition is logical OR and multiplication is logical AND.} of $\circuit$ (to solve the set-PDB version of \Cref{prob:intro-stmt}), where each $W_i$ represents the event that one monomial is true.
By simple inspection, if there are $\ell$ monomials, this estimator has runtime $\Omega(\ell)$. Further, a minimum of $\lceil\frac{3\cdot \ell\cdot \log(\frac{2}{\delta})}{\epsilon^2}\rceil$ invocations of the estimator are required to achieve $1\pm\epsilon$ approximation with probability at least $1-\delta$~\cite{DBLP:conf/icde/OlteanuHK10}, entailing a runtime at least quadratic in $\ell$.
By simple inspection, if there are $\ell$ monomials, this estimator has runtime $\Omega(\ell)$. Further, a minimum of $\left\lceil\frac{3\cdot \ell\cdot \log(\frac{2}{\delta})}{\epsilon^2}\right\rceil$ invocations of the estimator are required to achieve $1\pm\epsilon$ approximation with probability at least $1-\delta$~\cite{DBLP:conf/icde/OlteanuHK10}, entailing a runtime at least quadratic in $\ell$.
As an arbitrary lineage circuit $\circuit$ may encode $\Omega\inparen{|\circuit|^k}$ monomials, the worst case runtime is at least $\Omega\inparen{|\circuit|^{2k}}$ (where $k$ is the `degree' of lineage polynomial encoded by $\circuit$). By contrast note that by the discussion after \Cref{lem:val-ub} we can solve \Cref{prob:intro-stmt} in time $O\inparen{|\circuit|^2}$ for all \abbrBIDB circuits {\em independent} of the degree $k$.

View File

@ -22,7 +22,7 @@ of tuple $\tup$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
For \lstinline{COUNT(*)} queries, expected multiplicities can model the expected count; The equivalent set-\abbrPDB operation, simply computes the probability that this count is non-zero.
Further, we are interested in the parameterized complexity of \Cref{prob:bag-pdb-query-eval} (i.e. we think of $Q$ as being parameterized by some parameter $k$ and size of the database going to infinity relative to $k$). Unless stated otherwise, we implicitly assume the probability distribution $\pd$, and for notational convenience use $\expct\pbox{\cdot}$ instead of $\expct_\pd\pbox{\cdot}$. Further, define $\dbbase=\bigcup_{\db\in\idb} \db$.
Further, we are interested in the parameterized complexity of \Cref{prob:bag-pdb-query-eval} (i.e. we think of $Q$ as being parameterized by some parameter $k$ with the of the database going to infinity relative to $k$). Unless stated otherwise, we implicitly assume the probability distribution $\pd$, and for notational convenience use $\expct\pbox{\cdot}$ instead of $\expct_\pd\pbox{\cdot}$. Further, define $\dbbase=\bigcup_{\db\in\idb} \db$.
A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski1989IncompleteII,Antova_fastand,DBLP:conf/vldb/AgrawalBSHNSW06} and many others) relies on annotating tuples with lineages, propositional formulas that describe the set of possible worlds that the tuple appears in.
%\AR{Removed couple of sentence on lineage formula since we explicitly define $\poly$ now.}
@ -143,10 +143,10 @@ Given an $\raPlus$ query $\query$ and \abbrTIDB
%The problem of deterministic query evaluation is known to be \sharpwonehard\footnote{A problem is in \sharpwone if the runtime of the most efficient known algorithm to solve it is lower bounded by some function $f$ of a parameter $k$, where the growth in runtime is polynomially dependent on $f(k)$, i.e. $\Omega\inparen{\numvar^{f(k)}}$.} in data complexity for general $\query$. For example, the counting $k$-cliques query problem (where the parameter $k$ is the size of the clique) is \sharpwonehard since (under standard complexity assumptions) it cannot run in time faster than $n^{f(k)}$ for some strictly increasing $f(k)$.
%In this paper, we begin to explore whether the problem of bag-probabilistic query evaluation (which we relate to deterministic query processing more precisely below) falls into this same complexity class.
We note that the above is a special case of \Cref{prob:bag-pdb-query-eval} since we are asking whether the query evaluation over \abbrBPDB is {\em linear} in the runtime of deterministic query processing time.
We note that the above is a special case of \Cref{prob:bag-pdb-query-eval} since we are asking whether the query evaluation over a \abbrBPDB is {\em linear} in the runtime of deterministic query processing.
We stress that this question is very well motivated, even for one of the simplest models of probabilistic databases (i.e., \abbrTIDBs): An answer in the affirmative for~\Cref{prob:informal} indicates that bag-probabilistic databases can be competitive with deterministic databases, opening the door for deployment in practice.
\mypar{Our lower bound results} Unfortunately, we prove that this is not the case. In fact in Table~\ref{tab:lbs}%\AR{Cref was not formatting Table correct so added Table in explicitly.}
\mypar{Our lower bound results} Unfortunately, we prove that this is not the case. In fact in Table~\ref{tab:lbs} %\AR{Cref was not formatting Table correct so added Table in explicitly.}
we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to \Cref{prob:informal}.
\begin{table}
\begin{tabular}{|p{0.43\textwidth}|p{0.12\textwidth}|p{0.35\textwidth}|}
@ -164,7 +164,7 @@ $\Omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{c_0\cdot k}}$ for {\em som
\label{tab:lbs}
\end{table}
Note that the lower bound in the first row by itself is enough to refute \Cref{prob:informal}.
To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, where $k$ is the largest degree of the polynomial $\apolyqdt$ over all result tuples $\tup$ (which is the parameter that defines our family of hard queries). What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for \Cref{prob:informal}.
To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, where $k$ is the largest degree of the polynomial $\apolyqdt$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries). What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for \Cref{prob:informal}.
%\footnote{
% We note similar hardness results for determinsitic query processing that apply lower bounds in terms of $\abs{\dbbase}$. Our lower bounds are in terms of $\qruntime{Q,\dbbase}$, which in general can be super-linear in $\abs{\dbbase}$.
%}
@ -242,7 +242,7 @@ Note that if the answer to the above problem is yes, then we have shown that the
We show in \Cref{sec:circuit-depth} %{sec:gen}\AR{Refs needs to be updated}
%\OK{confirm this ref}
%Atri: fixed the ref
an $O(\qruntime{Q, \dbbase})$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the lineage).
an $O(\qruntime{Q, \dbbase})$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
% , and by extension the first step is in \sharpwonehard\AH{\sharpwonehard is not defined.}.
A key insight of this paper is that the representation of $\circuit$ matters.
For example, if we insist that $\circuit$ represent the lineage polynomial in the standard monomial basis (henceforth, \abbrSMB)\footnote{

View File

@ -60,7 +60,7 @@ SELECT COUNT(*) FROM $R_1$ JOIN $R_2$ JOIN$\cdots$JOIN $R_k$
In other words, for this instance $\dbbase$ contains the set of $n$ unary tuples in $OnTime$ (which corresponds to $\vset$) and $m$ binary tuples in $Route$ (which corresponds to $\edgeSet$).
Note that this implies that $\poly_{G}^\kElem$ is indeed a \abbrTIDB-lineage polynomial. % for a \abbrTIDB \abbrPDB.
Next, we note that the runtime for \abbrStepOne with $\query^k$ and $\dbbase$ as defined above is $O(m)$ (i.e. \abbrStepOne is `easy' for this query):
Next, we note that the runtime for answering $\query^k$ on deterministic database $\dbbase$, as defined above, is $O(m)$ (i.e. deterministic query processing is `easy' for this query):
\begin{Lemma}\label{lem:tdet-om}
Let $\query^k$ and $\dbbase$ be as defined above. Then
% of \Cref{def:qk}, the runtime

View File

@ -5,7 +5,7 @@
\subsection{Probabilistic Databases}
Following typical representation of bags in production databases, for query inputs, we will use \abbrBPDB\xplural with multiplicities $\{0, 1\}$ (see \Cref{sec:gener-results-beyond} for more on this choice).
Following the typical representation of bags in production databases, for query inputs, we will use \abbrBPDB\xplural with multiplicities $\{0, 1\}$ (see \Cref{sec:gener-results-beyond} for more on this choice).
% and a unique tuple-id field to allow duplicate tuples.
An \textit{incomplete database} $\idb$ is a set of deterministic databases $\db$ called possible worlds.