|
|
|
@ -2,7 +2,7 @@
|
|
|
|
|
%root: main.tex
|
|
|
|
|
\section{Introduction}\label{sec:intro}
|
|
|
|
|
|
|
|
|
|
This work explores the problem of computing the expectation of the multiplicity of a tuple in the result of a query over a \abbrCTIDB (tuple independent database), a type of probabilistic database with bag semantics where the multiplicity of a tuple is a random variable with range $[0,\bound]\stackrel{\text{def}}{=}\{0,1,\dots,\bound\}$ for some fixed constant $\bound$ and multiplicities assigned to any two tuples are independent of each other.
|
|
|
|
|
This work explores the problem of computing the expectation of the multiplicity of a tuple in the result of a query over a \abbrCTIDB (tuple independent database), a type of probabilistic database with bag semantics where the multiplicity of a tuple is a random variable with range $[0,\bound]\stackrel{\text{def}}{=}\{0,1,\dots,\bound\}$ for some fixed constant $\bound$, and multiplicities assigned to any two tuples are independent of each other.
|
|
|
|
|
Formally, a \abbrCTIDB,
|
|
|
|
|
$\pdb = \inparen{\worlds, \bpd}$ is defined over a set of tuples $\tupset$ and a probability distribution $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
|
|
|
|
|
Any such world can be encoded as a vector (of length $\numvar=\abs{\tupset}$) from $\worlds$, such that the multiplicity of each $\tup \in \tupset$ is stored at a distinct index.
|
|
|
|
@ -57,13 +57,13 @@ An $\raPlus$ query is a query expressed in positive relational algebra, i.e., us
|
|
|
|
|
%\vspace{-0.53cm}
|
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
As also observed in \cite{https://doi.org/10.48550/arxiv.2201.11524}, computing the expected multiplicity of a result tuple in a bag probabilistic database is the analog of computing the marginal probability in a set \abbrPDB.
|
|
|
|
|
As also observed in \cite{https://doi.org/10.48550/arxiv.2201.11524,DBLP:journals/sigmod/GuagliardoL17}, computing the expected multiplicity of a result tuple in a bag probabilistic database is the analog of computing the marginal probability in a set \abbrPDB.
|
|
|
|
|
% We will assume that $c =\bigO{1}$, since this is what is typically seen in practice.
|
|
|
|
|
% Allowing for unbounded $c$ is an interesting open problem.
|
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
|
|
|
|
|
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and its data complexity has, in general been shown % by Dalvi and Suicu
|
|
|
|
|
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and their data complexity has, in general been shown % by Dalvi and Suicu
|
|
|
|
|
to be \sharpphard\cite{10.1145/1265530.1265571}.
|
|
|
|
|
In an independent work, Grohe et. al.~\cite{https://doi.org/10.48550/arxiv.2201.11524} studied bag-\abbrTIDB\xplural with unbounded multiplicities, which requires %them to explicitly address the issue of
|
|
|
|
|
a succinct representation of probability distributions over infinitely many multiplicities.
|
|
|
|
@ -83,7 +83,8 @@ question that we explore is %the hardness of
|
|
|
|
|
\cite{https://doi.org/10.48550/arxiv.2201.11524} also observe that computing the expectation of an output tuple multiplicity is in \ptime, they do not investigate the fine-grained complexity of this problem.}
|
|
|
|
|
|
|
|
|
|
Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in time linear in the runtime of an analogous deterministic query, which we make more precise shortly.
|
|
|
|
|
If this is true, then this would open up the way for deployment of \abbrCTIDB\xplural in practice. We expand on the potential practical implications of this problem later in the section but for now we stress that in practice, $\bound$ is indeed constant and most often $\bound=1$; Although higher multiplicities may arise in intermediate results or outputs, input tuples are frequently unique, even in bag-relational data.
|
|
|
|
|
If true, this opens up the way for deployment of \abbrCTIDB\xplural in practice. We expand on the practical implications of this problem later in the section but for now we stress that in practice, $\bound$ is indeed constant and most often $\bound=1$.
|
|
|
|
|
That is, although production database systems use bag semantics for query evaluation, allowing duplicate intermediate or output tuples, input tuples in real world datasets are still frequently unique.
|
|
|
|
|
To analyze this question we denote by $\timeOf{}^*(\query,\pdb, \bound)$ the optimal runtime complexity of computing~\Cref{prob:expect-mult} over \abbrCTIDB $\pdb$ and query $\query$.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
@ -112,7 +113,7 @@ Those with `Multiple' in the second column need the algorithm to be able to hand
|
|
|
|
|
\mypar{Our lower bound results}
|
|
|
|
|
%
|
|
|
|
|
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$ over a deterministic database $\gentupset$ where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q}
|
|
|
|
|
Our question is whether or not it is always true that for every $\query$, $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query also encodes order of operations) since we want to answer the above question for all $\query$. \emph{Specifically, if there is an equivalent query $\query'$ that is more efficient to evaluate, we allow both the deterministic and probabilistic query processing access to $\query'$}.
|
|
|
|
|
Our question is whether or not it is always true that for every $\query$, $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query explicitly encodes order of operations) since we want to answer the above question for all $\query$. \emph{Specifically, if there is an equivalent query $\query'$ that is more efficient to evaluate, we allow both deterministic and probabilistic query processing access to $\query'$}.
|
|
|
|
|
|
|
|
|
|
Unfortunately the the answer to the above question is no--
|
|
|
|
|
\Cref{tab:lbs} shows our results.
|
|
|
|
@ -152,7 +153,7 @@ compute $\expct_{\vct{W}\sim \pdassign}\pbox{\poly\inparen{\worldvec}}$).
|
|
|
|
|
%We note that computing \Cref{prob:expect-mult} is equivalent (yields the same result as) to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
|
|
|
|
|
|
|
|
|
|
All of our results rely on working with a {\em reduced} form $\inparen{\rpoly}$ of the lineage polynomial $\poly$. As we show, for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating $\rpoly$ over the probabilities that define the $1$-\abbrTIDB.
|
|
|
|
|
Further, only light extensions are required to support block independent disjoint probabilistic databases~\cite{DBLP:conf/icde/OlteanuHK10} (bag query semantics with input tuple multiplicity at most $1$). %, for which the proof of~\Cref{lem:tidb-reduce-poly} (introduced shortly) holds .
|
|
|
|
|
Further, only light extensions (see \Cref{def:reduced-poly-one-bidb}) are required to support block independent disjoint probabilistic databases~\cite{DBLP:conf/icde/OlteanuHK10} (bag query semantics with input tuple multiplicity at most $1$). %, for which the proof of~\Cref{lem:tidb-reduce-poly} (introduced shortly) holds .
|
|
|
|
|
|
|
|
|
|
Next, we motivate this reduced polynomial $\rpoly$.
|
|
|
|
|
Consider the query $\query_1$ defined as follows over the bag relations of \Cref{fig:two-step}:
|
|
|
|
@ -185,7 +186,7 @@ The simple insight to get around this issue to note that the random variables $\
|
|
|
|
|
|
|
|
|
|
Given that $U$ can only have multiplicity of $1$ or $2$ but not both, we drop the monomials with the term $U_1U_2$ to get
|
|
|
|
|
$\refpoly{1, }^{\inparen{ABU}^2}\inparen{A, U_1, U_2, B} = A^2U_1^2B^2+2^2\cdot A^2 U_2^2B^2.$
|
|
|
|
|
Now that all the world vectors $(\randWorld_A,\randWorld_{U_1},\randWorld_{U_2},\randWorld_A)\in\inset{0,1}^4$, we have $\expct\pbox{\refpoly{1, }^2}=\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_1}}\expct\pbox{\randWorld_{B}}+$ \\ $4\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_2}}\expct\pbox{\randWorld_{B}}\stackrel{\text{def}}{=}\rpoly_1^2\inparen{p_A,\probOf\inparen{U=1},\probOf\inparen{U=2},p_B}$. We only did the argument for a single monomial but by linearity of expectation we can apply the same argument to all monomials in $\poly_1^2$. Generalizing this argument to general $\poly$ leads to consider its following `reduced' version:
|
|
|
|
|
Now that world vectors $(\randWorld_A,\randWorld_{U_1},\randWorld_{U_2},\randWorld_A)\in\inset{0,1}^4$, we have $\expct\pbox{\refpoly{1, }^2}=\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_1}}\expct\pbox{\randWorld_{B}}+$ \\ $4\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_2}}\expct\pbox{\randWorld_{B}}\stackrel{\text{def}}{=}\rpoly_1^2\inparen{p_A,\probOf\inparen{U=1},\probOf\inparen{U=2},p_B}$. We only did the argument for a single monomial but by linearity of expectation we can apply the same argument to all monomials in $\poly_1^2$. Generalizing this argument to general $\poly$ leads to consider its following `reduced' version:
|
|
|
|
|
|
|
|
|
|
\begin{Definition}\label{def:reduced-poly}
|
|
|
|
|
For any polynomial $\poly\inparen{\inparen{X_\tup}_{\tup\in\tupset}}$ define the reformulated polynomial $\refpoly{}\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}
|
|
|
|
@ -231,7 +232,7 @@ We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymb
|
|
|
|
|
Our negative results (\Cref{tab:lbs}) indicate that \abbrCTIDB{}s (even for $\bound=1$) can not achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for $1$-\abbrTIDB\xplural. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
|
|
|
|
|
|
|
|
|
|
\input{two-step-model}
|
|
|
|
|
We adopt a two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
|
|
|
|
|
We adopt the two-step intensional model of query evaluation used in set-\abbrPDB\xplural, as illustrated in \Cref{fig:two-step}:
|
|
|
|
|
(i) \termStepOne (\abbrStepOne): Given input $\tupset$ and $\query$, output every tuple $\tup$ that possibly satisfies $\query$, annotated with its lineage polynomial $\poly(\vct{X})%=\textcolor{red}{CHANGE}\apolyqdt\inparen{\vct{X}}$
|
|
|
|
|
$;
|
|
|
|
|
(ii) \termStepTwo (\abbrStepTwo): Given $\poly(\vct{X})$ for each tuple, compute $\expct_{\randWorld\sim\bpd}\pbox{\poly(\vct{\randWorld})}$.
|
|
|
|
@ -245,8 +246,8 @@ $\circuit : \timeOf{\abbrStepOne}(\query,\tupset, \circuit) + \timeOf{\abbrStepT
|
|
|
|
|
\end{Problem}
|
|
|
|
|
|
|
|
|
|
A key insight of this paper is that the representation of $\circuit$ matters.
|
|
|
|
|
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$,
|
|
|
|
|
and hence, just $\timeOf{\abbrStepOne}(\query,\tupset,\circuit)$ is too large.
|
|
|
|
|
For example, if we insist that $\circuit$ represent the lineage polynomial in \abbrSMB, the answer to the above question in general is no, since then we will need $\abs{\circuit}\ge \Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $|\circuit|$ is the size of circuit $\circuit$.
|
|
|
|
|
Hence, just $\timeOf{\abbrStepOne}(\query,\tupset,\circuit)$ is too large.
|
|
|
|
|
However, systems can directly emit compact, factorized representations of $\poly(\vct{X})$ (e.g., as a consequence of the standard projection push-down optimization~\cite{DBLP:books/daglib/0020812}).
|
|
|
|
|
Accordingly, this work uses (arithmetic) circuits\footnote{
|
|
|
|
|
An arithmetic circuit is a DAG with variable/numeric source gates and multiplication/addition internal/sink gates.
|
|
|
|
@ -254,14 +255,14 @@ Accordingly, this work uses (arithmetic) circuits\footnote{
|
|
|
|
|
as the representation system of $\poly(\vct{X})$, and we show in \Cref{sec:circuit-depth} an $\bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$ algorithm for constructing the lineage polynomial for all result tuples of an $\raPlus$ query $\query$ (or more more precisely, a single circuit $\circuit$ with one sink per tuple representing the tuple's lineage).
|
|
|
|
|
|
|
|
|
|
Given that a representation $\circuit^*$ exists where $\timeOf{\abbrStepOne}(\query,\tupset,\circuit^*)\le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$, we can focus on the complexity of \abbrStepTwo.
|
|
|
|
|
As we also show in \Cref{sec:circuit-runtime}, the size is also bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$), where $|\circuit|$ is the size of circuit $\circuit$.
|
|
|
|
|
As we also show in \Cref{sec:circuit-runtime}, the size is also bounded by $\qruntime{\optquery{\query}, \tupset, \bound}$ (i.e., $|\circuit^*| \le \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$).
|
|
|
|
|
%Thus, the question of approximation can be stated as the following stronger (since~\Cref{prob:big-o-joint-steps} has access to \emph{all} equivalent \circuit representing $\query\inparen{\vct{W}}\inparen{\tup}$), but sufficient condition:
|
|
|
|
|
Given such a $\circuit^*$, to solve \Cref{prob:big-o-joint-steps}, it is \emph{sufficient} to solve: % the following problem:
|
|
|
|
|
\begin{Problem}\label{prob:intro-stmt}
|
|
|
|
|
Given one circuit $\circuit$ that encodes $\Phi\inparen{\vct{X}}$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrCTIDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
|
|
|
|
|
\end{Problem}
|
|
|
|
|
|
|
|
|
|
We will formalize the notions of circuits and hence, \Cref{prob:intro-stmt} in \Cref{sec:expression-trees}. For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then (with an additive adjustment) $\poly\left(\prob_1,\dots, \prob_n\right)$ (recall \Cref{def:reduced-poly}) is a constant factor approximation of $\rpoly$.
|
|
|
|
|
We will formalize the notions of circuits and hence, \Cref{prob:intro-stmt} in \Cref{sec:expression-trees}. For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then (with an additive adjustment) $\poly\left(\prob_1,\dots, \prob_n\right)$ is a constant factor approximation of $\rpoly$ (recall \Cref{def:reduced-poly}).
|
|
|
|
|
This is illustrated in the following example using $\query_1^2$ from earlier. To aid in presentation we again limit our focus to $\refpoly{1, }^{\inparen{ABU}^2}$, assume $\bound = 2$ for variable $U$ and $\bound = 1$ for all other variables. Let $\prob_A$ denote $\probOf\pbox{A = 1}$.
|
|
|
|
|
%In computing $\rpoly$, we have some cancellations to deal with:
|
|
|
|
|
Then we have:
|
|
|
|
@ -296,7 +297,7 @@ If we assume that all probability values are in $[p_0,1]$ for some $p_0>0$,
|
|
|
|
|
we get that $\refpoly{1, }^{\inparen{ABU}^2}\inparen{\vct{\prob}} - 4\prob_A^2\prob_{U_1}\prob_{U_2}\prob_B^2$ is in the range $\pbox{p_0^3\cdot\rpoly^{\inparen{ABU}^2}_1\inparen{\vct{\prob}}, \rpoly_1^{\inparen{ABU}^2}\inparen{\vct{\prob}}}$.
|
|
|
|
|
%We can simulate sampling from $\refpoly{1, }^2\inparen{\vct{X}}$ by sampling monomials from $\refpoly{1, }^2$ while ignoring any samples $A^2X_1X_2B^2$.
|
|
|
|
|
Note however, that this is \emph{not a tight approximation}.
|
|
|
|
|
In~\cref{sec:algo} we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
|
|
|
|
|
In~\Cref{sec:algo} we demonstrate that a $(1\pm\epsilon)$ (multiplicative) approximation with competitive performance is achievable.
|
|
|
|
|
To get an $(1\pm \epsilon)$-multiplicative approximation and solve~\Cref{prob:intro-stmt}, using \circuit we uniformly sample monomials from the equivalent \abbrSMB representation of $\poly$ (without materializing the \abbrSMB representation) and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|