Smoothing some rough edges in the intro

master
Oliver Kennedy 2022-06-03 13:20:56 -04:00
parent 923d98fbd1
commit 09c4cf7e39
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
1 changed files with 23 additions and 14 deletions

View File

@ -7,7 +7,7 @@ Formally, a \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ is defined over a set of tuples $\tupset$ and a probability distribution $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
Any such world can be encoded as a vector (of length $\numvar=\abs{\tupset}$) from $\worlds$, such that the multiplicity of each $\tup \in \tupset$ is stored at a distinct index.
A given world $\worldvec \in\worlds$ can be interpreted as follows: for each $\tup \in \tupset$, $\worldvec_{\tup}$ is the multiplicity of $\tup$ in $\worldvec$.
We note that encoding a possible world as a vector, while non-standard, is equivalent to encoding it as a set of tuples (\Cref{prop:expection-of-polynom} in \Cref{subsec:expectation-of-polynom-proof}).
We note that encoding a possible world as a vector, while non-standard, is equivalent to encoding it as a bag of tuples (\Cref{prop:expection-of-polynom} in \Cref{subsec:expectation-of-polynom-proof}).
Given that tuple multiplicities are independent events, the probability distribution $\bpd$ can be expressed compactly by assigning each tuple a (disjoint) probability distribution over $[0,\bound]$. Let $\prob_{\tup,j}$ denote the probability that tuple $\tup$ is assigned multiplicity $j$. The probability of a world $\worldvec$ is then $\prod_{\tup \in \tupset} \prob_{\tup,\worldvec_{\tup}}$.
%
% Allowing for $\leq \bound$ multiplicities across all tuples gives rise to having $\leq \inparen{\bound+1}^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a $1$-\abbrTIDB, which (assuming set query semantics), is the same as the traditional set \abbrTIDB.
@ -57,7 +57,7 @@ An $\raPlus$ query is a query expressed in positive relational algebra, i.e., us
%\vspace{-0.53cm}
\end{figure}
As also observed in \cite{https://doi.org/10.48550/arxiv.2201.11524}, computing the expected multiplicity of a result tuple in a bag probabilistic database is the analog of the marginal probability in a set \abbrPDB.
As also observed in \cite{https://doi.org/10.48550/arxiv.2201.11524}, computing the expected multiplicity of a result tuple in a bag probabilistic database is the analog of computing the marginal probability in a set \abbrPDB.
% We will assume that $c =\bigO{1}$, since this is what is typically seen in practice.
% Allowing for unbounded $c$ is an interesting open problem.
@ -65,9 +65,11 @@ As also observed in \cite{https://doi.org/10.48550/arxiv.2201.11524}, computing
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
Set query evaluation semantics over $1$-\abbrTIDB\xplural have been studied extensively, and its data complexity has, in general been shown % by Dalvi and Suicu
to be \sharpphard\cite{10.1145/1265530.1265571}.
In an independent work Grohe et. al.~\cite{https://doi.org/10.48550/arxiv.2201.11524} studied bag-\abbrTIDB\xplural allowing for unbounded multiplicities, which requires them to explicitly address the issue of a succinct representation of probability distributions over infinitely many multiplicities.
Their work demonstrated the existence of a dichotomy for
the problem of computing the probability that an output tuple has a multiplicity of at most $s$.
In an independent work, Grohe et. al.~\cite{https://doi.org/10.48550/arxiv.2201.11524} studied bag-\abbrTIDB\xplural with unbounded multiplicities, which requires %them to explicitly address the issue of
a succinct representation of probability distributions over infinitely many multiplicities.
They demonstrate a dichotomy for
%the problem of
computing the probability that an output tuple's multiplicity is bounded by $s$.
% investigates the query evaluation problem over bag-\abbrTIDB\xplural when computing the probability of an output tuple having at most a multiplicity of $k$, showing that a dichotomy exists for this problem.
% While the authors observe that computing the expectation of an output tuple multiplicity is in polynomial time, no further (fine-grained) analysis of the expected value is considered.
% Our work in contrast assumes a finite bound on the multiplicities where we simply list the finitely many probability values (and hence do not need consider a more succinct representation). Further, our work primarily looks into the fine-grained analysis of computing the expected multiplicity of an output tuple.
@ -81,7 +83,8 @@ question that we explore is %the hardness of
\cite{https://doi.org/10.48550/arxiv.2201.11524} also observe that computing the expectation of an output tuple multiplicity is in \ptime, they do not investigate the fine-grained complexity of this problem.}
Specifically, in this work we ask if~\Cref{prob:expect-mult} can be solved in time linear in the runtime of an analogous deterministic query, which we make more precise shortly.
If this is true, then this would open up the way for deployment of \abbrCTIDB\xplural in practice. We expand on the potential practical implications of this problem later in the section but for now we stress that in practice, $\bound$ is indeed constant and most often $\bound=1$. To analyze this question we denote by $\timeOf{}^*(\query,\pdb, \bound)$ the optimal runtime complexity of computing~\Cref{prob:expect-mult} over \abbrCTIDB $\pdb$ and query $\query$.
If this is true, then this would open up the way for deployment of \abbrCTIDB\xplural in practice. We expand on the potential practical implications of this problem later in the section but for now we stress that in practice, $\bound$ is indeed constant and most often $\bound=1$; Although higher multiplicities may arise in intermediate results or outputs, input tuples are frequently unique, even in bag-relational data.
To analyze this question we denote by $\timeOf{}^*(\query,\pdb, \bound)$ the optimal runtime complexity of computing~\Cref{prob:expect-mult} over \abbrCTIDB $\pdb$ and query $\query$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -109,13 +112,16 @@ Those with `Multiple' in the second column need the algorithm to be able to hand
\mypar{Our lower bound results}
%
Let $\qruntime{\query,\gentupset,\bound}$ (see~\Cref{sec:gen} for further details) denote the runtime for query $\query$ over a deterministic database $\gentupset$ where the maximum multiplicity of any tuple is less than or equal to $\bound$. % This paper considers $\raPlus$ queries, for which order of operations is \emph{explicit}, as opposed to other query languages, e.g. Datalog, UCQ. Thus, since order of operations affects runtime, we denote the optimized $\raPlus$ query picked by an arbitrary production system as $\optquery{\query} \approx \min_{\query'\in\raPlus, \query'\equiv\query}\qruntime{\query', \gentupset, \bound}$. Then $\qruntime{\optquery{\query}, \gentupset,\bound}$ is the runtime for the optimized query.\footnote{The upper bounds on runtime that we derive apply pointwise to any $\query \in\raPlus$, allowing us to abstract away the specific heuristics for choosing an optimized query (i.e., Any deterministic query optimization heuristic is equally useful for \abbrCTIDB queries).}\BG{Rewrite: since an optimized Q is also a Q this also applies in the case where there is a query optimizer the rewrites Q}
Our question is whether or not it is always true that for every $\query$ and $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query also encodes the `query plan') since we want to answer the above question for all $\query$. \emph{Specifically, if there is an equivalent query $\query'$ that is more efficient to evaluate, we allow both the deterministic and probabilistic query processing access to $\query'$}.
Our question is whether or not it is always true that for every $\query$, $\timeOf{}^*\inparen{\query, \pdb, \bound}\leq \bigO{\qruntime{\optquery{\query}, \tupset, \bound}}$. We remark that the issue of query optimization is orthogonal to this question (recall that an $\raPlus$ query also encodes order of operations) since we want to answer the above question for all $\query$. \emph{Specifically, if there is an equivalent query $\query'$ that is more efficient to evaluate, we allow both the deterministic and probabilistic query processing access to $\query'$}.
Unfortunately the the answer to the above question is no--
\Cref{tab:lbs} shows our results.
Specifically, depending on what hardness result/conjecture we assume, we get various weaker or stronger versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in \Cref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(\query,\pdb, \bound) \le \bigO{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^k}$, where $k$ is the join width of $\query$ (our notion of join width
%follows from~\Cref{def:degree-of-poly}
is essentially the degree of the corresponding polynomial defined in~\Cref{fig:nxDBSemantics}.) of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
is essentially the degree of the corresponding polynomial)
%% OK: Fig 1 hasn't been introduced yet
% defined in~\Cref{fig:nxDBSemantics}.)
of the query $\query$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
%
What our lower bound in the third row says, is that one cannot get more than a polynomial improvement (for fixed $k$) over essentially the trivial algorithm for~\Cref{prob:expect-mult}, assuming the Exponential Time Hypothesis (ETH)~\cite{eth}.
%However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions).
@ -145,7 +151,8 @@ compute $\expct_{\vct{W}\sim \pdassign}\pbox{\poly\inparen{\worldvec}}$).
\end{Problem}
%We note that computing \Cref{prob:expect-mult} is equivalent (yields the same result as) to computing \Cref{prob:bag-pdb-poly-expected} (see \Cref{prop:expection-of-polynom}).
All of our results rely on working with a {\em reduced} form $\inparen{\rpoly}$ of the lineage polynomial $\poly$. In fact, it turns out that for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the $1$-\abbrTIDB. This is also true when the query input(s) is a block independent disjoint probabilistic database~\cite{DBLP:conf/icde/OlteanuHK10} (bag query semantics with tuple multiplicity at most $1$). %, for which the proof of~\Cref{lem:tidb-reduce-poly} (introduced shortly) holds .
All of our results rely on working with a {\em reduced} form $\inparen{\rpoly}$ of the lineage polynomial $\poly$. As we show, for the $1$-\abbrTIDB case, computing the expected multiplicity (over bag query semantics) is {\em exactly} the same as evaluating $\rpoly$ over the probabilities that define the $1$-\abbrTIDB.
Further, only light extensions are required to support block independent disjoint probabilistic databases~\cite{DBLP:conf/icde/OlteanuHK10} (bag query semantics with input tuple multiplicity at most $1$). %, for which the proof of~\Cref{lem:tidb-reduce-poly} (introduced shortly) holds .
Next, we motivate this reduced polynomial $\rpoly$.
Consider the query $\query_1$ defined as follows over the bag relations of \Cref{fig:two-step}:
@ -156,11 +163,11 @@ WHERE $t_1$.Point = r.Point$_1$ AND $t_2$.Point = r.Point$_2$
\end{lstlisting}
It can be verified that $\poly\inparen{A, B, C, E, U, Y, Z}$ for the sole result tuple of $\query_1$ is $AUB + BYE + BZC$. Now consider the product query $\query_1^2 = \query_1 \times \query_1$.
The lineage polynomial for $\query_1^2$ is given by $\poly_1^2\inparen{A, B, C, E, U, Y, Z}$
The lineage polynomial for $\query_1^2$ is $\poly_1^2\inparen{A, B, C, E, U, Y, Z}$
$$
=A^2U^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AUB^2YE + 2AXB^2ZC + 2B^2YEZC.
$$
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\poly_1^{\inparen{ABU}^2} = A^2U^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_U$ be the random variable corresponding to a lineage variable $U$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_U^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can further simplify to $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. Observe that if $W_U\in\inset{0, 1}$, then we further would have $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U}\expct\pbox{\randWorld_B} = \prob_A\cdot\prob_X\cdot\prob_B$ (denoting $\probOf\pbox{\randWorld_A = 1} = \prob_A$) $= \rpoly_1^{\inparen{ABX}^2}\inparen{\prob_A, \prob_U, \prob_B}$ (see $ii)$ of~\Cref{def:reduced-poly}). However, in this example, we get stuck with $\expct\pbox{\randWorld_U^2}$, since $\randWorld_U\in\inset{0, 1, 2}$ and for $\randWorld_U \gets 2$, $\randWorld_U^2 \neq \randWorld_U$.
To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push the expectation through each summand. To keep things simple, let us focus on the monomial $\poly_1^{\inparen{ABU}^2} = A^2U^2B^2$ as the procedure is the same for all other monomials of $\poly_1^2$. Let $\randWorld_U$ be the random variable corresponding to a lineage variable $U$. Because the distinct variables in the product are independent, we can push expectation through them yielding $\expct\pbox{\randWorld_A^2\randWorld_U^2\randWorld_B^2}=\expct\pbox{\randWorld_A^2}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B^2}$. Since $\randWorld_A, \randWorld_B\in \inset{0, 1}$ we can simplify to $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U^2}\expct\pbox{\randWorld_B}$ by the fact that for any $W\in \inset{0, 1}$, $W^2 = W$. Observe that if $W_U\in\inset{0, 1}$, then we further would have $\expct\pbox{\randWorld_A}\expct\pbox{\randWorld_U}\expct\pbox{\randWorld_B} = \prob_A\cdot\prob_X\cdot\prob_B$ (denoting $\probOf\pbox{\randWorld_A = 1} = \prob_A$) $= \rpoly_1^{\inparen{ABX}^2}\inparen{\prob_A, \prob_U, \prob_B}$ (see $ii)$ of~\Cref{def:reduced-poly}). However, in this example, we get stuck with $\expct\pbox{\randWorld_U^2}$, since $\randWorld_U\in\inset{0, 1, 2}$ and for $\randWorld_U \gets 2$, $\randWorld_U^2 \neq \randWorld_U$.
The simple insight to get around this issue to note that the random variables $\randWorld_U$ and $\randWorld_{U_1}+2\randWorld_{U_2}$ have exactly the same distribution, where $\randWorld_{U_1},\randWorld_{U_2}\in\inset{0,1}$ and $\probOf\pbox{\randWorld_{U_j} = 1} = \probOf\pbox{\randWorld_{U} = j}$. Thus, the idea is to replace the variable $U$ by $U_1+2U_2$ (where $U_j$ corresponds to the event that $U$ has multiplicity $j$) to obtain the following polynomial:
%
@ -178,7 +185,7 @@ The simple insight to get around this issue to note that the random variables $\
Given that $U$ can only have multiplicity of $1$ or $2$ but not both, we drop the monomials with the term $U_1U_2$ to get
$\refpoly{1, }^{\inparen{ABU}^2}\inparen{A, U_1, U_2, B} = A^2U_1^2B^2+2^2\cdot A^2 U_2^2B^2.$
Now that all the world vectors $(\randWorld_A,\randWorld_{U_1},\randWorld_{U_2},\randWorld_A)\in\inset{0,1}^4$, we have $\expct\pbox{\refpoly{1, }^2}=\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_1}}\expct\pbox{\randWorld_{B}}+$ \\ $4\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_2}}\expct\pbox{\randWorld_{B}}\stackrel{\text{def}}{=}\rpoly_1^2\inparen{p_A,\probOf\inparen{U=1},\probOf\inparen{U=2},p_B}$. We only did the argument for a single monomial but by linearity of expectation we can apply the same argument to all monomials in $\poly_1^2$. Generalizing this argument to general $\poly$ leads to consider its follownig `reduced' version:
Now that all the world vectors $(\randWorld_A,\randWorld_{U_1},\randWorld_{U_2},\randWorld_A)\in\inset{0,1}^4$, we have $\expct\pbox{\refpoly{1, }^2}=\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_1}}\expct\pbox{\randWorld_{B}}+$ \\ $4\expct\pbox{\randWorld_{A}}\expct\pbox{\randWorld_{U_2}}\expct\pbox{\randWorld_{B}}\stackrel{\text{def}}{=}\rpoly_1^2\inparen{p_A,\probOf\inparen{U=1},\probOf\inparen{U=2},p_B}$. We only did the argument for a single monomial but by linearity of expectation we can apply the same argument to all monomials in $\poly_1^2$. Generalizing this argument to general $\poly$ leads to consider its following `reduced' version:
\begin{Definition}\label{def:reduced-poly}
For any polynomial $\poly\inparen{\inparen{X_\tup}_{\tup\in\tupset}}$ define the reformulated polynomial $\refpoly{}\inparen{\inparen{X_{\tup, j}}_{\tup\in\tupset, j\in\pbox{\bound}}}
@ -197,7 +204,9 @@ removing all monomials containing the term $X_{\tup, j}X_{\tup, j'}$ for $\tup\i
%&= ABX_1 + AB\inparen{2}^2X_2+ BYE + BZC + 2AX_1BYE+ 2A\inparen{2}^2X_2BYE\\
%&\qquad + 2AX_1BZC + 2A\inparen{2}^2X_2BZC + 2BYEZC.
%\end{align*}
As we have essentially argued earlier, for our specific example the expectation that we want is $\rpoly_1^2(\probOf\inparen{A=1},$\allowbreak$\probOf\inparen{B=1}, \probOf\inparen{C=1}$,\allowbreak $\probOf\inparen{E=1},$\allowbreak $\probOf\inparen{U_1=1}, \probOf\inparen{U_2=1}, \probOf\inparen{Y=1}, \probOf\inparen{Z=1})$.
As we have essentially argued earlier, for our specific example the expectation that we want is $\rpoly_1^2(\probOf\inparen{A=1}, \ldots,
%$\allowbreak$\probOf\inparen{B=1}, \probOf\inparen{C=1}$,\allowbreak $\probOf\inparen{E=1},$\allowbreak $\probOf\inparen{U_1=1}, \probOf\inparen{U_2=1}, \probOf\inparen{Y=1},
\probOf\inparen{Z=1})$.
\Cref{lem:tidb-reduce-poly} generalizes the equivalence to {\em all} $\raPlus$ queries on \abbrCTIDB\xplural (proof in \Cref{subsec:proof-exp-poly-rpoly}):
\begin{Lemma}\label{lem:tidb-reduce-poly}
For any \abbrCTIDB $\pdb$, $\raPlus$ query $\query$, and lineage polynomial
@ -252,7 +261,7 @@ Given such a $\circuit^*$, to solve \Cref{prob:big-o-joint-steps}, it is \emph{s
Given one circuit $\circuit$ that encodes $\Phi\inparen{\vct{X}}$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrCTIDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
\end{Problem}
We will formalize the notions of circuits and hence, \Cref{prob:intro-stmt} in \cref{sec:expression-trees}. For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then (with an additive adjustment) $\rpoly\left(\prob_1,\dots, \prob_n\right)$ (recall \cref{def:reduced-poly}) is a constant factor approximation.
We will formalize the notions of circuits and hence, \Cref{prob:intro-stmt} in \Cref{sec:expression-trees}. For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then (with an additive adjustment) $\poly\left(\prob_1,\dots, \prob_n\right)$ (recall \Cref{def:reduced-poly}) is a constant factor approximation of $\rpoly$.
This is illustrated in the following example using $\query_1^2$ from earlier. To aid in presentation we again limit our focus to $\refpoly{1, }^{\inparen{ABU}^2}$, assume $\bound = 2$ for variable $U$ and $\bound = 1$ for all other variables. Let $\prob_A$ denote $\probOf\pbox{A = 1}$.
%In computing $\rpoly$, we have some cancellations to deal with:
Then we have: