Read through on S1 with changes.

master
Aaron Huber 2022-03-02 11:48:42 -05:00
parent 7199888c25
commit 68e05bcbf3
4 changed files with 28 additions and 21 deletions

View File

@ -9,7 +9,7 @@ We are specifically
Unfortunately, % we show the reverse;
our results imply that computing expected multiplicities for \abbrCTIDB\xplural based on the results\BG{That is confusing, it may not be clear to most readers why the result of detecministic QP are useful here} produced by such query evaluation algorithms introduces super-linear overhead\BG{Over what?} (under parameterized complexity hardness assumptions/conjectures).
We proceed to study approximation of expected result tuple multiplicities for positive relational algebra queries ($\raPlus$) over \abbrCTIDB\xplural and for a non-trivial subclass of block-independent databases (\abbrBIDB\xplural).
We develop a sampling algorithm that computes a $(1 \pm \epsilon)$-approximation of the expected multiplicity of an output tuple in time linear in the runtime of a comparable\BG{that just sounds hand-wavy, can we say something more concrete than comparable?} deterministic query for any $\raPlus$ query.
We develop a sampling algorithm that computes a $(1 \pm \epsilon)$-approximation of the expected multiplicity of an output tuple in time linear in the runtime of the corresponding\BG{that just sounds hand-wavy, can we say something more concrete than comparable?} deterministic query for any $\raPlus$ query.
% By removing Bag-PDB's reliance on the sum-of-products representation of polynomials, this result paves the way for future work on PDBs that are competitive with deterministic databases.
\end{abstract}

View File

@ -5,7 +5,7 @@
This work explores the problem of computing the expectation of the multiplicity of a tuple in the result of a query over a \abbrCTIDB, a type of probabilistic database with bag semantics where the multiplicity of a tuple is a random variable with range $[0,\bound]$ for some fixed constant $\bound$ and multiplicities assigned to any two tuples are independent of each other.
% in bag \abbrTIDB\xplural, which we term as \abbrCTIDB\xplural.
Formally, a \abbrCTIDB,
$\pdb = \inparen{\worlds, \bpd}$ consists of a set of tuples $\tupset$ and a probability distribution over $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
$\pdb = \inparen{\worlds, \bpd}$ consists of a set of tuples $\tupset$ and a probability distribution $\bpd$ over all possible worlds generated by assigning each tuple $\tup \in \tupset$ a multiplicity in the range $[0,\bound]$.
Any such world can be encoded as a vector from $\worlds$, the set of all vectors of length $\numvar=\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity.
\BGdel{encodes a bag of uncertain tuples such that each possible tuple encoded in $\pdb$ has a multiplicity of at most $\bound$. $\tupset$ is the set of tuples appearing across all possible worlds, and the set of all worlds is encoded in $\worlds$, which is the set of all vectors of length $\numvar=\abs{\tupset}$ such that each index corresponds to a distinct $\tup \in \tupset$ storing its multiplicity and $\bpd$ is the probability distribution over $\worlds$.}{tried to rephrase}
A given world $\worldvec \in\worlds$ can be interpreted as follows: for each $\tup \in \tupset$, $\worldvec_{\tup}$ is the multiplicity of $\tup$ in $\worldvec$. Given that the multiplicities of tuples are independent events, the probability distribution $\bpd$ can be expressed compactly by assigning each tuple a probability distribution over $[0,\bound]$. Let $\prob_{\tup,j}$ denote the probability that tuple $\tup$ is assigned multiplicity $j$. The probability of a particular world $\worldvec$ is then $\prod_{\tup \in \tupset} \prob_{\tup,\worldvec_{\tup}}$.
@ -18,11 +18,10 @@ In this work, since we are generally considering bag query input, we will only b
We can formally state our problem of computing the expected multiplicity of a result tuple as:
\begin{Problem}\label{prob:expect-mult}
Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query $\query$
\footnote{
Given a \abbrCTIDB $\pdb = \inparen{\worlds, \bpd}$, $\raPlus$ query\footnote{
A query $\query$ is an $\raPlus$ query if it is composed entirely of one or more of the positive relational operators $\inset{\select, \project, \join, \union}$.
}
, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$.
$\query$, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$.
\end{Problem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -104,7 +103,7 @@ A query $\query$ is an $\raPlus$ query if it is composed entirely of one or more
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
It is natural to explore computing the expected multiplicity of a result tuple as this is the analog for computing the marginal probability of a tuple in a set \abbrPDB.
In this work we will assume that $c =\bigO{1}$ since this is what typically seen in practice.
In this work we will assume that $c =\bigO{1}$ since this is what is typically seen in practice.
Allowing for unbounded $c$ is an interesting open problem.
\mypar{Hardness of Set Query Semantics and Bag Query Semantics}
@ -127,7 +126,7 @@ $\omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{C_0}}$
$\Omega\inparen{\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}^{c_0\cdot k}}$ for {\em some} $c_0>0$ & Multiple & \Cref{conj:known-algo-kmatch}\\ %Multiple & Current $k$-matching algorithms\\
\hline
\end{tabular}
\caption{Our lower bounds for a specific hard query $\query$ parameterized by $k$.For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\caption{Our lower bounds for a specific hard query $\query$ parameterized by $k$. For $\pdb = \inset{\worlds, \bpd}$ those with `Multiple' in the second column need the algorithm to be able to handle multiple $\bpd$, i.e. probability distributions (for a given $\tupset$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\label{tab:lbs}
\end{table}
\mypar{Our lower bound results}
@ -138,11 +137,12 @@ Specifically, depending on what hardness result/conjecture we assume, we get var
What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for~\Cref{prob:expect-mult}.
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results\footnote{
Consider the known results for the problem of counting $k$-cliques~\cite{10.5555/645413.652181},~\cite{CHEN20061346}, where for a query $\query$ over database $\tupset$ that counts the number of $k$-cliques, the results show a runtime of $\Omega_k\inparen{\numvar}$, implying our lower bounds would hold.
Consider the known results for the problem of counting $k$-cliques~\cite{10.5555/645413.652181},~\cite{CHEN20061346}, where for a query $\query$ over database $\tupset$ that counts the number of $k$-cliques, the results show a deterministic runtime of $\Omega_k\inparen{\numvar}$, implying our lower bounds would hold.
}
\AH{I am \emph{unsure} of this footnote. @atri may need to word smith this one. I don't feel like I entirely understand the purpose of this footnote. E.g., we could have a query that runs deterministically in $\Omega_k\inparen{n}$ worst case time; but this doesn't mean that $T^*\inparen{\query, \pdb}$ doesn't have a worst case lower bound of $\Omega\inparen{\numvar}^{c_0}$, correct? We would replace $T_{det}\inparen{\query, \tupset, \bound}$ with $\Omega_k\inparen{\numvar}$, no? If we replace $T_{det}\inparen{\query, \tupset, \bound}$ with $\numvar$, then this doesn't accurately reflect the worst case lower bound for counting $k$-cliques in the first place.}
already imply the claimed lower bounds if we were to replace the $\qruntime{\optquery{\query}, \tupset, \bound}$ by just $\numvar$ (indeed these results follow from known lower bounds for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$. This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query %$\timeOf{Approx}^*\inparen{\query, \pdb}\leq\qruntim{\optquery{\query},\tupset,\bound}$ (where $\timeOf{Approx}^*\inparen{\cdot}$ denotes runtime of approximation algorithm),
\mypar{Our upper bound results} We introduce a $(1\pm \epsilon)$-approximation algorithm that computes ~\Cref{prob:expect-mult} in time $O_\epsilon\inparen{\qruntime{\optquery{\query}, \tupset, \bound}}$. This means, when we are okay with approximation, that we solve~\Cref{prob:expect-mult} in time linear in the size of the deterministic query %$\timeOf{Approx}^*\inparen{\query, \pdb}\leq\qruntim{\optquery{\query},\tupset,\bound}$ (where $\timeOf{Approx}^*\inparen{\cdot}$ denotes runtime of approximation algorithm),
and bag \abbrPDB\xplural are deployable in practice.
% In particular, we show the following upper bound results.
%(i) We show that e.g. for a circuit representation of the lineage polynomial (more on this later), when the circuit is a tree and there is a single
@ -157,7 +157,7 @@ Further, our approximation algorithm works for a more general notion of bag \abb
\subsection{Polynomial Equivalence}\label{sec:intro-poly-equiv}
A common encoding of probabilistic databases (e.g., in \cite{IL84a,Imielinski1989IncompleteII,Antova_fastand,DBLP:conf/vldb/AgrawalBSHNSW06} and many others) relies on annotating tuples with lineages or propositional formulas that describe the set of possible worlds that the tuple appears in. The bag semantics analog is a provenance/lineage polynomial (see~\Cref{fig:nxDBSemantics}) $\apolyqdt$~\cite{DBLP:conf/pods/GreenKT07}, a polynomial with non-zero integer coefficients and exponents, over integer variables $\vct{X}$ encoding input tuple multiplicities.
\AH{This seems confusing since I thought the goal was to have $\vct{X}$ be abstract/typeless.}
%Intuitively, a \abbrCTIDB lends itself to a useful reduction to a specific type of block independent database (\abbrBIDB) which we refer to as a $1$-\abbrBIDB. A $1$-\abbrBIDB is a \abbrBIDB in the traditional sense of allowing no duplicate tuples, \emph{but} where we use bag query semantics instead of the usual set query semantics.
%(see~\Cref{fig:nxDBSemantics} for a definition)
\begin{figure}[b!]
@ -205,11 +205,11 @@ Next, we motivate this reduced polynomial.
Consider the query $\query_1$ defined as follows over the bag relations of \Cref{fig:two-step}:
\begin{lstlisting}
SELECT 1 FROM T $t_1$, R r, T $t_2$
WHERE $t_1$.city = r.city1 AND $t_2$.city = r.city2
SELECT DISTINCT 1 FROM T $t_1$, R r, T $t_2$
WHERE $t_1$.Point = r.Point$_1$ AND $t_2$.Point = r.Point$_2$
\end{lstlisting}
It can be verified that $\poly\inparen{A, B, C, E, X, Y, Z}$ for the sole result tuple (i.e. the count) of $\query$ is $AXB + BYE + BZC$. Now consider the product query $\query_1^2 = \query_1 \times \query_1$.
It can be verified that $\poly\inparen{A, B, C, E, X, Y, Z}$ for the sole result tuple of $\query_1$ is $AXB + BYE + BZC$. Now consider the product query $\query_1^2 = \query_1 \times \query_1$.
The lineage polynomial for $Q_1^2$ is given by $\poly_1^2\inparen{A, B, C, E, X, Y, Z}$
$$
=A^2X^2B^2 + B^2Y^2E^2 + B^2Z^2C^2 + 2AXB^2YE + 2AXB^2ZC + 2B^2YEZC.
@ -218,7 +218,8 @@ To compute $\expct\pbox{\poly_1^2}$ we can use linearity of expectation and push
%the expectation is $\expct\pbox{A^2X^2B^2} = A\cdot\prob_A\cdot\inparen{\sum\limits_{i \in [2]}X_i\cdot \prob_{X, i}}\cdot B\prob_B$ for $X \in \inset{0, 1, 2}$.
Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}\inparen{\vct{X_R}}}$ that is of use to us, where $\abs{\vct{X_R}} = \bound\cdot\abs{\vct{X}}$ . Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}jX_{\tup, j}$ where each $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$, we set $X_{\tup, j} = 1$ for $\worldvec_\tup = j$, while $X_{\tup, j'} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X_R}}$ $\inparen{\vct{X_R} = \vars{\refpoly{}}}$ since for any $X_\tup\in\vars{\poly}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.
Denote the variables of $\poly$ to be $\vars{\poly}.$ In the \abbrCTIDB setting, $\poly\inparen{\vct{X}}$ has an equivalent reformulation $\inparen{\refpoly{}\inparen{\vct{X_R}}}$ that is of use to us, where $\abs{\vct{X_R}} = \bound\cdot\abs{\vct{X}}$ . Given $X_\tup \in\vars{\poly}$, by definition $X_\tup \in\inset{0,\ldots, c}$. We can replace $X_\tup$ by $\sum_{j\in\pbox{\bound}}jX_{\tup, j}$ where the variables $\inparen{X_{\tup, j}}_{j\in\pbox{\bound}}$ are disjoint and each $X_{\tup, j}\in\inset{0, 1}$. Then for any $\worldvec\in\worlds$ and corresponding reformulated world $\worldvec_{\vct{R}}\in\inset{0, 1}^{\tupset\bound}$, we set $\worldvec_{\vct{R}_{\tup, j}} = 1$ for $\worldvec_\tup = j$, while $\worldvec_{\vct{R}_{\tup, j'}} = 0$ for all $j'\neq j\in\pbox{\bound}$. By construction then $\poly\inparen{\vct{X}}\equiv\refpoly{}\inparen{\vct{X_R}}$ $\inparen{\vct{X_R} = \vars{\refpoly{}}}$ since for any valuation $X_\tup\in\pbox{\bound}$ we have the equality $X_\tup = j = \sum_{j\in\pbox{\bound}}jX_j$.
\AH{I don't know the rules here, but since we have already (informally) defined $\vct{X}$ to be variables of type integer encoding multiplicities (see todo note above) and thus worlds, it seems that it is fine and natural to refer to valuations of the variables themselves, without having to use $\worldvec$ necessarily. The point I am trying to get across in the last sentence is, given these semantics and domains, we have an equivalent polynomial. Or is it wrong to use $\vct{X}$ and we should rather say, ``for any $\worldvec\in\worlds$, $\worldvec_{\vct{R}}\in\inset{0, 1}^{D\bound}$ we have that $\worldvec_\tup = j = \sum_{j\in\pbox{\bound}}j\cdot\worldvec_{\vct{R}_{\tup, j}}$?}
Considering again our example,
\begin{multline*}
@ -293,8 +294,11 @@ $, where $\probAllTup = \inparen{\inparen{\prob_{\tup, j}}_{\tup\in\tupset, j\in
\subsection{Our Techniques}
\mypar{Lower Bound Proof Techniques}
\AH{Regarding what follows (in the next paragraph): I think this \emph{may} be misleading (also, technically incorrect since $\poly$ is used instead of $\rpoly$) since it the \emph{lead} $c_{2k}$ of the term in $\rpoly\inparen{\vct{X}}$ with $2k$ distinct variables. However, technically, since we have that $\rpoly\inparen{\vct{\prob}}$ is a univariate polynomial, then, indeed this IS an accurate statement, since the term with $2k$ distinct variables in $\rpoly\inparen{\vct{\prob}}$ is the term with the highest degree (this assumes for $d$ distinct edges that $d \geq k$ for our special graph query; otherwise, there is no $k$-matching, and the leading coefficient is not $c_{2k}$). Perhaps we should note this. However, the context is in light of considering the \emph{univariate} polynomial $\rpoly\inparen{\vct{\prob}}$. Perhaps change $\poly$ to $\rpoly\inparen{\prob,\ldots,\prob}$.}
Our main hardness result shows that computing~\Cref{prob:expect-mult} is $\sharpwonehard$ for $1$-\abbrTIDB. To prove this result we show that for the same $\query_1$ from the example above, for an arbitrary `product width' $k$, the query $Q^k$ is able to encode various hard graph-counting problems (assuming $\bigO{\numvar}$ tuples rather than the $\bigO{1}$ tuples in \Cref{fig:two-step}).
We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymbol{R}$ of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$, a known hard problem in parameterized/fine-grained complexity literature.
We do so by considering an arbitrary graph $G$ (analogous to relation $\boldsymbol{R}$ of $\query$) and analyzing how the coefficients in the (univariate) polynomial $\widetilde{\poly}\left(p,\dots,p\right)$ relate to counts of subgraphs in $G$ that are isomorphic to various graphs with $k$ edges. E.g., we exploit the fact that the leading coefficient in $\poly$ corresponding to $\query^k$ is proportional to the number of $k$-matchings in $G$,
a known hard problem in parameterized/fine-grained complexity literature.
\mypar{Upper Bound Techniques}
Our negative results (\Cref{tab:lbs}) indicate that \abbrCTIDB{}s (even for $\bound=1$) can not achieve comparable performance to deterministic databases for exact results (under complexity assumptions). In fact, under plausible hardness conjectures, one cannot (drastically) improve upon the trivial algorithm to exactly compute the expected multiplicities for $1$-\abbrTIDB\xplural. A natural followup is whether we can do better if we are willing to settle for an approximation to the expected multiplities.
@ -330,10 +334,13 @@ As we also show in \Cref{sec:circuit-runtime}, this size is also bounded by $\qr
Thus, the question of approximation %\Cref{prob:big-o-joint-steps}
can be stated as the following stronger (since~\Cref{prob:big-o-joint-steps} has access to \emph{all} equivalent \circuit representing $\query\inparen{\vct{W}}\inparen{\tup}$), but sufficient condition:
\begin{Problem}\label{prob:intro-stmt}
Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrBPDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
Given one circuit $\circuit$ that encodes $\apolyqdt$ for all result tuples $\tup$ (one sink per $\tup$) for \abbrCTIDB $\pdb$ and $\raPlus$ query $\query$, does there exist an algorithm that computes a $(1\pm\epsilon)$-approximation of $\expct_{\rvworld\sim\bpd}\pbox{\query\inparen{\rvworld}\inparen{\tup}}$ (for all result tuples $\tup$) in $\bigO{|\circuit|}$ time?
\end{Problem}
For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then $\poly\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query^2$ from above, using $\prob_A$ to denote $\probOf\pbox{A = 1}$ (and similarly for the other variables), we can see that
For an upper bound on approximating the expected count, it is easy to check that if all the probabilties are constant then $\poly\left(\prob_1,\dots, \prob_n\right)$ (i.e. evaluating the original lineage polynomial over the probability values) is a constant factor approximation. For example, using $\query_1^2$ from above (with $\bound = 1$) and $\prob_A$ to denote $\probOf\pbox{A = 1}$, we can see that
\AH{I changed~\Cref{prob:intro-stmt} to use \abbrCTIDB. Correct me if this is wrong. Our results do apply to a more general class of \abbrBPDB, but the main data model considered in this paper is \abbrCTIDB.
Also, for the example above and worked out in what follows, it might be better flow to keep $\bound = 2$ and change what is below.}
\begin{footnotesize}
\begin{align*}

View File

@ -15,12 +15,12 @@
\newcommand{\draft}{0} %%% Change this to non-zero to remove comments
\ifnum\draft=0
\newcommand{\currentWork}[1]{\textcolor{red}{#1}}
\newcommand{\BG}[1]{\todo{\textbf{Boris says:$\,$} #1}}
\newcommand{\BG}[1]{\todo[inline]{\textbf{Boris says:$\,$} #1}}
\newcommand{\SF}[1]{\todo{\textbf{Su says:$\,$} #1}}
\newcommand{\OK}[1]{\todo[color=gray]{\textbf{Oliver says:$\,$} #1}}
\newcommand{\AH}[1]{\todo[inline, backgroundcolor=cyan, caption={}]{\textbf{Aaron says:$\,$} #1}}
\newcommand{\AR}[1]{\todo[inline,color=green]{\textbf{Atri says:$\,$} #1}}
\newcommand{\BGdel}[2]{\todo{\textbf{Boris deleted [#2]: {#1}}}}
\newcommand{\BGdel}[2]{\todo[inline]{\textbf{Boris deleted [#2]: {#1}}}}
\else
\newcommand{\BG}[1]{}
\newcommand{\SF}[1]{}

View File

@ -48,8 +48,8 @@ For any graph $G=(V,\edgeSet)$ and $\kElem\ge 1$, define
\noindent Returning to \Cref{fig:two-step}, it can be seen that $\poly_{G}^\kElem(\vct{X})$ is the lineage polynomial from query $\query_k$, which we define next ($\query_2$ from~\Cref{sec:intro} is the same query with $k=2$). Let us alias
\begin{lstlisting}
SELECT 1 FROM T $t_1$, R r, T $t_2$
WHERE $t_1$.city = r.city1 AND $t_2$.city = r.city2
SELECT DISTINCT 1 FROM T $t_1$, R r, T $t_2$
WHERE $t_1$.Point = r.Point$_1$ AND $t_2$.Point = r.Point$_2$
\end{lstlisting}
as $R$. The query $\query^k$ then becomes
\mdfdefinestyle{underbrace}{topline=false, rightline=false, bottomline=false, leftline=false, backgroundcolor=black!15!white, innerbottommargin=0pt}