Update on Overleaf.

This commit is contained in:
Atri Rudra 2022-01-18 16:47:30 +00:00 committed by node
parent a2fcf0b468
commit cd72155035
2 changed files with 25 additions and 9 deletions

View file

@ -3,16 +3,20 @@
\section{Introduction}\label{sec:intro}
\secrev{
This work explores the problem of developing a theoretical framework (\textit{at fine-grained/parameterized levels}) for the problem of computing the expectation of a tuple's multiplicity in a bag \abbrPDB. We begin our analysis using a restricted form of \abbrPDB which we call a \abbrCTIDB. A \abbrCTIDB, $\pdb = \inparen{\idb, \pd}$, is a bag \abbrPDB where each tuple is an independent random event, with a multiplicity of at most $c$, where $\idb$ is the set of possible worlds and $\pd$ is the probability distribution over $\idb$.
This work explores the problem of developing a theoretical framework (\textit{at fine-grained/parameterized levels}) for the problem of computing the expectation of a tuple's multiplicity in a bag \abbrPDB.
\AR{I will just start with TIDB and not even mention PDBs for now (i.e. not mention general PDBs in the first part of the intro (unless we are mentioning related work). And I would not use the ``developing a framework" because we really do not do that. Just say we consider the problem. And postpone the point about parameterized complexity till we talk about our results.}We begin our analysis using a restricted form of \abbrPDB which we call a \abbrCTIDB. A \abbrCTIDB, $\pdb = \inparen{\idb, \pd}$, is a bag \abbrPDB where each tuple is an independent random event, with a multiplicity of at most $c$, where $\idb$ is the set of possible worlds and $\pd$ is the probability distribution over $\idb$.
}
\mypar{For a later section}
%\mypar{For a later section}
%\sout{
Since each tuple in $\pdb$ has a mutually exclusive probability distribution over its possible multiplicities, it is natural to reduce a \abbrCTIDB to traditional (set) block independent database (\abbrBIDB). We refer to the reduced \abbrBIDB as a $1$-\abbrBIDB, as it is the case that each tuple can appear in a possible world at most $c = 1$ time. \Cref{fig:ctidb-red} shows an example of this reduction.
%Since each tuple in $\pdb$ has a mutually exclusive probability distribution over its possible multiplicities, it is natural to reduce a \abbrCTIDB to traditional (set) block independent database (\abbrBIDB). We refer to the reduced \abbrBIDB as a $1$-\abbrBIDB, as it is the case that each tuple can appear in a possible world at most $c = 1$ time. \Cref{fig:ctidb-red} shows an example of this reduction.
%}
\secrev{
Allowing for $\leq c$ multiplicities across all tuples gives rise to having $\leq c^\numvar$ possible worlds instead of the usual $2^\numvar$ possible worlds of a \abbrTIDB. \Cref{fig:ctidb-red} shows an example \abbrCTIDB and the possible worlds (with probabilities) which the model encodes.
\AR{I don't thnk the figure/example helps a lot and it also takes up a lot of space: I think we should just define \abbrCTIDB and then jump to the problems statement-- your example is just showing how a product distribution works, which seems like an overkill for the intended audience. Also I think (but \textbf{Boris/Oliver} should pitch in here) it might be easier to just define \abbrCTIDB without using the general PDB definition. E.g. here is one way to define things (this is juts a {\em proposal}.). A possible world over $n$ `base' tuples is defined by vector $\vct{W}\in\inset{0,\dots,c}^n$, where $\vct{W}[i]$th entry is the multiplicity of the $i$th tuple (a multiplicity of $0$ denotes that the corresponding tuple is not present). The distribution $\mathcal{P}$ for \abbrCTIDB is a product distribution, where the multiplicity $\vct{W}[i]$ is defined by $c+1$ probability values $p_{i,j}=\Pr[\vct{W}[i]=j]=p_{i,j}$ and these distributions are independent for each $i\in[n]$.}
We can formally state this problem as:
\AR{You have not stated explicitly that you are looking at bag semantics for query evaluation.}
\begin{Problem}\label{prob:expect-mult}
Given a \abbrCTIDB $\pdb = \inparen{\idb, \pd}$, query $\query$, and result tuple $\tup$, compute the expected multiplicity of $\tup$: $\expct_{\randDB\sim\pd}\pbox{\query\inparen{\randDB}\inparen{\tup}}$.
\end{Problem}
@ -92,20 +96,28 @@ Given a \abbrCTIDB $\pdb = \inparen{\idb, \pd}$, query $\query$, and result tupl
%\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We upperbound the multiplicity of tuples in a \abbrCTIDB because of the cancellation effect of queries over a $1$-\abbrBIDB (introduced later), where, for the worst case, a self join query, we would have a factor of $\frac{1}{c^{n-1}}$ cancellations. Allowing for unbounded $c$ is an intersting open problem.
We upperbound the multiplicity of tuples in a \abbrCTIDB because of the cancellation effect of queries over a $1$-\abbrBIDB (introduced later), where, for the worst case, a self join query, we would have a factor of $\frac{1}{c^{n-1}}$ cancellations.
\AR{No, this is not the (main) reason we consider \abbrCTIDB. The reason is that bounded $c$ and in particular, $c=1$ is what is used in practice. So the motivation is that we are considering a setting that is what is needed in practice. What you are mentioning above is again related to HOW we solve the problem and should not be in this part. In particular, in this part of the intro there should be NO mention of $1$-\abbrBIDB.}Allowing for unbounded $c$ is an interesting open problem.
}
%\sout{
\mypar{Example that can perhaps be used later on (using commented out figure above)}
Given a \abbrCTIDB $\pdb$ with $\numvar$ tuples, we can encode a possible world by the vector $\vct{W} \in \inset{0,\ldots, c}^\numvar$, with the intuitive interpretation when bit $W_i = j$, then tuple $\tup_i$ with multiplicity $j$ is selected, with $\tup_i$ not existing for the special case of $j = 0$. For the example in ~\cref{fig:ctidb-red}, we have that for \abbrCTIDB $\textbf{R}$, $\numvar = 2$. Then, e.g., arbitrary world vector $\vct{W} = [2, 3]$ encodes the possible world $\db = \inset{\intup{a, 2}, \intup{b, 3}}$ Computing ~\cref{prob:expect-mult} for tuple $\tup_2$ in ~\cref{fig:ctidb-red} when $\query = \mathbf{\rel}$ then becomes $\expct_{\randDB\sim\pd}\pbox{\mathbf{\rel}\inparen{\tup_2}} = 1\cdot\prob_{2,1} + 2\cdot\prob_{2,2} + 3\cdot\prob_{2,3} = 1\cdot 0.2 + 2\cdot 0.35 + 3\cdot 0.15 = 1.35$.
%\mypar{Example that can perhaps be used later on (using commented out figure above)}
%Given a \abbrCTIDB $\pdb$ with $\numvar$ tuples, we can encode a possible world by the vector $\vct{W} \in \inset{0,\ldots, c}^\numvar$, with the intuitive interpretation when bit $W_i = j$, then tuple $\tup_i$ with multiplicity $j$ is selected, with $\tup_i$ not existing for the special case of $j = 0$. For the example in ~\cref{fig:ctidb-red}, we have that for \abbrCTIDB $\textbf{R}$, $\numvar = 2$. Then, e.g., arbitrary world vector $\vct{W} = [2, 3]$ encodes the possible world $\db = \inset{\intup{a, 2}, \intup{b, 3}}$ Computing ~\cref{prob:expect-mult} for tuple $\tup_2$ in ~\cref{fig:ctidb-red} when $\query = \mathbf{\rel}$ then becomes $\expct_{\randDB\sim\pd}\pbox{\mathbf{\rel}\inparen{\tup_2}} = 1\cdot\prob_{2,1} + 2\cdot\prob_{2,2} + 3\cdot\prob_{2,3} = 1\cdot 0.2 + 2\cdot 0.35 + 3\cdot 0.15 = 1.35$.
%}
\AR{I know we did not decide on this ealier but I think this would be a good place have a para on results on setPDBs and focus just on $1$-TIDBs. Mention with set query eval semantics the problem is really and summarize at least the lower bound result here. Also might be a good place to mention the `trivial' algo for \abbrCTIDB in a sentence or so: the idea is to establish that we can solve our problem in polytime. So the interesting Q is fine-grained/parameterized complexity.}
\secrev{
One of the main theoretical points in this work is to discern whether or not bag \abbrPDB query semantics are indeed linear in the runtime of the deterministic query. Unfortunately, we prove that this is not the case. To analyze this question we denote by $\timeOf{}^*(Q,\pdb)$ the optimal runtime complexity of computing ~\cref{prob:expect-mult} over \abbrCTIDB $\pdb$. Let $\qruntime{\query, \db}$ be the runtime of query $\query$ on deterministic database $\db$ under a cost model that is satisfied by a wide range of query processing algorithms, including those based on the recent work on worst-case optimal join algorithms. We make this runtime concrete later on.
One of the main theoretical points in this work is to discern whether or not bag \abbrPDB query semantics are indeed linear in the runtime of the deterministic query.
\AR{Again just mention \abbrCTIDB instead of general \abbrPDB. Also you should motivate in a sentence or so WHY this is a problem to consider. Again if we have already motivated \abbrCTIDB because of practical considerations then you can say here than a positive answer would mean these things nice things in practice.}
Unfortunately, we prove that this is not the case. To analyze this question we denote by $\timeOf{}^*(Q,\pdb)$ the optimal runtime complexity of computing ~\cref{prob:expect-mult} over \abbrCTIDB $\pdb$. Let $\qruntime{\query, \db}$ be the runtime of query $\query$ on deterministic database $\db$ under a cost model that is satisfied by a wide range of query processing algorithms, including those based on the recent work on worst-case optimal join algorithms.
\AR{First I though we were going with $T_{det}^*$ instead of just $\qruntime{\cdot}$-- this change needs to be made for the rest of the intro. Also for now you just informally define $T_{det}^*$ as the {\em optimal} deterministic runutime (but add a parenthetical remark on there being some caveats). Second, $\db$ is not connected to $\pdb$.}
We make this runtime concrete later on.
We denote by $\dbbase$ the base \abbrCTIDB table containing all possible tuples, formally as,
\begin{Definition}[$\dbbase$]
Let $\dbbase$ be the relation composed of all possible tuples in $\pdb$, i.e. $\dbbase = \bigcup_{\db \in \idb}\db$.
\end{Definition}
\AR{Again if we are defining \abbrCTIDB `from scratch' instead of in terms of general PDBs, then the above might not be needed. Also it should be \abbrCTIDB instead of \abbrPDB in the sentence below.}
~\Cref{tab:lbs} shows our lower bounds for computing ~\cref{prob:expect-mult} on general \abbrPDB\xplural.
\newline
\begin{table}[h!]
@ -123,7 +135,10 @@ $\Omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{c_0\cdot k}}$ for {\em som
\caption{Our lower bounds for a specific hard query $Q$ parameterized by $k$. The $\pdb$ is over the same (family of) $\dbbase$ and those with `Multiple' in the second column need the algorithm to be able to handle multiple $\pd$ (for a given $\dbbase$). The last column states the hardness assumptions that imply the lower bounds in the first column ($\eps_o,C_0,c_0$ are constants that are independent of $k$).}
\label{tab:lbs}
\end{table}
\mypar{Our lower bound results} In table~\ref{tab:lbs} we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, where $k$ is the largest degree of the polynomial $\apolyqdt$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries). What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for \Cref{prob:informal}.\AH{Not sure what is meant by `the trivial algorithm for (what was originally called) Problem 1.4'}
\mypar{Our lower bound results} In table~\ref{tab:lbs} we show that depending on what hardness result/conjecture we assume, we get various emphatic versions of {\em no} as an answer to our question. To make some sense of the other lower bounds in Table~\ref{tab:lbs}, we note that it is not too hard to show that $\timeOf{}^*(Q,\pdb) \le O\inparen{\inparen{\qruntime{Q, \dbbase}}^k}$, where $k$ is the largest degree of the polynomial $\apolyqdt$ over all result tuples $\tup$ (and the parameter that defines our family of hard queries).
\AR{$\Phi$ is no longer defined so the above needs to be re-written just in terms if the query.}
What our lower bound in the third row says is that one cannot get more than a polynomial improvement over essentially the trivial algorithm for \Cref{prob:informal}.\AH{Not sure what is meant by `the trivial algorithm for (what was originally called) Problem 1.4'}
\AR{Trivial algorithm us convert to SMB form and then use linearity of expectation.}
However, this result assumes a hardness conjecture that is not as well studied as those in the first two rows of the table (see \Cref{sec:hard} for more discussion on the hardness assumptions). Further, we note that existing results already imply the claimed lower bounds if we were to replace the $\qruntime{\query, \dbbase}$ by just $\abs{\dbbase}$ (indeed these results follow from known lower bound for deterministic query processing). Our contribution is to then identify a family of hard queries where deterministic query processing is `easy' but computing the expected multiplicities is hard.
\mypar{Our upper bound results} We introduce an $(1\pm \epsilon)$-approximation algorithm that computes ~\cref{prob:expect-mult} in $\leq \qruntime{\query, \dbbase}$. In particular, we show the following upper bound results.
@ -132,6 +147,7 @@ $\Omega\inparen{\inparen{\qruntime{\query, \dbbase}}^{c_0\cdot k}}$ for {\em som
Further, we show that for {\em any} $\raPlus$ query on a \abbrTIDB $(1$-$\abbrTIDB)$, we also obtain linear runtime for approximation.
% the approximation algorithm has runtime linear in the size of the compressed lineage encoding (
In contrast, known approximation techniques (\cite{DBLP:conf/icde/OlteanuHK10,DBLP:journals/jal/KarpLM89}) in set-\abbrPDB\xplural need time $\Omega(\abs{\circuit}^{2k})$ (see \Cref{sec:karp-luby}).
\AR{The above needs to be re-written. It is stated in terms of circuits but circuits are only part of our solution. You should state the final result: i.e. we have a runtime of $O_{\eps}(T_{det}^*)$.}
(ii) We generalize the \abbrPDB data model considered by the approximation algorithm to a class of bag-Block Independent Disjoint Databases (see \Cref{subsec:tidbs-and-bidbs}) (\abbrBIDB\xplural).
}
\secrev{

View file

@ -19,7 +19,7 @@
\newcommand{\SF}[1]{\todo{\textbf{Su says:$\,$} #1}}
\newcommand{\OK}[1]{\todo[color=gray]{\textbf{Oliver says:$\,$} #1}}
\newcommand{\AH}[1]{\todo[backgroundcolor=cyan, caption={}]{\textbf{Aaron says:$\,$} #1}}
\newcommand{\AR}[1]{\todo[color=green]{\textbf{Atri says:$\,$} #1}}
\newcommand{\AR}[1]{\todo[inline,color=green]{\textbf{Atri says:$\,$} #1}}
\else
\newcommand{\BG}[1]{}
\newcommand{\SF}[1]{}