paper-BagRelationalPDBsAreHard/rebuttal.tex

%root: main.tex
\definecolor{GrayRew}{gray}{0.85}
\newcommand{\RCOMMENT}[1]{\medskip\noindent \begin{tabular}{|p{\linewidth-3ex}}\rowcolor{GrayRew} #1 \end{tabular}\smallskip\\}
\section{Rebuttal}
This paper is a resubmission, and being such, we use this section to document the changes that have been made since our prior submission, and in particular, how we have addressed reviewer critiques.

\subsection{Meta Review}
\RCOMMENT{Problem definition not stated rigorously nor motivated. Discussion needed on the standard PDB approach vs your approach.}
We made the decision to rewrite \Cref{sec:intro} to specifically address this concern.  The opening paragraph precisely and formally states the query evaluation problem in \abbrBPDB\xplural.  We use a series of problem statements to clearly define the problem we are addressing as it relates to the query evaluation problem.  We have included significant discussion of the standard approach, e.g. see the paragraph \textbf{Relationship to Set-Probabilistic Query Evaluation} on page 4.

\RCOMMENT{Definition 2.6 on reduced BIDB polynomials seem not the right tool for the studied problem.}
We have chosen to stick with a less formal, ad-hoc definition (please see \Cref{def:reduced-poly} and \Cref{def:reduced-bi-poly}) as suggested by both Reviewer 1 and Reviewer 2.

\RCOMMENT{The paper is very difficult to read. Improvements are needed in particular for the presentation of the approximation results and their proofs. Also for the notation. Missing definitions for used notions need to be added. Ideally use one instead of three query languages (UCQ, RA+, SPJU).}
\AH{How have we handled the presentation of the approximation results and their proofs?}
We have chosen one specific query language throughout the paper ($\raPlus$).  We have also made a concerted effort to use clean, defined, non-ambiguous notation.  To the best of our examination, all notation conflicts have been addressed and definitions for used notions are added (see e.g. \Cref{def:Gk} appears before \Cref{lem:3m-G2} and \Cref{lem:lin-sys}.

\subsection{Reviewer 1}
\RCOMMENT{l.24 "is \#W[1]-hard": parameterized by what?}
\RCOMMENT{l.103 and l.105: again, what is the parameter exactly?}
While the above reference does not exist in the revised \Cref{sec:intro} anymore, all theorem statements and claims on \sharpwone runtime have been stated in a way so as to avoid ambiguity in the parameter.  Please see e.g. \Cref{thm:k-match-hard} and \Cref{thm:mult-p-hard-result}

\RCOMMENT{You might want to explain your title somewhere (probably in the  introduction): in the end, what exactly should be considered harmful and why?}
We have modified the title to better aptly describe our body of work.

\RCOMMENT{l.45 when discussing Dalvi and Suciu's dichotomy, you might want to mention that they consider *data complexity*. Currently the second sentence of your introduction ("take a query Q and a pdb D") suggests that you are considering combined complexity.}
We have made an explicit mention of data complexity when alluding to Dalvi and Suciu's dichotomy.  We have further rewritten \Cref{sec:intro} in such a way as to explicitly note the type(s) of complexity we are considering.

\RCOMMENT{l.51 "Consider ... and tuples are independent random event": so this is actually a set PDB... You might want to use an example where the input PDB is actually a bag PDB. The last sentence before the example makes the reader *expect* that the example will be of a bag PDB that is not a set PDB}
Our revision has removed the example referred to above.  While the paper considers predominantly set-\abbrPDB inputs to queries, we have concluded this not to be limiting.  Please see \Cref{footnote:set-not-limit} on \Cpageref{footnote:set-not-limit}.

\RCOMMENT{- In the case of set semantics, the lineage of a tuple can be defined for *any* query: it is the unique Boolean function that satisfies the if and only if property that you mention on line 70. For bag semantics however, to the best of my knowledge there is no general definition of what is a lineage for an arbitrary query. On line 73, it is not clear at all how the polynomial should be defined, since this will depend on the type of query that you consider}
The definition of the lineage polynomial (bag \abbrPDB) semantics over an arbitrary $\raPlus$ query $\query$ is modeled in \Cref{fig:nxDBSemantics}.

\RCOMMENT{l.75 "evaluating the lineage of t over an assignment corresponding to a possible world": here, does the assignment assigns each tuple to true or false? In other words, do the variables X still represent individual tuples?  From what I see later in the article it seems that no, so this is confusing if we compare to what is explained in the previous paragraph about set TIDB}
The discussion after \Cref{prob:bag-pdb-poly-expected} (in particular, the paragraph \textbf{\abbrTIDB\xplural}) specifically address these questions.  While values for possible worlds assigned are from $\{0, 1\}$, which is analog to boolean, but note \Cref{footnote:set-not-limit}, which descibes the encoding of a bag in a set.

\RCOMMENT{- l.135 "polynomial Q(X)": Q should be reserved for queries... You could use $\varphi$ or $\phi$ or... anything else but Q really}
We have decided to use $\poly\inparen{\vct{X}}$.

\RCOMMENT{- If we consider data complexity (as did Dalvi and Suciu) and fix an UCQ Q, given as input a bag TID PDB we can always compute the lineage in $O(|D|^|Q|)$ in SOP form and from there compute the expected multiplicity with the same complexity, so in polynomial time. How does this relate to your hardness result? Is it that you are only interested in combined complexity? Why one shouldn't be happy with this running time? Usually queries are much smaller than databases and this motivates studying data complexity.}
We have rewritten \Cref{sec:intro} in a way to stress that we are are primarily interested in data complexity, but we cannot stop there.  As the reviewer has noted, the problem we explore requires further analysis, where we require parameterized and fine grained complexity analysis to provide theoretical foundation for the question we ask in \Cref{prob:informal}.  We have discussed this in the prose following \Cref{prob:bag-pdb-poly-expected}.

\RCOMMENT{A discussion is missing about the difference between the approach usually taken in PDB literature and your approach. In which case would one be more interested in the expected multiplicity or in the marginal probability of a tuple? This should be discussed clearly in the introduction, as currently there is no clear "motivation" to what you do. There is a section about related work at the end but it is mostly a set of facts and there is no insightful comparison to what you do.}
We include this discussion in paragraph \textbf{Relationship to Set-Probabilistic Query Evaluation} after \Cref{prob:informal}.\AH{We need to maybe talk about the motivation for computing expected multiplicity.}

\RCOMMENT{l.176 "N[X] relations are closed under RA+": is this a *definition* of what it means to take an RA+ query and evaluate it over an N[X] database, or does this sentence say something more? Also, I think it would be clearer to use UCQs in the whole paper instead of constantly changing between UCQs, RA+ and SPJU formalisms}
To make the paper more accessible and general, we found it better to not use $\semNX$-DBs.  While we wanted to use UCQ, we found the choice of $\raPlus$ to be more amenable to the presentation of the paper, and have, as suggested stuck with one query formalism.

\RCOMMENT{There are too many things undefined in from l.182 to the end of page. l.182 and in Proposition 2.1 N-PDBs are not defined, the function mod is undefined, etc.  The article should be self-contained: important definitions should be in the article and the appendix should only be used to hide proof details. I think it would not take a lot of space to properly define the main concepts that you are using, without hiding things in the appendix}
We have done as the reviewer has suggested.  All material in \Cref{sec:background} that is proof related is in the appendix, while \Cref{sec:background} is itself now self-contained.

\RCOMMENT{l.622 and l.632-634: so a N-PDB is a PDB where each possible world is an N-database, but an N[X]-PDB is not a PDB where each possible world is an N[X]-database... Confusing notation}
\AH{We need to make sure this is taken care of in the appendix.}

\RCOMMENT{If you want to be in the setting of bag PDBs, why not consider that the value of the variables are integers rather that Boolean? I.e., consider valuations $\nu: X \rightarrow$ N (or even to R, why not?) instead of $X \rightarrow \{0,1\}$; this would seem more natural to me than having this ad-hoc "mix" of Boolean and non-Boolean setting. If you consider this then your "reduced polynomial" trick does not seem to work anymore.}
As discussed earlier in this rebuttal, we primarily deal with set-\abbrPDB inputs and \abbrBPDB outputs, where an easy generalization exists to encode a \abbrBPDB in a set-\abbrPDB (which then allows for bag inputs), to which we allude in \Cref{footnote:set-not-limit} on \Cpageref{footnote:set-not-limit}.

\RCOMMENT{- l.656 "Thus, from now on we will solely use such vectors...": this seems to
  be false. Moreover you keep switching notation which makes it very hard to read... Sometimes it is $\varphi$, sometimes it is small w, sometimes it is big W (l.174 or l.722), sometimes the database is $\varphi(D)$, sometimes it is $\varphi_w(D)$, other times it is $D_{[w]}$ (l.671), and so on.}
We have made effort to be deliberately consistent with the use of notation, following standard usage whenever possible.
\AH{We need to be sure this is taken care of in the appendix.}

\RCOMMENT{l.658 "we use $\varphi(D)$ to denote the semiring homomorphism $\semNX \rightarrow \semN$
  that...": I don't understand why you need a database to extend an assignment to its semiring homomorphism from $\semNX \rightarrow \semN$}
\AH{Need to make sure that the reason for this is clear.}

\RCOMMENT{}