paper-BagRelationalPDBsAreHard/rebuttal.tex

177 lines
24 KiB
TeX

%root: main.tex
\definecolor{GrayRew}{gray}{0.85}
\newcommand{\RCOMMENT}[1]{\medskip\noindent \begin{tabular}{|p{\linewidth-3ex}}\rowcolor{GrayRew} #1 \end{tabular}\smallskip\\}
\section{Rebuttal}
This paper is a resubmission, and being such, we use this section to document the changes that have been made since our prior submission, and in particular, how we have addressed reviewer critiques.
\subsection{Meta Review}
\RCOMMENT{Problem definition not stated rigorously nor motivated. Discussion needed on the standard PDB approach vs your approach.}
We made the decision to rewrite \Cref{sec:intro} to specifically address this concern. The opening paragraph precisely and formally states the query evaluation problem in \abbrBPDB\xplural. We use a series of problem statements to clearly define the problem we are addressing as it relates to the query evaluation problem. We have included significant discussion of the standard approach, e.g. see the paragraph \textbf{Relationship to Set-Probabilistic Query Evaluation} on page 4.
\RCOMMENT{Definition 2.6 on reduced BIDB polynomials seem not the right tool for the studied problem.}
We have chosen to stick with a less formal, ad-hoc definition (please see \Cref{def:reduced-poly} and \Cref{def:reduced-bi-poly}) as suggested by both Reviewer 1 and Reviewer 2.
\RCOMMENT{The paper is very difficult to read. Improvements are needed in particular for the presentation of the approximation results and their proofs. Also for the notation. Missing definitions for used notions need to be added. Ideally use one instead of three query languages (UCQ, RA+, SPJU).}
\AH{How have we handled the presentation of the approximation results and their proofs?}
We have chosen one specific query language throughout the paper ($\raPlus$). We have also made a concerted effort to use clean, defined, non-ambiguous notation. To the best of our examination, all notation conflicts have been addressed and definitions for used notions are added (see e.g. \Cref{def:Gk} appears before \Cref{lem:3m-G2} and \Cref{lem:lin-sys}.
\subsection{Reviewer 1}
\RCOMMENT{l.24 "is \#W[1]-hard": parameterized by what?}
\RCOMMENT{l.103 and l.105: again, what is the parameter exactly?}
While the above reference does not exist in the revised \Cref{sec:intro} anymore, all theorem statements and claims on \sharpwone runtime have been stated in a way so as to avoid ambiguity in the parameter. Please see e.g. \Cref{thm:k-match-hard} and \Cref{thm:mult-p-hard-result}
\RCOMMENT{You might want to explain your title somewhere (probably in the introduction): in the end, what exactly should be considered harmful and why?}
We have modified the title to better aptly describe our body of work.
\RCOMMENT{l.45 when discussing Dalvi and Suciu's dichotomy, you might want to mention that they consider *data complexity*. Currently the second sentence of your introduction ("take a query Q and a pdb D") suggests that you are considering combined complexity.}
We have made an explicit mention of data complexity when alluding to Dalvi and Suciu's dichotomy. We have further rewritten \Cref{sec:intro} in such a way as to explicitly note the type(s) of complexity we are considering.
\RCOMMENT{l.51 "Consider ... and tuples are independent random event": so this is actually a set PDB... You might want to use an example where the input PDB is actually a bag PDB. The last sentence before the example makes the reader *expect* that the example will be of a bag PDB that is not a set PDB}
Our revision has removed the example referred to above. While the paper considers predominantly set-\abbrPDB inputs to queries, we have concluded this not to be limiting. Please see \Cref{footnote:set-not-limit} on \Cpageref{footnote:set-not-limit}.
\RCOMMENT{- In the case of set semantics, the lineage of a tuple can be defined for *any* query: it is the unique Boolean function that satisfies the if and only if property that you mention on line 70. For bag semantics however, to the best of my knowledge there is no general definition of what is a lineage for an arbitrary query. On line 73, it is not clear at all how the polynomial should be defined, since this will depend on the type of query that you consider}
The definition of the lineage polynomial (bag \abbrPDB) semantics over an arbitrary $\raPlus$ query $\query$ is modeled in \Cref{fig:nxDBSemantics}.
\RCOMMENT{l.75 "evaluating the lineage of t over an assignment corresponding to a possible world": here, does the assignment assigns each tuple to true or false? In other words, do the variables X still represent individual tuples? From what I see later in the article it seems that no, so this is confusing if we compare to what is explained in the previous paragraph about set TIDB}
The discussion after \Cref{prob:bag-pdb-poly-expected} (in particular, the paragraph \textbf{\abbrTIDB\xplural}) specifically address these questions. While values for possible worlds assigned are from $\{0, 1\}$, which is analog to boolean, but note \Cref{footnote:set-not-limit}, which descibes the encoding of a bag in a set.
\RCOMMENT{- l.135 "polynomial Q(X)": Q should be reserved for queries... You could use $\varphi$ or $\phi$ or... anything else but Q really}
We have decided to use $\poly\inparen{\vct{X}}$.
\RCOMMENT{- If we consider data complexity (as did Dalvi and Suciu) and fix an UCQ Q, given as input a bag TID PDB we can always compute the lineage in $O(|D|^|Q|)$ in SOP form and from there compute the expected multiplicity with the same complexity, so in polynomial time. How does this relate to your hardness result? Is it that you are only interested in combined complexity? Why one shouldn't be happy with this running time? Usually queries are much smaller than databases and this motivates studying data complexity.}
We have rewritten \Cref{sec:intro} in a way to stress that we are are primarily interested in data complexity, but we cannot stop there. As the reviewer has noted, the problem we explore requires further analysis, where we require parameterized and fine grained complexity analysis to provide theoretical foundation for the question we ask in \Cref{prob:informal}. We have discussed this in the prose following \Cref{prob:bag-pdb-poly-expected}.
\RCOMMENT{A discussion is missing about the difference between the approach usually taken in PDB literature and your approach. In which case would one be more interested in the expected multiplicity or in the marginal probability of a tuple? This should be discussed clearly in the introduction, as currently there is no clear "motivation" to what you do. There is a section about related work at the end but it is mostly a set of facts and there is no insightful comparison to what you do.}
We include this discussion in paragraph \textbf{Relationship to Set-Probabilistic Query Evaluation} after \Cref{prob:informal}.\AH{We need to maybe talk about the motivation for computing expected multiplicity.}
\RCOMMENT{l.176 "N[X] relations are closed under RA+": is this a *definition* of what it means to take an RA+ query and evaluate it over an N[X] database, or does this sentence say something more? Also, I think it would be clearer to use UCQs in the whole paper instead of constantly changing between UCQs, RA+ and SPJU formalisms}
To make the paper more accessible and general, we found it better to not use $\semNX$-DBs. While we wanted to use UCQ, we found the choice of $\raPlus$ to be more amenable to the presentation of the paper, and have, as suggested stuck with one query formalism.
\RCOMMENT{There are too many things undefined in from l.182 to the end of page. l.182 and in Proposition 2.1 N-PDBs are not defined, the function mod is undefined, etc. The article should be self-contained: important definitions should be in the article and the appendix should only be used to hide proof details. I think it would not take a lot of space to properly define the main concepts that you are using, without hiding things in the appendix}
We have done as the reviewer has suggested. All material in \Cref{sec:background} that is proof related is in the appendix, while \Cref{sec:background} is itself now self-contained.
\RCOMMENT{l.622 and l.632-634: so a N-PDB is a PDB where each possible world is an N-database, but an N[X]-PDB is not a PDB where each possible world is an N[X]-database... Confusing notation}
\AH{We need to make sure this is taken care of in the appendix.}
\RCOMMENT{If you want to be in the setting of bag PDBs, why not consider that the value of the variables are integers rather that Boolean? I.e., consider valuations $\nu: X \rightarrow$ N (or even to R, why not?) instead of $X \rightarrow \{0,1\}$; this would seem more natural to me than having this ad-hoc "mix" of Boolean and non-Boolean setting. If you consider this then your "reduced polynomial" trick does not seem to work anymore.}
As discussed earlier in this rebuttal, we primarily deal with set-\abbrPDB inputs and \abbrBPDB outputs, where an easy generalization exists to encode a \abbrBPDB in a set-\abbrPDB (which then allows for bag inputs), to which we allude in \Cref{footnote:set-not-limit} on \Cpageref{footnote:set-not-limit}.
\RCOMMENT{- l.656 "Thus, from now on we will solely use such vectors...": this seems to
be false. Moreover you keep switching notation which makes it very hard to read... Sometimes it is $\varphi$, sometimes it is small w, sometimes it is big W (l.174 or l.722), sometimes the database is $\varphi(D)$, sometimes it is $\varphi_w(D)$, other times it is $D_{[w]}$ (l.671), and so on.}
We have made effort to be deliberately consistent with the use of notation, following standard usage whenever possible.
\AH{We need to be sure this is taken care of in the appendix.}
\RCOMMENT{l.658 "we use $\varphi(D)$ to denote the semiring homomorphism $\semNX \rightarrow \semN$
that...": I don't understand why you need a database to extend an assignment to its semiring homomorphism from $\semNX \rightarrow \semN$}
\AH{Need to make sure that the reason for this is clear.}
\RCOMMENT{Figure 2, K is undefined}
We have updated \Cref{fig:nxDBSemantics} (originally figure 2) to not necessitate $K$.
\RCOMMENT{l.178 "$Q_t$", l.189 "Q will denote a polynomial": this is a very poor choice of notation}
\RCOMMENT{l.242 "and query Q": is Q a query or a lineage?}
We have reserved $\query$ to mean an $\raPlus$ query and nothing else.
\RCOMMENT{Section 2.1.1: here you are considering set semantics no? Otherwise, one would think that for bag semantics the annotation of a tuple could be 0 or something of the form c $\times$ X, where X is a variable and c is a natural number}
The semantics for the polynomial as seen in \Cref{eq:sop-form} is specified indeed as the reviewer has pointed out.
\RCOMMENT{Proof of Proposition A.3. I seems the proof should end after l.687, since you already proved everything from the statement of the proposition. I don't understand what it is that you do after this line.}
\AH{This needs to be verified.}
\RCOMMENT{l.686 "The closure of ... over K-relations": you should give more details on this part. It is not obvious to me that the relations from l.646 hold.}
\AH{This too needs to be looked at.}
\RCOMMENT{l.711 "As already noted...": ah? I don't see where you define which subclass of N[X]-PDBs define bag version of TIDBs. If this is supposed to be in Section 2.1.1 this is not clear, since the world "bag" does not even appear there (and as already mentioned everything seems to be set semantics in this section). I fact, nowhere in the article can I see a definition of what are bag TIDBs/BIDBs}
\AH{This needs to be taken care of in the appendix.}
\RCOMMENT{- l.707 "the sum of the probabilities of all the tuples in the same block b is 1": no, traditionally it can be less than 1, which means that there could be no tuple in the block.}
The reviewer is correct and we have updated our appendix text accordingly.
\RCOMMENT{it is not clear to me how you can go from l.733 to l.736, which is sad because this is actually the whole point of this proof. If I understand correctly, in l.733, Q(D)(t) is the polynomial annotation of t when you use the semantics of Figure 2 with the semiring K being N[X], so I don't see how you go from this to l.736}
\AH{Needs to be verified. I have looked at this previously, and the proof iirc.}
\RCOMMENT{l.209-227: so you define what is a polynomial and what is the degree of a polynomial (things that everyone knows), but you don't bother explaining what "taking the mod of Q(X) over all polynomials in S" means? This is a bit weird.}
Based on this and other reviewer comments, we removed the formal definition of $\rpoly\inparen{\vct{X}}$ and have defined it in a more ad-hoc manner, as suggested by the reviewers, including the comment immediately following.
\RCOMMENT{Definition 2.6: to me, using polynomial long division to define $\tilde{Q}$(X) seems like a pedantic way of reformulating something similar to Definition 1.3, which was perfectly fine and understandable already! You could just define $\tilde{Q}$(X) to set all exponents in the SOP that are >1 to 1 and to remove all monomials with variables from the same block, or using Lemma A.4 as a definition?}
As alluded to above, we have incorporated the reviewer's suggestion, c.f. \Cref{def:reduced-poly} and \Cref{def:reduced-bi-poly}.
\RCOMMENT{Definition 2.14. It is not clear what is the input exactly. Are the query Q and database D fixed? Moreover, I have the impression that your hardness results have nothing to do with lineages and that you don't need them to express your results. I think the problem you should consider is simply the following: Expected Multiplicity Problem: Input: query Q, N[X]-database D, tuple t. Output: expected multiplicity of t in Q(D). Your main hardness result would then look like this: the Expected
Multiplicity problem restricted to conjunctive queries is \#W[1]-hard, parameterized by query size. Indeed if I look at the proof, all you need is the queries $Q^k_G$. The problem is \#W[1]-hard and it should not matter how one tries to solve it: using an approach with lineages or using anything else.
Currently it is confusing because you make it look like the problem is hard only when you consider general arithmetic circuits, but your hardness proof has nothing to do with circuits. Moreover, it is not surprising that computing the expected output of an arithmetic circuit is hard: it is trivial, given a CNF $\phi$, to build an arithmetic circuit C such that for any valuation $\nu$ of the variables the formula $\phi$ evaluates to True under $\nu$ if C evaluates to 1 and the formula $\phi$ evaluates to False under $\nu$ if C evaluates to 0, so this problem is \sharpphard anyways.}
We have rewritten \Cref{sec:intro} with a series of refined problem statements to show that the problem we explore and the results we obtain directly involve lineage polynomials. The reviewer is correct that the output is the expected multiplicity, and we hope that our updated presentation of the paper makes it clear that $\expct_{\vct{\randWorld}\sim\pdassign}\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ is indeed the expected multiplicity spoken of. We have also addressed the ambiguity in the complexity we are focusing on, both explicitly in the intro and in the revised definition, \Cref{def:the-expected-multipl}.
Regarding the use of circuits, it is true that our hardness results do not require circuits while our approximation algorithm and cost model both rely on circuits. We have adjusted our presentation (e.g. the segway between \Cref{prob:informal} and \Cref{prob:big-o-joint-steps}) to make this distinction clear and eliminate any confusion.
\RCOMMENT{Section 3.3. It seems to me the important part of this section is not so much the fact that we have fixed values of p but that the query is now fixed and that you are looking at the fine-grained complexity. If what you really cared about was having fixed value of p, then the result of this section should be exactly like the one in Theorem 3.4, but starting with "fix p". So something like "Fix p. Computing $\tilde{Q}^k_G$ for arbitrary G is \#W1-hard".}
\AH{Need help in responding to this one.}
\RCOMMENT{General remark: The story of the paper I think should be this: we can always compute the expected multiplicity for a UCQ Q and N[X]-database D and tuple t by first computing the lineage in SOP form and then using linearity of expectation, which gives an upper bound of (roughly) $O(|D|^|Q|)$. We show that this exponential dependence in |Q| is unavoidable by proving that this problem is \#W1 hard parameterized by |Q| (which implies that we cannot solve it in $f(|Q|) |D|^c$ ). Furthermore we obtain fine-grained superlinear lower bounds for a fix conjunctive query Q. (Observe how up to here, there is no need to talk about lineages at all). We then obtain an approximation algorithm for this problem for [this class of queries] and [that class of bag PDBs] with [that running time (Q,D)]. The method is to first compute the lineage as an arithmetic circuit C in [this running time (Q,D)], and then from the arithmetic circuit C compute in [running time(C)] an approximation of its expected output. Currently I don't understand to which queries your approximation algorithm can be applied (see later comments).}
We have followed the suggestions of the reviewer to delineate between the `coarse' polynomial time and the fine grained complexity analysis. We found it necessary to introduce polynomials earlier since our hard query, hardness results, and their proofs are easier to present (and we feel make the paper more accessible) than doing so without the lineage polynomials.
We have taken pains to be very clear that this work only considers $\raPlus$ queries, adding a reminder to this end in the first paragraph of \Cref{sec:algo}.
\AH{We need to address the last line of the reviewer's comment. Also, not sure if I answered the comment perfectly.}
\RCOMMENT{l.381: Here again, I think it would be simpler to consider that the input of the problem is the query, the database and a tuple and claim that you can compute an approximation of the expected multiplicity in linear time. The algo is to first compute the lineage as an arithmetic circuit, and then to use what you currently use (which could be put in a lemma or in a proposition).}
Our appoximation algorithm assumes an input circuit \circuit that has been computed via an arbitrary $\raPlus$ query $\query$ and arbitrary \abbrBIDB $\pdb$. We have included prose to describe this at the beginning of {sec:algo:sub:main-result}.
\RCOMMENT{Definition 4.2: would you mind giving an intuition of what this is? It is not OK to define something and just tell the reader to refer the appendix to understand what this is and why this is needed; the article should be understandable without having to look at the appendix. It is simply something that gives the coefficient of each monomial in the reduced polynomial?}
We have provided an intuitive example in directly after \Cref{def:expand-circuit}.
\RCOMMENT{- l.409: how does it matter that the circuit C is the lineage of a UCQ? Doesn't this work for any arithmetic circuit?}
The reviewer is correct that our approximation results apply to $\raPlus$ queries over \abbrBIDB\xplural. This we specify this in the formal statements of \Cref{sec:algo}, e.g. see \Cref{def:param-gamma} and \Cref{cor:approx-algo-const-p}.
\RCOMMENT{l.411: what are $|C|^2(1,...,1)$ and $|C|(1,...,1)$? }
We clarify this overloaded notation immediately after \Cref{def:positive-circuit}.
\RCOMMENT{Sometimes you consider UCQs, sometimes RA+ queries. I think it would be simpler if you stick to one formalism (probably UCQs is cleaner?)}
As alluded to previously, we have followed the reviewer's suggestion and have found $\raPlus$ queries to be most amenable for this work.
\RCOMMENT{l.432 what is an FAQ query?}
We have added a reference. Please see \Cref{lem:val-ub}.
\RCOMMENT{Generally speaking, I think I don't understand much about Section 4, and the convolutedness of the appendix does not help to understand. I don't even see in which result you get a linear runtime and to which queries the linear runtime applies. Somewhere there should be a corollary that clearly states a linear time approximation algorithm for some queries.}
\AH{Needs to be addressed.}
\RCOMMENT{In section 5, it seems you are arguing that we can compute lineages as arithmetic circuits at the same time as we would be running an ordinary query evaluation plan. How is that different from using the relations in Figure 2 for computing the lineage?}
There is not a major difference between the two. This observation has persuaded us to eliminate $\semNX$-DB query evaluation and have only an algorithm for lineage.
\RCOMMENT{l.679 where do you use $max(D_i)$ later in the proof?}
\AH{Needs to be fixed.}
\RCOMMENT{l.688 That sentence is hard to parse, consider reformulating it}
\AH{Needs to be reformulated.}
\RCOMMENT{it seems you are defining N[X]-PDB at two places in the appendix: once near l.632, and another time near l.652}
\AH{Needs to be addressed.}
\subsection{Reviewer 2}
\RCOMMENT{First, the paper should state rigorously the problem definition. There are three well-known definitions in database theory: data complexity, combined complexity, and parameterized complexity. If I understand correctly, Theorem 3.4 refers to the parameterized complexity, Theorem 3.6 refers to the data complexity (of a fixed query), while the positive results in Sec. 4 (e.g. Th. 4.8) introduce yet another notion of complexity, which requires discussion.}
We have addressed the concerns in rewriting the entirety of \Cref{sec:intro}, explicitly mentioning complexity metrics considered, while forming a series of problem statements that intuitively describe the exact problem we are considering, and the complexity metrics considered. We have also adjusted the phrasing of the said theorems and definitions to eliminate the ambiguity described.
\RCOMMENT{The problem definition is supposed to be in Definition 2.14, but this definition is sloppy. It states that the input to the problem is a circuit C: but then, what is the role of the PDB and the query Q? Currently Definition 2.14 reads as follows: "Given a circuit C defining some polynomial Q(X), compute E[Q(W)]", and, thus, the PDB and the query play no role at all. All results in Section 4 seem to assume this simplified version of Definition 2.14. On the other hand, if one interprets the definition in the traditional framework of data complexity (Q is fixed, the inputs are D and C) then the problem is solvable in PTIME (and there is no need for C), since E[Q(W)] is the sum of expectations of the monomials in Q (this is mentioned in Example 1.2).}
We have rephrased \Cref{def:the-expected-multipl} to qualify data complexity. The paper (especially in \Cref{sec:intro}) builds up the fact that we aren't stopping at polynomial time, but exploring parameterized complexity and fine grained analysis (as the reviewer aptly noted in the first comment).
\RCOMMENT{Second, Definition 2.6 of Reduced BIDB polynomials is simply wrong. It uses "mod" of two multivariate polynomials, but "mod" doesn't exists for multivariate polynomials...Either state Definition 2.6 directly, in an ad-hoc manner (which seems doable), or do a more thorough job grounding it in the ring of multivariate polynomials and its ideals.
}
We have implemented the reviewer's ad-hoc suggestion in light of Reviewer 1's similar suggestions.
\RCOMMENT{the paper uses three notations (UCQ, RA+, SPJU) for the same thing, and never defines formally any of them.}
We have chosen $\raPlus$ for consistent use throughout the paper. We have included \Cref{footnote:ra-def} on \Cpageref{footnote:ra-def} for an explicity definition of $\raPlus$ queries.
\RCOMMENT{}
\RCOMMENT{}
\RCOMMENT{}
\RCOMMENT{}
\RCOMMENT{}
\RCOMMENT{}
\RCOMMENT{}