paper-BagRelationalPDBsAreHard/rebuttal.tex

308 lines
44 KiB
TeX

%root: main.tex
%!TEX root=./main.tex
\definecolor{GrayRew}{gray}{0.85}
\newcommand{\RCOMMENT}[1]{\medskip\noindent \begin{tabular}{|p{\linewidth-3ex}}\rowcolor{GrayRew} #1 \end{tabular}\smallskip\\}
\section{Response to first cycle reviewer comments}
\label{sec:rebuttal}
This paper is a resubmission of our submission to the ICDT first cycle. We thank the reviewers for their insightful comments, which we believe has helped improve the presentation of the paper tremendously.
We use this section to document the changes that have been made since our prior submission, and in particular, how we have addressed reviewer comments (reviewer comments are shaded followed by our responses).
\subsection{Meta Review}
\RCOMMENT{Problem definition not stated rigorously nor motivated. Discussion needed on the standard PDB approach vs your approach.}
We rewrote \Cref{sec:intro} to specifically address this concern. The opening paragraph precisely and formally states the query evaluation problem in \abbrBPDB\xplural. We use a series of problem statements to clearly define the problem we are addressing as it relates to the query evaluation problem.
We made the concrete problem statements more precise by more clearly formalizing $\qruntime{Q, \dbbase}$ and stating our runtime objectives relative to it (\Cref{prob:informal},~\ref{prob:big-o-joint-steps},~\ref{prob:intro-stmt}).
%Notably, explicit discussion of provenance polynomials is limited to the proofs in the appendices.
We have included a discussion of the standard approach, e.g. see the paragraph \textbf{Relationship to Set-Probabilistic Query Evaluation} on page 4.
\RCOMMENT{Definition 2.6 on reduced BIDB polynomials seem not the right tool for the studied problem.}
We have chosen to stick with a less formal, ad-hoc definition (please see \Cref{def:reduced-poly} and \Cref{def:reduced-bi-poly}) of the general problem as suggested by both Reviewer 1 and Reviewer 2. Our earlier proof of the current \Cref{lem:exp-poly-rpoly} (in the appendix) had a small bug, which also has been fixed.
\RCOMMENT{The paper is very difficult to read. Improvements are needed in particular for the presentation of the approximation results and their proofs. Also for the notation. Missing definitions for used notions need to be added. Ideally use one instead of three query languages (UCQ, RA+, SPJU).}
%\AH{How have we handled the presentation of the approximation results and their proofs?}
%\AR{Added text at the end of the answer
We have chosen one specific query language throughout the paper ($\raPlus$) and made a concerted effort to use clean, defined, non-ambiguous notation.
We have also simplified the notation by limiting the paper's use of provenance semirings (which are needed solely for proofs) to the appendix.
To the best of our examination, all notation conflicts have been addressed and definitions for used notions are added (see e.g. \Cref{def:Gk} appears before \Cref{lem:3m-G2} and \Cref{lem:lin-sys}).
After the rewrite of \Cref{sec:intro}, we had even less space for \Cref{sec:algo}. However, we have modified \Cref{sec:algo} so that it flows better. In particular, we start off with the algorithm idea first (paragraph \textbf{Overview of our Techniques} in \Cref{sec:intro} also has more details on the intuition behind the approximation algorithm) and then state the results (with more details on how we argue the claimed runtime). Finally, we clearly state \Cref{cor:approx-algo-punchline} for which queries our linear-time approximation result holds.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Reviewer 1}
\RCOMMENT{l.24 "is \#W[1]-hard": parameterized by what?}
\RCOMMENT{l.103 and l.105: again, what is the parameter exactly?}
While the above references do not exist in the revised \Cref{sec:intro} anymore, all theorem statements and claims on \sharpwone runtime have been stated in a way so as to avoid ambiguity in the parameter. Please see e.g. \Cref{thm:k-match-hard} and \Cref{thm:mult-p-hard-result}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{You might want to explain your title somewhere (probably in the introduction): in the end, what exactly should be considered harmful and why?}
We have modified the title to be more descriptive.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{l.45 when discussing Dalvi and Suciu's dichotomy, you might want to mention that they consider *data complexity*. Currently the second sentence of your introduction ("take a query Q and a pdb D") suggests that you are considering combined complexity.}
We have made an explicit mention of data complexity when alluding to Dalvi and Suciu's dichotomy. We have further rewritten \Cref{sec:intro} in such a way as to explicitly note the type(s) of complexity we are considering (mostly it's parameterized complexity).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{l.51 "Consider ... and tuples are independent random event": so this is actually a set PDB... You might want to use an example where the input PDB is actually a bag PDB. The last sentence before the example makes the reader *expect* that the example will be of a bag PDB that is not a set PDB}
Our revision has removed the example referred to above. While the paper considers inputs to queries that are equivalent to set-\abbrPDB, this is not limiting. Please see \Cref{footnote:set-not-limit} on \Cpageref{footnote:set-not-limit}. Furthermore, we have added a discussion to the appendix that expands on why our results do extend beyond set inputs (\Cref{sec:gener-results-beyond}).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{- In the case of set semantics, the lineage of a tuple can be defined for *any* query: it is the unique Boolean function that satisfies the if and only if property that you mention on line 70. For bag semantics however, to the best of my knowledge there is no general definition of what is a lineage for an arbitrary query. On line 73, it is not clear at all how the polynomial should be defined, since this will depend on the type of query that you consider}
Note that lineage for a set semantics query is as a positive Boolean formula is defined for positive relational algebra. For instance, for aggregate queries a more powerful model ~(\cite{AD11d}) is needed.
The definition of the lineage polynomial (bag \abbrPDB) semantics over an arbitrary $\raPlus$ query $\query$ is now given in \Cref{fig:nxDBSemantics}.
We also note that these semantics are not novel (e.g., similar semantics appear for both provenance \cite{DBLP:conf/pods/GreenKT07} and probabilistic database \cite{kennedy:2010:icde:pip,FH12} contexts). %feng:2019:sigmod:uncertainty,
However, as we were unable to find a formal proof of the equivalence between the expectation of the query multiplicity and of the lineage polynomial in related work, we have included a proof of \Cref{prop:expection-of-polynom}.
\RCOMMENT{l.75 "evaluating the lineage of t over an assignment corresponding to a possible world": here, does the assignment assigns each tuple to true or false? In other words, do the variables X still represent individual tuples? From what I see later in the article it seems that no, so this is confusing if we compare to what is explained in the previous paragraph about set TIDB}
The discussion after \Cref{prob:bag-pdb-poly-expected} (in particular, the paragraph \textbf{\abbrTIDB\xplural}) specifically address these questions. While values for possible worlds assigned are from $\{0, 1\}$, which is analog to Boolean, this is not limiting. Please see \Cref{footnote:set-not-limit} $\inparen{\Cpageref{footnote:set-not-limit}}$ and the new appendix section \Cref{sec:gener-results-beyond}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{- l.135 "polynomial Q(X)": Q should be reserved for queries... You could use $\varphi$ or $\phi$ or... anything else but Q really}
We now use $\poly\inparen{\vct{X}}$ for (lineage) polynomials.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{- If we consider data complexity (as did Dalvi and Suciu) and fix an UCQ Q, given as input a bag TIDB PDB we can always compute the lineage in $O(|D|^|Q|)$ in SOP form and from there compute the expected multiplicity with the same complexity, so in polynomial time. How does this relate to your hardness result? Is it that you are only interested in combined complexity? Why one shouldn't be happy with this running time? Usually queries are much smaller than databases and this motivates studying data complexity.}
We have rewritten \Cref{sec:intro} in a way to stress that we are are primarily interested in data complexity, but we cannot stop there. As the reviewer has noted, the problem we explore requires further analysis, where we require parameterized and fine grained complexity analysis to provide a theoretical foundation for the question we ask in \Cref{prob:informal}. We have discussed this in the prose following \Cref{prob:bag-pdb-poly-expected}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{A discussion is missing about the difference between the approach usually taken in PDB literature and your approach. In which case would one be more interested in the expected multiplicity or in the marginal probability of a tuple? This should be discussed clearly in the introduction, as currently there is no clear "motivation" to what you do. There is a section about related work at the end but it is mostly a set of facts and there is no insightful comparison to what you do.}
We provide more motivating examples in the first paragraph, and include a more detailed discussion of the relationship to sets in paragraph \textbf{Relationship to Set-Probabilistic Query Evaluation} after \Cref{prob:informal}.
For example, expected multiplicities can model expectation of a \lstinline{COUNT(*)} query, while in many contexts computing the probability that this count is non-zero is not that useful.
%As a trivial (albeit relevant) example, consider a model of a contact network.
%The probability that there exists at least one new COVID infection in the graph is far less informative than the expected number of new infections.
As we now explain in the introduction, another motivation for generalizing marginal probability to expected multiplicity is that it is a natural generalization. The marginal probability of a tuple $t$ is the expectation of a Boolean random variable that is assigned 1 in every world where tuple $t$ exists and $0$ otherwise. For bag-PDBs the multiplicity of a query result tuple can be modeled as a natural-number random variable that for a world $\db$ is assigned the multiplicity of the tuple in $\db$. Thus, a natural generalization of the marginal probability (expectation of a Boolean random variable) to bags is the expectation of this variable: the tuple's expected multiplicity.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{l.176 "N[X] relations are closed under RA+": is this a *definition* of what it means to take an RA+ query and evaluate it over an N[X] database, or does this sentence say something more? Also, I think it would be clearer to use UCQs in the whole paper instead of constantly changing between UCQs, RA+ and SPJU formalisms}
To make the paper more accessible and general, we found it better to not use $\semNX$-DBs. While we wanted to use UCQ, we found the choice of $\raPlus$ to be more amenable to the presentation of the paper, and have, as suggested stuck with one query formalism.
\RCOMMENT{There are too many things undefined in from l.182 to the end of page. l.182 and in Proposition 2.1 N-PDBs are not defined, the function mod is undefined, etc. The article should be self-contained: important definitions should be in the article and the appendix should only be used to hide proof details. I think it would not take a lot of space to properly define the main concepts that you are using, without hiding things in the appendix}
%We have done as the reviewer has suggested.
All material in \Cref{sec:background} that is proof-related is in the appendix, while \Cref{sec:background} (modulo the proofs) is itself now self-contained.
\RCOMMENT{l.622 and l.632-634: so a N-PDB is a PDB where each possible world is an N-database, but an N[X]-PDB is not a PDB where each possible world is an N[X]-database... Confusing notation}
The text now refers to latter as an \abbrNXPDB\xplural.
\RCOMMENT{If you want to be in the setting of bag PDBs, why not consider that the value of the variables are integers rather that Boolean? I.e., consider valuations $\nu: X \rightarrow$ N (or even to R, why not?) instead of $X \rightarrow \{0,1\}$; this would seem more natural to me than having this ad-hoc "mix" of Boolean and non-Boolean setting. If you consider this then your "reduced polynomial" trick does not seem to work anymore.}
Our objective is to establish the feasibility of bag-probabilistic databases as compared to existing deterministic query processing systems.
Accordingly, we take our input model from production database systems like Postgresql, Oracle, DB2, SQLServer, etc. (e.g., see \Cref{footnote:set-not-limit} on \Cpageref{footnote:set-not-limit}), where duplicate tuples are represented as independent entities.
As a convenient benefit, this leads to a direct translation of TIDBs (which are defined over $\{0,1\}$ inputs).
Finally, as we mention earlier, an easy generalization exists to encode a \abbrBPDB in a set-\abbrPDB (which then allows for bag inputs). \Cref{sec:gener-results-beyond}.
\RCOMMENT{- l.656 "Thus, from now on we will solely use such vectors...": this seems to
be false. Moreover you keep switching notation which makes it very hard to read... Sometimes it is $\varphi$, sometimes it is small w, sometimes it is big W (l.174 or l.722), sometimes the database is $\varphi(D)$, sometimes it is $\varphi_w(D)$, other times it is $D_{[w]}$ (l.671), and so on.}
We have made effort to be consistent with the use of notation, following standard usage whenever possible.
\RCOMMENT{l.658 "we use $\varphi(D)$ to denote the semiring homomorphism $\semNX \rightarrow \semN$
that...": I don't understand why you need a database to extend an assignment to its semiring homomorphism from $\semNX \rightarrow \semN$}
$\varphi$~\cite{DBLP:conf/pods/GreenKT07} lifts the valuation function (with kind $\semNX \rightarrow \semN$) to databases (i.e., a mapping from an $\semNX$-DB to a deterministic $\semN$-DB).
We note that the main body of the paper no longer references $\semNX$-DBs, and thus $\varphi$ is discussed exclusively in the appendix.
\RCOMMENT{Figure 2, K is undefined}
We have updated \Cref{fig:nxDBSemantics} (originally figure 2) to not need $K$.
\RCOMMENT{l.178 "$Q_t$", l.189 "Q will denote a polynomial": this is a very poor choice of notation}
\RCOMMENT{l.242 "and query Q": is Q a query or a lineage?}
We have reserved $\query$ to mean an $\raPlus$ query and nothing else.
\RCOMMENT{Section 2.1.1: here you are considering set semantics no? Otherwise, one would think that for bag semantics the annotation of a tuple could be 0 or something of the form c $\times$ X, where X is a variable and c is a natural number}
Please see \Cref{sec:gener-results-beyond} for a discussion on going beyond set inputs.
\RCOMMENT{Proof of Proposition A.3. I seems the proof should end after l.687, since you already proved everything from the statement of the proposition. I don't understand what it is that you do after this line.}
This text is an informal proof of \Cref{prop:expection-of-polynom} originally intended to motivate \Cref{prop:semnx-pdbs-are-a-}.
We agree that this should not be part of the proof of the later, and have removed the text.
\RCOMMENT{l.686 "The closure of ... over K-relations": you should give more details on this part. It is not obvious to me that the relations from l.646 hold.}
The core of this (otherwise trivial) argument, that semiring homomorphisms commute through queries, was already proven in \cite{DBLP:conf/pods/GreenKT07}. We now make this reference explicit.
We apologize for not explaining this in more detail. In universal algebra~\cite{graetzer-08-un}, it has been proven (the HSP theorem) that for any variety, the set of all structures (called objects) with a certain signature that obey a set of equational laws, there exists a ``most general'' object called the \emph{free object}. The elements of the free objects are equivalence classes (with respect to the laws of the variety) of symbolic expressions over a set of variables $\vct{X}$ that consist of the operations of the structure. The operations of the free object are combining symbolic expression using the operation. It has been shown that for any other object $K$ of a variety, any assignment $\phi: \vct{X} \to K$ uniquely extends to a homomorphism from the free object to $K$ by substituting variables for based on $\phi$ in symbolic expression and then evaluating the resulting expression in $K$.
Commutative semirings form a variety where $\semNX$ is the free object. Thus, for any polynomial (element of $\semNX$), for any assignment $\phi: \vct{X} \to \semN$ (also a semiring) there exists a unique semiring homomorphism $\textsc{Eval}_{\phi}: \semNX \to \semN$. Homomorphisms by definition commute with the operations of a semiring. Green et al. \cite{GK07} did prove that semiring homomorphisms extend to homomorphisms over K-relations (by applying the homomorphism to each tuple's annotation) and these homomorphisms over K-relations commute with queries.
\RCOMMENT{l.711 "As already noted...": ah? I don't see where you define which subclass of N[X]-PDBs define bag version of TIDBs. If this is supposed to be in Section 2.1.1 this is not clear, since the world "bag" does not even appear there (and as already mentioned everything seems to be set semantics in this section). I fact, nowhere in the article can I see a definition of what are bag TIDBs/BIDBs}
The new text precisely defines TIDBs (\Cref{sec:intro}), and the BIDB generalization (\Cref{subsec:tidbs-and-bidbs}). The specific text referenced in this comment has now been moved to the appendix and restructured to reference \Cref{def:semnx-pdbs} (which defines an \abbrNXPDB defined over variables $\vct{X}$) and relate it to the formal structure of BIDBs in \Cref{subsec:tidbs-and-bidbs}.
\RCOMMENT{- l.707 "the sum of the probabilities of all the tuples in the same block b is 1": no, traditionally it can be less than 1, which means that there could be no tuple in the block.}
The reviewer is correct and we have updated our appendix text accordingly. %\AR{Um, no we haven't. {\bf TODO for Atri:} do this.}
\RCOMMENT{it is not clear to me how you can go from l.733 to l.736, which is sad because this is actually the whole point of this proof. If I understand correctly, in l.733, Q(D)(t) is the polynomial annotation of t when you use the semantics of Figure 2 with the semiring K being N[X], so I don't see how you go from this to l.736}
This result follows from the inner sum looping only over $\vct{w}$ s.t. $\assign_{\vct{w}}(\pxdb) = \db$. As a consequence of this constraint, we have $Q(\db_{\semNX})(t)(\vct{w}) = \query(\db)(t)$. The latter term is independent of the summation, and so can be pulled out by distributivity of addition over multiplication.
We agree with the reviewer that this could be presented more clearly, and have now split the distributivity argument into a separate step.
\RCOMMENT{l.209-227: so you define what is a polynomial and what is the degree of a polynomial (things that everyone knows), but you don't bother explaining what "taking the mod of Q(X) over all polynomials in S" means? This is a bit weird.}
Based on this and other reviewer comments, we removed the earlier definition of $\rpoly\inparen{\vct{X}}$ and have defined it in a more ad-hoc manner, as suggested by the reviewers, including the comment immediately following.
\RCOMMENT{Definition 2.6: to me, using polynomial long division to define $\tilde{Q}$(X) seems like a pedantic way of reformulating something similar to Definition 1.3, which was perfectly fine and understandable already! You could just define $\tilde{Q}$(X) to set all exponents in the SOP that are >1 to 1 and to remove all monomials with variables from the same block, or using Lemma A.4 as a definition?}
As alluded to above, we have incorporated the reviewer's suggestion, c.f. \Cref{def:reduced-poly} and \Cref{def:reduced-bi-poly}.
\RCOMMENT{Definition 2.14. It is not clear what is the input exactly. Are the query Q and database D fixed? Moreover, I have the impression that your hardness results have nothing to do with lineages and that you don't need them to express your results. I think the problem you should consider is simply the following: Expected Multiplicity Problem: Input: query Q, N[X]-database D, tuple t. Output: expected multiplicity of t in Q(D). Your main hardness result would then look like this: the Expected
Multiplicity problem restricted to conjunctive queries is \#W[1]-hard, parameterized by query size. Indeed if I look at the proof, all you need is the queries $Q^k_G$. The problem is \#W[1]-hard and it should not matter how one tries to solve it: using an approach with lineages or using anything else.
Currently it is confusing because you make it look like the problem is hard only when you consider general arithmetic circuits, but your hardness proof has nothing to do with circuits. Moreover, it is not surprising that computing the expected output of an arithmetic circuit is hard: it is trivial, given a CNF $\phi$, to build an arithmetic circuit C such that for any valuation $\nu$ of the variables the formula $\phi$ evaluates to True under $\nu$ if C evaluates to 1 and the formula $\phi$ evaluates to False under $\nu$ if C evaluates to 0, so this problem is \sharpphard anyways.}
The reviewer is correct. Our hardness results are now stated independently of circuits. We note that the hardness result alluded to at the end of the comment above is not applicable in our case since for fixed queries $\query$, \Cref{prob:bag-pdb-query-eval} and \Cref{prob:bag-pdb-poly-expected} can be solved in polynomial time.
Further, as we point out in \Cref{sec:intro} what is new in our hardness results is that we show a query $Q^k$ such that $\qruntime{\query^k,\dbbase}$ is small (linear in $\abs{\dbbase}$ but solving \Cref{prob:bag-pdb-query-eval} and \Cref{prob:bag-pdb-poly-expected} is hard. We note that it is well-known that one can reduce the problem of counting $k$-cliques or $k$-matchings to a query $\query$ for which computing $\query(\dbbase)$ is $\sharpwone$-hard. So our contribution to come up with a different reduction from counting $k$-matchings so that the hardness manifests itself in the probabilistic computing part of our problem.
%We have rewritten \Cref{sec:intro} with a series of refined problem statements to show that the problem we explore and the results we obtain directly involve lineage polynomials. The reviewer is correct that the output is the expected multiplicity, and we hope that our updated presentation of the paper makes it clear that $\expct_{\vct{\randWorld}\sim\pdassign}\pbox{\apolyqdt\inparen{\vct{\randWorld}}}$ is indeed the expected multiplicity spoken of. We have also addressed the ambiguity in the complexity we are focusing on, both explicitly in the intro and in the revised definition, \Cref{def:the-expected-multipl}.
%
%Regarding the use of circuits, it is true that our hardness results do not require circuits while our approximation algorithm and cost model both rely on circuits. We have adjusted our presentation (e.g. the segway between \Cref{prob:informal} and \Cref{prob:big-o-joint-steps}) to make this distinction clear and eliminate any confusion.
\RCOMMENT{Section 3.3. It seems to me the important part of this section is not so much the fact that we have fixed values of p but that the query is now fixed and that you are looking at the fine-grained complexity. If what you really cared about was having fixed value of p, then the result of this section should be exactly like the one in Theorem 3.4, but starting with "fix p". So something like "Fix p. Computing $\tilde{Q}^k_G$ for arbitrary G is \#W1-hard".}
We agree with the reviewer that the result on fixed value of $p$ is mostly of (narrow) theoretical interest. We have added a discussion summarizing the reviewer's point above below \Cref{th:single-p-hard}.
\RCOMMENT{General remark: The story of the paper I think should be this: we can always compute the expected multiplicity for a UCQ Q and N[X]-database D and tuple t by first computing the lineage in SOP form and then using linearity of expectation, which gives an upper bound of (roughly) $O(|D|^|Q|)$. We show that this exponential dependence in |Q| is unavoidable by proving that this problem is \#W1 hard parameterized by |Q| (which implies that we cannot solve it in $f(|Q|) |D|^c$ ). Furthermore we obtain fine-grained superlinear lower bounds for a fix conjunctive query Q. (Observe how up to here, there is no need to talk about lineages at all). We then obtain an approximation algorithm for this problem for [this class of queries] and [that class of bag PDBs] with [that running time (Q,D)]. The method is to first compute the lineage as an arithmetic circuit C in [this running time (Q,D)], and then from the arithmetic circuit C compute in [running time(C)] an approximation of its expected output. Currently I don't understand to which queries your approximation algorithm can be applied (see later comments).}
%We have followed the suggestions of the reviewer to delineate between the `coarse' polynomial time and the fine grained complexity analysis. We found it necessary to introduce polynomials earlier since our hard query, hardness results, and their proofs are easier to present (and we feel make the paper more accessible) than doing so without the lineage polynomials.
%We have taken pains to be very clear that this work only considers $\raPlus$ queries, adding a reminder to this end in the first paragraph of \Cref{sec:algo}.
%\AH{We need to address the last line of the reviewer's comment. Also, not sure if I answered the comment perfectly.}
We have restructured \Cref{sec:intro} to more or less follow the reviewer's outline above. The only deviation is that we still introduce lineage polynomials. We do this because the polynomial view is very helpful in the proofs of our hardness result (in addition to the obvious relevance for the approximation algorithm). We have also clarified that our approximation result applied to all $\raPlus$ queries (see \Cref{cor:approx-algo-punchline}).
\RCOMMENT{l.381: Here again, I think it would be simpler to consider that the input of the problem is the query, the database and a tuple and claim that you can compute an approximation of the expected multiplicity in linear time. The algo is to first compute the lineage as an arithmetic circuit, and then to use what you currently use (which could be put in a lemma or in a proposition).}
We have implemented the above overview in \Cref{sec:intro} when we move from \Cref{prob:informal} to \Cref{prob:intro-stmt}. For the approximation algorithm we focus on \Cref{prob:intro-stmt}, which still takes a circuit as an input.
%Our appoximation algorithm assumes an input circuit \circuit that has been computed via an arbitrary $\raPlus$ query $\query$ and arbitrary \abbrBIDB $\pdb$. We have included prose to describe this at the beginning of {sec:algo:sub:main-result}.
\RCOMMENT{Definition 4.2: would you mind giving an intuition of what this is? It is not OK to define something and just tell the reader to refer the appendix to understand what this is and why this is needed; the article should be understandable without having to look at the appendix. It is simply something that gives the coefficient of each monomial in the reduced polynomial?}
We have provided an example in directly after \Cref{def:expand-circuit} as well as a sentence pointing out why this definitions is useful. %\AR{This is not enough: we need a line for {\em why} this notation is useful. {\bf TODO for Atri}}
\RCOMMENT{- l.409: how does it matter that the circuit C is the lineage of a UCQ? Doesn't this work for any arithmetic circuit?}
The reviewer is correct that the earlier Theorem 4.9 works for any circuit (this result is now in the appendix).
%The reviewer is correct that our approximation results apply to $\raPlus$ queries over \abbrBIDB\xplural. This we specify this in the formal statements of \Cref{sec:algo}, e.g. see \Cref{def:param-gamma} and \Cref{cor:approx-algo-const-p}.
%More specifically, our proofs rely on (i) circuits with a bounded polynomial degree (we use a slightly non standard definition of degree --- \Cref{def:degree}), which is the case for any circuit resulting from an $\raPlus$ query; and (ii) specific assumptions about variable independence, which hold when the input to the query is a BIDB.
\RCOMMENT{l.411: what are $|C|^2(1,...,1)$ and $|C|(1,...,1)$? }
We clarify this overloaded notation immediately after \Cref{def:positive-circuit}.
\RCOMMENT{Sometimes you consider UCQs, sometimes RA+ queries. I think it would be simpler if you stick to one formalism (probably UCQs is cleaner?)}
As alluded to previously, we have followed the reviewer's suggestion and have found $\raPlus$ queries to be most amenable for this work.
\RCOMMENT{l.432 what is an FAQ query?}
We actually no longer need that result since \Cref{lem:val-ub} now has a bound on $|\circuit|(1,\dots,1)$ in terms of $\depth(\circuit)$ and the latter is used in \Cref{cor:approx-algo-punchline} for all $\raPlus$ queries. Please see \Cref{lem:val-ub} and the followup discussion for more on this.
\RCOMMENT{Generally speaking, I think I don't understand much about Section 4, and the convolutedness of the appendix does not help to understand. I don't even see in which result you get a linear runtime and to which queries the linear runtime applies. Somewhere there should be a corollary that clearly states a linear time approximation algorithm for some queries.}
We have re-organized {sec:algo} to address the above comments as follows:
\begin{itemize}
\item We now start off \Cref{sec:algo:sub:main-result} with the algorithm idea.
\item We give a quick overview of how the claimed runtime follows from the algorithm idea mentioned above.
\item Added \Cref{cor:approx-algo-punchline} that clearly states that we get an $O(\qruntime{\query,\dbbase})$ for {\em all} $\raPlus$ queries $Q$.
\end{itemize}
\RCOMMENT{In section 5, it seems you are arguing that we can compute lineages as arithmetic circuits at the same time as we would be running an ordinary query evaluation plan. How is that different from using the relations in Figure 2 for computing the lineage?}
There is not a major difference between the two. This observation has persuaded us to eliminate $\semNX$-DB query evaluation and have only an algorithm for lineage.
We have also re-organized the earlier Section 5 and moved the definition of $\qruntime{\cdot}$ (earlier denoted as $\mathbf{cost}(\cdot)$) to \Cref{sec:gen} and moved the rest of the material to the appendix.
\RCOMMENT{l.679 where do you use $max(D_i)$ later in the proof?}
Thank you. This reference was unnecessary and has been removed.
\RCOMMENT{l.688 That sentence is hard to parse, consider reformulating it}
As the reviewer notes above, this paragraph is unnecessary and we have removed it.
\RCOMMENT{it seems you are defining N[X]-PDB at two places in the appendix: once near l.632, and another time near l.652}
Thank you. The latter definition has been removed.
\subsection{Reviewer 2}
\RCOMMENT{First, the paper should state rigorously the problem definition. There are three well-known definitions in database theory: data complexity, combined complexity, and parameterized complexity. If I understand correctly, Theorem 3.4 refers to the parameterized complexity, Theorem 3.6 refers to the data complexity (of a fixed query), while the positive results in Sec. 4 (e.g. Th. 4.8) introduce yet another notion of complexity, which requires discussion.}
We have addressed the concerns in rewriting the entirety of \Cref{sec:intro}, explicitly mentioning complexity metrics considered, while forming a series of problem statements that describe the exact problem we are considering, and the complexity metrics considered. We have also adjusted the phrasing of the said theorems and definitions to eliminate the ambiguity described.
\RCOMMENT{The problem definition is supposed to be in Definition 2.14, but this definition is sloppy. It states that the input to the problem is a circuit C: but then, what is the role of the PDB and the query Q? Currently Definition 2.14 reads as follows: "Given a circuit C defining some polynomial Q(X), compute E[Q(W)]", and, thus, the PDB and the query play no role at all. All results in Section 4 seem to assume this simplified version of Definition 2.14. On the other hand, if one interprets the definition in the traditional framework of data complexity (Q is fixed, the inputs are D and C) then the problem is solvable in PTIME (and there is no need for C), since E[Q(W)] is the sum of expectations of the monomials in Q (this is mentioned in Example 1.2).}
We have rephrased \Cref{def:the-expected-multipl} to qualify data complexity. The paper (especially in \Cref{sec:intro}) builds up the fact that we aren't stopping at polynomial time, but exploring parameterized complexity and fine grained analysis (as the reviewer aptly noted in the first comment).
\RCOMMENT{Second, Definition 2.6 of Reduced BIDB polynomials is simply wrong. It uses "mod" of two multivariate polynomials, but "mod" doesn't exists for multivariate polynomials...Either state Definition 2.6 directly, in an ad-hoc manner (which seems doable), or do a more thorough job grounding it in the ring of multivariate polynomials and its ideals.
}
The reviewer is correct in their comment on the "mod" part-- we apologize for the error.
We have implemented the reviewer's ad-hoc suggestion in light of Reviewer 1's similar suggestions.
\RCOMMENT{the paper uses three notations (UCQ, RA+, SPJU) for the same thing, and never defines formally any of them.}
We have chosen $\raPlus$ for consistent use throughout the paper. We have included \Cref{footnote:ra-def} on \Cpageref{footnote:ra-def} for an explicit definition of $\raPlus$ queries.
\RCOMMENT{$G^{\ell}$ is used in Lemma 3.8 but defined only in the Appendix (Def. B.2), without even a forward pointer. This is a major omission: Lemma 3.8 is a key step for a key result, but it is impossible to read.}
We have fixed this mistake. Unfortunately, because of the changes in the paper (especially expanding on \Cref{sec:intro}), the earlier Lemma 3.8 had to be moved to the appendix.
\RCOMMENT{Definition 2.7. "valid worlds $\eta$". This is confusing. A "possible world" is an element of $\idb$: this is not stated explicitly in the paper, but it is implicit on line 163, so I assumed that possible worlds refer to elements of $\idb$. If I assumed correctly, then calling $\eta$ a "world" in Def. 2.7 is misleading, because $\eta$ is not an element of $\idb$. More, it is unclear to me why this definition is needed: it is used right below, in Lemma 2.8, but that lemma seems to continue to hold even if w is not restricted.}
We agree with the reviewer that this notation is confusing;
$\eta$ is meant to cope with the fact that tuples from the same group in a BIDB can not co-exist, even though our $\{0,1\}$-input vectors can encode such worlds.
We now address this constraint by embedding it directly into the reduced polynomial with \Cref{def:reduced-bi-poly}.
\RCOMMENT{line 305: please define what is an "occurrence of H in G". It could mean: a homomorphic image, a subgraph of G isomorphic to H, an induced subgraph of G isomorphic to H, or maybe something else.}
We agree with the reviewer's suggestion and have rephrased the wording to be clear. Please see the beginning of \Cref{sec:hard:sub:pre}.
\RCOMMENT{If the proofs are given in the appendix, please say so. Lemmas 3.5 and 3.8 are stated without any mention, and one has to guess whether they are obvious, or proven somewhere else. On this note: I found Lemma 3.5 quite easy, since the number of k-matching is the coefficient of the leading monomial (of degree 2k) in $Q^k(p,p,...,p)$, while Lemma 3.8 appears much harder. It would help to briefly mention this in the main body of the paper.}
We have implemented the reviewer's suggestion. Please see the last sentence of \Cref{sec:intro} (as well as the expanded discussion on the hardness result in the {\bf Overview of our Techniques} part of \Cref{sec:intro}).
\RCOMMENT{line 177: what is $\Omega_{\semNX}$?}
We have eliminated the use of $\semNX$-DBs in the paper proper, using them only when necessary in the proofs of the appendix.
\RCOMMENT{line 217. The polynomial $X^2 + 2XY + Y^2$ is a poor choice to illustrate the degree. There are two standard definitions of the degree of a multivariate polynomial, and one has to always clarify which one is meant. One definition is the total degree (which is Def. 2.3 in the paper), the other is the maximum degree of any single variable. It is nice that you are trying to clarify for the reader which definition you are using, but the polynomial $X^2 + 2XY + Y^2$ is worst choice, since here the two coincide.}
We have adjusted the example to account for the reviewer's correct observation.
\RCOMMENT{line 220. "we consider only finite degree polynomials". This is a surprise. Polynomials, by definition, are of finite degree; there are extensions (I'm aware of powerseries, maybe you have other extensions in mind), but they are usually not called polynomials, plus, nothing in the paper so far suggests that it might refer to those extensions.}
We have removed the redundant terminology the reviewer has pointed out, and refined the discussion surrounding (and including) \Cref{eq:sop-form} to be explicit to the novice reader that polynomials are by definition of finite degree.
\RCOMMENT{"Note that our hardness results even hold for the expression trees". At this point we haven't seen the hardness results, nor their proofs, and we don't know what expression trees are. It's unclear what we can note.}
%We have accounted for the reviewer's concern in the rewrite of \Cref{sec:hard} adjusting the prose accordingly.
Our hardness results are now stated independently of circuits so the above statement no longer appears in the paper.
\RCOMMENT{paragraph at the top of pp.10 is confusing. My guess is that it is trying to this say: "there exists a query Q, such that, for each graph G, there exists a database D s.t. the lineage of Q on D is the polynomial $Q_G$."}
Our revision has eliminated this statement.
\subsection{Reviewer 3}
\RCOMMENT{The overall study is then extended to a multiplicative approximation algorithm for the expectation of polynomial circuits in linear time in the size of the polynomial. It was much harder to read this part, and I found the examples and flow in the appendix quite helpful. I suggest to include these examples into the body of the paper. }
%\AH{Need to address this.}
In our revision we expanded on \Cref{sec:intro} to give a better overview of the problems we are considering in this paper. This meant we had to cut out material in later sections, which unfortunately meant we did not have space in \Cref{sec:algo} to include any examples that the reviewer suggested above. However, we have tried to make \Cref{sec:algo} more readable as a whole.
\RCOMMENT{While ApproximateQ is linear in the size of the circuit, it is quadratic in epsilon and so we need quadratically many samples for the desired accuracy -- overall runtime is not linear therefore and it may be better to elaborate this. It may also be helpful to comment on how this relates to Karp, Luby, Madras algorithm [1] for \#DNF which is also quadratic in epsilon.}
%\AH{Need to elaborate on this.}
In \Cref{prob:big-o-joint-steps} we note explicitly that we care about linear dependence on $\qruntime{\inparen{\query,\dbbase}}$ and do not care about the exact dependence on $\epsilon$. While it would be nice to design an approximation algorithm that is linear in $1/\epsilon$ as well, we believe it is out of scope of this initial work.
\RCOMMENT{The coverage of related work is adequate. Fink et. al seems as the closest related work to me and I would appreciate a more elaborate comparison with this paper. My understanding is that Fink et. al considers exact evaluation only and focuses on knowledge compilation techniques based on decompositions. They also note that "Expected values can lead to unintuitive query answers, for instance when data values and their probabilities follow skewed and non-aligned distributions" attributed to [2]. Does this apply to the current work? Can you please comment on this?}
The work is indeed quite close to our own.
It targets a broader class of queries (aggregates include COUNT/SUM/MIN/MAX, rather than our more narrow focus on COUNT), but has significantly less general theoretical results.
Most notably, their proof of linear runtime in the size of the input polynomial is based on a tree-based encoding of the polynomial.
Tree-based representation representation (and hence the Fink et. al. algorithm's runtime) is, as we note several times, superlinear in $\qruntime{\query, \dbbase}$.
This result is also limited to a specific class of (hierarchical) queries, devolving to exponential time (as in \cite{FH13}) in general.
By contrast, our results apply to all of $\raPlus$.
Our revised related work section now addresses both points.
\RCOMMENT{I assume the authors focus on parameterized complexity throughout the paper, and even this is not stated unambiguously. The authors should make an extra effort to make the paper more accessible by using the explanations and examples from the appendix in the body of the paper. It is also important to highlight the differences with the complexity of standard query evaluation over PDBs.}
Our revision has focused on explicitly mentioning the complexity metrics we are interested in. This can be seen in e.g. \Cref{sec:intro} and formal statement (theorems, lemmas, etc.), which have been rewritten to eliminate ambiguities.
We have also taken pains to be promote accessibility, keeping the paper self-contained, and using examples for difficult or potentially unclear concepts. This can be seen in e.g. eliminating unnecessary machinery (e.g. $\semNX$-DB machinery from the paper proper), providing/modifying examples (c.f. \Cref{def:expand-circuit}, \Cref{def:degree}), and ensuring consistency in notational use, e.g. using one query evaluation formalism ($\raPlus$).
We decided to focus on beefing up \Cref{sec:intro} and cleaning up definitions of problems, which unfortunately meant we ran out of space to bring back examples from the appendix (especially into \Cref{sec:algo}).
\subsection{Reviewer 4}
\RCOMMENT{I wonder whether the writing could be revisited to give the reader a better overview of the technical challenges, motivation, and the high level ideas of the algorithm and hardness results. The current exposition seems slightly too tailored for the expert in probabilistic databases rather than the average ICDT attendee. Also the current exposition is structured such that the reader needs to get through quite a few definitions and technical lemmas until they get to the new ideas in the paper.}
We have (as noted throughout this section) revised the writing to provide precision and clarity to the problem we explore as well as the results we obtain. Part of this revision was a complete rewriting of \Cref{sec:intro} where we sought to be extremely precise in language and through a series of problem statements to help the reader navigate and understand the problem we explore as well as how we have gone about exploring that problem coupled with the validity of the exploration strategy. We have simultaneously sought to make the paper more accessible by assuming the average ICDT attendee and defining or explaining concepts that might not be known to them. Finally, we have expanded on the {\bf Overview of our Techniques} part of \Cref{sec:intro} to provide more intuition on how we prove our lower and upper bounds.
\RCOMMENT{}
\RCOMMENT{}
\RCOMMENT{}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"
%%% End: