This commit is contained in:
Boris Glavic 2021-09-19 21:00:40 -05:00
parent 5ee5943863
commit 3c3041b48d

View file

@ -19,47 +19,58 @@ We have chosen one specific query language throughout the paper ($\raPlus$) and
We have also simplified the notation by limiting the paper's use of provenance semirings (which are needed solely for proofs) to the appendix.
To the best of our examination, all notation conflicts have been addressed and definitions for used notions are added (see e.g. \Cref{def:Gk} appears before \Cref{lem:3m-G2} and \Cref{lem:lin-sys}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Reviewer 1}
\RCOMMENT{l.24 "is \#W[1]-hard": parameterized by what?}
\RCOMMENT{l.103 and l.105: again, what is the parameter exactly?}
While the above reference does not exist in the revised \Cref{sec:intro} anymore, all theorem statements and claims on \sharpwone runtime have been stated in a way so as to avoid ambiguity in the parameter. Please see e.g. \Cref{thm:k-match-hard} and \Cref{thm:mult-p-hard-result}
While the above reference does not exist in the revised \Cref{sec:intro} anymore, all theorem statements and claims on \sharpwone runtime have been stated in a way so as to avoid ambiguity in the parameter. Please see e.g. \Cref{thm:k-match-hard} and \Cref{thm:mult-p-hard-result}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{You might want to explain your title somewhere (probably in the introduction): in the end, what exactly should be considered harmful and why?}
We have modified the title to better aptly describe our body of work.
We have modified the title to be more descriptive.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{l.45 when discussing Dalvi and Suciu's dichotomy, you might want to mention that they consider *data complexity*. Currently the second sentence of your introduction ("take a query Q and a pdb D") suggests that you are considering combined complexity.}
We have made an explicit mention of data complexity when alluding to Dalvi and Suciu's dichotomy. We have further rewritten \Cref{sec:intro} in such a way as to explicitly note the type(s) of complexity we are considering.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{l.51 "Consider ... and tuples are independent random event": so this is actually a set PDB... You might want to use an example where the input PDB is actually a bag PDB. The last sentence before the example makes the reader *expect* that the example will be of a bag PDB that is not a set PDB}
Our revision has removed the example referred to above. While the paper considers inputs to queries that are equivalent to set-\abbrPDB, we have concluded this not to be limiting. Please see \Cref{footnote:set-not-limit} on \Cpageref{footnote:set-not-limit}.
Our revision has removed the example referred to above. While the paper considers inputs to queries that are equivalent to set-\abbrPDB, we have concluded that this isnot limiting. Please see \Cref{footnote:set-not-limit} on \Cpageref{footnote:set-not-limit}. Furthermore, we have added a discussion to the appendix that expands on why our results do extend beyond set inputs.\BG{Add reference}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{- In the case of set semantics, the lineage of a tuple can be defined for *any* query: it is the unique Boolean function that satisfies the if and only if property that you mention on line 70. For bag semantics however, to the best of my knowledge there is no general definition of what is a lineage for an arbitrary query. On line 73, it is not clear at all how the polynomial should be defined, since this will depend on the type of query that you consider}
Note that lineage for a set semantics query is as a positive Boolean formula is defined for positive relational algebra. For instance, for aggregate queries a more powerful model ~(\cite{AD11d}) is needed.
The definition of the lineage polynomial (bag \abbrPDB) semantics over an arbitrary $\raPlus$ query $\query$ is modeled in \Cref{fig:nxDBSemantics}.
We also note that these semantics are not novel (e.g., similar semantics appear for both provenance \cite{DBLP:conf/pods/GreenKT07} and probabilistic database \cite{feng:2019:sigmod:uncertainty,kennedy:2010:icde:pip} contexts).
However, as we were likewise unable to find a formal proof of equivalence between the expectation of the query multiplicity and of the lineage polynomial, we prove it with \Cref{prop:expection-of-polynom}.
We also note that these semantics are not novel (e.g., similar semantics appear for both provenance \cite{DBLP:conf/pods/GreenKT07} and probabilistic database \cite{kennedy:2010:icde:pip,FH12} contexts). %feng:2019:sigmod:uncertainty,
However, as we were unable to find a formal proof of the equivalence between the expectation of the query multiplicity and of the lineage polynomial in related work, we have included a prove in \Cref{prop:expection-of-polynom}.
\RCOMMENT{l.75 "evaluating the lineage of t over an assignment corresponding to a possible world": here, does the assignment assigns each tuple to true or false? In other words, do the variables X still represent individual tuples? From what I see later in the article it seems that no, so this is confusing if we compare to what is explained in the previous paragraph about set TIDB}
The discussion after \Cref{prob:bag-pdb-poly-expected} (in particular, the paragraph \textbf{\abbrTIDB\xplural}) specifically address these questions. While values for possible worlds assigned are from $\{0, 1\}$, which is analog to boolean, but note \Cref{footnote:set-not-limit}, which descibes the encoding of a bag in a set.
The discussion after \Cref{prob:bag-pdb-poly-expected} (in particular, the paragraph \textbf{\abbrTIDB\xplural}) specifically address these questions. While values for possible worlds assigned are from $\{0, 1\}$, which is analog to boolean, but note \Cref{footnote:set-not-limit} and the new appendix \BG{add reference} \BG{REMOVED: which describes the encoding of a bag as a set.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{- l.135 "polynomial Q(X)": Q should be reserved for queries... You could use $\varphi$ or $\phi$ or... anything else but Q really}
We have decided to use $\poly\inparen{\vct{X}}$.
We now use $\poly\inparen{\vct{X}}$ for polynomials.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{- If we consider data complexity (as did Dalvi and Suciu) and fix an UCQ Q, given as input a bag TID PDB we can always compute the lineage in $O(|D|^|Q|)$ in SOP form and from there compute the expected multiplicity with the same complexity, so in polynomial time. How does this relate to your hardness result? Is it that you are only interested in combined complexity? Why one shouldn't be happy with this running time? Usually queries are much smaller than databases and this motivates studying data complexity.}
We have rewritten \Cref{sec:intro} in a way to stress that we are are primarily interested in data complexity, but we cannot stop there. As the reviewer has noted, the problem we explore requires further analysis, where we require parameterized and fine grained complexity analysis to provide theoretical foundation for the question we ask in \Cref{prob:informal}. We have discussed this in the prose following \Cref{prob:bag-pdb-poly-expected}.
We have rewritten \Cref{sec:intro} in a way to stress that we are are primarily interested in data complexity, but we cannot stop there. As the reviewer has noted, the problem we explore requires further analysis, where we require parameterized and fine grained complexity analysis to provide a theoretical foundation for the question we ask in \Cref{prob:informal}. We have discussed this in the prose following \Cref{prob:bag-pdb-poly-expected}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{A discussion is missing about the difference between the approach usually taken in PDB literature and your approach. In which case would one be more interested in the expected multiplicity or in the marginal probability of a tuple? This should be discussed clearly in the introduction, as currently there is no clear "motivation" to what you do. There is a section about related work at the end but it is mostly a set of facts and there is no insightful comparison to what you do.}
We provide more motivating examples in the first paragraph, and include a more detailed discussion of the relationship to sets in paragraph \textbf{Relationship to Set-Probabilistic Query Evaluation} after \Cref{prob:informal}.
\AH{We need to maybe talk about the motivation for computing expected multiplicity.}
Broadly, expected multiplicities correspond to expected \lstinline{COUNT(*)} queries.
As a trivial (albeit relevant) example, consider a model of a contact network.
The probability that there exists at least one new COVID infection in the graph is far less informative than the expected number of new infections.
As we now explain in the introduction, another motivation for generalizing marginal probability to expected multiplicity is that the marginal probability of a tuple $t$ is the expectation of a Boolean random variable that is assigned 1 in every world where tuple $t$ exists and $0$ otherwise. For bag-PDBs the multiplicity of a query result tuple can be modeled as a natural-number random variable that for a world $\db$ is assigned the multiplicity of the tuple in $\db$. Thus, a natural generalization of the marginal probability (expectation of a Boolean random variable) to bags is the expectation of this variable: the tuple's expected multiplicity.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\RCOMMENT{l.176 "N[X] relations are closed under RA+": is this a *definition* of what it means to take an RA+ query and evaluate it over an N[X] database, or does this sentence say something more? Also, I think it would be clearer to use UCQs in the whole paper instead of constantly changing between UCQs, RA+ and SPJU formalisms}
To make the paper more accessible and general, we found it better to not use $\semNX$-DBs. While we wanted to use UCQ, we found the choice of $\raPlus$ to be more amenable to the presentation of the paper, and have, as suggested stuck with one query formalism.
\RCOMMENT{There are too many things undefined in from l.182 to the end of page. l.182 and in Proposition 2.1 N-PDBs are not defined, the function mod is undefined, etc. The article should be self-contained: important definitions should be in the article and the appendix should only be used to hide proof details. I think it would not take a lot of space to properly define the main concepts that you are using, without hiding things in the appendix}
We have done as the reviewer has suggested. All material in \Cref{sec:background} that is proof related is in the appendix, while \Cref{sec:background} is itself now self-contained.
We have done as the reviewer has suggested. All material in \Cref{sec:background} that is proof-related is in the appendix, while \Cref{sec:background} is itself now self-contained.
\RCOMMENT{l.622 and l.632-634: so a N-PDB is a PDB where each possible world is an N-database, but an N[X]-PDB is not a PDB where each possible world is an N[X]-database... Confusing notation}
@ -101,7 +112,7 @@ We agree that this should not be part of the proof of the later, and have remove
\RCOMMENT{l.686 "The closure of ... over K-relations": you should give more details on this part. It is not obvious to me that the relations from l.646 hold.}
The core of this (otherwise trivial) argument, that semiring homomorphisms commute through queries, was already proven in \cite{DBLP:conf/pods/GreenKT07}. We now make this reference explicit.
The core of this (otherwise trivial) argument, that semiring homomorphisms commute through queries, was already proven in \cite{DBLP:conf/pods/GreenKT07}. We now make this reference explicit.
We apologize for not explaining this in more detail. In universal algebra~\cite{graetzer-08-un}, it has been proven (the HSP theorem) that for any variety, the set of all structures (called objects) with a certain signature that obey a set of equational laws, there exists a ``most general'' object called the \emph{free object}. The elements of the free objects are equivalence classes (with respect to the laws of the variety) of symbolic expressions over a set of variables $\vct{X}$ that consist of the operations of the structure. The operations of the free object are combining symbolic expression using the operation. It has been shown that for any other object $K$ of a variety, any assignment $\phi: \vct{X} \to K$ uniquely extends to a homomorphism from the free object to $K$ by substituting variables for based on $\phi$ in symbolic expression and then evaluating the resulting expression in $K$.