paper-BagRelationalPDBsAreHard/reviews.txt


----------------------- REVIEW 1 ---------------------
SUBMISSION: 61
TITLE: Standard Operating Procedure in PDBs Considered Harmful (for bags)
AUTHORS: Su Feng, Boris Glavic, Aaron Huber, Oliver Kennedy and Atri Rudra

----------- Overall evaluation -----------
SCORE: 3 (strong accept)
----- TEXT:
Assume a probabilistic database (PDB) where every tuple is annotated with a random variable whose domain is the set of natural numbers. That is, the possible worlds of this PDB represent the different combinations of multiplicities of the tuples. As in the classic tuple-independent setting, it is assumed that the random variables are pairwise independent.
The problem investigated is to compute the expected multiplicity of every result tuple.
The paper examines the intensional approach to query evaluation that translates to computing the expectation of the lineage polynomial.
If the polynomial is given in sum-of-product form, then applying linearity of expectation along with the assumption that the RVs are pairwise independent allows to compute the expectation in time that is linear in the size of the lineage.
The main question considered by the authors is whether there is an efficient algorithm (for computing the expectation) when the lineage is given in compressed form.

The paper proves several results: hardness of computing the expectation when the lineage is in compressed form, a (restricted) generalization of the former when all RVs have the same probability, and a linear-time approximation algorithm with a multiplicative approximation factor. The proofs are elegant and use recent results and conjectures in fine-grained-complexity. The use of polynomial algebra to process the lineage and prove equivalences in this setting is novel. Finally, the exposition is good, with many examples providing intuition. Therefore, I recommend  acceptance.

RESULTS:
The first result is a proof showing that if the expectation of the lineage can be computed in linear time in the size of its compressed form, then it is possible to count the number of k-matchings in a graph in time O(k^3). Since counting k-matchings is #W[1]-hard (when parameterized by k) then this proves the hardness of the compressed evaluation.
The technique used is common in the PDB literature (i.e., Dalvi and Suciu 2008) where the PDB is used to encode a graph, and query evaluation is used to construct a system of linear equations represented by a Vandermonde matrix, which can be solved in polynomial time.
The authors then generalize this result to the case where all RVs have the same probability by using the triangle detection hypothesis stating that counting triangles takes time at least \Omega(|E|^{4/3}).

Comments:
1. The paper would benefit if the transition from N-PDB to N[X]-PDB (while proved in the appendix) were explained in the paper because the results that follow rest on the correctness of this encoding.
2. Conjecture 3.2:  the fine-grained complexity bound refers only to triangle counting, and not the P4-path and the 3-matching. It is confusing that these are mentioned in the conjecture.

Minor Comments:
Section 2.2.1 - mention that \pi_{sch} refers to the schema of R_1
Definition 2.11 - poly(T) is not yet defined at this point (defined in def. 4.2)
Lemma 3.8: p^4 is missing on the third term of the polynomial.
----------- Reviewer's confidence -----------
SCORE: 3 ((medium))


----------------------- REVIEW 2 ---------------------
SUBMISSION: 61
TITLE: Standard Operating Procedure in PDBs Considered Harmful (for bags)
AUTHORS: Su Feng, Boris Glavic, Aaron Huber, Oliver Kennedy and Atri Rudra

----------- Overall evaluation -----------
SCORE: -2 (reject)
----- TEXT:
From what I can tell: This paper studies the problem of computing, in the context of probabilistic DBs, the marginal probability of a result tuple, under bag semantics.  While this problem is tractable, when the lineage polynomial is presented in a typical representation (sum-of-products), there exist compressed representations that are relevant, and the paper focuses on studying the problem with respect to these compressed representations.  While a negative result is shown, a positive result, an approximation algorithm, is also given.

I had an extreme amount of difficulty reading the paper, to the point where I am not sure of the meaning of notions that seem basic to the paper's aims.  As a result of this, I don't think that the paper can be accepted in the present form.

Below are some examples of places where I had difficulty.
A number of these refer to the introduction.  To be clear, I don't expect the introduction to be completely rigorous mathematically, but where not rigorous, it should give useful hints that drive the reader towards understanding.


- Pg 1, "computing the expectation of the lineage, which under bag semantics is a polynomial".  Make clear, what is a polynomial: the expectation or the lineage?  Also, a polynomial in what?

- Pg 1, "These compression schemes are analogous to typical...optimizations like projection push-down".  What is the analogy?  What is projection push-down?  And why is 27 an appropriate reference for it, did they invent it?

- Pg 1, "this algorithm only has a constant factor overhead".  What does that mean?

- Pg 2, Example 1.2.  I read that the lineage of the result in a Bag-PDB is a polynomial formula.  What is a polynomial formula?.

- Pg 2, "observe that the clauses...are neither independent nor disjoint".  What does it mean for clauses to be independent or disjoint?

Same sentence: what is the Shannon decomposition, and what does it mean for it to be "at worst exponential" in the size of the input?

Also, what are clauses of a polynomial?    Here, you say "of the SOP polynomial": of what SOP polynomial?

- Pg 2, Col 2, "further interesting feature".  Can you say more about why this is interesting?

- Pg 2, Col 2, on dropping subscript from Q_{bag}: This is very confusing, not in the least because the first usage of Q after you say this seems to refer to a query.

I think this overloading leads to a ton of confusion, indeed, throughout the paper, one sees (for example) Q(W_a,W_b,W_c), Q(\Omega), Q(X), Q(D), Q(W), Q(), and each time the reader potentially has to invoke some sort of search over the possible meanings to determine the actual meaning.

- Pg 2, Col 2, "Cartesian product of Q".  What do you mean by Cartesian products of queries?

- Pg 3, Col 1, central question of this paper: I'm not sure how to interpret this given what comes before it.  The central question refers to "the compressed lineage polynomial", but I do not know what this is, and miss a definition.  The beginning of Section 1.2 says some vague-sounding things about compressed representations: 'For an arbitrary polynomial, it is known that there may exist equivalent compressed representations.  One such compression is the factorized polynomial...', but I don't see any real definition.

- Sect 1.3, first sentence.  In (i), did you define TIDB?  (This is mentioned in Ex 1.1, but is it defined anywhere?)  Also, you say "over a bag-TIDB"; do you mean over one particular bag-TIDB, or is the bag-TIDB part of the input?

- Defs. 2.2 and 2.3.  Do you need to state that each c_i in Def. 2.2 is nonzero in order so that the degree in Def. 2.3 is defined correctly?

- Def. 2.11.  Is "poly" defined anywhere?

Right after Def. 2.11, an example of a set ET(Q(X)) is given, but the elements of the set do not appear to be binary trees.

- Sect. 3, first sentence.  You say 'project-join query', earlier you used the term 'UCQ': why the inconsistency?

- Sect. 3.1, beginning.  You say "fixed graph H", but how is this graph fixed?  (It varies in, say, Theorem 3.1.)  Also, why do you suddenly switch to saying "pattern" after that, also, what is an occurrence of a pattern?

- Thms. 3.1 and 3.4.  What is the parameterization?  (Without stating this, there is no way to understand what the #W[1]-hardness refers to.)
----------- Reviewer's confidence -----------
SCORE: 4 ((high))


----------------------- REVIEW 3 ---------------------
SUBMISSION: 61
TITLE: Standard Operating Procedure in PDBs Considered Harmful (for bags)
AUTHORS: Su Feng, Boris Glavic, Aaron Huber, Oliver Kennedy and Atri Rudra

----------- Overall evaluation -----------
SCORE: 2 (accept)
----- TEXT:
This paper investigates the problem of computing the expected number of times a tuple is produced as a result of a bag-semantics query evaluation over probabilistic databases (BIDs). Though the computation of this expected value can be done in linear time on the monomial expansions of the provenance, it is observed that this is not the case on factorized representations of the provenance. This is formalized by a (conditional) super-linear hardness result, through some quite technical reductions from counting patterns in graphs. A scheme for approximate computation is proposed, which is linear-time for some specific settings, including when the input database is a TID.

This is a sound and novel paper, and the attention given to probabilistic counting is interesting. It is generally quite well-written.

I am not a huge fan of the title of the paper, which does not give a good overview of what the paper is about. At the very least, I would urge the authors to add the "(for bags)" to the actual title and not just subtitle of their papers, to avoid any metadata issue with the paper -- the notion of bags is quite crucial in this paper.

The introduction states that the assumption that input tuples have cardinality 0 or 1 is done without loss of generality, but it seems to actually have quite a strong impact, since the assumption that E[X]=Pr(X=1) is used in a number of places. This should be discussed in more depth.

Section 2.2.1, maybe replace the \sum symbol with \bigoplus?

Definition 2.11: at this point, poly has not been defined. ET does not really seem well defined. Contrarily to the example that follows, there are infinitely many trees that evaluate to the same polynomial (you can always have things like X-X+X-X+X-X...).

Before Section 3.1, when mentioning for the first time "these conjectures hold", there is no conjecture mentioned at this point.

Lemma 3.8, factor p⁴ missing in the monomial with the 2-matchings.

Typos:
- (page 1) "the the first linear"
- (page 3) extraneous space after "is #W[1]-hard"
- (page 4) something wrong with the formatting of the subscript in footnote 4
- (page 5) "demonstrate Section 3.3"
- (page 6) Main verb missing in the sentence after Conjecture 3.2
- (page 7) "that this problem" -> does not parse
- (page 7) "Analogusly"
- (page 9) The statement of Corollary 4.9 has an extraneous closing parenthesis
- (page 11) "So far focused" -> "So far we focused"
- (page 11) "to e.g. to"
----------- Reviewer's confidence -----------
SCORE: 5 ((expert))


----------------------- REVIEW 4 ---------------------
SUBMISSION: 61
TITLE: Standard Operating Procedure in PDBs Considered Harmful (for bags)
AUTHORS: Su Feng, Boris Glavic, Aaron Huber, Oliver Kennedy and Atri Rudra

----------- Overall evaluation -----------
SCORE: 0 (borderline paper)
----- TEXT:
The paper addresses the problem of computing the expected multiplicity of a tuple in the result of evaluating a query on a probabilistic databases under bag semantics. The paper shows that while this problem is hard, one can compute efficiently an additive approximation.

Strong points:
The problem seems interesting and the results seem to require non-trivial effort

Weak points:
The presentation is hard to follow, in particular the introduction. There were basic definitions that were unclear to me.

Detailed comments:

Introduction - I find the structure of the introduction almost impossible to follow as it is built on things that are only defined in the body of the paper. I read it several times and returned to it after reading the whole paper and there are still things that are vague to me. I would prefer a shorter introduction that gives an overview of the results and the techniques used to obtain them with much fewer technical details.

Main question - I did not understand why the main question is "Is it always the case that the expectation of a UCQ in a
Bag-PDB can be computed in time linear in the size of the compressed lineage polynomial?"
Why is the complexity measured with respect to the size of the compressed lineage polynomial and what is the complexity required to compute this polynomial from the input? You discuss close issues on 5.1.2 but it is still not clear to me.
Also, is reduced polynomial and compressed polynomial refer to the same notion? If so, why do we need both?

Definition 2.2.2 - It would have been useful to elaborate more here on the correspondence between the original PDB and the N[X}-PDB. This is a key element and without it it is impossible to understand why the main problem is indeed the one presented in definition 2.12.

Definition 2.5 - I did not understand the intuitive explanation. Can you explain why this is the case?

Section 2.1: you use both multi-sets and bags for the same notion

Theorem 3.4: the index of p_i in the reduced polynomial are wrong (should be p_0, \cdots, p_{2k})
----------- Reviewer's confidence -----------
SCORE: 2 ((low))