paper-ParallelPython-Short/reviews.org

#+title: reviews
* REVIEW 1
** Overall evaluation
SCORE: 1 (weak accept)
** TEXT
The paper proposes a mix of static\dynamic approach to capture at
compile\runtime provenance information on notebook cells such that their
execution can be parallalized. The static provenance information captured are
actually approximate so some mitigation logics must be triggered in order to get
to a proper parallel execution that matches the serial one.


In general I loved the "workbook" idea and using provenance as a mean to get
there. Parallelization is a low hanging fruit. There are probably much more
things that can be explored in this model. I a m not 100% convinced about the
mixed static\dynamic provenance approach. Therefore I am giving a weak accept.

Strong points:
- I loved the "workbook" idea. I personally used dataflow systems, and I found
their interface difficult to use, and error prone, especially the fact that you
need to explicitly define the variables and the dependency graph. A
notebook-oriented approach is so much better!
- The parallelization use-case totally makes sense and the performance gains showed in Section 6 backed it up.


Weak points:
- Is static provenance actually useful? Because of its approximate nature it
would be interesting to have an idea on its practical contribution to the
approach. For example, one could start with executing all cells serially, build
the dynamic graph, and thee using parallelization only from subsequent
re-executions. This will probably also simplify scheduling since it is "safer".
- The scaling experiment paragraph may need some additional explanation. What is hot vs cold cache? Does cold mean that the data is in storage, or it is just a matter of warming up the system caches?


Minor:
- "Section 5 adn a simple"
- " Not that Vizier"

Suggestions:
- My understanding is that you use Spark as backend for Vizier. Why is that? I
believe that something like Ray would be a better choice since it provides a
proper distributed data management, instead of having to resort to primitives
such as shuffling and broadcasting (which in this context are a bit weird)
- Can you add a couple of more sentences on future works beyond system improvements? I would love to hear more about possible use-cases.
* REVIEW 2
** Overall evaluation
SCORE: 2 (accept)
** TEXT
The paper proposes a provenance refinement approach for notebooks that use the
ICE (isolated cell execution) execution model. The proposed approach is able to
improve simple static provenance at runtime and uses and fixes provenance
captured in previous executions of a given notebook. By doing that, the proposed
method can provide cell execution parallelism, and also partial re-evaluation of
notebooks.

While the implementation is still preliminary, it shows performance improvements
of up to 4 times in a syntactic notebook, which is compatible with a survey they
conducted on real notebooks from [10] that show an average parallelism factor of
4.

I wonder what would be the implications of the proposed approach in a monolithic execution model like the one used by Jupyter.

The paper is well written in general. I did find two typos, however:

- adn (Section 6)
- Not that (should probably be Note that)


* REVIEW 3
** Overall evaluation
SCORE: 1 (weak accept)
** TEXT
The author proposes an improvement of the traditional notebook with provenance
support and parallel execution capabilities. In particular, the authors aim to
solve the problem of the hidden state of the Python interpreter, and the lack of
ability to process cells in a parallel manner. To do this, the authors had to
move away from execution in a single monolithic core, which is used in Jupyter
for example, to isolated interpreters per cell. The authors have performed an
evaluation using 6000 Jupyter notebook illustrating the gain that can be
achieved.

The article is well written, and the technical solution seems to be sound. That
said, I am skeptical about the motivation behind the solution presented in this
article. Jupiter is not intended for very intensive data processing, but it is
often used in the early stages of exploration, after which a team of developers
usually moves on to more robust environments. So I'm not sure how useful the
solution is in terms of supporting parallelism. In terms of managing
dependencies between cells, this may be useful, however, it will require a
change in the mindset of the developer who is used to working with a system like
Jupyter.

To sum up, I think the idea and the proposal are interesting, but I am not convinced of their usefulness in practice.