paper-ParallelPython-Short/reviews.org

4.5 KiB

reviews

REVIEW 1

Overall evaluation

SCORE: 1 (weak accept)

TEXT

The paper proposes a mix of static\dynamic approach to capture at compile\runtime provenance information on notebook cells such that their execution can be parallalized. The static provenance information captured are actually approximate so some mitigation logics must be triggered in order to get to a proper parallel execution that matches the serial one.

In general I loved the "workbook" idea and using provenance as a mean to get there. Parallelization is a low hanging fruit. There are probably much more things that can be explored in this model. I a m not 100% convinced about the mixed static\dynamic provenance approach. Therefore I am giving a weak accept.

Strong points:

  • I loved the "workbook" idea. I personally used dataflow systems, and I found

their interface difficult to use, and error prone, especially the fact that you need to explicitly define the variables and the dependency graph. A notebook-oriented approach is so much better!

  • The parallelization use-case totally makes sense and the performance gains showed in Section 6 backed it up.

Weak points:

  • Is static provenance actually useful? Because of its approximate nature it

would be interesting to have an idea on its practical contribution to the approach. For example, one could start with executing all cells serially, build the dynamic graph, and thee using parallelization only from subsequent re-executions. This will probably also simplify scheduling since it is "safer".

  • The scaling experiment paragraph may need some additional explanation. What is hot vs cold cache? Does cold mean that the data is in storage, or it is just a matter of warming up the system caches?

Minor:

  • "Section 5 adn a simple"
  • " Not that Vizier"

Suggestions:

  • My understanding is that you use Spark as backend for Vizier. Why is that? I

believe that something like Ray would be a better choice since it provides a proper distributed data management, instead of having to resort to primitives such as shuffling and broadcasting (which in this context are a bit weird)

  • Can you add a couple of more sentences on future works beyond system improvements? I would love to hear more about possible use-cases.

REVIEW 2

Overall evaluation

SCORE: 2 (accept)

TEXT

The paper proposes a provenance refinement approach for notebooks that use the ICE (isolated cell execution) execution model. The proposed approach is able to improve simple static provenance at runtime and uses and fixes provenance captured in previous executions of a given notebook. By doing that, the proposed method can provide cell execution parallelism, and also partial re-evaluation of notebooks.

While the implementation is still preliminary, it shows performance improvements of up to 4 times in a syntactic notebook, which is compatible with a survey they conducted on real notebooks from [10] that show an average parallelism factor of

I wonder what would be the implications of the proposed approach in a monolithic execution model like the one used by Jupyter.

The paper is well written in general. I did find two typos, however:

  • adn (Section 6)
  • Not that (should probably be Note that)

REVIEW 3

Overall evaluation

SCORE: 1 (weak accept)

TEXT

The author proposes an improvement of the traditional notebook with provenance support and parallel execution capabilities. In particular, the authors aim to solve the problem of the hidden state of the Python interpreter, and the lack of ability to process cells in a parallel manner. To do this, the authors had to move away from execution in a single monolithic core, which is used in Jupyter for example, to isolated interpreters per cell. The authors have performed an evaluation using 6000 Jupyter notebook illustrating the gain that can be achieved.

The article is well written, and the technical solution seems to be sound. That said, I am skeptical about the motivation behind the solution presented in this article. Jupiter is not intended for very intensive data processing, but it is often used in the early stages of exploration, after which a team of developers usually moves on to more robust environments. So I'm not sure how useful the solution is in terms of supporting parallelism. In terms of managing dependencies between cells, this may be useful, however, it will require a change in the mindset of the developer who is used to working with a system like Jupyter.

To sum up, I think the idea and the proposal are interesting, but I am not convinced of their usefulness in practice.