94 lines
4.5 KiB
Org Mode
94 lines
4.5 KiB
Org Mode
#+title: reviews
|
|
* REVIEW 1
|
|
** Overall evaluation
|
|
SCORE: 1 (weak accept)
|
|
** TEXT
|
|
The paper proposes a mix of static\dynamic approach to capture at
|
|
compile\runtime provenance information on notebook cells such that their
|
|
execution can be parallalized. The static provenance information captured are
|
|
actually approximate so some mitigation logics must be triggered in order to get
|
|
to a proper parallel execution that matches the serial one.
|
|
|
|
|
|
In general I loved the "workbook" idea and using provenance as a mean to get
|
|
there. Parallelization is a low hanging fruit. There are probably much more
|
|
things that can be explored in this model. I a m not 100% convinced about the
|
|
mixed static\dynamic provenance approach. Therefore I am giving a weak accept.
|
|
|
|
Strong points:
|
|
- I loved the "workbook" idea. I personally used dataflow systems, and I found
|
|
their interface difficult to use, and error prone, especially the fact that you
|
|
need to explicitly define the variables and the dependency graph. A
|
|
notebook-oriented approach is so much better!
|
|
- The parallelization use-case totally makes sense and the performance gains showed in Section 6 backed it up.
|
|
|
|
|
|
Weak points:
|
|
- Is static provenance actually useful? Because of its approximate nature it
|
|
would be interesting to have an idea on its practical contribution to the
|
|
approach. For example, one could start with executing all cells serially, build
|
|
the dynamic graph, and thee using parallelization only from subsequent
|
|
re-executions. This will probably also simplify scheduling since it is "safer".
|
|
- The scaling experiment paragraph may need some additional explanation. What is hot vs cold cache? Does cold mean that the data is in storage, or it is just a matter of warming up the system caches?
|
|
|
|
|
|
Minor:
|
|
- "Section 5 adn a simple"
|
|
- " Not that Vizier"
|
|
|
|
Suggestions:
|
|
- My understanding is that you use Spark as backend for Vizier. Why is that? I
|
|
believe that something like Ray would be a better choice since it provides a
|
|
proper distributed data management, instead of having to resort to primitives
|
|
such as shuffling and broadcasting (which in this context are a bit weird)
|
|
- Can you add a couple of more sentences on future works beyond system improvements? I would love to hear more about possible use-cases.
|
|
* REVIEW 2
|
|
** Overall evaluation
|
|
SCORE: 2 (accept)
|
|
** TEXT
|
|
The paper proposes a provenance refinement approach for notebooks that use the
|
|
ICE (isolated cell execution) execution model. The proposed approach is able to
|
|
improve simple static provenance at runtime and uses and fixes provenance
|
|
captured in previous executions of a given notebook. By doing that, the proposed
|
|
method can provide cell execution parallelism, and also partial re-evaluation of
|
|
notebooks.
|
|
|
|
While the implementation is still preliminary, it shows performance improvements
|
|
of up to 4 times in a syntactic notebook, which is compatible with a survey they
|
|
conducted on real notebooks from [10] that show an average parallelism factor of
|
|
4.
|
|
|
|
I wonder what would be the implications of the proposed approach in a monolithic execution model like the one used by Jupyter.
|
|
|
|
The paper is well written in general. I did find two typos, however:
|
|
|
|
- adn (Section 6)
|
|
- Not that (should probably be Note that)
|
|
|
|
|
|
|
|
* REVIEW 3
|
|
** Overall evaluation
|
|
SCORE: 1 (weak accept)
|
|
** TEXT
|
|
The author proposes an improvement of the traditional notebook with provenance
|
|
support and parallel execution capabilities. In particular, the authors aim to
|
|
solve the problem of the hidden state of the Python interpreter, and the lack of
|
|
ability to process cells in a parallel manner. To do this, the authors had to
|
|
move away from execution in a single monolithic core, which is used in Jupyter
|
|
for example, to isolated interpreters per cell. The authors have performed an
|
|
evaluation using 6000 Jupyter notebook illustrating the gain that can be
|
|
achieved.
|
|
|
|
The article is well written, and the technical solution seems to be sound. That
|
|
said, I am skeptical about the motivation behind the solution presented in this
|
|
article. Jupiter is not intended for very intensive data processing, but it is
|
|
often used in the early stages of exploration, after which a team of developers
|
|
usually moves on to more robust environments. So I'm not sure how useful the
|
|
solution is in terms of supporting parallelism. In terms of managing
|
|
dependencies between cells, this may be useful, however, it will require a
|
|
change in the mindset of the developer who is used to working with a system like
|
|
Jupyter.
|
|
|
|
To sum up, I think the idea and the proposal are interesting, but I am not convinced of their usefulness in practice.
|