34 lines
3.2 KiB
TeX
34 lines
3.2 KiB
TeX
|
%!TEX root=../main.tex
|
||
|
|
||
|
A computational notebook is an ordered sequence of \emph{cells}, each containing a block of (e.g., python) code or (e.g., markdown) documentation.
|
||
|
For the sake of simplicity, we ignore documentation cells, which do not impact notebook evaluation semantics.
|
||
|
A notebook is evaluated by instantiating a python interpreter --- usually called a kernel --- and sequentially using the kernel to evaluate each cell's code.
|
||
|
The python interpreter retains execution state after each cell is executed, allowing information to flow between cells.
|
||
|
Our model of the notebook is thus essentially a single large script obtained by concatenating all code cells together.
|
||
|
|
||
|
This execution model has two critical shortcomings.
|
||
|
First, while users may explicitly declare opportunities for parallelism (e.g., by using data-parallel libraries like Spark or TensorFlow), inter-cell parallelism opportunities are lost.
|
||
|
The use of python interpreter state for inter-cell communication requires that each cell finish before the next cell can run.
|
||
|
Second, partial re-execution of cells is possible, but requires users to manually re-execute affected cells.
|
||
|
We note that even a naive approach like re-executing all cells subsequent to a modified cell is not possible.
|
||
|
The use of python interpreter state for inter-cell communication makes it difficult to reason about whether a cell's inputs are unchanged from the last time it was run.
|
||
|
|
||
|
In this paper, we propose a new workflow-style runtime for python notebooks that addresses both shortcomings.
|
||
|
Notably, we propose a runtime that relies on a hybrid of dataflow- and workflow-style provenance models;
|
||
|
Analogous to workflow-style provenance models, dependencies are tracked coarsely at the level of cells.
|
||
|
However, like dataflow provenance, dependencies are discovered automatically through a combination of static analysis and dynamic instrumentation, rather than being explicitly declared.
|
||
|
|
||
|
Parallel execution and incremental re-execution require overcoming three key challenges: isolation, scheduling, and translation.
|
||
|
First, as discussed above, the state of the python interpreter is not a suitable medium for inter-cell communication.
|
||
|
Ideally, each cell could be executed in an isolated environment with explicit information flow between cells.
|
||
|
The cell isolation mechanism should also permit efficient checkpointing of inter-cell state, and cleanly separate the transient state of concurrently executing cells.
|
||
|
We discuss our approach to isoaltion in \Cref{sec:isolation}.
|
||
|
|
||
|
Second, scheduling requires deriving a partial order over the notebook's cells.
|
||
|
In a typical workflow system, such dependencies are explicitly provided.
|
||
|
However, \emph{correctly} inferring dependencies statically is intractable.
|
||
|
Thus the scheduler needs to be able to execute a workflow with a dynamically changing depdendency graph.
|
||
|
We discuss how a workflow scheduler can merge conservative, statically derived provenance bounds with dynamic provenance collected during cell execution in \Cref{sec:scheduler}.
|
||
|
|
||
|
Finally, notebooks written for a kernel-based runtime assume that the effects of one cell will be visible in the next.
|
||
|
We discuss how kernel-based notebooks can be translated into our execution model in \Cref{sec:import}.
|