Workflow systems~\cite{DC07} help users to break complex tasks like ETL, model-fitting, and more, into a series of smaller steps.
Users explicitly declare inter-step dependencies, permitting parallel execution of mutually independent steps.
% Recently however, systems like Papermill~\cite{papermill} have emerged,
Systems like Jupyter % or Zeppelin
allow users to instead describe such tasks through computational notebooks.
Notebook users express tasks as a sequence of code `cells' that describe each step of the computation without explicitly declaring dependencies between cells.
Users manually trigger cell execution, dispatching the code in the cell to a long-lived Python interpreter called a kernel.
Cells share data through the state of the interpreter, e.g., a global variable created in one cell can be read by another cell.
Relieving users of manual dependency management, along with their ability to provide immediate feedback make notebooks more suitable for iteratively developing a data processing pipeline.
However, the hidden state of the Python interpreter creates hard to understand and reproduce bugs, like users forgetting to rerun dependent cells~\cite{DBLP:journals/ese/PimentelMBF21,
% joelgrus,
In addition to requiring error-prone manual management of state versions, the notebook execution model also precludes parallelism --- cells must be evaluated serially because their inter-dependencies are not know upfront.
Dynamically collected provenance~\cite{pimentel-17-n} can, in principle, address the former problem, but is less useful for the latter because provenance dependencies only become know at runtime.
Conversely, static dataflow analysis~\cite{NN99} addresses both concerns, but the dynamic nature of Python implies that such dependencies can only be extracted approximately --- conservatively, limiting opportunities for parallelism; or optimistically, risking execution errors from missed inter-cell dependencies.
We bridge this gap by proposing a hybrid approach that we call \emph{Incremental Runtime Provenance Refinement} that statically approximates provenance which is then incrementally refined at runtime.
We demonstrate how to utilize provenance refinement to enable (i) parallel execution and (ii) partial re-evaluation of notebooks.
This requires a fundamental shift away from execution in a single monolithic kernel towards isolated per-cell interpreters.
We validate our ideas by extending \emph{Vizier}~\cite{brachmann:2020:cidr:your}, an isolated cell execution (\textit{ICE}) notebook.
Cells in Vizier run in isolated interpreters and communicate only through dataflow (by reading / writing data artifacts).
%This model allows concurrently running cells to read the same artifact, and allows snapshotting artifacts for re-use during incremental re-evaluation.
We outline the challenges of extending Vizier to support parallelism and incremental re-evaluation based on approximate provenance determined using static program analysis that is incrementally refined during (re-)execution.
Additionally, this enables Jupyter notebooks to be imported into an ICE which requires data dependencies between cells to be made explicit as inter-cell dataflow. % to be % determined statically and to
% transformed into explicit dataflow between cells.
\includegraphics[width=1\columnwidth,trim=0 15 0 0]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}.}
As a preliminary assessment of the potential of parallelizing Jupyter notebooks, we conducted a survey on notebooks scraped from Github by Pimentel et. al.~\cite{DBLP:journals/ese/PimentelMBF21}.
We only include notebooks using Python and which are known to execute successfully ($\sim$6000 notebooks). % A total of $\sim$6000 notebooks met these criteria.
We constructed a dataflow graph for each notebook as described in \Cref{sec:import}. As a proxy measure for potential speedup, we considered the depth of this graph.
\Cref{fig:parallelismSurvey} presents the depth --- the maximum number of cells that must be executed serially --- in relation to the total number of Python cells in the notebook.
The average notebook has over 16 cells, but an average dependency depth of just under 4 and an average parallelism factor of 4.
We outline our central contribution: \textbf{Incremental Runtime Provenance Refinement} in \Cref{sec:approx-prov}, and review Vizier's ICE architecture in \Cref{sec:isolation}.
Afterwards, we discuss our remaining contributions:
(i) \textbf{An Incremental Provenance-based Scheduler}. In \Cref{sec:scheduler}, we present a scheduler for incremental and parallel notebook execution. We discuss the challenges that arise due to provenance mispredictions and how to compensate for them.
(ii) \textbf{Jupyter Import}: \Cref{sec:import} discusses how we extract approximate provenance from Python code statically, and how existing notebooks written for Jupyter % (or comparable monolithic kernel architectures)
can be translated to ICE notebook architectures like Vizier.
(iii) \textbf{Implementation in Vizier and Experiments}: We have implemented a preliminary prototype of the proposed scheduler in Vizier. \Cref{sec:experiments} presents our experiments with parallel evaluation of Jupyter notebooks.
