91 lines
9 KiB
TeX
91 lines
9 KiB
TeX
%!TEX root=../main.tex
|
|
|
|
Workflow systems\cite{DC07} help users to break complex tasks like ETL processes, model-fitting, and more, into a series of smaller steps.
|
|
Users explicitly declare dependencies between steps, permitting parallel execution of mutually independent steps.
|
|
Recently, systems like Papermill~\cite{papermill} have emerged, allowing users to instead describe such tasks through computational notebooks like Jupyter or Zeppelin.
|
|
Notebook users tasks as a series of code `cells' describing each step of the computation, but do not need to explicitly declare dependencies between cells.
|
|
Users manually trigger cell execution, dispatching the code in the cell to a long-lived python interpreter called a kernel.
|
|
Cells share data through the state of the interpreter --- a global variable created in one cell can be read by another cell.
|
|
|
|
Relieving users of manual dependency management, along with a notebook's ability to provide immediate feedback make them more suitable for iteratively developing a data processing pipeline.
|
|
However, the hidden state of the Python interpreter creates hard to understand or to-reproduce bugs, like users forgetting to rerun dependent cells~\cite{DBLP:journals/ese/PimentelMBF21,joelgrus}.
|
|
In addition to requiring error-prone manual state version management, the notebook execution model also precludes parallelism --- cells must be evaluated serially.
|
|
|
|
Dynamically collected provenance\OK{Cite} can, in principle, address the former problem, but is notably less useful for the latter.
|
|
Conversely, static dataflow analysis~\cite{NN99} could address both concerns, but the dynamic nature of python implies that such dependencies can only be extracted approximately --- conservatively, limiting opportunities for parallelism; or optimistically, risking missed inter-cell dependencies.
|
|
In this paper, we bridge this gap by proposing a hybrid of the two provenance models that we call \emph{Incremental Provenance}.
|
|
We then begin to explore the challenges of working with provenance metadata that is incrementally refined from approximate provenance obtained through static analysis to exact, dynamically collected provenance.
|
|
|
|
As a case study for exploring incremental provenance, we adapt a workflow-style parallelism-capable scheduling engine for use with Jupyter notebooks.
|
|
In addition to adapting the scheduler itself, parallel execution requires a fundamental shift away from execution in a single monolithic kernel, and towards lighter-weight per-cell interpreters.
|
|
As a basis for our implementation, we leverage Vizier~\cite{brachmann:2019:sigmod:data, brachmann:2020:cidr:your}, an isolated cell execution (ICE) notebook.
|
|
In lieu of an opaque, monolithic kernel, cells run in isolated interpreters and communicate only through dataflow (by reading and writing data artifacts).
|
|
This paper outlines the challenges of extending Vizier to support parallel execution through incremental provenance.
|
|
We then show generality by discussing the process for importing Jupyter notebooks into an ICE, including the use of static analysis to extract approximate provenance information.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\begin{figure}
|
|
% \includegraphics[width=\columnwidth]{graphics/depth_vs_cellcount.vega-lite.pdf}
|
|
\includegraphics[width=0.8\columnwidth]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
|
|
\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}: On average, only one out of every 4 notebook cells must be run serially.}
|
|
\label{fig:parallelismSurvey}
|
|
\end{figure}
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
As a preliminary assessment of the potential of parallelizing Jupyter notebooks, we conducted a preliminary survey on an archive of Jupyter notebooks scraped from Github by Pimentel et. al.~\cite{DBLP:journals/ese/PimentelMBF21}.
|
|
Our survey included only notebooks using a python kernel and known to execute successfully; A total of nearly 6000 notebooks met these criteria.
|
|
We constructed a dataflow graph as described in \Cref{sec:import}; as a proxy measure for potential speedup, we considered the depth of this graph.
|
|
\Cref{fig:parallelismSurvey} presents the depth --- the maximum number of cells that must be executed serially --- in relation to the total number of python cells in the notebook.
|
|
The average notebook has over 16 cells, but an average dependency depth of just under 4; an average of around 4 cells able to run concurrently.
|
|
|
|
We outline the key ideas of our central contributions incremental provenance in \Cref{sec:approx-prov}, and review Vizier's ICE architecture in \Cref{sec:isolation}.
|
|
The remainder of the paper focuses on our remaining contributions:
|
|
% \begin{itemize}
|
|
% \item
|
|
% (i) \textbf{}: , we introduce an approach to provenance management that begins with statically derived approximate provenance, and is refined as dynamic provenance becomes available over the course of evaluation.
|
|
% \item
|
|
(i) \textbf{An Incremental Provenance Scheduler}. In \Cref{sec:scheduler}, we present a scheduler for incremental notebook execution; We identify the challenges that arise due to provenance mispredictions, and discuss how to compensate for them.
|
|
% \item
|
|
(ii) \textbf{Jupyter Import}: \Cref{sec:import} discusses how we extract approximate provenance from python code, and how existing notebooks written for Jupyter (or comparable monolithic kernel architectures) can be translated to ICE notebook architectures like Vizier.
|
|
% \item
|
|
(iii) \textbf{Implementation in Vizier and Experiments}: We have implemented a preliminary prototype of the proposed scheduler in Vizier. \Cref{sec:experiments} presents our initial experiments with parallel evaluation of Jupyter notebooks using this implementation.
|
|
% \end{itemize}
|
|
|
|
% A computational notebook is an ordered sequence of \emph{cells}, each containing a block of (e.g., python) code or (e.g., markdown) documentation.
|
|
% For the sake of simplicity, we ignore documentation cells, which do not impact notebook evaluation semantics.
|
|
% A notebook is evaluated by instantiating a python interpreter --- usually called a kernel --- and sequentially using the kernel to evaluate each cell's code.
|
|
% The python interpreter retains execution state after each cell is executed, allowing information to flow between cells.
|
|
% Our model of the notebook is thus essentially a single large script obtained by concatenating all code cells together.
|
|
|
|
% This execution model has two critical shortcomings.
|
|
% First, while users may explicitly declare opportunities for parallelism (e.g., by using data-parallel libraries like Spark or TensorFlow), inter-cell parallelism opportunities are lost.
|
|
% The use of python interpreter state for inter-cell communication requires that each cell finish before the next cell can run.
|
|
% Second, partial re-execution of cells is possible, but requires users to manually re-execute affected cells.
|
|
% We note that even a naive approach like re-executing all cells subsequent to a modified cell is not possible.
|
|
% The use of python interpreter state for inter-cell communication makes it difficult to reason about whether a cell's inputs are unchanged from the last time it was run.
|
|
|
|
% In this paper, we propose a new workflow-style runtime for python notebooks that addresses both shortcomings.
|
|
% Notably, we propose a runtime that relies on a hybrid of dataflow- and workflow-style provenance models;
|
|
% Analogous to workflow-style provenance models, dependencies are tracked coarsely at the level of cells.
|
|
% However, like dataflow provenance, dependencies are discovered automatically through a combination of static analysis and dynamic instrumentation, rather than being explicitly declared.
|
|
|
|
% Parallel execution and incremental re-execution require overcoming three key challenges: isolation, scheduling, and translation.
|
|
% First, as discussed above, the state of the python interpreter is not a suitable medium for inter-cell communication.
|
|
% Ideally, each cell could be executed in an isolated environment with explicit information flow between cells.
|
|
% The cell isolation mechanism should also permit efficient checkpointing of inter-cell state, and cleanly separate the transient state of concurrently executing cells.
|
|
% We discuss our approach to isoaltion in \Cref{sec:isolation}.
|
|
|
|
% Second, scheduling requires deriving a partial order over the notebook's cells.
|
|
% In a typical workflow system, such dependencies are explicitly provided.
|
|
% However, \emph{correctly} inferring dependencies statically is intractable.
|
|
% Thus the scheduler needs to be able to execute a workflow with a dynamically changing depdendency graph.
|
|
% We discuss how a workflow scheduler can merge conservative, statically derived provenance bounds with dynamic provenance collected during cell execution in \Cref{sec:scheduler}.
|
|
|
|
% Finally, notebooks written for a kernel-based runtime assume that the effects of one cell will be visible in the next.
|
|
% We discuss how kernel-based notebooks can be translated into our execution model in \Cref{sec:import}.
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "../main"
|
|
%%% End:
|