100 lines
9.3 KiB
TeX
100 lines
9.3 KiB
TeX
%!TEX root=../main.tex
|
|
|
|
Workflow systems~\cite{DC07} help users to break complex tasks like ETL, model-fitting, and more, into a series of smaller steps.
|
|
Users explicitly declare inter-step dependencies, permitting parallel execution of mutually independent steps.
|
|
% Recently however, systems like Papermill~\cite{papermill} have emerged,
|
|
Systems like Jupyter % or Zeppelin
|
|
allow users to instead describe such tasks through computational notebooks.
|
|
Notebook users express tasks as a sequence of code `cells' that describe each step of the computation without explicitly declaring dependencies between cells.
|
|
Users manually trigger cell execution, dispatching the code in the cell to a long-lived Python interpreter called a kernel.
|
|
Cells share data through the state of the interpreter, e.g., a global variable created in one cell can be read by another cell.
|
|
|
|
Relieving users of manual dependency management, along with their ability to provide immediate feedback make notebooks more suitable for iteratively developing a data processing pipeline.
|
|
However, the hidden state of the Python interpreter creates hard to understand and reproduce bugs, like users forgetting to rerun dependent cells~\cite{DBLP:journals/ese/PimentelMBF21,
|
|
% joelgrus,
|
|
brachmann:2020:cidr:your}\footnote{https://www.youtube.com/watch?v=7jiPeIFXb6U}.
|
|
In addition to requiring error-prone manual management of state versions, the notebook execution model also precludes parallelism --- cells must be evaluated serially because their inter-dependencies are not know upfront.
|
|
|
|
Dynamically collected provenance~\cite{pimentel-17-n} can address the former problem, but is of limited use for scheduling since dependencies are only learned after running a cell.
|
|
By the time we learn that two cells can be safely executed concurrently, they have already completed.
|
|
Static dataflow analysis~\cite{NN99} addresses both needs, but the dynamic nature of Python forces approximation of dependencies --- conservatively, limiting opportunities for parallelism; or optimistically, risking execution errors from missed inter-cell dependencies.
|
|
We bridge the static/dynamic gap by proposing a hybrid approach: \emph{Incremental Runtime Provenance Refinement}, where statically approximated provenance is incrementally refined at runtime.
|
|
|
|
We demonstrate how to utilize provenance refinement to enable (i) parallel execution and (ii) partial re-evaluation of notebooks.
|
|
This requires a fundamental shift away from execution in a single monolithic kernel towards isolated per-cell interpreters.
|
|
We validate our ideas by extending \emph{Vizier}~\cite{brachmann:2020:cidr:your}, an isolated cell execution (\textit{ICE}) notebook.
|
|
Cells in Vizier run in isolated interpreters and communicate only through dataflow (by reading / writing data artifacts).
|
|
%This model allows concurrently running cells to read the same artifact, and allows snapshotting artifacts for re-use during incremental re-evaluation.
|
|
We outline the challenges of extending Vizier to support parallelism and incremental re-evaluation based on approximate provenance determined using static program analysis that is incrementally refined during (re-)execution.
|
|
Additionally, this enables Jupyter notebooks to be imported into an ICE which requires data dependencies between cells to be made explicit as inter-cell dataflow. % to be % determined statically and to
|
|
% transformed into explicit dataflow between cells.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\begin{figure}
|
|
% \includegraphics[width=\columnwidth]{graphics/depth_vs_cellcount.vega-lite.pdf}
|
|
\includegraphics[width=0.9\columnwidth,trim=0 8 0 0]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
|
|
\vspace*{-3mm}
|
|
\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}.}
|
|
\label{fig:parallelismSurvey}
|
|
\trimfigurespacing
|
|
\end{figure}
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
As a preliminary assessment of the potential of parallelizing Jupyter notebooks, we conducted a survey on notebooks scraped from Github by Pimentel et. al.~\cite{DBLP:journals/ese/PimentelMBF21}.
|
|
We only include notebooks using Python and which are known to execute successfully ($\sim$6000 notebooks). % A total of $\sim$6000 notebooks met these criteria.
|
|
We constructed a dataflow graph for each notebook as described in \Cref{sec:import}. As a proxy measure for potential speedup, we considered the depth of this graph.
|
|
\Cref{fig:parallelismSurvey} presents the depth --- the maximum number of cells that must be executed serially --- in relation to the total number of Python cells in the notebook.
|
|
The average notebook has over 16 cells, but an average dependency depth of just under 4 and an average parallelism factor of 4.
|
|
|
|
We outline our central contribution: \textbf{Incremental Runtime Provenance Refinement} in \Cref{sec:approx-prov}, and review Vizier's ICE architecture in \Cref{sec:isolation}.
|
|
Afterwards, we discuss our remaining contributions:
|
|
% \begin{itemize}
|
|
% \item
|
|
% (i) \textbf{}: , we introduce an approach to provenance management that begins with statically derived approximate provenance, and is refined as dynamic provenance becomes available over the course of evaluation.
|
|
% \item
|
|
(i) \textbf{An Incremental Provenance-based Scheduler}. In \Cref{sec:scheduler}, we present a scheduler for incremental and parallel notebook execution. We discuss the challenges that arise due to provenance mispredictions and how to compensate for them.
|
|
% \item
|
|
(ii) \textbf{Jupyter Import}: \Cref{sec:import} discusses how we extract approximate provenance from Python code statically, and how existing notebooks written for Jupyter % (or comparable monolithic kernel architectures)
|
|
can be translated to ICE notebook architectures like Vizier.
|
|
% \item
|
|
(iii) \textbf{Implementation in Vizier and Experiments}: We have implemented a preliminary prototype of the proposed scheduler in Vizier. \Cref{sec:experiments} presents our experiments with parallel evaluation of Jupyter notebooks.
|
|
% \end{itemize}
|
|
|
|
% A computational notebook is an ordered sequence of \emph{cells}, each containing a block of (e.g., Python) code or (e.g., markdown) documentation.
|
|
% For the sake of simplicity, we ignore documentation cells, which do not impact notebook evaluation semantics.
|
|
% A notebook is evaluated by instantiating a Python interpreter --- usually called a kernel --- and sequentially using the kernel to evaluate each cell's code.
|
|
% The Python interpreter retains execution state after each cell is executed, allowing information to flow between cells.
|
|
% Our model of the notebook is thus essentially a single large script obtained by concatenating all code cells together.
|
|
|
|
% This execution model has two critical shortcomings.
|
|
% First, while users may explicitly declare opportunities for parallelism (e.g., by using data-parallel libraries like Spark or TensorFlow), inter-cell parallelism opportunities are lost.
|
|
% The use of Python interpreter state for inter-cell communication requires that each cell finish before the next cell can run.
|
|
% Second, partial re-execution of cells is possible, but requires users to manually re-execute affected cells.
|
|
% We note that even a naive approach like re-executing all cells subsequent to a modified cell is not possible.
|
|
% The use of Python interpreter state for inter-cell communication makes it difficult to reason about whether a cell's inputs are unchanged from the last time it was run.
|
|
|
|
% In this paper, we propose a new workflow-style runtime for Python notebooks that addresses both shortcomings.
|
|
% Notably, we propose a runtime that relies on a hybrid of dataflow- and workflow-style provenance models;
|
|
% Analogous to workflow-style provenance models, dependencies are tracked coarsely at the level of cells.
|
|
% However, like dataflow provenance, dependencies are discovered automatically through a combination of static analysis and dynamic instrumentation, rather than being explicitly declared.
|
|
|
|
% Parallel execution and incremental re-execution require overcoming three key challenges: isolation, scheduling, and translation.
|
|
% First, as discussed above, the state of the Python interpreter is not a suitable medium for inter-cell communication.
|
|
% Ideally, each cell could be executed in an isolated environment with explicit information flow between cells.
|
|
% The cell isolation mechanism should also permit efficient checkpointing of inter-cell state, and cleanly separate the transient state of concurrently executing cells.
|
|
% We discuss our approach to isoaltion in \Cref{sec:isolation}.
|
|
|
|
% Second, scheduling requires deriving a partial order over the notebook's cells.
|
|
% In a typical workflow system, such dependencies are explicitly provided.
|
|
% However, \emph{correctly} inferring dependencies statically is intractable.
|
|
% Thus the scheduler needs to be able to execute a workflow with a dynamically changing depdendency graph.
|
|
% We discuss how a workflow scheduler can merge conservative, statically derived provenance bounds with dynamic provenance collected during cell execution in \Cref{sec:scheduler}.
|
|
|
|
% Finally, notebooks written for a kernel-based runtime assume that the effects of one cell will be visible in the next.
|
|
% We discuss how kernel-based notebooks can be translated into our execution model in \Cref{sec:import}.
|
|
|
|
%%% Local Variables:
|
|
%%% mode: latex
|
|
%%% TeX-master: "../main"
|
|
%%% End:
|