paper-ParallelPython-Short/sections/introduction.tex

92 lines
9.1 KiB
TeX

%!TEX root=../main.tex
Workflow systems~\cite{DC07} help users to break complex tasks like ETL processes, model-fitting, and more, into a series of smaller steps.
Users explicitly declare inter-step dependencies, permitting parallel execution of mutually independent steps.
Recently however, systems like Papermill~\cite{papermill} have emerged, allowing users to instead describe such tasks through computational notebooks like Jupyter or Zeppelin.
Notebook users express tasks as a sequence of code `cells' that describe each step of the computation without explicitly declaring dependencies between cells.
Users instead manually trigger cell execution, dispatching the code in the cell to a long-lived python interpreter called a kernel.
Cells share data through the state of the interpreter --- a global variable created in one cell can be read by another cell.
Relieving users of manual dependency management, along with their ability to provide immediate feedback make notebooks more suitable for iteratively developing a data processing pipeline.
However, the hidden state of the Python interpreter creates hard to understand or to-reproduce bugs, like users forgetting to rerun dependent cells~\cite{DBLP:journals/ese/PimentelMBF21,joelgrus}.
In addition to requiring error-prone manual management of state versions, the notebook execution model also precludes parallelism --- cells must be evaluated serially.
Dynamically collected provenance\OK{Cite} can, in principle, address the former problem, but is notably less useful for the latter.
Conversely, static dataflow analysis~\cite{NN99} could address both concerns, but the dynamic nature of python implies that such dependencies can only be extracted approximately --- conservatively, limiting opportunities for parallelism; or optimistically, risking execution errors from missed inter-cell dependencies.
In this paper, we bridge this gap by proposing a hybrid provenance model that we call \emph{Incremental Provenance}, and begin to explore the challenges of working with provenance metadata that is incrementally refined at runtime.
Concretely, we focus on the challenge of leveraging incremental provenance for (i) parallel execution and (ii) incremental partial re-evaluation.
Both tasks require a fundamental shift away from execution in a single monolithic kernel, and towards lighter-weight per-cell interpreters.
Thus, we validate our idea by extending Vizier~\cite{brachmann:2019:sigmod:data, brachmann:2020:cidr:your}, an isolated cell execution (ICE) notebook.
In lieu of an opaque, monolithic kernel, cells in Vizier run in isolated interpreters and communicate only through dataflow (by reading and writing data artifacts).
This model allows concurrently running cells to read from the same artifacts, and allows snapshotting artifacts for re-use during incremental re-evaluation.
This paper outlines the challenges of extending Vizier to support parallel execution and incremental re-evaluation on a provenance model that is incrementally revised during notebook (re-)execution.
We then show generality by discussing the process for importing Jupyter notebooks into an ICE, including the use of static analysis to extract approximate provenance information.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
% \includegraphics[width=\columnwidth]{graphics/depth_vs_cellcount.vega-lite.pdf}
\includegraphics[width=0.8\columnwidth]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}: On average, only one out of every 4 notebook cells has serial dependencies.}
\label{fig:parallelismSurvey}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As a preliminary assessment of the potential of parallelizing Jupyter notebooks, we conducted a survey on an archive of Jupyter notebooks scraped from Github by Pimentel et. al.~\cite{DBLP:journals/ese/PimentelMBF21}.
Our survey included only notebooks using a python kernel and known to execute successfully; A total of nearly 6000 notebooks met these criteria.
We constructed a dataflow graph as described in \Cref{sec:import}; as a proxy measure for potential speedup, we considered the depth of this graph.
\Cref{fig:parallelismSurvey} presents the depth --- the maximum number of cells that must be executed serially --- in relation to the total number of python cells in the notebook.
The average notebook has over 16 cells, but an average dependency depth of just under 4, an average parallelism factor of 4.
We outline our central contribution: \textbf{Incremental Provenance} in \Cref{sec:approx-prov}, and review Vizier's ICE architecture in \Cref{sec:isolation}.
The remainder of the paper focuses on our remaining contributions:
% \begin{itemize}
% \item
% (i) \textbf{}: , we introduce an approach to provenance management that begins with statically derived approximate provenance, and is refined as dynamic provenance becomes available over the course of evaluation.
% \item
(i) \textbf{An Incremental Provenance Scheduler}. In \Cref{sec:scheduler}, we present a scheduler for incremental notebook execution; We identify the challenges that arise due to provenance mispredictions, and discuss how to compensate for them.
% \item
(ii) \textbf{Jupyter Import}: \Cref{sec:import} discusses how we extract approximate provenance from python code, and how existing notebooks written for Jupyter (or comparable monolithic kernel architectures) can be translated to ICE notebook architectures like Vizier.
% \item
(iii) \textbf{Implementation in Vizier and Experiments}: We have implemented a preliminary prototype of the proposed scheduler in Vizier. \Cref{sec:experiments} presents our experiments with parallel evaluation of Jupyter notebooks.
% \end{itemize}
% A computational notebook is an ordered sequence of \emph{cells}, each containing a block of (e.g., python) code or (e.g., markdown) documentation.
% For the sake of simplicity, we ignore documentation cells, which do not impact notebook evaluation semantics.
% A notebook is evaluated by instantiating a python interpreter --- usually called a kernel --- and sequentially using the kernel to evaluate each cell's code.
% The python interpreter retains execution state after each cell is executed, allowing information to flow between cells.
% Our model of the notebook is thus essentially a single large script obtained by concatenating all code cells together.
% This execution model has two critical shortcomings.
% First, while users may explicitly declare opportunities for parallelism (e.g., by using data-parallel libraries like Spark or TensorFlow), inter-cell parallelism opportunities are lost.
% The use of python interpreter state for inter-cell communication requires that each cell finish before the next cell can run.
% Second, partial re-execution of cells is possible, but requires users to manually re-execute affected cells.
% We note that even a naive approach like re-executing all cells subsequent to a modified cell is not possible.
% The use of python interpreter state for inter-cell communication makes it difficult to reason about whether a cell's inputs are unchanged from the last time it was run.
% In this paper, we propose a new workflow-style runtime for python notebooks that addresses both shortcomings.
% Notably, we propose a runtime that relies on a hybrid of dataflow- and workflow-style provenance models;
% Analogous to workflow-style provenance models, dependencies are tracked coarsely at the level of cells.
% However, like dataflow provenance, dependencies are discovered automatically through a combination of static analysis and dynamic instrumentation, rather than being explicitly declared.
% Parallel execution and incremental re-execution require overcoming three key challenges: isolation, scheduling, and translation.
% First, as discussed above, the state of the python interpreter is not a suitable medium for inter-cell communication.
% Ideally, each cell could be executed in an isolated environment with explicit information flow between cells.
% The cell isolation mechanism should also permit efficient checkpointing of inter-cell state, and cleanly separate the transient state of concurrently executing cells.
% We discuss our approach to isoaltion in \Cref{sec:isolation}.
% Second, scheduling requires deriving a partial order over the notebook's cells.
% In a typical workflow system, such dependencies are explicitly provided.
% However, \emph{correctly} inferring dependencies statically is intractable.
% Thus the scheduler needs to be able to execute a workflow with a dynamically changing depdendency graph.
% We discuss how a workflow scheduler can merge conservative, statically derived provenance bounds with dynamic provenance collected during cell execution in \Cref{sec:scheduler}.
% Finally, notebooks written for a kernel-based runtime assume that the effects of one cell will be visible in the next.
% We discuss how kernel-based notebooks can be translated into our execution model in \Cref{sec:import}.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: