paper-ParallelPython-Short/sections/introduction.tex

77 lines
13 KiB
TeX

%!TEX root=../main.tex
Workflow systems\cite{DC07} \OK{cite several systems: Vistrails, etc...}\BG{did use a survey instead for now} help users to break complex tasks like ETL processes, model-fitting, and more, into a series of smaller steps. Users explicitly declare dependencies between steps, permitting parallel execution of mutually independent steps. A recent trend in industry has been to instead encode such tasks through computational notebooks like Jupyter or Zeppelin. Notebooks likewise allow users to declare tasks as a series of steps, but do not require the user to explicitly declare dependencies. Instead the execution of the code of each cell in a notebook is triggered manually by the user. In the background, a Python process is responsible for executing the code. The state of this process is how cells can share data, e.g., a global variable created in one cell can be read by another cell. A notebook interface is more suited for iteratively developing a data processing pipeline, because the user can make incremental changes to the steps in their pipeline and get immediate feedback by re-executing (part of) the pipeline. Furthermore, they do not have to declare dependencies explicitly which gives relieves the user form having to design the whole dataflow of their pipeline upfront. However, these advantages of notebooks come at a price. The hidden state of the Python interpreter can lead to hard to understand and reproduce bugs, e.g., when the user forgets to rerun dependent cells as has been reported in prior work~\cite{DBLP:journals/ese/PimentelMBF21}. Furthermore, it prevents parallel execution of cells, because to determine whether the parallel execution of two cells is safe, a scheduler for notebooks would have to analyze the Python code of the cells to determine whether there are any data dependencies between the cells. While static analysis~\cite{NN99} could be used for this purpose, the dynamic nature of python implies that static dataflow analysis either will result in very coarse over-approximations of the real dependencies or would not guarantee that all dependencies can be found which would lead to incorrect results if two cells that are dependent are executed in parallel.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
% \includegraphics[width=\columnwidth]{graphics/depth_vs_cellcount.vega-lite.pdf}
\includegraphics[width=\columnwidth]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}: On average, only one out of every 4 notebook cells must be run serially.}
\label{fig:parallelismSurvey}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
With Vizier~\cite{brachmann:2019:sigmod:data, brachmann:2020:cidr:your}, we introduced an alternative execution model for notebooks: the cells of a notebook are steps in a workflow which run isolated from each other except for the ability to communicate through dataflow (by reading and writing data artifacts). By using workflow evolution provenance~\cite{FS06} to keep track of changes to a notebook, a version model for data artifacts that exploits immutability, and a workflow execution engine that is capable of incrementally executing relevant parts of a workflow when it has been modified this addresses the issues of reproducibility of notebooks because a cell will be automatically reexecuted if another cell it depends on directly or indirectly is modified and all versions of the workflow and the corresponding data artifacts are retained.\footnote{data artifacts for previous versions of a notebook / workflow may be garbage collected to save space, but can be reproduced by reexecuting the version of the workflow that generated them if need be.} In contrast to traditional workflow systems, the user does not have to design a workflow upfront or has to specify dataflow explicitly. Data dependencies are detected at runtime when a cell's code accesses a data artifact using an API provided by the system. This means that support for new programming languages can be added to the system relatively easily by implementing a new artifact API. However, the ease-of-use of this approach comes at the cost of less flexible scheduling, because new dependencies may arise during cell execution, essentially preventing parallel execution of cells. Furthermore, the fact that cells are isolated from each other prevents easy translation from the commonly used notebook formats such as the one used by Jupiter, because notebooks typically rely on the hidden python interpreter state for inter-cell communication. This in turns negatively impacts the adoption of Vizier, because users cannot easily port their notebooks to the system.
In this paper, we present an approach that uses approximate provenance for Python code computed using static program analysis techniques to address the two issue mentioned about
novel \emph{coarse-grained} dataflow provenance model --- a hybrid of classical workflow and dataflow provenance --- that permits not only parallel execution, but also incremental re-execution of computational notebooks.
We outline the implementation of this provenance model into an existing workflow system named Vizier~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data}, and address several of the challenges that arise when parallelizing notebooks.
\subsection{Parallelism}
To assess the potential of parallelizing Jupyter notebooks, we conducted a preliminary survey on an archive of Jupyter notebooks scraped from Github by Pimentel et. al.~\cite{DBLP:journals/ese/PimentelMBF21}.
Our survey included only notebooks using a python kernel and known to execute successfully; A total of nearly 6000 notebooks met these criteria.
We constructed a dataflow graph as described in \Cref{sec:import}; as a proxy measure for potential speedup, we considered the depth of this graph.
\Cref{fig:parallelismSurvey} presents the depth --- the maximum number of cells that must be executed serially --- in relation to the total number of python cells in the notebook.
The average notebook has over 16 cells, but an average dependency depth of just under 4; an average of around 4 cells able to run concurrently.
The main contributions of this work are:
\begin{itemize}
\item \textbf{Approximate Provenance for Python using Static Analysis}: We introduce an approach that captures data provenance (data dependencies) using static analysis of the Python code of a notebook's cells. This approach is approximate in that it allows for both false positives as well as false negatives. Through this approximation however, we circumvent the issue that a conservative approach for static dataflow analysis for python would face: it would result in a very coarse-grained over-approximation of the data dependencies to account for the dynamic features of python.
\item \textbf{A Scheduler for Parallel Execution of Workbook Cells}: We present a scheduler for incremental execution of workbooks, workflows whose modules are cells of Python computational notebook. The scheduler uses data approximate data dependencies determined based on our static analysis of the Python code of the workbooks to enable parallel execution of cells that are independent in terms of dataflow. To compensate for false positives (data dependencies that do not materialize at runtime), the schedule dynamically adapts the execution plan if predicted data dependencies do not materialize during the execution of a cell. To deal with false negatives (data dependencies that materialize at runtime, but were not predicted by static analysis), the schedule can roll back the execution of cells whose execution was invalid (that should have observed the changes made by another cell).
\item \textbf{Jupyter Import for Workbook Systems based on Provenance}: Since cells in a workbook are isolated from each other and communicate only explicitly through dataflow, importing Juptyer notebooks into Vizier (or any other workbook system) cannot preserve the cell structure of the notebook. A naive approach would be to merge all cells into a single cell to circumvent the problem caused by global state that is shared among cells. However, the structure of the notebook and the intermediate results that are shown inline in the Jupyter notebook would be lost in the imported workbook. Instead, we utilize our approximate provenance approach for python to determine data dependencies between cells. Any object that is shared across cells is then wrapped as a data artifact in Vizier, i.e., we make the dataflow across cells explicit. This allows us to preserve the structure of the notebook. That is, there will be a one-to-one correspondence between the cells of the notebook and the generated workbook. While technically this is not safe if there are false negatives in terms of data dependencies, the types of python features that lead to false negatives are rarely used in notebooks. Furthermore, our approach is safe in the sense that
\item \textbf{Implementation in Vizier and Experiments}: We have implemented a prototype of the proposed scheduler (except for compensation for false negatives) in our workbook system Vizier and have evaluated the potential for parallel execution by importing real work Jupyter notebooks into Vizier using the proposed techniques and compared their serial and parallel execution.
\end{itemize}
A computational notebook is an ordered sequence of \emph{cells}, each containing a block of (e.g., python) code or (e.g., markdown) documentation.
For the sake of simplicity, we ignore documentation cells, which do not impact notebook evaluation semantics.
A notebook is evaluated by instantiating a python interpreter --- usually called a kernel --- and sequentially using the kernel to evaluate each cell's code.
The python interpreter retains execution state after each cell is executed, allowing information to flow between cells.
Our model of the notebook is thus essentially a single large script obtained by concatenating all code cells together.
This execution model has two critical shortcomings.
First, while users may explicitly declare opportunities for parallelism (e.g., by using data-parallel libraries like Spark or TensorFlow), inter-cell parallelism opportunities are lost.
The use of python interpreter state for inter-cell communication requires that each cell finish before the next cell can run.
Second, partial re-execution of cells is possible, but requires users to manually re-execute affected cells.
We note that even a naive approach like re-executing all cells subsequent to a modified cell is not possible.
The use of python interpreter state for inter-cell communication makes it difficult to reason about whether a cell's inputs are unchanged from the last time it was run.
In this paper, we propose a new workflow-style runtime for python notebooks that addresses both shortcomings.
Notably, we propose a runtime that relies on a hybrid of dataflow- and workflow-style provenance models;
Analogous to workflow-style provenance models, dependencies are tracked coarsely at the level of cells.
However, like dataflow provenance, dependencies are discovered automatically through a combination of static analysis and dynamic instrumentation, rather than being explicitly declared.
Parallel execution and incremental re-execution require overcoming three key challenges: isolation, scheduling, and translation.
First, as discussed above, the state of the python interpreter is not a suitable medium for inter-cell communication.
Ideally, each cell could be executed in an isolated environment with explicit information flow between cells.
The cell isolation mechanism should also permit efficient checkpointing of inter-cell state, and cleanly separate the transient state of concurrently executing cells.
We discuss our approach to isoaltion in \Cref{sec:isolation}.
Second, scheduling requires deriving a partial order over the notebook's cells.
In a typical workflow system, such dependencies are explicitly provided.
However, \emph{correctly} inferring dependencies statically is intractable.
Thus the scheduler needs to be able to execute a workflow with a dynamically changing depdendency graph.
We discuss how a workflow scheduler can merge conservative, statically derived provenance bounds with dynamic provenance collected during cell execution in \Cref{sec:scheduler}.
Finally, notebooks written for a kernel-based runtime assume that the effects of one cell will be visible in the next.
We discuss how kernel-based notebooks can be translated into our execution model in \Cref{sec:import}.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: