This commit is contained in:
Boris Glavic 2022-03-30 21:04:23 -05:00
parent 3bb622c2ac
commit 57375fb110
3 changed files with 44 additions and 3 deletions

View file

@ -100,9 +100,13 @@
\label{sec:introduction}
\input{sections/introduction.tex}
\subsection{Problem Statement}
\label{sec:problem}
\input{sections/problem}
% \subsection{Problem Statement}
% \label{sec:problem}
% \input{sections/problem}
\section{Approximate Provenance for Python}
\label{sec:approx-prov}
\input{sections/approx-prov}
\section{Task Isolation}
\label{sec:isolation}

5
sections/approx-prov.tex Normal file
View file

@ -0,0 +1,5 @@
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End:

View file

@ -39,6 +39,38 @@ The main contributions of this work are:
\item \textbf{Implementation in Vizier and Experiments}: We have implemented a prototype of the proposed scheduler (except for compensation for false negatives) in our workbook system Vizier and have evaluated the potential for parallel execution by importing real work Jupyter notebooks into Vizier using the proposed techniques and compared their serial and parallel execution.
\end{itemize}
A computational notebook is an ordered sequence of \emph{cells}, each containing a block of (e.g., python) code or (e.g., markdown) documentation.
For the sake of simplicity, we ignore documentation cells, which do not impact notebook evaluation semantics.
A notebook is evaluated by instantiating a python interpreter --- usually called a kernel --- and sequentially using the kernel to evaluate each cell's code.
The python interpreter retains execution state after each cell is executed, allowing information to flow between cells.
Our model of the notebook is thus essentially a single large script obtained by concatenating all code cells together.
This execution model has two critical shortcomings.
First, while users may explicitly declare opportunities for parallelism (e.g., by using data-parallel libraries like Spark or TensorFlow), inter-cell parallelism opportunities are lost.
The use of python interpreter state for inter-cell communication requires that each cell finish before the next cell can run.
Second, partial re-execution of cells is possible, but requires users to manually re-execute affected cells.
We note that even a naive approach like re-executing all cells subsequent to a modified cell is not possible.
The use of python interpreter state for inter-cell communication makes it difficult to reason about whether a cell's inputs are unchanged from the last time it was run.
In this paper, we propose a new workflow-style runtime for python notebooks that addresses both shortcomings.
Notably, we propose a runtime that relies on a hybrid of dataflow- and workflow-style provenance models;
Analogous to workflow-style provenance models, dependencies are tracked coarsely at the level of cells.
However, like dataflow provenance, dependencies are discovered automatically through a combination of static analysis and dynamic instrumentation, rather than being explicitly declared.
Parallel execution and incremental re-execution require overcoming three key challenges: isolation, scheduling, and translation.
First, as discussed above, the state of the python interpreter is not a suitable medium for inter-cell communication.
Ideally, each cell could be executed in an isolated environment with explicit information flow between cells.
The cell isolation mechanism should also permit efficient checkpointing of inter-cell state, and cleanly separate the transient state of concurrently executing cells.
We discuss our approach to isoaltion in \Cref{sec:isolation}.
Second, scheduling requires deriving a partial order over the notebook's cells.
In a typical workflow system, such dependencies are explicitly provided.
However, \emph{correctly} inferring dependencies statically is intractable.
Thus the scheduler needs to be able to execute a workflow with a dynamically changing depdendency graph.
We discuss how a workflow scheduler can merge conservative, statically derived provenance bounds with dynamic provenance collected during cell execution in \Cref{sec:scheduler}.
Finally, notebooks written for a kernel-based runtime assume that the effects of one cell will be visible in the next.
We discuss how kernel-based notebooks can be translated into our execution model in \Cref{sec:import}.
%%% Local Variables:
%%% mode: latex