master
Boris Glavic 2022-05-06 22:25:26 -05:00
parent af09529cd2
commit f814a6b272
7 changed files with 42 additions and 40 deletions

View File

@ -40,7 +40,7 @@ To overcome this dilemma, we propose an approach that computes approximate prove
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Static Approximate Provenance}
An initial pass over the notebook's code obtains a set of read and write dependencies for each cell using Python's AST library, using standard dataflow equations~\cite{NN99} to derive an approximate dataflow graph.
An initial pass over the notebook's code obtains a set of read and write dependencies for each cell using Python's AST library and standard dataflow equations~\cite{NN99} to derive an approximate dataflow graph.
To minimize performance overhead, this step only analyzes the user's code and does not consider other modules (libraries) --- intra-module dependencies (e.g., stateful libraries) will be missed at this stage, but can still be discovered at runtime.
Conversely, like any static dataflow analysis, this stage may also produce false positives due to control-flow decisions that depend on the input. For example, in \Cref{fig:static-dataflow-analysis-} whether the cell has a read dependency on \texttt{d} or \texttt{e} depends on the value of \texttt{a} which only will be known at runtime.

View File

@ -2,7 +2,7 @@
We introduced an approach for incrementally refining a static approximation of provenance for computational notebooks and implemented a scheduler for ICE-architecture notebooks based on this approach.
Our method enables (i) parallel cell execution; (ii) automatic refresh of dependent cells after modifications; and (iii) import of Jupyter notebooks.
While our proof-of-concept shows promise, significant additional work is needed to reduce state transfer between kernels.
For example, kernels can be re-used forked to minimize state transfer.
For example, kernels can be re-used or forked to minimize state transfer.
Alternatively, responsibility for hosting state can be moved from the coordinator, directly to the kernel that created the state.

View File

@ -1,40 +1,5 @@
%!TEX root=../main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}[t]
\newcommand{\plotminusvspace}{-3mm}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_serial.pdf}
\vspace*{\plotminusvspace}
\caption{Serial Execution}
\label{fig:gantt:serial}
\end{subfigure}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_parallel.pdf}
\vspace*{\plotminusvspace}
\label{fig:gantt:serial}
\caption{Parallel Execution}
\end{subfigure}
% \begin{subfigure}[b]{.24\textwidth}
% \includegraphics[width=\columnwidth]{graphics/gantt_serial.png}
% \vspace*{\plotminusvspace}
% \label{fig:gantt:serial}
% \caption{Scalability - Read}
% \end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{subfigure}{0.3\linewidth}
\vspace*{-26mm}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/scalability-read.pdf}
\vspace*{\plotminusvspace}
\caption{Scalability - Read}\label{fig:scalability}
\end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace*{-5mm}
\caption{Workload traces for a synthetic reader/writer workload}
\label{fig:gantt}
\trimfigurespacing
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As a proof of concept, we implemented the static analysis approach from \Cref{sec:import} as a provenance-aware parallel scheduler (\Cref{sec:scheduler}) within the Vizier notebook system~\cite{brachmann:2020:cidr:your}.
Parallelizing cell execution requires an ICE architecture, which comes at the cost of increased communication overhead relative to monolithic kernel notebooks.

View File

@ -52,8 +52,44 @@ When traversing a function or class declaration, we record such dependencies and
Transitive dependency tracking is complicated due to Python's use of mutable closures (e.g., see \Cref{fig:scoping});
In the latter code block, when \texttt{bar} is declared, it `captures' the scope of \texttt{foo}, in which \texttt{a = 2}, and overrides an assignment in the global scope, even though the enclosing scope is not otherwise accessible.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}[t]
\newcommand{\plotminusvspace}{-3mm}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_serial.pdf}
\vspace*{\plotminusvspace}
\caption{Serial Execution}
\label{fig:gantt:serial}
\end{subfigure}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_parallel.pdf}
\vspace*{\plotminusvspace}
\label{fig:gantt:serial}
\caption{Parallel Execution}
\end{subfigure}
% \begin{subfigure}[b]{.24\textwidth}
% \includegraphics[width=\columnwidth]{graphics/gantt_serial.png}
% \vspace*{\plotminusvspace}
% \label{fig:gantt:serial}
% \caption{Scalability - Read}
% \end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{subfigure}{0.3\linewidth}
\vspace*{-26mm}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/scalability-read.pdf}
\vspace*{\plotminusvspace}
\caption{Scalability - Read}\label{fig:scalability}
\end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace*{-5mm}
\caption{Workload traces for a synthetic reader/writer workload}
\label{fig:gantt}
\trimfigurespacing
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Second, the fine-grained dataflow graph, produced as explained above, is reduced into a \emph{coarse-grained} dataflow graph by (i) merging nodes for the statements in a cell, (ii) removing self-edges, and (iii) removing parallel edges with identical labels.
Afterwards, the fine-grained dataflow graph, produced as explained above, is reduced into a \emph{coarse-grained} dataflow graph by (i) merging nodes for the statements in a cell, (ii) removing self-edges, and (iii) removing parallel edges with identical labels.
The coarse-grained data flow graph provides an approximation of the cell's dependencies: The set of in-edges (resp., out-edges) is typically an upper bound on the cells real dependencies. % guaranteed upper bound on the cell's write set (read set).
While missed dependencies are theoretically possible, they are rare in the type of code used in typical Jupyter notebooks. Nonetheless, if they arise they will be taken care of by our scheduler. As a final step, we inject explicit variable imports and exports (using Vizier's artifact API) for the read and write sets of each cell into the cell's code.

View File

@ -59,7 +59,7 @@ Afterwards, we discuss our remaining contributions:
(ii) \textbf{Jupyter Import}: \Cref{sec:import} discusses how we extract approximate provenance from Python code statically, and how existing notebooks written for Jupyter % (or comparable monolithic kernel architectures)
can be translated to ICE notebook architectures like Vizier.
% \item
(iii) \textbf{Implementation in Vizier and Experiments}: We have implemented a preliminary prototype of the proposed scheduler in Vizier. \Cref{sec:experiments} presents our experiments with parallel evaluation of Jupyter notebooks.
(iii) \textbf{Implementation in Vizier and Experiments}: We have implemented a preliminary prototype of the proposed scheduler in Vizier. \Cref{sec:experiments} presents our experiments on the parallel evaluation of Jupyter notebooks.
% \end{itemize}
% A computational notebook is an ordered sequence of \emph{cells}, each containing a block of (e.g., Python) code or (e.g., markdown) documentation.

View File

@ -9,6 +9,7 @@ Before discussing how notebooks written for monolithic kernels can be mapped to
As in a monolithic kernel notebook, an ICE notebook maintains a shared global state that is manipulated by each individual cell.
However, these manipulations are explicit: for a variable defined in one cell (the writer) to be used in a subsequent cell (the reader): (i) the writer must explicitly export the variable into the global state, and (ii) the reader must explicitly import the variable from the global state.
For example, Vizier provides explicit setter and getter functions (respectively) on a global state variable, while Nodebook inspects the Python interpreter's global scope dictionary in between cell executions.
As mentioned in the introduction, with our scheduler it will no longer be necessary for the user to explicitly call setters and getters.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{State Serialization}

View File

@ -59,7 +59,7 @@ For any notebook $\nb$ and approximated dependency graph $\dga$ for $\nb$, the e
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Incremental Re-execution}
Vizier automatically refreshes dependent cells when a cell $\nc$ is modified by the user using incremental re-execution which avoids re-execution of cells whose output will be the same in the modified notebook. For that, the modified cell $\nc$ is put into \spending state. Furthermore, all cells that depend on $\nc$ directly or indirectly are also put into \spending state. That is, we memorize a cells actual dependencies from the previous execution and initially assume that the dependency graph will be the same as in the previous execution. The exception is the modified cell for which we statically approximate provenance from scratch. During the execution of the modified cell or one of its dependencies we may observe changes to the read and write set of a cell. We compensate for that using the repair actions described above.
Vizier automatically refreshes dependent cells when a cell $\nc$ is modified by the user using incremental re-execution which avoids re-execution of cells whose output will be the same in the modified notebook. For that, the modified cell $\nc$ is put into \spending state. Furthermore, all cells that depend on $\nc$ directly or indirectly are also put into \spending state. That is, we memorize a cell's actual dependencies from the previous execution and initially assume that the dependency graph will be the same as in the previous execution. The exception is the modified cell for which we statically approximate provenance from scratch. During the execution of the modified cell or one of its dependencies we may observe changes to the read and write set of a cell. We compensate for that using the repair actions described above.
% When the user schedules a cell for partial re-execution (e.g., to retrieve new input data), we would like to avoid re-executing dependent cells that will produce identical outputs.
% The cell(s) scheduled for re-execution are moved to the \spending state.