Trimming... still slightly over 4pp.

This commit is contained in:
Oliver Kennedy 2022-03-31 23:44:58 -04:00
parent 8b30a85a7e
commit 067df20e7f
Signed by: okennedy
GPG key ID: 3E5F9B3ABD3FDB60
8 changed files with 69 additions and 94 deletions

View file

@ -14,13 +14,11 @@ with r.urlopen('http://someweb/code.py') as response:
\end{minted}
\begin{minted}{python}
if a > 10:
b = d * 2
else:
c = e * 2
b = d * 2 if a > 10 else e * 2
\end{minted}
\caption{Example Python code}\label{fig:example-python-code}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -33,7 +31,7 @@ To overcome this dilemma, we propose an approach that computes approximate prove
\tinysection{Static Approximate Provenance}
An initial pass over the data obtains a set of read and write dependencies for each cell using Python's AST library, using standard dataflow equations~\cite{DBLP:journals/tse/Weiser84} to derive an approximate dataflow graph.
To minimize performance overheads, this step only analyzes the user's code and does not consider other modules (libraries) --- intra-module dependencies (e.g., stateful libraries) will be missed at this stage, but can still be discovered at runtime.
Conversely, like any static dataflow analysis technique, this stage may also produce false positives due to non-deterministic control-flow. For example, in the code snippet shown at the bottom in \Cref{fig:example-python-code} the cell's read dependency on \texttt{d} (resp. \texttt{e}) and write dependency on \texttt{b} (resp., \texttt{c}) depends on the value of \texttt{a}.
Conversely, like any static dataflow analysis, this stage may also produce false positives due to non-deterministic control-flow. For example, in the last line of \Cref{fig:example-python-code} the cell's read dependency on \texttt{d} or \texttt{e} depends on the value of \texttt{a}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Exact Runtime Provenance}

View file

@ -1,10 +1,8 @@
%!TEX root=../main.tex
We introduce a provenance-based approach for predicting and tracking dependencies across python cells in a computational notebook and an implementation of this approach in Vizier, a data-centric notebook system where cells are isolated from each other and communicate through data artifacts. By combining best effort static analysis with an adaptable runtime schedule for notebook cell execution, we achieve (i) parallel execution of python cells, (ii) automatic refresh of dependent cells when the notebook is modified, and (iii) translation of Jupyter notebooks into our model.
This paper represents an initial proof-of-concept, on which we note several opportunities for improvement.
Crucially, there are still considerable opportunities to reduce blocking due to state transfer between kernels.
For example, it may be possible to re-use a kernel for applications involving large state, trading the increased overhead of sequential execution for a reduction in overhead from state transfer.
We also plan to explore ways to make state export and import asynchronous.
We introduce a hybrid ``incremental'' approach to tracking dependencies through computational notebooks, and implement a scheduler based on this approach in an ICE-architecture notebook.
This incremental provenance model allows for (i) parallel cell execution, and (ii) automatic refresh of dependent cells after modifications.
We note several opportunities for improvement on our initial proof-of-concept, most notably to reduce blocking due to state transfer between kernels.
For example, an advanced scheduler could dynamically trade off between the overheads of inter-kernel communication, and the benefits of parallel execution.

View file

@ -19,7 +19,7 @@ If we know which artifacts a cell will use, we can prioritize kernels that have
All experiments were run on a XX GHz, XX core Intel Xeon with XX GB RAM running XX Linux\OK{Boris, Nachiket, can you fill this in?}.
As Vizier relies on Apache Spark, we prefix all notebooks under test with a single reader and writer cell to force initialization of e.g., Spark's HDFS module. These are not included in timing results.
\begin{figure*}
\begin{figure*}[t]
\begin{subfigure}[b]{.32\textwidth}
\includegraphics[width=\columnwidth]{graphics/gantt_serial.png}
\label{fig:gantt:serial}
@ -37,6 +37,7 @@ As Vizier relies on Apache Spark, we prefix all notebooks under test with a sing
\end{subfigure}
\label{fig:gantt}
\caption{Workload traces for a synthetic reader/writer workload}
\trimfigurespacing
\end{figure*}
\tinysection{Overview}

View file

@ -1,24 +1,19 @@
%!TEX root=../main.tex
Jupyter notebooks are evaluated in a single long-running kernel that facilitates inter-cell communication through a shared global state.
In this section, we outline the process for converting a Jupyter notebook into a form compatible \systemname's isolated cell execution model.
For this preliminary work, we make the simplifying assumption that cells do not perform out-of-band communication (i.e., through files or external services), and that all state required by a cell is passes via the kernel's global scope.
In this section, we outline the conversion of monolithic kernel notebooks into ICE-compatible forms.
For this preliminary work, we make a simplifying assumption that all inter-cell communication occurs through the kernel's global scope (e.g., as opposed to files).
Our first challenge is deriving each cell's read and write sets.
To accomplish this, we build a dataflow graph over the cells of the notebook using Python's \texttt{ast} module to obtain a structured representation of the code: an \emph{abstract syntax tree} (AST).
Attribute references are marked by instances of the \texttt{Attribute} object, conveniently annotated with the directionality of the reference: \texttt{Load}, \texttt{Store}, or \texttt{Delete}.
Python's \texttt{ast} module provides a structured representation of the code: an \emph{abstract syntax tree} (AST).
Variable accesses are marked by instances of the \texttt{Attribute} object, conveniently annotated with the directionality of the reference: \texttt{Load}, \texttt{Store}, or \texttt{Delete}.
Analogous to Program Slicing~\cite{DBLP:journals/tse/Weiser84}, we traverse the AST's statements in-order to build a \emph{fine-grained} data flow graph, where each node is a cell/statement pair, and each directed edge goes from an attribute \texttt{Load} to the corresponding \texttt{Store}.
Control-flow constructs (e.g., if-then-else blocks and for loops) may necessitate multiple edges per \texttt{Load}, as it may read from multiple \texttt{Store} operations.
We simulate execution of the python code in a manner analogous to Program Slicing~\cite{DBLP:journals/tse/Weiser84}, using an in-order traversal of the AST's statements (lines or blocks of code) to build a \emph{fine-grained} data flow graph.
A data flow graph is a directed graph where each node corresponds to a single statement in the AST, and is i
We identify nodes by a 2-tuple of the cell index, and the line number.
Each edge corresponds to a read dependency, originating at a statement with an attribute \texttt{Load} to the statement with the \texttt{Store} operation that most recently wrote the attribute.
Due to control flow constructs (e.g., if-then-else blocks and for loops), a single attribute \texttt{Load} can potentially read from multiple \texttt{Store} operations; we represent this by encoding one edge for each potential dependency.
Python's sccping logic presents additional complications;
First, function and class declarations may reference attributes (e.g., \texttt{import}s) from an enclosing scope, creating transitive dependencies.
When traversing a function or class declaration, we record such dependencies and include them when the symbol is \texttt{Load}ed.
Transitive dependency tracking is complicated due to python's use of mutable closures (e.g., see \Cref{fig:scoping});
In the latter code block, when \texttt{bar} is declared, it `captures' the scope of \texttt{foo}, in which \texttt{a = 2}, and overrides assignment in the global scope, even though the enclosing scope is not otherwise accessible.
Python's scoping logic presents additional complications.
First, function and class definitions may reference symbols from a containing scope.
For example, python's \texttt{import} statement simply declares imported modules as symbols in the current scope.
Thus, references to the module's functions within a function or class definition create transitive dependencies.
When the traversal visits a function or class declaration statement, we record
\begin{figure}
\begin{center}
@ -47,23 +42,13 @@ a = 1
bar() # Prints '2'
\end{minted}
\end{subfigure}
\end{center}
\label{fig:scoping}
\caption{Scope capture in python happens at function definition, but captured scopes remain mutable.}
\trimfigurespacing
\end{center}
\end{figure}
An additional complication arises from python's scope capture semantics.
When a function (or class) is declared, it records a reference to all enclosing scopes. Consider the following example code in \Cref{fig:scoping}.
In the latter code block, when \texttt{bar} is declared, it `captures' the scope of \texttt{foo}, in which \texttt{a = 2}, and overrides assignment in the global scope.
In the former instance, conversely, \texttt{bar}'s assignment to \texttt{a} happens in its own scope, and so the invocation of \texttt{foo} reads the instance of \texttt{a} in the global scope.
\tinysection{Coarse-Grained Data Flow}
We next reduce the fine-grained data flow graph, as defined above, into a simplified \emph{coarse-grained} data flow graph by (i) merging the nodes for each cell, (ii) removing self-edges, and (iii) removing parallel edges with identical labels.
The coarse-grained data flow graph provides us with the cell's dependencies: The set of in-edges (resp., out-edges) is a guaranteed upper bound on the cell's write set (read set).
We note that the coarse-grained data flow graph is free of several complications that affect most data flow analyses.
Notably, Jupyter does not allow control-flow primitives to span cells.
For each cell, we record the approximate read and write sets. We then instrument the cell by injecting \systemname import operations for every element of the read set into the start of the cell and export operations for every element of the write set at the end of the cell.
We leave to future work a more fine-grained instrumentation that avoids importing or exporting symbols until they are explicitly required.
Second, the fine-grained dataflow graph, as defined above, is reduced into a simplified \emph{coarse-grained} data flow graph by (i) merging nodes for the statements in a cell, (ii) removing self-edges, and (iii) removing parallel edges with identical labels.
The coarse-grained data flow graph provides the cell's dependencies: The set of in-edges (resp., out-edges) is a guaranteed upper bound on the cell's write set (read set).
As a final step, we inject explicit variable imports and exports for the read and write sets of each cell into the cell's code.

View file

@ -28,8 +28,9 @@ We then show generality by discussing the process for importing Jupyter notebook
\begin{figure}
% \includegraphics[width=\columnwidth]{graphics/depth_vs_cellcount.vega-lite.pdf}
\includegraphics[width=0.8\columnwidth]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}: On average, only one out of every 4 notebook cells has serial dependencies.}
\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}: On average, notebooks can run with a parallelism factor of 4.}
\label{fig:parallelismSurvey}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View file

@ -1,7 +1,7 @@
%!TEX root=../main.tex
An isolated cell execution notebook (ICE) isolates cells by executing each in a fresh kernel.
In this section, we review key differences between the ICE model used in notebooks like Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your} or Nodebook~\cite{nodebook}, and the classical monolithic approach of Jupyter itself.
In this section, we review key differences between the ICE model used in notebooks like Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your} or Nodebook~\cite{nodebook}, and the classical monolithic approach of Jupyter.
\tinysection{Communication Model}
As in a monolithic kernel notebook, an ICE notebook maintains a shared global state that is manipulated by each individual cell.
@ -12,11 +12,10 @@ For example, Vizier provides explicit setter and getter functions (respectively)
When a state variable is exported, it is serialized by the python interpreter and exported into a versioned state management system.
We refer to the serialized state as an \emph{artifact}.
Each cell executes in the context of a scope, a mapping from variable names to artifacts that can be imported by the cell.
Artifacts exported by the cell are integrated into the scope for the next cell.
By default, Vizier serializes state through python's native \texttt{pickle} library, although \systemname can be easily extended with codecs for specialized types that are either unsupported by \texttt{pickle}, or for which it is not efficient:
(i) Python code (e.g., import statements, and function or class definitions) is exported as raw python code and imported with \texttt{eval}.
(ii) Pandas dataframes are exported in parquet format, and are exposed to subsequent cells by the monitor process through Apache Arrow direct access.
One notable challenge is the need to support transitive dependencies. One function, for example, may rely on a second function.
When the former function's symbol is exported (resp., imported) the latter symbol must also be exported (imported).
% We note the need to support transitive dependencies like functions that invoke other functions.
% Exporting (resp., importing) the former function requires exporting (importing) the latter.

View file

@ -1,14 +1,19 @@
Provenance for workflow systems has been studied extensively for several decades (e.g., see \cite{DC07} for a survey). However, workflow systems expect data dependencies to be specified explicitly as part of the workflow specification and, thus, such provenance techniques are not applicable to our problem setting where data dependencies are not declared upfront. More closely related to our work are provenance techniques for programming languages and static analysis techniques from the programming languages community~\cite{NN99}.
%!TEX root=../main.tex
Provenance for workflow systems has been studied extensively for several decades (e.g., see \cite{DC07} for a survey). However, workflow systems expect data dependencies to be specified explicitly as part of the workflow specification and, thus, such provenance techniques are not applicable to our problem setting. More closely related to our work are provenance techniques for programming languages and static analysis techniques from the programming languages community~\cite{NN99}.
Pimentel et al.~\cite{pimentel-19-scmanpfs} provide an overview of research on provenance for scripting (programming) languages and did identify a need and challenges for fine-grained provenance in this context.\BG{What other takeaways?}
noWorkflow~\cite{pimentel-17-n, DBLP:conf/tapp/PimentelBMF15} is tool for collecting several types of provenance for python scripts including environmental information (library dependencies and OS environments), static data-flow information, and dynamic (runtime) control- and dataflow information collected using profiling and instrumentation tools. In\cite{DBLP:conf/tapp/PimentelBMF15}, noWorkflow was extended to support collecting provenance for IPython notebooks. Like noWorkflow, we also use static dataflow analysis, but unlike noWorkflow, we do not instrument Python code to determine actual dependencies at runtime.
\cite{macke-21-fglsnin} presents an approach that combines static and dynamic dataflow analysis using Python's tracing capabilities to track dataflow dependencies during cell execution. This information is then used to detect what the authors refer to as ``unsafe'' interactions where a cell is reading an outdated version of a variable written to be another cell and to suggest what cells to execute to resolve the staleness. In contrast to this approach where the user has to manually follow the suggestions of the system in a step-wise manner to resolve staleness, our model prevents staleness from happening in the first place by automatically refreshing dependent cells.
Pimentel et al.~\cite{pimentel-19-scmanpfs} provide an overview of research on provenance for scripting (programming) languages and did identify a need and challenges for fine-grained provenance in this context.
noWorkflow~\cite{pimentel-17-n, DBLP:conf/tapp/PimentelBMF15} collects several types of provenance for python scripts including environmental information, as well as static and dynamic data- and control-flow.
\cite{DBLP:conf/tapp/PimentelBMF15} extends noWorkflow to Jupyter notebooks and is closely related to our work, but only produces provenance for analysis and debugging and not scheduling.
\cite{macke-21-fglsnin} combines static and dynamic dataflow analysis to track dataflow dependencies during cell execution and warn users of ``unsafe'' interactions where a cell is reading an outdated version of a variable. By contrast, our approach automatically refreshes dependent cells.
Vamsa~\cite{namaki-20-v} also employes static dataflow analysis to analyze provenance of Python ML pipelines, but additionally annotates variables with semantic tags (e.g., features and labels).
\cite{KP17a} introduces Dataflow notebooks which extend Jupyter with immutable identifiers for cells and the capability to reference the results of a cell by its identifier. The purpose of this extension is to attack the problem of implicit cell dependencies caused by shared python interpreter state and out-of-order execution of cells in a notebook. If users are diligent in using these features, then Dataflow notebooks can be used for automatic refresh of dependent cells like our model. However, our model has the advantage that users do not need to change their code to use cell identifiers and cannot accidentally create hidden dependencies since cell executions are isolated from each other. Another advantage of our approach is that it allows parallel execution of independent cells which was only alluded to as a possibility in \cite{KP17a}.
\cite{KP17a} introduces Dataflow notebooks which extend Jupyter with immutable identifiers for cells and the capability to reference the results of a cell by its identifier.
This approach can avoid implicit dependencies, but requires users to be diligent in using these features.
Additionally, our approach is that it allows parallel execution of independent cells, something that was only alluded to as a possibility in \cite{KP17a}.
A similar project, Nodebook~\cite{nodebook} is a plugin for Jupyter that checkpoints notebook state in between cells to force in-order cell evaluation; Although closely related to our approach, it does not attempt parallelism, nor automatic re-execution of cells.
% \begin{itemize}

View file

@ -1,9 +1,9 @@
%!TEX root=../main.tex
In this section, we overview \systemname's provenance model and how it is used to support parallel code execution and incremental updates.
We define the baseline notebook semantics as a serial execution of the cells in notebook order, and refer to the set of symbols imported or exported by each cell as the cell's read and write sets, respectively.
We define a correct execution in terms of view serializability: any reordering of the cells (e.g., to parallelize execution) that preserves data flow dependencies is correct.
As a simplification for this preliminary work, we assume that cell execution is atomic and idempotent: we are allowed to freely interrupt the execution of a cell, or to restart it.
The baseline semantics of a notebook are a serial execution of the cells in notebook order.
We refer to the set of variables imported or exported by each cell as the cell's read and write sets, respectively.
A correct execution is thus defined in terms of view serializability: An execution order is correct iff the artifact versions it imports are consistent with the versions it would import in a serial order.
We assume that cell execution is atomic and idempotent: we are allowed to freely interrupt the execution of a cell, or to restart it.
\tinysection{Naive Scheduling}
Let $N$ denote a notebook, a sequence of cells $[c_1, \ldots, c_n]$.
@ -15,43 +15,31 @@ D = \{\; (c_r, c_w, \ell) \;|\; c_r,c_w \in N, \ell \in \mathcal R(c_r), \ell \i
\end{multline*}
An edge labelled $\ell$ exists from any cell $c_r$ that reads symbol $\ell$ to the most recent preceding cell that writes symbol $\ell$.
Given such a graph, scheduling is trivial.
Denote by $\mathcal S(c) \in \{ \text{PENDING}, \text{DONE} \}$ the state of a cell (i.e., \text{DONE} after it has completed execution); a cell $c$ can be scheduled for execution when all input edges are \text{DONE}:
$$\forall (c, c_w, \ell) \in D : \mathcal S(c_w) = \text{DONE}$$
It can be trivially shown that this scheduling approach guarantees conflict equivalence between the execution and notebook orders.
$\forall (c, c_w, \ell) \in D : \mathcal S(c_w) = \text{DONE}$
When a cell $c_r$ imports variable $\ell$ from the global scope, where $(c_r, c_w, \ell) \in D$, it receives the version exported by cell $c_w$.
It can be trivially shown that this execution model must produce schedules that are view-equivalent to the notebook order.
\tinysection{Approximating Dependencies}
A cell's read and write sets can not be known exactly until runtime, but can often be approximated.
In principle, a dependency graph could be created from these approximations and the scheduling could proceed as in the naive approach.
However, when a cell executes and its exact read and write sets become known, the notebook's dependency graph changes.
\systemname's \emph{dependency graph must be dynamic}.
There are four possible errors that we now consider in turn.
Here, we are primarily concerned with ensuring correctness (i.e., serializability) of the notebook execution in the presence of execution errors, even at the cost of performance.
\tinysection{Runtime Refinement}
The dependency graph is refined during execution in four possible ways: We now consider all four, and their consequences on scheduler \emph{correctness}.
A false positive read (a symbol in the approximate read set that is never accessed by the cell) corresponds to an unnecessary edge in the dependency graph.
The presence of this edge in the dependency graph is a performance problem, as it may have been possible to execute the cell earlier, but does not result in a correctness error.
\textbf{(i)} An approximated read that is never imported removes the corresponding edge from the dependency graph.
This is not a correctness error.
\textbf{(ii)} An approximated write that is never exported redirects inbound edges with the corresponding label to the preceding cell to export the variable.
The dependent cells could not have been started yet, so there is no possibility of a correctness error.
\textbf{(iii)} An imported variable not in the approximated read set adds a new edge to the dependency graph.
If the edge leads to a cell in the \texttt{PENDING} state, the import operation may block until the exporting cell has completed.
This state is less desirable, as any already allocated resources are tied to the blocked cell and may create resource starvation.
\textbf{(iv)} An exported variable not in the approximated write set redirects a subset of edges with the corresponding label to the cell.
This is only a correctness error if one of the dependent cells has already been started --- if so, the cell must be aborted and rescheduled after the current cell completes.
A false positive write (a symbol in the approximate write set that is never modified by the cell) corresponds to zero or more edges that need to be redirected to an earlier cell.
Modulo errors in the notebook (i.e., undefined variables), reads on the symbol instead reference the output of a preceding cell.
All dependent cells are blocked until the prediction error is discovered, so as with a false positive read, there is no potential for correctness errors.
However, a cell that could have been scheduled if the symbol had been written may now need to block until a preceding cell completes.
% In this preliminary work, we focus on workloads where we can guarantee that we have an upper bound on the read and write sets (i.e., no false negatives).
% In this setting, approximation errors can lead to poor performance, but not correctness errors.
A false negative read (a symbol not in the approximate read set that is read by the cell) corresponds to an edge missing from the dependency graph.
Since the artifact is not available to the cell until the read is triggered, correctness errors can be avoided if cell execution is blocked until the dependent cell completes.
We note that any resources allocated to the kernel can not be released until the cell is unblocked.
When a pool of kernels is used, this could potentially lead to starvation if the entire pool blocks.
A false negative write (a symbol not in the approximate write set that is written to by the cell) corresponds to zero or more edges that need to be redirected to a later cell.
This is only a correctness error if one of the dependent cells has already been scheduled or completed --- if so, the cell must be aborted and rescheduled after the current cell completes.
In this preliminary work, we focus on workloads where we can guarantee that we have an upper bound on the read and write sets (i.e., no false negatives).
In this setting, approximation errors can lead to poor performance, but not correctness errors.
\paragraph{Incremental Re-execution}
Users may trigger the re-execution of one or more cells (e.g., to retrieve new input data).
When this happens, the re-executed cells and all of their descendants in the dependency DAG enter the \text{PENDING} state and execution proceeds mostly as above.
All cells that are not descendants of an updated cell remain in the \text{DONE} state, and their output artifacts are assumed to be re-usable from the cell's prior execution.
During a partial re-execution of the workflow, the cell's actual dependencies from the prior execution are also available.
When computing the dependency DAG, the exact read dependencies from the cell's prior execution are used.
This is safe under the assumption that the cell's behavior is deterministic --- that the cell's execution trace is unchanged if it is re-executed with the same inputs.
\tinysection{Incremental Re-execution}
When the user schedules a cell for partial re-execution (e.g., to retrieve new input data), we would like to avoid re-executing cells that will produce identical outputs.
The cell(s) scheduled for re-execution are moved to the \text{PENDING} state.
False positives from the approximate dependencies may be true positives in a different execution, so the exact dependencies are no longer valid.
On the other hand, false negatives (cases (iii) and (iv) above) revealed during the past execution are also valuable.
Accordingly, the dependency graph is updated according to the union of the approximate and exact dependencies.
Any \text{DONE} cells that now depend on a cell in the \text{PENDING} state are recursively moved to the \text{PENDING} state and the graph is updated as above.