paper-ParallelPython-Short/sections/import.tex

%!TEX root=../main.tex

Jupyter notebooks are evaluated in a single long-running kernel that facilitates inter-cell communication through a shared global state.
In this section, we outline the process for converting a Jupyter notebook into a form compatible \systemname's isolated cell execution model.
For this preliminary work, we make the simplifying assumption that cells do not perform out-of-band communication (i.e., through files or external services), and that all state required by a cell is passes via the kernel's global scope.

Our first challenge is deriving each cell's read and write sets.
To accomplish this, we build a dataflow graph over the cells of the notebook using Python's \texttt{ast} module to obtain a structured representation of the code: an \emph{abstract syntax tree} (AST).
Attribute references are marked by instances of the \texttt{Attribute} object, conveniently annotated with the directionality of the reference: \texttt{Load}, \texttt{Store}, or \texttt{Delete}.

We simulate execution of the python code in a manner analogous to Program Slicing~\cite{DBLP:journals/tse/Weiser84}, using an in-order traversal of the AST's statements (lines or blocks of code) to build a \emph{fine-grained} data flow graph.
A data flow graph is a directed graph where each node corresponds to a single statement in the AST, and is i
We identify nodes by a 2-tuple of the cell index, and the line number.
Each edge corresponds to a read dependency, originating at a statement with an attribute \texttt{Load} to the statement with the \texttt{Store} operation that most recently wrote the attribute.
Due to control flow constructs (e.g., if-then-else blocks and for loops), a single attribute \texttt{Load} can potentially read from multiple \texttt{Store} operations; we represent this by encoding one edge for each potential dependency.

Python's scoping logic presents additional complications.
First, function and class definitions may reference symbols from a containing scope.
For example, python's \texttt{import} statement simply declares imported modules as symbols in the current scope.
Thus, references to the module's functions within a function or class definition create transitive dependencies.
When the traversal visits a function or class declaration statement, we record

\begin{figure}
  \begin{center}
  \begin{subfigure}{0.45\columnwidth}
\begin{minted}{python}
def foo():
  print(a)
a = 1
foo()  # Prints '1'
def bar():
  a = 2
  foo()
bar()  # Prints '1'
\end{minted}
  \end{subfigure}
  \begin{subfigure}{0.5\columnwidth}
\begin{minted}{python}
def foo():
  def bar():
    print(a)
  a = 2
  return bar
bar = foo()
bar()  # Prints '2'
a = 1
bar()  # Prints '2'
\end{minted}
  \end{subfigure}
  \end{center}
  \label{fig:scoping}
  \caption{Scope capture in python happens at function definition, but captured scopes remain mutable.}
\end{figure}

An additional complication arises from python's scope capture semantics.
When a function (or class) is declared, it records a reference to all enclosing scopes.  Consider the following example code in \Cref{fig:scoping}.
In the latter code block, when \texttt{bar} is declared, it `captures' the scope of \texttt{foo}, in which \texttt{a = 2}, and overrides assignment in the global scope.
In the former instance, conversely, \texttt{bar}'s assignment to \texttt{a} happens in its own scope, and so the invocation of \texttt{foo} reads the instance of \texttt{a} in the global scope.

\tinysection{Coarse-Grained Data Flow}

We next reduce the fine-grained data flow graph, as defined above, into a simplified \emph{coarse-grained} data flow graph by (i) merging the nodes for each cell, (ii) removing self-edges, and (iii) removing parallel edges with identical labels.
The coarse-grained data flow graph provides us with the cell's dependencies: The set of in-edges (resp., out-edges) is a guaranteed upper bound on the cell's write set (read set).
We note that the coarse-grained data flow graph is free of several complications that affect most data flow analyses.
Notably, Jupyter does not allow control-flow primitives to span cells.

For each cell, we record the approximate read and write sets.  We then instrument the cell by injecting \systemname import operations for every element of the read set into the start of the cell and export operations for every element of the write set at the end of the cell.
We leave to future work a more fine-grained instrumentation that avoids importing or exporting symbols until they are explicitly required.