paper-ParallelPython-Short/sections/approx-prov.tex

%!TEX root=../main.tex
Conservative static analysis for Python will typically lead to very coarse-grained over-approximation of the real data dependencies of a program, or has to allow for false negatives (missed dependencies).
To see why this is the case consider the code snippet in \Cref{fig:dynamic-code-evaluation-i}, which retrieves a piece of Python code from the web and then evaluates that code.
The dynamically evaluated code can create data dependencies between everything in global scope. Furthermore, conservative static analysis must recursively descend into libraries to obtain a full set of dependencies.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
  \centering

  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  \begin{subfigure}{1\linewidth}
\begin{pythoncode*}{highlightlines={3}}
import urllib.request as r
with r.urlopen('http://someweb/code.py') as response:
  eval( response.read() )
\end{pythoncode*}
\vspace*{-2mm}
    \caption{Dynamic code evaluation in Python may lead to arbitrary dependencies that will only be known at runtime.}\label{fig:dynamic-code-evaluation-i}
  \end{subfigure}
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  \begin{subfigure}{1\linewidth}
\begin{pythoncode}
b = d * 2 if a > 10 else e * 2
\end{pythoncode}
\vspace*{-2mm}
\caption{Static dataflow analysis has to conservatively over-approximate dataflow because control flow depends on the program's input.}\label{fig:static-dataflow-analysis-}
  \end{subfigure}
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace*{-4mm}
\caption{Example Python code}\label{fig:example-python-code}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Crucially, for our provenance use cases (i.e., parallelism and incremental updates), we need provenance to be available \emph{before} a notebook is executed.
Overly conservative static analysis is a bad fit: In all but the most trivial notebooks, such analysis must trade off between excessive runtimes to fully analyze all dependent libraries, and treating all cells as interdependent.
Conversely, a less conservative approach could lead to unsafe notebook execution if it misses a dependency.
To overcome this dilemma, we propose an approach that computes approximate provenance using static analysis (allowing for both false negative and false positives in terms of data dependencies) and then compensates for missing and spurious data dependencies by discovering and compensating for them at runtime. This approach is sensible in the context of computational notebooks, and prior systems like Nodebook, a Jupyter plugin developed at Stitchfix~\cite{nodebook}, make similar assumptions.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Static Approximate Provenance}
An initial pass over the notebook's code obtains a set of read and write dependencies for each cell using Python's AST library, using standard dataflow equations~\cite{NN99} to derive an approximate dataflow graph.
To minimize performance overhead, this step only analyzes the user's code and does not consider other modules (libraries) --- intra-module dependencies (e.g., stateful libraries) will be missed at this stage, but can still be discovered at runtime.
Conversely, like any static dataflow analysis, this stage may also  produce false positives due to control-flow decisions that depend on the input. For example, in \Cref{fig:static-dataflow-analysis-} whether the cell has a  read dependency on \texttt{d} or \texttt{e} depends on the value of \texttt{a} which only will be known at runtime.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Exact Runtime Provenance}
As the notebook executes, provenance refinement relies on the ICE architecture to collect data artifacts written to or read by each cell.
The resulting dynamically collected read / write sets are used to refine the dataflow graph created by static analysis.
Our scheduler (\Cref{sec:scheduler}), assessing opportunities for parallelism or work re-use across notebook re-executions, leverages this refined information as it becomes available.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: