paper-ParallelPython-Short/sections/approx-prov.tex

%!TEX root=../main.tex
Conservative static analysis for Python will typically lead to very coarse-grained over-approximation of the real data dependencies of a program, or has to allow for false negatives (missed dependencies).
To see why this is the case consider the code snippet in \Cref{fig:example-python-code}, which retrieves a piece of python code from the web and then evaluates that code.
Such code presents challenges for a conservative approach, because the dynamically evaluated code can create data dependencies between everything in global scope.
Furthermore, conservative static analysis must recursively descend into libraries to obtain a full set of dependencies.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
  \centering
  \begin{minted}{python}
import urllib.request as r
with r.urlopen('http://someweb/code.py') as response:
  eval( response.read() )
\end{minted}

\begin{minted}{python}
b = d * 2 if a > 10 else e * 2
\end{minted}

\caption{Example Python code}\label{fig:example-python-code}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Crucially, for our provenance use cases (i.e., parallelism and incremental updates), we need provenance to be available \emph{before} a notebook is executed.
Overly conservative static analysis is a bad fit: In all but the most trivial notebooks, such analyses must trade off between excessive runtimes to fully analyze all dependent libraries, and treating all cells as interdependent.
Conversely, a less conservative approach could lead to unsafe notebook execution if it misses a dependency.
To overcome this dilemma, we propose an approach that computes approximate provenance using static analysis (allowing for both false negative and false positives in terms of data dependencies) and then compensates for missing and spurious data dependencies by discovering and compensating for them at runtime. This approach is sensible in the context of computational notebooks, and prior systems like Nodebook, a Jupyter plugin developed at Stitchfix~\cite{nodebook}, make similar assumptions.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Static Approximate Provenance}
An initial pass over the data obtains a set of read and write dependencies for each cell using Python's AST library, using standard dataflow equations~\cite{DBLP:journals/tse/Weiser84} to derive an approximate dataflow graph.
To minimize performance overheads, this step only analyzes the user's code and does not consider other modules (libraries) --- intra-module dependencies (e.g., stateful libraries) will be missed at this stage, but can still be discovered at runtime.
Conversely, like any static dataflow analysis, this stage may also  produce false positives due to non-deterministic control-flow. For example, in the last line of \Cref{fig:example-python-code} the cell's read dependency on \texttt{d} or \texttt{e} depends on the value of \texttt{a}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Exact Runtime Provenance}
As the notebook executes, the second stage relies on the ICE architecture to collect a list of data artifacts written to or read by each cell.
The resulting dynamically collected read and write sets are used to refine the dataflow graph.
A scheduler, assessing opportunities for parallelism or work re-use across re-executions of the notebook, can then leverage this refined information as it becomes available.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: