paper-ParallelPython-Short/sections/approx-prov.tex

As mentioned earlier, conservative static analysis for Python either would typically lead to very coarse-grained over-approximation of the real data dependencies of a program because of the dynamic nature of Python or has to allow for false negatives (missed dependencies). To see why this is the case consider the code snippet shown below which retrieves a piece of python code from the web and then evaluates that code. Such code presents challenges for a conservative approach, because the dynamically evaluates code could create data dependencies between everything in global scope. If the user's code uses any libraries, then the static analysis would have to extended into the library code to determine any dynamic code execution that could cause unpredictable data-dependencies.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
  \centering
  \begin{minted}{python}
import urllib.request as r
with r.urlopen('http://someweb/code.py') as response:
  somecode = response.read()
  eval(somecode)
\end{minted}

\begin{minted}{python}
if a > 10:
   b = c * 2
else:
   b = d * 2
\end{minted}

\caption{Example Python code}\label{fig:example-python-code}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

For our use cases of provenance (parallel execution of cells and limiting automatic refresh to dependent cells), we need provenance to be available before a notebook is executed to schedule parallel execution of cells and to restrict reexecution of cells to cells that are data-dependent on a cell that was modified by a user for automatic refresh. Overly conservative static analysis would not be a good fit for this purpose, because in most cases all cells would have be considered to be interdependent. However, a less conservative approach could lead to unsafe notebook execution, e.g., when we parallelize the execution of two cells with data dependencies. To overcome this dilemma, we propose an approach that computes approximate provenance using static analysis (allowing for both false negative and false positives in terms of data dependencies) and then compensates for missing and spurious data dependencies by discovering and compensating for them at runtime. This approach is sensible in the context of computational notebooks, because the language dynamic language features that cause issues are rarely used in notebooks.\BG{back this up?}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Approximate Provenance through Static Dataflow Analysis}

In terms of static analysis, we built a dataflow graph for the code of Python cells, using Python's AST library and  dataflow equations that are standard in static program analysis. We only analyze the user's code and do not extend the analysis to other modules (libraries). While it is possible that we will miss data dependencies between objects created by the user that are caused by such library code, this is acceptable, because such dependencies will be discovered and compensated for at runtime.
Like any static dataflow analysis technique, our approach may produce false positives.\footnote{This is due to the fact that static analysis has to reason about all possible controlflow paths through a program that could arise for some input, while for a concrete input only some of these paths may be taken.} For example, in the code snippet shown at the bottom in \Cref{fig:example-python-code} that value of the variable \texttt{b} may dependent on either \texttt{c} or \texttt{d} subject to whether \texttt{a < 10} evaluates to true. Since the value of \texttt{a} is only known at runtime, static analysis has to assume that \texttt{b} depends on both \texttt{c} and \texttt{d}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Refining Approximate Provenance at Runtime}

To deal with the approximate nature of provenance generated by our static analysis approach, we allow the educated guesses made by static analysis to be refined at runtime. This is were the benefits of isolated cell execution with explicit communication between cells through data artificers that are created, read, and written through an API provided by the system become clear. Because the system keeps book about which cells access what data through this API, new data dependencies not predicted by static analysis are automatically detected at runtime. Similarly, when a cell finishes execution, we will know which predicated data dependencies have not materialized. Next we will introduce a scheduling algorithm that dynamically adapts its execution plan when new dependencies are detected or predicted cell dependencies do not materialize.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: