paper-ParallelPython-Short/sections/related.tex

Provenance for workflow systems has been studied extensively for several decades (e.g., see \cite{DC07} for a survey). However, workflow systems expect data dependencies to be specified explicitly as part of the workflow specification and, thus, such provenance techniques are not applicable to our problem setting where data dependencies are not declared upfront. More closely related to our work are provenance techniques for programming languages and static  analysis techniques from the programming languages community~\cite{NN99}.

Pimentel et al.~\cite{pimentel-19-scmanpfs}  provide an overview of research on provenance for scripting (programming) languages and did identify a need and challenges for fine-grained provenance in this context.\BG{What other takeaways?}

noWorkflow~\cite{pimentel-17-n, DBLP:conf/tapp/PimentelBMF15} is tool for collecting several types of provenance for python scripts including environmental information (library dependencies and OS environments), static data-flow information, and dynamic (runtime) control- and dataflow information collected using profiling and instrumentation tools. In\cite{DBLP:conf/tapp/PimentelBMF15}, noWorkflow was extended to support collecting provenance for IPython notebooks. Like noWorkflow, we also use static dataflow analysis, but unlike noWorkflow, we do not instrument Python code to determine actual dependencies at runtime.

\cite{macke-21-fglsnin} presents an approach that combines static and dynamic dataflow analysis using Python's tracing capabilities to track dataflow dependencies during cell execution. This information is then used to detect what the authors refer to as ``unsafe'' interactions where a cell is reading an outdated version of a variable written to be another cell and to suggest what cells to execute to resolve the staleness. In contrast to this approach where the user has to manually follow the suggestions of the system in a step-wise manner to resolve staleness, our model prevents staleness from happening in the first place by automatically refreshing dependent cells.

Vamsa~\cite{namaki-20-v} also employes static dataflow analysis to analyze provenance of Python ML pipelines, but additionally annotates variables with semantic tags (e.g., features and labels).

\cite{KP17a} introduces Dataflow notebooks which extend Jupyter with immutable identifiers for cells and the capability to reference the results of a cell by its identifier. The purpose of this extension is to attack the problem of implicit cell dependencies caused by shared python interpreter state and out-of-order execution of cells in a notebook. If users are diligent in using these features, then Dataflow notebooks can be used for automatic refresh of dependent cells like our model. However, our model has the advantage that users do not need to change their code to use cell identifiers and cannot accidentally create hidden dependencies since cell executions are isolated from each other. Another advantage of our approach is that it allows parallel execution of independent cells which was only alluded to as a possibility in \cite{KP17a}.


% \begin{itemize}
% \item Provenance for jupyter or python
% \item Dataflow analysis / program slicing
% \item Textbook Transactions / Optimistic concurrency control?\BG{maybe not enough space and not closely related enough}
% \end{itemize}


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: