paper-ParallelPython-Short/sections/related.tex

%!TEX root=../main.tex
Provenance for workflow systems has been studied extensively for several decades (e.g., see \cite{DC07} for a survey). However, workflow systems expect data dependencies to be specified explicitly as part of the workflow specification and, thus, such provenance techniques are not applicable to our problem setting. More closely related to our work are provenance techniques for programming languages and static  analysis techniques from the programming languages community~\cite{NN99}.

Pimentel et al.~\cite{pimentel-19-scmanpfs}  provide an overview of research on provenance for scripting (programming) languages and did identify a need and challenges for fine-grained provenance in this context.

noWorkflow~\cite{pimentel-17-n, DBLP:conf/tapp/PimentelBMF15} collects several types of provenance for python scripts including environmental information, as well as static and dynamic data- and control-flow.
\cite{DBLP:conf/tapp/PimentelBMF15} extends noWorkflow to Jupyter notebooks and is closely related to our work, but only produces provenance for analysis and debugging and not scheduling.
\cite{macke-21-fglsnin} combines static and dynamic dataflow analysis to track dataflow dependencies during cell execution and warn users of ``unsafe'' interactions where a cell is reading an outdated version of a variable.  By contrast, our approach automatically refreshes dependent cells.
Vamsa~\cite{namaki-20-v} also employes static dataflow analysis to analyze provenance of Python ML pipelines, but additionally annotates variables with semantic tags (e.g., features and labels).

\cite{KP17a} introduces Dataflow notebooks which extend Jupyter with immutable identifiers for cells and the capability to reference the results of a cell by its identifier.
This approach can avoid implicit dependencies, but requires users to be diligent in using these features.
Additionally, our approach is that it allows parallel execution of independent cells, something that was only alluded to as a possibility in \cite{KP17a}.
A similar project, Nodebook~\cite{nodebook} is a plugin for Jupyter that checkpoints notebook state in between cells to force in-order cell evaluation; Although closely related to our approach, it does not attempt parallelism, nor automatic re-execution of cells.


% \begin{itemize}
% \item Provenance for jupyter or python
% \item Dataflow analysis / program slicing
% \item Textbook Transactions / Optimistic concurrency control?\BG{maybe not enough space and not closely related enough}
% \end{itemize}


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: