paper-ParallelPython-Short/sections/related.tex

27 lines
2.6 KiB
TeX

%!TEX root=../main.tex
Workflow provenance has been studied extensively (e.g., see \cite{DC07} for a survey), but reliance on explicit dependencies limits its utility in our setting. More closely related are provenance and static analysis techniques from the programming languages community~\cite{NN99}.
Pimentel et al.~\cite{pimentel-19-scmanpfs} provide an overview of research on provenance for scripting (programming) languages and did identify the need for and challenges of fine-grained provenance in this context.
noWorkflow~\cite{pimentel-17-n} collects several types of provenance for Python scripts including environmental information, as well as static and dynamic data- and control-flow,
% \cite{DBLP:conf/tapp/PimentelBMF15} extends noWorkflow to Jupyter notebooks and is closely related to our work,
but in contrast to our work only produces provenance for analysis and debugging and not scheduling.
\cite{macke-21-fglsnin} combines static and dynamic dataflow analysis to track dataflow dependencies during cell execution and warn users of ``unsafe'' interactions where a cell is reading an outdated version of a variable. By contrast, our approach automatically refreshes dependent cells.
Vamsa~\cite{namaki-20-v} also employes static dataflow analysis to analyze provenance of Python ML pipelines. % , but additionally annotates variables with semantic tags (e.g., features and labels).
Dataflow notebooks~\cite{KP17a} extend Jupyter with immutable identifiers for cells and the capability to reference the results of a cell by its identifier.
This approach can avoid implicit dependencies, but requires users to be diligent in using these features.
Additionally, our approach allows parallel execution of independent cells, something that was only alluded to as a possibility in \cite{KP17a}.
Nodebook~\cite{nodebook} is a plugin for Jupyter that checkpoints notebook state in between cells to force in-order cell evaluation; Although closely related to our approach, it does not attempt parallelism, nor automatic re-execution of cells.
\cite{chapman-20-cqfgppp} capture fine-grained provenance at runtime for common classes of relational data transformations in Python preprocessing pipelines. In contrast our approach utilizes static analysis. % and is not limited to operations on relational datasets.
% \begin{itemize}
% \item Provenance for jupyter or Python
% \item Dataflow analysis / program slicing
% \item Textbook Transactions / Optimistic concurrency control?\BG{maybe not enough space and not closely related enough}
% \end{itemize}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: