paper-ParallelPython-Short/sections/related.tex

29 lines
2.6 KiB
TeX
Raw Normal View History

2022-03-31 23:44:58 -04:00
%!TEX root=../main.tex
Workflow provenance has been studied extensively (e.g., see \cite{DC07} for a survey), but reliance on explicit dependencies limits its utility in our setting. More closely related are provenance and static analysis techniques from the programming languages community~\cite{NN99}.
2022-03-22 22:19:53 -04:00
2022-04-01 22:00:20 -04:00
Pimentel et al.~\cite{pimentel-19-scmanpfs} provide an overview of research on provenance for scripting (programming) languages and did identify the need for and challenges of fine-grained provenance in this context.
noWorkflow~\cite{pimentel-17-n} collects several types of provenance for Python scripts including environmental information, as well as static and dynamic data- and control-flow,
% \cite{DBLP:conf/tapp/PimentelBMF15} extends noWorkflow to Jupyter notebooks and is closely related to our work,
but in contrast to our work only produces provenance for analysis and debugging and not scheduling.
2022-03-31 23:44:58 -04:00
\cite{macke-21-fglsnin} combines static and dynamic dataflow analysis to track dataflow dependencies during cell execution and warn users of ``unsafe'' interactions where a cell is reading an outdated version of a variable. By contrast, our approach automatically refreshes dependent cells.
2022-04-01 22:00:20 -04:00
Vamsa~\cite{namaki-20-v} also employes static dataflow analysis to analyze provenance of Python ML pipelines. % , but additionally annotates variables with semantic tags (e.g., features and labels).
Dataflow notebooks~\cite{KP17a} extend Jupyter with immutable identifiers for cells and the capability to reference the results of a cell by its identifier.
2022-03-31 23:44:58 -04:00
This approach can avoid implicit dependencies, but requires users to be diligent in using these features.
2022-04-01 22:00:20 -04:00
Additionally, our approach allows parallel execution of independent cells, something that was only alluded to as a possibility in \cite{KP17a}.
Nodebook~\cite{nodebook} is a plugin for Jupyter that checkpoints notebook state in between cells to force in-order cell evaluation; Although closely related to our approach, it does not attempt parallelism, nor automatic re-execution of cells.
\cite{chapman-20-cqfgppp} capture fine-grained provenance at runtime for common classes of relational data transformations in Python preprocessing pipelines. In contrast our approach utilizes static analysis. % and is not limited to operations on relational datasets.
2022-03-22 22:19:53 -04:00
2022-03-23 10:38:51 -04:00
% \begin{itemize}
2022-04-01 22:00:20 -04:00
% \item Provenance for jupyter or Python
2022-03-23 10:38:51 -04:00
% \item Dataflow analysis / program slicing
% \item Textbook Transactions / Optimistic concurrency control?\BG{maybe not enough space and not closely related enough}
% \end{itemize}
2022-03-22 22:19:53 -04:00
2022-03-20 14:59:22 -04:00
2022-03-22 22:19:53 -04:00
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: