Oliver Kennedy 2022-03-30 21:21:25 -04:00
commit f539632ef0
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
9 changed files with 709 additions and 5791 deletions

View File

@ -0,0 +1,12 @@
0,4288,1,135,1172
1,4289,1,1262,4550
2,4290,1,4644,22094
3,4291,1,4659,23049
4,4292,1,4676,20274
5,4293,1,4691,20146
6,4294,1,4708,23067
7,4295,1,4727,23019
8,4296,1,4753,23098
9,4297,1,4771,22820
10,4298,1,4790,23084
11,4299,1,4811,23036
1 0 4288 1 135 1172
2 1 4289 1 1262 4550
3 2 4290 1 4644 22094
4 3 4291 1 4659 23049
5 4 4292 1 4676 20274
6 5 4293 1 4691 20146
7 6 4294 1 4708 23067
8 7 4295 1 4727 23019
9 8 4296 1 4753 23098
10 9 4297 1 4771 22820
11 10 4298 1 4790 23084
12 11 4299 1 4811 23036

View File

@ -0,0 +1,12 @@
0,4312,1,150,1162
1,4313,1,1279,4338
2,4314,1,4405,13781
3,4315,1,13830,21114
4,4316,1,21171,28362
5,4317,1,28406,35557
6,4318,1,35593,42747
7,4319,1,42780,49838
8,4320,1,49872,56996
9,4321,1,57027,64225
10,4322,1,64255,71411
11,4323,1,71443,78471
1 0 4312 1 150 1162
2 1 4313 1 1279 4338
3 2 4314 1 4405 13781
4 3 4315 1 13830 21114
5 4 4316 1 21171 28362
6 5 4317 1 28406 35557
7 6 4318 1 35593 42747
8 7 4319 1 42780 49838
9 8 4320 1 49872 56996
10 9 4321 1 57027 64225
11 10 4322 1 64255 71411
12 11 4323 1 71443 78471

File diff suppressed because it is too large Load Diff

Binary file not shown.

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 2.1 MiB

After

Width:  |  Height:  |  Size: 264 KiB

View File

@ -280,3 +280,11 @@ Applications Using Provenance},
volume = {32},
year = {2007}
}
@incollection{FS06,
author = {Freire, Juliana and Silva, Cláudio T and Callahan, Steven P and Santos, Emanuele and Scheidegger, Carlos E and Vo, Huy T},
booktitle = {Provenance and Annotation of Data},
pages = {10--18},
title = {Managing rapidly-evolving scientific workflows},
year = {2006}
}

View File

@ -11,13 +11,15 @@ Workflow systems\cite{DC07} \OK{cite several systems: Vistrails, etc...}\BG{did
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
With Vizier~\cite{brachmann:2019:sigmod:data, brachmann:2020:cidr:your}, we introduced an alternative execution model for notebooks: the cells of a notebook are steps in a workflow which run isolated from each other except for the ability to communicate through dataflow (by reading and writing data artifacts). By using workflow evolution provenance, this addresses the issues of reproducibility of notebooks because a cell will be automatically reexecuted if another cell it depends on directly or indirectly is modified. Furthermore, in contrast to traditional workflow systems, the user does not have to design a workflow upfront or has to specify dataflow explicitly. Data dependencies are detected at runtime when a cell's code accesses a data artifact using an API provided by the system. This means that support for new programming languages can be added to the system relatively easily by implementing a new artifact API. However, the ease-of-use of this approach comes at the cost of less flexible scheduling, because new dependencies may arise during cell execution, essentially preventing parallel execution of cells. Furthermore, the fact that cells are isolated from each other prevents easy translation from the commonly used notebook formats such as the one used by Jupyter, because notebooks typically rely on the hidden python interpreter state for inter-cell communication. This in turns negatively impacts the adoption of Vizier, because users cannot easily port their notebooks to the system.
With Vizier~\cite{brachmann:2019:sigmod:data, brachmann:2020:cidr:your}, we introduced an alternative execution model for notebooks: the cells of a notebook are steps in a workflow which run isolated from each other except for the ability to communicate through dataflow (by reading and writing data artifacts). By using workflow evolution provenance~\cite{FS06} to keep track of changes to a notebook, a version model for data artifacts that exploits immutability, and a workflow execution engine that is capable of incrementally executing relevant parts of a workflow when it has been modified this addresses the issues of reproducibility of notebooks because a cell will be automatically reexecuted if another cell it depends on directly or indirectly is modified and all versions of the workflow and the corresponding data artifacts are retained.\footnote{data artifacts for previous versions of a notebook / workflow may be garbage collected to save space, but can be reproduced by reexecuting the version of the workflow that generated them if need be.} In contrast to traditional workflow systems, the user does not have to design a workflow upfront or has to specify dataflow explicitly. Data dependencies are detected at runtime when a cell's code accesses a data artifact using an API provided by the system. This means that support for new programming languages can be added to the system relatively easily by implementing a new artifact API. However, the ease-of-use of this approach comes at the cost of less flexible scheduling, because new dependencies may arise during cell execution, essentially preventing parallel execution of cells. Furthermore, the fact that cells are isolated from each other prevents easy translation from the commonly used notebook formats such as the one used by Jupiter, because notebooks typically rely on the hidden python interpreter state for inter-cell communication. This in turns negatively impacts the adoption of Vizier, because users cannot easily port their notebooks to the system.
In this paper, we present an approach that uses approximate provenance for Python code computed using static program analysis techniques to address the two issue mentioned about
In this paper, we present a novel \emph{coarse-grained} dataflow provenance model --- a hybrid of classical workflow and dataflow provenance --- that permits not only parallel execution, but also incremental re-execution of computational notebooks.
novel \emph{coarse-grained} dataflow provenance model --- a hybrid of classical workflow and dataflow provenance --- that permits not only parallel execution, but also incremental re-execution of computational notebooks.
We outline the implementation of this provenance model into an existing workflow system named Vizier~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data}, and address several of the challenges that arise when parallelizing notebooks.
\subsection{Parallelism}
To assess the potential of parallelizing Jupyter notebooks, we conducted a preliminary survey on an archive of Jupyter notebooks scraped from Github by Pimentel et. al.~\cite{DBLP:journals/ese/PimentelMBF21}.
Our survey included only notebooks using a python kernel and known to execute successfully; A total of nearly 6000 notebooks met these criteria.
@ -25,6 +27,17 @@ We constructed a dataflow graph as described in \Cref{sec:import}; as a proxy me
\Cref{fig:parallelismSurvey} presents the depth --- the maximum number of cells that must be executed serially --- in relation to the total number of python cells in the notebook.
The average notebook has over 16 cells, but an average dependency depth of just under 4; an average of around 4 cells able to run concurrently.
The main contributions of this work are:
\begin{itemize}
\item \textbf{Approximate Provenance for Python using Static Analysis}: We introduce an approach that captures data provenance (data dependencies) using static analysis of the Python code of a notebook's cells. This approach is approximate in that it allows for both false positives as well as false negatives. Through this approximation however, we circumvent the issue that a conservative approach for static dataflow analysis for python would face: it would result in a very coarse-grained over-approximation of the data dependencies to account for the dynamic features of python.
\item \textbf{A Scheduler for Parallel Execution of Workbook Cells}: We present a scheduler for incremental execution of workbooks, workflows whose modules are cells of Python computational notebook. The scheduler uses data approximate data dependencies determined based on our static analysis of the Python code of the workbooks to enable parallel execution of cells that are independent in terms of dataflow. To compensate for false positives (data dependencies that do not materialize at runtime), the schedule dynamically adapts the execution plan if predicted data dependencies do not materialize during the execution of a cell. To deal with false negatives (data dependencies that materialize at runtime, but were not predicted by static analysis), the schedule can roll back the execution of cells whose execution was invalid (that should have observed the changes made by another cell).
\item \textbf{Jupyter Import for Workbook Systems based on Provenance}: Since cells in a workbook are isolated from each other and communicate only explicitly through dataflow, importing Juptyer notebooks into Vizier (or any other workbook system) cannot preserve the cell structure of the notebook. A naive approach would be to merge all cells into a single cell to circumvent the problem caused by global state that is shared among cells. However, the structure of the notebook and the intermediate results that are shown inline in the Jupyter notebook would be lost in the imported workbook. Instead, we utilize our approximate provenance approach for python to determine data dependencies between cells. Any object that is shared across cells is then wrapped as a data artifact in Vizier, i.e., we make the dataflow across cells explicit. This allows us to preserve the structure of the notebook. That is, there will be a one-to-one correspondence between the cells of the notebook and the generated workbook. While technically this is not safe if there are false negatives in terms of data dependencies, the types of python features that lead to false negatives are rarely used in notebooks. Furthermore, our approach is safe in the sense that
\item \textbf{Implementation in Vizier and Experiments}: We have implemented a prototype of the proposed scheduler (except for compensation for false negatives) in our workbook system Vizier and have evaluated the potential for parallel execution by importing real work Jupyter notebooks into Vizier using the proposed techniques and compared their serial and parallel execution.
\end{itemize}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"