Merge branch 'master' of https://git.overleaf.com/622e24cad7e111236dd1ae13

2022-03-30 21:21:25 -04:00 · 2022-03-30 21:21:25 -04:00 · f539632ef0
parent e26d01514d 3bb622c2ac
commit f539632ef0
9 changed files with 709 additions and 5791 deletions
--- a/data/compute_bound_parallel.csv
+++ b/data/compute_bound_parallel.csv
@ -0,0 +1,12 @@
+0,4288,1,135,1172
+1,4289,1,1262,4550
+2,4290,1,4644,22094
+3,4291,1,4659,23049
+4,4292,1,4676,20274
+5,4293,1,4691,20146
+6,4294,1,4708,23067
+7,4295,1,4727,23019
+8,4296,1,4753,23098
+9,4297,1,4771,22820
+10,4298,1,4790,23084
+11,4299,1,4811,23036
--- a/data/compute_bound_serial.csv
+++ b/data/compute_bound_serial.csv
@ -0,0 +1,12 @@
+0,4312,1,150,1162
+1,4313,1,1279,4338
+2,4314,1,4405,13781
+3,4315,1,13830,21114
+4,4316,1,21171,28362
+5,4317,1,28406,35557
+6,4318,1,35593,42747
+7,4319,1,42780,49838
+8,4320,1,49872,56996
+9,4321,1,57027,64225
+10,4322,1,64255,71411
+11,4323,1,71443,78471
--- a/data/depth_vs_cellcount.vega-lite
+++ b/data/depth_vs_cellcount.vega-lite
--- a/graphics/depth_vs_cellcount-averaged.vega-lite.pdf
+++ b/graphics/depth_vs_cellcount-averaged.vega-lite.pdf
--- a/graphics/depth_vs_cellcount-averaged.vega-lite.svg
+++ b/graphics/depth_vs_cellcount-averaged.vega-lite.svg
--- a/graphics/depth_vs_cellcount.vega-lite.pdf
+++ b/graphics/depth_vs_cellcount.vega-lite.pdf
--- a/graphics/depth_vs_cellcount.vega-lite.svg
+++ b/graphics/depth_vs_cellcount.vega-lite.svg
--- a/main.bib
+++ b/main.bib
@ -280,3 +280,11 @@ Applications Using Provenance},
 volume = {32},
 year = {2007}
 }
+
+@incollection{FS06,
+ author = {Freire, Juliana and Silva, Cláudio T and Callahan, Steven P and Santos, Emanuele and Scheidegger, Carlos E and Vo, Huy T},
+ booktitle = {Provenance and Annotation of Data},
+ pages = {10--18},
+ title = {Managing rapidly-evolving scientific workflows},
+ year = {2006}
+}
--- a/sections/introduction.tex
+++ b/sections/introduction.tex
@ -11,13 +11,15 @@ Workflow systems\cite{DC07} \OK{cite several systems: Vistrails, etc...}\BG{did
 \end{figure}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-With Vizier~\cite{brachmann:2019:sigmod:data, brachmann:2020:cidr:your}, we introduced an alternative execution model for notebooks: the cells of a notebook are steps in a workflow which run isolated from each other except for the ability to communicate through dataflow (by reading and writing data artifacts). By using workflow evolution provenance, this addresses the issues of reproducibility of notebooks because a cell will be automatically reexecuted if another cell it depends on directly or indirectly is modified. Furthermore, in contrast to traditional workflow systems, the user does not have to design a workflow upfront or has to specify dataflow explicitly. Data dependencies are detected at runtime when a cell's code accesses a data artifact using an API provided by the system. This means that support for new programming languages can be added to the system relatively easily by implementing a new artifact API. However, the ease-of-use of this approach comes at the cost of less flexible scheduling, because new dependencies may arise during cell execution, essentially preventing parallel execution of cells. Furthermore, the fact that cells are isolated from each other prevents easy translation from the commonly used notebook formats such as the one used by Jupyter, because notebooks typically rely on the hidden python interpreter state for inter-cell communication. This in turns negatively impacts the adoption of Vizier, because users cannot easily port their notebooks to the system.
+With Vizier~\cite{brachmann:2019:sigmod:data, brachmann:2020:cidr:your}, we introduced an alternative execution model for notebooks: the cells of a notebook are steps in a workflow which run isolated from each other except for the ability to communicate through dataflow (by reading and writing data artifacts). By using workflow evolution provenance~\cite{FS06} to keep track of changes to a notebook, a version model for data artifacts that exploits immutability, and a workflow execution engine that is capable of incrementally executing relevant parts of a workflow when it has been modified  this addresses the issues of reproducibility of notebooks because a cell will be automatically reexecuted if another cell it depends on directly or indirectly is modified and all versions of the workflow and the corresponding data artifacts are retained.\footnote{data artifacts for previous versions of a notebook / workflow may be garbage collected to save space, but can be reproduced by reexecuting the version of the workflow that generated them if need be.} In contrast to traditional workflow systems, the user does not have to design a workflow upfront or has to specify dataflow explicitly. Data dependencies are detected at runtime when a cell's code accesses a data artifact using an API provided by the system. This means that support for new programming languages can be added to the system relatively easily by implementing a new artifact API. However, the ease-of-use of this approach comes at the cost of less flexible scheduling, because new dependencies may arise during cell execution, essentially preventing parallel execution of cells. Furthermore, the fact that cells are isolated from each other prevents easy translation from the commonly used notebook formats such as the one used by Jupiter, because notebooks typically rely on the hidden python interpreter state for inter-cell communication. This in turns negatively impacts the adoption of Vizier, because users cannot easily port their notebooks to the system.

+In this paper, we present an approach that uses approximate provenance for Python code computed using static program analysis techniques to address the two issue mentioned about

-
-In this paper, we present a novel \emph{coarse-grained} dataflow provenance model --- a hybrid of classical workflow and dataflow provenance --- that permits not only parallel execution, but also incremental re-execution of computational notebooks.
+novel \emph{coarse-grained} dataflow provenance model --- a hybrid of classical workflow and dataflow provenance --- that permits not only parallel execution, but also incremental re-execution of computational notebooks.
 We outline the implementation of this provenance model into an existing workflow system named Vizier~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data}, and address several of the challenges that arise when parallelizing notebooks.

+
+
 \subsection{Parallelism}
 To assess the potential of parallelizing Jupyter notebooks, we conducted a preliminary survey on an archive of Jupyter notebooks scraped from Github by Pimentel et. al.~\cite{DBLP:journals/ese/PimentelMBF21}.
 Our survey included only notebooks using a python kernel and known to execute successfully; A total of nearly 6000 notebooks met these criteria.
@ -25,6 +27,17 @@ We constructed a dataflow graph as described in \Cref{sec:import}; as a proxy me
 \Cref{fig:parallelismSurvey} presents the depth --- the maximum number of cells that must be executed serially --- in relation to the total number of python cells in the notebook.
 The average notebook has over 16 cells, but an average dependency depth of just under 4; an average of around 4 cells able to run concurrently.

+
+The main contributions of this work are:
+
+\begin{itemize}
+\item \textbf{Approximate Provenance for Python using Static Analysis}: We introduce an approach that captures data provenance (data dependencies) using static analysis of the Python code of a notebook's cells. This approach is approximate in that it allows for both false positives as well as false negatives. Through this approximation however, we circumvent the issue that a conservative approach for static dataflow analysis for python would face: it would result in a very coarse-grained over-approximation of the data dependencies to account for the dynamic features of python.
+\item \textbf{A Scheduler for Parallel Execution of Workbook Cells}: We present a scheduler for incremental execution of workbooks, workflows whose modules are cells of Python computational notebook. The scheduler uses data approximate data dependencies determined based on our static analysis of the Python code of the workbooks to enable parallel execution of cells that are independent in terms of dataflow. To compensate for false positives (data dependencies that do not materialize at runtime), the schedule dynamically adapts the execution plan if predicted data dependencies do not materialize during the execution of a cell. To deal with false negatives (data dependencies that materialize at runtime, but were not predicted by static analysis), the schedule can roll back the execution of cells whose execution was invalid (that should have observed the changes made by another cell).
+\item \textbf{Jupyter Import for Workbook Systems based on Provenance}: Since cells in a workbook are isolated from each other and communicate only explicitly through dataflow, importing Juptyer notebooks into Vizier (or any other workbook system) cannot preserve the cell structure of the notebook. A naive approach would be to merge all cells into a single cell to circumvent the problem caused by global state that is shared among cells. However, the structure of the notebook and the intermediate results that are shown inline in the Jupyter notebook would be lost in the imported workbook. Instead, we utilize our approximate provenance approach for python to determine data dependencies between cells. Any object that is shared across cells is then wrapped as a data artifact in Vizier, i.e., we make the dataflow across cells explicit. This allows us to preserve the structure of the notebook. That is, there will be a one-to-one correspondence between the cells of the notebook and the generated workbook. While technically this is not safe if there are false negatives in terms of data dependencies, the types of python features that lead to false negatives are rarely used in notebooks. Furthermore, our approach is safe in the sense that
+\item \textbf{Implementation in Vizier and Experiments}: We have implemented a prototype of the proposed scheduler (except for compensation for false negatives) in our workbook system Vizier and have evaluated the potential for parallel execution by importing real work Jupyter notebooks into Vizier using the proposed techniques and compared their serial and parallel execution.
+\end{itemize}
+
+
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: "../main"