25 lines
1.8 KiB
TeX
25 lines
1.8 KiB
TeX
%!TEX root=../main.tex
|
|
|
|
Workflow systems \OK{cite several systems: Vistrails, etc...} help users to break complex tasks like ETL processes, model-fitting, and more, into a series of smaller steps.
|
|
Users explicitly declare dependencies between steps, permitting parallel execution of mutually independent steps.
|
|
A recent trend in industry has been to instead encode such tasks through computational notebooks like Jupyter or Zeppelin.
|
|
Notebooks likewise allow users to declare tasks as a series of steps, but do not require the user to explicitly declare dependencies.
|
|
Consequently, notebook execution frameworks like \OK{reference a few, e.g., netflix's} simply execute the steps of the workflow sequentially (i.e., without parallelism).
|
|
|
|
\begin{figure}
|
|
FIGURE
|
|
\caption{The number of python cells in a notebooks scraped from github~\cite{pimentel} against the number of sequential steps required}
|
|
\label{fig:parallelismSurvey}
|
|
\end{figure}
|
|
|
|
|
|
To assess the potential for improvement, we conducted a preliminary survey on an archive of Jupyter notebooks scraped from Github by Pimentel et. al.~\cite{pimentel}.
|
|
Our survey included only notebooks using a python kernel and known to execute successfully; A total of 800\OK{fill in the exact number} notebooks met these criteria.
|
|
We used the python \texttt{ast} module to construct an inter-cell dataflow graph (e.g., using the methodology of \OK{citations}).
|
|
As a proxy measure for potential speedup, we considered the depth of this graph in relation to the total number of python cells in the notebook.
|
|
\Cref{fig:parallelismSurvey} relates these measures in a XXX.
|
|
Although XXX percent of the notebooks do require sequential execution, as many as XXX percent can XXX.
|
|
|
|
In this paper, we present \systemname, a workflow system designed to facilitate parallel execution of Jupyter notebooks.
|
|
|