abstract

2022-03-20 13:59:22 -05:00 · 2022-03-20 13:59:22 -05:00 · 2ba0c50c9c
parent f67ac21da9
commit 2ba0c50c9c
5 changed files with 22 additions and 2 deletions
--- a/main.tex
+++ b/main.tex
@ -24,6 +24,7 @@

 \newcommand{\OK}[1]{\todo[backgroundcolor=blue!25]{\tiny \textbf{Oliver says:} #1}}
 \newcommand{\BG}[1]{\todo[backgroundcolor=red!25]{\tiny \textbf{Boris says:} #1}}
+\newcommand{\BGI}[1]{\todo[backgroundcolor=red!25,inline]{\textbf{Boris says:} #1}}
 \newcommand{\ND}[1]{\todo[backgroundcolor=green!25]{\tiny \textbf{Nachiket says:} #1}}

 \definecolor{PineGreen}{HTML}{007B62}
--- a/sections/abstract.tex
+++ b/sections/abstract.tex
@ -1 +1,7 @@
-ABSTRACT
+Computational notebooks, as implemented in systems like Jupyter or Apache Zeppelin,  have become a popular choice for Data Science, ETL, and data preparation / cleaning tasks. The main advantage of notebooks is that they interleave code with results and documentation and provide users with immediate feedback to changes by structuring workflows into cells that the user can execute on demand. In spite of their advantages, existing notebook solutions suffer from poor reproducibility, do not aide users in the incrementally refining their workflows during development since there is no automatic refresh of dependent results when a cell in the notebook is updated, and lack the capability for automatic parallelizing of the execution of cells in a notebook. In this work we argue that a different computational model where the cells of the notebooks are steps in a workflow and do not share a common (hidden) state, but are communicating explicitly through data artifacts, can overcome these shortcomings. However, compared to traditional workflow systems where the dataflow is known upfront, in our model dataflow dependencies among cells may be runtime dependent. This allows for flexibility (e.g., the conditional execution of some code can results in new data dependencies) and relieves the user from having to specify data flow explicitly for the whole workflow (notebook), but results in new challenges for scheduling the execution of the whole notebooks or parts of the notebook (when automatic refresh of cell results is triggered, because a user has modified the notebook). In this work, we present a provenance-based approach for tracking dataflow dependencies across cells containing python code and a scheduler implementation for Vizier, our notebook system that implements the dataflow-workflow model for notebooks. Our approach relies on static analysis to determine an estimation of the data dependencies of python cells. This approach is best-effort and may both over- (static analysis has to make worst-case assumptions about control flow in a program) and under-approximate (because of the dynamic nature of Python guaranteeing an over-approximation that is too coarse-grained) data dependencies. We compensate for that by extending scheduling to adapt the schedule dynamically when new data dependencies are discovered at runtime or predicted data dependencies do not materialize during execution. Another benefit for this technique is that it enables importing of existing Jupyter notebooks into our model. Using some real world Jupyter notebooks, we demonstrate using our implementation of these techniques in Vizier that (i) Jupyter notebooks exhibit potential for parallelization; (ii) using our techniques, automatic refresh can be limited to parts of a notebook; (iii) Jupyter notebooks can be successfully translated into our model.
+
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "../main"
+%%% End:
--- a/sections/acknowledgements.tex
+++ b/sections/acknowledgements.tex
@ -1 +1,7 @@
 ACKS
+
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "../main"
+%%% End:
--- a/sections/conclusions.tex
+++ b/sections/conclusions.tex
@ -0,0 +1,6 @@
+We introduce a provenance-based approach for predicting and tracking dependencies across python cells in a computational notebook and an implementation of this approach in Vizier, a data-centric notebook system where cells are isolated from each other and communicate through data artifacts. By combining best effort static analysis with an adaptable runtime schedule for notebook cell execution, we achieve (i) parallel execution of python cells, (ii) automatic refresh of dependent cells when the notebook is modified, and (iii) translation of Jupyter notebooks into our model.
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "../main"
+%%% End:
--- a/sections/related.tex
+++ b/sections/related.tex
@ -1,5 +1,6 @@
 \begin{itemize}
-  \item Provenance for jupyter or python
+\item Provenance for jupyter or python
+
  \item Dataflow analysis / program slicing
  \item Textbook Transactions / Optimistic concurrency control
 \end{itemize}