Trimmed abstract

This commit is contained in:
Oliver Kennedy 2022-03-29 11:53:08 -04:00
parent 1f9fdfecad
commit 557d497a25
Signed by: okennedy
GPG key ID: 3E5F9B3ABD3FDB60
3 changed files with 34 additions and 2 deletions

View file

@ -263,3 +263,10 @@ Applications Using Provenance},
volume = {14},
year = {2020}
}
% Optional fields: subtitle, titleaddon, language, howpublished, type, version, note, organization, location, date, month, year, addendum, pubstate, doi, eprint, eprintclass, eprinttype, url, urldate
@misc{papermill,
author = {Netflix},
title = {Papermill},
howpublished = {https://github.com/nteract/papermill}
}

View file

@ -46,7 +46,7 @@
\newtheorem{example}{Example}
\newtheorem{definition}{Definition}
\newcommand{\systemname}{Vizier\xspace}
\newcommand{\systemname}{Workbook\xspace}
\newcommand{\TheTitle}{Coarse-Grained Dataflow Provenance}

View file

@ -1,4 +1,29 @@
Computational notebooks, as implemented in systems like Jupyter or Apache Zeppelin, have become a popular choice for Data Science, ETL, and data preparation / cleaning tasks. The main advantage of notebooks is that they interleave code with results and documentation and provide users with immediate feedback to changes by structuring workflows into cells that the user can execute on demand. In spite of their advantages, existing notebook solutions suffer from poor reproducibility, do not aide users in the incrementally refining their workflows during development since there is no automatic refresh of dependent results when a cell in the notebook is updated, and lack the capability for automatic parallelizing of the execution of cells in a notebook. In this work we argue that a different computational model where the cells of the notebooks are steps in a workflow and do not share a common (hidden) state, but are communicating explicitly through data artifacts, can overcome these shortcomings. However, compared to traditional workflow systems where the dataflow is known upfront, in our model dataflow dependencies among cells may be runtime dependent. This allows for flexibility (e.g., the conditional execution of some code can results in new data dependencies) and relieves the user from having to specify data flow explicitly for the whole workflow (notebook), but results in new challenges for scheduling the execution of the whole notebooks or parts of the notebook (when automatic refresh of cell results is triggered, because a user has modified the notebook). In this work, we present a provenance-based approach for tracking dataflow dependencies across cells containing python code and a scheduler implementation for Vizier, our notebook system that implements the dataflow-workflow model for notebooks. Our approach relies on static analysis to determine an estimation of the data dependencies of python cells. This approach is best-effort and may both over- (static analysis has to make worst-case assumptions about control flow in a program) and under-approximate (because of the dynamic nature of Python guaranteeing an over-approximation that is too coarse-grained) data dependencies. We compensate for that by extending scheduling to adapt the schedule dynamically when new data dependencies are discovered at runtime or predicted data dependencies do not materialize during execution. Another benefit for this technique is that it enables importing of existing Jupyter notebooks into our model. Using some real world Jupyter notebooks, we demonstrate using our implementation of these techniques in Vizier that (i) Jupyter notebooks exhibit potential for parallelization; (ii) using our techniques, automatic refresh can be limited to parts of a notebook; (iii) Jupyter notebooks can be successfully translated into our model.
%!TEX root=../main.tex
Computational notebooks (e.g., Jupyter or Apache Zeppelin) have become a popular choice for Data Exploration, Preparation, and ETL.
Indeed, a single notebook may evolve through all three phases, with users exploring a dataset to identify problems, repairing the problems, and then deploying them to a system like Papermill for batch processing on related datasets.
Notebooks more user-friendly for ETL than the classical state of the art, workflow systems, which require users to explicitly target batch processing and manually specify inputs and outputs.
However, the notebook model suffers from poor reproducibility, do not automatically support incremental re-evaluation when inputs change, and must be executed in serial order --- all symptoms of its kernel-based evaluation strategy.
In this paper, we propose a new a new ``workbook'' execution model that retains the usability of notebooks, and the provenance capabilities of workflow systems.
We address key challenges in the workbook model, including information flow, static analysis, scheduling in the presence of ambiguous dependencies, and importing Jupyter notebooks into the workbook model.
We also discuss the implementation of the workbook model within our existing notebook engine \textsc{Vizier}, and evaluate the resulting implementation.
% We show h
% that is largely compat
% that retains the user-friendliness of python
% do not aide users in the incrementally refining their workflows during development since there is no automatic refresh of dependent results when a cell in the notebook is updated, and lack the capability for automatic parallelizing of the execution of cells in a notebook. In this work we argue that a different computational model where the cells of the notebooks are steps in a workflow and do not share a common (hidden) state, but are communicating explicitly through data artifacts, can overcome these shortcomings. However, compared to traditional workflow systems where the dataflow is known upfront, in our model dataflow dependencies among cells may be runtime dependent. This allows for flexibility (e.g., the conditional execution of some code can results in new data dependencies) and relieves the user from having to specify data flow explicitly for the whole workflow (notebook), but results in new challenges for scheduling the execution of the whole notebooks or parts of the notebook (when automatic refresh of cell results is triggered, because a user has modified the notebook). In this work, we present a provenance-based approach for tracking dataflow dependencies across cells containing python code and a scheduler implementation for Vizier, our notebook system that implements the dataflow-workflow model for notebooks. Our approach relies on static analysis to determine an estimation of the data dependencies of python cells. This approach is best-effort and may both over- (static analysis has to make worst-case assumptions about control flow in a program) and under-approximate (because of the dynamic nature of Python guaranteeing an over-approximation that is too coarse-grained) data dependencies. We compensate for that by extending scheduling to adapt the schedule dynamically when new data dependencies are discovered at runtime or predicted data dependencies do not materialize during execution. Another benefit for this technique is that it enables importing of existing Jupyter notebooks into our model. Using some real world Jupyter notebooks, we demonstrate using our implementation of these techniques in Vizier that (i) Jupyter notebooks exhibit potential for parallelization; (ii) using our techniques, automatic refresh can be limited to parts of a notebook; (iii) Jupyter notebooks can be successfully translated into our model.
%%% Local Variables: