This commit is contained in:
Boris Glavic 2022-03-31 21:42:28 -05:00
parent e6de4169c1
commit 8b30a85a7e

View file

@ -1,12 +1,13 @@
%!TEX root=../main.tex
Computational notebooks (e.g., Jupyter or Apache Zeppelin) have become a popular choice for Data Exploration, Preparation, and ETL.
Indeed, a single notebook may evolve through all three phases, with users exploring a dataset to identify problems, repairing the problems, and then deploying them to a system like Papermill for batch processing on related datasets.
Notebooks more user-friendly for ETL than the classical state of the art, workflow systems, which require users to explicitly target batch processing and manually specify inputs and outputs.
However, the notebook model suffers from poor reproducibility, do not automatically support incremental re-evaluation when inputs change, and must be executed in serial order --- all symptoms of its kernel-based evaluation strategy.
In this paper, we propose a new a new ``workbook'' execution model that retains the usability of notebooks, and the provenance capabilities of workflow systems.
We address key challenges in the workbook model, including information flow, static analysis, scheduling in the presence of ambiguous dependencies, and importing Jupyter notebooks into the workbook model.
We also discuss the implementation of the workbook model within our existing notebook engine \textsc{Vizier}, and evaluate the resulting implementation.
Computational notebooks (e.g., Jupyter or Apache Zeppelin) have become a popular choice for data exploration, preparation, and ETL.
% Indeed, a single notebook may evolve through all three phases, with
% Users typically first explore a dataset to identify problems, then repair the problems, finally deploy their pipeline to a system like Papermill for batch processing on related datasets.
Notebooks are more user-friendly for ETL than classical workflow systems, because they provide immediate feedback for intermediate results and do not require the full computation upfront to be specified upfront. % the user to specify including the inputs and outputs of each step.
However, the notebook model suffers from poor reproducibility, does not support automatic incremental re-evaluation of code when inputs change, and does not allow for parallel execution of cells --- all symptoms of its kernel-based evaluation strategy.
We propose a new \emph{``workbook''} execution model that combines the usability of notebooks with the provenance and parallel execution capabilities of workflow systems. This is made possible through a novel approach that refines a static approximation of provenance at runtime and a scheduler that dynamically adapts the execution order of cells based on data dependencies detected during refinement. Additionally, this enables translation of Jupyter notebooks into workbooks.
% We address key challenges in the workbook model, including information flow, static analysis, scheduling in the presence of ambiguous dependencies, and importing Jupyter notebooks into the workbook model.
We implement this model in our notebook engine \textsc{Vizier}, and evaluate the resulting implementation.
% We show h
@ -16,7 +17,7 @@ We also discuss the implementation of the workbook model within our existing not
% that is largely compat
% that retains the user-friendliness of python
% that retains the user-friendliness of python