diff --git a/sections/abstract.tex b/sections/abstract.tex index 50b3e7e..50d5bd5 100644 --- a/sections/abstract.tex +++ b/sections/abstract.tex @@ -1,12 +1,13 @@ %!TEX root=../main.tex -Computational notebooks (e.g., Jupyter or Apache Zeppelin) have become a popular choice for Data Exploration, Preparation, and ETL. -Indeed, a single notebook may evolve through all three phases, with users exploring a dataset to identify problems, repairing the problems, and then deploying them to a system like Papermill for batch processing on related datasets. -Notebooks more user-friendly for ETL than the classical state of the art, workflow systems, which require users to explicitly target batch processing and manually specify inputs and outputs. -However, the notebook model suffers from poor reproducibility, do not automatically support incremental re-evaluation when inputs change, and must be executed in serial order --- all symptoms of its kernel-based evaluation strategy. -In this paper, we propose a new a new ``workbook'' execution model that retains the usability of notebooks, and the provenance capabilities of workflow systems. -We address key challenges in the workbook model, including information flow, static analysis, scheduling in the presence of ambiguous dependencies, and importing Jupyter notebooks into the workbook model. -We also discuss the implementation of the workbook model within our existing notebook engine \textsc{Vizier}, and evaluate the resulting implementation. +Computational notebooks (e.g., Jupyter or Apache Zeppelin) have become a popular choice for data exploration, preparation, and ETL. +% Indeed, a single notebook may evolve through all three phases, with +% Users typically first explore a dataset to identify problems, then repair the problems, finally deploy their pipeline to a system like Papermill for batch processing on related datasets. +Notebooks are more user-friendly for ETL than classical workflow systems, because they provide immediate feedback for intermediate results and do not require the full computation upfront to be specified upfront. % the user to specify including the inputs and outputs of each step. +However, the notebook model suffers from poor reproducibility, does not support automatic incremental re-evaluation of code when inputs change, and does not allow for parallel execution of cells --- all symptoms of its kernel-based evaluation strategy. +We propose a new \emph{``workbook''} execution model that combines the usability of notebooks with the provenance and parallel execution capabilities of workflow systems. This is made possible through a novel approach that refines a static approximation of provenance at runtime and a scheduler that dynamically adapts the execution order of cells based on data dependencies detected during refinement. Additionally, this enables translation of Jupyter notebooks into workbooks. +% We address key challenges in the workbook model, including information flow, static analysis, scheduling in the presence of ambiguous dependencies, and importing Jupyter notebooks into the workbook model. +We implement this model in our notebook engine \textsc{Vizier}, and evaluate the resulting implementation. % We show h @@ -16,7 +17,7 @@ We also discuss the implementation of the workbook model within our existing not % that is largely compat -% that retains the user-friendliness of python +% that retains the user-friendliness of python