abstract
This commit is contained in:
parent
e6de4169c1
commit
8b30a85a7e
|
@ -1,12 +1,13 @@
|
||||||
%!TEX root=../main.tex
|
%!TEX root=../main.tex
|
||||||
|
|
||||||
Computational notebooks (e.g., Jupyter or Apache Zeppelin) have become a popular choice for Data Exploration, Preparation, and ETL.
|
Computational notebooks (e.g., Jupyter or Apache Zeppelin) have become a popular choice for data exploration, preparation, and ETL.
|
||||||
Indeed, a single notebook may evolve through all three phases, with users exploring a dataset to identify problems, repairing the problems, and then deploying them to a system like Papermill for batch processing on related datasets.
|
% Indeed, a single notebook may evolve through all three phases, with
|
||||||
Notebooks more user-friendly for ETL than the classical state of the art, workflow systems, which require users to explicitly target batch processing and manually specify inputs and outputs.
|
% Users typically first explore a dataset to identify problems, then repair the problems, finally deploy their pipeline to a system like Papermill for batch processing on related datasets.
|
||||||
However, the notebook model suffers from poor reproducibility, do not automatically support incremental re-evaluation when inputs change, and must be executed in serial order --- all symptoms of its kernel-based evaluation strategy.
|
Notebooks are more user-friendly for ETL than classical workflow systems, because they provide immediate feedback for intermediate results and do not require the full computation upfront to be specified upfront. % the user to specify including the inputs and outputs of each step.
|
||||||
In this paper, we propose a new a new ``workbook'' execution model that retains the usability of notebooks, and the provenance capabilities of workflow systems.
|
However, the notebook model suffers from poor reproducibility, does not support automatic incremental re-evaluation of code when inputs change, and does not allow for parallel execution of cells --- all symptoms of its kernel-based evaluation strategy.
|
||||||
We address key challenges in the workbook model, including information flow, static analysis, scheduling in the presence of ambiguous dependencies, and importing Jupyter notebooks into the workbook model.
|
We propose a new \emph{``workbook''} execution model that combines the usability of notebooks with the provenance and parallel execution capabilities of workflow systems. This is made possible through a novel approach that refines a static approximation of provenance at runtime and a scheduler that dynamically adapts the execution order of cells based on data dependencies detected during refinement. Additionally, this enables translation of Jupyter notebooks into workbooks.
|
||||||
We also discuss the implementation of the workbook model within our existing notebook engine \textsc{Vizier}, and evaluate the resulting implementation.
|
% We address key challenges in the workbook model, including information flow, static analysis, scheduling in the presence of ambiguous dependencies, and importing Jupyter notebooks into the workbook model.
|
||||||
|
We implement this model in our notebook engine \textsc{Vizier}, and evaluate the resulting implementation.
|
||||||
|
|
||||||
|
|
||||||
% We show h
|
% We show h
|
||||||
|
|
Loading…
Reference in a new issue