paper-ParallelPython-Short/sections/abstract.tex
Boris Glavic 4c98ca952d updates
2022-04-01 21:00:20 -05:00

34 lines
4.2 KiB
TeX

%!TEX root=../main.tex
Computational notebooks (e.g., Jupyter or Apache Zeppelin) have become a popular choice for data exploration, preparation, and ETL.
% Indeed, a single notebook may evolve through all three phases, with
% Users typically first explore a dataset to identify problems, then repair the problems, finally deploy their pipeline to a system like Papermill for batch processing on related datasets.
Notebooks are more suited for interactive development of data pipelines than classical workflow systems, because they provide immediate feedback for the results of a computation and do not require the full computation to be specified upfront. % the user to specify including the inputs and outputs of each step.
However, the notebook model suffers from poor reproducibility, does not support automatic incremental re-evaluation of code when inputs change, and does not allow for parallel execution of cells --- all symptoms of its kernel-based evaluation strategy.
We propose a new \emph{``workbook''} model that combines the usability of notebooks with the provenance and parallel execution capabilities of workflow systems. This is made possible through a novel approach that refines a static approximation of provenance for Python code at runtime and a scheduler that dynamically adapts the execution order of cells based on data dependencies detected or refuted at runtime.
%Additionally, this enables translation of Jupyter notebooks into workbooks.
% We address key challenges in the workbook model, including information flow, static analysis, scheduling in the presence of ambiguous dependencies, and importing Jupyter notebooks into the workbook model.
We demonstrate the feasibility of this approach using a prototype implementation in our notebook engine \textsc{Vizier}.
% We show h
% that is largely compat
% that retains the user-friendliness of Python
% do not aide users in the incrementally refining their workflows during development since there is no automatic refresh of dependent results when a cell in the notebook is updated, and lack the capability for automatic parallelizing of the execution of cells in a notebook. In this work we argue that a different computational model where the cells of the notebooks are steps in a workflow and do not share a common (hidden) state, but are communicating explicitly through data artifacts, can overcome these shortcomings. However, compared to traditional workflow systems where the dataflow is known upfront, in our model dataflow dependencies among cells may be runtime dependent. This allows for flexibility (e.g., the conditional execution of some code can results in new data dependencies) and relieves the user from having to specify data flow explicitly for the whole workflow (notebook), but results in new challenges for scheduling the execution of the whole notebooks or parts of the notebook (when automatic refresh of cell results is triggered, because a user has modified the notebook). In this work, we present a provenance-based approach for tracking dataflow dependencies across cells containing Python code and a scheduler implementation for Vizier, our notebook system that implements the dataflow-workflow model for notebooks. Our approach relies on static analysis to determine an estimation of the data dependencies of Python cells. This approach is best-effort and may both over- (static analysis has to make worst-case assumptions about control flow in a program) and under-approximate (because of the dynamic nature of Python guaranteeing an over-approximation that is too coarse-grained) data dependencies. We compensate for that by extending scheduling to adapt the schedule dynamically when new data dependencies are discovered at runtime or predicted data dependencies do not materialize during execution. Another benefit for this technique is that it enables importing of existing Jupyter notebooks into our model. Using some real world Jupyter notebooks, we demonstrate using our implementation of these techniques in Vizier that (i) Jupyter notebooks exhibit potential for parallelization; (ii) using our techniques, automatic refresh can be limited to parts of a notebook; (iii) Jupyter notebooks can be successfully translated into our model.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: