Boris Glavic 2022-05-06 09:56:31 -05:00
commit 336afe5fb7
6 changed files with 31 additions and 77 deletions

View File

@ -1,28 +1,9 @@
--- For Camera-Readh ---
- Spend more time emphasizing:
- The "First run" problem
- Dynamic provenance changes with each update
- Static buys you better scheduling decisions
- Better explain "hot" vs "cold" cache (e.g., spark loading data into memory)
- Explain that the choice of spark is due to Vizier having already chosen it. The main thing we need it for is Arrow and scheduling.
- Space permitting, maybe spend a bit more time contrasting microkernel with jupyter "hacks" like Nodebook.
- Add some text emphasizing the point that even though Jupyter is not intended for batch ETL processing, that is how a lot of people (e.g., cite netflix, stitchfix?). (and yes, we're aware that this is bad practice)
- Around the point where we describe that Vizier involves explicit dependencies, also point out that we describe how to provide a Jupyter-like experience on top of this model later in the paper. "Keep the mental model"
- Typos:
- " Not that Vizier"
- Add more future work
- Direct output to Arrow instead of via parquet.
- Add copyright text
- Check for and remove Type 3 fonts if any exist.
- Make sure fonts are embedded (should be default for LaTeX)

View File

@ -174,41 +174,3 @@
\bibliography{main}
\end{document}
--- For Camera-Readh ---
- Spend more time emphasizing:
- The "First run" problem
- Dynamic provenance changes with each update
- Static buys you better scheduling decisions
- Better explain "hot" vs "cold" cache (e.g., spark loading data into memory)
- Explain that the choice of spark is due to Vizier having already chosen it. The main thing we need it for is Arrow and scheduling.
- Space permitting, maybe spend a bit more time contrasting microkernel with jupyter "hacks" like Nodebook.
- Add some text emphasizing the point that even though Jupyter is not intended for batch ETL processing, that is how a lot of people (e.g., cite netflix, stitchfix?). (and yes, we're aware that this is bad practice)
- Around the point where we describe that Vizier involves explicit dependencies, also point out that we describe how to provide a Jupyter-like experience on top of this model later in the paper. "Keep the mental model"
- Typos:
- " Not that Vizier"
- Add more future work
- Direct output to Arrow instead of via parquet.
- Add copyright text
- Check for and remove Type 3 fonts if any exist.
- Make sure fonts are embedded (should be default for LaTeX)
--- For Next Paper ---
- Use GIT history to recover the dependency graph
- e.g., figure out how much dynamic provenance changes for a single cell over a series of edits.
- Static vs Dynamic provenance: How different are they?
- e.g., how often do you need to "repair"
- How much further away from serial does dynamic get you?

View File

@ -1,7 +1,13 @@
%!TEX root=../main.tex
We introduce an approach for incrementally refining a static approximation of provenance for computational notebooks at runtime and implement a scheduler for ICE-architecture notebooks based on this approach.
We introduced an approach for incrementally refining a static approximation of provenance for computational notebooks and implemented a scheduler for ICE-architecture notebooks based on this approach.
Our method enables (i) parallel cell execution; (ii) automatic refresh of dependent cells after modifications; and (iii) import of Jupyter notebooks.
We note several opportunities for improvement on our initial proof-of-concept, most notably to reduce blocking due to state transfer between kernels.
While our proof-of-concept shows promise, significant additional work is needed to reduce state transfer between kernels.
For example, kernels can be re-used forked to minimize state transfer.
Alternatively, responsibility for hosting state can be moved from the coordinator, directly to the kernel that created the state.
Significant further opportunities exist for research on ICE Notebooks.
\tinysection{Acknowledgements} The authors would like to thank Breadcrumb Analytics for their work on Vizier, and acknowledge support from the NSF under awards IIS-1956149 and ACI-1640864.
% For example, an advanced scheduler could dynamically trade off between the overheads of inter-kernel communication, and the benefits of parallel execution.

View File

@ -2,15 +2,15 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}[t]
\newcommand{\plotminusvspace}{-5mm}
\newcommand{\plotminusvspace}{-3mm}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=\columnwidth,trim=0 10 0 0]{graphics/gantt_serial.pdf}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_serial.pdf}
\vspace*{\plotminusvspace}
\caption{Serial Execution}
\label{fig:gantt:serial}
\end{subfigure}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=\columnwidth,trim=0 10 0 0]{graphics/gantt_parallel.pdf}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_parallel.pdf}
\vspace*{\plotminusvspace}
\label{fig:gantt:serial}
\caption{Parallel Execution}
@ -23,20 +23,20 @@
% \end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{subfigure}{0.3\linewidth}
\vspace*{-29mm}
\includegraphics[width=\columnwidth,trim=0 10 0 0]{graphics/scalability-read.pdf}
\vspace*{-26mm}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/scalability-read.pdf}
\vspace*{\plotminusvspace}
\caption{Scalability - Read}\label{fig:scalability}
\end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace*{-4mm}
\vspace*{-5mm}
\caption{Workload traces for a synthetic reader/writer workload}
\label{fig:gantt}
\trimfigurespacing
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As a proof of concept, we implemented the static analysis approach described in \Cref{sec:import} as a simple, provenance-aware parallel scheduler (\Cref{sec:scheduler}) within the Vizier notebook system~\cite{brachmann:2020:cidr:your}.
As a proof of concept, we implemented the static analysis approach from \Cref{sec:import} as a simple, provenance-aware parallel scheduler (\Cref{sec:scheduler}) within the Vizier notebook system~\cite{brachmann:2020:cidr:your}.
Parallelizing cell execution requires an ICE architecture, which comes at the cost of increased communication overhead relative to monolithic kernel notebooks.
% In this section, we assess that cost.
@ -50,7 +50,8 @@ In future work, we plan to allow kernels to cache artifacts, and prioritize the
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Experiments}
All experiments were run on Ubuntu 20.04 on a server with 2 x AMD Opteron 4238 CPUs (3.3Ghz), 128GB of RAM, and 4 x 1TB 7.2k RPM HDDs in hardware Raid 5. As Vizier relies on Apache Spark, we prefix all notebooks under test with a single reader and writer cell to force initialization of e.g., Spark's HDFS module. These are not included in timing results.
All experiments were run on Ubuntu 20.04 on a server with 2 x AMD Opteron 4238 CPUs (3.3Ghz), 128GB of RAM, and 4 x 1TB 7.2k RPM HDDs in hardware Raid 5. Vizier's internal dataframe representation relies on Apache Arrow and Spark.
To mitigate Spark's high start-up costs, we prefix all notebooks under test with a single reader and writer cell to force initialization of e.g., Spark's HDFS module. These are not included in timing results.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -62,11 +63,13 @@ The experiments shows that the parallel execution is $\sim 4$ times faster than
% First, the serial first access to the dataset is 2s more expensive than the remaining lookups as Vizier loads and prepares to host the dataset through the Arrow protocol. We expect that such startup costs can be mitigated, for example by having the Python kernel continue hosting the dataset itself while the monitor process is loading the data.
% We also note that this overhead grows to almost 10s in the parallel case. In addition to startup-costs,
This is possibly the result of contention on the dataset. % Even when executing cells in parallel execution, it may be beneficial to stagger cell starts.
Nonetheless, this preliminary result together with the results of the analysis shown in \Cref{fig:parallelismSurvey} demonstrates the potential for parallel execution of notebooks. % parallel execution even in its preliminary implementation already reduces runtime from 80 to 20 seconds for this test case.
Nonetheless, this preliminary result and the analysis shown in \Cref{fig:parallelismSurvey} demonstrate the potential for parallel execution of notebooks. % parallel execution even in its preliminary implementation already reduces runtime from 80 to 20 seconds for this test case.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Scaling}
In this benchmark, we specifically measure ICE cost of reading data relative to data size. We ran the experiment with cold and hot cache. \Cref{fig:scalability} shows the results of this experiment. Not that Vizier scales linearly for larger dataset sizes.
\Cref{fig:scalability} shows the overhead of loading data into a new kernel by creating a python cell that iterates over all records in the pandas dataframe.
The 'cold cache' cost also includes the one-time cost of loading a Parquet dataset and export it via Arrow.
% We specifically measure the cost of:
% (i) exporting a dataset,
% (ii) importing a 'cold' dataset, and

View File

@ -9,15 +9,16 @@ Notebook users express tasks as a sequence of code `cells' that describe each st
Users manually trigger cell execution, dispatching the code in the cell to a long-lived Python interpreter called a kernel.
Cells share data through the state of the interpreter, e.g., a global variable created in one cell can be read by another cell.
Relieving users of manual dependency management, along with their ability to provide immediate feedback make notebooks more suitable for iteratively developing a data processing pipeline.
However, the hidden state of the Python interpreter creates hard to understand and reproduce bugs, like users forgetting to rerun dependent cells~\cite{DBLP:journals/ese/PimentelMBF21,
Companies like Netflix\footnote{https://netflixtechblog.com/scheduling-notebooks-348e6c14cfd6} are increasingly turning to notebooks for batch data processing (e.g., ETL or ML) thanks to the rapid, interactive notebook development cycle.
However, the hidden state of the Python interpreter creates hard to understand and reproduce bugs, like notebooks that only work if executed out-of-order~\cite{DBLP:journals/ese/PimentelMBF21,
% joelgrus,
brachmann:2020:cidr:your}\footnote{https://www.youtube.com/watch?v=7jiPeIFXb6U}.
In addition to requiring error-prone manual management of state versions, the notebook execution model also precludes parallelism --- cells must be evaluated serially because their inter-dependencies are not know upfront.
Furthermore, the notebook execution model also precludes parallelism --- cells must be evaluated serially because their inter-dependencies are not know upfront.
Dynamically collected provenance~\cite{pimentel-17-n} can, in principle, address the former problem, but is less useful for the latter because provenance dependencies only become know at runtime.
Conversely, static dataflow analysis~\cite{NN99} addresses both concerns, but the dynamic nature of Python implies that such dependencies can only be extracted approximately --- conservatively, limiting opportunities for parallelism; or optimistically, risking execution errors from missed inter-cell dependencies.
We bridge this gap by proposing a hybrid approach that we call \emph{Incremental Runtime Provenance Refinement} that statically approximates provenance which is then incrementally refined at runtime.
Dynamically collected provenance~\cite{pimentel-17-n} can address the former problem, but is of limited use for scheduling since dependencies are only learned after running a cell.
By the time we learn that two cells can be safely executed concurrently, they have already finished.
Static dataflow analysis~\cite{NN99} addresses both needs, but the dynamic nature of Python forces approximation of dependencies --- conservatively, limiting opportunities for parallelism; or optimistically, risking execution errors from missed inter-cell dependencies.
We bridge the static/dynamic gap by proposing a hybrid approach: \emph{Incremental Runtime Provenance Refinement}, where statically approximated provenance is incrementally refined at runtime.
We demonstrate how to utilize provenance refinement to enable (i) parallel execution and (ii) partial re-evaluation of notebooks.
This requires a fundamental shift away from execution in a single monolithic kernel towards isolated per-cell interpreters.
@ -31,7 +32,7 @@ Additionally, this enables Jupyter notebooks to be imported into an ICE which re
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
% \includegraphics[width=\columnwidth]{graphics/depth_vs_cellcount.vega-lite.pdf}
\includegraphics[width=1\columnwidth,trim=0 15 0 0]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
\includegraphics[width=0.9\columnwidth,trim=0 8 0 0]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
\vspace*{-3mm}
\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}.}
\label{fig:parallelismSurvey}

View File

@ -1,7 +1,8 @@
%!TEX root=../main.tex
An isolated cell execution notebook (ICE) isolates cells by executing each in a fresh kernel.
In this section, we review key differences between the ICE model used in systems like Vizier~\cite{brachmann:2020:cidr:your} or Nodebook~\cite{nodebook}, and the monolithic approach of Jupyter.
Before discussing how notebooks written under the assumption of a monolithic kernel can be mapped to the ICE model in \Cref{sec:import}, we first review the key differences between the monolithic approach and systems like Vizier~\cite{brachmann:2020:cidr:your} or Nodebook~\cite{nodebook}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Communication Model}