Update on Overleaf.

master
Oliver Kennedy 2022-05-04 19:02:03 +00:00 committed by node
parent 3171ffdba0
commit 9c7308f895
3 changed files with 20 additions and 51 deletions

View File

@ -174,41 +174,3 @@
\bibliography{main}
\end{document}
--- For Camera-Readh ---
- Spend more time emphasizing:
- The "First run" problem
- Dynamic provenance changes with each update
- Static buys you better scheduling decisions
- Better explain "hot" vs "cold" cache (e.g., spark loading data into memory)
- Explain that the choice of spark is due to Vizier having already chosen it. The main thing we need it for is Arrow and scheduling.
- Space permitting, maybe spend a bit more time contrasting microkernel with jupyter "hacks" like Nodebook.
- Add some text emphasizing the point that even though Jupyter is not intended for batch ETL processing, that is how a lot of people (e.g., cite netflix, stitchfix?). (and yes, we're aware that this is bad practice)
- Around the point where we describe that Vizier involves explicit dependencies, also point out that we describe how to provide a Jupyter-like experience on top of this model later in the paper. "Keep the mental model"
- Typos:
- " Not that Vizier"
- Add more future work
- Direct output to Arrow instead of via parquet.
- Add copyright text
- Check for and remove Type 3 fonts if any exist.
- Make sure fonts are embedded (should be default for LaTeX)
--- For Next Paper ---
- Use GIT history to recover the dependency graph
- e.g., figure out how much dynamic provenance changes for a single cell over a series of edits.
- Static vs Dynamic provenance: How different are they?
- e.g., how often do you need to "repair"
- How much further away from serial does dynamic get you?

View File

@ -2,15 +2,15 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}[t]
\newcommand{\plotminusvspace}{-5mm}
\newcommand{\plotminusvspace}{-3mm}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=\columnwidth,trim=0 10 0 0]{graphics/gantt_serial.pdf}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_serial.pdf}
\vspace*{\plotminusvspace}
\caption{Serial Execution}
\label{fig:gantt:serial}
\end{subfigure}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=\columnwidth,trim=0 10 0 0]{graphics/gantt_parallel.pdf}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_parallel.pdf}
\vspace*{\plotminusvspace}
\label{fig:gantt:serial}
\caption{Parallel Execution}
@ -23,20 +23,20 @@
% \end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{subfigure}{0.3\linewidth}
\vspace*{-29mm}
\includegraphics[width=\columnwidth,trim=0 10 0 0]{graphics/scalability-read.pdf}
\vspace*{-26mm}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/scalability-read.pdf}
\vspace*{\plotminusvspace}
\caption{Scalability - Read}\label{fig:scalability}
\end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace*{-4mm}
\vspace*{-5mm}
\caption{Workload traces for a synthetic reader/writer workload}
\label{fig:gantt}
\trimfigurespacing
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As a proof of concept, we implemented the static analysis approach described in \Cref{sec:import} as a simple, provenance-aware parallel scheduler (\Cref{sec:scheduler}) within the Vizier notebook system~\cite{brachmann:2020:cidr:your}.
As a proof of concept, we implemented the static analysis approach from \Cref{sec:import} as a simple, provenance-aware parallel scheduler (\Cref{sec:scheduler}) within the Vizier notebook system~\cite{brachmann:2020:cidr:your}.
Parallelizing cell execution requires an ICE architecture, which comes at the cost of increased communication overhead relative to monolithic kernel notebooks.
% In this section, we assess that cost.
@ -62,11 +62,17 @@ The experiments shows that the parallel execution is $\sim 4$ times faster than
% First, the serial first access to the dataset is 2s more expensive than the remaining lookups as Vizier loads and prepares to host the dataset through the Arrow protocol. We expect that such startup costs can be mitigated, for example by having the Python kernel continue hosting the dataset itself while the monitor process is loading the data.
% We also note that this overhead grows to almost 10s in the parallel case. In addition to startup-costs,
This is possibly the result of contention on the dataset. % Even when executing cells in parallel execution, it may be beneficial to stagger cell starts.
Nonetheless, this preliminary result together with the results of the analysis shown in \Cref{fig:parallelismSurvey} demonstrates the potential for parallel execution of notebooks. % parallel execution even in its preliminary implementation already reduces runtime from 80 to 20 seconds for this test case.
Nonetheless, this preliminary result and the analysis shown in \Cref{fig:parallelismSurvey} demonstrate the potential for parallel execution of notebooks. % parallel execution even in its preliminary implementation already reduces runtime from 80 to 20 seconds for this test case.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Scaling}
In this benchmark, we specifically measure ICE cost of reading data relative to data size. We ran the experiment with cold and hot cache. \Cref{fig:scalability} shows the results of this experiment. Not that Vizier scales linearly for larger dataset sizes.
\Cref{fig:scalability}
ICE cost of reading data relative to data size.
We ran the experiment with cold and hot cache. shows the results of this experiment. Note that Vizier scales linearly for larger dataset sizes.
% We specifically measure the cost of:
% (i) exporting a dataset,
% (ii) importing a 'cold' dataset, and

View File

@ -15,9 +15,10 @@ However, the hidden state of the Python interpreter creates hard to understand a
brachmann:2020:cidr:your}\footnote{https://www.youtube.com/watch?v=7jiPeIFXb6U}.
In addition to requiring error-prone manual management of state versions, the notebook execution model also precludes parallelism --- cells must be evaluated serially because their inter-dependencies are not know upfront.
Dynamically collected provenance~\cite{pimentel-17-n} can, in principle, address the former problem, but is less useful for the latter because provenance dependencies only become know at runtime.
Conversely, static dataflow analysis~\cite{NN99} addresses both concerns, but the dynamic nature of Python implies that such dependencies can only be extracted approximately --- conservatively, limiting opportunities for parallelism; or optimistically, risking execution errors from missed inter-cell dependencies.
We bridge this gap by proposing a hybrid approach that we call \emph{Incremental Runtime Provenance Refinement} that statically approximates provenance which is then incrementally refined at runtime.
Dynamically collected provenance~\cite{pimentel-17-n} can address the former problem, but is of limited use for scheduling since dependencies are only learned after running a cell.
By the time we learn that two cells can be safely executed concurrently, they have already completed.
Static dataflow analysis~\cite{NN99} addresses both needs, but the dynamic nature of Python forces approximation of dependencies --- conservatively, limiting opportunities for parallelism; or optimistically, risking execution errors from missed inter-cell dependencies.
We bridge the static/dynamic gap by proposing a hybrid approach: \emph{Incremental Runtime Provenance Refinement}, where statically approximated provenance is incrementally refined at runtime.
We demonstrate how to utilize provenance refinement to enable (i) parallel execution and (ii) partial re-evaluation of notebooks.
This requires a fundamental shift away from execution in a single monolithic kernel towards isolated per-cell interpreters.
@ -31,7 +32,7 @@ Additionally, this enables Jupyter notebooks to be imported into an ICE which re
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
% \includegraphics[width=\columnwidth]{graphics/depth_vs_cellcount.vega-lite.pdf}
\includegraphics[width=1\columnwidth,trim=0 15 0 0]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
\includegraphics[width=0.9\columnwidth,trim=0 8 0 0]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
\vspace*{-3mm}
\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}.}
\label{fig:parallelismSurvey}