paper-ParallelPython-Short/sections/experiments.tex

%!TEX root=../main.tex

As a proof of concept, we implemented a simple, provenance-aware parallel scheduler (\Cref{sec:scheduler}), within the Vizier notebook~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data}.
Parallelizing cell execution requires an ICE architecture, which comes at the cost of increased communication overhead relative to monolithic kernel notebooks.
In this section, we assess that cost.

\tinysection{Implementation}
The parallel scheduler was integrated into Vizier 1.2\footnote{\url{https://github.com/VizierDB/vizier-scala}} --- our experiments lightly modify this version for Jupyter notebooks and the related \texttt{-X PARALLEL-PYTHON} experimental option.
We additionally added a pooling feature to mitigate Python's high startup cost (600ms up to multiple seconds); The modified Vizier pre-launches a small pool of Python instances and keeps them running in the background.
Our current implementation selects kernels from the pool arbitrarily.
In future work, we plan to allow kernels to cache artifacts, and prioritize the use of kernels that have already loaded artifacts we expect the cell to read.

\tinysection{Experiments}
All experiments were run on a XX GHz, XX core Intel Xeon with XX GB RAM running XX Linux\OK{Boris, Nachiket, can you fill this in?}.
As Vizier relies on Apache Spark, we prefix all notebooks under test with a single reader and writer cell to force initialization of e.g., Spark's HDFS module.  These are not included in timing results.

\begin{figure*}[t]
  \begin{subfigure}[b]{.32\textwidth}
    \includegraphics[width=\columnwidth]{graphics/gantt_serial.png}
    \vspace*{-3mm}
    \caption{Serial Execution}
    \label{fig:gantt:serial}
  \end{subfigure}
  \begin{subfigure}[b]{.32\textwidth}
    \includegraphics[width=\columnwidth]{graphics/gantt_parallel.png}
    \vspace*{-3mm}
    \label{fig:gantt:serial}
    \caption{Parallel Execution}
  \end{subfigure}
  \begin{subfigure}[b]{.32\textwidth}
    \includegraphics[width=\columnwidth]{graphics/gantt_serial.png}
    \vspace*{-3mm}
    \label{fig:gantt:serial}
    \caption{Monolithic Kernel Execution}
  \end{subfigure}
  \vspace*{-3mm}
  \caption{Workload traces for a synthetic reader/writer workload}
  \label{fig:gantt}
  \trimfigurespacing
\end{figure*}

\tinysection{Overview}
As a preliminary overview, we run a synthetic workload consisting of one cell that randomly generates a 100k-row, 2-integer column\OK{Nachiket: Confirm this please} Pandas dataframe and exports it, and 10 reader cells that lead the dataset and perform a compute intensive task: Computing pairwise distance for a 10k-row subset of the source dataset.
\Cref{fig:gantt} shows example execution traces for the workload in Vizier with its classical (serial) scheduler, Vizier with its new (parallel) scheduler, and Jupyter.
The experiments show an overhead of XX\OK{Fill in}s overhead as python exports data, and a XXs overhead from loading the data back in.
We observe several oppoortunities for potential improvement:
First, the serial first access to the dataset is 2s more expensive than the remaining lookups as Vizier loads and prepares to host the dataset through the Arrow protocol.  We expect that such startup costs can be mitigated, for example by having the python kernel continue hosting the dataset itself while the monitor process is loading the data.
We also note that this overhead grows to almost 10s in the parallel case.  In addition to startup-costs, this may be the result of contention; Even when executing cells in parallel execution, it may be beneficial to stagger cell starts.


\tinysection{Scaling}
In this benchmark, we specifically measure ICE overheads relative to data size.
We specifically measure the cost of:
(i) exporting a dataset,
(ii) importing a 'cold' dataset, and
(iii) importing a 'hot' dataset.
Results are shown in XXXXX.