paper-ParallelPython-Short/sections/experiments.tex

%!TEX root=../main.tex

As a proof of concept, we implemented a simple, incremental provenance-aware parallel scheduler as described in \Cref{sec:scheduler}, into the Vizier notebook~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data}\footnote{
We note that although Vizier is the only an ICE notebook we are aware of, it is presently optimized for use with SQL rather than python.
}.
Parallelizing cell execution requires an ICE architecture, which comes at the cost of increased communication overhead relative to monolithic kernel notebooks.
In this section, we assess that cost.

All experiments were run on a XX GHz, XX core Intel Xeon with XX GB RAM running XX Linux\OK{Boris, Nachiket, can you fill this in?}.
The provenance aware scheduler was integrated into Vizier 1.2\footnote{\url{https://github.com/VizierDB/vizier-scala}} --- our experiments use a lightly modified version with support for importing Jupyter notebooks, and the related \texttt{-X PARALLEL-PYTHON} experimental option.
As Vizier relies on Apache Spark, we prefix all notebooks under test with a single reader and writer cell to force initialization of e.g., Spark's HDFS module.  These are not included in timing results.

\begin{figure*}
  \begin{subfigure}[b]{.32\textwidth}
    \includegraphics[width=\columnwidth]{graphics/gantt_serial.png}
    \label{fig:gantt:serial}
    \caption{Serial Execution}
  \end{subfigure}
  \begin{subfigure}[b]{.32\textwidth}
    \includegraphics[width=\columnwidth]{graphics/gantt_parallel.png}
    \label{fig:gantt:serial}
    \caption{Parallel Execution}
  \end{subfigure}
  \begin{subfigure}[b]{.32\textwidth}
    \includegraphics[width=\columnwidth]{graphics/gantt_serial.png}
    \label{fig:gantt:serial}
    \caption{Monolithic Kernel Execution}
  \end{subfigure}
  \label{fig:gantt}
  \caption{Workload traces for a synthetic reader/writer workload}
\end{figure*}

\tinysection{Overview}
As a preliminary overview, we run a synthetic workload consisting of one cell that randomly generates a 100k-row, 2-integer column\OK{Nachiket: Confirm this please} pandas dataframe and exports it, and 10 reader cells that lead the dataset and perform a compute intensive task: Computing pairwise distance for a 10k-row subset of the source dataset.
\Cref{fig:gantt} shows example execution traces for the workload in Vizier with its classical (serial) scheduler, Vizier with its new (parallel) scheduler, and Jupyter.
The experiments show an overhead of XX\OK{Fill in}s overhead as python exports data, and a XXs overhead from loading the data back in.
We observe several oppoortunities for potential improvement:
First, the serial first access to the dataset is 2s more expensive than the remaining lookups as Vizier loads and prepares to host the dataset through the Arrow protocol.  We expect that such startup costs can be mitigated, for example by having the python kernel continue hosting the dataset itself while the monitor process is loading the data.
We also note that this overhead grows to almost 10s in the parallel case.  In addition to startup-costs, this may be the result of contention; Even when executing cells in parallel execution, it may be beneficial to stagger cell starts.


\tinysection{Scaling}
In this benchmark, we specifically measure ICE overheads relative to data size.
We specifically measure the cost of:
(i) exporting a dataset,
(ii) importing a 'cold' dataset, and
(iii) importing a 'hot' dataset.
Results are shown in XXXXX.