More trimming. Down to a bit under 4pp (with enough space for another figure)

This commit is contained in:
Oliver Kennedy 2022-04-01 00:03:42 -04:00
parent 067df20e7f
commit b145c64bf4
Signed by: okennedy
GPG key ID: 3E5F9B3ABD3FDB60
5 changed files with 24 additions and 30 deletions

View file

@ -11,12 +11,9 @@ Furthermore, conservative static analysis must recursively descend into librarie
import urllib.request as r
with r.urlopen('http://someweb/code.py') as response:
eval( response.read() )
\end{minted}
\begin{minted}{python}
b = d * 2 if a > 10 else e * 2
\end{minted}
\vspace*{-3mm}
\caption{Example Python code}\label{fig:example-python-code}
\trimfigurespacing
\end{figure}

View file

@ -1,19 +1,14 @@
%!TEX root=../main.tex
As a proof of concept, we implemented a simple, incremental provenance-aware parallel scheduler as described in \Cref{sec:scheduler}, into the Vizier notebook~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data}\footnote{
We note that although Vizier is the only an ICE notebook we are aware of, it is presently optimized for use with SQL rather than python.
}.
As a proof of concept, we implemented a simple, provenance-aware parallel scheduler (\Cref{sec:scheduler}), within the Vizier notebook~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data}.
Parallelizing cell execution requires an ICE architecture, which comes at the cost of increased communication overhead relative to monolithic kernel notebooks.
In this section, we assess that cost.
\tinysection{Implementation}
The parallel scheduler was integrated into Vizier 1.2\footnote{\url{https://github.com/VizierDB/vizier-scala}} --- our experiments use a lightly modified version with support for importing Jupyter notebooks, and the related \texttt{-X PARALLEL-PYTHON} experimental option.
The parallel scheduler was integrated into Vizier 1.2\footnote{\url{https://github.com/VizierDB/vizier-scala}} --- our experiments lightly modify this version for Jupyter notebooks and the related \texttt{-X PARALLEL-PYTHON} experimental option.
We additionally added a pooling feature to mitigate Python's high startup cost (600ms up to multiple seconds); The modified Vizier pre-launches a small pool of Python instances and keeps them running in the background.
When a cell begins executing, its code is shipped to an already running kernel, which executes the code and returns the response to the monitor process.
In our current implementation, the kernel chosen to run a piece of code is selected arbitrarily.
We note an opportunity for future work: artifacts created during cell execution or imported from the global state can be cached in the executing kernel.
If we know which artifacts a cell will use, we can prioritize kernels that have already loaded these artifacts.
Our current implementation selects kernels from the pool arbitrarily.
In future work, we plan to allow kernels to cache artifacts, and prioritize the use of kernels that have already loaded artifacts we expect the cell to read.
\tinysection{Experiments}
All experiments were run on a XX GHz, XX core Intel Xeon with XX GB RAM running XX Linux\OK{Boris, Nachiket, can you fill this in?}.
@ -22,26 +17,30 @@ As Vizier relies on Apache Spark, we prefix all notebooks under test with a sing
\begin{figure*}[t]
\begin{subfigure}[b]{.32\textwidth}
\includegraphics[width=\columnwidth]{graphics/gantt_serial.png}
\label{fig:gantt:serial}
\vspace*{-3mm}
\caption{Serial Execution}
\label{fig:gantt:serial}
\end{subfigure}
\begin{subfigure}[b]{.32\textwidth}
\includegraphics[width=\columnwidth]{graphics/gantt_parallel.png}
\vspace*{-3mm}
\label{fig:gantt:serial}
\caption{Parallel Execution}
\end{subfigure}
\begin{subfigure}[b]{.32\textwidth}
\includegraphics[width=\columnwidth]{graphics/gantt_serial.png}
\vspace*{-3mm}
\label{fig:gantt:serial}
\caption{Monolithic Kernel Execution}
\end{subfigure}
\label{fig:gantt}
\vspace*{-3mm}
\caption{Workload traces for a synthetic reader/writer workload}
\label{fig:gantt}
\trimfigurespacing
\end{figure*}
\tinysection{Overview}
As a preliminary overview, we run a synthetic workload consisting of one cell that randomly generates a 100k-row, 2-integer column\OK{Nachiket: Confirm this please} pandas dataframe and exports it, and 10 reader cells that lead the dataset and perform a compute intensive task: Computing pairwise distance for a 10k-row subset of the source dataset.
As a preliminary overview, we run a synthetic workload consisting of one cell that randomly generates a 100k-row, 2-integer column\OK{Nachiket: Confirm this please} Pandas dataframe and exports it, and 10 reader cells that lead the dataset and perform a compute intensive task: Computing pairwise distance for a 10k-row subset of the source dataset.
\Cref{fig:gantt} shows example execution traces for the workload in Vizier with its classical (serial) scheduler, Vizier with its new (parallel) scheduler, and Jupyter.
The experiments show an overhead of XX\OK{Fill in}s overhead as python exports data, and a XXs overhead from loading the data back in.
We observe several oppoortunities for potential improvement:

View file

@ -16,11 +16,10 @@ In the latter code block, when \texttt{bar} is declared, it `captures' the scope
\begin{figure}
\begin{center}
\centering
\begin{subfigure}{0.45\columnwidth}
\begin{minted}{python}
def foo():
print(a)
def foo(): print(a)
a = 1
foo() # Prints '1'
def bar():
@ -32,8 +31,7 @@ bar() # Prints '1'
\begin{subfigure}{0.5\columnwidth}
\begin{minted}{python}
def foo():
def bar():
print(a)
def bar(): print(a)
a = 2
return bar
bar = foo()
@ -42,10 +40,10 @@ a = 1
bar() # Prints '2'
\end{minted}
\end{subfigure}
\label{fig:scoping}
\vspace*{-3mm}
\caption{Scope capture in python happens at function definition, but captured scopes remain mutable.}
\label{fig:scoping}
\trimfigurespacing
\end{center}
\end{figure}
Second, the fine-grained dataflow graph, as defined above, is reduced into a simplified \emph{coarse-grained} data flow graph by (i) merging nodes for the statements in a cell, (ii) removing self-edges, and (iii) removing parallel edges with identical labels.

View file

@ -26,11 +26,12 @@ We then show generality by discussing the process for importing Jupyter notebook
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
% \includegraphics[width=\columnwidth]{graphics/depth_vs_cellcount.vega-lite.pdf}
\includegraphics[width=0.8\columnwidth]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}: On average, notebooks can run with a parallelism factor of 4.}
\label{fig:parallelismSurvey}
\trimfigurespacing
% \includegraphics[width=\columnwidth]{graphics/depth_vs_cellcount.vega-lite.pdf}
\includegraphics[width=0.8\columnwidth]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf}
\vspace*{-3mm}
\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}.}
\label{fig:parallelismSurvey}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View file

@ -1,8 +1,7 @@
%!TEX root=../main.tex
Provenance for workflow systems has been studied extensively for several decades (e.g., see \cite{DC07} for a survey). However, workflow systems expect data dependencies to be specified explicitly as part of the workflow specification and, thus, such provenance techniques are not applicable to our problem setting. More closely related to our work are provenance techniques for programming languages and static analysis techniques from the programming languages community~\cite{NN99}.
Workflow provenance has been studied extensively (e.g., see \cite{DC07} for a survey), but reliance on explicit dependencies limits its utility in our setting. More closely related are provenance and static analysis techniques from the programming languages community~\cite{NN99}.
Pimentel et al.~\cite{pimentel-19-scmanpfs} provide an overview of research on provenance for scripting (programming) languages and did identify a need and challenges for fine-grained provenance in this context.
noWorkflow~\cite{pimentel-17-n, DBLP:conf/tapp/PimentelBMF15} collects several types of provenance for python scripts including environmental information, as well as static and dynamic data- and control-flow.
\cite{DBLP:conf/tapp/PimentelBMF15} extends noWorkflow to Jupyter notebooks and is closely related to our work, but only produces provenance for analysis and debugging and not scheduling.
\cite{macke-21-fglsnin} combines static and dynamic dataflow analysis to track dataflow dependencies during cell execution and warn users of ``unsafe'' interactions where a cell is reading an outdated version of a variable. By contrast, our approach automatically refreshes dependent cells.