experiments

2022-04-01 10:31:05 -05:00 · 2022-04-01 10:31:05 -05:00 · e0160dfd6b
parent a9cfd91335
commit e0160dfd6b
3 changed files with 49 additions and 19 deletions
--- a/sections/approx-prov.tex
+++ b/sections/approx-prov.tex
@ -1,40 +1,55 @@
 %!TEX root=../main.tex
 Conservative static analysis for Python will typically lead to very coarse-grained over-approximation of the real data dependencies of a program, or has to allow for false negatives (missed dependencies).
-To see why this is the case consider the code snippet in \Cref{fig:example-python-code}, which retrieves a piece of python code from the web and then evaluates that code.
+To see why this is the case consider the code snippet in \Cref{fig:dynamic-code-evaluation-i}, which retrieves a piece of python code from the web and then evaluates that code.
 Such code presents challenges for a conservative approach, because the dynamically evaluated code can create data dependencies between everything in global scope.
 Furthermore, conservative static analysis must recursively descend into libraries to obtain a full set of dependencies.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{figure}[t]
  \centering
-  \begin{minted}{python}
+
+  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+  \begin{subfigure}{1\linewidth}
+\begin{minted}{python}
 import urllib.request as r
 with r.urlopen('http://someweb/code.py') as response:
  eval( response.read() )
+\end{minted}
+    \caption{Dynamic code evaluation in Python may lead to arbitrary dependencies that will only be known at runtime.}\label{fig:dynamic-code-evaluation-i}
+  \end{subfigure}
+  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+  \begin{subfigure}{1\linewidth}
+\begin{minted}{python}
 b = d * 2 if a > 10 else e * 2
 \end{minted}
-\vspace*{-3mm}
+    \caption{Static dataflow analysis has to conservatively over-approximate dataflow because control flow depends on the program's input.}\label{fig:static-dataflow-analysis-}
+  \end{subfigure}
+  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+%\vspace*{-3mm}
 \caption{Example Python code}\label{fig:example-python-code}
 \trimfigurespacing
 \end{figure}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 Crucially, for our provenance use cases (i.e., parallelism and incremental updates), we need provenance to be available \emph{before} a notebook is executed.
-Overly conservative static analysis is a bad fit: In all but the most trivial notebooks, such analyses must trade off between excessive runtimes to fully analyze all dependent libraries, and treating all cells as interdependent.
+Overly conservative static analysis is a bad fit: In all but the most trivial notebooks, such analysis must trade off between excessive runtimes to fully analyze all dependent libraries, and treating all cells as interdependent.
 Conversely, a less conservative approach could lead to unsafe notebook execution if it misses a dependency.
 To overcome this dilemma, we propose an approach that computes approximate provenance using static analysis (allowing for both false negative and false positives in terms of data dependencies) and then compensates for missing and spurious data dependencies by discovering and compensating for them at runtime. This approach is sensible in the context of computational notebooks, and prior systems like Nodebook, a Jupyter plugin developed at Stitchfix~\cite{nodebook}, make similar assumptions.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \tinysection{Static Approximate Provenance}
-An initial pass over the data obtains a set of read and write dependencies for each cell using Python's AST library, using standard dataflow equations~\cite{DBLP:journals/tse/Weiser84} to derive an approximate dataflow graph.
-To minimize performance overheads, this step only analyzes the user's code and does not consider other modules (libraries) --- intra-module dependencies (e.g., stateful libraries) will be missed at this stage, but can still be discovered at runtime.
-Conversely, like any static dataflow analysis, this stage may also  produce false positives due to non-deterministic control-flow. For example, in the last line of \Cref{fig:example-python-code} the cell's read dependency on \texttt{d} or \texttt{e} depends on the value of \texttt{a}.
+An initial pass over the notebook's code obtains a set of read and write dependencies for each cell using Python's AST library, using standard dataflow equations~\cite{DBLP:journals/tse/Weiser84} to derive an approximate dataflow graph.
+To minimize performance overhead, this step only analyzes the user's code and does not consider other modules (libraries) --- intra-module dependencies (e.g., stateful libraries) will be missed at this stage, but can still be discovered at runtime.
+Conversely, like any static dataflow analysis, this stage may also  produce false positives due to control-flow decisions that depend on the input. For example, in \Cref{fig:static-dataflow-analysis-} whether the cell has a  read dependency on \texttt{d} or \texttt{e} depends on the value of \texttt{a} which only will be known at runtime.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \tinysection{Exact Runtime Provenance}
-As the notebook executes, the second stage relies on the ICE architecture to collect a list of data artifacts written to or read by each cell.
-The resulting dynamically collected read and write sets are used to refine the dataflow graph.
-A scheduler, assessing opportunities for parallelism or work re-use across re-executions of the notebook, can then leverage this refined information as it becomes available.
+As the notebook executes, provenance refinement relies on the ICE architecture to collect a list of data artifacts written to or read by each cell.
+The resulting dynamically collected read and write sets are used to refine the dataflow graph created by static analysis.
+Our scheduler (\Cref{sec:scheduler}), assessing opportunities for parallelism or work re-use across re-executions of the notebook, can then leverage this refined information as it becomes available.


 %%% Local Variables:
--- a/sections/experiments.tex
+++ b/sections/experiments.tex
@ -4,15 +4,16 @@ As a proof of concept, we implemented a simple, provenance-aware parallel schedu
 Parallelizing cell execution requires an ICE architecture, which comes at the cost of increased communication overhead relative to monolithic kernel notebooks.
 In this section, we assess that cost.

+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \tinysection{Implementation}
-The parallel scheduler was integrated into Vizier 1.2\footnote{\url{https://github.com/VizierDB/vizier-scala}} --- our experiments lightly modify this version for Jupyter notebooks and the related \texttt{-X PARALLEL-PYTHON} experimental option.  
+The parallel scheduler was integrated into Vizier 1.2\footnote{\url{https://github.com/VizierDB/vizier-scala}} --- our experiments lightly modify this version for Jupyter notebooks and the related \texttt{-X PARALLEL-PYTHON} experimental option.
 We additionally added a pooling feature to mitigate Python's high startup cost (600ms up to multiple seconds); The modified Vizier pre-launches a small pool of Python instances and keeps them running in the background.
 Our current implementation selects kernels from the pool arbitrarily.
 In future work, we plan to allow kernels to cache artifacts, and prioritize the use of kernels that have already loaded artifacts we expect the cell to read.

+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \tinysection{Experiments}
-All experiments were run on a XX GHz, XX core Intel Xeon with XX GB RAM running XX Linux\OK{Boris, Nachiket, can you fill this in?}.
-As Vizier relies on Apache Spark, we prefix all notebooks under test with a single reader and writer cell to force initialization of e.g., Spark's HDFS module.  These are not included in timing results.
+All experiments were run on Ubuntu 20.04 on a server  with 2 x AMD Opteron 4238 CPUs (3.3Ghz), 128GB of RAM, and 4 x 1TB 7.2k RPM HDDs in hardware Raid 5. As Vizier relies on Apache Spark, we prefix all notebooks under test with a single reader and writer cell to force initialization of e.g., Spark's HDFS module.  These are not included in timing results.

 \begin{figure*}[t]
  \begin{subfigure}[b]{.32\textwidth}
@ -39,19 +40,26 @@ As Vizier relies on Apache Spark, we prefix all notebooks under test with a sing
  \trimfigurespacing
 \end{figure*}

+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \tinysection{Overview}
 As a preliminary overview, we run a synthetic workload consisting of one cell that randomly generates a 100k-row, 2-integer column\OK{Nachiket: Confirm this please} Pandas dataframe and exports it, and 10 reader cells that lead the dataset and perform a compute intensive task: Computing pairwise distance for a 10k-row subset of the source dataset.
 \Cref{fig:gantt} shows example execution traces for the workload in Vizier with its classical (serial) scheduler, Vizier with its new (parallel) scheduler, and Jupyter.
-The experiments show an overhead of XX\OK{Fill in}s overhead as python exports data, and a XXs overhead from loading the data back in.  
-We observe several oppoortunities for potential improvement: 
+The experiments show an overhead of XX\OK{Fill in}s overhead as python exports data, and a XXs overhead from loading the data back in.
+We observe several oppoortunities for potential improvement:
 First, the serial first access to the dataset is 2s more expensive than the remaining lookups as Vizier loads and prepares to host the dataset through the Arrow protocol.  We expect that such startup costs can be mitigated, for example by having the python kernel continue hosting the dataset itself while the monitor process is loading the data.
 We also note that this overhead grows to almost 10s in the parallel case.  In addition to startup-costs, this may be the result of contention; Even when executing cells in parallel execution, it may be beneficial to stagger cell starts.

-
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \tinysection{Scaling}
-In this benchmark, we specifically measure ICE overheads relative to data size.  
+In this benchmark, we specifically measure ICE overheads relative to data size.
 We specifically measure the cost of:
 (i) exporting a dataset,
 (ii) importing a 'cold' dataset, and
 (iii) importing a 'hot' dataset.
 Results are shown in XXXXX.
+
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "../main"
+%%% End:
--- a/sections/isolation.tex
+++ b/sections/isolation.tex
@ -1,21 +1,28 @@
 %!TEX root=../main.tex

 An isolated cell execution notebook (ICE) isolates cells by executing each in a fresh kernel.
-In this section, we review key differences between the ICE model used in notebooks like Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your} or Nodebook~\cite{nodebook}, and the classical monolithic approach of Jupyter.
+In this section, we review key differences between the ICE model used in systems like Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your} or Nodebook~\cite{nodebook}, and the monolithic approach of Jupyter.

+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \tinysection{Communication Model}
 As in a monolithic kernel notebook, an ICE notebook maintains a shared global state that is manipulated by each individual cell.
 However, these manipulations are explicit: for a variable defined in one cell (the writer) to be used in a subsequent cell (the reader): (i) the writer must explicitly export the variable into the global state, and (ii) the reader must explicitly import the variable from the global state.
 For example, Vizier provides explicit setter and getter functions (respectively) on a global state variable, while Nodebook inspects the python interpreter's global scope dictionary in between cell executions.

+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \tinysection{State Serialization}
 When a state variable is exported, it is serialized by the python interpreter and exported into a versioned state management system.
 We refer to the serialized state as an \emph{artifact}.
 Each cell executes in the context of a scope, a mapping from variable names to artifacts that can be imported by the cell.

-By default, Vizier serializes state through python's native \texttt{pickle} library, although \systemname can be easily extended with codecs for specialized types that are either unsupported by \texttt{pickle}, or for which it is not efficient:
+By default, Vizier serializes state through python's native \texttt{pickle} library, although it can be easily extended with codecs for specialized types that are either unsupported by \texttt{pickle}, or for which it is not efficient:
 (i) Python code (e.g., import statements, and function or class definitions) is exported as raw python code and imported with \texttt{eval}.
 (ii) Pandas dataframes are exported in parquet format, and are exposed to subsequent cells by the monitor process through Apache Arrow direct access.

 % We note the need to support transitive dependencies like functions that invoke other functions.
 % Exporting (resp., importing) the former function requires exporting (importing) the latter.
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "../main"
+%%% End: