diff --git a/.gitignore b/.gitignore index d8f20a7..9357850 100644 --- a/.gitignore +++ b/.gitignore @@ -9,3 +9,4 @@ pdfa.xmpi main.bbl main.blg vizier.db +/_minted-main diff --git a/graphics/depth_vs_cellcount-averaged.vega-lite.pdf b/graphics/depth_vs_cellcount-averaged.vega-lite.pdf index 6489b81..062ae5d 100644 Binary files a/graphics/depth_vs_cellcount-averaged.vega-lite.pdf and b/graphics/depth_vs_cellcount-averaged.vega-lite.pdf differ diff --git a/graphics/depth_vs_cellcount-averaged.vega-lite.svg b/graphics/depth_vs_cellcount-averaged.vega-lite.svg index b718246..a51b94e 100644 --- a/graphics/depth_vs_cellcount-averaged.vega-lite.svg +++ b/graphics/depth_vs_cellcount-averaged.vega-lite.svg @@ -1 +1 @@ -050100150200250300350400450500Total Cells0510152025Average Dependency Depth \ No newline at end of file +050100150200250300350400450500Total Cells0510152025Average Dependency Depth \ No newline at end of file diff --git a/graphics/gantt_serial.png b/graphics/gantt_serial.png index 0ce001a..3faa122 100644 Binary files a/graphics/gantt_serial.png and b/graphics/gantt_serial.png differ diff --git a/main.tex b/main.tex index fd10064..89b74f2 100644 --- a/main.tex +++ b/main.tex @@ -2,6 +2,7 @@ %\documentclass{vldb} \settopmatter{printacmref=false} +\setcopyright{none} \usepackage{hyperref} \usepackage[a-1b]{pdfx} @@ -49,7 +50,7 @@ \newcommand{\systemname}{Workbook\xspace} -\newcommand{\TheTitle}{Coarse-Grained Dataflow Provenance} +\newcommand{\TheTitle}{Incremental Provenance} \pagestyle{plain} diff --git a/sections/experiments.tex b/sections/experiments.tex index f7f64f6..b47d7e3 100644 --- a/sections/experiments.tex +++ b/sections/experiments.tex @@ -7,7 +7,7 @@ Parallelizing cell execution requires an ICE architecture, which comes at the co In this section, we assess that cost. All experiments were run on a XX GHz, XX core Intel Xeon with XX GB RAM running XX Linux\OK{Boris, Nachiket, can you fill this in?}. -The provenance aware scheduler was integrated into Vizier 1.2\footnote{https://github.com/VizierDB/vizier-scala} --- our experiments use a lightly modified version with support for importing Jupyter notebooks, and the related \texttt{-X PARALLEL-PYTHON} experimental option. +The provenance aware scheduler was integrated into Vizier 1.2\footnote{\url{https://github.com/VizierDB/vizier-scala}} --- our experiments use a lightly modified version with support for importing Jupyter notebooks, and the related \texttt{-X PARALLEL-PYTHON} experimental option. As Vizier relies on Apache Spark, we prefix all notebooks under test with a single reader and writer cell to force initialization of e.g., Spark's HDFS module. These are not included in timing results. \begin{figure*} diff --git a/sections/import.tex b/sections/import.tex index a6c3850..6620a39 100644 --- a/sections/import.tex +++ b/sections/import.tex @@ -20,9 +20,10 @@ For example, python's \texttt{import} statement simply declares imported modules Thus, references to the module's functions within a function or class definition create transitive dependencies. When the traversal visits a function or class declaration statement, we record -An additional complication arises from python's scope capture semantics. -When a function (or class) is declared, it records a reference to all enclosing scopes. Consider the following example code: -\begin{lstlisting} +\begin{figure} + \begin{center} + \begin{subfigure}{0.45\columnwidth} +\begin{minted}{python} def foo(): print(a) a = 1 @@ -31,21 +32,29 @@ def bar(): a = 2 foo() bar() # Prints '1' -\end{lstlisting} - -\begin{lstlisting} +\end{minted} + \end{subfigure} + \begin{subfigure}{0.5\columnwidth} +\begin{minted}{python} def foo(): - a = 2 def bar(): print(a) + a = 2 return bar bar = foo() bar() # Prints '2' a = 1 bar() # Prints '2' -\end{lstlisting} +\end{minted} + \end{subfigure} + \end{center} + \label{fig:scoping} + \caption{Scope capture in python happens at function definition, but captured scopes remain mutable.} +\end{figure} -In the latter instance, when \texttt{bar} is declared, it `captures' the scope of \texttt{foo}, in which \texttt{a = 2}, and overrides assignment in the global scope. +An additional complication arises from python's scope capture semantics. +When a function (or class) is declared, it records a reference to all enclosing scopes. Consider the following example code in \Cref{fig:scoping}. +In the latter code block, when \texttt{bar} is declared, it `captures' the scope of \texttt{foo}, in which \texttt{a = 2}, and overrides assignment in the global scope. In the former instance, conversely, \texttt{bar}'s assignment to \texttt{a} happens in its own scope, and so the invocation of \texttt{foo} reads the instance of \texttt{a} in the global scope. \tinysection{Coarse-Grained Data Flow} diff --git a/sections/introduction.tex b/sections/introduction.tex index a46e08a..4d40868 100644 --- a/sections/introduction.tex +++ b/sections/introduction.tex @@ -27,8 +27,8 @@ We then show generality by discussing the process for importing Jupyter notebook \begin{figure} % \includegraphics[width=\columnwidth]{graphics/depth_vs_cellcount.vega-lite.pdf} \includegraphics[width=0.8\columnwidth]{graphics/depth_vs_cellcount-averaged.vega-lite.pdf} -\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}: On average, only one out of every 4 notebook cells must be run serially.} -\label{fig:parallelismSurvey} +\caption{Notebook size versus workflow depth in a collection of notebooks scraped from github~\cite{DBLP:journals/ese/PimentelMBF21}: On average, only one out of every 4 notebook cells has serial dependencies.} +\label{fig:parallelismSurvey} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%