Compare commits

...

6 Commits

Author SHA1 Message Date
Boris Glavic 945de5e806 pdf CR 2022-05-09 06:07:35 -05:00
Boris Glavic af2d4354c1 updates 2022-05-09 06:00:58 -05:00
Oliver Kennedy 22d2546aea
Trimming 2022-05-08 13:20:20 -04:00
Boris Glavic f814a6b272 updates 2022-05-06 22:25:26 -05:00
Boris Glavic af09529cd2 Merge branch 'master' of https://git.overleaf.com/622e24cad7e111236dd1ae13 2022-05-06 21:40:09 -05:00
nachideo.em b8047dc0e9 Update on Overleaf. 2022-05-07 02:40:05 +00:00
10 changed files with 87 additions and 89 deletions

View File

@ -299,7 +299,6 @@ Applications Using Provenance},
howpublished={https://github.com/stitchfix/nodebook}
}
@book{WV02,
Author = {Weikum, G. and Vossen, G.},
Date-Added = {2008-04-08 16:42:46 +0200},

Binary file not shown.

View File

@ -1,7 +1,8 @@
%!TEX root=../main.tex
Conservative static analysis for Python will typically lead to very coarse-grained over-approximation of the real data dependencies of a program, or has to allow for false negatives (missed dependencies).
To see why this is the case consider the code snippet in \Cref{fig:dynamic-code-evaluation-i}, which retrieves a piece of Python code from the web and then evaluates that code.
The dynamically evaluated code can create data dependencies between everything in global scope. Furthermore, conservative static analysis must recursively descend into libraries to obtain a full set of dependencies.
Conservative static analysis for Python either produces a coarse-grained over-approximation of the real data dependencies of a program, or has to allow for missed dependencies.
To see why this is the case, consider the code snippet in \Cref{fig:dynamic-code-evaluation-i}, which executes a piece of Python code retrieved from the web.
The dynamically evaluated code can create data dependencies between everything in the global scope.
A further overhead of conservative static analysis is the need to recursively descend into libraries for completeness.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
@ -33,16 +34,16 @@ b = d * 2 if a > 10 else e * 2
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Crucially, for our provenance use cases (i.e., parallelism and incremental updates), we need provenance to be available \emph{before} a notebook is executed.
Overly conservative static analysis is a bad fit: In all but the most trivial notebooks, such analysis must trade off between excessive runtimes to fully analyze all dependent libraries, and treating all cells as interdependent.
Conversely, a less conservative approach could lead to unsafe notebook execution if it misses a dependency.
To overcome this dilemma, we propose an approach that computes approximate provenance using static analysis (allowing for both false negative and false positives in terms of data dependencies) and then compensates for missing and spurious data dependencies by discovering and compensating for them at runtime. This approach is sensible in the context of computational notebooks, and prior systems like Nodebook, a Jupyter plugin developed at Stitchfix~\cite{nodebook}, make similar assumptions.
Overly conservative static analysis must accept excessive runtimes to fully analyze all dependent libraries and approximations to manage dynamically executed code, or must instead treat all cells as interdependent.
However, a less conservative approach could lead to unsafe notebook execution if it misses a dependency.
To overcome this dilemma, we propose an approach that computes approximate provenance using static analysis (allowing for both false negative and false positives in terms of data dependencies) and deals with missing and spurious data dependencies by discovering them and compensating for them at runtime.
This approach is sensible in the context of computational notebooks, and prior systems like Nodebook, a Jupyter plugin developed at Stitchfix~\cite{nodebook}, make similar assumptions.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Static Approximate Provenance}
An initial pass over the notebook's code obtains a set of read and write dependencies for each cell using Python's AST library, using standard dataflow equations~\cite{NN99} to derive an approximate dataflow graph.
An initial pass over the notebook's code obtains a set of read and write dependencies for each cell using Python's AST library and standard dataflow equations~\cite{NN99} to derive an approximate dataflow graph.
To minimize performance overhead, this step only analyzes the user's code and does not consider other modules (libraries) --- intra-module dependencies (e.g., stateful libraries) will be missed at this stage, but can still be discovered at runtime.
Conversely, like any static dataflow analysis, this stage may also produce false positives due to control-flow decisions that depend on the input. For example, in \Cref{fig:static-dataflow-analysis-} whether the cell has a read dependency on \texttt{d} or \texttt{e} depends on the value of \texttt{a} which only will be known at runtime.
Like any static dataflow analysis, this stage may also produce false positives due to control-flow decisions that depend on the input. For example, in \Cref{fig:static-dataflow-analysis-} whether the cell has a read dependency on \texttt{d} or \texttt{e} depends on the \emph{runtime} value of \texttt{a}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Exact Runtime Provenance}

View File

@ -1,9 +1,10 @@
%!TEX root=../main.tex
We introduced an approach for incrementally refining a static approximation of provenance for computational notebooks and implemented a scheduler for ICE-architecture notebooks based on this approach.
We introduced an approach for incrementally refining provenance for computational notebooks and implemented a scheduler for ICE-architecture notebooks based on this approach.
Our method enables (i) parallel cell execution; (ii) automatic refresh of dependent cells after modifications; and (iii) import of Jupyter notebooks.
While our proof-of-concept shows promise, significant additional work is needed to reduce state transfer between kernels.
For example, kernels can be re-used forked to minimize state transfer.
While our proof-of-concept shows promise, further work is needed to reduce state transfer between kernels.
For example, kernels can be re-used or forked to minimize state transfer.
Alternatively, responsibility for hosting state can be moved from the coordinator, directly to the kernel that created the state.
We also plan to explore how to reduce the initial dataframe access cost.
% Significant further opportunities exist for research on ICE Notebooks.

View File

@ -1,40 +1,5 @@
%!TEX root=../main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}[t]
\newcommand{\plotminusvspace}{-3mm}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_serial.pdf}
\vspace*{\plotminusvspace}
\caption{Serial Execution}
\label{fig:gantt:serial}
\end{subfigure}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_parallel.pdf}
\vspace*{\plotminusvspace}
\label{fig:gantt:serial}
\caption{Parallel Execution}
\end{subfigure}
% \begin{subfigure}[b]{.24\textwidth}
% \includegraphics[width=\columnwidth]{graphics/gantt_serial.png}
% \vspace*{\plotminusvspace}
% \label{fig:gantt:serial}
% \caption{Scalability - Read}
% \end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{subfigure}{0.3\linewidth}
\vspace*{-26mm}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/scalability-read.pdf}
\vspace*{\plotminusvspace}
\caption{Scalability - Read}\label{fig:scalability}
\end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace*{-5mm}
\caption{Workload traces for a synthetic reader/writer workload}
\label{fig:gantt}
\trimfigurespacing
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
As a proof of concept, we implemented the static analysis approach from \Cref{sec:import} as a provenance-aware parallel scheduler (\Cref{sec:scheduler}) within the Vizier notebook system~\cite{brachmann:2020:cidr:your}.
Parallelizing cell execution requires an ICE architecture, which comes at the cost of increased communication overhead relative to monolithic kernel notebooks.
@ -44,7 +9,7 @@ Parallelizing cell execution requires an ICE architecture, which comes at the co
\tinysection{Implementation}
The parallel scheduler was integrated into Vizier 1.2 (\url{https://github.com/VizierDB/vizier-scala}).
%--- our experiments lightly modify this version for Jupyter notebooks and the related \texttt{-X PARALLEL-PYTHON} experimental option.
We additionally added a pooling feature to mitigate Python's high startup cost (600ms up to multiple seconds); The modified Vizier launches a small pool of continuously running Python instances.
We additionally added a pooling feature to mitigate Python's high startup cost (approximately 600ms); The modified Vizier launches a small pool of continuously running Python instances.
%Our current implementation selects kernels from the pool arbitrarily.
In future work, we plan to allow kernels to cache artifacts, and prioritize the use of kernels that have already loaded artifacts we expect the cell to read. This prototype does not yet implement repairs for missed dependencies.
@ -58,7 +23,7 @@ To mitigate Spark's high start-up costs, we prefix all notebooks under test with
\tinysection{Overview}
As a preliminary experiment, we ran a synthetic workload consisting of one cell that randomly generates a 100k-row, 2 integer column Pandas dataframe and exports it, and 10 reader cells that read the dataset and perform a compute intensive task: Computing pairwise distance for a 10k-row subset of the source dataset.
\Cref{fig:gantt} shows execution traces for the workload in Vizier with its default (serial) scheduler and Vizier with its new (parallel) scheduler.
The experiments shows that the parallel execution is $\sim 4$ times faster than the serial execution. However, each individual reader takes longer to finish in the parallel execution. % an overhead of XX\OK{Fill in}s overhead as Python exports data, and a XXs overhead from loading the data back in.
The experiment shows that the parallel execution is $\sim 4$ times faster than the serial execution. However, each individual reader takes longer to finish in the parallel execution. % an overhead of XX\OK{Fill in}s overhead as Python exports data, and a XXs overhead from loading the data back in.
% We observe several oppoortunities for potential improvement:
% First, the serial first access to the dataset is 2s more expensive than the remaining lookups as Vizier loads and prepares to host the dataset through the Arrow protocol. We expect that such startup costs can be mitigated, for example by having the Python kernel continue hosting the dataset itself while the monitor process is loading the data.
% We also note that this overhead grows to almost 10s in the parallel case. In addition to startup-costs,

View File

@ -37,7 +37,7 @@ bar() # Prints '2'
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We now outline the conversion of monolithic Jupyter kernel notebooks into ICE-compatible form.
We now outline the conversion of (monolithic-kernel) Jupyter notebooks into ICE-compatible form.
For this preliminary work, we make a simplifying assumption that all inter-cell communication occurs through the kernel's global scope (e.g., as opposed to files).
Python's \texttt{ast} module provides a structured representation of the code: an \emph{abstract syntax tree} (AST).
@ -52,10 +52,48 @@ When traversing a function or class declaration, we record such dependencies and
Transitive dependency tracking is complicated due to Python's use of mutable closures (e.g., see \Cref{fig:scoping});
In the latter code block, when \texttt{bar} is declared, it `captures' the scope of \texttt{foo}, in which \texttt{a = 2}, and overrides an assignment in the global scope, even though the enclosing scope is not otherwise accessible.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}[t]
\newcommand{\plotminusvspace}{-3mm}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_serial.pdf}
\vspace*{\plotminusvspace}
\caption{Serial Execution}
\label{fig:gantt:serial}
\end{subfigure}
\begin{subfigure}[b]{.3\textwidth}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/gantt_parallel.pdf}
\vspace*{\plotminusvspace}
\label{fig:gantt:serial}
\caption{Parallel Execution}
\end{subfigure}
% \begin{subfigure}[b]{.24\textwidth}
% \includegraphics[width=\columnwidth]{graphics/gantt_serial.png}
% \vspace*{\plotminusvspace}
% \label{fig:gantt:serial}
% \caption{Scalability - Read}
% \end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{subfigure}{0.3\linewidth}
\vspace*{-26mm}
\includegraphics[width=0.9\columnwidth,trim=0 0 0 0]{graphics/scalability-read.pdf}
\vspace*{\plotminusvspace}
\caption{Scalability - Read}\label{fig:scalability}
\end{subfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vspace*{-5mm}
\caption{Workload traces for a synthetic reader/writer workload}
\label{fig:gantt}
\trimfigurespacing
\end{figure*}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Second, the fine-grained dataflow graph, produced as explained above, is reduced into a \emph{coarse-grained} dataflow graph by (i) merging nodes for the statements in a cell, (ii) removing self-edges, and (iii) removing parallel edges with identical labels.
Afterwards, the fine-grained dataflow graph, produced as explained above, is reduced into a \emph{coarse-grained} dataflow graph by (i) merging nodes for the statements in a cell, (ii) removing self-edges, and (iii) removing parallel edges with identical labels.
The coarse-grained data flow graph provides an approximation of the cell's dependencies: The set of in-edges (resp., out-edges) is typically an upper bound on the cells real dependencies. % guaranteed upper bound on the cell's write set (read set).
While missed dependencies are theoretically possible, they are rare in the type of code used in typical Jupyter notebooks. Nonetheless, if they arise they will be taken care of by our scheduler. As a final step, we inject explicit variable imports and exports (using Vizier's artifact API) for the read and write sets of each cell into the cell's code.
While missed dependencies are theoretically possible, they are rare in the type of code used in typical Jupyter notebooks.
Nonetheless, if they arise they will be taken care of by our scheduler.
As a final step, we inject explicit variable imports and exports (using Vizier's artifact API) for the read and write sets of each cell into the cell's code.
%%% Local Variables:

View File

@ -1,35 +1,30 @@
%!TEX root=../main.tex
Workflow systems~\cite{DC07} help users to break complex tasks like ETL, model-fitting, and more, into a series of smaller steps.
Users explicitly declare inter-step dependencies, permitting parallel execution of mutually independent steps.
% Recently however, systems like Papermill~\cite{papermill} have emerged,
Systems like Jupyter % or Zeppelin
allow users to instead describe such tasks through computational notebooks.
Notebook users express tasks as a sequence of code `cells' that describe each step of the computation without explicitly declaring dependencies between cells.
Users manually trigger cell execution, dispatching the code in the cell to a long-lived Python interpreter called a kernel.
Workflow systems~\cite{DC07} break complex tasks like ETL, model-fitting, and more into a series of smaller, parallelizable steps, but require users to explicitly declare inter-step data dependencies.
Computational notebooks like Jupyter instead model tasks as sequences of code `cells' that each describe a step of the computation without explicit inter-cell dependencies.
Users manually trigger cell execution, dispatching the cell's code to a long-lived Python interpreter called a kernel.
Cells share data through the state of the interpreter, e.g., a global variable created in one cell can be read by another cell.
Companies like Netflix\footnote{https://netflixtechblog.com/scheduling-notebooks-348e6c14cfd6} are increasingly turning to notebooks for batch data processing (e.g., ETL or ML) thanks to the rapid, interactive notebook development cycle.
However, the hidden state of the Python interpreter creates hard to understand and reproduce bugs, like notebooks that only work if executed out-of-order~\cite{DBLP:journals/ese/PimentelMBF21,
Companies like Netflix\footnote{https://netflixtechblog.com/scheduling-notebooks-348e6c14cfd6} are increasingly turning to notebooks for bulk ETL and ML workloads as they allow a faster, more interactive development cycle.
However, the kernel's hidden state often creates bugs that are hard to understand or reproduce~\cite{DBLP:journals/ese/PimentelMBF21,
% joelgrus,
brachmann:2020:cidr:your}\footnote{https://www.youtube.com/watch?v=7jiPeIFXb6U}.
Furthermore, the notebook execution model also precludes parallelism --- cells must be evaluated serially because their inter-dependencies are not know upfront.
Furthermore, without explicit dependencies, the notebook execution model also precludes inter-cell parallelism and incremental evaluation.
Dynamically collected provenance~\cite{pimentel-17-n} can address the former problem, but is of limited use for scheduling since dependencies are only learned after running a cell.
By the time we learn that two cells can be safely executed concurrently, they have already finished.
Static dataflow analysis~\cite{NN99} addresses both needs, but the dynamic nature of Python forces approximation of dependencies --- conservatively, limiting opportunities for parallelism; or optimistically, risking execution errors from missed inter-cell dependencies.
Dynamically collected provenance~\cite{pimentel-17-n} can address the hidden state problem, but is of limited use for scheduling since dependencies are only learned after running a cell:
By the time we learn that two cells can be safely executed concurrently, they are finished.
Static dataflow analysis~\cite{NN99} addresses both needs, but requries approximating dependencies --- conservatively, limiting opportunities for parallelism; or optimistically, risking errors from missed dependencies.
We bridge the static/dynamic gap by proposing a hybrid approach: \emph{Incremental Runtime Provenance Refinement}, where statically approximated provenance is incrementally refined at runtime.
We demonstrate how to utilize provenance refinement to enable (i) parallel execution and (ii) partial re-evaluation of notebooks.
This requires a fundamental shift away from execution in a single monolithic kernel towards isolated per-cell interpreters.
We validate our ideas by extending \emph{Vizier}~\cite{brachmann:2020:cidr:your}, an isolated cell execution (\textit{ICE}) notebook.
Cells in Vizier run in isolated interpreters and communicate only through dataflow (by reading / writing data artifacts).
We demonstrate how provenance refinement enables
(i) parallel execution and
(ii) partial re-evaluation of notebooks.
As we show, this requires a fundamental shift away from a single monolithic kernel towards isolated per-cell interpreters, a model we call Isolated Code Execution (ICE).
We validate our ideas by extending an ICE notebook named \emph{Vizier}~\cite{brachmann:2020:cidr:your}.
Cells in Vizier run in isolated interpreters and communicate only through dataflow.
%This model allows concurrently running cells to read the same artifact, and allows snapshotting artifacts for re-use during incremental re-evaluation.
We outline the challenges of extending Vizier to support parallelism and incremental re-evaluation based on approximate provenance determined using static program analysis that is incrementally refined during (re-)execution.
Additionally, this enables Jupiter notebooks to be imported into an ICE which requires data dependencies between cells to be made explicit as inter-cell dataflow. % to be % determined statically and to
% transformed into explicit dataflow between cells.
%In the future, we plan to extend Vizier to automatically determine dataflow using these techniques which
While ICE notebooks have many advantages, they require users to explicitly declare which data items to share across cells. Using our provenance refinement techniques we can automate this process, enabling users to operate as in monolithic kernel notebooks, i.e., without having to explicitly declare what state to share among cells.
We outline the challenges of extending Vizier to support implicit dependencies through incrementally refined provenance.
This, in turn allows for parallel and incremental notebook execution, while retaining the typical programming model of Jupyter, where developers do not have to explicitly declare what state to share among cells.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
@ -59,7 +54,7 @@ Afterwards, we discuss our remaining contributions:
(ii) \textbf{Jupyter Import}: \Cref{sec:import} discusses how we extract approximate provenance from Python code statically, and how existing notebooks written for Jupyter % (or comparable monolithic kernel architectures)
can be translated to ICE notebook architectures like Vizier.
% \item
(iii) \textbf{Implementation in Vizier and Experiments}: We have implemented a preliminary prototype of the proposed scheduler in Vizier. \Cref{sec:experiments} presents our experiments with parallel evaluation of Jupyter notebooks.
(iii) \textbf{Implementation in Vizier and Experiments}: We have implemented a preliminary prototype of the proposed scheduler in Vizier. \Cref{sec:experiments} presents our experiments on the parallel evaluation of Jupyter notebooks.
% \end{itemize}
% A computational notebook is an ordered sequence of \emph{cells}, each containing a block of (e.g., Python) code or (e.g., markdown) documentation.

View File

@ -1,7 +1,7 @@
%!TEX root=../main.tex
An isolated cell execution notebook (ICE) isolates cells by executing each in a fresh kernel.
Before discussing how notebooks written for monolithic kernels can be mapped to the ICE model in \Cref{sec:import}, we first review the key differences between the monolithic approach and systems like Vizier~\cite{brachmann:2020:cidr:your} or Nodebook~\cite{nodebook}.
Before discussing how notebooks written for monolithic kernels, like Jupyter, can be mapped to the ICE model in \Cref{sec:import}, we first review the key differences between the monolithic approach and systems like Vizier~\cite{brachmann:2020:cidr:your} or Nodebook~\cite{nodebook}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -9,6 +9,7 @@ Before discussing how notebooks written for monolithic kernels can be mapped to
As in a monolithic kernel notebook, an ICE notebook maintains a shared global state that is manipulated by each individual cell.
However, these manipulations are explicit: for a variable defined in one cell (the writer) to be used in a subsequent cell (the reader): (i) the writer must explicitly export the variable into the global state, and (ii) the reader must explicitly import the variable from the global state.
For example, Vizier provides explicit setter and getter functions (respectively) on a global state variable, while Nodebook inspects the Python interpreter's global scope dictionary in between cell executions.
As mentioned in the introduction, with our scheduler it will no longer be necessary for the user to explicitly call setters and getters.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{State Serialization}
@ -16,9 +17,9 @@ When a state variable is exported, it is serialized by the Python interpreter an
We refer to the serialized state as an \emph{artifact}.
Each cell executes in the context of a scope, a mapping from variable names to artifacts that can be imported by the cell.
By default, Vizier serializes state through Python's native \texttt{pickle} library, although it can be easily extended with codecs for specialized types that are either unsupported by \texttt{pickle}, or for which it is not efficient:
(i) Python code (e.g., import statements, and function or class definitions) is exported as raw Python code and imported with \texttt{eval}.
(ii) Pandas dataframes are exported in parquet format and are exposed to subsequent cells through Apache Arrow. % direct access.
By default, Vizier serializes state through Python's native \texttt{pickle} library, but can be easily extended with more specialized codecs like
(i) Python code (e.g., function or class definitions) is exported as raw Python code and imported with \texttt{eval}.
(ii) Pandas dataframes exported in parquet format and hosted through Apache Arrow. % direct access.
% We note the need to support transitive dependencies like functions that invoke other functions.
% Exporting (resp., importing) the former function requires exporting (importing) the latter.

View File

@ -13,8 +13,6 @@ Additionally, our approach allows parallel execution of independent cells, somet
Nodebook~\cite{nodebook} is a plugin for Jupyter that checkpoints notebook state in between cells to force in-order cell evaluation; Although closely related to our approach, it does not attempt parallelism, nor automatic re-execution of cells.
\cite{chapman-20-cqfgppp} captures fine-grained provenance at runtime for common classes of relational data transformations in Python preprocessing pipelines. In contrast our approach utilizes static analysis. % and is not limited to operations on relational datasets.
% \begin{itemize}
% \item Provenance for jupyter or Python
% \item Dataflow analysis / program slicing

View File

@ -17,8 +17,8 @@
The semantics of a workbook notebook is the serial execution of the cells in notebook order.
We refer to the set of variables imported or exported by each cell as the cell's read and write sets, respectively.
A \textit{correct} execution is thus defined in terms of view serializability~\cite{WV02}: A (parallel) schedule is correct iff the artifact versions that are read by each cell are consistent with the versions the cell would read in a serial execution. Note that blind writes are not an issue, because writes to an artifact create a new (immutable) version. Thus, cells that blindly write an artifact do not conflict with each other. % the last version of an artifact written by the serial and non-serial schedule will also be the same.
We assume that cell execution is atomic and idempotent: we are allowed to freely interrupt the execution of a cell, or to restart it.
A \textit{correct} execution is thus defined in terms of view serializability~\cite{WV02}: A (parallel) schedule is correct iff the artifact versions that are read by each cell are consistent with the versions the cell would read in a serial execution. Note that blind writes are not an issue in Vizier, because writes to an artifact create a new (immutable) version. Thus, cells that blindly write an artifact do not conflict with each other. % the last version of an artifact written by the serial and non-serial schedule will also be the same.
We assume that cell execution is atomic and idempotent: we are allowed to freely interrupt or restart a cell's execution.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Naive Scheduling}
@ -31,17 +31,17 @@ A notebook's data dependency graph $\dg = (\nb, \dep)$ connects cells through e
\end{multline*}
An edge labelled $\ell$ exists from any cell $c_r$ that reads symbol $\ell$ to the most recent preceding cell that writes symbol $\ell$.
Denote by $\cstate(c) \in \{ \spending, \sdone \}$ the state of a cell (i.e., \sdone after it has completed execution); a cell $c$ can be scheduled for execution when all input edges are \sdone:
Denote by $\cstate(c) \in \{ \spending, \sdone \}$ the state of a cell (i.e., \sdone after it has completed execution); a cell $c$ can be scheduled for execution when all cells connected to incoming edges are \sdone:
$\forall (c, c_w, \ell) \in \dep : \cstate(c_w) = \sdone$.
When a cell $c_r$ imports variable $\ell$ from the global scope, where $(c_r, c_w, \ell) \in \dep$, it receives the version exported by cell $c_w$.
Trivially, any execution order that complies with this rule produces schedules that are view-equivalent to the notebook order and, thus, will produce the same result as a serial execution.
Any execution order that complies with this rule produces schedules that are view-equivalent to the notebook order and, thus, will produce the same result as a serial execution.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Runtime Refinement}
Recall that our static analysis approach produces a dependency graph $\dga= (\nb, \depa)$ which may have spurious edges and may miss edges. We refine $\dga$ at runtime. There are four possible types of changes to the dependency graph when a cell $\nc$ is executed. In the following we discuss these cases and how to compensate for them to ensure scheduler \emph{correctness}.
\textbf{(i)} When a read does not materialize during $\nc$'s execution, we remove the corresponding edge from the dependency graph. Such spurious reads of a variable $l$ may cause a delay in $\nc$'s execution, because $\nc$ has to wait for the cell writing $l$ to finish execution. However, the correctness of the schedule is not affects by that.
\textbf{(ii)} A write of $l$ that does not materialize, causes inbound edges with the corresponding label to be redirected to the preceding cell to write $l$. Cells dependent on $\nc$'s version of $l$ could not have been started yet, so the schedule is still valid.
\textbf{(i)} When a read does not materialize during $\nc$'s execution, we remove the corresponding edge from the dependency graph. Such spurious reads of a variable $l$ may cause a delay in $\nc$'s execution, because $\nc$ has to wait for the cell writing $l$ to finish execution. However, the correctness of the schedule is not affected.
\textbf{(ii)} A write of $l$ that does not materialize causes inbound edges with the corresponding label to be redirected to the preceding cell to write $l$. Cells dependent on $\nc$'s version of $l$ could not have started yet, so the schedule is still valid.
\textbf{(iii)} A missed read that materializes during \nc's execution adds a new edge to the dependency graph.
If the edge leads to a cell $\nc'$ in the \spending state, the read operation may block until the writing cell has completed.
This state is less desirable, as we may have already allocated resources for the blocked cell $\nc$ which may lead to resource starvation.
@ -59,7 +59,7 @@ For any notebook $\nb$ and approximated dependency graph $\dga$ for $\nb$, the e
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Incremental Re-execution}
Vizier automatically refreshes dependent cells when a cell $\nc$ is modified by the user using incremental re-execution which avoids re-execution of cells whose output will be the same in the modified notebook. For that, the modified cell $\nc$ is put into \spending state. Furthermore, all cells that depend on $\nc$ directly or indirectly are also put into \spending state. That is, we memorize a cells actual dependencies from the previous execution and initially assume that the dependency graph will be the same as in the previous execution. The exception is the modified cell for which we statically approximate provenance from scratch. During the execution of the modified cell or one of its dependencies we may observe changes to the read and write set of a cell. We compensate for that using the repair actions described above.
Vizier automatically refreshes dependent cells when a cell $\nc$ is modified by the user using incremental re-execution which avoids re-execution of cells whose output will be the same in the modified notebook. For that, the modified cell $\nc$ is put into \spending state. Furthermore, all cells that depend on $\nc$ directly or indirectly are also put into \spending state. That is, we memorize a cell's actual dependencies from the previous execution and initially assume that the dependency graph will be the same as in the previous execution. The exception is the modified cell for which we statically approximate provenance from scratch. During the execution of the modified cell or one of its dependencies we may observe changes to the read and write set of a cell. We compensate for that using the repair actions described above.
% When the user schedules a cell for partial re-execution (e.g., to retrieve new input data), we would like to avoid re-executing dependent cells that will produce identical outputs.
% The cell(s) scheduled for re-execution are moved to the \spending state.