paper-ParallelPython-Short/sections/scheduler.tex

58 lines
5.4 KiB
TeX

%!TEX root=../main.tex
In this section, we overview \systemname's provenance model and how it is used to support parallel code execution and incremental updates.
We define the baseline notebook semantics as a serial execution of the cells in notebook order, and refer to the set of symbols imported or exported by each cell as the cell's read and write sets, respectively.
We define a correct execution in terms of view serializability: any reordering of the cells (e.g., to parallelize execution) that preserves data flow dependencies is correct.
As a simplification for this preliminary work, we assume that cell execution is atomic and idempotent: we are allowed to freely interrupt the execution of a cell, or to restart it.
\tinysection{Naive Scheduling}
Let $N$ denote a notebook, a sequence of cells $[c_1, \ldots, c_n]$.
Assume, initially, that for each cell $c_i \in N$ we are given exact read and write sets ($\mathcal R(c_i)$ and $\mathcal W(c_i)$ respectively).
We define the notebook's data dependency graph $(N, D)$ through a series of edges $(r, w, \ell) \in D$ as follows:
\begin{multline*}
D = \{\; (c_r, c_w, \ell) \;|\; c_r,c_w \in N, \ell \in \mathcal R(c_r), \ell \in \mathcal W(c_w), w < r, \\
\not \exists c_{w'} \in N \text{ s.t. }w < w' < r, \ell \in \mathcal W(c_{w'}) \;\}
\end{multline*}
An edge labelled $\ell$ exists from any cell $c_r$ that reads symbol $\ell$ to the most recent preceding cell that writes symbol $\ell$.
Given such a graph, scheduling is trivial.
Denote by $\mathcal S(c) \in \{ \text{PENDING}, \text{DONE} \}$ the state of a cell (i.e., \text{DONE} after it has completed execution); a cell $c$ can be scheduled for execution when all input edges are \text{DONE}:
$$\forall (c, c_w, \ell) \in D : \mathcal S(c_w) = \text{DONE}$$
It can be trivially shown that this scheduling approach guarantees conflict equivalence between the execution and notebook orders.
\tinysection{Approximating Dependencies}
A cell's read and write sets can not be known exactly until runtime, but can often be approximated.
In principle, a dependency graph could be created from these approximations and the scheduling could proceed as in the naive approach.
However, when a cell executes and its exact read and write sets become known, the notebook's dependency graph changes.
\systemname's \emph{dependency graph must be dynamic}.
There are four possible errors that we now consider in turn.
Here, we are primarily concerned with ensuring correctness (i.e., serializability) of the notebook execution in the presence of execution errors, even at the cost of performance.
A false positive read (a symbol in the approximate read set that is never accessed by the cell) corresponds to an unnecessary edge in the dependency graph.
The presence of this edge in the dependency graph is a performance problem, as it may have been possible to execute the cell earlier, but does not result in a correctness error.
A false positive write (a symbol in the approximate write set that is never modified by the cell) corresponds to zero or more edges that need to be redirected to an earlier cell.
Modulo errors in the notebook (i.e., undefined variables), reads on the symbol instead reference the output of a preceding cell.
All dependent cells are blocked until the prediction error is discovered, so as with a false positive read, there is no potential for correctness errors.
However, a cell that could have been scheduled if the symbol had been written may now need to block until a preceding cell completes.
A false negative read (a symbol not in the approximate read set that is read by the cell) corresponds to an edge missing from the dependency graph.
Since the artifact is not available to the cell until the read is triggered, correctness errors can be avoided if cell execution is blocked until the dependent cell completes.
We note that any resources allocated to the kernel can not be released until the cell is unblocked.
When a pool of kernels is used, this could potentially lead to starvation if the entire pool blocks.
A false negative write (a symbol not in the approximate write set that is written to by the cell) corresponds to zero or more edges that need to be redirected to a later cell.
This is only a correctness error if one of the dependent cells has already been scheduled or completed --- if so, the cell must be aborted and rescheduled after the current cell completes.
In this preliminary work, we focus on workloads where we can guarantee that we have an upper bound on the read and write sets (i.e., no false negatives).
In this setting, approximation errors can lead to poor performance, but not correctness errors.
\paragraph{Incremental Re-execution}
Users may trigger the re-execution of one or more cells (e.g., to retrieve new input data).
When this happens, the re-executed cells and all of their descendants in the dependency DAG enter the \text{PENDING} state and execution proceeds mostly as above.
All cells that are not descendants of an updated cell remain in the \text{DONE} state, and their output artifacts are assumed to be re-usable from the cell's prior execution.
During a partial re-execution of the workflow, the cell's actual dependencies from the prior execution are also available.
When computing the dependency DAG, the exact read dependencies from the cell's prior execution are used.
This is safe under the assumption that the cell's behavior is deterministic --- that the cell's execution trace is unchanged if it is re-executed with the same inputs.