We present the internal representation first, and later discuss how Jupyter notebooks are translated into this model in \Cref{sec:import}.
A \systemname notebook is an ordered list of isolated, atomic units of work called \emph{cells}.
A cell is \emph{isolated} --- \systemname assumes that each can be safely executed in a fresh python interpreter.
To manage inter-cell dependencies, information flow between cells must be explicitly declared.
All communication between cells is mediated by a \emph{scope}, a partial mapping from variable \emph{symbols} to \emph{artifacts} (e.g., python literals).
As a cell runs, it generates artifacts that are serialized and persisted by \systemname; The cell may then assign the artifact to one or more symbols in the scope --- we denote this as a \emph{write} to the symbol.
When a subsequent cell \emph{reads} a symbol, \systemname provides it with the corresponding artifact.
We refer to the set of sybols read (resp., written) by a cell as the cell's read- (resp., write-)set.
We describe both sets together as the cell's dependencies.
A cell is also \emph{atomic} --- \systemname assumes that the cells can be safely executed in any order or even multiple times, so long as the cell's reads return the correct artifacts.
Python is a Turing-complete language, and it is impossible to predict the \emph{actual} set of reads and writes that the cell will perform without running it.
Instead, \systemname maintains two collections of dependencies.
The first set, the dependency \emph{bounds}, are an upper bound on the read- and write-sets.
A cell must declare every symbol that it can read or write in its dependency bounds.
We note that we do not expect users to explicitly provide the dependency bounds --- they are computed through best-effort static analysis as described in \Cref{sec:import}.
The second set, the actual dependencies, result from instrumenting the cell's execution.
This is exactly the set of symbols that the cell read or wrote the most recent execution.
Identifying opportunities for parallelism requires converting the provided sequence of cells into a partial order.
\Cref{alg:buildDag} accomplishes this by simulating sequential execution of the workflow to create a directed acyclic graph (DAG) from the workflow's dependency bounds.
As an optimization, \systemname only checks to see if a cell $c$ is \textbf{runnable} when another cell with an in-edge to $c$ is marked as \textbf{done}.
As a further optimization, the dependency DAG is pruned when a cell finishes executing.
A cell may not to write every artifact in its dependency bounds.
As noted above, cell execution is instrumented and the actual write-set is collected.
When cell $c$ completes, the dependency DAG is recomputed based on its \emph{actual} write set rather than its write bounds.
Each out-edge $(c, c', x)$ for a symbol $x$ not in the actual write set will be redirected to an earlier cell in the workflow.
This redirection is safe, because we are only interested in whether the cell's read dependencies are available, and not the specific artifact identified by $x$.
Users may trigger the re-execution of one or more cells (e.g., to retrieve new input data).
When this happens, the re-executed cells and all of their descendants in the dependency DAG enter the \textbf{waiting} state and execution proceeds mostly as above.
All cells that are not descendants of an updated cell remain in the \textbf{done} state, and their output artifacts are assumed to be re-usable from the cell's prior execution.
During a partial re-execution of the workflow, the cell's actual dependencies from the prior execution are also available.
When computing the dependency DAG, the \emph{actual} read dependencies from the cell's prior execution are used.
This is safe under the assumption that the cell's behavior is deterministic --- that the cell's execution trace is unchanged if it is re-executed with the same inputs.
However, it does not pose a correctness problem, and by the time it is discovered the cell will have already finished running.
The primary challenge is thus coping with reads and writes that are not in the predicted read- or write-set, respectively.
In either case, the change may add or remove edges to/from the dependency DAG, requiring re-execution of running or even already completed cells.
To streamline updates to the dependency DAG, \systemname caches the virtual input scope ($\mathcal S$) computed for each cell by \Cref{alg:buildDag} (denote by $\mathcal S_c$ the scope emitted by the cell prior to $c$).
\State{$\mathcal G \leftarrow\mathcal G -\{\;c \rightarrow*\;\}\cup\{\;(c, \mathcal S[r])\;|\;r \in c.\texttt{reads}\;\}$}
\EndIf
\If{$w \in c.\texttt{writes}$}
\Return
\EndIf
\EndFor
\end{algorithmic}
\end{algorithm}
\begin{algorithm}
\caption{\texttt{abort}$(\mathcal N, c_0)$}
\label{alg:abort}
\begin{algorithmic}[1]
\Require{$\mathcal N$: A sequence of cells}
\Require{$c_0\in\mathcal N$: A cell to abort}
\If{$c_0$.\texttt{running}}
\State{Abort $c_0$'s execution}
\ElsIf{$c_0$.\texttt{done}}
\State{Clear $c_0$'s results}
\For{$c$ s.t. $(c_0, c)\in\mathcal G$}
\State{\texttt{abort}$(\mathcal N, c)$}
\EndFor
\EndIf
\end{algorithmic}
\end{algorithm}
An unpredicted write from cell $c$ on identifier $r$ may redirect edges in the dependency DAG from another cell to $c$.
The process is summarized in \Cref{alg:updateDag}, which iterates through every cell following $c_0$ in order.
The iteration stops at the first cell (if one exists) to overwrite $w$ (line 7).
The cached scope for each cell is updated (line 2), removing prior writes to $w$ and adding $c_0$'s write.
If another cell reads the unpredicted $w$, the dependency DAG must be updated.
All out-edges from the cell are removed and recomputed (lines 6).
If the cell was already running or has already completed (lines 4-5), its execution is stopped, results are cleared, and any running dependencies are recursively aborted (\Cref{alg:abort}).