39 lines
3.5 KiB
TeX
39 lines
3.5 KiB
TeX
%!TEX root=../main.tex
|
|
|
|
Recall that in a classical notebook, a cell is run by evaluating its code in the kernel, a single running python interpreter.
|
|
To facilitate parallel execution, as well as incremental updates, \systemname isolates cells by executing each in a fresh kernel.
|
|
We note that isolation is incompatible, at least directly, with classical computational notebooks:
|
|
(i) Cells normally communicate through kernel state, precluding a cell executing in one kernel from accessing variables created in another;
|
|
(ii) Variables generated by one kernel may need to be accessed after the kernel has exited;
|
|
(iii) Isolation comes at an impractically high performance cost.
|
|
|
|
\tinysection{Communication Model}
|
|
First, the runtime must be able to reconstruct the global interpreter state, or at least the necessary subset of it needed to run the cell.
|
|
We start with a simplified model where inter-cell communication is explicit --- we discuss converting Jupyter notebooks into this model in \Cref{sec:import}.
|
|
Concretely, for a variable defined in one cell (the writer) to be used in a subsequent cell (the reader): (i) the writer must explicitly export the variable into the global state, and (ii) the reader must explicitly import the variable from the global state.
|
|
\systemname provides setter and getter functions (respectively) on a global state variable for this purpose.
|
|
|
|
\tinysection{State Serialization}
|
|
When a state variable is exported, it is serialized by the python interpreter.
|
|
We refer to the serialized state as an \emph{artifact}.
|
|
The artifact is delivered to a central monitor process, and assigned to a name in the global state.
|
|
When a cell imports a symbol from the global state, it contacts the monitor to retrieve the artifact associated with the symbol.
|
|
The runtime deserializes the artifact and places it into the kernel-local state.
|
|
|
|
By default, state is serialized with python's native \texttt{pickle} library, although \systemname can be easily extended with codecs for specialized types that are either unsupported by \texttt{pickle}, or for which it is not efficient:
|
|
(i) Python code (e.g., import statements, and function or class definitions) is exported as raw python code and imported with \texttt{eval}.
|
|
(ii) Pandas dataframes are exported in parquet format, and are exposed to subsequent cells by the monitor process through Apache Arrow direct access.
|
|
|
|
One notable challenge is the need to support transitive dependencies. One function, for example, may rely on a second function.
|
|
When the former function's symbol is exported (resp., imported) the latter symbol must also be exported (imported).
|
|
|
|
\tinysection{Optimizing Python Startup}
|
|
The Python process has a high start-up cost, ranging anywhere from 600ms with a recent processor with SSDs, to multiple seconds on a less powerful computer.
|
|
Python's start-up cost dominates the runtime of many short-lived cells, making it impractical to start python kernels on-demand.
|
|
Instead \systemname maintains a pool of pre-allocated python kernels.
|
|
When a cell begins executing, its code is shipped to an already running kernel, which executes the code and returns the response to the monitor process.
|
|
|
|
In our current implementation, the kernel chosen to run a piece of code is selected arbitrarily.
|
|
We note an opportunity for future work: artifacts created during cell execution or imported from the global state can be cached in the executing kernel.
|
|
If we know which artifacts a cell will use, we can prioritize kernels that have already loaded these artifacts.
|