paper-ParallelPython-Short/sections/isolation.tex

39 lines
3.5 KiB
TeX

%!TEX root=../main.tex
Recall that in a classical notebook, a cell is run by evaluating its code in the kernel, a single running python interpreter.
To facilitate parallel execution, as well as incremental updates, \systemname isolates cells by executing each in a fresh kernel.
We note that isolation is incompatible, at least directly, with classical computational notebooks:
(i) Cells normally communicate through kernel state, precluding a cell executing in one kernel from accessing variables created in another;
(ii) Variables generated by one kernel may need to be accessed after the kernel has exited;
(iii) Isolation comes at an impractically high performance cost.
\tinysection{Communication Model}
First, the runtime must be able to reconstruct the global interpreter state, or at least the necessary subset of it needed to run the cell.
We start with a simplified model where inter-cell communication is explicit --- we discuss converting Jupyter notebooks into this model in \Cref{sec:import}.
Concretely, for a variable defined in one cell (the writer) to be used in a subsequent cell (the reader): (i) the writer must explicitly export the variable into the global state, and (ii) the reader must explicitly import the variable from the global state.
\systemname provides setter and getter functions (respectively) on a global state variable for this purpose.
\tinysection{State Serialization}
When a state variable is exported, it is serialized by the python interpreter.
We refer to the serialized state as an \emph{artifact}.
The artifact is delivered to a central monitor process, and assigned to a name in the global state.
When a cell imports a symbol from the global state, it contacts the monitor to retrieve the artifact associated with the symbol.
The runtime deserializes the artifact and places it into the kernel-local state.
By default, state is serialized with python's native \texttt{pickle} library, although \systemname can be easily extended with codecs for specialized types that are either unsupported by \texttt{pickle}, or for which it is not efficient:
(i) Python code (e.g., import statements, and function or class definitions) is exported as raw python code and imported with \texttt{eval}.
(ii) Pandas dataframes are exported in parquet format, and are exposed to subsequent cells by the monitor process through Apache Arrow direct access.
One notable challenge is the need to support transitive dependencies. One function, for example, may rely on a second function.
When the former function's symbol is exported (resp., imported) the latter symbol must also be exported (imported).
\tinysection{Optimizing Python Startup}
The Python process has a high start-up cost, ranging anywhere from 600ms with a recent processor with SSDs, to multiple seconds on a less powerful computer.
Python's start-up cost dominates the runtime of many short-lived cells, making it impractical to start python kernels on-demand.
Instead \systemname maintains a pool of pre-allocated python kernels.
When a cell begins executing, its code is shipped to an already running kernel, which executes the code and returns the response to the monitor process.
In our current implementation, the kernel chosen to run a piece of code is selected arbitrarily.
We note an opportunity for future work: artifacts created during cell execution or imported from the global state can be cached in the executing kernel.
If we know which artifacts a cell will use, we can prioritize kernels that have already loaded these artifacts.