paper-ParallelPython-Short/sections/isolation.tex

53 lines
3.4 KiB
TeX

%!TEX root=../main.tex
In this section, we discuss the execution of an individual cell.
Recall that in a classical notebook, a cell is run by evaluating its code in the kernel, a single running python interpreter.
Conversely, we isolate cells by executing each in a freshly allocated python interpreter.
When running code imported from a classical computational notebook, the runtime must be able to reconstruct the global interpreter state, or at least the necessary subset of it needed to run the cell.
In this section, we assume inter-cell communication is explicit --- each cell's code must include explicit instructions to interact with the state.
We discuss how cells are instrumented to include these instructions in \Cref{sec:import}.
The global system state is managed by a monitor process, which maintains for each running cell a \emph{scope}, a mapping from variable names to serialized python objects.
Under a serial cell execution, each cell updates the scope, and receives the scope emitted by the preceding cell.
We discuss how this model is refined in more detail in \Cref{sec:scheduler}.
To allow later cells access to a cell-local variable, the cell explicitly exports the varaible by serializing it and sending it to the monitor, which writes it into the scope.
Likewise cells can import cell-local variables from the monitor by requesting a specific scope element by name.
\tinysection{Optimizing Python Startup}
Allocating a fresh python interpreter for each individual cell poses a significant performance problem.
Even under ideal circumstances, the python interpreter can take nearly a full second to start.
This slow startup cost is especially problematic for cells that complete relatively quickly.
We address this by maintaining a pool of python workers.
When a cell begins executing, its code is shipped to an already running python process.
\tinysection{Serialization Special Cases}
A naive solution would rely exclusively on object serialization, for example via python's native \texttt{pickle} library.
\paragraph{Specialized Migration}
\begin{itemize}
\item Default to pickling
\item Provide a special-case bypass (e.g. dataframes via arrow / parquet)
\item Mutable state (i.e., figuring out which variables to export)
\item Side-effecting variables:
\begin{itemize}
\item Record the characteristics of the endpoint if possible (e.g., a file) and re-open it later?
\item This is not something that I'd expect to see pipelined through multiple cells... maybe mark the value as an unpredicted read/write dependency?
\item Might be a workaround to allow checkpointed state
\end{itemize}
\item Functions and Classes
\begin{itemize}
\item Here, the code needs to be imported explicitly
\item (cloudpickle allows serializing functions... can we do that here as well)
\item Note that the exported function may introduce chained dependencies that also need to be registered in cells that use the function. These dependencies may be to cells that appear after (which means we need to extend the misprediction cases).
\end{itemize}
\end{itemize}
\paragraph{Chained Dependencies}
TODO
\paragraph{Interpreter Re-Use}
\begin{itemize}
\item Cells that re-use the same state already have a mutual dependency. We may be able to re-use the same interpreter.
\item We might also be able to re-use python state with some sort of fork/join trickery.
\item Q: is this something we can pull off here, or is this strictly future work?
\end{itemize}