paper-ParallelPython-Short/sections/isolation.tex

57 lines
4.3 KiB
TeX

%!TEX root=../main.tex
A typical notebook like Jupyter maintains a single python interpreter instance (called a kernel).
Code from each cell is executed in this interpreter sequentially in this interpreter.
Python, its libraries, and other dependencies are designed under the assumption of a global interpreter lock (GIL) that only permits a single thread running at a time.
The single-kernel approach of Jupyter is thus unsuitable for use with \systemname.
Instead, \systemname defaults to running each cell's code in a freshly allocated interpreter instance.
However, running each cell in its own interpreter presents a problem for passing state between cells.
Under normal circumstances, data flow between cells in a notebook occurs through the global namespace, a dictionary of key-value pairs.
The namespace is used for everything from variables, to function, class, and module definitions, and for symbols imported from other files.
When cells are run in a single interpreter, the global namespace is preserved between cells.
In \systemname, the interpreter running a given cell must be able to reconstruct the global namespace, or at least the necessary subset of it needed to run the cell.
A naive solution would rely exclusively on object serialization, for example via python's native \texttt{pickle} library.
In this scenario, when a cell finishes executing, all elements of its global namespace are serialized and exported as artifacts into \systemname.
Conversely, when a cell accesses a symbol in the global namespace that is not already present, the corresponding artifact can be retrieved and the object deserialized\footnote{\systemname implements on-demand deserialization by using placeholder ``proxy'' objects.}
This naive solution presents three primary challenges.
First, the global namespace may contain elements that are only used within a cell; serializing and exporting these is an unnecessary performance hit and waste of space and/or memory.
Second, generic serialization mechanisms (like \texttt{pickle}) can be slow, and may not support every value (e.g., stateful values like \texttt{IO} handles, or function definitions).
Finally, exported values depend on other values in the global namespace (e.g., an exported function that uses the \texttt{global} keyword).
\paragraph{Limiting Serialization}
\systemname avoids the overhead of serializing the entire global namespace through the translation step.
As we discuss in more detail in \Cref{sec:import}, the import process predicts the list of artifacts that the cell will produce or consume.
By default, only artifacts included in this list are serialized.
However, as before, the possibility of mispredictions should not lead to incorrect outputs.
Likewise as before, over-prediction is never a problem; our focus is on entries in the global namespace that are accessed by a subsequent cell without being exported.
\paragraph{Specialized Migration}
\begin{itemize}
\item Default to pickling
\item Provide a special-case bypass (e.g. dataframes via arrow / parquet)
\item Mutable state (i.e., figuring out which variables to export)
\item Side-effecting variables:
\begin{itemize}
\item Record the characteristics of the endpoint if possible (e.g., a file) and re-open it later?
\item This is not something that I'd expect to see pipelined through multiple cells... maybe mark the value as an unpredicted read/write dependency?
\item Might be a workaround to allow checkpointed state
\end{itemize}
\item Functions and Classes
\begin{itemize}
\item Here, the code needs to be imported explicitly
\item (cloudpickle allows serializing functions... can we do that here as well)
\item Note that the exported function may introduce chained dependencies that also need to be registered in cells that use the function. These dependencies may be to cells that appear after (which means we need to extend the misprediction cases).
\end{itemize}
\end{itemize}
\paragraph{Chained Dependencies}
TODO
\paragraph{Interpreter Re-Use}
\begin{itemize}
\item Cells that re-use the same state already have a mutual dependency. We may be able to re-use the same interpreter.
\item We might also be able to re-use python state with some sort of fork/join trickery.
\item Q: is this something we can pull off here, or is this strictly future work?
\end{itemize}