Some more notes

master
Oliver Kennedy 2021-08-14 16:49:55 -05:00
parent 47916a6519
commit dd6a8345e9
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
2 changed files with 43 additions and 5 deletions

View File

@ -1,4 +1,5 @@
\begin{itemize}
\item Serial vs Parallel Runtime
\item Prediction accuracy
\item Tradeoffs between pruning and non-pruned (any pruning mispredictions) export lists
\end{itemize}

View File

@ -1,5 +1,42 @@
- Isolating python executions (per-cell interpreters)
- Migrating state between cells
- imports (keeping track of variable types)
- Large blobs (is it worth it to re-use a python interpreter?)
- Serializing function declarations
%!TEX root=../main.tex
A typical notebook like Jupyter maintains a single python interpreter instance (called a kernel).
Code from each cell is executed in this interpreter sequentially in this interpreter.
Python, its libraries, and other dependencies are designed under the assumption of a global interpreter lock (GIL) that only permits a single thread running at a time.
The single-kernel approach of Jupyter is thus unsuitable for use with \systemname.
Instead, \systemname defaults to running each cell's code in a freshly allocated interpreter instance.
However, running each cell in its own interpreter presents a problem for passing state between cells, which now needs to be migrated between interpreters.
Thus far, we have presented \systemname's scheduler abstractly viewing cell as producing and consuming arbitrary artifacts.
We now make this model more concrete: Identifier/artifact pairs in \systemname correspond to symbol/value pairs in the global python namespace\OK{No... this is going to make things too painful when it comes to pruning out unnecessary variables. We still want/need explicitly marked import/exports}.
This includes variables, as well as functions, classes, and symbols \texttt{import}ed from other files\OK{TODO: Confirm that this is a comprehensive list.}.
We now consider each of these categories in turn.
\paragraph{Serializable Variables}
\begin{itemize}
\item Default to pickling
\item Provide a special-case bypass (e.g. dataframes via arrow / parquet)
\item Mutable state (i.e., figuring out which variables to export)
\end{itemize}
\paragraph{Side-effecting Variables}
\begin{itemize}
\item This is not something that I'd expect to see pipelined through multiple cells... maybe mark the value as an unpredicted read/write dependency?
\item Record the characteristics of the endpoint if possible (e.g., a file) and re-open it later?
\item Might be a workaround to allow checkpointed state
\end{itemize}
\paragraph{Functions and Classes}
\begin{itemize}
\item Here, the code needs to be imported explicitly
\item (cloudpickle allows serializing functions... can we do that here as well)
\item Note that the exported function may introduce chained dependencies that also need to be registered in cells that use the function. These dependencies may be to cells that appear after (which means we need to extend the misprediction cases).
\end{itemize}
\paragraph{Interpreter Re-Use}
\begin{itemize}
\item Cells that re-use the same state already have a mutual dependency. We may be able to re-use the same interpreter.
\item We might also be able to re-use python state with some sort of fork/join trickery.
\item Q: is this something we can pull off here, or is this strictly future work?
\end{itemize}