Website/src/talks/2024-04-12-UIC.erb

---
template: templates/talk_slides_v1.erb
title: Principled management of notebook state in Vizier
---

<style type="text/css">
	.notebook {
		width: 100%;
	}
	.notebook .nbcell .nblabel {
		font-family: Courier;
		vertical-align: middle;
		margin-right: 20px;
		color: blue;
		font-weight: bold;
	}
	.notebook .nbcell pre {
		width: calc(100% - 100px);
		display: inline-block;
		vertical-align: middle;
	}
	.notebook .nbcell pre code {
		padding: 20px;
	}
	.notebook .nbcell pre.nbresult {
		width: calc(100% - 110px);
		padding-left: 20px;
		margin-left: 90px;
		padding-top: 10px;
		padding-bottom: 10px;
	}
	.notebook .nbcell.fragment.highlight-blue.current-fragment
	{
		border: solid 4px blue;
	}

</style>

<%
$cells = []
def notebook()
	$cells = []
	ret = ""
	ret += "<div class='notebook'>"
	yield
	ret += $cells.join("")
	ret += "</div>"
	return ret
end

def nbdiv(body, varargs={})
	hide = varargs.fetch(:hide, nil)
	show = varargs.fetch(:show, nil)
	highlight = varargs.fetch(:highlight, nil)
	css_class = varargs.fetch(:css, "nbcell")
	extra_attrs = ""
	unless show.nil?
		css_class += " fragment"
		extra_attrs += " data-fragment-index='#{show}'"
	end
	unless hide.nil?
		css_class += " fragment fade-out"
		extra_attrs += " data-fragment-index='#{hide}'"
	end
	unless highlight.nil?
		css_class += " fragment highlight-blue"
		extra_attrs += " data-fragment-index='#{highlight}'"
	end
	return "<div class='#{css_class}'#{extra_attrs}>#{body}</div>"
end

def nbcell(text, varargs={})
	lang = varargs.fetch(:lang, "python")
	idx = varargs.fetch(:idx, nil)
	output = varargs.fetch(:output, nil)
	idx = $cells.size + 1 if idx.nil?
	cmd = "<span class='nblabel'>[#{idx}]</span><pre><code class='#{lang}'>#{text}</code></pre>"
	unless output.nil?
		cmd += "<br/><pre class='nbresult'>#{output}</pre>"
	end
	$cells += [nbdiv(cmd, varargs)]
end

def nbnote(note, varargs={})
	varargs[:css_class] = "nbnote"
	$cells += [nbdiv(note, varargs)]
end
%>

<section>
	<h2><%= title %></h2>

	<h4>Oliver Kennedy</h4>
	<h5>University at Buffalo</h5>
</section>

<section>
	<svg data-src="graphics/2022-06-20/NotebookOverview.svg" height="400px" style="margin-left: -100px"/>
</section>

<section>

	<div style="display: inline-block; width: 45%;">
		<img src="graphics/2022-06-20/Pimentel.png" height="400px">
		<p style="font-size: 70%;"><a href="https://ieeexplore.ieee.org/document/8816763">Pimentel et al</a>: "4.03% of notebooks on github are reproducible"</p>
	</div>

	<div style="display: inline-block; width: 45%;" class="fragment">
		<img src="graphics/2022-06-20/Grus.png">
		<p style="font-size: 70%;"><a href="https://www.youtube.com/watch?v=7jiPeIFXb6U">Joel Grus</a>: "For beginners, with dozens of cells and more complex code [the ability to run code snippets out of order] is utterly confusing."</p>
	</div>
</section>

<section>
	<h3>High-Level Challenges</h3>
	<ul>
		<li>Not clear from context where a variable was written.</li>
		<li>A cell that runs may still be wrong (for the program).</li>
		<li>A state that was computed may be inconsistent.</li>
	</ul>

	<p class="fragment takeaway">
		So why does everyone use this confusing state model?
	</p>
</section>

<section>
	<h3>Notebooks: The Good</h3>
	<ul>
		<li>They're interactive</li>
		<li>Less intimidating than the command line</li>
		<li class="fragment highlight-blue">In principle, a record of what you did.</li>
		<li>Everyone's using them</li>
	</ul>
</section>

<section>
	<h3>Notebook</h3>
	<svg data-src="graphics/2024-04-12/NotebookExtensions.svg" height="400px"/>
</section>

<section>
	<h3>High-Level Challenges</h3>
	<svg data-src="graphics/2024-04-12/Dependencies.svg" height="400px"/>
</section>

<section>
	<%=
		notebook() do
			nbcell("x = 3", idx: 1)
			nbcell("y = x + 2", idx: 2)
			nbcell("x = 4", idx: 3)
			nbcell("print(y)", idx: 4, output: "5")
		end
	%>
</section>
<section>
	<%=
		notebook() do
			nbcell("x = 3", idx: 1, highlight: 1)
			nbcell("y = x + 102", idx: 5)
			nbcell("x = 4", idx: 3, highlight: 1)
			nbcell("print(y)", idx: 4, output: "5", highlight: 2)
		end
	%>
</section>

<section>
	<h3>Dependencies</h3>

	<p class="fragment"><b>Reads: </b> <tt>x</tt></p>

	<%=
		notebook() do
			nbcell("y = x + 2", idx: 2)
		end
	%>

	<p class="fragment"><b>Writes: </b> <tt>y</tt></p>

	<p class="fragment takeaway">
		<b>Question:</b> Which variables does the cell read/write?
	</p>

</section>

<section>
	<%=
		notebook() do
			nbnote("$$\\{\\;\\;\\}$$", show: 1)
			nbcell("x = 3", idx: 1)
			nbnote("$$\\{\\;x \\rightarrow \\textbf{@1}\\;\\}$$", show: 2)
			nbcell("y = x + 2", idx: 2)
			nbnote("$$\\{\\;x \\rightarrow \\textbf{@1},\\;y \\rightarrow \\textbf{@2}\\;\\}$$", show: 3)
			nbcell("x = 4", idx: 3)
			nbnote("$$\\{\\;x \\rightarrow \\textbf{@3},\\;y \\rightarrow \\textbf{@2}\\;\\}$$", show: 4)
			nbcell("print(y)", idx: 4)
		end
	%>
</section>

<section>
	<p>Interpreter State: $\{\;x \rightarrow \textbf{@3},\;y \rightarrow \textbf{@2}\;\}$</p>

	<p>... but Cell 2 read $x \rightarrow \textbf{@1}$</p>

	<p class="fragment takeaway">
		<b>Question:</b> How do we get the interpreter back to a known state?
	</p>
</section>

<section>
	<%=
		notebook() do
			nbnote("$$\\{\\;x \\rightarrow \\textbf{@1}\\;\\}$$")
			nbcell("y = x + 102", idx: 5)
			nbnote("$$\\{\\;x \\rightarrow \\textbf{@1},\\;y \\rightarrow \\color{blue}{\\textbf{@4}}\\;\\}$$", show:1)
			nbcell("x = 4", idx: 3, highlight: 2)
			nbnote("$$\\{\\;x \\rightarrow \\textbf{@3},\\;y \\rightarrow \\color{blue}{\\textbf{@4}}\\;\\}$$", show:3)
			nbcell("print(y)", idx: 4, highlight: 4)
		end
	%>

	<p class="fragment takeaway">
		<b><strike>Question</strike></b> A cell is stale if a value it read last time changed.
	</p>
</section>

<section>
	<h3>Overview: Workflow-Style Notebooks</h3>

	<ol>
		<li>Static Analysis</li>
		<li>Microkernel Notebooks</li>
		<li>Approximate Dependencies</li>
		<li>Inter-Kernel Interop <span style="color: grey;">[Work In Progress]</span></li>
	</ol>
</section>

<!------------------------- Static Analysis -------------------------->
<section>
	<h3>Obtaining Cell Dependencies</h3>

	<ul>
		<li class="fragment" data-fragment-index="1">What could the cell read/write? <span class="fragment" data-fragment-index="5">[Static]</span></li>
		<li class="fragment" data-fragment-index="2"><span class="fragment strike" data-fragment-index="4">What will the cell read/write?</span></li>
		<li class="fragment" data-fragment-index="3">What did the cell read/write? <span class="fragment" data-fragment-index="5">[Dynamic]</span></li>
	</ul>
</section>


<section>
	<h3>Dynamic Dependencies</h3>

	<%=
		notebook() do
			nbcell("if z:\n  y = x + 2", idx: 2)
		end
	%>

	<pre class="fragment">

       0 LOAD_GLOBAL              0 (z)
       2 POP_JUMP_IF_FALSE       12
       4 LOAD_GLOBAL              1 (x)
       6 LOAD_CONST               1 (2)
       8 BINARY_ADD
      10 STORE_GLOBAL             2 (y)
      12 LOAD_CONST               0 (None)
      14 RETURN_VALUE
	</pre>

</section>


<section>
	<h3>Dynamic Dependencies</h3>

	<%=
		notebook() do
			nbcell("if z:\n  y = x + 2", idx: 2)
		end
	%>

	<pre>

       0 LOAD_GLOBAL              0 (z)
       2 POP_JUMP_IF_FALSE       12


      12 LOAD_CONST               0 (None)
      14 RETURN_VALUE
	</pre>

</section>


<section>
	<h3>Dynamic Dependencies</h3>

	<%=
		notebook() do
			nbcell("if z:\n  y = x + 2", idx: 2)
		end
	%>

	<p><b>Reads: </b> $\{\;\textbf{z}\;\}$</p>
	<p><b>Writes: </b> $\{\;\;\}$</p>

</section>

<section>
	<h3>Static Dependencies</h3>

	<%=
		notebook() do
			nbcell("if z:\n  y = x + 2", idx: 2)
		end
	%>

	<pre>

       0 LOAD_GLOBAL              0 (z)
       2 POP_JUMP_IF_FALSE       12
       4 LOAD_GLOBAL              1 (x)
       6 LOAD_CONST               1 (2)
       8 BINARY_ADD
      10 STORE_GLOBAL             2 (y)
      12 LOAD_CONST               0 (None)
      14 RETURN_VALUE
	</pre>


</section>

<section>
	<h3>Static Dependencies</h3>

	<%=
		notebook() do
			nbcell("if z:\n  y = x + 2", idx: 2)
		end
	%>

	<pre>

       0 LOAD_GLOBAL              0 (z) # <---- reads
       2 POP_JUMP_IF_FALSE       12
       4 LOAD_GLOBAL              1 (x) # <---- reads
       6 LOAD_CONST               1 (2)
       8 BINARY_ADD
      10 STORE_GLOBAL             2 (y) # ----> writes
      12 LOAD_CONST               0 (None)
      14 RETURN_VALUE
	</pre>

</section>


<section>
	<h3>Static Dependencies</h3>

	<%=
		notebook() do
			nbcell("if z:\n  y = x + 2", idx: 2)
		end
	%>

	<p><b>Could Read: </b> $\{\;\textbf{x},\;\textbf{z}\;\}$</p>
	<p><b>Could Write: </b> $\{\;\textbf{y}\;\}$</p>

</section>


<section>
	<h3>Static Dependencies</h3>

	<%=
		notebook() do
			nbcell("my_data.filter( items )", idx: 2)
		end
	%>

	<p><b>Could Read: </b> $\{\;\textbf{my_data},\;\textbf{items}\;\}$</p>
	<p><b>Could Write: </b> $\{\;\;\}$</p>

</section>


<section>
	<h3>Static Dependencies</h3>

	<%=
		notebook() do
			nbcell("my_data.push( items )", idx: 2)
		end
	%>

	<p><b>Could Read: </b> $\{\;\textbf{my_data},\;\textbf{items}\;\}$</p>
	<p><b>Could Write: </b> $\{\;\textbf{my_data}\;\}$</p>

</section>

<section>

	<%=
		notebook() do
			nbcell("my_data.filter( items )", idx: 2)
		end
	%>
	vs
	<%=
		notebook() do
			nbcell("my_data.push( items )", idx: 2)
		end
	%>

</section>

<section>

	<img src="graphics/2024-04-12/LibraryStaticAnalysisDSL.png">

	<attribution>"Bolt-on, Compact, and Rapid Program Slicing for Notebooks" (Shenkar et. al.; VLDB 2023)</attribution>
</section>


<section>
	<h3>Dependency Needs</h3>

	<ul>
		<li class="fragment" data-fragment-index="1">Does a cell need to be re-run based on these changes?
			<div class="fragment" data-fragment-index="2" style="font-weight: bold;">Dynamic sufficient (assuming deterministic cells).</div>
		</li>
		<li class="fragment" data-fragment-index="3">
			<div class="fragment highlight-blue">
				What is the minimal set of inputs a cell needs to run?
				<div class="fragment" data-fragment-index="4" style="font-weight: bold;">Static required.</div>
			</div>
		</li>
		<li class="fragment" data-fragment-index="5">Which cell last wrote to a variable?
			<div class="fragment" data-fragment-index="6" style="font-weight: bold;">Dynamic sufficient.</div>
		</li>
	</ul>
</section>

<section>
	<h3>Vizier Demo</h3>
</section>

<!------------------------- Microkernel Notebooks -------------------------->
<section>
	$$\{\;???\;\}$$
	<%=
		notebook() do
			nbcell("if z:\n  y = x + 2", idx: 2)
		end
	%>

	<p class="fragment takeaway">We need to be able to recover the notebook from <i>any</i> state.</p>
</section>

<section>
	Not same interpreter means:
	- No worrying about crashes
	- Portability / Resume at any point
	- Parallel execution
</section>

<section>
	Outline the data model:

	- Interpreter
	- "backend 'state database'"
	- Lazy-loading interpreter state
</section>

<section>
	If we have to have the ability to recover a state, does it have to be the same interpreter *version*?
</section>

<section>
	If we have to have the ability to recover a state, does it have to be the same language?
</section>

<section>
	Cool things we can do if we lift the "state lives in the kernel" model

	- Deserialize program state into another interpreter
	- Graphical widgets for common tasks (data loading)
	- 1-3 slides on spreadsheets
</section>

<!------------------------- Approximate Dependencies -------------------------->
<section>
	How to figure out dependencies

	1. Run the code (exact, after the fact)
	2. Static analysis (imprecise, incomplete)
	3. Both!
</section>

<section>
	Idea: use static analysis to create a mask.

	Cell state model:
	- stable
	- unknown
	- stale
	- runnable (revisit parallelism)
</section>

<section>
	Preliminary results: TAPP
</section>

<!------------------------- Approximate Dependencies -------------------------->

<section>
	State model.  Review:
	- State needs to come *out* of the cell that created it
	- State needs to go *into* the cell that is about to consume it
</section>

<section>
	Naive approach: Pickle

	... but pickle doesn't allow interop
	... but pickle doesn't always work (e.g., for 'File' objects)
</section>

<section>
	Interop: Define standards

	- Primitive Values (int, float, date, etc...)
	- Collection Types (map, list, etc...)
	- Libraries
	- Function [Challenge: Chained Dependencies]
	- Dataframe/Series [Challenge: These are BIG]
</section>

<section>

</section>

<!------------------------- Closing -------------------------->
<%#
<section>
	<a href="https://vizierdb.info">
		<img src="graphics/2022-06-20/vizier.svg" height="200px">
		<p style="margin-top: -20px;">https://vizierdb.info</p>
	</a>

	<p style="font-size: 65%"><b>Mike Brachmann, Boris Glavic, Nachiket Deo</b>, Juliana Freire, Heiko Mueller, Sonia Castello, Munaf Arshad Qazi, William Spoth, Poonam Kumari, Soham Patel, and more...</p>
</section>
 %>