491 lines
16 KiB
Plaintext
491 lines
16 KiB
Plaintext
---
|
||
template: templates/talk_slides_v1.erb
|
||
title: "CSE 501: Microkernel Notebooks"
|
||
---
|
||
|
||
<section>
|
||
<h2>Microkernel Notebooks</h2>
|
||
<h4 style="margin-top: 20px;">Oliver Kennedy</h4>
|
||
<p style="font-size: 70%; width: 730px; margin-right: auto; margin-left: auto; margin-top: 100px;" >
|
||
<a href="https://vizierdb.info">
|
||
<img src="graphics/logos/vizier-blue.svg" height="70px" style="float: left; margin-right: 20px; vertical-align: middle;" />
|
||
</a>
|
||
Boris Glavic, Juliana Freire, Michael Brachmann, William Spoth, Poonam Kumari, Ying Yang, Su Feng, Heiko Mueller, Aaron Huber, Nachiket Deo, and many more...</p>
|
||
</section>
|
||
|
||
|
||
<section>
|
||
<section>
|
||
<h2>But first...</h2>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Databases?</h3>
|
||
<img src="graphics/2022-04-02/er-diagrams.png" height="400px">
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/clipart/Female-or-Male-Unisex-Geek-or-Nerd-Light-Skin.svg" height="400px">
|
||
<attribution><a href="https://openclipart.org/">openclipart.org</a></attribution>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Data Structures?</h2>
|
||
|
||
<img src="graphics/2022-04-02/250-textbook.png" height="300px">
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/2022-04-02/Macintosh_classic_250.jpg" height="400px">
|
||
<attribution>Adapted from <a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=10101">Wikimedia Commons</a></attribution>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/2022-04-02/nuclear_dino_db.png">
|
||
<attribution>Adapted from <a href="https://www.destroyallsoftware.com/talks/wat">Wat; Gary Bernhardt @ CodeMash 2012</a>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>CSE 562; Database Systems</h3>
|
||
|
||
<ul>
|
||
<li>A bit of operating systems</li>
|
||
<li>A bit of hardware</li>
|
||
<li>A bit of compilers</li>
|
||
<li>A bit of distributed systems</li>
|
||
</ul>
|
||
|
||
<p class="fragment" style="font-weight: bold; margin-top: 50px">Applied Computer Science</p>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/2022-04-02/db-convergence.svg">
|
||
</section>
|
||
|
||
</section>
|
||
|
||
<section>
|
||
|
||
<section>
|
||
<h3>For example...</h3>
|
||
</section>
|
||
|
||
<section>
|
||
<pre><code class="sql">
|
||
CREATE VIEW salesSinceLastMonth AS
|
||
SELECT l.*
|
||
FROM lineitem l, orders o
|
||
WHERE l.orderkey = o.orderkey
|
||
AND o.orderdate > DATE(NOW() - '1 Month')
|
||
</code></pre>
|
||
<pre><code class="sql">
|
||
SELECT partkey FROM salesSinceLastMonth
|
||
ORDER BY shipdate DESC LIMIT 10;
|
||
</code></pre>
|
||
<pre><code class="sql">
|
||
SELECT suppkey, COUNT(*)
|
||
FROM salesSinceLastMonth
|
||
GROUP BY suppkey;
|
||
</code></pre>
|
||
<pre><code class="sql">
|
||
SELECT DISTINCT partkey
|
||
FROM salesSinceLastMonth
|
||
</code></pre>
|
||
</section>
|
||
|
||
<section>
|
||
<pre><code class="python">
|
||
def really_expensive_computation():
|
||
return [
|
||
expensive_computation(i)
|
||
for i in range(1, 1000000):
|
||
if expensive_test(i)
|
||
]
|
||
</code></pre>
|
||
<pre><code class="python">
|
||
print(sorted(really_expensive_computation())[:10])
|
||
</code></pre>
|
||
<pre><code class="python">
|
||
print(len(really_expensive_computation()))
|
||
</code></pre>
|
||
<pre><code class="python">
|
||
print(set(really_expensive_computation()))
|
||
</code></pre>
|
||
</section>
|
||
|
||
<section>
|
||
<pre><code class="python">
|
||
def really_expensive_computation():
|
||
return [
|
||
expensive_computation(i)
|
||
for i in range(1, 1000000):
|
||
if expensive_test(i)
|
||
]
|
||
|
||
view = really_expensive_computation()
|
||
</code></pre>
|
||
<pre><code class="python">
|
||
print(sorted(view)[:10])
|
||
</code></pre>
|
||
<pre><code class="python">
|
||
print(len(view))
|
||
</code></pre>
|
||
<pre><code class="python">
|
||
print(set(view))
|
||
</code></pre>
|
||
</section>
|
||
|
||
<section>
|
||
<p><b>Opportunity:</b> Views are queried frequently</p>
|
||
<p><b>Idea: </b> Pre-compute and save the view’s contents!</p>
|
||
</section>
|
||
|
||
<section>
|
||
<p>Btw... this idea is the essence of CSE 250.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2022-04-02/DBToQ.svg" />
|
||
<attribution>openclipart.org</attribution>
|
||
</section>
|
||
|
||
<section>
|
||
<p>When the base data changes, <br/>the view needs to be updated too!</p>
|
||
</section>
|
||
|
||
<section>
|
||
<pre><code class="python">
|
||
def init():
|
||
view = query(database)
|
||
</code></pre>
|
||
<p style="margin-top: 100px;">Our view starts off initialized</p>
|
||
</section>
|
||
|
||
<section>
|
||
<p style="margin-top: 100px;"><b>Idea:</b> Recompute the view from scratch when data changes.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<pre><code class="python">
|
||
def update(changes):
|
||
database = database + changes
|
||
view = query(database) # includes changes
|
||
</code></pre>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/clipart/Snail.jpg" height="400px">
|
||
<attribution><a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=95926">Wikimedia Commons</a></attribution>
|
||
</section>
|
||
|
||
<section>
|
||
<pre><code class="python">
|
||
def update(changes):
|
||
view = delta(query, database, changes)
|
||
database = database + changes
|
||
</code></pre>
|
||
<table style="margin-top: 50px;">
|
||
<tr class="fragment">
|
||
<td style="font-size: 150%;"><tt>delta</tt></td>
|
||
<td>(ideally) Small & fast query</td>
|
||
</tr>
|
||
<tr class="fragment">
|
||
<td style="font-size: 150%;"><tt>+</tt></td>
|
||
<td>(ideally) Fast "merge" operation</td>
|
||
</tr>
|
||
</table>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Intuition</h3>
|
||
<div>
|
||
$$\mathcal{D} = \{\ 1,\ 2,\ 3,\ 4\ \} \hspace{1in} \Delta\mathcal{D} = \{\ 5\ \}$$
|
||
$$Q(\mathcal D) = \texttt{SUM}(\mathcal D)$$
|
||
</div>
|
||
<div style="margin-top: 50px;">
|
||
<div class="fragment">$$ 1 + 2 + 3 + 4 + 5 $$</div>
|
||
<div class="fragment">$Q(\mathcal D+\Delta\mathcal D)$ <span class="fragment">$\sim O(|\mathcal D| + |\Delta\mathcal D|)$</span></div>
|
||
</div>
|
||
<div style="margin-top: 50px;">
|
||
<div class="fragment">$10$<span class="fragment">$+ 5$</span></div>
|
||
<div class="fragment">$\texttt{VIEW} + SUM(\Delta\mathcal D)$ <span class="fragment">$\sim O(|\Delta\mathcal D|)$</span></div>
|
||
</div>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/2022-04-02/morpheus.jpeg">
|
||
|
||
<attribution>©1999 Warner Bros. Pictures</attribution>
|
||
</section>
|
||
|
||
</section>
|
||
|
||
|
||
|
||
<section>
|
||
<h6 style="font-size: 60%" class="fragment" data-fragment-index="2">Get off my database's lawn, punk kids</h6>
|
||
<h4 class="fragment" data-fragment-index="1"><span class="fragment strike" data-fragment-index="2">Why Jupyter Sucks</span></h4>
|
||
<h2 class="fragment strike" data-fragment-index="1">Microkernel Notebooks</h2>
|
||
<h4 style="margin-top: 20px;">Oliver Kennedy</h4>
|
||
<p style="font-size: 70%; width: 730px; margin-right: auto; margin-left: auto; margin-top: 100px;" >
|
||
<a href="https://vizierdb.info">
|
||
<img src="graphics/logos/vizier-blue.svg" height="70px" style="float: left; margin-right: 20px; vertical-align: middle;" />
|
||
</a>
|
||
Boris Glavic, Juliana Freire, Michael Brachmann, William Spoth, Poonam Kumari, Ying Yang, Su Feng, Heiko Mueller, Aaron Huber, Nachiket Deo, and many more...</p>
|
||
</section>
|
||
|
||
<section>
|
||
|
||
<section>
|
||
<img src="graphics/logos/jupyter.svg" height="500px">
|
||
<img src="graphics/2022-04-02/jupyter.png" height="500px">
|
||
</section>
|
||
|
||
<section>
|
||
<pre><code class="python">
|
||
import pandas as pd
|
||
</code></pre>
|
||
<pre class="fragment"><code class="python">
|
||
df = pd.read_csv("AMS-USDA-Directories-FarmersMarkets.csv")
|
||
df
|
||
</code></pre>
|
||
<img class="fragment" src="graphics/2022-04-02/jupyter-table.png" width="800px">
|
||
<pre class="fragment"><code class="python">
|
||
df.groupby("County").count()
|
||
</code></pre>
|
||
<p class="fragment">...</p>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/2022-04-02/oz.jpeg" height="500px">
|
||
<attribution>©1939 Metro-Goldwyn-Mayer</attribution>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/2022-04-02/oz_curtain.jpeg" height="500px">
|
||
<attribution>©1939 Metro-Goldwyn-Mayer</attribution>
|
||
</section>
|
||
|
||
<section>
|
||
<console>
|
||
Python 3.9.7 (default, Sep 10 2021, 14:59:43)
|
||
[GCC 11.2.0] on linux
|
||
Type "help", "copyright", "credits" or "license" for more information.
|
||
>>> <span class="fragment">import pandas as pd
|
||
>>> </span><span class="fragment">df = pd.read_csv("AMS-USDA-Directories-FarmersMarkets.csv")
|
||
>>> df<span>
|
||
<div class="fragment"> FMID MarketName ... WildHarvested updateTime
|
||
0 1000519 Alexandria Bay Farmers Market ... N 2/1/2021 11:02:22 AM
|
||
1 1021329 Aurora Farmers Market ... Y 1/30/2021 6:24:08 PM
|
||
2 1002064 Belmont Farmers Market ... N 1/27/2021 9:03:15 PM
|
||
3 1021262 Broome County Regional Farmers Market ... Y 1/5/2021 10:02:05 AM
|
||
4 1021202 Canal Village Farmers' Market ... N 9/9/2020 7:55:23 PM
|
||
.. ... ... ... ... ...
|
||
82 1020021 Waterloo Rotary Farm Market ... N 8/3/2020 2:28:33 PM
|
||
83 1000384 Webster's Joe Obbie Farmers' Market, Inc. ... N 1/5/2021 10:18:30 AM
|
||
84 1002177 West Point-Town of Highlands Farmers Market ... N 8/2/2018 12:58:13 AM
|
||
85 1019038 Woodstock Farm Festival ... N 4/4/2018 11:27:02 AM
|
||
86 1007259 Yates County Cooperative Farm and Craft Market... ... N 2/3/2019 12:29:07 PM
|
||
|
||
[87 rows x 59 columns]
|
||
>>> </div>
|
||
</console>
|
||
</section>
|
||
|
||
<section>
|
||
<p>Cells are code snippets that get pasted into a long running <b>kernel</b></p>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/2022-04-02/joelgrus.png" height="500px">
|
||
<attribution><a href="https://www.youtube.com/watch?v=7jiPeIFXb6U">I don't like notebooks.- Joel Grus (Allen Institute for Artificial Intelligence)</a></attribution>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/2022-04-02/joelgrus_hiddenstate.png" height="500px">
|
||
<attribution><a href="https://www.youtube.com/watch?v=7jiPeIFXb6U">I don't like notebooks.- Joel Grus (Allen Institute for Artificial Intelligence)</a></attribution>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/2022-04-02/joelgrus_y_is_5.png" height="300px">
|
||
<attribution><a href="https://www.youtube.com/watch?v=7jiPeIFXb6U">I don't like notebooks.- Joel Grus (Allen Institute for Artificial Intelligence)</a></attribution>
|
||
</section>
|
||
|
||
<section>
|
||
<p>Evaluation Order ≠ Notebook Order</p>
|
||
|
||
<p style="margin-top: 100px; font-weight: bold;" class="fragment">... but why?</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>In a monokernel...</h3>
|
||
</section>
|
||
|
||
<section>
|
||
<pre><code class="python">
|
||
import pandas as pd
|
||
</code></pre>
|
||
<pre class="fragment"><code class="python">
|
||
df = pd.read_csv("really_big_dataset.csv")
|
||
</code></pre>
|
||
<pre class="fragment"><code class="python">
|
||
test = df.iloc[:800]
|
||
train = df.iloc[800:]
|
||
</code></pre>
|
||
<pre class="fragment"><code class="python">
|
||
model = train_linear_regression(train, "target")
|
||
</code></pre>
|
||
<pre class="fragment"><code class="python">
|
||
evaluate_linear_regresion(model, test, "target")
|
||
</code></pre>
|
||
</section>
|
||
|
||
<section>
|
||
<pre><code class="python">
|
||
import pandas as pd
|
||
</code></pre>
|
||
<pre><code class="python">
|
||
df = pd.read_csv("really_big_dataset.csv")
|
||
</code></pre>
|
||
<pre style="box-shadow: 0px 0px 12px red; "><code class="python">
|
||
test = df.iloc[:500]
|
||
train = df.iloc[500:]
|
||
</code></pre>
|
||
<pre class="fragment red-shadow-current" data-fragment-index="1"><code class="python">
|
||
model = train_linear_regression(train, "target")
|
||
</code></pre>
|
||
<pre class="fragment red-shadow-current" data-fragment-index="1"><code class="python">
|
||
evaluate_linear_regresion(model, test, "target")
|
||
</code></pre>
|
||
</section>
|
||
|
||
<section>
|
||
<h3>Q1: Which cells need to be re-evaluated?</h3>
|
||
|
||
<p style="margin-top: 100px; font-weight: bold;" class="fragment">Idea 1: All of them!</p>
|
||
</section>
|
||
|
||
<section>
|
||
<pre><code class="python">
|
||
import pandas as pd
|
||
|
||
df = pd.read_csv("really_big_dataset.csv")
|
||
test = df.iloc[:500]
|
||
train = df.iloc[500:]
|
||
|
||
model = train_linear_regression(train, "target")
|
||
|
||
evaluate_linear_regresion(model, test, "target")
|
||
</code></pre>
|
||
</section>
|
||
|
||
<section>
|
||
<img src="graphics/clipart/Snail.jpg" height="400px">
|
||
<attribution><a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=95926">Wikimedia Commons</a></attribution>
|
||
</section>
|
||
|
||
<section>
|
||
<p>... but <tt>df</tt> is still around, and you can "re-use" it.</p>
|
||
<p>Idea 2: Skip cells that haven't changed.</p>
|
||
|
||
<p style="margin-top: 100px; font-weight: bold;" class="fragment">... but <u class="fragment highlight-red">you</u> need to keep track of this.</p>
|
||
</section>
|
||
|
||
</section>
|
||
|
||
<section>
|
||
|
||
<section>
|
||
<p>Idea 3: Pull out your CSE 443 Textbooks</p>
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2022-04-02/data-flow-simple.svg" height="500px" />
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Data Flow Graph</h2>
|
||
|
||
<p>Cell 3 changed, so re-evaluate only cells 4 and 5</p>
|
||
|
||
<p style="margin-top: 100px; font-weight: bold;" class="fragment">... but</p>
|
||
</section>
|
||
|
||
<section>
|
||
<h2>...</h2>
|
||
<pre><code class="python">
|
||
model = train_linear_regression(train, "target")
|
||
</code></pre>
|
||
<pre><code class="python">
|
||
evaluate_linear_regresion(model, test, "target")
|
||
</code></pre>
|
||
<pre class="fragment"><code class="python">
|
||
df = pd.read_csv("another_really_big_dataset.csv")
|
||
test = df.iloc[:500]
|
||
train = df.iloc[500:]
|
||
</code></pre>
|
||
|
||
<p style="margin-top: 100px; font-weight: bold;" class="fragment"><tt>df</tt> has changed!</p>
|
||
|
||
</section>
|
||
|
||
<section>
|
||
<p>We want to "snapshot" <tt>df</tt> in between cells.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2022-04-02/monokernel.svg" height="350px" />
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2022-04-02/microkernel.svg" height="350px" />
|
||
</section>
|
||
|
||
<section>
|
||
<p>The kernel runs, snapshots its variables, and quits.</p>
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2022-04-02/microkernel-invalidate.svg" height="500px" />
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Microkernel Notebooks</h2>
|
||
|
||
<ul>
|
||
<li>Lots of small "micro-kernels"</li>
|
||
<li>Explicit inter-cell messaging</li>
|
||
<li>Messsages are snapshotted for re-use</li>
|
||
</ul>
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2022-04-02/microkernel-multiarch.svg" height="500px" />
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2022-04-02/microkernel-parallelism.svg" height="500px" />
|
||
</section>
|
||
|
||
<section>
|
||
<svg data-src="graphics/2022-04-02/microkernel-parallelism-2.svg" height="500px" />
|
||
</section>
|
||
|
||
<section>
|
||
<h2>Demo</h2>
|
||
<a href="https://vizierdb.info">
|
||
<img src="graphics/logos/vizier-blue.svg" height="200px" />
|
||
</a>
|
||
</section>
|
||
|
||
<section>
|
||
<a href="https://vizierdb.info">
|
||
<img src="graphics/logos/vizier-blue.svg" height="200px" />
|
||
</a>
|
||
<p><a href="https://vizierdb.info">https://vizierdb.info</a></p>
|
||
<p><a href="https://github.com/VizierDB/vizier-scala">https://github.com/VizierDB/vizier-scala</a></p>
|
||
|
||
|
||
</section>
|
||
|
||
</section> |