Website/src/talks/2022-09-20-501-Vizier.erb

491 lines
16 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

---
template: templates/talk_slides_v1.erb
title: "CSE 501: Microkernel Notebooks"
---
<section>
<h2>Microkernel Notebooks</h2>
<h4 style="margin-top: 20px;">Oliver Kennedy</h4>
<p style="font-size: 70%; width: 730px; margin-right: auto; margin-left: auto; margin-top: 100px;" >
<a href="https://vizierdb.info">
<img src="graphics/logos/vizier-blue.svg" height="70px" style="float: left; margin-right: 20px; vertical-align: middle;" />
</a>
Boris Glavic, Juliana Freire, Michael Brachmann, William Spoth, Poonam Kumari, Ying Yang, Su Feng, Heiko Mueller, Aaron Huber, Nachiket Deo, and many more...</p>
</section>
<section>
<section>
<h2>But first...</h2>
</section>
<section>
<h3>Databases?</h3>
<img src="graphics/2022-04-02/er-diagrams.png" height="400px">
</section>
<section>
<img src="graphics/clipart/Female-or-Male-Unisex-Geek-or-Nerd-Light-Skin.svg" height="400px">
<attribution><a href="https://openclipart.org/">openclipart.org</a></attribution>
</section>
<section>
<h2>Data Structures?</h2>
<img src="graphics/2022-04-02/250-textbook.png" height="300px">
</section>
<section>
<img src="graphics/2022-04-02/Macintosh_classic_250.jpg" height="400px">
<attribution>Adapted from <a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=10101">Wikimedia Commons</a></attribution>
</section>
<section>
<img src="graphics/2022-04-02/nuclear_dino_db.png">
<attribution>Adapted from <a href="https://www.destroyallsoftware.com/talks/wat">Wat; Gary Bernhardt @ CodeMash 2012</a>
</section>
<section>
<h3>CSE 562; Database Systems</h3>
<ul>
<li>A bit of operating systems</li>
<li>A bit of hardware</li>
<li>A bit of compilers</li>
<li>A bit of distributed systems</li>
</ul>
<p class="fragment" style="font-weight: bold; margin-top: 50px">Applied Computer Science</p>
</section>
<section>
<img src="graphics/2022-04-02/db-convergence.svg">
</section>
</section>
<section>
<section>
<h3>For example...</h3>
</section>
<section>
<pre><code class="sql">
CREATE VIEW salesSinceLastMonth AS
SELECT l.*
FROM lineitem l, orders o
WHERE l.orderkey = o.orderkey
AND o.orderdate > DATE(NOW() - '1 Month')
</code></pre>
<pre><code class="sql">
SELECT partkey FROM salesSinceLastMonth
ORDER BY shipdate DESC LIMIT 10;
</code></pre>
<pre><code class="sql">
SELECT suppkey, COUNT(*)
FROM salesSinceLastMonth
GROUP BY suppkey;
</code></pre>
<pre><code class="sql">
SELECT DISTINCT partkey
FROM salesSinceLastMonth
</code></pre>
</section>
<section>
<pre><code class="python">
def really_expensive_computation():
return [
expensive_computation(i)
for i in range(1, 1000000):
if expensive_test(i)
]
</code></pre>
<pre><code class="python">
print(sorted(really_expensive_computation())[:10])
</code></pre>
<pre><code class="python">
print(len(really_expensive_computation()))
</code></pre>
<pre><code class="python">
print(set(really_expensive_computation()))
</code></pre>
</section>
<section>
<pre><code class="python">
def really_expensive_computation():
return [
expensive_computation(i)
for i in range(1, 1000000):
if expensive_test(i)
]
view = really_expensive_computation()
</code></pre>
<pre><code class="python">
print(sorted(view)[:10])
</code></pre>
<pre><code class="python">
print(len(view))
</code></pre>
<pre><code class="python">
print(set(view))
</code></pre>
</section>
<section>
<p><b>Opportunity:</b> Views are queried frequently</p>
<p><b>Idea: </b> Pre-compute and save the views contents!</p>
</section>
<section>
<p>Btw... this idea is the essence of CSE 250.</p>
</section>
<section>
<svg data-src="graphics/2022-04-02/DBToQ.svg" />
<attribution>openclipart.org</attribution>
</section>
<section>
<p>When the base data changes, <br/>the view needs to be updated too!</p>
</section>
<section>
<pre><code class="python">
def init():
view = query(database)
</code></pre>
<p style="margin-top: 100px;">Our view starts off initialized</p>
</section>
<section>
<p style="margin-top: 100px;"><b>Idea:</b> Recompute the view from scratch when data changes.</p>
</section>
<section>
<pre><code class="python">
def update(changes):
database = database + changes
view = query(database) # includes changes
</code></pre>
</section>
<section>
<img src="graphics/clipart/Snail.jpg" height="400px">
<attribution><a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=95926">Wikimedia Commons</a></attribution>
</section>
<section>
<pre><code class="python">
def update(changes):
view = delta(query, database, changes)
database = database + changes
</code></pre>
<table style="margin-top: 50px;">
<tr class="fragment">
<td style="font-size: 150%;"><tt>delta</tt></td>
<td>(ideally) Small &amp; fast query</td>
</tr>
<tr class="fragment">
<td style="font-size: 150%;"><tt>+</tt></td>
<td>(ideally) Fast "merge" operation</td>
</tr>
</table>
</section>
<section>
<h3>Intuition</h3>
<div>
$$\mathcal{D} = \{\ 1,\ 2,\ 3,\ 4\ \} \hspace{1in} \Delta\mathcal{D} = \{\ 5\ \}$$
$$Q(\mathcal D) = \texttt{SUM}(\mathcal D)$$
</div>
<div style="margin-top: 50px;">
<div class="fragment">$$ 1 + 2 + 3 + 4 + 5 $$</div>
<div class="fragment">$Q(\mathcal D+\Delta\mathcal D)$ <span class="fragment">$\sim O(|\mathcal D| + |\Delta\mathcal D|)$</span></div>
</div>
<div style="margin-top: 50px;">
<div class="fragment">$10$<span class="fragment">$+ 5$</span></div>
<div class="fragment">$\texttt{VIEW} + SUM(\Delta\mathcal D)$ <span class="fragment">$\sim O(|\Delta\mathcal D|)$</span></div>
</div>
</section>
<section>
<img src="graphics/2022-04-02/morpheus.jpeg">
<attribution>©1999 Warner Bros. Pictures</attribution>
</section>
</section>
<section>
<h6 style="font-size: 60%" class="fragment" data-fragment-index="2">Get off my database's lawn, punk kids</h6>
<h4 class="fragment" data-fragment-index="1"><span class="fragment strike" data-fragment-index="2">Why Jupyter Sucks</span></h4>
<h2 class="fragment strike" data-fragment-index="1">Microkernel Notebooks</h2>
<h4 style="margin-top: 20px;">Oliver Kennedy</h4>
<p style="font-size: 70%; width: 730px; margin-right: auto; margin-left: auto; margin-top: 100px;" >
<a href="https://vizierdb.info">
<img src="graphics/logos/vizier-blue.svg" height="70px" style="float: left; margin-right: 20px; vertical-align: middle;" />
</a>
Boris Glavic, Juliana Freire, Michael Brachmann, William Spoth, Poonam Kumari, Ying Yang, Su Feng, Heiko Mueller, Aaron Huber, Nachiket Deo, and many more...</p>
</section>
<section>
<section>
<img src="graphics/logos/jupyter.svg" height="500px">
<img src="graphics/2022-04-02/jupyter.png" height="500px">
</section>
<section>
<pre><code class="python">
import pandas as pd
</code></pre>
<pre class="fragment"><code class="python">
df = pd.read_csv("AMS-USDA-Directories-FarmersMarkets.csv")
df
</code></pre>
<img class="fragment" src="graphics/2022-04-02/jupyter-table.png" width="800px">
<pre class="fragment"><code class="python">
df.groupby("County").count()
</code></pre>
<p class="fragment">...</p>
</section>
<section>
<img src="graphics/2022-04-02/oz.jpeg" height="500px">
<attribution>©1939 Metro-Goldwyn-Mayer</attribution>
</section>
<section>
<img src="graphics/2022-04-02/oz_curtain.jpeg" height="500px">
<attribution>©1939 Metro-Goldwyn-Mayer</attribution>
</section>
<section>
<console>
Python 3.9.7 (default, Sep 10 2021, 14:59:43)
[GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> <span class="fragment">import pandas as pd
>>> </span><span class="fragment">df = pd.read_csv("AMS-USDA-Directories-FarmersMarkets.csv")
>>> df<span>
<div class="fragment"> FMID MarketName ... WildHarvested updateTime
0 1000519 Alexandria Bay Farmers Market ... N 2/1/2021 11:02:22 AM
1 1021329 Aurora Farmers Market ... Y 1/30/2021 6:24:08 PM
2 1002064 Belmont Farmers Market ... N 1/27/2021 9:03:15 PM
3 1021262 Broome County Regional Farmers Market ... Y 1/5/2021 10:02:05 AM
4 1021202 Canal Village Farmers' Market ... N 9/9/2020 7:55:23 PM
.. ... ... ... ... ...
82 1020021 Waterloo Rotary Farm Market ... N 8/3/2020 2:28:33 PM
83 1000384 Webster's Joe Obbie Farmers' Market, Inc. ... N 1/5/2021 10:18:30 AM
84 1002177 West Point-Town of Highlands Farmers Market ... N 8/2/2018 12:58:13 AM
85 1019038 Woodstock Farm Festival ... N 4/4/2018 11:27:02 AM
86 1007259 Yates County Cooperative Farm and Craft Market... ... N 2/3/2019 12:29:07 PM
[87 rows x 59 columns]
>>> </div>
</console>
</section>
<section>
<p>Cells are code snippets that get pasted into a long running <b>kernel</b></p>
</section>
<section>
<img src="graphics/2022-04-02/joelgrus.png" height="500px">
<attribution><a href="https://www.youtube.com/watch?v=7jiPeIFXb6U">I don't like notebooks.- Joel Grus (Allen Institute for Artificial Intelligence)</a></attribution>
</section>
<section>
<img src="graphics/2022-04-02/joelgrus_hiddenstate.png" height="500px">
<attribution><a href="https://www.youtube.com/watch?v=7jiPeIFXb6U">I don't like notebooks.- Joel Grus (Allen Institute for Artificial Intelligence)</a></attribution>
</section>
<section>
<img src="graphics/2022-04-02/joelgrus_y_is_5.png" height="300px">
<attribution><a href="https://www.youtube.com/watch?v=7jiPeIFXb6U">I don't like notebooks.- Joel Grus (Allen Institute for Artificial Intelligence)</a></attribution>
</section>
<section>
<p>Evaluation Order ≠ Notebook Order</p>
<p style="margin-top: 100px; font-weight: bold;" class="fragment">... but why?</p>
</section>
<section>
<h3>In a monokernel...</h3>
</section>
<section>
<pre><code class="python">
import pandas as pd
</code></pre>
<pre class="fragment"><code class="python">
df = pd.read_csv("really_big_dataset.csv")
</code></pre>
<pre class="fragment"><code class="python">
test = df.iloc[:800]
train = df.iloc[800:]
</code></pre>
<pre class="fragment"><code class="python">
model = train_linear_regression(train, "target")
</code></pre>
<pre class="fragment"><code class="python">
evaluate_linear_regresion(model, test, "target")
</code></pre>
</section>
<section>
<pre><code class="python">
import pandas as pd
</code></pre>
<pre><code class="python">
df = pd.read_csv("really_big_dataset.csv")
</code></pre>
<pre style="box-shadow: 0px 0px 12px red; "><code class="python">
test = df.iloc[:500]
train = df.iloc[500:]
</code></pre>
<pre class="fragment red-shadow-current" data-fragment-index="1"><code class="python">
model = train_linear_regression(train, "target")
</code></pre>
<pre class="fragment red-shadow-current" data-fragment-index="1"><code class="python">
evaluate_linear_regresion(model, test, "target")
</code></pre>
</section>
<section>
<h3>Q1: Which cells need to be re-evaluated?</h3>
<p style="margin-top: 100px; font-weight: bold;" class="fragment">Idea 1: All of them!</p>
</section>
<section>
<pre><code class="python">
import pandas as pd
df = pd.read_csv("really_big_dataset.csv")
test = df.iloc[:500]
train = df.iloc[500:]
model = train_linear_regression(train, "target")
evaluate_linear_regresion(model, test, "target")
</code></pre>
</section>
<section>
<img src="graphics/clipart/Snail.jpg" height="400px">
<attribution><a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=95926">Wikimedia Commons</a></attribution>
</section>
<section>
<p>... but <tt>df</tt> is still around, and you can "re-use" it.</p>
<p>Idea 2: Skip cells that haven't changed.</p>
<p style="margin-top: 100px; font-weight: bold;" class="fragment">... but <u class="fragment highlight-red">you</u> need to keep track of this.</p>
</section>
</section>
<section>
<section>
<p>Idea 3: Pull out your CSE 443 Textbooks</p>
</section>
<section>
<svg data-src="graphics/2022-04-02/data-flow-simple.svg" height="500px" />
</section>
<section>
<h2>Data Flow Graph</h2>
<p>Cell 3 changed, so re-evaluate only cells 4 and 5</p>
<p style="margin-top: 100px; font-weight: bold;" class="fragment">... but</p>
</section>
<section>
<h2>...</h2>
<pre><code class="python">
model = train_linear_regression(train, "target")
</code></pre>
<pre><code class="python">
evaluate_linear_regresion(model, test, "target")
</code></pre>
<pre class="fragment"><code class="python">
df = pd.read_csv("another_really_big_dataset.csv")
test = df.iloc[:500]
train = df.iloc[500:]
</code></pre>
<p style="margin-top: 100px; font-weight: bold;" class="fragment"><tt>df</tt> has changed!</p>
</section>
<section>
<p>We want to "snapshot" <tt>df</tt> in between cells.</p>
</section>
<section>
<svg data-src="graphics/2022-04-02/monokernel.svg" height="350px" />
</section>
<section>
<svg data-src="graphics/2022-04-02/microkernel.svg" height="350px" />
</section>
<section>
<p>The kernel runs, snapshots its variables, and quits.</p>
</section>
<section>
<svg data-src="graphics/2022-04-02/microkernel-invalidate.svg" height="500px" />
</section>
<section>
<h2>Microkernel Notebooks</h2>
<ul>
<li>Lots of small "micro-kernels"</li>
<li>Explicit inter-cell messaging</li>
<li>Messsages are snapshotted for re-use</li>
</ul>
</section>
<section>
<svg data-src="graphics/2022-04-02/microkernel-multiarch.svg" height="500px" />
</section>
<section>
<svg data-src="graphics/2022-04-02/microkernel-parallelism.svg" height="500px" />
</section>
<section>
<svg data-src="graphics/2022-04-02/microkernel-parallelism-2.svg" height="500px" />
</section>
<section>
<h2>Demo</h2>
<a href="https://vizierdb.info">
<img src="graphics/logos/vizier-blue.svg" height="200px" />
</a>
</section>
<section>
<a href="https://vizierdb.info">
<img src="graphics/logos/vizier-blue.svg" height="200px" />
</a>
<p><a href="https://vizierdb.info">https://vizierdb.info</a></p>
<p><a href="https://github.com/VizierDB/vizier-scala">https://github.com/VizierDB/vizier-scala</a></p>
</section>
</section>