Website/src/talks/2023-06-18-HILDA.erb

444 lines
17 KiB
Plaintext

---
template: templates/talk_slides_v1.erb
title: "Overlay Spreadsheets"
---
<section>
<h2>Overlay Spreadsheets</h2>
<div style="font-size: 80%;">
<div style="display: inline-block; border-right: 1px solid black; padding-right: 10px; margin-right: 5px;">
<h4 style="font-weight: bold;">Oliver Kennedy</h4>
<h5 style="font-weight: bold;">University at Buffalo</h5>
</div>
<div style="display: inline-block; border-right: 1px solid black; padding-right: 10px; margin-right: 5px; padding-left: 0px;">
<h4>Boris Glavic</h4>
<h5>Illinois Institute of Technology</h5>
</div>
<div style="display: inline-block; padding-left: 0px;">
<h4>Michael Brachmann</h4>
<h5>Breadcrumb Analytics</h5>
</div>
</div>
</section>
<section>
<a href="https://vizierdb.info">
<img src="graphics/2022-06-20/vizier.svg" height="200px">
</a>
<h3>The Vizier Notebook</h3>
<aside class="notes">
<ul>
<li>Mention Vizier</li>
<li>Re-inventing notebooks</li>
<li>You don't need to know all the details, but...</li>
</ul>
</aside>
</section>
<section>
<h3>1. Vizier is a Workflow System </h3>
<svg data-src="graphics/2023-06-18/workflow.svg" style="margin-top: 50px"/>
<aside class="notes">
<ul>
<li>Vizier is a workflow system</li>
<li>Data "flows" between cells</li>
</ul>
</aside>
</section>
<section>
<h3>2. Vizier isn't Just Code</h3>
<img src="graphics/2023-06-18/vizier-gui.png" height="400px">
<aside class="notes">
<ul>
<li>Graphical cells exist to streamline standard tasks</li>
<li>Including...</li>
</ul>
</aside>
</section>
<section>
<h3>Including...</h3>
<img src="graphics/2023-06-18/vizier-spreadsheet.png" height="300px">
<aside class="notes">
<ul>
<li>Vizier allows users to interactively modify datasets</li>
<li>... many users find spreadsheets more intuitive</li>
<li>... editing a spreadsheet can be faster than writing code</li>
<li>... sometimes you only need to edit one or two elements</li>
<li>See my HILDA 2016 talk</li>
</ul>
</aside>
</section>
<section>
<h3>Spreadsheets in Workflows</h3>
<img src="graphics/2023-06-18/wrangler.png" height="200px">
<attribution>(Wrangler: Interactive Visual Specification of Data Transformation Scripts; Kandel et al; CHI 2011)
<aside class="notes">
<ul>
<li>The same idea has appeared in other work, most prominently Wrangler, which gives users the ability to specify workflow steps visually.</li>
<li>... but the key idea is that this is not (click) a spreadsheet.</li>
</ul>
</aside>
</section>
<section>
<h3>Spreadsheets in Workflows</h3>
<img src="graphics/2023-06-18/mito.jpg" height="400px">
<attribution>(trymito.io)
<aside class="notes">
And in a few other places. For example, mito is a tool that generates python code from spreadsheet interactions.
</aside>
</section>
<section>
<h3><span class="fragment" data-fragment-index="1">“</span>Spreadsheets<span class="fragment" data-fragment-index="1">"</span> in Workflows</h3>
<img src="graphics/2023-06-18/vizier-vizual.png" height="400px">
<attribution>(The Exception that Improves the Rule; Freire et. al.; HILDA 2016)
<aside class="notes">
Both Vizier and Wrangler provide something that looks vaguely like a spreadsheet, but "under the hood", both are just emitting workflows. The approach is <b>reproducible</b>/<b>reusable</b>, and supports both structural reorganization and minor edits, but lacks good support for data entry, and crucially "free form "
</aside>
</section>
<section>
<h3>Spreadsheets as Workflow Precursors</h3>
<img src="graphics/2023-06-18/excel-futzing.png" height="400px">
<aside class="notes">
Spreadsheets are "free-form" data exploration tools. Any computation can be put anywhere. However, as the process transitions from purely exploratory to deployable, users generally have to rebuild the workflow from scratch.
</aside>
</section>
<!--
<section>
<h3>... so why not use a spreadsheet?</h3>
<h3 class="fragment">... people do!</h3>
</section>
<section>
<h3>Spreadsheets in "Workflows"</h3>
<img src="graphics/2023-06-18/ipynb-excel.png" height="200px">
<aside class="notes">
Spreadsheets still show up in data processing workflows. For example, they can facilitate workflows involving a data-entry step. A colleague of mine uses workflows where one notebook generates templates for data entry; the resulting spreadsheet is manually edited, and the next notebook is invoked.
Clearly, this is viable, but requires care and pedantry on the part of the notebook developer(s):
<ol>
<li>Clearly communicating what needs to happen and when</li>
<li>Ensuring that prior efforts are not lost (don't overwrite old templates, design templates for re-use)</li>
</ol>
</aside>
</section>
-->
<section>
<h3>So why not just use a spreadsheet?</h3>
<svg data-src="graphics/2023-06-18/workflow-revision.svg" style="margin-top: 50px"/>
<aside class="notes">
So you've got your workflow... somewhere in there you put a spreadsheet. But now you make a change to one of the early steps. How do you propagate the changes into the spreasdsheet?
</aside>
</section>
<section>
<h3>Goals</h3>
<ul>
<li>Build a spreadsheet...</li>
<li class="fragment">... that can replay edits on a new dataset.</li>
<li class="fragment">... while supporting "free-form" editing.</li>
</ul>
<aside class="notes">
So at a high level, we want to be able to support the free-form editing experience of a spreadsheet, while simultaneously decoupling the edits from the underlying dataset.
</aside>
</section>
<section>
<h3>So we built a spreadsheet...</h3>
<ol>
<li>Replaying <b>Shape</b> Updates</li>
<li>Replaying <b>Content</b> Updates</li>
</ol>
<aside class="notes">
We implemented a spreadsheet. A lot of the pedantic details are not especially interesting. See the paper or come talk to us if you have questions. But to give you a high-level view of the problem, we basically need to be able to support changes to the shape of the data (row/column insertions/deletions/moves), as well as changes to the contents.
</aside>
</section>
<section>
<h3>Shape Changes</h3>
<svg data-src="graphics/2023-06-18/resize.svg" style="scale: 80%; "/>
<div style="font-size: 60%;" class="fragment" data-fragment-index=1>
<h4 style="text-decoration: underline;">Mapping Functions</h4>
<span class="fragment" data-fragment-index=2>$row(n) = \begin{cases}
n & \textbf{if } \color{#d40000}{\blacksquare}, \color{#6c5353}{\blacksquare}\\
n+1 & \textbf{if } \color{#008080}{\blacksquare}, \color{#a05a2c}{\blacksquare}\\
n-3 & \textbf{if } \color{#808000}{\blacksquare}
\end{cases}$
</span> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <span class="fragment" data-fragment-index=3>
$col(n) = \begin{cases}
n & \textbf{if } \color{#d40000}{\blacksquare}, \color{#008080}{\blacksquare}\\
n-1 & \textbf{if } \color{#6c5353}{\blacksquare}, \color{#a05a2c}{\blacksquare}\\
\texttt{null} & \textbf{if } \color{#536c53}{\blacksquare}\\
\end{cases}$</span>
</div>
<aside class="notes">
Shape changes are (relatively) easy. We need a way to map the naming scheme of the spreadsheet into the naming scheme of the source dataset. Technical details are in the paper, but this is essentially just a big 2d index.
</aside>
</section>
<section>
<h3>Content Updates</h3>
<svg data-src="graphics/2023-06-18/overlay.svg" style="scale: 80%; "/>
<aside class="notes">
<p>To track content, we have, essentially a large, sparse grid that behaves a lot like a
normal spreadsheet, but that allows requests for data to "flow through" to the underlying dataset for cells that are not defined. We call this an "Overlay"</p>
<p>Although the implementation details are non-trivial, they're also not especially interesting unless you're implementing a spreadsheet. What is interesting, however, is how we cope with the fact that the source dataset is often going to be super big.</p>
</aside>
</section>
<section>
<h3>Coping with Big Data</h3>
<ol>
<li class="fragment">Classical Spreadsheets (Excel, Google Sheets, etc...)</li>
<li class="fragment">DataSpread (Bendre et. al.; ICDE 2018)</li>
<li class="fragment">Ignore the Problem (This Talk)</li>
</ol>
<aside class="notes">
<p>So how do we handle "big" data? Well, in a normal spreadsheet... we don't. Notably, these start barfing around a few 100k rows. DataSpread scales that up by quite a bit, but reaction times can still get slow if you have a lot of data, and that's not entirely surprising. This is an in-memory reactive database, with data flows involving potentially millions of steps. </p>
<p>So, what other options do we have?</p>
</aside>
</section>
<section>
<h3>Patterns</h3>
<svg data-src="graphics/2023-06-18/patterns.svg" style="scale: 80%; "/>
<aside class="notes">
<p>Well, one thing is to observe that while each cell <i>technically</i> has its own formulas, users don't usually type formulas out one cell at a time. Rather, they write a single cell (click), and then copy the formula across multiple cells (click). The resulting formulas are modified for each cell they're pasted into, following a simple pattern. This is how most (click) bulk updates to the spreadsheet happen. </p>
<p>Instead of storing formulas per-cell, we can just store the pattern, along with the range of cells that it was applied to. (notably, the DataSpread team noticed this as well).</p>
<p>Now, this saves us a ton on space. It can also be used to do things like cell invalidation pretty fast, but we still end up doing a ton of compute on every change.</p>
</aside>
</section>
<section>
<h3>Coping with Big Data</h3>
<svg data-src="graphics/2023-06-18/focus.svg" style="scale: 80%; "/>
<aside class="notes">
The key insight here is that the user isn't going to be paying attention to the entire million rows of data all at once. It'll be a flat 100-200, maybe even 1k rows. Regardless, pretty manageable... if we can pull it off.
</aside>
</section>
<section>
<p>Only materialize the subset of visible rows <br/><span class="fragment">(and/or columns)</span></p>
<aside class="notes">
<p>So, we're only going to materialize a subset of the rows. The frontend will keep us informed about what's on (or near to) the screen, and we're going to do our best to ignore everything that's offscreen.</p>
<p>Note: While we're going to focus on limiting rows here for simplicity, the same ideas apply to columns as well.</p>
</aside>
</section>
<section>
<h3>Restricting the Source Data</h3>
<svg data-src="graphics/2023-06-18/focus-source.svg" style="scale: 80%; "/>
<aside class="notes">
We have our mapping functions, so we can figure out what rows of the source dataset need to be retrieved. For Vizier, we built a simple LRU cache over spark to make this sort of access efficient, but most storage engines with efficient support for LIMIT/OFFSET queries can handle this gracefully.
</aside>
</section>
<section>
<h3>Restricting the Overlay</h3>
<svg data-src="graphics/2023-06-18/focus-overlay-simple.svg"/>
<aside class="notes">
<p>So that leaves the overlay. We need to be able to compute the contents of each cell, given its formula. Remember, we're storing all of the patterns, but we're only materializing the specific cells we're interested in.</p>
<p>(click) So, we've restrict ourselves to <b>materializing</b> a specific set of rows. (click) If we're lucky, all of the cells our formula references are defined in this restricted region and we're basically done. On the other hand, maybe we're not so lucky...</p>
</aside>
</section>
<section>
<h3>Restricting the Overlay</h3>
<svg data-src="graphics/2023-06-18/focus-overlay-offset.svg"/>
<aside class="notes">
<p>... in which case a dependency points outside of our materialized range (click)</p>
</aside>
</section>
<section>
<p>We need to materialize the <br/><b>transitive closure</b><br/> of the cell dependencies.</p>
<aside class="notes">
<p>... so we need the transitive closure. Now again, if we're lucky, we're done here... but what if a cell defined by some pattern depends on a different cell defined by the same pattern.</p>
</aside>
</section>
<section>
<h3>Restricting the Overlay</h3>
<svg data-src="graphics/2023-06-18/focus-overlay-recursive.svg"/>
<aside class="notes">
<p>In the worst case, the pattern depends on itself. A common example is the "running sum", where each cell's value is defined by summing the value for the previous row with the value for the current row.</p>
<p>In this case, the cell's dependencies stretch all the way to the beginning of the dataset. Not a problem if we're viewing the first 100 rows, but it becomes a problem once we scroll down to the millionth row or so.</p>
<p>We call this case, where one cell defined by a pattern references a different cell defined by the same pattern: <b>a recursive pattern</b></p>
</aside>
</section>
<section>
<p>Spreadsheets are optimized for reactive, interactive computation over cells.</p>
<p class="fragment">... but here we have a big bulk computation.</p>
</section>
<section>
<h3>Recursive Patterns</h3>
<svg data-src="graphics/2023-06-18/focus-overlay-recursive-batch.svg"/>
<aside class="notes">
<p>Compute values that fall outside of the materialized region using a batch-processing query engine (e.g., Spark). Cache them. It's still necessary to figure out if some update invalidates the cached value, but you only need to cache the single value, and computing it through a system optimized for batch computation is going to be a lot faster.</p>
<p>So, how do we convert the formula to a query?</p>
</aside>
</section>
<section>
<h3>Recursive Patterns</h3>
<div style="margin-top: 50px;">
$$H[0] = G[0]$$
$$H[n] = G[n] + H[n-1]$$
</div>
<div class="fragment" data-fragment-index="1" style="margin-top: 10px;">
$$H[n] = sum(G[0:n])$$
</div>
<p class="fragment" data-fragment-index="1" style="margin-top: 50px">Many common patterns have an <br/>equivalent closed-form representation.</p>
<aside class="notes">
So at least for common patterns, we can rewrite them into simple closed-form aggregates that can be sent directly to a SQL engine (like Spark in Vizier).
</aside>
</section>
<section>
<h3>Recursive Patterns</h3>
<div style="margin-top: 50px;">
$$H[0:1] = G[0:1]$$
$$H[n] = G[n] + H[n-2]$$
</div>
<div class="fragment" data-fragment-index="1" style="margin-top: 10px;">
<pre><code class="sql">
SELECT G + lag(H, 2) AS H OVER (ORDER BY row) FROM ...
</code></pre>
</div>
<p class="fragment" data-fragment-index="1" style="margin-top: 50px">Window queries work for the rest.</p>
<aside class="notes">
And while we don't have a formal proof, we're pretty certain that any patterns that we can't get a closed form for, we can express as a window query.
</aside>
</section>
<section>
<h3>Recursive Patterns</h3>
<canvas data-chart="line">
<!--
{
"data": {
"datasets":[
{"fill": false, "borderColor":"#0000FF", "borderDash": [], "borderWidth": 5},
{"fill": false, "borderColor":"#008080", "borderDash": [], "borderWidth": 5},
{"fill": false, "borderColor":"#008000", "borderDash": [], "borderWidth": 5}
]
},
"options": {
"scales": {
"xAxes": [{
"scaleLabel": {
"display": true,
"labelString": "First Visible Row Index"
}
}],
"yAxes": [{
"type": "logarithmic",
"position": "left",
"scaleLabel": {
"display": true,
"labelString": "Time to Quiescence (s)"
}
}]
}
}
}
-->
Axis,60,600,6000,60000
Vizier,0.24214023499999998,0.416489697,4.6895117310000005,48.321868185
Vizier-Batch,0.292034878,0.291488883,0.31172266200000004,0.457365417
DataSpread,26.404142864,28.609918545,42.390664963,221.393066346
</canvas>
<aside class="notes">
<p>So let's see how it works. We have a large dataset (600k rows) of TPCH-Lineitem. We generate a spreadsheet that roughly models Q1, albeit with a running total column. We're measuring the time until the system is done working.</p>
<p>There's a bit of scrolling overhead in Dataspread, but basically, until we get quite far into the dataset, the runtime is fairly stable at about 30-40 seconds to quiescence. Vizier is a bit faster until we get considerably further in, because we're computing a lot less stuff. However when we simulated batching in Vizier, we get sub-second quiescence time, even 60k records into the dataset.</p>
</aside>
</section>
<section>
<a href="https://vizierdb.info">
<table>
<tr>
<td>
<img src="graphics/2022-06-20/vizier.svg" height="100px">
</td>
<td style="vertical-align: middle;">
https://vizierdb.info
</td>
</tr>
</table>
</a>
<div style="font-size: 70%">
<ul>
<li style="margin: 10px;">Decoupling edits from source data enables spreadsheets in workflow systems</li>
<li style="margin: 10px;">Overlays support a (fast) hybrid batch/reactive execution model.</li>
</ul>
<h3 style="margin-bottom: 0px; margin-top: 20px;">Open Challenges</h3>
<ul>
<li style="margin: 10px;">Freeform edits don't usually map nicely to structured dataframes.</li>
<li style="margin: 10px;">Content updates don't generalize through external changes in shape.</li>
<li style="margin: 10px; font-weight: bold;">[Your challenge/comment here]</li>
</ul>
</div>
</section>