<pre gra[js

main
Oliver Kennedy 2023-03-30 15:33:05 -04:00
parent 32caa6acdc
commit 60161699b9
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
34 changed files with 92 additions and 329 deletions

View File

@ -8,6 +8,7 @@
\usepackage{listings}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usepackage{subcaption}
\newcommand{\trimfigurespacing}{\vspace*{-5mm}}
@ -182,8 +183,8 @@ the materialized approach has better performance.
% \input{sections/formalism}
\input{sections/system}
% \input{sections/data}
\input{sections/relwork}
\input{sections/experiments}
\input{sections/relwork}
\input{sections/conclusions}

View File

@ -1,31 +0,0 @@
[info] DataspreadBenchmarkVizierSpec
[info] DataspreadBenchmarkVizierSpec should
[info] Perform Benchamrks consistent with those done with VizierDB
[test] @0: Init Spreadsheet: 21.147718712 s
[test] @0: Monitoring Overhead: 0.204571017 s
[test] @0: Init formulas: 5.258463694 s
[test] @0: Update one: 0.031132039 s
[test] @0: Update all: [not run]
[info] + Time Results @ 0
[test] @60: Init Spreadsheet: 21.075780951 s
[test] @60: Monitoring Overhead: 0.121315652 s
[test] @60: Init formulas: 7.208747642 s
[test] @60: Update one: 0.020606538 s
[test] @60: Update all: [not run]
[info] + Time Results @ 60
[test] @600: Init Spreadsheet: 21.059860767 s
[test] @600: Monitoring Overhead: 0.120807536 s
[test] @600: Init formulas: 24.264395221 s
[test] @600: Update one: 0.007306683 s
[test] @600: Update all: [not run]
[info] + Time Results @ 600
[test] @6000: Init Spreadsheet: 21.056667272 s
[test] @6000: Monitoring Overhead: 0.11134842 s
[test] @6000: Init formulas: 245.847328681 s
[test] @6000: Update one: 0.033850361 s
[test] @6000: Update all: [not run]
[info] + Time Results @ 6000
[info] o Time Results @ 60000 [not run - taking longer than 30 minutes]
[info] Total for specification DataspreadBenchmarkVizierSpec
[info] Finished in 6 minutes 16 seconds, 997 ms
[info] 4 examples, 6 expectations, 0 failure, 0 error

View File

@ -1,32 +0,0 @@
[info] DataspreadBenchmarkVizierSpec
[info] DataspreadBenchmarkVizierSpec should
[info] Perform Benchamrks consistent with those done with VizierDB
[info] + Warm up the cache
[test] @0: Init Spreadsheet: 21.144650182 s
[test] @0: Monitoring Overhead: 0.204167816 s
[test] @0: Init formulas: 5.258831865 s
[test] @0: Update one: 0.033579433 s
[test] @0: Update all: [not run]
[info] + Time Results @ 0
[test] @60: Init Spreadsheet: 21.046036608 s
[test] @60: Monitoring Overhead: 0.116584287 s
[test] @60: Init formulas: 7.469271179 s
[test] @60: Update one: 0.013094508 s
[test] @60: Update all: [not run]
[info] + Time Results @ 60
[test] @600: Init Spreadsheet: 21.068284745 s
[test] @600: Monitoring Overhead: 0.11571002 s
[test] @600: Init formulas: 21.723715773 s
[test] @600: Update one: 0.007254895 s
[test] @600: Update all: [not run]
[info] + Time Results @ 600
[test] @6000: Init Spreadsheet: 21.072645619 s
[test] @6000: Monitoring Overhead: 0.11134726 s
[test] @6000: Init formulas: 244.810593821 s
[test] @6000: Update one: 0.021708022 s
[test] @6000: Update all: [not run]
[info] + Time Results @ 6000
[info] o Time Results @ 60000 [not run - taking longer than 30 minutes]
[info] Total for specification DataspreadBenchmarkVizierSpec
[info] Finished in 6 minutes 15 seconds, 731 ms
[info] 5 examples, 7 expectations, 0 failure, 0 error

View File

@ -1,32 +0,0 @@
[info] DataspreadBenchmarkVizierSpec
[info] DataspreadBenchmarkVizierSpec should
[info] Perform Benchamrks consistent with those done with VizierDB
[test] @0: Init Spreadsheet: 21.159097527 s
[test] @0: Monitoring Overhead: 0.231722052 s
[test] @0: Init formulas: 5.259485356 s
[test] @0: Update one: 0.019440175 s
[test] @0: Update all: [not run]
[info] + Time Results @ 0
[test] @60: Init Spreadsheet: 21.077188513 s
[test] @60: Monitoring Overhead: 0.119935001 s
[test] @60: Init formulas: 8.167940197 s
[test] @60: Update one: 0.024402629 s
[test] @60: Update all: [not run]
[info] + Time Results @ 60
[test] @600: Init Spreadsheet: 0.030223373 s
[test] @600: Monitoring Overhead: 0.113077619 s
[test] @600: Init formulas: 32.570919298 s
[test] @600: Update one: 0.008316406 s
[test] @600: Update all: [not run]
[info] + Time Results @ 600
[test] @6000: Init Spreadsheet: 0.024342633 s
[test] @6000: Monitoring Overhead: 0.111124168 s
[test] @6000: Init formulas: 193.063436155 s
[test] @6000: Update one: 0.018608992 s
[test] @6000: Update all: [not run]
[info] + Time Results @ 6000
[info] o Time Results @ 60000 [not run - taking longer than 30 minutes]
[info] Total for specification DataspreadBenchmarkVizierSpec
[info] Finished in 4 minutes 51 seconds, 142 ms
[info] 4 examples, 6 expectations, 0 failure, 0 error

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 21 KiB

View File

@ -1,56 +0,0 @@
SpreadsheetBenchmark
@60/false: Init Spreadsheet: 0.502968873 s
@60/false: Monitoring Overhead: 0.006877202 s
@60/false: Init Formulas: 0.288059989 s
@60/false: Update one: 0.008191735 s
@60/false: Update all: 0.066918199 s
+ Time Results @ 60
@60/true: Init Spreadsheet: 0.006720101 s
@60/true: Monitoring Overhead: 0.005177446 s
@60/true: Init Formulas: 0.224219809 s
@60/true: Update one: 0.008588053 s
@60/true: Update all: 0.059844046 s
+ Time Results @ 60
@600/false: Init Spreadsheet: 0.010373324 s
@600/false: Monitoring Overhead: 0.008801294 s
@600/false: Init Formulas: 0.207266834 s
@600/false: Update one: 0.007608617 s
@600/false: Update all: 0.076818458 s
+ Time Results @ 600
@600/true: Init Spreadsheet: 0.014502952 s
@600/true: Monitoring Overhead: 0.005923629 s
@600/true: Init Formulas: 0.191094545 s
@600/true: Update one: 0.007352669 s
@600/true: Update all: 0.057713751 s
+ Time Results @ 600
@6000/false: Init Spreadsheet: 0.29348608 s
@6000/false: Monitoring Overhead: 0.010335879 s
@6000/false: Init Formulas: 0.225178793 s
@6000/false: Update one: 0.008125263 s
@6000/false: Update all: 0.061337873 s
+ Time Results @ 6000
@6000/true: Init Spreadsheet: 0.005700485 s
@6000/true: Monitoring Overhead: 0.00465223 s
@6000/true: Init Formulas: 0.205236902 s
@6000/true: Update one: 0.007241557 s
@6000/true: Update all: 0.053460177 s
+ Time Results @ 6000
@60000/false: Init Spreadsheet: 4.830022963 s
@60000/false: Monitoring Overhead: 0.004278716 s
@60000/false: Init Formulas: 0.160613413 s
@60000/false: Update one: 0.006726616 s
@60000/false: Update all: 0.057563017 s
+ Time Results @ 60000
@60000/true: Init Spreadsheet: 0.007197506 s
@60000/true: Monitoring Overhead: 0.004691892 s
@60000/true: Init Formulas: 0.230308324 s
@60000/true: Update one: 0.007654771 s
@60000/true: Update all: 0.053793202 s
+ Time Results @ 60000
Total for specification SpreadsheetBenchmark
Finished in 15 seconds, 208 ms
8 examples, 9 expectations, 0 failure, 0 error

View File

@ -1,56 +0,0 @@
SpreadsheetBenchmark
+ Warm up the cache
@60/false: Init Spreadsheet: 26.140277045 s
@60/false: Monitoring Overhead: 0.014927718 s
@60/false: Init Formulas: 0.378240991 s
@60/false: Update one: 0.007560233 s
@60/false: Update all: 0.087895794 s
+ Time Results @ 60
@60/true: Init Spreadsheet: 0.007541636 s
@60/true: Monitoring Overhead: 0.00600814 s
@60/true: Init Formulas: 0.309725093 s
@60/true: Update one: 0.00739941 s
@60/true: Update all: 0.080354155 s
+ Time Results @ 60
@600/false: Init Spreadsheet: 0.00719922 s
@600/false: Monitoring Overhead: 0.005914293 s
@600/false: Init Formulas: 0.409290477 s
@600/false: Update one: 0.008420892 s
@600/false: Update all: 0.144015039 s
+ Time Results @ 600
@600/true: Init Spreadsheet: 0.007104852 s
@600/true: Monitoring Overhead: 0.005998643 s
@600/true: Init Formulas: 0.284384031 s
@600/true: Update one: 0.008116714 s
@600/true: Update all: 0.077044715 s
+ Time Results @ 600
@6000/false: Init Spreadsheet: 0.39814007 s
@6000/false: Monitoring Overhead: 0.005437453 s
@6000/false: Init Formulas: 4.291371661 s
@6000/false: Update one: 0.010826873 s
@6000/false: Update all: 0.697241208 s
+ Time Results @ 6000
@6000/true: Init Spreadsheet: 0.006777436 s
@6000/true: Monitoring Overhead: 0.005843616 s
@6000/true: Init Formulas: 0.304945226 s
@6000/true: Update one: 0.007598998 s
@6000/true: Update all: 0.076540439 s
+ Time Results @ 6000
@60000/false: Init Spreadsheet: 0.42241651 s
@60000/false: Monitoring Overhead: 0.005534486 s
@60000/false: Init Formulas: 47.899451675 s
@60000/false: Update one: 0.03709275 s
@60000/false: Update all: 29.091698828 s
+ Time Results @ 60000
@60000/true: Init Spreadsheet: 0.006794719 s
@60000/true: Monitoring Overhead: 0.005888871 s
@60000/true: Init Formulas: 0.450570698 s
@60000/true: Update one: 0.007473072 s
@60000/true: Update all: 0.078396281 s
+ Time Results @ 60000
Total for specification SpreadsheetBenchmark
Finished in 1 minute 58 seconds, 970 ms
9 examples, 10 expectations, 0 failure, 0 error

View File

@ -1,56 +0,0 @@
SpreadsheetBenchmark
@60/false: Init Spreadsheet: 0.622319637 s
@60/false: Monitoring Overhead: 0.008267427 s
@60/false: Init Formulas: 0.379130661 s
@60/false: Update one: 0.008588219 s
@60/false: Update all: 0.096520684 s
+ Time Results @ 60
@60/true: Init Spreadsheet: 0.007797491 s
@60/true: Monitoring Overhead: 0.006393537 s
@60/true: Init Formulas: 0.461880471 s
@60/true: Update one: 0.013939607 s
@60/true: Update all: 0.11355628 s
+ Time Results @ 60
@600/false: Init Spreadsheet: 0.008746324 s
@600/false: Monitoring Overhead: 0.006425039 s
@600/false: Init Formulas: 0.551072941 s
@600/false: Update one: 0.01089435 s
@600/false: Update all: 0.158270664 s
+ Time Results @ 600
@600/true: Init Spreadsheet: 0.007240454 s
@600/true: Monitoring Overhead: 0.006301443 s
@600/true: Init Formulas: 0.339588744 s
@600/true: Update one: 0.009767286 s
@600/true: Update all: 0.086571365 s
+ Time Results @ 600
@6000/false: Init Spreadsheet: 0.443006796 s
@6000/false: Monitoring Overhead: 0.007127946 s
@6000/false: Init Formulas: 2.475917445 s
@6000/false: Update one: 0.01008147 s
@6000/false: Update all: 0.858393776 s
+ Time Results @ 6000
@6000/true: Init Spreadsheet: 0.007298637 s
@6000/true: Monitoring Overhead: 0.005954614 s
@6000/true: Init Formulas: 0.342004032 s
@6000/true: Update one: 0.008022405 s
@6000/true: Update all: 0.08392664 s
+ Time Results @ 6000
@60000/false: Init Spreadsheet: 5.584728068 s
@60000/false: Monitoring Overhead: 0.006156743 s
@60000/false: Init Formulas: 42.167866315 s
@60000/false: Update one: 0.0327693 s
@60000/false: Update all: 17.918641827 s
+ Time Results @ 60000
@60000/true: Init Spreadsheet: 0.00672735 s
@60000/true: Monitoring Overhead: 0.005856171 s
@60000/true: Init Formulas: 0.339442519 s
@60000/true: Update one: 0.007649768 s
@60000/true: Update all: 0.076880671 s
+ Time Results @ 60000
Total for specification SpreadsheetBenchmark
Finished in 1 minute 21 seconds, 988 ms
8 examples, 9 expectations, 0 failure, 0 error

View File

@ -9,6 +9,7 @@ def read_dataspread(testbed, experiment):
if match is None:
return []
else:
print(line)
return [(
testbed,
"dataspread",
@ -68,12 +69,12 @@ def read_vizier(testbed, experiment):
data = [
record
for ds in [
read_vizier("desktop", "varystart"),
read_dataspread("desktop", "varystart"),
read_vizier("desktop", "varysize"),
read_dataspread("desktop", "varysize"),
read_vizier("desktop", "varystartandsize"),
read_dataspread("desktop", "varystartandsize"),
read_vizier("laptop", "varystart"),
read_dataspread("laptop", "varystart"),
read_vizier("laptop", "varysize"),
read_dataspread("laptop", "varysize"),
read_vizier("laptop", "varystartandsize"),
read_dataspread("laptop", "varystartandsize"),
]
for record in ds
]
@ -88,9 +89,9 @@ experiment_xlabels = {
}
system_labels = {
"vizier" : "Vizier",
"vizier-batch" : "Vizier (Simulated Batching)",
"dataspread" : "DataSpread"
"vizier" : ("Vizier", "v-"),
"vizier-batch" : ("Vizier (Simulated Batching)", "^-"),
"dataspread" : ("DataSpread", 'o-')
}
@ -149,10 +150,12 @@ def plot_one(testbed, stage, experiment):
and record[5] == experiment
], key=lambda x: x[0])
label, marker = system_labels[system]
ax.plot(
[pt[0] for pt in points],
[pt[1] for pt in points],
label=system_labels[system]
marker,
label=label,
)
ax.legend()
stage = stage.replace(" ", "_")
@ -162,17 +165,17 @@ def plot_one(testbed, stage, experiment):
plot_one("desktop", "init spreadsheet", "varystart")
plot_one("desktop", "init formulas", "varystart")
plot_one("desktop", "init", "varystart")
plot_one("desktop", "update one", "varystart")
# plot_one("laptop", "init spreadsheet", "varystart")
# plot_one("laptop", "init formulas", "varystart")
plot_one("laptop", "init", "varystart")
plot_one("laptop", "update one", "varystart")
plot_one("desktop", "init spreadsheet", "varysize")
plot_one("desktop", "init formulas", "varysize")
plot_one("desktop", "init", "varysize")
plot_one("desktop", "update one", "varysize")
# plot_one("laptop", "init spreadsheet", "varysize")
# plot_one("laptop", "init formulas", "varysize")
plot_one("laptop", "init", "varysize")
plot_one("laptop", "update one", "varysize")
plot_one("desktop", "init spreadsheet", "varystartandsize")
plot_one("desktop", "init formulas", "varystartandsize")
plot_one("desktop", "init", "varystartandsize")
plot_one("desktop", "update one", "varystartandsize")
# plot_one("laptop", "init spreadsheet", "varystartandsize")
# plot_one("laptop", "init formulas", "varystartandsize")
plot_one("laptop", "init", "varystartandsize")
plot_one("laptop", "update one", "varystartandsize")

View File

@ -1,4 +1,30 @@
%!TEX root=../main.tex
\begin{figure*}
\centering
\subcaptionbox{Scale Data, View First}{
\includegraphics[width=0.28\textwidth]{results/laptop-init-varysize.pdf}
}
\subcaptionbox{Fix Data, Move View}{
\includegraphics[width=0.28\textwidth]{results/laptop-init-varystart.pdf}
}
\subcaptionbox{Scale Data, View Last}{
\includegraphics[width=0.28\textwidth]{results/laptop-init-varystartandsize.pdf}
}
\subcaptionbox{Scale Data, View First}{
\includegraphics[width=0.28\textwidth]{results/laptop-update_one-varysize.pdf}
}
\subcaptionbox{Fix Data, Move View}{
\includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystart.pdf}
}
\subcaptionbox{Scale Data, View Last}{
\includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystartandsize.pdf}
}
\caption{System Initialization costs (a-c) and cost to update one cell (d-f)}
\label{fig:experiments}
\trimfigurespacing
\end{figure*}
\section{Experiments}
\label{sec:experiments}
@ -6,11 +32,14 @@ In this section we explore the performance of the overlay approach.
Concretely, we are interested in two questions:
(i) How does data size affect the performance of each system?
(ii) How does dependency chain length affect the performance of each system?
Experiments were run on a 10-core 1.7 GHz Intel i7-12700H running Linux (Kernel 6.0), with 64G of DDR-3200 RAM, and a 2TB 970 EVO NVME solid state drive.
% Desktop
% Experiments were run on a 10-core 1.7 GHz Intel i7-12700H running Linux (Kernel 6.0), with 64G of DDR-3200 RAM, and a 2TB 970 EVO NVME solid state drive.
% Laptop
Experiments were run on an 8-core 2.3 GHz Intel i7-11800H running Linux (Kernel 5.19), with 32G of DDR4-3200 RAM, and a 2TB 970 EVO NVME solid state drive.
We compare three systems:
(i) \textbf{dataspread}: Dataspread version 0.5~\cite{bendre-15-d}, the most recent version of time of submission;
(ii) \textbf{vizier}: Our prototype implementation of overlay spreadsheets; and
(iii) \textbf{vizier-batch}: Our prototype implementation with simulated hybrid batch processing.
(i) \textbf{DataSpread}: Dataspread version 0.5~\cite{bendre-15-d}, the most recent version of time of submission;
(ii) \textbf{Vizier}: Our prototype implementation of overlay spreadsheets; and
(iii) \textbf{Vizier (Simulated Batching)}: Our prototype with simulated hybrid batch processing (see Setup, below).
All experiments were performed with a warm cache.
\partitle{Setup}
@ -28,32 +57,27 @@ We measure (i) the cost of initialization and (ii) the cost of a single update.
Time is measured until quiescence.
To emulate batch processing, we replace the formula for the $\texttt{sum\_change}[i-1]$ (where $i$ is the first visible row) with a formula that computes the analogous aggregate query.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Moving Viewport}
% \begin{figure}
% \includegraphics[width=0.7\columnwidth]{results/desktop-update_one.png}
% \vspace*{-4mm}
% \caption{Performance based on viewable range.}
% \label{fig:perf-scale-visible}
% \trimfigurespacing
% \end{figure}
\Cref{fig:experiments}(a,d) shows initialization and update costs, with a fixed dataset size of approximately 600,000 rows, and a variable viewport position.
Due to the running sum, the longest visible dependency chain grows as the visible region moves further into the dataset.
Costs for Vizier and Dataspread grow significantly with the length of the dependency chain, while batch processing can compute the updated sum significantly faster.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Scaling Data}
\begin{figure}
\includegraphics[width=0.7\columnwidth]{results/desktop-init_formulas.png}
\vspace*{-4mm}
\caption{Performance as data size scales.}
\label{fig:perf-scale-size}
\trimfigurespacing
\end{figure}
\Cref{fig:perf-scale-size} shows performance as the size of the dataset grows.
\Cref{fig:experiments}(a,d) shows the initialization and update costs when the viewport is on the first cell. Vizier only needs to compute the visible cell formulas, and so is significantly faster.
\Cref{fig:experiments}(c,f) show these costs when the viewport is on the last cell; as before, the costs for Vizier grow with the length of the longest visible dependency chain, supporting the value of batching.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Viewport}
\begin{figure}
\includegraphics[width=0.7\columnwidth]{results/desktop-update_one.png}
\vspace*{-4mm}
\caption{Performance based on viewable range.}
\label{fig:perf-scale-visible}
\trimfigurespacing
\end{figure}
\Cref{fig:perf-scale-size} shows performance as the viewable area moves lower.

View File

@ -4,32 +4,30 @@
\label{sec:related-work}
Although spreadsheets present a convenient, direct-manipulation interface to data, they lack the scalability to manage large data.
A common approach to scaling spreadsheets --- what we term the ``virtual'' approach --- is to reformulate the interface to an existing database or workflow system using spreadsheet-style direct manipulation metaphors~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/icde/LiuJ09,freire:2016:hilda:exception,DBLP:conf/sigmod/JagadishCEJLNY07,DBLP:conf/chi/KandelPHH11}.
A common approach to scaling spreadsheets (the ``virtual'' approach) reformulates the interface to an existing database or workflow system using spreadsheet-style direct manipulation metaphors~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/icde/LiuJ09,freire:2016:hilda:exception,DBLP:conf/sigmod/JagadishCEJLNY07,DBLP:conf/chi/KandelPHH11}.
The resulting systems bear varying levels of resemblance to existing spreadsheets, usually introducing concepts from relational databases like explicit tables, attributes, and records.
Vizier~\cite{brachmann:2019:sigmod:data, kennedy:2022:ieee-deb:right, kumari:2021:cidr:datasense, brachmann:2020:cidr:your} is a computational notebook system that automatically versions notebooks as they are edited by users.
In Vizier, any dataset used in a computational notebook can be accessed and edited through a spreadsheet interface; the resulting edits are integrated into the workflow.
%
Wrangler~\cite{DBLP:conf/chi/KandelPHH11} is an ETL workflow development tool with an interface inspired by spreadsheets.
Users open a small sample of a dataset in Wrangler and use spreadsheet-style direct manipulations to indicate a desired change to the dataset.
Wrangler, in turn, proposes ETL workflow steps that can achieve the user's desired effect on the target cell, as well as the remainder of the dataset.
Other approaches more directly mimic relational databases through spreadsheet-style interfaces.
Users open a small sample of a dataset in Wrangler and use spreadsheet-style direct manipulations to indicate desired changes to the dataset.
%
Vizier~\cite{brachmann:2019:sigmod:data, kennedy:2022:ieee-deb:right, kumari:2021:cidr:datasense, brachmann:2020:cidr:your} is a computational notebook system that allows users to define workflow stages through a spreadsheet-style interface.
%
Other approaches more directly mimic relational databases:
The Spreadsheet Algebra~\cite{DBLP:conf/sigmod/JagadishCEJLNY07,DBLP:conf/icde/LiuJ09} allows users to specify any SPJGA-query purely through spreadsheet-style user interactions.
Related Worksheets~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/chi/BakkeKM11} re-imagines the classical spreadsheet-style interface by introducing relational structure, as well as nested display of foreign-key dependencies.
Related Worksheets~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/chi/BakkeKM11} re-imagines the classical spreadsheet-style interface with record structure and inlined display of foreign-key references.
A second class of approach --- what we term the ``materialized'' approach --- instead redesigns the spreadsheet engine itself through database concepts;
The primary example in this space is DataSpread~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}.
A key challenge that the materialized approach faces is that classical database techniques, which exploit common structures in a dataset, are not directly applicable.
A second approach (the ``materialized'' approach) instead redesigns the spreadsheet engine itself through database concepts;
An example is DataSpread~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}.
A key challenge is that classical database techniques, which exploit common structures in a dataset, are not directly applicable.
\cite{DBLP:conf/icde/BendreVZCP18} explores data structures that can leverage partial structure; for example, when a range of cells are structured as a relational table.
\cite{DBLP:conf/sigmod/BendreWMCP19} explores strategies for quickly invalidating cells and computing dependencies, by leveraging a (lossy) compressed dependency graph that can efficiently bound a cell's downstream.
\cite{tang-23-efcsfg} introduces a different type of compressed dependency graph which is lossless, instead exploiting repeating patterns in formulas.
This is analogous to our own approach, but focuses on the dependency graph;
As we demonstrate here, applying a similar approach to expressions as well creates multiple optimization opportunities.
As we demonstrate, expression patterns create more optimization opportunities.
In summary, several efficient algorithms for storing, accessing, and updating spreadsheets have been developed and adapted in the context of the DataSpread.
The approach developed for Vizier is often less efficient, but has the advantage of supporting light-weight versioning and tracking the provenance of the evolution of a dataset (and the computational notebook containing it) under spreadsheet operations.
Importantly, this approach enables replaying a user's updates that were originally applied to a dataset $D_{old}$ when $D_{old}$ is replaced with an updated dataset $D_{new}$ (e.g., the user may have downloaded a new version of an open dataset and wants to keep the manual fixes they have applied to the original version of the dataset).
The virtual approach is often less efficient, but has the advantage of supporting light-weight versioning, tracking the provenance.
Crucially, this approach also enables replaying a user's updates, originally applied to one dataset, on a new dataset (e.g., to re-apply curation work on an updated version of the data).
The overlay approach we present in this work has the potential to retain these benefits while enabling performance competitive with, or exceeding that of DataSpread.
Furthermore, overlays with reference frames enable more efficient support for insertion and deletion for rows and columns as this only affects reference frames, but not the formulas of cells.