Merge branch 'main' of git.odin.cse.buffalo.edu:VizierDB/paper-Vizier-SpreadsheetOverlay

main
Boris Glavic 2023-05-12 18:13:29 -05:00
commit 041e54a872
35 changed files with 579 additions and 223 deletions

4
.gitignore vendored
View File

@ -2,7 +2,7 @@ comment.cut
*.aux
*.fdb_latexmk
*.fls
*.log
/*.log
*.out
*.synctex.gz
*.bbl
@ -15,7 +15,7 @@ comment.cut
## Core latex/pdflatex auxiliary files:
*.aux
*.lof
*.log
/*.log
*.lot
*.fls
*.out

View File

@ -1,4 +1,4 @@
\documentclass[sigconf,review]{acmart}
\documentclass[sigconf,review,10pt]{acmart}
\usepackage{cleveref}
\usepackage{todonotes}
@ -8,6 +8,7 @@
\usepackage{listings}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usepackage{subcaption}
\newcommand{\trimfigurespacing}{\vspace*{-5mm}}
@ -121,23 +122,14 @@
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
Spreadsheets provide a convenient % , friendly
direct manipulation interface to datasets.
Efforts to scale spreadsheets % have taken two approaches: A
either follow a `virtual` strategy that imposes a spreadsheet interface over an existing database engine or a `materialized' strategy based on re-engineering the spreadsheet engine using % around
standard database optimizations. % like indexes.
Because database engines are not optimized for spreadsheet access patterns,
% typically optimized for bulk query processing over interactive latencies,
the materialized approach has better performance.
However, the virtual approach offers several key advantages that can not be easily replicated in the materialized approach, including % notably
the ability to re-apply user interactions to an updated dataset. % version of the same dataset.
We propose a hybrid of % the materialized and virtual
these approaches, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
We introduce the overlay update model, and outline strategies for efficiently accessing a spreadsheet defined in this way.
% Spreadsheets provide a convenient, friendly direct manipulation interface to datasets.
Efforts to scale spreadsheets either follow a `virtual` strategy that imposes a spreadsheet interface over an existing database engine or a `materialized' strategy based on re-engineering the spreadsheet engine.
Because database engines are not optimized for spreadsheet access patterns, the materialized approach has better performance.
However, the virtual approach offers several advantages that can not be easily replicated in the materialized approach, including the ability to re-apply user interactions to an updated dataset.
We propose a hybrid approach, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
We introduce the overlay update model, and outline strategies for efficiently accessing an overlay spreadsheet.
A key feature of our approach is storing updates generated by bulk operations (e.g., copy/paste) as ``patterns" that can be leveraged to reduce execution costs.
We implement an overlay spreadsheet over Apache Spark and demonstrate that, compared to DataSpread, it can significantly reduce execution costs. % popular
% materialized spreadsheet.
% Our preliminary results show that overlay spreadsheets can significantly reduce execution costs.
We implement an overlay spreadsheet over Apache Spark and demonstrate that, compared to DataSpread (a standard materialized-style spreadsheet), it can significantly reduce execution costs.
\end{abstract}
%%
@ -155,7 +147,7 @@ the materialized approach has better performance.
%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\keywords{Spreadsheets, Dataframes, Scalable Data Management}
% \keywords{Spreadsheets, Dataframes, Scalable Data Management}
%% A "teaser" image appears between the author and affiliation
%% information and the body of the document, and typically spans the
%% page.
@ -182,8 +174,8 @@ the materialized approach has better performance.
% \input{sections/formalism}
\input{sections/system}
% \input{sections/data}
\input{sections/relwork}
\input{sections/experiments}
\input{sections/relwork}
\input{sections/conclusions}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

View File

@ -9,6 +9,7 @@ def read_dataspread(testbed, experiment):
if match is None:
return []
else:
print(line)
return [(
testbed,
"dataspread",
@ -63,24 +64,68 @@ def read_vizier(testbed, experiment):
# 'update one' - time to update one cell
# 'update all' - time to update an entire column (not used)
# 4. time: float (number of seconds)
# 5. experiment: 'varystart', 'varysize', 'varystartandsize'
data = [
record
for ds in [
read_vizier("desktop", "varystart"),
# read_vizier("laptop"),
read_dataspread("desktop", "varystart"),
# read_dataspread("laptop")
read_vizier("laptop", "varystart"),
read_dataspread("laptop", "varystart"),
read_vizier("laptop", "varysize"),
read_dataspread("laptop", "varysize"),
read_vizier("laptop", "varystartandsize"),
read_dataspread("laptop", "varystartandsize"),
]
for record in ds
]
stages = set(i[3] for i in data)
sizes = set(i[2] for i in data)
experiment_xlabels = {
"varystart" : "First visible row",
"varysize" : "Number of rows",
"varystartandsize" : "Number of rows",
}
system_labels = {
"vizier" : ("Vizier", "v-"),
"vizier-batch" : ("Vizier (Simulated Batching)", "^-"),
"dataspread" : ("DataSpread", 'o-')
}
init_costs = {}
init_fields = [
"init spreadsheet",
"init formulas"
]
for record in data:
if record[3] in init_fields:
key = (
record[0],
record[1],
record[2],
record[5]
)
print(key)
init_costs[key] = init_costs.get(key, 0) + record[4]
data += [
(
key[0],
key[1],
key[2],
"init",
init_costs[key],
key[3]
)
for key in init_costs
]
print(data)
print(stages)
def plot_one(testbed, stage, experiment):
global data
fig, ax = plt.subplots(
@ -89,25 +134,28 @@ def plot_one(testbed, stage, experiment):
)
# ax.set_title(f"{stage} ({testbed})")
ax.set_ylabel(f"{stage} (s)")
ax.set_xlabel("Data Size (number of rows)")
ax.set_ylabel(f"Time (s)")
ax.set_xlabel(experiment_xlabels[experiment])
ax.set_xscale("log")
ax.set_yscale("log")
for system in ["vizier", "vizier-batch", "dataspread"]:
for system in system_labels:
points = sorted([
(record[2], record[4])
for record in data
if record[0] == testbed
and record[1] == system
and record[3] == stage
and record[5] == experiment
], key=lambda x: x[0])
label, marker = system_labels[system]
ax.plot(
[pt[0] for pt in points],
[pt[1] for pt in points],
label=system
marker,
label=label,
)
ax.legend()
stage = stage.replace(" ", "_")
@ -115,6 +163,19 @@ def plot_one(testbed, stage, experiment):
fig.savefig(f"{testbed}-{stage}-{experiment}.png")
plot_one("desktop", "init spreadsheet", "varystart")
plot_one("desktop", "init formulas", "varystart")
plot_one("desktop", "update one", "varystart")
# plot_one("laptop", "init spreadsheet", "varystart")
# plot_one("laptop", "init formulas", "varystart")
plot_one("laptop", "init", "varystart")
plot_one("laptop", "update one", "varystart")
# plot_one("laptop", "init spreadsheet", "varysize")
# plot_one("laptop", "init formulas", "varysize")
plot_one("laptop", "init", "varysize")
plot_one("laptop", "update one", "varysize")
# plot_one("laptop", "init spreadsheet", "varystartandsize")
# plot_one("laptop", "init formulas", "varystartandsize")
plot_one("laptop", "init", "varystartandsize")
plot_one("laptop", "update one", "varystartandsize")

View File

@ -0,0 +1,36 @@
[info] DataspreadBenchmarkVizierSpec
[info] DataspreadBenchmarkVizierSpec should
[info] Perform Benchamrks consistent with those done with VizierDB
[test] @0: Init Spreadsheet: 21.159173174 s
[test] @0: Monitoring Overhead: 0.20974519 s
[test] @0: Init formulas: 5.261571355 s
[test] @0: Update one: 0.01938019 s
[test] @0: Update all: [not run]
[info] + Time Results @ 0
[test] @60: Init Spreadsheet: 21.083128875 s
[test] @60: Monitoring Overhead: 0.120861192 s
[test] @60: Init formulas: 7.946650643 s
[test] @60: Update one: 0.02460699 s
[test] @60: Update all: [not run]
[info] + Time Results @ 60
[test] @600: Init Spreadsheet: 21.064740668 s
[test] @600: Monitoring Overhead: 0.111825686 s
[test] @600: Init formulas: 21.72623498 s
[test] @600: Update one: 0.009170771 s
[test] @600: Update all: [not run]
[info] + Time Results @ 600
[test] @6000: Init Spreadsheet: 21.063488863 s
[test] @6000: Monitoring Overhead: 0.11231663 s
[test] @6000: Init formulas: 222.331098879 s
[test] @6000: Update one: 0.035261053 s
[test] @6000: Update all: [not run]
[info] + Time Results @ 6000
[test] @60000: Init Spreadsheet: 21.160535801 s
[test] @60000: Monitoring Overhead: 0.201710736 s
[test] @60000: Init formulas: 11685.549964039 s
[test] @60000: Update one: 0.182460831 s
[test] @60000: Update all: [not run]
[info] + Time Results @ 60000
[info] Total for specification DataspreadBenchmarkVizierSpec
[info] Finished in ? hours ? minutes 1? seconds, ? ms
[info] 5 examples, 6 expectations, 0 failure, 0 error

View File

@ -0,0 +1,33 @@
[info] DataspreadBenchmarkVizierSpec
[info] DataspreadBenchmarkVizierSpec should
[info] Perform Benchamrks consistent with those done with VizierDB
[info] + Warm up the cache
[test] @0: Init Spreadsheet: 21.14533664 s
[test] @0: Monitoring Overhead: 0.206532011 s
[test] @0: Init formulas: 5.258806224 s
[test] @0: Update one: 0.019434445 s
[test] @0: Update all: [not run]
[info] + Time Results @ 0
[test] @60: Init Spreadsheet: 21.049808795 s
[test] @60: Monitoring Overhead: 0.121249886 s
[test] @60: Init formulas: 7.56010975 s
[test] @60: Update one: 0.018575696 s
[test] @60: Update all: [not run]
[info] + Time Results @ 60
[test] @600: Init Spreadsheet: 21.071083577 s
[test] @600: Monitoring Overhead: 0.11424194 s
[test] @600: Init formulas: 21.319581386 s
[test] @600: Update one: 0.007370146 s
[test] @600: Update all: [not run]
[info] + Time Results @ 600
[test] @6000: Init Spreadsheet: 21.071681763 s
[test] @6000: Monitoring Overhead: 0.109863317 s
[test] @6000: Init formulas: 200.321384583 s
[test] @6000: Update one: 0.016227122 s
[test] @6000: Update all: [not run]
[info] + Time Results @ 6000
[info] o Time Results @ 60000 [not run - taking longer than 30 minutes]
[info] Total for specification DataspreadBenchmarkVizierSpec
[info] Finished in ? minutes ? seconds, ? ms
[info] 6 examples, 8 expectations, 0 failure, 0 error

View File

@ -0,0 +1,37 @@
[info] DataspreadBenchmarkVizierSpec
[info] DataspreadBenchmarkVizierSpec should
[info] Perform Benchamrks consistent with those done with VizierDB
[test] @0: Init Spreadsheet: 21.152766205 s
[test] @0: Monitoring Overhead: 0.212853925 s
[test] @0: Init formulas: 5.258824289 s
[test] @0: Update one: 0.011733248 s
[test] @0: Update all: [not run]
[info] + Time Results @ 0
[test] @60: Init Spreadsheet: 21.08499361 s
[test] @60: Monitoring Overhead: 0.118184044 s
[test] @60: Init formulas: 7.224992098 s
[test] @60: Update one: 0.022238658 s
[test] @60: Update all: [not run]
[info] + Time Results @ 60
[test] @600: Init Spreadsheet: 0.023304786 s
[test] @600: Monitoring Overhead: 0.111591264 s
[test] @600: Init formulas: 18.308526165 s
[test] @600: Update one: 0.011778973 s
[test] @600: Update all: [not run]
[info] + Time Results @ 600
[test] @6000: Init Spreadsheet: 0.023009218 s
[test] @6000: Monitoring Overhead: 0.110957631 s
[test] @6000: Init formulas: 233.968469121 s
[test] @6000: Update one: 0.03754689 s
[test] @6000: Update all: [not run]
[info] + Time Results @ 6000
[test] @60000: Init Spreadsheet: 15.887701562 s
[test] @60000: Monitoring Overhead: 0.201221654 s
[test] @60000: Init formulas: 9289.854796884 s
[test] @60000: Update one: 0.213215589 s
[test] @60000: Update all: [not run]
[info] + Time Results @ 60000
[info] Total for specification DataspreadBenchmarkVizierSpec
[info] Finished in ? hours ? minutes ? seconds, ? ms
[info] 5 examples, 6 expectations, 0 failure, 0 error

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

View File

@ -0,0 +1,61 @@
SpreadsheetBenchmark
@0/false: Init Spreadsheet: 0.007564038 s
@0/false: Monitoring Overhead: 0.007400494 s
@0/false: Init Formulas: 0.230605886 s
@0/false: Update one: 0.00911814 s
@0/false: Update all: 0.079019807 s
+ Time Results @ 0
@60/false: Init Spreadsheet: 0.007365202 s
@60/false: Monitoring Overhead: 0.005953976 s
@60/false: Init Formulas: 0.214453005 s
@60/false: Update one: 0.008366991 s
@60/false: Update all: 0.077043804 s
+ Time Results @ 60
@60/true: Init Spreadsheet: 0.006909941 s
@60/true: Monitoring Overhead: 0.00610603 s
@60/true: Init Formulas: 0.210893718 s
@60/true: Update one: 0.008825594 s
@60/true: Update all: 0.094068053 s
+ Time Results @ 60
@600/false: Init Spreadsheet: 0.009024537 s
@600/false: Monitoring Overhead: 0.005922278 s
@600/false: Init Formulas: 0.212432161 s
@600/false: Update one: 0.009236636 s
@600/false: Update all: 0.092001339 s
+ Time Results @ 600
@600/true: Init Spreadsheet: 0.007489662 s
@600/true: Monitoring Overhead: 0.006039918 s
@600/true: Init Formulas: 0.209676009 s
@600/true: Update one: 0.009630307 s
@600/true: Update all: 0.070513729 s
+ Time Results @ 600
@6000/false: Init Spreadsheet: 0.006762883 s
@6000/false: Monitoring Overhead: 0.005878985 s
@6000/false: Init Formulas: 0.198955475 s
@6000/false: Update one: 0.009233433 s
@6000/false: Update all: 0.07331271 s
+ Time Results @ 6000
@6000/true: Init Spreadsheet: 0.006658383 s
@6000/true: Monitoring Overhead: 0.005803843 s
@6000/true: Init Formulas: 0.214047748 s
@6000/true: Update one: 0.009735771 s
@6000/true: Update all: 0.065257715 s
+ Time Results @ 6000
@60000/false: Init Spreadsheet: 0.006462848 s
@60000/false: Monitoring Overhead: 0.005737669 s
@60000/false: Init Formulas: 0.205596903 s
@60000/false: Update one: 0.008717803 s
@60000/false: Update all: 0.073094656 s
+ Time Results @ 60000
@60000/true: Init Spreadsheet: 0.006758208 s
@60000/true: Monitoring Overhead: 0.005889192 s
@60000/true: Init Formulas: 0.255475956 s
@60000/true: Update one: 0.008484922 s
@60000/true: Update all: 0.064445427 s
+ Time Results @ 60000
Total for specification SpreadsheetBenchmark
Finished in ? seconds, ? ms
9 examples, 9 expectations, 0 failure, 0 error

View File

@ -0,0 +1,61 @@
SpreadsheetBenchmark
+ Warm up the cache
@0/false: Init Spreadsheet: 0.007200929 s
@0/false: Monitoring Overhead: 0.005763902 s
@0/false: Init Formulas: 0.233877935 s
@0/false: Update one: 0.009171376 s
@0/false: Update all: 0.080520967 s
+ Time Results @ 0
@60/false: Init Spreadsheet: 0.007492587 s
@60/false: Monitoring Overhead: 0.007265447 s
@60/false: Init Formulas: 0.234647648 s
@60/false: Update one: 0.007701139 s
@60/false: Update all: 0.084986609 s
+ Time Results @ 60
@60/true: Init Spreadsheet: 0.00693865 s
@60/true: Monitoring Overhead: 0.005998937 s
@60/true: Init Formulas: 0.285096228 s
@60/true: Update one: 0.00791677 s
@60/true: Update all: 0.078190827 s
+ Time Results @ 60
@600/false: Init Spreadsheet: 0.00719922 s
@600/false: Monitoring Overhead: 0.005914293 s
@600/false: Init Formulas: 0.409290477 s
@600/false: Update one: 0.008420892 s
@600/false: Update all: 0.144015039 s
+ Time Results @ 600
@600/true: Init Spreadsheet: 0.007104852 s
@600/true: Monitoring Overhead: 0.005998643 s
@600/true: Init Formulas: 0.284384031 s
@600/true: Update one: 0.008116714 s
@600/true: Update all: 0.077044715 s
+ Time Results @ 600
@6000/false: Init Spreadsheet: 0.39814007 s
@6000/false: Monitoring Overhead: 0.005437453 s
@6000/false: Init Formulas: 4.291371661 s
@6000/false: Update one: 0.010826873 s
@6000/false: Update all: 0.697241208 s
+ Time Results @ 6000
@6000/true: Init Spreadsheet: 0.006777436 s
@6000/true: Monitoring Overhead: 0.005843616 s
@6000/true: Init Formulas: 0.304945226 s
@6000/true: Update one: 0.007598998 s
@6000/true: Update all: 0.076540439 s
+ Time Results @ 6000
@60000/false: Init Spreadsheet: 0.42241651 s
@60000/false: Monitoring Overhead: 0.005534486 s
@60000/false: Init Formulas: 47.899451675 s
@60000/false: Update one: 0.03709275 s
@60000/false: Update all: 29.091698828 s
+ Time Results @ 60000
@60000/true: Init Spreadsheet: 0.006794719 s
@60000/true: Monitoring Overhead: 0.005888871 s
@60000/true: Init Formulas: 0.450570698 s
@60000/true: Update one: 0.007473072 s
@60000/true: Update all: 0.078396281 s
+ Time Results @ 60000
Total for specification SpreadsheetBenchmark
Finished in ? minute ? seconds, ? ms
10 examples, 10 expectations, 0 failure, 0 error

View File

@ -0,0 +1,61 @@
SpreadsheetBenchmark
@0/false: Init Spreadsheet: 0.006769554 s
@0/false: Monitoring Overhead: 0.005108301 s
@0/false: Init Formulas: 0.184505859 s
@0/false: Update one: 0.007652165 s
@0/false: Update all: 0.063410596 s
+ Time Results @ 0
@60/false: Init Spreadsheet: 0.00610572 s
@60/false: Monitoring Overhead: 0.004903835 s
@60/false: Init Formulas: 0.192604791 s
@60/false: Update one: 0.006000217 s
@60/false: Update all: 0.073722164 s
+ Time Results @ 60
@60/true: Init Spreadsheet: 0.005724917 s
@60/true: Monitoring Overhead: 0.004879481 s
@60/true: Init Formulas: 0.255259897 s
@60/true: Update one: 0.006215926 s
@60/true: Update all: 0.064652433 s
+ Time Results @ 60
@600/false: Init Spreadsheet: 0.006123947 s
@600/false: Monitoring Overhead: 0.004805293 s
@600/false: Init Formulas: 0.354857596 s
@600/false: Update one: 0.006654787 s
@600/false: Update all: 0.112083011 s
+ Time Results @ 600
@600/true: Init Spreadsheet: 0.005842375 s
@600/true: Monitoring Overhead: 0.004949255 s
@600/true: Init Formulas: 0.240377094 s
@600/true: Update one: 0.006095369 s
@600/true: Update all: 0.063035398 s
+ Time Results @ 600
@6000/false: Init Spreadsheet: 0.005478137 s
@6000/false: Monitoring Overhead: 0.004693916 s
@6000/false: Init Formulas: 2.266872167 s
@6000/false: Update one: 0.00771648 s
@6000/false: Update all: 0.564202431 s
+ Time Results @ 6000
@6000/true: Init Spreadsheet: 0.005612294 s
@6000/true: Monitoring Overhead: 0.004520144 s
@6000/true: Init Formulas: 0.227395145 s
@6000/true: Update one: 0.00600239 s
@6000/true: Update all: 0.060715141 s
+ Time Results @ 6000
@60000/false: Init Spreadsheet: 0.005795735 s
@60000/false: Monitoring Overhead: 0.006177181 s
@60000/false: Init Formulas: 35.722531793 s
@60000/false: Update one: 0.027631704 s
@60000/false: Update all: 16.491494823 s
+ Time Results @ 60000
@60000/true: Init Spreadsheet: 0.005326716 s
@60000/true: Monitoring Overhead: 0.004702773 s
@60000/true: Init Formulas: 0.26901122 s
@60000/true: Update one: 0.005975437 s
@60000/true: Update all: 0.060946162 s
+ Time Results @ 60000
Total for specification SpreadsheetBenchmark
Finished in ? minute ? seconds, ? ms
9 examples, 9 expectations, 0 failure, 0 error

View File

@ -3,9 +3,9 @@
\section{Conclusions and Future Work}
\label{sec:conclusions}
In this work, we introduced overlay spreadsheets as a potential direction for reproducible spreadsheets where a user's edits can be re-applied to updated input data, and thus used directly in classical workflow and provenance analysis systems like Vizier.
In this work, we introduced overlay spreadsheets as a potential direction for reproducible spreadsheets in workflow and provenance analysis systems like Vizier.
This novel capability is powered by overlays that decouple the user's edits from the source data they are applied to.
We also demonstrated how updates to ranges of cells can be represented declaratively, improving performance and introducing several avenues for optimized evaluation of recursive patterns.
We also demonstrated how updates to ranges of cells can be represented declaratively, improving performance and enabling optimized evaluation of recursive patterns.
Recursive patterns remain the source of several open challenges for us.
Most notably, in the absence of recursive patterns, the depth of a dependency chains is bounded by the number of user interactions.

View File

@ -1,4 +1,30 @@
%!TEX root=../main.tex
\begin{figure}
\centering
\subcaptionbox{Scale Data, View First}{
\includegraphics[width=0.47\columnwidth]{results/laptop-init-varysize.pdf}
}
\subcaptionbox{Fix Data, Move View}{
\includegraphics[width=0.47\columnwidth]{results/laptop-init-varystart.pdf}
}\\[1mm]
\subcaptionbox{Scale Data, View Last}{
\includegraphics[width=0.47\columnwidth]{results/laptop-init-varystartandsize.pdf}
}
\subcaptionbox{Scale Data, View First}{
\includegraphics[width=0.47\columnwidth]{results/laptop-update_one-varysize.pdf}
}\\[-2mm]
% \subcaptionbox{Fix Data, Move View}{
% \includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystart.pdf}
% }
% \subcaptionbox{Scale Data, View Last}{
% \includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystartandsize.pdf}
% }
\caption{Time to initialize the spreadsheet (a-b) and cost to update one cell (c-d)}
\label{fig:experiments}
\trimfigurespacing
\end{figure}
\section{Experiments}
\label{sec:experiments}
@ -6,54 +32,56 @@ In this section we explore the performance of the overlay approach.
Concretely, we are interested in two questions:
(i) How does data size affect the performance of each system?
(ii) How does dependency chain length affect the performance of each system?
Experiments were run on a 10-core 1.7 GHz Intel i7-12700H running Linux (Kernel 6.0), with 64G of DDR-3200 RAM, and a 2TB 970 EVO NVME solid state drive.
% Desktop
% Experiments were run on a 10-core 1.7 GHz Intel i7-12700H running Linux (Kernel 6.0), with 64G of DDR-3200 RAM, and a 2TB 970 EVO NVME solid state drive.
% Laptop
Experiments were run on an 8-core 2.3 GHz Intel i7-11800H running Linux (Kernel 5.19), with 32G of DDR4-3200 RAM, and a 2TB 970 EVO NVME solid state drive.
We compare three systems:
(i) \textbf{dataspread}: Dataspread version 0.5~\cite{bendre-15-d}, the most recent version of time of submission;
(ii) \textbf{vizier}: Our prototype implementation of overlay spreadsheets; and
(iii) \textbf{vizier-batch}: Our prototype implementation with simulated hybrid batch processing.
(i) \textbf{DataSpread}: Dataspread version 0.5~\cite{bendre-15-d};
(ii) \textbf{Vizier}: Our prototype implementation of overlay spreadsheets; and
(iii) \textbf{Vizier (Simulated Batching)}: Simulated hybrid batch processing (see Setup, below).
All experiments were performed with a warm cache.
\partitle{Setup}
We address our questions through a simple microbenchmark modeled after query 1 from the TPC-H benchmark~\cite{tpc-h}: The spreadsheet is defined by the TPC-H \texttt{lineitem} dataset with $\texttt{N}$ rows and four additional columns defined by the patterns:\\[-5mm]
We address our questions through a microbenchmark modeled after TPC-H query 1~\cite{tpc-h}: The spreadsheet is defined by the TPC-H \texttt{lineitem} dataset with $\texttt{N}$ rows and four additional columns defined by the patterns:\\[-2mm]
{\footnotesize
\begin{verbatim}
base_price[1-N] = ext_price[+0]
disc_price[1-N] = base_price[+0] * (1 - discount[+0])
charge[1-N] = disc_price[+0] * (1 + tax[+0])
sum_charge[1] = charge[1]
sum_charge[2-N] = charge[+0] + sum_charge[-1]
base_price[1-N] = ext_price[+0]
disc_price[1-N] = base_price[+0] * (1 - discount[+0])
charge[1-N] = disc_price[+0] * (1 + tax[+0])
sum_charge[1] = charge[1]
sum_charge[2-N] = charge[+0] + sum_charge[-1]
\end{verbatim}
Note that the \texttt{sum\_charge} column is a running total aand the length of the dependency chain on row $i$ proportional to $i$. Thus, as the user scrolls down the page (under normal usage), the runtime to compute individual cells grows linearly.
Each system under test is allowed to load the spreadsheet with a viewable area of 50 rows.
}
\noindent The \texttt{sum\_charge} column is a running total, creating a dependency chain that grows linearly with row index.
As the user scrolls down the page (under normal usage), the runtime to compute visible cells grows linearly.
Each system loads the spreadsheet with a viewable area of 50 rows and updates a single cell.
We measure (i) the cost of initialization and (ii) the cost of a single update.
Time is measured until quiescence.
To emulate batch processing, we replace the formula for the $\texttt{sum\_change}[i-1]$ (where $i$ is the first visible row) with a formula that computes the analogous aggregate query.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Moving View}
% \begin{figure}
% \includegraphics[width=0.7\columnwidth]{results/desktop-update_one.png}
% \vspace*{-4mm}
% \caption{Performance based on viewable range.}
% \label{fig:perf-scale-visible}
% \trimfigurespacing
% \end{figure}
\Cref{fig:experiments}(a,c) shows costs for a fixed dataset size of approximately 600,000 rows, varying the viewable rows.
Due to the running sum, later rows require more computation.
Costs for Vizier and Dataspread grow significantly with the length of the dependency chain, while batch processing can compute the updated sum significantly faster.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Scaling Data}
\begin{figure}
\includegraphics[width=0.7\columnwidth]{results/desktop-init_formulas.png}
\vspace*{-4mm}
\caption{Performance as data size scales.}
\label{fig:perf-scale-size}
\trimfigurespacing
\end{figure}
\Cref{fig:perf-scale-size} shows performance as the size of the dataset grows.
\Cref{fig:experiments}(b,d) shows costs when varying data size, with the view fixed on the first cell.
Because dependencies in the visible area are of constant size, Vizier is faster.
% \Cref{fig:experiments}(c,f) show these costs when the viewport is on the last cell; as before, the costs for Vizier grow with the length of the longest visible dependency chain, supporting the value of batching.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Viewport}
\begin{figure}
\includegraphics[width=0.7\columnwidth]{results/desktop-update_one.png}
\vspace*{-4mm}
\caption{Performance based on viewable range.}
\label{fig:perf-scale-visible}
\trimfigurespacing
\end{figure}
\Cref{fig:perf-scale-size} shows performance as the viewable area moves lower.

View File

@ -6,11 +6,11 @@ Spreadsheets are a popular tools for data exploration, transformation, and visua
rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system. % like Apache Spark.
In this model, the spreadsheet is a chain of versions, each linked by a lightweight transformation function~\cite{freire:2016:hilda:exception}.
The approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the % entire
A different approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the % entire
spreadsheet runtime and specializes % around
database primitives like indexes and incremental maintenance % specialized
for spreadsheet access patterns.
We refer to these as the virtual and materialized approach, respectively, and illustrate them in \Cref{fig:overlay}.
We refer to these as the virtual and materialized approaches, respectively, and illustrate them in \Cref{fig:overlay}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
@ -27,32 +27,29 @@ We refer to these as the virtual and materialized approach, respectively, and il
The materialized approach is optimized for multiple data access patterns common to spreadsheets~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}, including
(i) Data structures specialized for the positional referencing scheme commonly used in spreadsheet formulas~\cite{DBLP:conf/icde/BendreVZCP18},
(ii) Execution strategies that prioritize completion of portions of the spreadsheet that the user is viewing~\cite{DBLP:conf/sigmod/BendreWMCP19}, and
(iii) Indexes that leverage patterns in the dependencies of adjacent cells to compress dependency graphs~\cite{tang-23-efcsfg}.
(iii) Indexes storing compressed dependency graphs~\cite{DBLP:conf/sigmod/BendreWMCP19,tang-23-efcsfg}.
Similar optimizations are considerably harder in the virtual approach, as the result of updates and their effects on cell position are only materialized when data is received.
Although the virtual approach is often less efficient, it does provide capabilities that the materialized approach does not:
(i) Because it stores only the updates applied by the user (e.g., insert a row at position $x$, replace the value of cell $c$ with $v$, \ldots), the spreadsheet's full version history can be stored at negligible;
(ii) As in Wrangler, the resulting data transformation process can be easily applied to other data (e.g., by scaling up from an interaction-friendly sample of the data to the entire dataset, or an updated version of the data); and
(iii) As in Vizier, the user's interactions can be translated into a standardized query model (i.e., a Spark dataframe), allowing it to ``plug into'' existing scalable computation platforms (i.e., Spark) and standardized provenance analysis frameworks (e.g., \cite{kumari:2021:cidr:datasense}).
(i) It is a naturally efficient encoding of the spreadsheet's full version history.
(ii) As in Wrangler, the user's actions can be re-applied to new data (e.g., an updated version of the source data); and
(iii) As in Vizier, the spreadsheet can be re-encoded as a relational query allowing it to ``plug into'' existing scalable computation platforms (e.g., Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15}) and provenance analysis tools (e.g., \cite{kumari:2021:cidr:datasense}).
In this paper, we present an optimized hybrid of the virtual and materialized approaches: \emph{Overlay Spreadsheets}.
In an Overlay Spreadsheet (\Cref{fig:overlay}), the user's edits are stored in a spreadsheet that is ``overlaid'' on top of source data.
Users interact with an Overlay Spreadsheet just like an ordinary spreadsheet, inserting or removing rows or columns, overwriting data with formulas or literals, and reorganizing the data.
However, references to the source dataset are virtualized, allowing users to replay their actions on a updated datasets, translate spreadsheets to run on scalable computation platforms, and to facilitate provenance analysis.
We also demonstrate that this different virtual representation of edits enables more efficient exploitation of spreadsheet access patterns, including optimizing computation of cells visible to the user.
We propose an optimized hybrid of the virtual and materialized approaches: \emph{Overlay Spreadsheets}.
An Overlay Spreadsheet (\Cref{fig:overlay}) presents an interface analogous to a normal spreadsheet.
User edits are ``overlaid'' on top of a source dataset that can be easily be updated to a new version.
As an added benefit, decoupling edits and source data makes it easier to leverage spreadsheet access patterns, reducing the time needed to respond to user actions.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal, reproducibility-oriented, notebook-style workflow system built on Apache Spark.
Users of Vizier define sequences of data transformation steps that may include scripts, templated widgets, or other operations.
Existing versions of Vizier provide a spreadsheet-style interface, where each user interaction builds out the data transformation workflow.
In spite of the performance limitations of the virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features are implemented primarily over Spark dataframes.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal notebook-style workflow system built on Apache Spark.
Existing versions of Vizier allow users to define workflow steps through a spreadsheet-style interface; each action adds a new workflow step.
In spite of the performance limitations of this virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features rely on encoding data transformations as Spark dataframes.
%
Our objective in this paper is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting for \textbf{replay and provenance} (i.e., like the virtual approach).
Our objective is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting \textbf{replay and provenance} (i.e., like the virtual approach).
As a secondary goal, we further explore the additional benefits of the overlay approach.
Specifically, we observe that because spreadsheet updates are typically made manually, the number of updates is limited by the speed of a human interacting with the system.
Although a single update may be applied to multiple cells (e.g., by copy/pasting a formula over a range of cells), the number of such updates is likely to be small.
In this paper, we take the first steps towards hybridizing the cell-at-a-time execution strategies of classical spreadsheets, with bulk computation strategies found in relational databases.
This hybrid strategy is akin to optimizations applied in data spread~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}, but operating over patterns of updates rather than patterns in the dependency graph.
As a secondary goal, we explore potential performance improvements that the overlay approach enables.
Specifically, we observe that bulk updates in a spreadsheet (e.g., pasting a formula across a range of cells) rely on expression ``patterns,''
which admit more efficient dependency analysis and bulk computation, when intermediate values are not required.
This hybrid strategy is akin to optimizations applied in DataSpread~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}, but operate over patterns of updates rather than patterns in the dependency graph, enabling additional optimizations.
% March 26 by OK: Trimming the ToC summary for space
%

View File

@ -62,19 +62,19 @@
\subsection{Spreadsheets}
\label{sec:spreadsheets}
Let $\columnDomain$ and $\rowDomain$ denote domains of column and row labels; unless otherwise noted, we assume $\rowDomain \subset \mathbb Z$.
Let $\valueDomain \subset \exprDomain$ denote domains of values and expressions, respectively; We define $\exprDomain$ in greater detail below.
We define a \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ as a partial mapping from \emph{cells} ($\cellRef{\column}{\row} \in (\columnDomain \times \rowDomain)$) to expressions.
Let $\columnDomain$ and $\rowDomain$ denote domains of column and row labels. Except where noted, $\rowDomain \subset \mathbb Z$.
Let $\valueDomain$ and $\exprDomain \supset \valueDomain$ denote domains of values and expressions, respectively.
A \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ is a partial mapping from \emph{cells} ($\cellRef{\column}{\row} \in (\columnDomain \times \rowDomain)$) to expressions.
We use $\valat{\spreadsheet}{\column}{\row}$ to denote $\spreadsheet(\cellRef{\column}{\row})$.
Let $\errorval \in \valueDomain$ indicate ``undefined'' and define the \emph{domain} $\dom(\spreadsheet)$ to be the set of cells $\cellRef{\column}{\row}$ where $\valat{\spreadsheet}{\column}{\row} \neq \errorval$.
An expression $\expr \in \exprDomain$ is a formula defined over literals from $\valueDomain$, the standard arithmetic operators, and references to other cells in the spreadsheet ($\cellRef{\column}{\row}$).
The expression $\expr$ may be evaluated in the context of a spreadsheet ($\evalOf{\spreadsheet}{\cdot} : \exprDomain \rightarrow \valueDomain$) as follows:
(i) Literals evaluate to themselves, (ii) Arithmetic formulas are evaluated in the usual way, and (iii) References to the spreadsheet are evaluated recursively
The expression $\expr$ is evaluated in the context of a spreadsheet ($\evalOf{\spreadsheet}{\cdot} : \exprDomain \rightarrow \valueDomain$) as follows:
(i) Literals and arithmetic are evaluated in the usual way, and (ii) References to the spreadsheet are evaluated recursively
($\evalOf{\spreadsheet}{\cellRef{\column}{\row}} \equiv \evalOf{\spreadsheet}{\spreadsheet(\column, \row)}$).
By convention, cyclic references evaluate to the distinguished error value $\errorval$ in $\valueDomain$.
We define the dependencies of an expression ($\depsOf{\expr}$) as the cells referenced by $\expr$.
By convention, cyclic references evaluate to $\errorval$.
%
An expression's dependencies ($\depsOf{\expr}$) are the cells referenced by $\expr$.
Dependencies induce a graph $\DG{\spreadsheet}\tuple{N, E}$ over the spreadsheet, with cells as nodes (i.e., $N = \columnDomain \times \rowDomain$), and dependencies as directed edges:
$$E = \bigcup_{\cell \in \columnDomain \times \rowDomain}
\{\;\cell \rightarrow \cellPrime\;|\;\cellPrime \in \depsOf{\valat{\spreadsheet}{\column}{\row}}\;\} $$
@ -84,8 +84,8 @@ Note that if all cell expressions are constants (i.e., a spreadsheet without for
\begin{example}
Consider the spreadsheet at the top of \Cref{fig:example-spreadsheet-and-a}.
Columns \emph{A} and \emph{B} hold constant expressions, while column \emph{C} holds arithmetic expressions referencing cells from columns \emph{A} and \emph{B}.
Evaluating this spreadsheet assigns each cell a concrete value, as in the top right.
Columns \emph{A} and \emph{B} hold constant expressions, while column \emph{C} holds reference cells from columns \emph{A} and \emph{B}.
Evaluating this spreadsheet assigns each cell a value, as in the top right.
For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{A}{1} + \cellRef{B}{1}} = \evalOf{\spreadsheet}{\cellRef{A}{1}} + \evalOf{\spreadsheet}{\cellRef{B}{1}} = 15 + 50 = 65$.
\end{example}
@ -96,7 +96,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
\begin{minipage}{0.48\linewidth}
\centering
\textbf{Spreadsheet $\spreadsheet$}\\
\textbf{\small Spreadsheet $\spreadsheet$}\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{c|c|c|c|}
\cline{2-4}
@ -111,7 +111,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
%
\begin{minipage}{0.49\linewidth}
\centering
\textbf{Evaluated Spreadsheet $\evalOf{\spreadsheet}{\cdot}$}\\
\textbf{\small Evaluated Spreadsheet $\evalOf{\spreadsheet}{\cdot}$}\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{c|c|c|c|}
\cline{2-4}
@ -133,7 +133,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
%$,$\\ %\vspace{5mm}
\begin{minipage}{0.46\linewidth}
\centering
\textbf{Updated Spreadsheet $\upd(\spreadsheet)$}\\
\textbf{\small Updated Spreadsheet $\upd(\spreadsheet)$}\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{c|c|c|c|}
\cline{2-4}
@ -148,7 +148,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
%
\begin{minipage}{0.49\linewidth}
\centering
\textbf{Evaluated Update $\evalOf{\upd(\spreadsheet)}{\cdot}$}\\
\textbf{\small Evaluated Update $\evalOf{\upd(\spreadsheet)}{\cdot}$}\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{c|c|c|c|}
\cline{2-4}
@ -163,6 +163,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
\vspace{-3mm}
\caption{Example spreadsheet with expressions shown in \textcolor{tabexprcolor}{dark green}, and an update applied to the spreadsheet with updated expressions and values shown in \uv{red}.}\label{fig:example-spreadsheet-and-a}
\trimfigurespacing
% \vspace{1mm}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -170,8 +171,8 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
\subsection{Cell Updates}
\label{sec:updates}
A cell update set $\upd \subseteq \columnDomain \times \rowDomain \times \exprDomain$ to a spreadsheet is a set of cell updates of the form $\acu$ that assign to cell $\cellRef{\column}{\row}$ the expression $\expr$.
Denote by $\dom(\upd)$ the domain of update $\upd$, containing all cells $\cellRef{\column}{\row}$ defined in $\upd$ (i.e., $\exists \expr : (\acu \in \upd)$).
A cell update set $\upd \subseteq \columnDomain \times \rowDomain \times \exprDomain$ is a set of cell updates of the form $\acu$ that assign to cell $\cellRef{\column}{\row}$ the expression $\expr$.
Denote by $\dom(\upd)$ the domain of update $\upd$, containing all cells $\cellRef{\column}{\row}$ defined in $\upd$ (i.e., $\exists \expr : ([\acu] \in \upd)$).
Applying an update $\upd$ to a spreadsheet $\spreadsheet$ returns an updated spreadsheet:
\[
\valat{\upd(\spreadsheet)}{\column}{\row} =
@ -182,19 +183,18 @@ Applying an update $\upd$ to a spreadsheet $\spreadsheet$ returns an updated spr
\]
An update may affect cells beyond its domain.
For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes the constant expression in cell \emph{\cellRef{A}{1}} and the arithmetic expression in cell \emph{\cellRef{C}{3}}.
Evaluating the updated spreadsheet $\upd(\spreadsheet)$ results in \emph{three} cell changes (in red).
For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes two cells \emph{\cellRef{A}{1}} and \emph{\cellRef{C}{3}}, but evaluating the updated spreadsheet $\upd(\spreadsheet)$ results in \emph{three} cell changes (in red).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Spreadsheet Access to Datasets}
\label{sec:spre-access-datas}
To uniformly model spreadsheet access to relational data as well as to data already represented as spreadsheets, we assume an input dataset $\ds$ with a designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
For example, in a relational table, these can be the table's columns and values of a key or rowid attribute, respectively.
For a spreadsheet or csv data, $\rowDomain_{\ds} \subset \mathbb Z$ can be the position of the row.
We use $\valat{\ds}{\row}{\column}$ to denote the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
To uniformly model source datasets, whether from relational databases or other spreadsheets, we assume an input dataset $\ds$ with designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
In a relational table, these are the table's columns and the values of a key or rowid attribute, respectively.
For csv data, $\rowDomain_{\ds} \subset \mathbb Z$ is the position of the row in the file.
We write $\valat{\ds}{\row}{\column}$ to denote the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
Denote by $\rframe: \rowDomain_{\ds} \to \mathbb{Z}$ a reference frame, an injective map that maps rows in $\ds$ into the spreadsheet.
Denote by $\rframe: \rowDomain_{\ds} \to \mathbb{Z}$ a reference frame, an injective map from rows in $\ds$ to rows of the spreadsheet.
A \emph{spreadsheet overlay} for a dataset $\ds$ is then a pair $(\ds, \rframe)$ that defines a spreadsheet $\spreadsheet_{\ds, \rframe}$ with domains $\columnDomain = \columnDomain_{\ds}$, $\rowDomain = \dom(\rframe)$ as
$
\valat{\spreadsheet_{\ds, \rframe}}{\column}{\row} = \valat{\ds}{\column}{\rframe^{-1}(\row)}
@ -204,18 +204,17 @@ $
\subsection{Overlay Updates}
\label{sec:overlay-updates}
An Overlay Update describes a set of changes to a target spreadsheet (or dataset).
Changes may include cell updates as already discussed, or the insertion, deletion, or reordering of rows or columns.
As we discuss in \Cref{sec:system-presentation}, column operations are purely cosmetic in our model, and so we focus on cell and row updates.
An Overlay Update describes a set of changes to a spreadsheet (or dataset).
As we discuss in \Cref{sec:system-presentation}, column operations are purely cosmetic in our model, and we focus on cell and row updates exclusively.
Concretely, a spreadsheet overlay $\overlay = \aol$ is a reference frame transformation $\rtrans$ and a set of pattern updates $\oup$, terms we now define.
% We now define these terms, and discuss their semantics.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Reference Frame Transformations}
Recall that the spreadsheet's positional row references are translated into the native record format of the source dataset through a mapping function called a reference frame.
To insert, delete, or move rows in the spreadsheet, it is sufficient to simply modify the reference frame.
Formally, a reference frame transformation $\rtrans$ is an injective mapping $\mathbb{Z} \to \mathbb{Z} \cup \errorval$ from an initial set of row positions to a new set of row positions, or the value $\errorval$ to indicate a deleted row.
The new reference frame for the spreadsheet overlay after applying $\overlay$ is $\rframe' = \rtrans \circ \mathcal F$, where $\circ$ denotes function composition.
Recall that a reference frame maps the spreadsheet's positional row references to native record identifiers.
Thus, to insert, delete, or move rows in the spreadsheet, it is sufficient to modify the reference frame.
Formally, a reference frame transformation $\rtrans$ is an injective mapping $\mathbb{Z} \to \mathbb{Z} \cup \errorval$ from initial row positions to new row positions, or the value $\errorval$ for a deleted row.
The new reference frame, after applying $\overlay$ is $\rframe' = \rtrans \circ \mathcal F$, where $\circ$ denotes function composition.
As an example, consider deleting the 2nd row of the spreadsheet from \Cref{fig:example-spreadsheet-and-a}. The positions of rows $3$ and $4$ are decreased by one, while row $1$ retains its position
$$\rtrans(x) = \begin{cases}
x & \textbf{if } x < 2\\
@ -224,23 +223,22 @@ $$\rtrans(x) = \begin{cases}
\end{cases}$$
Row insertions and movement are handled analogously.
Note that row insertions, deletions, and movement are each expressible in constant size, independent of the size of the data.
Note that row insertions, deletions, and movement are expressible in constant size, independent of the size of the data.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Pattern Updates}
Spreadsheets allow users to prototype a formula in one cell and then generalize the formula by copying and pasting it into a range of cells.
Spreadsheets allow a formula from one cell to be pasted across a range of cells.
%\footnote{``Relative" column and row references are updated to be relative to each cell the formula is pasted into.}
Such bulk interactions pose a challenge for state models that maintain an expression for each cell.
For example, a user might paste a formula into an entire column, creating one expression for each row of the dataset.
In lieu of this, an overlay update groups together the set of pasted cells into a single \emph{pattern}.
In a classical spreadsheet, bulk interactions like this modify each cell's expression individually.
Overlay spreadsheets avoid the high cost that individual modifications can entail by grouping together the set of pasted cells into a single \emph{pattern}.
A \emph{range} $\rangeOf{\columnRange}{\rowRange}$ is the Cartesian product $\columnRange \times [l,h]$ of a set of columns ($\columnRange \subseteq \columnDomain$) and row positions ($R \subset \mathbb{Z}$).
A \emph{range} $\rangeOf{\columnRange}{\rowRange}$ is the Cartesian product $\columnRange \times [l,h]$ of a set of columns ($\columnRange \subseteq \columnDomain$) and row positions ($R = [l, h] \subset \mathbb{Z}$).
%
A pattern update $\oup$ is a set of pairs $\{ (\rangeOf{C_i}{R_i}, \pattern_i) \}$ where $\rangeOf{C_i}{R_i}$ is a range and $\pattern_i$ is a \emph{pattern expression}, i.e., an expression that may also contain cell references where rows are relative offsets (written as $+i$ or $-i$).
Ranges $\rangeOf{C_i}{R_i}$ must be pairwise disjoint.
A pattern update $(\rangeOf{C_i}{R_i}, \pattern_i)$ assigns an expression to every cell $(\column, \row)$ in $\rangeOf{C_i}{R_i}$ by replacing any relative references of the form $(\column, \delta)$ in $\pattern_i$ with $(\column, \row + \delta)$. We use $\pattern_i(\cell)$ to denote the instantiation of pattern $\pattern_i$ for cell $\cell$.
Ranges in an update $\rangeOf{C_i}{R_i}$ must be pairwise disjoint.
A pattern update $(\rangeOf{C_i}{R_i}, \pattern_i)$ assigns an expression to every cell $\cellRef{\column}{\row}$ in $\rangeOf{C_i}{R_i}$ by replacing any relative references of the form $\cellRef{\column}{+\delta}$ in $\pattern_i$ with $\cellRef{\column}{\row + \delta}$. We use $\pattern_i(\cell)$ to denote instantiation of pattern $\pattern_i$ for cell $\cell$.
For instance, to store a running sum of the values in column \emph{C} as the cell values in column \emph{D} for the spreadsheet from \Cref{fig:example-spreadsheet-and-a}:\\[-2mm]
For instance, to store a running sum of the values in column \emph{C} into column \emph{D} (for the spreadsheet from \Cref{fig:example-spreadsheet-and-a}):\\[-2mm]
%
\[
\oup_{running} = (\rangeOf{D}{1}, (C,+0)), (\rangeOf{D}{2-4}, (C,+0) + (D,-1))
@ -289,32 +287,31 @@ $\,$\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Semantics for Overlay Updates}
%
An overlay update $\overlay$ appleid to a spreadsheet $\spreadsheet$ defines the spreadsheet $\overlay(\spreadsheet$ computed by applying the reference frame update and then applying all pattern updates:
An overlay update $\overlay$ applied to a spreadsheet $\spreadsheet$ defines the spreadsheet $\overlay(\spreadsheet)$ computed by applying the reference frame update and then applying all pattern updates (with $\overlay = \tuple{\rtrans, \{ (\columnRange_i, \rowRange_i, \pattern_i)\}})$:
\begin{align*}
\valat{\overlay(\spreadsheet)}{\column}{\row} &=
\begin{cases}
\pattern_i(\cellRef{\column}{\row}) & \text{\textbf{if}} \exists i: \cellRef{\column}{\row} \in \rangeOf{C_i}{R_i} \\
\valat{\spreadsheet}{\column}{\rtrans^{-1}(\row)} & \text{\textbf{if}} \exists \row': \rtrans(\row') = \row\\
\pattern_i((\column,\row)) & \text{\textbf{if}} \exists i: (\column,\row) \in \rangeOf{C_i}{R_i} \\
\errorval &\text{\textbf{otherwise}}\\
\end{cases}
\end{align*}
\begin{example}
\label{ex:recursive-running-sum}
Consider applying our example update ($\overlay_{running} = (\rtrans_{id},\oup_{running})$ where $\rframe_{id}(x) = x$) to our running example spreadsheet.
The result is shown in \Cref{fig:example-overlay-update}. The column $D$ now computes the running sum of column $C$.
Consider our example update ($\overlay_{running} = (\rtrans_{id},\oup_{running})$ where $\rtrans_{id}(x) = x$).
\Cref{fig:example-overlay-update} shows the result of applying $\overlay_{running}$ to our running example spreadsheet.
\end{example}
Several remarks are in order. First, note that overlays can be used to encode common spreadsheet update operations in constant space (per update), including bulk updates via copy/paste.
Second, \cite{tang-23-efcsfg} uses similar ideas to compress the dependencies in a spreadsheet using ranges and patterns, but focuses exclusively on the dependency graph and not on compacting the spreadsheet itself.
Several remarks are in order. First, overlays can be used to encode common spreadsheet update operations in constant space (per update), including bulk updates via copy/paste.
Second, \cite{tang-23-efcsfg} uses similar ideas to compress the dependencies in a spreadsheet using ranges and patterns, but focuses exclusively on the dependency graph rather than expressions.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Replacing Source Data}
\label{sec:updating-datasets}
A major advantage of modeling spreadsheets as overlays is that source data may be updated;
An overlay designed for source data $(\ds, \rframe)$ may be applied to a dataset $(\ds', \rframe')$ as long as each $\row\in \rowDomain_{\ds}$ that corresponds to some $\row' \in \rowDomain_{\ds'}$, $\rframe'(\rframe^{-1}(\row)) = \row'$.
This is possible if, for example, $\rowDomain_{\ds}= \rowDomain_{\ds'}$ is a semantic key for the dataset.

View File

@ -3,35 +3,32 @@
\section{Related Work}
\label{sec:related-work}
Although spreadsheets present a convenient, direct-manipulation interface to data, they lack the scalability to manage large data.
A common approach to scaling spreadsheets --- what we term the ``virtual'' approach --- is to reformulate the interface to an existing database or workflow system using spreadsheet-style direct manipulation metaphors~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/icde/LiuJ09,freire:2016:hilda:exception,DBLP:conf/sigmod/JagadishCEJLNY07,DBLP:conf/chi/KandelPHH11}.
Although spreadsheets present a convenient interface to data, they lack the scalability to manage large data.
A common approach to scaling spreadsheets (the ``virtual'' approach) reformulates the interface to an existing database or workflow system using spreadsheet-style direct manipulation metaphors~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/icde/LiuJ09,freire:2016:hilda:exception,DBLP:conf/sigmod/JagadishCEJLNY07,DBLP:conf/chi/KandelPHH11}.
The resulting systems bear varying levels of resemblance to existing spreadsheets, usually introducing concepts from relational databases like explicit tables, attributes, and records.
Vizier~\cite{brachmann:2019:sigmod:data, kennedy:2022:ieee-deb:right, kumari:2021:cidr:datasense, brachmann:2020:cidr:your} is a computational notebook system that automatically versions notebooks as they are edited by users.
In Vizier, any dataset used in a computational notebook can be accessed and edited through a spreadsheet interface; the resulting edits are integrated into the workflow.
%
Wrangler~\cite{DBLP:conf/chi/KandelPHH11} is an ETL workflow development tool with an interface inspired by spreadsheets.
Users open a small sample of a dataset in Wrangler and use spreadsheet-style direct manipulations to indicate a desired change to the dataset.
Wrangler, in turn, proposes ETL workflow steps that can achieve the user's desired effect on the target cell, as well as the remainder of the dataset.
Other approaches more directly mimic relational databases through spreadsheet-style interfaces.
Users open a small sample of a dataset in Wrangler and use spreadsheet-style direct manipulations to indicate desired changes to the dataset.
%
Vizier~\cite{brachmann:2019:sigmod:data, kennedy:2022:ieee-deb:right, kumari:2021:cidr:datasense, brachmann:2020:cidr:your} is a computational notebook system that allows users to define workflow stages through a spreadsheet-style interface.
%
Other approaches more directly mimic relational databases:
The Spreadsheet Algebra~\cite{DBLP:conf/sigmod/JagadishCEJLNY07,DBLP:conf/icde/LiuJ09} allows users to specify any SPJGA-query purely through spreadsheet-style user interactions.
Related Worksheets~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/chi/BakkeKM11} re-imagines the classical spreadsheet-style interface by introducing relational structure, as well as nested display of foreign-key dependencies.
Related Worksheets~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/chi/BakkeKM11} re-imagines the classical spreadsheet-style interface with record structure and inlined display of foreign-key references.
A second class of approach --- what we term the ``materialized'' approach --- instead redesigns the spreadsheet engine itself through database concepts;
The primary example in this space is DataSpread~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}.
A key challenge that the materialized approach faces is that classical database techniques, which exploit common structures in a dataset, are not directly applicable.
A second approach (the ``materialized'' approach) instead redesigns the spreadsheet engine itself through database concepts;
An example is DataSpread~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}.
A key challenge is that classical database techniques, which exploit common structures in a dataset, are not directly applicable.
\cite{DBLP:conf/icde/BendreVZCP18} explores data structures that can leverage partial structure; for example, when a range of cells are structured as a relational table.
\cite{DBLP:conf/sigmod/BendreWMCP19} explores strategies for quickly invalidating cells and computing dependencies, by leveraging a (lossy) compressed dependency graph that can efficiently bound a cell's downstream.
\cite{tang-23-efcsfg} introduces a different type of compressed dependency graph which is lossless, instead exploiting repeating patterns in formulas.
This is analogous to our own approach, but focuses on the dependency graph;
As we demonstrate here, applying a similar approach to expressions as well creates multiple optimization opportunities.
This is analogous to our own approach, but focuses on the dependency graph rather than expressions, limiting opportunities for optimization.
In summary, several efficient algorithms for storing, accessing, and updating spreadsheets have been developed and adapted in the context of the DataSpread.
The approach developed for Vizier is often less efficient, but has the advantage of supporting light-weight versioning and tracking the provenance of the evolution of a dataset (and the computational notebook containing it) under spreadsheet operations.
Importantly, this approach enables replaying a user's updates that were originally applied to a dataset $D_{old}$ when $D_{old}$ is replaced with an updated dataset $D_{new}$ (e.g., the user may have downloaded a new version of an open dataset and wants to keep the manual fixes they have applied to the original version of the dataset).
The overlay approach we present in this work has the potential to retain these benefits while enabling performance competitive with, or exceeding that of DataSpread.
Furthermore, overlays with reference frames enable more efficient support for insertion and deletion for rows and columns as this only affects reference frames, but not the formulas of cells.
In summary, DataSpread introduced multiple efficient algorithms for storing, accessing, and updating spreadsheets.
The virtual approach is often less efficient, but has the advantage of supporting light-weight versioning, tracking the provenance.
Crucially, this approach also enables replaying a user's updates, originally applied to one dataset, on a new dataset (e.g., to re-apply curation work on an updated version of the data).
The overlay approach we present in this work has the potential to retain these benefits while enabling performance competitive with DataSpread.
% Furthermore, overlays with reference frames allow more efficient insertion and deletion for rows and columns as this only affects reference frames, but not the formulas of cells.
%%% Local Variables:

View File

@ -2,34 +2,33 @@
\section{System Design}
\label{sec:system}
We now overview our prototype overlay spreadsheet, implemented for use with the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and import/export format compatibility.
Our prototype likewise builds on Spark, using any dataframe as a data source.
Our prototype overlay spreadsheet is implemented within the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and data import/export.
Our prototype is designed to accept any Spark dataframe as a data source.
The prototype's design is illustrated in \Cref{fig:systemdesign}
Client applications (e.g., Javascript-based frontends) connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and provides light syntactic sugar over the underlying data and update model.
The data model itself is maintained by an \textbf{Execution} layer that is responsible for evaluating spreadsheet cells and materializing a subset of the cell values that are viewable.
The execution layer applies an update overlay stored by an \textbf{Indexing} layer to an arbitary Spark dataframe.
A simple LRU \textbf{Cache} provides efficient random access to a subset of the dataframe's rows.
% The prototype's design is illustrated in \Cref{fig:systemdesign}
Client applications connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and translates to our simplified model of a spreadsheet to a more natural interface.
An \textbf{Execution} layer is responsible for evaluating spreadsheet cells and materializing values for the viewable set of cells.
An \textbf{Indexing} layer provides efficient access to the updates themselves, and a simple LRU cache provides efficient random access to the source dataframe.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Presentation Layer}
\label{sec:system-presentation}
Multiple user-facing client applications connect to the overlay spreadsheet through a presentation layer.
This layer mediates concurrent updates of the spreadsheet, allows clients to subscribe to push-based updates of cell state, and provides clients with the illusion of a fixed grid of cells by defining and maintaining an explicit order over columns, as well as maintaining a bound over the number of rows in the spreadsheet.
Operations over columns (insertion, deletion, reordering) are handled at this layer, allowing lower levels to reference the (comparatively small) set of columns by column identity.
With the exception of updates to columns, most updates are coalesced into a serial order and relayed to lower levels.
User-facing client applications connect to the overlay spreadsheet through a presentation layer.
This layer mediates concurrent updates of the spreadsheet and provides clients with the illusion of a fixed grid of cells by defining and maintaining an explicit order over columns.
Column operations (insertion, deletion, reordering) are handled at this layer, so lower levels can reference the (comparatively small) set of columns solely by column identity.
Other updates are put into a serial order and relayed to lower levels.
\begin{figure}
\includegraphics[width=0.4\columnwidth]{graphics/system-arch}
\caption{Overlay system design.}
\label{fig:systemdesign}
\trimfigurespacing
\end{figure}
% \begin{figure}
% \includegraphics[width=0.4\columnwidth]{graphics/system-arch}
% \caption{Overlay system design.}
% \label{fig:systemdesign}
% \trimfigurespacing
% \end{figure}
The presentation layer expects the level below it to provide (i) efficient random access to cell values, (ii) subscription access to state (e.g., value) updates for ranges of cells.
The presentation layer expects the Executor to provide efficient random access to cell values and support updating ranges of cells with pattern expressions.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -37,11 +36,11 @@ The presentation layer expects the level below it to provide (i) efficient rando
\subsection{Executor}
\label{sec:system-executor}
The executor provides efficient access to cell values and is responsible for pushing notifications about cell state changes to clients.
Cell values is derived from two sources:
The executor provides efficient access to cell values and generates notifications about cell state changes.
Cell values are derived from two sources:
(i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and
(ii) A series of overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) extends the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and an update index, respectively.
(ii) A sequence of overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) that extend the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and the update index, as discussed below.
% The update index stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
% Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
% The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
@ -50,37 +49,35 @@ These sources are implemented by a cache around $\spreadsheet_{\ds}$ and an upda
% \spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
% \end{cases}$$
The direct approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) first computes a topological sort over all cells (in order of dependencies) and evaluates them in this order.
However, the computational cost of this approach can be proportional to the size of the data, as it requires expanding patterns out over all individual cells.
The Executor and Index leverage the fact that updates are already provided as patterns to materialize the spreadseheet faster.
We return to the index and efficient strategies for computing dependencies \Cref{sec:system-index}, and first consider expression evaluation.
Specifically, we rely on the observation that only a small fraction of cells will be visible at any one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} uses this observation to prioritize evaluation of visible cells).
The naive approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) computes a topological sort over cell dependencies and evaluates cells in this order.
The Executor side-steps the linear (in the data size) cost of the naive approach through two insights:
(i) Updates applied over multiple cells are already provided by higher layers as patterns, and
(ii) Only a small fraction of cells will be visible at any one time.
Assuming the dependencies of a range of cells can be computed efficiently (we return to this assumption in \Cref{sec:system-index}), only the visible cells and any hidden dependencies need to be evaluated.
The Executor only evaluates cell expressions on rows that are (close to being) visible to the user, and the transitive closure of their dependencies.
Note that some dependency chains (e.g., the running sum example) still require computation for each row of data (e.g., if the last row is visible).
Although we leave a detailed exploration of this challenge to future work, we observe that the fixed point of such cell's expressions can often be rewritten into a closed form.
For example, any given cell in a running sum column may be expressed in terms of the sum of all preceding cells.
Our preliminary experiments show that when a chain of dependencies becomes sufficiently long, bulk computation can be used to provide a more responsive interface.
Some dependency chains (e.g., running sums) still require computation for each row of data.
Although we leave a detailed exploration of this challenge to future work, we observe that the fixed point of such pattern expressions can often be rewritten into a closed form.
For example, any cell in a running sum column is equivalent to a sum over the preceding cells.
Our preliminary experiments (\Cref{sec:experiments}) suggest promise in a hybrid evaluation strategy that evaluates visible cells individually and computes cells defined by patterns through closed form aggregate queries.
\partitle{Incremental Updates}
\partitle{Updates}
When the executor receives an update to a cell, it uses the index to compute the set of invalidated cells, marks them as ``pending,'' and begins re-evaluating them in topological order.
An update to the reference frame is applied to both the index and the data source.
Following typical spreadsheet semantics, an insertion or row move updates references in dependent formulas, so modulo changes in the set of visible rows, no re-evaluation is required.
Following typical spreadsheet semantics, an insertion or row move updates references in dependent formulas, so no re-evaluation is typically required.
If a row with dependent cells is deleted, the dependent cells need to be updated to indicate the error.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Update Index}
\label{sec:system-index}
The update index provides efficient positional access to the spreadsheet (denoted $\spreadsheet_\overlay$) defined by a sequence of updates ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$), with $\errorval$ for all undefined cells.
Specifically, the index is required to support:
(i) Access to the expressions for individual cells $\spreadsheet_\overlay[\column, \row]$ (for cell evaluation);
(ii) Computing the upstream of a range of cells (for topological sort and computing the active set), and
(iii) Computing the downstream of a range of cells (for cell invalidation after an update).
The key insight behind the index is that it stores updates in the form of pattern-range tuples to avoid materializing the full spreadsheet.
As noted above, we assume that the number of columns is comparatively small, and the number of rows is comparatively large.
The update index stores sequence of updates ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$) and provide efficient access to the cells of an overlay spreadsheet (denoted $\spreadsheet_\overlay$) where undefined cells have the value $\errorval$.
This entails:
(i) cell expressions $\spreadsheet_\overlay[\column, \row]$ (for cell evaluation);
(ii) upstream dependencies of a range (for topological sort and computing the active set), and
(iii) downstream dependents of a range (for cell invalidation after an update).
The key insight behind the index is that updates are stored as pattern-range tuples instead of as individual cells.
%As noted above, we assume that the number of columns is small and the number of rows is large.
\begin{figure}
\includegraphics[width=0.7\columnwidth]{graphics/rangemap.pdf}
@ -90,24 +87,24 @@ As noted above, we assume that the number of columns is comparatively small, and
\end{figure}
\partitle{Range Maps}
The core building block for the update index is a one-dimensional range map, an ordered map with integer keys.
The update index is built over a one-dimensional range map, an ordered map with integer keys.
In addition to the usual operations of an ordered map (e.g., \texttt{put}, \texttt{get}, \texttt{successorOf}), we define the operation \texttt{bulkPut(low, high, value)} which is equivalent to a \texttt{put} on every element in the range from \texttt{low} to \texttt{high}.
Implemented naively through a binary tree over $N$ elements, this operation takes $O((\texttt{high}-\texttt{low})\cdot\log(N))$ time.
Implemented naively (e.g. a size $N$ binary tree), this operation is $O((\texttt{high}-\texttt{low})\cdot\log(N))$.
A range map avoids the $(\texttt{high}-\texttt{low})$ factor (and correspondingly reduces $N$) by storing an ordered sequence of disjoint ranges, each mapping one specific value as illustrated in \Cref{fig:rangemap}.
A binary tree provides efficient membership lookups over the ranges.
With a range map, the set of distinct values appearing in a range can be accessed in $O(\log(N)+M)$ time (where $M$ is the number of distinct values), and similar deletion and insertion costs.
With a range map, the set of distinct values appearing in a range can be accessed in $O(\log(N)+M)$ time (where $M$ is the number of distinct values), and has similar deletion and insertion costs.
\partitle{Cell Access}
The index layer maintains a ``forward" index: An unordered map that stores a range map for each column.
To compute the expression for a cell $\cellRef{\column}{\row}$, the index layer (i) looks up the range map for $\column$ in the unordered map, (ii) looks up $\row$ in the range map to obtain a pattern (and returns $\emptyset$ if the row is undefined), and (iii) computes the expression by applying the pattern to $\cellRef{\column}{\row}$.
The index layer maintains a ``forward'' index: An unordered map $\mathcal I$ that stores a range map $\mathcal I[\column]$ for each column.
The expression for a cell $\cellRef{\column}{\row}$ is stored at $\mathcal I[\column][\row]$.
\begin{algorithm}
\caption{\textbf{upstream}($\columnRange$, $\rowRange$)}
\label{alg:upstream}
\begin{algorithmic}[1]
\Require $\rangeOf{\columnRange, \rowRange}$: A range of cells to compute the upstream of.
\Ensure $\texttt{upstream}$: A set of cells on which $\rangeOf{\column}{\rowRange}$ is a dependency.
\Require $\rangeOf{\columnRange, \rowRange}$: A range to compute the upstream of.
\Ensure $\texttt{upstream}$: Cells on which $\rangeOf{\column}{\rowRange}$ is a dependency.
\State $\texttt{upstream} \leftarrow \{\}$
\State $\texttt{work} \leftarrow \comprehension{(\column, \rowRange, \{\})}{\column \in \columnRange}$
\While{$(\column', \rowRange', \texttt{lineage}) \leftarrow \texttt{work}.\textbf{dequeue}$}
@ -117,7 +114,8 @@ To compute the expression for a cell $\cellRef{\column}{\row}$, the index layer
\If{$(\column_{d}, \rowRange_{d})$ is non-empty}
\State $\texttt{upstream} \leftarrow \texttt{upstream} + (\column_{d}, \rowRange_{d})$
\State $\texttt{queue}.\textbf{enqueue}( \column_{d}, \rowRange_{d},$\\
\hfill$\comprehension{ \texttt{p}' \rightarrow (\texttt{o}'+\texttt{offset})}{ ((\texttt{p}' \rightarrow \texttt{o}' )\in \texttt{lineage}}$\\
\hspace*{37mm}$\{\;\texttt{p}' \rightarrow (\texttt{o}'+\texttt{offset})$~~~~~~~~~\\
\hspace*{40mm}$|\; (\texttt{p}' \rightarrow \texttt{o}' )\in \texttt{lineage}\}$\\
\hfill $\cup \{\texttt{pattern} \rightarrow \texttt{offset}\} )$
\EndIf
\EndFor
@ -133,9 +131,9 @@ We refer to this set as the target's \emph{upstream}.
Each item in the BFS's work queue consists of a column, a row set, and a lineage; We will return to the lineage shortly.
For each work item enqueued, we query the forward index to obtain the set of patterns in the range (line 4), and iterate over the set of their dependencies (line 5).
If we discover a new dependency (lines 6-7), the newly discovered range is added to the return set and the work queue.
We will explain line 10 shortly.
We will explain lines 10-12 shortly.
The \textbf{getDeps} operation (Line 5; \Cref{alg:getDeps}) computes the immediate dependencies of a range of cells $\rangeOf{\column, \rowRange}$ that share a pattern.
The \textbf{getDeps} operation (Line 5; \Cref{alg:getDeps}) computes the immediate dependencies of a range of cells $\rangeOf{\column}{\rowRange}$ that share a pattern.
Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column}{\rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf{\cell'}$.
The algorithm uses a recursive traversal (lines 6-7) to visit every cell reference (offset or explicit):
For offset references (lines 2-3), the provided range of rows is offset by the appropriate amount.
@ -162,26 +160,23 @@ For explicit cell references (lines 4-5), the explicit reference is used.
\partitle{Optimizing Recursive Reachability}
Consider a running sum, such as the one in \Cref{ex:recursive-running-sum}.
Observe that the $k$th element will have $O(k)$ upstream dependencies, and so naively following \Cref{alg:upstream} requires $O(k)$ compute.
The $k$th element will have $O(k)$ upstream dependencies, and so naively following \Cref{alg:upstream} requires $O(k)$ compute.
However, observe that a single pattern is responsible for all of these dependencies, suggesting that a more efficient option may be available.
Specifically, this dependency chain is defined by recursion over single pattern; all but the first cell depend on another cell defined by the same pattern.
We refer to a pattern that references cells defined by the same pattern as \emph{recursive}.
Note that a recursive pattern need not indicate a dependency cycle between individual cells.
This dependency chain arises from recursion over single pattern; most cells depend on other cells defined by the same pattern.
We refer to such a pattern as \emph{recursive}, even if it does not create dependency cycle over individual cells.
Our key insight is that for some (mutually) recursive patterns, the transitive closure of the dependencies will have a closed-form representation.
As with cell execution, the transitive closure of the dependencies of a recursive pattern has a closed-form representation.
In our running example, the upstream of any $\cellRef{D}{k}$ is exactly $\cellRef{D}{1-(k-1)}$ and $\cellRef{C}{1-k}$.
%
The \texttt{lineage} field of \Cref{alg:upstream} is used to track the set of patterns visited, and the offset(s) at which they were visited.
If the pattern being visited already appears in the lineage, then we know it is recursive and that we can extend out the sequence of upstream cells across the remaining cells of the pattern.
If the offset is $\pm 1$, then the elements of this sequence are efficiently representable as a range of cells and can return it in $O(1)$ time.
When the offset is $\pm 1$, the elements of this sequence are efficiently representable as a range of cells, computable in $O(1)$ time.
\partitle{Downstream Reachability}
When a cell's expression is updated, cells that depend on it (even transitively) must be recomputed.
The index must thus also support downstream reachability queries.
To support these efficiently, we maintain a backward index that relates cell ranges to the ranges of patterns that depend on it.
Analog to $\textbf{getDeps}$ inferring cells immediately upstream of a range of cells, we can infer the cells downstream of any cell or set of cells, with one caveat.
When the cell identified an absolute reference in a pattern is modified, all cells using the pattern are invalidated, so we track the set of ranges over which any given pattern is defined.
When a cell's expression is updated, cells that depend on it (even transitively) must be recomputed, so the index must support downstream reachability queries.
For efficient downstream lookups, the index maintains a ``backward'' index relating ranges to the set of patterns that depend on all cells in the range.
The resulting algorithm over the backward index is analogous to $\textbf{getDeps}$.
% \partitle{Column Insertions, Deletions, and Moves}