main
Boris Glavic 2023-07-07 17:22:59 +02:00
parent 27a543a49a
commit 79a7d2c77d
2 changed files with 5 additions and 5 deletions

View File

@ -26,7 +26,7 @@ By contrast, a spreadsheet is a grid of interdependent cells where the original
In this paper, we propose a model of spreadsheets that acts like a classical spreadsheet, but where the user's edits and the source data are decoupled.
The result is a spreadsheet that can be \emph{`overlaid'} on top of any dataset (`Overlay' in \Cref{fig:overlay}), no matter whether the source data is a raw datafile or the result of a workflow (e.g., in Vizier).
Overlay Spreadsheets provide the flexibility of spreadsheets, while also supporting the replay capabilities needed for workflows.
Overlay Spreadsheets provide the flexibility of spreadsheets, while also supporting the replay capabilities available in workflow systems.
%The resulting interface can support data scientists throughout the entire data lifecycle, from exploration through data curation pipeline development and analysis.
As we discuss in this paper, this new \emph{overlay} approach to spreadsheets also enables a new approach to scaling spreadsheets to larger data.
@ -52,7 +52,7 @@ The latter observation arises because users typically edit large numbers of cell
The pasted formula acts as a template, and the pasted cells all follow a common pattern.
Like \cite{DBLP:conf/sigmod/BendreWMCP19}, we avoid storing formulas for each individual cell, instead storing patterns and ranges of pasted cells.
Leveraging the user's interest inn only a small subset of the spreadsheet at any one time, overlay spreadsheets avoid computations outside of this subset.
Leveraging the user's interest in only a small subset of the spreadsheet at any one time, overlay spreadsheets avoid computations outside of this subset.
Requests for cells originating in the source data can be handled efficiently by standard relational storage engines, while only formula cells visible to the user and their transitive dependencies need to be computed.
Unfortunately, common spreadsheet usage creates cells with transitive dependencies that scale with data size.
@ -62,7 +62,7 @@ Although slower for small datasets, batch engines scale to larger workloads more
\paragraph{Overlay Spreadsheets}
We propose \emph{Overlay Spreadsheets}, which present an interface analogous to a normal spreadsheet, but where user edits are ``overlaid'' on top of a source dataset that can easily be updated to a new version.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal notebook-style workflow system built on Apache Spark; Our implementation replaces its existing workflow-style spreadsheet.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal notebook-style workflow system built on Apache Spark. Our implementation replaces its existing workflow-style spreadsheet.
Our objective is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting \textbf{replay and provenance} (i.e., like the virtual approach).
% March 26 by OK: Trimming the ToC summary for space

View File

@ -189,7 +189,7 @@ For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes tw
\subsection{Spreadsheet Access to Datasets}
\label{sec:spre-access-datas}
To uniformly model source datasets, whether from relational databases or other spreadsheets, we assume an input dataset $\ds$ with designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
To uniformly model source datasets, whether originating from relational databases or other spreadsheets, we assume an input dataset $\ds$ with designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
In a relational table, these are the table's columns and the values of a key or rowid attribute, respectively.
For csv data, $\rowDomain_{\ds} \subset \mathbb Z$ is the position of the row in the file.
We write $\valat{\ds}{\row}{\column}$ to denote the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
@ -205,7 +205,7 @@ $
\label{sec:overlay-updates}
An Overlay Update describes a set of changes to a spreadsheet (or dataset).
As we will discuss in \Cref{sec:system-presentation}, column operations are purely cosmetic in our model, and we focus on cell and row updates exclusively.
As we will discuss in \Cref{sec:system-presentation}, column operations are trivially supported in our model, and we focus on cell and row updates exclusively.
Concretely, a spreadsheet overlay $\overlay = \aol$ is a reference frame transformation $\rtrans$ and a set of pattern updates $\oup$, terms we now define.
% We now define these terms, and discuss their semantics.