10 point font... bla. Trimming down more.
16
main.tex
|
@ -1,4 +1,4 @@
|
|||
\documentclass[sigconf,review]{acmart}
|
||||
\documentclass[sigconf,review,10pt]{acmart}
|
||||
|
||||
\usepackage{cleveref}
|
||||
\usepackage{todonotes}
|
||||
|
@ -125,16 +125,16 @@
|
|||
Spreadsheets provide a convenient % , friendly
|
||||
direct manipulation interface to datasets.
|
||||
Efforts to scale spreadsheets % have taken two approaches: A
|
||||
either follow a `virtual` strategy that imposes a spreadsheet interface over an existing database engine or a `materialized' strategy based on re-engineering the spreadsheet engine using % around
|
||||
standard database optimizations. % like indexes.
|
||||
either follow a `virtual` strategy that imposes a spreadsheet interface over an existing database engine or a `materialized' strategy based on re-engineering the spreadsheet engine % using % around
|
||||
%standard database optimizations. % like indexes.
|
||||
Because database engines are not optimized for spreadsheet access patterns,
|
||||
% typically optimized for bulk query processing over interactive latencies,
|
||||
the materialized approach has better performance.
|
||||
However, the virtual approach offers several key advantages that can not be easily replicated in the materialized approach, including % notably
|
||||
However, the virtual approach offers several advantages that can not be easily replicated in the materialized approach, including % notably
|
||||
the ability to re-apply user interactions to an updated dataset. % version of the same dataset.
|
||||
We propose a hybrid of % the materialized and virtual
|
||||
these approaches, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
|
||||
We introduce the overlay update model, and outline strategies for efficiently accessing a spreadsheet defined in this way.
|
||||
We propose a hybrid % the materialized and virtual
|
||||
approach, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
|
||||
We introduce the overlay update model, and outline strategies for efficiently accessing an overlay spreadsheet.
|
||||
A key feature of our approach is storing updates generated by bulk operations (e.g., copy/paste) as ``patterns" that can be leveraged to reduce execution costs.
|
||||
We implement an overlay spreadsheet over Apache Spark and demonstrate that, compared to DataSpread, it can significantly reduce execution costs. % popular
|
||||
% materialized spreadsheet.
|
||||
|
@ -156,7 +156,7 @@ the materialized approach has better performance.
|
|||
%%
|
||||
%% Keywords. The author(s) should pick words that accurately describe
|
||||
%% the work being presented. Separate the keywords with commas.
|
||||
\keywords{Spreadsheets, Dataframes, Scalable Data Management}
|
||||
% \keywords{Spreadsheets, Dataframes, Scalable Data Management}
|
||||
%% A "teaser" image appears between the author and affiliation
|
||||
%% information and the body of the document, and typically spans the
|
||||
%% page.
|
||||
|
|
Before Width: | Height: | Size: 16 KiB After Width: | Height: | Size: 16 KiB |
Before Width: | Height: | Size: 19 KiB After Width: | Height: | Size: 19 KiB |
Before Width: | Height: | Size: 19 KiB After Width: | Height: | Size: 19 KiB |
Before Width: | Height: | Size: 19 KiB After Width: | Height: | Size: 19 KiB |
Before Width: | Height: | Size: 20 KiB After Width: | Height: | Size: 22 KiB |
Before Width: | Height: | Size: 20 KiB After Width: | Height: | Size: 19 KiB |
|
@ -1,29 +1,29 @@
|
|||
%!TEX root=../main.tex
|
||||
|
||||
\begin{figure*}
|
||||
\begin{figure}
|
||||
\centering
|
||||
\subcaptionbox{Scale Data, View First}{
|
||||
\includegraphics[width=0.28\textwidth]{results/laptop-init-varysize.pdf}
|
||||
\includegraphics[width=0.47\columnwidth]{results/laptop-init-varysize.pdf}
|
||||
}
|
||||
\subcaptionbox{Fix Data, Move View}{
|
||||
\includegraphics[width=0.28\textwidth]{results/laptop-init-varystart.pdf}
|
||||
\includegraphics[width=0.47\columnwidth]{results/laptop-init-varystart.pdf}
|
||||
}
|
||||
\subcaptionbox{Scale Data, View Last}{
|
||||
\includegraphics[width=0.28\textwidth]{results/laptop-init-varystartandsize.pdf}
|
||||
\includegraphics[width=0.47\columnwidth]{results/laptop-init-varystartandsize.pdf}
|
||||
}
|
||||
\subcaptionbox{Scale Data, View First}{
|
||||
\includegraphics[width=0.28\textwidth]{results/laptop-update_one-varysize.pdf}
|
||||
}
|
||||
\subcaptionbox{Fix Data, Move View}{
|
||||
\includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystart.pdf}
|
||||
}
|
||||
\subcaptionbox{Scale Data, View Last}{
|
||||
\includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystartandsize.pdf}
|
||||
\includegraphics[width=0.47\columnwidth]{results/laptop-update_one-varysize.pdf}
|
||||
}
|
||||
% \subcaptionbox{Fix Data, Move View}{
|
||||
% \includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystart.pdf}
|
||||
% }
|
||||
% \subcaptionbox{Scale Data, View Last}{
|
||||
% \includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystartandsize.pdf}
|
||||
% }
|
||||
\caption{System Initialization costs (a-c) and cost to update one cell (d-f)}
|
||||
\label{fig:experiments}
|
||||
\trimfigurespacing
|
||||
\end{figure*}
|
||||
\end{figure}
|
||||
|
||||
\section{Experiments}
|
||||
\label{sec:experiments}
|
||||
|
@ -69,15 +69,15 @@ To emulate batch processing, we replace the formula for the $\texttt{sum\_change
|
|||
% \label{fig:perf-scale-visible}
|
||||
% \trimfigurespacing
|
||||
% \end{figure}
|
||||
\Cref{fig:experiments}(a,d) shows initialization and update costs, with a fixed dataset size of approximately 600,000 rows, and a variable viewport position.
|
||||
\Cref{fig:experiments}(a,c) shows initialization and update costs, with a fixed dataset size of approximately 600,000 rows, and a variable viewport position.
|
||||
Due to the running sum, the longest visible dependency chain grows as the visible region moves further into the dataset.
|
||||
Costs for Vizier and Dataspread grow significantly with the length of the dependency chain, while batch processing can compute the updated sum significantly faster.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\partitle{Scaling Data}
|
||||
\Cref{fig:experiments}(a,d) shows the initialization and update costs when the viewport is on the first cell. Vizier only needs to compute the visible cell formulas, and so is significantly faster.
|
||||
\Cref{fig:experiments}(c,f) show these costs when the viewport is on the last cell; as before, the costs for Vizier grow with the length of the longest visible dependency chain, supporting the value of batching.
|
||||
\Cref{fig:experiments}(b,d) shows the initialization and update costs when the viewport is on the first cell. Vizier only needs to compute the visible cell formulas, and so is significantly faster.
|
||||
% \Cref{fig:experiments}(c,f) show these costs when the viewport is on the last cell; as before, the costs for Vizier grow with the length of the longest visible dependency chain, supporting the value of batching.
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -27,31 +27,28 @@ We refer to these as the virtual and materialized approach, respectively, and il
|
|||
The materialized approach is optimized for multiple data access patterns common to spreadsheets~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}, including
|
||||
(i) Data structures specialized for the positional referencing scheme commonly used in spreadsheet formulas~\cite{DBLP:conf/icde/BendreVZCP18},
|
||||
(ii) Execution strategies that prioritize completion of portions of the spreadsheet that the user is viewing~\cite{DBLP:conf/sigmod/BendreWMCP19}, and
|
||||
(iii) Indexes that leverage patterns in the dependencies of adjacent cells to compress dependency graphs~\cite{tang-23-efcsfg}.
|
||||
(iii) Indexes storing compressed dependency graphs~\cite{DBLP:conf/sigmod/BendreWMCP19,tang-23-efcsfg}.
|
||||
Similar optimizations are considerably harder in the virtual approach, as the result of updates and their effects on cell position are only materialized when data is received.
|
||||
|
||||
Although the virtual approach is often less efficient, it does provide capabilities that the materialized approach does not:
|
||||
(i) Because it stores only the updates applied by the user (e.g., insert a row at position $x$, replace the value of cell $c$ with $v$, \ldots), the spreadsheet's full version history can be stored at negligible;
|
||||
(ii) As in Wrangler, the resulting data transformation process can be easily applied to other data (e.g., by scaling up from an interaction-friendly sample of the data to the entire dataset, or an updated version of the data); and
|
||||
(iii) As in Vizier, the user's interactions can be translated into a standardized query model (i.e., a Spark dataframe), allowing it to ``plug into'' existing scalable computation platforms (i.e., Spark) and standardized provenance analysis frameworks (e.g., \cite{kumari:2021:cidr:datasense}).
|
||||
(i) It is a naturally efficient encoding of the spreadsheet's full version history.
|
||||
(ii) As in Wrangler, the user's actions can be re-applied to new data (e.g., an updated version of the source data); and
|
||||
(iii) As in Vizier, the spreadsheet can be re-encoded as a relational query allowing it to ``plug into'' existing scalable computation platforms (i.e., Spark) and provenance analysis tools (e.g., \cite{kumari:2021:cidr:datasense}).
|
||||
|
||||
In this paper, we present an optimized hybrid of the virtual and materialized approaches: \emph{Overlay Spreadsheets}.
|
||||
In an Overlay Spreadsheet (\Cref{fig:overlay}), the user's edits are stored in a spreadsheet that is ``overlaid'' on top of source data.
|
||||
Users interact with an Overlay Spreadsheet just like an ordinary spreadsheet, inserting or removing rows or columns, overwriting data with formulas or literals, and reorganizing the data.
|
||||
However, references to the source dataset are virtualized, allowing users to replay their actions on a updated datasets, translate spreadsheets to run on scalable computation platforms, and to facilitate provenance analysis.
|
||||
We also demonstrate that this different virtual representation of edits enables more efficient exploitation of spreadsheet access patterns, including optimizing computation of cells visible to the user.
|
||||
We propose an optimized hybrid of the virtual and materialized approaches: \emph{Overlay Spreadsheets}.
|
||||
An Overlay Spreadsheet (\Cref{fig:overlay}) stores presents an interface analogous to a normal spreadsheet.
|
||||
User edits are ``overlaid'' on top of a source dataset that can be easily be updated to a new version.
|
||||
As an added benefit, decoupling edits and source data makes it easier to leverage spreadsheet access patterns, reducing the time needed to respond to user actions.
|
||||
|
||||
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal, reproducibility-oriented, notebook-style workflow system built on Apache Spark.
|
||||
Users of Vizier define sequences of data transformation steps that may include scripts, templated widgets, or other operations.
|
||||
Existing versions of Vizier provide a spreadsheet-style interface, where each user interaction builds out the data transformation workflow.
|
||||
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal notebook-style workflow system built on Apache Spark.
|
||||
Existing versions of Vizier allow users to define workflow steps through a spreadsheet-style interface; each action adds a new workflow step.
|
||||
In spite of the performance limitations of the virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features are implemented primarily over Spark dataframes.
|
||||
%
|
||||
Our objective in this paper is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting for \textbf{replay and provenance} (i.e., like the virtual approach).
|
||||
|
||||
As a secondary goal, we further explore the additional benefits of the overlay approach.
|
||||
Specifically, we observe that because spreadsheet updates are typically made manually, the number of updates is limited by the speed of a human interacting with the system.
|
||||
Although a single update may be applied to multiple cells (e.g., by copy/pasting a formula over a range of cells), the number of such updates is likely to be small.
|
||||
In this paper, we take the first steps towards hybridizing the cell-at-a-time execution strategies of classical spreadsheets, with bulk computation strategies found in relational databases.
|
||||
As a secondary goal, explore the potential for performance improvement resulting from the overlay approach.
|
||||
Specifically, we observe that bulk updates in a spreadsheet (e.g., pasting a formula across a range of cells) rely on expression ``patterns,''
|
||||
which admit more efficient dependency analysis and bulk computation when intermediate values are not required.
|
||||
This hybrid strategy is akin to optimizations applied in data spread~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}, but operating over patterns of updates rather than patterns in the dependency graph.
|
||||
|
||||
% March 26 by OK: Trimming the ToC summary for space
|
||||
|
|
|
@ -62,19 +62,19 @@
|
|||
\subsection{Spreadsheets}
|
||||
\label{sec:spreadsheets}
|
||||
|
||||
Let $\columnDomain$ and $\rowDomain$ denote domains of column and row labels; unless otherwise noted, we assume $\rowDomain \subset \mathbb Z$.
|
||||
Let $\valueDomain \subset \exprDomain$ denote domains of values and expressions, respectively; We define $\exprDomain$ in greater detail below.
|
||||
We define a \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ as a partial mapping from \emph{cells} ($\cellRef{\column}{\row} \in (\columnDomain \times \rowDomain)$) to expressions.
|
||||
Let $\columnDomain$ and $\rowDomain$ denote domains of column and row labels. Except where noted, $\rowDomain \subset \mathbb Z$.
|
||||
Let $\valueDomain \subset \exprDomain$ denote domains of values and expressions, respectively.
|
||||
A \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ is a partial mapping from \emph{cells} ($\cellRef{\column}{\row} \in (\columnDomain \times \rowDomain)$) to expressions.
|
||||
We use $\valat{\spreadsheet}{\column}{\row}$ to denote $\spreadsheet(\cellRef{\column}{\row})$.
|
||||
Let $\errorval \in \valueDomain$ indicate ``undefined'' and define the \emph{domain} $\dom(\spreadsheet)$ to be the set of cells $\cellRef{\column}{\row}$ where $\valat{\spreadsheet}{\column}{\row} \neq \errorval$.
|
||||
|
||||
An expression $\expr \in \exprDomain$ is a formula defined over literals from $\valueDomain$, the standard arithmetic operators, and references to other cells in the spreadsheet ($\cellRef{\column}{\row}$).
|
||||
The expression $\expr$ may be evaluated in the context of a spreadsheet ($\evalOf{\spreadsheet}{\cdot} : \exprDomain \rightarrow \valueDomain$) as follows:
|
||||
(i) Literals evaluate to themselves, (ii) Arithmetic formulas are evaluated in the usual way, and (iii) References to the spreadsheet are evaluated recursively
|
||||
The expression $\expr$ is evaluated in the context of a spreadsheet ($\evalOf{\spreadsheet}{\cdot} : \exprDomain \rightarrow \valueDomain$) as follows:
|
||||
(i) Literals and arithmetic are evaluated in the usual way, and (ii) References to the spreadsheet are evaluated recursively
|
||||
($\evalOf{\spreadsheet}{\cellRef{\column}{\row}} \equiv \evalOf{\spreadsheet}{\spreadsheet(\column, \row)}$).
|
||||
By convention, cyclic references evaluate to the distinguished error value $\errorval$ in $\valueDomain$.
|
||||
|
||||
We define the dependencies of an expression ($\depsOf{\expr}$) as the cells referenced by $\expr$.
|
||||
An expression's dependencies ($\depsOf{\expr}$) are the cells referenced by $\expr$.
|
||||
Dependencies induce a graph $\DG{\spreadsheet}\tuple{N, E}$ over the spreadsheet, with cells as nodes (i.e., $N = \columnDomain \times \rowDomain$), and dependencies as directed edges:
|
||||
$$E = \bigcup_{\cell \in \columnDomain \times \rowDomain}
|
||||
\{\;\cell \rightarrow \cellPrime\;|\;\cellPrime \in \depsOf{\valat{\spreadsheet}{\column}{\row}}\;\} $$
|
||||
|
@ -84,7 +84,7 @@ Note that if all cell expressions are constants (i.e., a spreadsheet without for
|
|||
|
||||
\begin{example}
|
||||
Consider the spreadsheet at the top of \Cref{fig:example-spreadsheet-and-a}.
|
||||
Columns \emph{A} and \emph{B} hold constant expressions, while column \emph{C} holds arithmetic expressions referencing cells from columns \emph{A} and \emph{B}.
|
||||
Columns \emph{A} and \emph{B} hold constant expressions, while column \emph{C} holds reference cells from columns \emph{A} and \emph{B}.
|
||||
Evaluating this spreadsheet assigns each cell a concrete value, as in the top right.
|
||||
For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{A}{1} + \cellRef{B}{1}} = \evalOf{\spreadsheet}{\cellRef{A}{1}} + \evalOf{\spreadsheet}{\cellRef{B}{1}} = 15 + 50 = 65$.
|
||||
\end{example}
|
||||
|
@ -170,7 +170,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
|
|||
\subsection{Cell Updates}
|
||||
\label{sec:updates}
|
||||
|
||||
A cell update set $\upd \subseteq \columnDomain \times \rowDomain \times \exprDomain$ to a spreadsheet is a set of cell updates of the form $\acu$ that assign to cell $\cellRef{\column}{\row}$ the expression $\expr$.
|
||||
A cell update set $\upd \subseteq \columnDomain \times \rowDomain \times \exprDomain$ is a set of cell updates of the form $\acu$ that assign to cell $\cellRef{\column}{\row}$ the expression $\expr$.
|
||||
Denote by $\dom(\upd)$ the domain of update $\upd$, containing all cells $\cellRef{\column}{\row}$ defined in $\upd$ (i.e., $\exists \expr : (\acu \in \upd)$).
|
||||
Applying an update $\upd$ to a spreadsheet $\spreadsheet$ returns an updated spreadsheet:
|
||||
\[
|
||||
|
@ -182,17 +182,17 @@ Applying an update $\upd$ to a spreadsheet $\spreadsheet$ returns an updated spr
|
|||
\]
|
||||
|
||||
An update may affect cells beyond its domain.
|
||||
For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes the constant expression in cell \emph{\cellRef{A}{1}} and the arithmetic expression in cell \emph{\cellRef{C}{3}}.
|
||||
For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes the expressions in cells \emph{\cellRef{A}{1}} and \emph{\cellRef{C}{3}}.
|
||||
Evaluating the updated spreadsheet $\upd(\spreadsheet)$ results in \emph{three} cell changes (in red).
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Spreadsheet Access to Datasets}
|
||||
\label{sec:spre-access-datas}
|
||||
|
||||
To uniformly model spreadsheet access to relational data as well as to data already represented as spreadsheets, we assume an input dataset $\ds$ with a designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
|
||||
For example, in a relational table, these can be the table's columns and values of a key or rowid attribute, respectively.
|
||||
For a spreadsheet or csv data, $\rowDomain_{\ds} \subset \mathbb Z$ can be the position of the row.
|
||||
We use $\valat{\ds}{\row}{\column}$ to denote the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
|
||||
To uniformly model source datasets, whether from relational databases or other spreadsheets, we assume an input dataset $\ds$ with a designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
|
||||
In a relational table, these are the table's columns and values of a key attribute, respectively.
|
||||
For csv data, $\rowDomain_{\ds} \subset \mathbb Z$ is the position of the row.
|
||||
$\valat{\ds}{\row}{\column}$ denotes the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
|
||||
|
||||
Denote by $\rframe: \rowDomain_{\ds} \to \mathbb{Z}$ a reference frame, an injective map that maps rows in $\ds$ into the spreadsheet.
|
||||
A \emph{spreadsheet overlay} for a dataset $\ds$ is then a pair $(\ds, \rframe)$ that defines a spreadsheet $\spreadsheet_{\ds, \rframe}$ with domains $\columnDomain = \columnDomain_{\ds}$, $\rowDomain = \dom(\rframe)$ as
|
||||
|
|