main
Boris Glavic 2023-03-30 11:36:21 +03:00
parent 9f701be095
commit a28de18b33
2 changed files with 25 additions and 13 deletions

View File

@ -121,15 +121,23 @@
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
Spreadsheets provide a convenient, friendly direct manipulation interface to datasets.
Efforts to scale spreadsheets have taken two approaches: A `virtual` strategy that imposes a spreadsheet-like interface over an existing database engine, and a `materialized' strategy based on re-engineering the spreadsheet engine around standard database optimizations like indexes.
Because database engines are typically optimized for bulk query processing over interactive latencies, the materialized approach has better performance.
However, the virtual approach offers several key advantages that can not be easily replicated in the materialized approach, including notably the ability to re-apply user interactions to an updated version of the same dataset.
We propose a hybrid of the materialized and virtual approaches, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
Spreadsheets provide a convenient % , friendly
direct manipulation interface to datasets.
Efforts to scale spreadsheets % have taken two approaches: A
either follow a `virtual` strategy that imposes a spreadsheet interface over an existing database engine or a `materialized' strategy based on re-engineering the spreadsheet engine using % around
standard database optimizations. % like indexes.
Because database engines are not optimized for spreadsheet access patterns,
% typically optimized for bulk query processing over interactive latencies,
the materialized approach has better performance.
However, the virtual approach offers several key advantages that can not be easily replicated in the materialized approach, including % notably
the ability to re-apply user interactions to an updated dataset. % version of the same dataset.
We propose a hybrid of % the materialized and virtual
these approaches, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
We introduce the overlay update model, and outline strategies for efficiently accessing a spreadsheet defined in this way.
A key feature of our approach is storing updates generated by bulk operations (e.g., copy/paste) as ``patterns" that can be leveraged to reduce execution costs.
We implement an overlay spreadsheet over Apache Spark and compare it against DataSpread, a popular materialized spreadsheet.
Our preliminary results show that overlay spreadsheets can significantly reduce execution costs.
We implement an overlay spreadsheet over Apache Spark and demonstrate that, compared to DataSpread, it can significantly reduce execution costs. % popular
% materialized spreadsheet.
% Our preliminary results show that overlay spreadsheets can significantly reduce execution costs.
\end{abstract}
%%

View File

@ -2,10 +2,14 @@
\section{Introduction}
\label{sec:introduction}
Spreadsheets are a popular tools for data exploration, transformation, and visualization, but have historically had challenges managing ``big data'' --- with as few as fifty thousand rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system like Apache Spark.
Spreadsheets are a popular tools for data exploration, transformation, and visualization, but have historically had challenges managing ``big data'' --- as few as 50k % fifty thousand
rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system. % like Apache Spark.
In this model, the spreadsheet is a chain of versions, each linked by a lightweight transformation function~\cite{freire:2016:hilda:exception}.
A more recent approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the entire spreadsheet runtime around database primitives like indexes and incremental maintenance specialized for spreadsheet access patterns.
The approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the % entire
spreadsheet runtime and specializes % around
database primitives like indexes and incremental maintenance % specialized
for spreadsheet access patterns.
We refer to these as the virtual and materialized approach, respectively, and illustrate them in \Cref{fig:overlay}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -24,7 +28,7 @@ The materialized approach is optimized for multiple data access patterns common
(i) Data structures specialized for the positional referencing scheme commonly used in spreadsheet formulas~\cite{DBLP:conf/icde/BendreVZCP18},
(ii) Execution strategies that prioritize completion of portions of the spreadsheet that the user is viewing~\cite{DBLP:conf/sigmod/BendreWMCP19}, and
(iii) Indexes that leverage patterns in the dependencies of adjacent cells to compress dependency graphs~\cite{tang-23-efcsfg}.
Similar optimizations are considerably harder in the virtual approach, as the result of updates and their effects on cell position are only materialized when data is received.
Similar optimizations are considerably harder in the virtual approach, as the result of updates and their effects on cell position are only materialized when data is received.
Although the virtual approach is often less efficient, it does provide capabilities that the materialized approach does not:
(i) Because it stores only the updates applied by the user (e.g., insert a row at position $x$, replace the value of cell $c$ with $v$, \ldots), the spreadsheet's full version history can be stored at negligible;
@ -34,12 +38,12 @@ Although the virtual approach is often less efficient, it does provide capabilit
In this paper, we present an optimized hybrid of the virtual and materialized approaches: \emph{Overlay Spreadsheets}.
In an Overlay Spreadsheet (\Cref{fig:overlay}), the user's edits are stored in a spreadsheet that is ``overlaid'' on top of source data.
Users interact with an Overlay Spreadsheet just like an ordinary spreadsheet, inserting or removing rows or columns, overwriting data with formulas or literals, and reorganizing the data.
However, references to the source dataset are virtualized, allowing users to replay their actions on a updated datasets, translate spreadsheets to run on scalable computation platforms, and to facilitate provenance analysis.
However, references to the source dataset are virtualized, allowing users to replay their actions on a updated datasets, translate spreadsheets to run on scalable computation platforms, and to facilitate provenance analysis.
We also demonstrate that this different virtual representation of edits enables more efficient exploitation of spreadsheet access patterns, including optimizing computation of cells visible to the user.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal, reproducibility-oriented, notebook-style workflow system built on Apache Spark.
Users of Vizier define sequences of data transformation steps that may include scripts, templated widgets, or other operations.
Existing versions of Vizier provide a spreadsheet-style interface, where each user interaction builds out the data transformation workflow.
Existing versions of Vizier provide a spreadsheet-style interface, where each user interaction builds out the data transformation workflow.
In spite of the performance limitations of the virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features are implemented primarily over Spark dataframes.
%
Our objective in this paper is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting for \textbf{replay and provenance} (i.e., like the virtual approach).