intro
parent
9f701be095
commit
a28de18b33
22
main.tex
22
main.tex
|
@ -121,15 +121,23 @@
|
|||
%% The abstract is a short summary of the work to be presented in the
|
||||
%% article.
|
||||
\begin{abstract}
|
||||
Spreadsheets provide a convenient, friendly direct manipulation interface to datasets.
|
||||
Efforts to scale spreadsheets have taken two approaches: A `virtual` strategy that imposes a spreadsheet-like interface over an existing database engine, and a `materialized' strategy based on re-engineering the spreadsheet engine around standard database optimizations like indexes.
|
||||
Because database engines are typically optimized for bulk query processing over interactive latencies, the materialized approach has better performance.
|
||||
However, the virtual approach offers several key advantages that can not be easily replicated in the materialized approach, including notably the ability to re-apply user interactions to an updated version of the same dataset.
|
||||
We propose a hybrid of the materialized and virtual approaches, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
|
||||
Spreadsheets provide a convenient % , friendly
|
||||
direct manipulation interface to datasets.
|
||||
Efforts to scale spreadsheets % have taken two approaches: A
|
||||
either follow a `virtual` strategy that imposes a spreadsheet interface over an existing database engine or a `materialized' strategy based on re-engineering the spreadsheet engine using % around
|
||||
standard database optimizations. % like indexes.
|
||||
Because database engines are not optimized for spreadsheet access patterns,
|
||||
% typically optimized for bulk query processing over interactive latencies,
|
||||
the materialized approach has better performance.
|
||||
However, the virtual approach offers several key advantages that can not be easily replicated in the materialized approach, including % notably
|
||||
the ability to re-apply user interactions to an updated dataset. % version of the same dataset.
|
||||
We propose a hybrid of % the materialized and virtual
|
||||
these approaches, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
|
||||
We introduce the overlay update model, and outline strategies for efficiently accessing a spreadsheet defined in this way.
|
||||
A key feature of our approach is storing updates generated by bulk operations (e.g., copy/paste) as ``patterns" that can be leveraged to reduce execution costs.
|
||||
We implement an overlay spreadsheet over Apache Spark and compare it against DataSpread, a popular materialized spreadsheet.
|
||||
Our preliminary results show that overlay spreadsheets can significantly reduce execution costs.
|
||||
We implement an overlay spreadsheet over Apache Spark and demonstrate that, compared to DataSpread, it can significantly reduce execution costs. % popular
|
||||
% materialized spreadsheet.
|
||||
% Our preliminary results show that overlay spreadsheets can significantly reduce execution costs.
|
||||
\end{abstract}
|
||||
|
||||
%%
|
||||
|
|
|
@ -2,10 +2,14 @@
|
|||
\section{Introduction}
|
||||
\label{sec:introduction}
|
||||
|
||||
Spreadsheets are a popular tools for data exploration, transformation, and visualization, but have historically had challenges managing ``big data'' --- with as few as fifty thousand rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
|
||||
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system like Apache Spark.
|
||||
Spreadsheets are a popular tools for data exploration, transformation, and visualization, but have historically had challenges managing ``big data'' --- as few as 50k % fifty thousand
|
||||
rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
|
||||
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system. % like Apache Spark.
|
||||
In this model, the spreadsheet is a chain of versions, each linked by a lightweight transformation function~\cite{freire:2016:hilda:exception}.
|
||||
A more recent approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the entire spreadsheet runtime around database primitives like indexes and incremental maintenance specialized for spreadsheet access patterns.
|
||||
The approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the % entire
|
||||
spreadsheet runtime and specializes % around
|
||||
database primitives like indexes and incremental maintenance % specialized
|
||||
for spreadsheet access patterns.
|
||||
We refer to these as the virtual and materialized approach, respectively, and illustrate them in \Cref{fig:overlay}.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
@ -24,7 +28,7 @@ The materialized approach is optimized for multiple data access patterns common
|
|||
(i) Data structures specialized for the positional referencing scheme commonly used in spreadsheet formulas~\cite{DBLP:conf/icde/BendreVZCP18},
|
||||
(ii) Execution strategies that prioritize completion of portions of the spreadsheet that the user is viewing~\cite{DBLP:conf/sigmod/BendreWMCP19}, and
|
||||
(iii) Indexes that leverage patterns in the dependencies of adjacent cells to compress dependency graphs~\cite{tang-23-efcsfg}.
|
||||
Similar optimizations are considerably harder in the virtual approach, as the result of updates and their effects on cell position are only materialized when data is received.
|
||||
Similar optimizations are considerably harder in the virtual approach, as the result of updates and their effects on cell position are only materialized when data is received.
|
||||
|
||||
Although the virtual approach is often less efficient, it does provide capabilities that the materialized approach does not:
|
||||
(i) Because it stores only the updates applied by the user (e.g., insert a row at position $x$, replace the value of cell $c$ with $v$, \ldots), the spreadsheet's full version history can be stored at negligible;
|
||||
|
@ -34,12 +38,12 @@ Although the virtual approach is often less efficient, it does provide capabilit
|
|||
In this paper, we present an optimized hybrid of the virtual and materialized approaches: \emph{Overlay Spreadsheets}.
|
||||
In an Overlay Spreadsheet (\Cref{fig:overlay}), the user's edits are stored in a spreadsheet that is ``overlaid'' on top of source data.
|
||||
Users interact with an Overlay Spreadsheet just like an ordinary spreadsheet, inserting or removing rows or columns, overwriting data with formulas or literals, and reorganizing the data.
|
||||
However, references to the source dataset are virtualized, allowing users to replay their actions on a updated datasets, translate spreadsheets to run on scalable computation platforms, and to facilitate provenance analysis.
|
||||
However, references to the source dataset are virtualized, allowing users to replay their actions on a updated datasets, translate spreadsheets to run on scalable computation platforms, and to facilitate provenance analysis.
|
||||
We also demonstrate that this different virtual representation of edits enables more efficient exploitation of spreadsheet access patterns, including optimizing computation of cells visible to the user.
|
||||
|
||||
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal, reproducibility-oriented, notebook-style workflow system built on Apache Spark.
|
||||
Users of Vizier define sequences of data transformation steps that may include scripts, templated widgets, or other operations.
|
||||
Existing versions of Vizier provide a spreadsheet-style interface, where each user interaction builds out the data transformation workflow.
|
||||
Existing versions of Vizier provide a spreadsheet-style interface, where each user interaction builds out the data transformation workflow.
|
||||
In spite of the performance limitations of the virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features are implemented primarily over Spark dataframes.
|
||||
%
|
||||
Our objective in this paper is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting for \textbf{replay and provenance} (i.e., like the virtual approach).
|
||||
|
|
Loading…
Reference in New Issue