paper-Vizier-SpreadsheetOve.../sections/introduction.tex

66 lines
7.6 KiB
TeX

%!TEX root=../main.tex
\section{Introduction}
\label{sec:introduction}
Spreadsheets are a popular tools for data exploration, transformation, and visualization, but have historically had challenges managing ``big data'' --- with as few as fifty thousand rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system like Apache Spark.
In this model, the spreadsheet is a chain of versions, each linked by a lightweight transformation function~\cite{freire:2016:hilda:exception}.
A more recent approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the entire spreadsheet runtime around database primitives like indexes and incremental maintenance specialized for spreadsheet access patterns.
We refer to these as the virtual and materialized approach, respectively, and illustrate them in \Cref{fig:overlay}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\includegraphics[width=0.9\columnwidth]{graphics/forms-of-spreadsheet.pdf}
\Description{
Three architectures: First, a materialized approach depicted by a 4 by 3 grid of cells with arrows indicating dependencies. The cells of the 3rd column point to cells in the same row in the 1st and 2nd columns. The cells of the 4th column are marked red and point to the cell in the 3rd column of the same row, and the cell of the 4th column in the preceding row. Second, a virtual approach depicted by three boxes: The first box contains a 2 by 3 grid of cells. An arrow labeled with a pi (projection) connects it to the second box, which contains a 3 by 3 grid of cells. An arrow labeled with a sigma (aggregation) connects it to the third box, which contains a 4 by 3 grid of cells. The sigma, its arrow, and the 4th column of cells are marked in red. Finally, an overlay approach depicted by 2 boxes. The 2nd box is nearly identical to the materialized approach, but the 1st and 2nd columns of the grid are greyed out with arrows pointing to the first box; The first box is the same 2 by 3 grid from the virtual approach.
}
\caption{Approaches to scalable spreadsheet design}
\label{fig:overlay}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The materialized approach is optimized for multiple data access patterns common to spreadsheets~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}, including
(i) Data structures specialized for the positional referencing scheme commonly used in spreadsheet formulas~\cite{DBLP:conf/icde/BendreVZCP18},
(ii) Execution strategies that prioritize completion of portions of the spreadsheet that the user is viewing~\cite{DBLP:conf/sigmod/BendreWMCP19}, and
(iii) Indexes that leverage patterns in the dependencies of adjacent cells to compress dependency graphs~\cite{tang-23-efcsfg}.
Similar optimizations are considerably harder in the virtual approach, as the result of updates and their effects on cell position are only materialized when data is received.
Although the virtual approach is often less efficient, it does provide capabilities that the materialized approach does not:
(i) Because it stores only the updates applied by the user (e.g., insert a row at position $x$, replace the value of cell $c$ with $v$, \ldots), the spreadsheet's full version history can be stored at negligible;
(ii) As in Wrangler, the resulting data transformation process can be easily applied to other data (e.g., by scaling up from an interaction-friendly sample of the data to the entire dataset, or an updated version of the data); and
(iii) As in Vizier, the user's interactions can be translated into a standardized query model (i.e., a Spark dataframe), allowing it to ``plug into'' existing scalable computation platforms (i.e., Spark) and standardized provenance analysis frameworks (e.g., \cite{kumari:2021:cidr:datasense}).
In this paper, we present an optimized hybrid of the virtual and materialized approaches: \emph{Overlay Spreadsheets}.
In an Overlay Spreadsheet (\Cref{fig:overlay}), the user's edits are stored in a spreadsheet that is ``overlaid'' on top of source data.
Users interact with an Overlay Spreadsheet just like an ordinary spreadsheet, inserting or removing rows or columns, overwriting data with formulas or literals, and reorganizing the data.
However, references to the source dataset are virtualized, allowing users to replay their actions on a updated datasets, translate spreadsheets to run on scalable computation platforms, and to facilitate provenance analysis.
We also demonstrate that this different virtual representation of edits enables more efficient exploitation of spreadsheet access patterns, including optimizing computation of cells visible to the user.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal, reproducibility-oriented, notebook-style workflow system built on Apache Spark.
Users of Vizier define sequences of data transformation steps that may include scripts, templated widgets, or other operations.
Existing versions of Vizier provide a spreadsheet-style interface, where each user interaction builds out the data transformation workflow.
In spite of the performance limitations of the virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features are implemented primarily over Spark dataframes.
%
Our objective in this paper is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting for \textbf{replay and provenance} (i.e., like the virtual approach).
As a secondary goal, we further explore the additional benefits of the overlay approach.
Specifically, we observe that because spreadsheet updates are typically made manually, the number of updates is limited by the speed of a human interacting with the system.
Although a single update may be applied to multiple cells (e.g., by copy/pasting a formula over a range of cells), the number of such updates is likely to be small.
In this paper, we take the first steps towards hybridizing the cell-at-a-time execution strategies of classical spreadsheets, with bulk computation strategies found in relational databases.
This hybrid strategy is akin to optimizations applied in data spread~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}, but operating over patterns of updates rather than patterns in the dependency graph.
% March 26 by OK: Trimming the ToC summary for space
%
% In this paper, we introduce Overlay Spreadsheets, and present the details of our prototype implementation.
% We implement the concept in the Vizier notebook~\cite{kennedy:2022:ieee-deb:right,brachmann:2020:cidr:your,brachmann:2019:sigmod:data}, a workflow-style notebook built over Apache Spark.
% We explore the challenges of integrating overlay spreadsheets with Apache Spark dataframes, and discuss preliminary work in translating an overlay spreadsheet to derive a dataframe.
% \BG{Experimal result take-aways}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: