paper-Vizier-SpreadsheetOve.../sections/introduction.tex

81 lines
7.8 KiB
TeX

%!TEX root=../main.tex
\section{Introduction}
\label{sec:introduction}
Tools like \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} adopt direct manipulation interfaces, similar to spreadsheets, as a way to streamline the definition of data preparation workflows.
While convenient for curation, these interfaces lack the free form data manipulation capabilities that make spreadsheets ideal for data exploration and visualization.
Fundamental to spreadsheet interfaces in workflow systems is the need to support \emph{replay}.
When the source data or workflow changes, it should be possible to re-run the (updated) workflow on the updated data.
This is enabled in systems like Wrangler and Vizier, where the fundamental data model is a workflow of repeatable transformations (`Workflow' in \Cref{fig:overlay}).
By contrast, a spreadsheet is a grid of interdependent cells where the original data and user-applied edits are indistinguishable (`Spreadsheet` in \Cref{fig:overlay}).
% \BG{I think the significance or dependency versus workflow order is not clear here. We can remove this argument to save space or have to extend it}
% \OK{Rephrased the above to focus on data representation and the distinguishability of source data and edits.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\includegraphics[width=0.9\columnwidth]{graphics/forms-of-spreadsheet.pdf}
\Description{
Three architectures: First, a materialized approach depicted by a 4 by 3 grid of cells with arrows indicating dependencies. The cells of the 3rd column point to cells in the same row in the 1st and 2nd columns. The cells of the 4th column are marked red and point to the cell in the 3rd column of the same row, and the cell of the 4th column in the preceding row. Second, a virtual approach depicted by three boxes: The first box contains a 2 by 3 grid of cells. An arrow labeled with a pi (projection) connects it to the second box, which contains a 3 by 3 grid of cells. An arrow labeled with a sigma (aggregation) connects it to the third box, which contains a 4 by 3 grid of cells. The sigma, its arrow, and the 4th column of cells are marked in red. Finally, an overlay approach depicted by 2 boxes. The 2nd box is nearly identical to the materialized approach, but the 1st and 2nd columns of the grid are greyed out with arrows pointing to the first box; The first box is the same 2 by 3 grid from the virtual approach.
}
\caption{Approaches to scalable spreadsheet design}
\label{fig:overlay}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In this paper, we propose a model of spreadsheets that acts like a classical spreadsheet, but where the user's edits and the source data are decoupled.
The result is a spreadsheet that can be \emph{`overlaid'} on top of any dataset (`Overlay' in \Cref{fig:overlay}), no matter whether the source data is a raw datafile or the result of a workflow (e.g., in Vizier).
Overlay Spreadsheets provide the flexibility of spreadsheets, while also supporting the replay capabilities needed for workflows.
%The resulting interface can support data scientists throughout the entire data lifecycle, from exploration through data curation pipeline development and analysis.
As we discuss in this paper, this new \emph{overlay} approach to spreadsheets also enables a new approach to scaling spreadsheets to larger data.
Classical spreadsheets have historically had challenges managing ``big data'' --- as few as 100k rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
\emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18} re-architects the spreadsheet runtime and specializes database primitives like indexes and incremental maintenance for spreadsheet access patterns.
In spite of these changes, DataSpread still faces a key challenge: like classical spreadsheets, its unit of computation is the cell.
Although the overheads of starting a computation (e.g., locking, state initialization, etc...) are typically low, they are repeated for each and every cell that needs to be computed.
% \BG{Say why cell unit computation is a challenge? Expand or remove}
% \OK{Added: per-cell overheads.}
\paragraph{Scaling Spreadsheets to Big Data}
There has been considerable effort by the database community to `scale up' spreadsheets to big data~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}.
% These efforts include:
% (i) Data structures specialized for the positional referencing scheme commonly used in spreadsheet formulas~\cite{DBLP:conf/icde/BendreVZCP18},
% (ii) Execution strategies that prioritize completion of portions of the spreadsheet that the user is viewing~\cite{DBLP:conf/sigmod/BendreWMCP19}, and
% (iii) Indexes storing compressed dependency graphs~\cite{DBLP:conf/sigmod/BendreWMCP19,tang-23-efcsfg}.
%
Overlays create an opportunity for further scalability based on the following two observations:
(i) Most of the `big' data appears in the source dataset.
(ii) The user applies a small number of edits (that may affect a large number of cells).
The latter observation arises because users typically edit large numbers of cells by `pasting' a formula into a range of cells.
The pasted formula acts as a template, and the pasted cells all follow a common pattern.
Like \cite{DBLP:conf/sigmod/BendreWMCP19}, we avoid storing formulas for each individual cell, instead storing patterns and ranges of pasted cells.
Leveraging the user's interest inn only a small subset of the spreadsheet at any one time, overlay spreadsheets avoid computations outside of this subset.
Requests for cells originating in the source data can be handled efficiently by standard relational storage engines, while only formula cells visible to the user and their transitive dependencies need to be computed.
Unfortunately, common spreadsheet usage creates cells with transitive dependencies that scale with data size.
We mitigate the prohibitively high cost of such cells by outsourcing their computation to a batch-processing engine like Apache Spark.
Although slower for small datasets, batch engines scale to larger workloads more gracefully, making them ideal for expensive computations that span many cells, where individual cell values are not needed.
\paragraph{Overlay Spreadsheets}
We propose \emph{Overlay Spreadsheets}, which present an interface analogous to a normal spreadsheet, but where user edits are ``overlaid'' on top of a source dataset that can easily be updated to a new version.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal notebook-style workflow system built on Apache Spark; Our implementation replaces its existing workflow-style spreadsheet.
Our objective is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting \textbf{replay and provenance} (i.e., like the virtual approach).
% March 26 by OK: Trimming the ToC summary for space
%
% In this paper, we introduce Overlay Spreadsheets, and present the details of our prototype implementation.
% We implement the concept in the Vizier notebook~\cite{kennedy:2022:ieee-deb:right,brachmann:2020:cidr:your,brachmann:2019:sigmod:data}, a workflow-style notebook built over Apache Spark.
% We explore the challenges of integrating overlay spreadsheets with Apache Spark dataframes, and discuss preliminary work in translating an overlay spreadsheet to derive a dataframe.
% \BG{Experimal result take-aways}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: