Reframing intro around expressivity rather than scalability
parent
de9a5d26b3
commit
9c738f2b3e
|
@ -2,15 +2,28 @@
|
|||
\section{Introduction}
|
||||
\label{sec:introduction}
|
||||
|
||||
Spreadsheets are a popular tools for data exploration, transformation, and visualization, but have historically had challenges managing ``big data'' --- as few as 50k % fifty thousand
|
||||
rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
|
||||
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system. % like Apache Spark.
|
||||
In this model, the spreadsheet is a chain of versions, each linked by a lightweight transformation function~\cite{freire:2016:hilda:exception}.
|
||||
A different approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the % entire
|
||||
spreadsheet runtime and specializes % around
|
||||
database primitives like indexes and incremental maintenance % specialized
|
||||
for spreadsheet access patterns.
|
||||
We refer to these as the virtual and materialized approaches, respectively, and illustrate them in \Cref{fig:overlay}.
|
||||
Tools like \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} adopt direct manipulation interfaces, similar to spreadsheets, as a way to streamline definition of data preparation workflows.
|
||||
While convenient for data curation, these interfaces lack many of the free form data manipulation capabilities that make normal spreadsheets ideal for data exploration and visualization.
|
||||
|
||||
Fundamental to spreadsheet-style interfaces for workflows is the need to support \emph{replay}.
|
||||
When the source data changes, it should be possible to re-run the workflow on the updated data.
|
||||
Thus, in systems like Wrangler and Vizier, each user interaction adds a repeatable step to the workflow.
|
||||
These steps are repeated in-order; By contrast, typical spreadsheets allow new formulas to be defined in any cell, and dynamically execute cells in dependency order.
|
||||
|
||||
In this paper, we propose a model of spreadsheets that acts like a classical spreadsheet, but where the user's edits and the source data are decoupled.
|
||||
The result is a spreadsheet that can be `overlaid' on top of any dataset, allowing for both the flexibility of a classical spreadsheet, and the replay capabilities of a workflow spreadsheet.
|
||||
The resulting interface can support data scientists throughout the entire data lifecycle, from exploration through data curation pipeline development.
|
||||
|
||||
As we discuss in this paper, this new `overlay'-based approach to spreadsheets also enables a new approach to scaling spreadsheets to larger data.
|
||||
Classical spreadsheets have historically had challenges managing ``big data'' --- as few as 100k rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
|
||||
\emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18} re-architects the spreadsheet runtime and specializes database primitives like indexes and incremental maintenance for spreadsheet access patterns.
|
||||
In spite of these changes, DataSpread still faces a key challenge: like classical spreadsheets, its unit of computation is the cell.
|
||||
|
||||
We show how the overlay spreadsheet model may be leveraged to offload significant computational work for portions of the spreadsheet not visible to the user to a batch-processing engine like Apache Spark.
|
||||
Although slower for small computations, such systems scale to larger computations more gracefully, making them ideal for expensive computations spanning large numbers of cells.
|
||||
|
||||
|
||||
\textbf{OLIVER'S EDITS APPLIED UP TO THIS POINT}
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{figure}
|
||||
|
|
Loading…
Reference in New Issue