Reframing intro around expressivity rather than scalability

main
Oliver Kennedy 2023-06-28 17:26:50 -04:00
parent de9a5d26b3
commit 9c738f2b3e
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
1 changed files with 22 additions and 9 deletions

View File

@ -2,15 +2,28 @@
\section{Introduction}
\label{sec:introduction}
Spreadsheets are a popular tools for data exploration, transformation, and visualization, but have historically had challenges managing ``big data'' --- as few as 50k % fifty thousand
rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system. % like Apache Spark.
In this model, the spreadsheet is a chain of versions, each linked by a lightweight transformation function~\cite{freire:2016:hilda:exception}.
A different approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the % entire
spreadsheet runtime and specializes % around
database primitives like indexes and incremental maintenance % specialized
for spreadsheet access patterns.
We refer to these as the virtual and materialized approaches, respectively, and illustrate them in \Cref{fig:overlay}.
Tools like \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} adopt direct manipulation interfaces, similar to spreadsheets, as a way to streamline definition of data preparation workflows.
While convenient for data curation, these interfaces lack many of the free form data manipulation capabilities that make normal spreadsheets ideal for data exploration and visualization.
Fundamental to spreadsheet-style interfaces for workflows is the need to support \emph{replay}.
When the source data changes, it should be possible to re-run the workflow on the updated data.
Thus, in systems like Wrangler and Vizier, each user interaction adds a repeatable step to the workflow.
These steps are repeated in-order; By contrast, typical spreadsheets allow new formulas to be defined in any cell, and dynamically execute cells in dependency order.
In this paper, we propose a model of spreadsheets that acts like a classical spreadsheet, but where the user's edits and the source data are decoupled.
The result is a spreadsheet that can be `overlaid' on top of any dataset, allowing for both the flexibility of a classical spreadsheet, and the replay capabilities of a workflow spreadsheet.
The resulting interface can support data scientists throughout the entire data lifecycle, from exploration through data curation pipeline development.
As we discuss in this paper, this new `overlay'-based approach to spreadsheets also enables a new approach to scaling spreadsheets to larger data.
Classical spreadsheets have historically had challenges managing ``big data'' --- as few as 100k rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
\emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18} re-architects the spreadsheet runtime and specializes database primitives like indexes and incremental maintenance for spreadsheet access patterns.
In spite of these changes, DataSpread still faces a key challenge: like classical spreadsheets, its unit of computation is the cell.
We show how the overlay spreadsheet model may be leveraged to offload significant computational work for portions of the spreadsheet not visible to the user to a batch-processing engine like Apache Spark.
Although slower for small computations, such systems scale to larger computations more gracefully, making them ideal for expensive computations spanning large numbers of cells.
\textbf{OLIVER'S EDITS APPLIED UP TO THIS POINT}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}