Reframing intro around expressivity rather than scalability

2023-06-28 17:26:50 -04:00 · 2023-06-28 17:26:50 -04:00 · 9c738f2b3e
parent de9a5d26b3
commit 9c738f2b3e
1 changed files with 22 additions and 9 deletions
--- a/sections/introduction.tex
+++ b/sections/introduction.tex
@ -2,15 +2,28 @@
 \section{Introduction}
 \label{sec:introduction}

-Spreadsheets are a popular tools for data exploration, transformation, and visualization, but have historically had challenges managing ``big data'' --- as few as 50k % fifty thousand
-rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
-One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system. % like Apache Spark.
-In this model, the spreadsheet is a chain of versions, each linked by a lightweight transformation function~\cite{freire:2016:hilda:exception}.
-A different approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the % entire
-spreadsheet runtime and specializes % around
-database primitives like indexes and incremental maintenance % specialized
-for spreadsheet access patterns.
-We refer to these as the virtual and materialized approaches, respectively, and illustrate them in \Cref{fig:overlay}.
+Tools like \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} adopt direct manipulation interfaces, similar to spreadsheets, as a way to streamline definition of data preparation workflows.  
+While convenient for data curation, these interfaces lack many of the free form data manipulation capabilities that make normal spreadsheets ideal for data exploration and visualization.
+
+Fundamental to spreadsheet-style interfaces for workflows is the need to support \emph{replay}.  
+When the source data changes, it should be possible to re-run the workflow on the updated data.
+Thus, in systems like Wrangler and Vizier, each user interaction adds a repeatable step to the workflow.
+These steps are repeated in-order; By contrast, typical spreadsheets allow new formulas to be defined in any cell, and dynamically execute cells in dependency order.
+
+In this paper, we propose a model of spreadsheets that acts like a classical spreadsheet, but where the user's edits and the source data are decoupled.  
+The result is a spreadsheet that can be `overlaid' on top of any dataset, allowing for both the flexibility of a classical spreadsheet, and the replay capabilities of a workflow spreadsheet.
+The resulting interface can support data scientists throughout the entire data lifecycle, from exploration through data curation pipeline development.
+
+As we discuss in this paper, this new `overlay'-based approach to spreadsheets also enables a new approach to scaling spreadsheets to larger data.
+Classical spreadsheets have historically had challenges managing ``big data'' --- as few as 100k rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
+\emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18} re-architects the spreadsheet runtime and specializes database primitives like indexes and incremental maintenance for spreadsheet access patterns.
+In spite of these changes, DataSpread still faces a key challenge: like classical spreadsheets, its unit of computation is the cell.  
+
+We show how the overlay spreadsheet model may be leveraged to offload significant computational work for portions of the spreadsheet not visible to the user to a batch-processing engine like Apache Spark.
+Although slower for small computations, such systems scale to larger computations more gracefully, making them ideal for expensive computations spanning large numbers of cells.
+
+
+\textbf{OLIVER'S EDITS APPLIED UP TO THIS POINT}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{figure}