Finishing intro pass and trimming space.

main
Oliver Kennedy 2023-07-06 13:45:42 -04:00
parent dec192e897
commit 27a543a49a
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
8 changed files with 775 additions and 765 deletions

Binary file not shown.

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 34 KiB

View File

@ -8,7 +8,7 @@
% \usepackage[disable]{todonotes}
\DeclareRobustCommand{\BG}[1]{{\todo[inline,color=red!40,size=\footnotesize,fancyline]{\textbf{Boris says: }{#1}}}}
\DeclareRobustCommand{\BGDel}[2]{{\todo[inline,color=red!40,size=\footnotesize,fancyline]{\textbf{Boris deleted:}{#1}\textbf{because {#2}}}}}
\DeclareRobustCommand{\OK}[1]{{\todo[inline,color=red!40,size=\footnotesize,fancyline]{\textbf{Oliver says: }{#1}}}}
\DeclareRobustCommand{\OK}[1]{{\todo[inline,color=blue!40,size=\footnotesize,fancyline]{\textbf{Oliver says: }{#1}}}}
\DeclareRobustCommand{\OKDel}[2]{{\todo[inline,color=red!40,size=\footnotesize,fancyline]{\textbf{Oliver deleted:}}{#1}\textbf{because {#2}}}}
\DeclareRobustCommand{\V}[1]{{\todo[inline,color=red!40,size=\footnotesize,fancyline]{\textbf{Viktoria says: }{#1}}}}
\DeclareRobustCommand{\VDel}[2]{{\todo[inline,color=red!40,size=\footnotesize,fancyline]{\textbf{Viktoria deleted:}}{#1}\textbf{because {#2}}}}

View File

@ -1,4 +1,4 @@
\documentclass[sigconf,review,10pt]{acmart}
\documentclass[sigconf,10pt]{acmart}
\usepackage{cleveref}
\usepackage{todonotes}

View File

@ -4,15 +4,15 @@
\label{sec:conclusions}
In this work, we introduced overlay spreadsheets as a potential direction for reproducible spreadsheets in workflow and provenance analysis systems like Vizier.
This novel capability is powered by overlays that decouple the user's edits from the source data they are applied to.
We also demonstrated how updates to ranges of cells can be represented declaratively, improving performance and enabling optimized evaluation of recursive patterns.
Overlay spreadsheets decouple the user's edits from the source data they are applied to, enabling replayability.
We demonstrated how a compact, declarative encoding of formulas, in turn enables optimized evaluation of recursive patterns.
Recursive patterns remain the source of several open challenges for us.
Most notably, in the absence of recursive patterns, the depth of a dependency chains is bounded by the number of user interactions.
We suggested two strategies for improving performance in the presence of recursive patterns: (i) Closed-form computation of dependencies, and (ii) using bulk processing to avoid individual evaluation of cells that are not being shown to the user.
We suggested two strategies for improving performance in the presence of recursive patterns: (i) Closed-form computation of dependencies, and (ii) using bulk processing to avoid individual cell evaluation.
We also observe two additional challenges of adapting a dataset to new source data.
As we noted, row identity is a critical challenge for updating source data, as each row in the updated dataset needs to be mapped to its corresponding row in the original.
Row identity is a critical challenge for updating source data, as each row in the updated dataset needs to be mapped to its corresponding row in the original.
Additionally, the spreadsheet itself may need to change, for example extending patterns to incorporate newly introduced rows in the dataset.
%%% Local Variables:

View File

@ -5,25 +5,12 @@
Tools like \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} adopt direct manipulation interfaces, similar to spreadsheets, as a way to streamline the definition of data preparation workflows.
While convenient for curation, these interfaces lack the free form data manipulation capabilities that make spreadsheets ideal for data exploration and visualization.
Fundamental to spreadsheet interfaces for datasets created by workflows is the need to support \emph{replay}.
Fundamental to spreadsheet interfaces in workflow systems is the need to support \emph{replay}.
When the source data or workflow changes, it should be possible to re-run the (updated) workflow on the updated data.
This is enabled in systems like Wrangler and Vizier by recording each user interaction as a repeatable step in the workflow.
These steps are repeated in-order; By contrast, typical spreadsheets allow new formulas to be defined in any cell, and dynamically execute cells in dependency order.\BG{I think the significance or dependency versus workflow order is not clear here. We can remove this argument to save space or have to extend it}
In this paper, we propose a model of spreadsheets that acts like a classical spreadsheet, but where the user's edits and the source data are decoupled.
The result is a spreadsheet that can be \emph{`overlaid'} on top of any dataset (no matter whether the dataset is the result of a workflow defined in tools like Vizier or not), allowing for both the flexibility of a classical spreadsheet, and the replay capabilities of a workflow spreadsheet.
The resulting interface can support data scientists throughout the entire data lifecycle, from exploration through data curation pipeline development and analysis.
As we discuss in this paper, this new \emph{overlay} approach to spreadsheets also enables a new approach to scaling spreadsheets to larger data.
Classical spreadsheets have historically had challenges managing ``big data'' --- as few as 100k rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
\emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18} re-architects the spreadsheet runtime and specializes database primitives like indexes and incremental maintenance for spreadsheet access patterns.
In spite of these changes, DataSpread still faces a key challenge: like classical spreadsheets, its unit of computation is the cell.\BG{Say why cell unit computation is a challenge? Expand or remove}
We show how the overlay spreadsheet model can be leveraged to offload significant computational work for portions of the spreadsheet not visible to the user to a batch-processing engine like Apache Spark.
Although slower for small datasets / computations, such systems scale to larger datasets / computations more gracefully, making them ideal for expensive computations spanning large numbers of cells.
\OK{\textbf{OLIVER'S EDITS APPLIED UP TO THIS POINT}}
This is enabled in systems like Wrangler and Vizier, where the fundamental data model is a workflow of repeatable transformations (`Workflow' in \Cref{fig:overlay}).
By contrast, a spreadsheet is a grid of interdependent cells where the original data and user-applied edits are indistinguishable (`Spreadsheet` in \Cref{fig:overlay}).
% \BG{I think the significance or dependency versus workflow order is not clear here. We can remove this argument to save space or have to extend it}
% \OK{Rephrased the above to focus on data representation and the distinguishability of source data and edits.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
@ -37,32 +24,46 @@ Although slower for small datasets / computations, such systems scale to larger
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The materialized approach is optimized for multiple data access patterns common to spreadsheets~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}, including
(i) Data structures specialized for the positional referencing scheme commonly used in spreadsheet formulas~\cite{DBLP:conf/icde/BendreVZCP18},
(ii) Execution strategies that prioritize completion of portions of the spreadsheet that the user is viewing~\cite{DBLP:conf/sigmod/BendreWMCP19}, and
(iii) Indexes storing compressed dependency graphs~\cite{DBLP:conf/sigmod/BendreWMCP19,tang-23-efcsfg}.
Similar optimizations are considerably harder in the virtual approach, as the result of updates and their effects on cell position are only materialized when data is received.
In this paper, we propose a model of spreadsheets that acts like a classical spreadsheet, but where the user's edits and the source data are decoupled.
The result is a spreadsheet that can be \emph{`overlaid'} on top of any dataset (`Overlay' in \Cref{fig:overlay}), no matter whether the source data is a raw datafile or the result of a workflow (e.g., in Vizier).
Overlay Spreadsheets provide the flexibility of spreadsheets, while also supporting the replay capabilities needed for workflows.
%The resulting interface can support data scientists throughout the entire data lifecycle, from exploration through data curation pipeline development and analysis.
Although the virtual approach is often less efficient, it does provide capabilities that the materialized approach does not:
(i) It is a naturally efficient encoding of the spreadsheet's full version history (only update operations rather than changes to the data need to be recorded);
(ii) As in Wrangler, the user's actions can be re-applied to new data (e.g., an updated version of the source data); and
(iii) As in Vizier, the generation of the current spreadsheet version from the input dataset can be re-encoded as a relational query allowing it to ``plug into'' existing scalable computation platforms (e.g., Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15}) and provenance analysis tools (e.g., \cite{kumari:2021:cidr:datasense}).
As we discuss in this paper, this new \emph{overlay} approach to spreadsheets also enables a new approach to scaling spreadsheets to larger data.
Classical spreadsheets have historically had challenges managing ``big data'' --- as few as 100k rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
\emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18} re-architects the spreadsheet runtime and specializes database primitives like indexes and incremental maintenance for spreadsheet access patterns.
In spite of these changes, DataSpread still faces a key challenge: like classical spreadsheets, its unit of computation is the cell.
Although the overheads of starting a computation (e.g., locking, state initialization, etc...) are typically low, they are repeated for each and every cell that needs to be computed.
% \BG{Say why cell unit computation is a challenge? Expand or remove}
% \OK{Added: per-cell overheads.}
We propose \emph{Overlay Spreadsheets}, an optimized hybrid of the virtual and materialized approaches.
An Overlay Spreadsheet (\Cref{fig:overlay}) presents an interface analogous to a normal spreadsheet.
User edits are ``overlaid'' on top of a source dataset that can easily be updated to a new version.
As an added benefit, decoupling edits and source data makes it easier to leverage spreadsheet access patterns, reducing the time needed to respond to user actions.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal notebook-style workflow system built on Apache Spark.
Existing versions of Vizier allow users to define workflow steps through a spreadsheet-style interface; each action adds a new workflow step.
In spite of the performance limitations of this virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features rely on encoding data transformations as Spark dataframes.
\paragraph{Scaling Spreadsheets to Big Data}
There has been considerable effort by the database community to `scale up' spreadsheets to big data~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}.
% These efforts include:
% (i) Data structures specialized for the positional referencing scheme commonly used in spreadsheet formulas~\cite{DBLP:conf/icde/BendreVZCP18},
% (ii) Execution strategies that prioritize completion of portions of the spreadsheet that the user is viewing~\cite{DBLP:conf/sigmod/BendreWMCP19}, and
% (iii) Indexes storing compressed dependency graphs~\cite{DBLP:conf/sigmod/BendreWMCP19,tang-23-efcsfg}.
%
Our objective is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting \textbf{replay and provenance} (i.e., like the virtual approach).
Overlays create an opportunity for further scalability based on the following two observations:
(i) Most of the `big' data appears in the source dataset.
(ii) The user applies a small number of edits (that may affect a large number of cells).
The latter observation arises because users typically edit large numbers of cells by `pasting' a formula into a range of cells.
The pasted formula acts as a template, and the pasted cells all follow a common pattern.
Like \cite{DBLP:conf/sigmod/BendreWMCP19}, we avoid storing formulas for each individual cell, instead storing patterns and ranges of pasted cells.
As a secondary goal, we explore potential performance improvements that the overlay approach enables.
Specifically, we observe that bulk updates in a spreadsheet (e.g., pasting a formula across a range of cells) rely on expression ``patterns'',
which admit more efficient dependency analysis and bulk computation, when intermediate values are not required.
This hybrid strategy is akin to optimizations applied in DataSpread~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}, but operates over patterns of updates rather than patterns in the dependency graph, enabling additional optimizations.
Leveraging the user's interest inn only a small subset of the spreadsheet at any one time, overlay spreadsheets avoid computations outside of this subset.
Requests for cells originating in the source data can be handled efficiently by standard relational storage engines, while only formula cells visible to the user and their transitive dependencies need to be computed.
Unfortunately, common spreadsheet usage creates cells with transitive dependencies that scale with data size.
We mitigate the prohibitively high cost of such cells by outsourcing their computation to a batch-processing engine like Apache Spark.
Although slower for small datasets, batch engines scale to larger workloads more gracefully, making them ideal for expensive computations that span many cells, where individual cell values are not needed.
\paragraph{Overlay Spreadsheets}
We propose \emph{Overlay Spreadsheets}, which present an interface analogous to a normal spreadsheet, but where user edits are ``overlaid'' on top of a source dataset that can easily be updated to a new version.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal notebook-style workflow system built on Apache Spark; Our implementation replaces its existing workflow-style spreadsheet.
Our objective is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting \textbf{replay and provenance} (i.e., like the virtual approach).
% March 26 by OK: Trimming the ToC summary for space
%

View File

@ -193,9 +193,9 @@ To uniformly model source datasets, whether from relational databases or other s
In a relational table, these are the table's columns and the values of a key or rowid attribute, respectively.
For csv data, $\rowDomain_{\ds} \subset \mathbb Z$ is the position of the row in the file.
We write $\valat{\ds}{\row}{\column}$ to denote the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
%
Denote by $\rframe: \rowDomain_{\ds} \to \mathbb{Z}$ a reference frame, an injective map from rows in $\ds$ to rows of the spreadsheet.
A \emph{spreadsheet overlay} for a dataset $\ds$ is then a pair $(\ds, \rframe)$ that defines a spreadsheet $\spreadsheet_{\ds, \rframe}$ with domains $\columnDomain = \columnDomain_{\ds}$, $\rowDomain = \dom(\rframe)$ as
A \emph{spreadsheet overlay} for a dataset $\ds$ is a pair $(\ds, \rframe)$ that defines a spreadsheet $\spreadsheet_{\ds, \rframe}$ with domains $\columnDomain = \columnDomain_{\ds}$, $\rowDomain = \dom(\rframe)$ as
$
\valat{\spreadsheet_{\ds, \rframe}}{\column}{\row} = \valat{\ds}{\column}{\rframe^{-1}(\row)}
$
@ -215,7 +215,7 @@ Recall that a reference frame maps the spreadsheet's positional row references t
Thus, to insert, delete, or move rows in the spreadsheet, it is sufficient to modify the reference frame.
Formally, a reference frame transformation $\rtrans$ is an injective mapping $\mathbb{Z} \to \mathbb{Z} \cup \errorval$ from initial row positions to new row positions, or the value $\errorval$ for a deleted row ($\rtrans$ is allowed to map multiple inputs to $\errorval$).
The new reference frame, after applying $\overlay$ is $\rframe' = \rtrans \circ \mathcal F$, where $\circ$ denotes function composition.
As an example, consider deleting the 2nd row of the spreadsheet from \Cref{fig:example-spreadsheet-and-a}. The positions of rows $3$ and $4$ are decreased by one, while row $1$ retains its position
As an example, consider deleting the 2nd row from \Cref{fig:example-spreadsheet-and-a}. The positions of rows $3$ and $4$ are decreased by one, while row $1$ retains its position
$$\rtrans(x) = \begin{cases}
x & \textbf{if } x < 2\\
\errorval & \textbf{if } x = 2\\

View File

@ -8,18 +8,17 @@ Our prototype is designed to accept any Spark dataframe as a data source.
% The prototype's design is illustrated in \Cref{fig:systemdesign}
Client applications connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and translates our internal model of a spreadsheet to a more natural interface.
The \textbf{Execution} layer is responsible for evaluating spreadsheet cells and materializing values for the cells currently visible to the user.
The \textbf{Indexing} layer provides efficient access to the updates themselves, and a LRU cache provides efficient random access to the source dataframe.
The \textbf{Execution} layer evaluates spreadsheet cells and materializes cells currently visible to the user.
The \textbf{Indexing} layer provides efficient access to formulas, and a LRU cache provides efficient access to source dataframes.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Presentation Layer}
\label{sec:system-presentation}
User-facing client applications connect to the overlay spreadsheet through a presentation layer.
This layer mediates concurrent updates of the spreadsheet and provides clients with the illusion of a fixed grid of cells by defining and maintaining an explicit order over columns.
Column operations (insertion, deletion, reordering) are handled at this layer, so lower levels can reference the (comparatively small) set of columns solely by column identity.
Other updates are put into a serial order and relayed to lower levels.
User-facing client applications connect to the overlay spreadsheet through a presentation layer that serializes concurrent updates, and provides clients with the illusion of a fixed grid of cells.
Column operations (insertion, deletion, reordering) are handled at this layer, so lower levels can reference the small set of columns solely by column identity.
Other updates are serialized and forwarded to lower levels.
% \begin{figure}
% \includegraphics[width=0.4\columnwidth]{graphics/system-arch}
@ -62,7 +61,7 @@ For example, any cell in a running sum column is equivalent to a sum over the pr
Our preliminary experiments (\Cref{sec:experiments}) suggest promise in a hybrid evaluation strategy that evaluates visible cells individually and computes cells defined by patterns through closed form windowed aggregation queries.
\partitle{Updates}
When the executor receives an update to a cell, it uses the index to compute the set of invalidated cells, marks them as ``pending,'' and begins re-evaluating them in topological order.
When the executor receives a cell update, it uses the index to identify invalidated cells and begins re-evaluating them in topological order.
An update to the reference frame is applied to both the index and the data source.
Following typical spreadsheet semantics, an insertion or row move updates references in dependent formulas, so no re-evaluation is typically required.
If a row with dependent cells is deleted, the dependent cells need to be updated to indicate the error.
@ -80,7 +79,7 @@ The key insight behind the index is that updates are stored as pattern-range tup
%As noted above, we assume that the number of columns is small and the number of rows is large.
\begin{figure}
\includegraphics[width=0.7\columnwidth]{graphics/rangemap.pdf}
\includegraphics[width=0.5\columnwidth]{graphics/rangemap.pdf}
\caption{A range map maps disjoint ranges to values.}
\label{fig:rangemap}
\trimfigurespacing
@ -91,7 +90,7 @@ The update index is built over a one\--di\-men\-sion\-al range map, an ordered m
In addition to the usual operations of an ordered map (e.g., \texttt{put}, \texttt{get}, \texttt{successorOf}), we define the operation \texttt{bulkPut(low, high, value)} which is equivalent to a \texttt{put} on every element in the range from \texttt{low} to \texttt{high}.
Implemented naively (e.g. a size $N$ binary tree), this operation is $O((\texttt{high}-\texttt{low})\cdot\log(N))$.
A range map avoids the $(\texttt{high}-\texttt{low})$ factor (and correspondingly reduces $N$) by storing an ordered sequence of disjoint ranges, each mapping one specific value as illustrated in \Cref{fig:rangemap}.
A range map avoids the $(\texttt{high}-\texttt{low})$ factor by storing an ordered sequence of disjoint ranges, each mapping one specific value as illustrated in \Cref{fig:rangemap}.
A binary tree provides efficient membership lookups over the ranges.
With a range map, the set of distinct values appearing in a range can be accessed in $O(\log(N)+M)$ time (where $M$ is the number of distinct values), and has similar deletion and insertion costs.
@ -114,9 +113,8 @@ The expression for a cell $\cellRef{\column}{\row}$ is stored at $\mathcal I[\co
\If{$(\column_{d}, \rowRange_{d})$ is non-empty}
\State $\texttt{upstream} \leftarrow \texttt{upstream} + (\column_{d}, \rowRange_{d})$
\State $\texttt{queue}.\textbf{enqueue}( \column_{d}, \rowRange_{d},$\\
\hspace*{37mm}$\{\;\texttt{p}' \rightarrow (\texttt{o}'+\texttt{offset})$~~~~~~~~~\\
\hspace*{40mm}$|\; (\texttt{p}' \rightarrow \texttt{o}' )\in \texttt{lineage}\}$\\
\hfill $\cup \{\texttt{pattern} \rightarrow \texttt{offset}\} )$
\hspace*{23mm}{\footnotesize $\{\;\texttt{p}' \rightarrow (\texttt{o}'+\texttt{offset})|\; (\texttt{p}' \rightarrow \texttt{o}' )\in \texttt{lineage}\}$}\\
\hfill {\footnotesize $\cup \{\texttt{pattern} \rightarrow \texttt{offset}\} )$}
\EndIf
\EndFor
\EndFor
@ -129,7 +127,7 @@ The execution layer needs to be able to derive the set of cells on which a speci
We refer to this set as the target's \emph{upstream}.
\Cref{alg:upstream} illustrates how to use breadth-first search to obtain the full upstream set for a given target range.
Each item in the BFS's work queue consists of a column, a row set, and a lineage; We will return to the lineage shortly.
For each work item enqueued, we query the forward index to obtain the set of patterns in the range (line 4), and iterate over the set of their dependencies (line 5).
For each work item enqueued, we query the forward index to obtain patterns in the range (line 4), and iterate over the set of their dependencies (line 5).
If we discover a new dependency (lines 6-7), the newly discovered range is added to the return set and the work queue.
We will explain lines 10-12 shortly.
@ -145,7 +143,7 @@ For explicit cell references (lines 4-5), the explicit reference is used.
\begin{algorithmic}[1]
\Require \texttt{pattern}: An expression pattern
\Require $\rangeOf{\column}{\rowRange}$: A range of cells
\Ensure \texttt{deps}: The dependencies of \texttt{pattern} applied to $\rangeOf{\column}{\rowRange}$
\Ensure \texttt{deps}: Dependencies of $\rangeOf{\column}{\rowRange}$'s \texttt{pattern}
\State $\texttt{deps} \leftarrow \{\}$
\If{\texttt{pattern} \textbf{is an offset reference} $\cellRef{\column'}{\delta'}$}
\State $\texttt{deps} \leftarrow \texttt{deps} \cup \{ (\column', \rowRange+\delta', \delta') \}$