10 point font... bla. Trimming down more.

main
Oliver Kennedy 2023-03-30 16:26:00 -04:00
parent ac7f5fb7cf
commit 021852f294
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
16 changed files with 49 additions and 52 deletions

View File

@ -1,4 +1,4 @@
\documentclass[sigconf,review]{acmart}
\documentclass[sigconf,review,10pt]{acmart}
\usepackage{cleveref}
\usepackage{todonotes}
@ -125,16 +125,16 @@
Spreadsheets provide a convenient % , friendly
direct manipulation interface to datasets.
Efforts to scale spreadsheets % have taken two approaches: A
either follow a `virtual` strategy that imposes a spreadsheet interface over an existing database engine or a `materialized' strategy based on re-engineering the spreadsheet engine using % around
standard database optimizations. % like indexes.
either follow a `virtual` strategy that imposes a spreadsheet interface over an existing database engine or a `materialized' strategy based on re-engineering the spreadsheet engine % using % around
%standard database optimizations. % like indexes.
Because database engines are not optimized for spreadsheet access patterns,
% typically optimized for bulk query processing over interactive latencies,
the materialized approach has better performance.
However, the virtual approach offers several key advantages that can not be easily replicated in the materialized approach, including % notably
However, the virtual approach offers several advantages that can not be easily replicated in the materialized approach, including % notably
the ability to re-apply user interactions to an updated dataset. % version of the same dataset.
We propose a hybrid of % the materialized and virtual
these approaches, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
We introduce the overlay update model, and outline strategies for efficiently accessing a spreadsheet defined in this way.
We propose a hybrid % the materialized and virtual
approach, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
We introduce the overlay update model, and outline strategies for efficiently accessing an overlay spreadsheet.
A key feature of our approach is storing updates generated by bulk operations (e.g., copy/paste) as ``patterns" that can be leveraged to reduce execution costs.
We implement an overlay spreadsheet over Apache Spark and demonstrate that, compared to DataSpread, it can significantly reduce execution costs. % popular
% materialized spreadsheet.
@ -156,7 +156,7 @@ the materialized approach has better performance.
%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\keywords{Spreadsheets, Dataframes, Scalable Data Management}
% \keywords{Spreadsheets, Dataframes, Scalable Data Management}
%% A "teaser" image appears between the author and affiliation
%% information and the body of the document, and typically spans the
%% page.

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 16 KiB

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

After

Width:  |  Height:  |  Size: 19 KiB

View File

@ -1,29 +1,29 @@
%!TEX root=../main.tex
\begin{figure*}
\begin{figure}
\centering
\subcaptionbox{Scale Data, View First}{
\includegraphics[width=0.28\textwidth]{results/laptop-init-varysize.pdf}
\includegraphics[width=0.47\columnwidth]{results/laptop-init-varysize.pdf}
}
\subcaptionbox{Fix Data, Move View}{
\includegraphics[width=0.28\textwidth]{results/laptop-init-varystart.pdf}
\includegraphics[width=0.47\columnwidth]{results/laptop-init-varystart.pdf}
}
\subcaptionbox{Scale Data, View Last}{
\includegraphics[width=0.28\textwidth]{results/laptop-init-varystartandsize.pdf}
\includegraphics[width=0.47\columnwidth]{results/laptop-init-varystartandsize.pdf}
}
\subcaptionbox{Scale Data, View First}{
\includegraphics[width=0.28\textwidth]{results/laptop-update_one-varysize.pdf}
}
\subcaptionbox{Fix Data, Move View}{
\includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystart.pdf}
}
\subcaptionbox{Scale Data, View Last}{
\includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystartandsize.pdf}
\includegraphics[width=0.47\columnwidth]{results/laptop-update_one-varysize.pdf}
}
% \subcaptionbox{Fix Data, Move View}{
% \includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystart.pdf}
% }
% \subcaptionbox{Scale Data, View Last}{
% \includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystartandsize.pdf}
% }
\caption{System Initialization costs (a-c) and cost to update one cell (d-f)}
\label{fig:experiments}
\trimfigurespacing
\end{figure*}
\end{figure}
\section{Experiments}
\label{sec:experiments}
@ -69,15 +69,15 @@ To emulate batch processing, we replace the formula for the $\texttt{sum\_change
% \label{fig:perf-scale-visible}
% \trimfigurespacing
% \end{figure}
\Cref{fig:experiments}(a,d) shows initialization and update costs, with a fixed dataset size of approximately 600,000 rows, and a variable viewport position.
\Cref{fig:experiments}(a,c) shows initialization and update costs, with a fixed dataset size of approximately 600,000 rows, and a variable viewport position.
Due to the running sum, the longest visible dependency chain grows as the visible region moves further into the dataset.
Costs for Vizier and Dataspread grow significantly with the length of the dependency chain, while batch processing can compute the updated sum significantly faster.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Scaling Data}
\Cref{fig:experiments}(a,d) shows the initialization and update costs when the viewport is on the first cell. Vizier only needs to compute the visible cell formulas, and so is significantly faster.
\Cref{fig:experiments}(c,f) show these costs when the viewport is on the last cell; as before, the costs for Vizier grow with the length of the longest visible dependency chain, supporting the value of batching.
\Cref{fig:experiments}(b,d) shows the initialization and update costs when the viewport is on the first cell. Vizier only needs to compute the visible cell formulas, and so is significantly faster.
% \Cref{fig:experiments}(c,f) show these costs when the viewport is on the last cell; as before, the costs for Vizier grow with the length of the longest visible dependency chain, supporting the value of batching.

View File

@ -27,31 +27,28 @@ We refer to these as the virtual and materialized approach, respectively, and il
The materialized approach is optimized for multiple data access patterns common to spreadsheets~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}, including
(i) Data structures specialized for the positional referencing scheme commonly used in spreadsheet formulas~\cite{DBLP:conf/icde/BendreVZCP18},
(ii) Execution strategies that prioritize completion of portions of the spreadsheet that the user is viewing~\cite{DBLP:conf/sigmod/BendreWMCP19}, and
(iii) Indexes that leverage patterns in the dependencies of adjacent cells to compress dependency graphs~\cite{tang-23-efcsfg}.
(iii) Indexes storing compressed dependency graphs~\cite{DBLP:conf/sigmod/BendreWMCP19,tang-23-efcsfg}.
Similar optimizations are considerably harder in the virtual approach, as the result of updates and their effects on cell position are only materialized when data is received.
Although the virtual approach is often less efficient, it does provide capabilities that the materialized approach does not:
(i) Because it stores only the updates applied by the user (e.g., insert a row at position $x$, replace the value of cell $c$ with $v$, \ldots), the spreadsheet's full version history can be stored at negligible;
(ii) As in Wrangler, the resulting data transformation process can be easily applied to other data (e.g., by scaling up from an interaction-friendly sample of the data to the entire dataset, or an updated version of the data); and
(iii) As in Vizier, the user's interactions can be translated into a standardized query model (i.e., a Spark dataframe), allowing it to ``plug into'' existing scalable computation platforms (i.e., Spark) and standardized provenance analysis frameworks (e.g., \cite{kumari:2021:cidr:datasense}).
(i) It is a naturally efficient encoding of the spreadsheet's full version history.
(ii) As in Wrangler, the user's actions can be re-applied to new data (e.g., an updated version of the source data); and
(iii) As in Vizier, the spreadsheet can be re-encoded as a relational query allowing it to ``plug into'' existing scalable computation platforms (i.e., Spark) and provenance analysis tools (e.g., \cite{kumari:2021:cidr:datasense}).
In this paper, we present an optimized hybrid of the virtual and materialized approaches: \emph{Overlay Spreadsheets}.
In an Overlay Spreadsheet (\Cref{fig:overlay}), the user's edits are stored in a spreadsheet that is ``overlaid'' on top of source data.
Users interact with an Overlay Spreadsheet just like an ordinary spreadsheet, inserting or removing rows or columns, overwriting data with formulas or literals, and reorganizing the data.
However, references to the source dataset are virtualized, allowing users to replay their actions on a updated datasets, translate spreadsheets to run on scalable computation platforms, and to facilitate provenance analysis.
We also demonstrate that this different virtual representation of edits enables more efficient exploitation of spreadsheet access patterns, including optimizing computation of cells visible to the user.
We propose an optimized hybrid of the virtual and materialized approaches: \emph{Overlay Spreadsheets}.
An Overlay Spreadsheet (\Cref{fig:overlay}) stores presents an interface analogous to a normal spreadsheet.
User edits are ``overlaid'' on top of a source dataset that can be easily be updated to a new version.
As an added benefit, decoupling edits and source data makes it easier to leverage spreadsheet access patterns, reducing the time needed to respond to user actions.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal, reproducibility-oriented, notebook-style workflow system built on Apache Spark.
Users of Vizier define sequences of data transformation steps that may include scripts, templated widgets, or other operations.
Existing versions of Vizier provide a spreadsheet-style interface, where each user interaction builds out the data transformation workflow.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal notebook-style workflow system built on Apache Spark.
Existing versions of Vizier allow users to define workflow steps through a spreadsheet-style interface; each action adds a new workflow step.
In spite of the performance limitations of the virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features are implemented primarily over Spark dataframes.
%
Our objective in this paper is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting for \textbf{replay and provenance} (i.e., like the virtual approach).
As a secondary goal, we further explore the additional benefits of the overlay approach.
Specifically, we observe that because spreadsheet updates are typically made manually, the number of updates is limited by the speed of a human interacting with the system.
Although a single update may be applied to multiple cells (e.g., by copy/pasting a formula over a range of cells), the number of such updates is likely to be small.
In this paper, we take the first steps towards hybridizing the cell-at-a-time execution strategies of classical spreadsheets, with bulk computation strategies found in relational databases.
As a secondary goal, explore the potential for performance improvement resulting from the overlay approach.
Specifically, we observe that bulk updates in a spreadsheet (e.g., pasting a formula across a range of cells) rely on expression ``patterns,''
which admit more efficient dependency analysis and bulk computation when intermediate values are not required.
This hybrid strategy is akin to optimizations applied in data spread~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}, but operating over patterns of updates rather than patterns in the dependency graph.
% March 26 by OK: Trimming the ToC summary for space

View File

@ -62,19 +62,19 @@
\subsection{Spreadsheets}
\label{sec:spreadsheets}
Let $\columnDomain$ and $\rowDomain$ denote domains of column and row labels; unless otherwise noted, we assume $\rowDomain \subset \mathbb Z$.
Let $\valueDomain \subset \exprDomain$ denote domains of values and expressions, respectively; We define $\exprDomain$ in greater detail below.
We define a \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ as a partial mapping from \emph{cells} ($\cellRef{\column}{\row} \in (\columnDomain \times \rowDomain)$) to expressions.
Let $\columnDomain$ and $\rowDomain$ denote domains of column and row labels. Except where noted, $\rowDomain \subset \mathbb Z$.
Let $\valueDomain \subset \exprDomain$ denote domains of values and expressions, respectively.
A \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ is a partial mapping from \emph{cells} ($\cellRef{\column}{\row} \in (\columnDomain \times \rowDomain)$) to expressions.
We use $\valat{\spreadsheet}{\column}{\row}$ to denote $\spreadsheet(\cellRef{\column}{\row})$.
Let $\errorval \in \valueDomain$ indicate ``undefined'' and define the \emph{domain} $\dom(\spreadsheet)$ to be the set of cells $\cellRef{\column}{\row}$ where $\valat{\spreadsheet}{\column}{\row} \neq \errorval$.
An expression $\expr \in \exprDomain$ is a formula defined over literals from $\valueDomain$, the standard arithmetic operators, and references to other cells in the spreadsheet ($\cellRef{\column}{\row}$).
The expression $\expr$ may be evaluated in the context of a spreadsheet ($\evalOf{\spreadsheet}{\cdot} : \exprDomain \rightarrow \valueDomain$) as follows:
(i) Literals evaluate to themselves, (ii) Arithmetic formulas are evaluated in the usual way, and (iii) References to the spreadsheet are evaluated recursively
The expression $\expr$ is evaluated in the context of a spreadsheet ($\evalOf{\spreadsheet}{\cdot} : \exprDomain \rightarrow \valueDomain$) as follows:
(i) Literals and arithmetic are evaluated in the usual way, and (ii) References to the spreadsheet are evaluated recursively
($\evalOf{\spreadsheet}{\cellRef{\column}{\row}} \equiv \evalOf{\spreadsheet}{\spreadsheet(\column, \row)}$).
By convention, cyclic references evaluate to the distinguished error value $\errorval$ in $\valueDomain$.
We define the dependencies of an expression ($\depsOf{\expr}$) as the cells referenced by $\expr$.
An expression's dependencies ($\depsOf{\expr}$) are the cells referenced by $\expr$.
Dependencies induce a graph $\DG{\spreadsheet}\tuple{N, E}$ over the spreadsheet, with cells as nodes (i.e., $N = \columnDomain \times \rowDomain$), and dependencies as directed edges:
$$E = \bigcup_{\cell \in \columnDomain \times \rowDomain}
\{\;\cell \rightarrow \cellPrime\;|\;\cellPrime \in \depsOf{\valat{\spreadsheet}{\column}{\row}}\;\} $$
@ -84,7 +84,7 @@ Note that if all cell expressions are constants (i.e., a spreadsheet without for
\begin{example}
Consider the spreadsheet at the top of \Cref{fig:example-spreadsheet-and-a}.
Columns \emph{A} and \emph{B} hold constant expressions, while column \emph{C} holds arithmetic expressions referencing cells from columns \emph{A} and \emph{B}.
Columns \emph{A} and \emph{B} hold constant expressions, while column \emph{C} holds reference cells from columns \emph{A} and \emph{B}.
Evaluating this spreadsheet assigns each cell a concrete value, as in the top right.
For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{A}{1} + \cellRef{B}{1}} = \evalOf{\spreadsheet}{\cellRef{A}{1}} + \evalOf{\spreadsheet}{\cellRef{B}{1}} = 15 + 50 = 65$.
\end{example}
@ -170,7 +170,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
\subsection{Cell Updates}
\label{sec:updates}
A cell update set $\upd \subseteq \columnDomain \times \rowDomain \times \exprDomain$ to a spreadsheet is a set of cell updates of the form $\acu$ that assign to cell $\cellRef{\column}{\row}$ the expression $\expr$.
A cell update set $\upd \subseteq \columnDomain \times \rowDomain \times \exprDomain$ is a set of cell updates of the form $\acu$ that assign to cell $\cellRef{\column}{\row}$ the expression $\expr$.
Denote by $\dom(\upd)$ the domain of update $\upd$, containing all cells $\cellRef{\column}{\row}$ defined in $\upd$ (i.e., $\exists \expr : (\acu \in \upd)$).
Applying an update $\upd$ to a spreadsheet $\spreadsheet$ returns an updated spreadsheet:
\[
@ -182,17 +182,17 @@ Applying an update $\upd$ to a spreadsheet $\spreadsheet$ returns an updated spr
\]
An update may affect cells beyond its domain.
For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes the constant expression in cell \emph{\cellRef{A}{1}} and the arithmetic expression in cell \emph{\cellRef{C}{3}}.
For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes the expressions in cells \emph{\cellRef{A}{1}} and \emph{\cellRef{C}{3}}.
Evaluating the updated spreadsheet $\upd(\spreadsheet)$ results in \emph{three} cell changes (in red).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Spreadsheet Access to Datasets}
\label{sec:spre-access-datas}
To uniformly model spreadsheet access to relational data as well as to data already represented as spreadsheets, we assume an input dataset $\ds$ with a designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
For example, in a relational table, these can be the table's columns and values of a key or rowid attribute, respectively.
For a spreadsheet or csv data, $\rowDomain_{\ds} \subset \mathbb Z$ can be the position of the row.
We use $\valat{\ds}{\row}{\column}$ to denote the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
To uniformly model source datasets, whether from relational databases or other spreadsheets, we assume an input dataset $\ds$ with a designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
In a relational table, these are the table's columns and values of a key attribute, respectively.
For csv data, $\rowDomain_{\ds} \subset \mathbb Z$ is the position of the row.
$\valat{\ds}{\row}{\column}$ denotes the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
Denote by $\rframe: \rowDomain_{\ds} \to \mathbb{Z}$ a reference frame, an injective map that maps rows in $\ds$ into the spreadsheet.
A \emph{spreadsheet overlay} for a dataset $\ds$ is then a pair $(\ds, \rframe)$ that defines a spreadsheet $\spreadsheet_{\ds, \rframe}$ with domains $\columnDomain = \columnDomain_{\ds}$, $\rowDomain = \dom(\rframe)$ as