Final edits

main
Oliver Kennedy 2023-03-30 21:59:10 -04:00
parent 55f6df55d6
commit 9a819ae5a1
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
6 changed files with 56 additions and 56 deletions

View File

@ -122,14 +122,14 @@
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
Spreadsheets provide a convenient, friendly direct manipulation interface to datasets.
% Spreadsheets provide a convenient, friendly direct manipulation interface to datasets.
Efforts to scale spreadsheets either follow a `virtual` strategy that imposes a spreadsheet interface over an existing database engine or a `materialized' strategy based on re-engineering the spreadsheet engine.
Because database engines are not optimized for spreadsheet access patterns, the materialized approach has better performance.
However, the virtual approach offers several advantages that can not be easily replicated in the materialized approach, including the ability to re-apply user interactions to an updated dataset.
We propose a hybrid approach, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
We introduce the overlay update model, and outline strategies for efficiently accessing an overlay spreadsheet.
A key feature of our approach is storing updates generated by bulk operations (e.g., copy/paste) as ``patterns" that can be leveraged to reduce execution costs.
We implement an overlay spreadsheet over Apache Spark and demonstrate that, compared to DataSpread, it can significantly reduce execution costs.
We implement an overlay spreadsheet over Apache Spark and demonstrate that, compared to DataSpread (a standard materialized-style spreadsheet), it can significantly reduce execution costs.
\end{abstract}
%%

View File

@ -54,8 +54,8 @@ We address our questions through a microbenchmark modeled after TPC-H query 1~\c
\end{verbatim}
}
\noindent The \texttt{sum\_charge} column is a running total, creating a dependency chain that grows linearly with row index.
As the user scrolls down the page (under normal usage), the runtime to compute individual cells grows linearly.
Each system load the spreadsheet with a viewable area of 50 rows and updates a single cell.
As the user scrolls down the page (under normal usage), the runtime to compute visible cells grows linearly.
Each system loads the spreadsheet with a viewable area of 50 rows and updates a single cell.
We measure (i) the cost of initialization and (ii) the cost of a single update.
Time is measured until quiescence.
To emulate batch processing, we replace the formula for the $\texttt{sum\_change}[i-1]$ (where $i$ is the first visible row) with a formula that computes the analogous aggregate query.
@ -64,7 +64,7 @@ To emulate batch processing, we replace the formula for the $\texttt{sum\_change
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Moving Viewport}
\partitle{Moving View}
% \begin{figure}
% \includegraphics[width=0.7\columnwidth]{results/desktop-update_one.png}
% \vspace*{-4mm}

View File

@ -6,11 +6,11 @@ Spreadsheets are a popular tools for data exploration, transformation, and visua
rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system. % like Apache Spark.
In this model, the spreadsheet is a chain of versions, each linked by a lightweight transformation function~\cite{freire:2016:hilda:exception}.
The approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the % entire
A different approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the % entire
spreadsheet runtime and specializes % around
database primitives like indexes and incremental maintenance % specialized
for spreadsheet access patterns.
We refer to these as the virtual and materialized approach, respectively, and illustrate them in \Cref{fig:overlay}.
We refer to these as the virtual and materialized approaches, respectively, and illustrate them in \Cref{fig:overlay}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
@ -33,23 +33,23 @@ Similar optimizations are considerably harder in the virtual approach, as the re
Although the virtual approach is often less efficient, it does provide capabilities that the materialized approach does not:
(i) It is a naturally efficient encoding of the spreadsheet's full version history.
(ii) As in Wrangler, the user's actions can be re-applied to new data (e.g., an updated version of the source data); and
(iii) As in Vizier, the spreadsheet can be re-encoded as a relational query allowing it to ``plug into'' existing scalable computation platforms (i.e., Spark) and provenance analysis tools (e.g., \cite{kumari:2021:cidr:datasense}).
(iii) As in Vizier, the spreadsheet can be re-encoded as a relational query allowing it to ``plug into'' existing scalable computation platforms (e.g., Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15}) and provenance analysis tools (e.g., \cite{kumari:2021:cidr:datasense}).
We propose an optimized hybrid of the virtual and materialized approaches: \emph{Overlay Spreadsheets}.
An Overlay Spreadsheet (\Cref{fig:overlay}) stores presents an interface analogous to a normal spreadsheet.
An Overlay Spreadsheet (\Cref{fig:overlay}) presents an interface analogous to a normal spreadsheet.
User edits are ``overlaid'' on top of a source dataset that can be easily be updated to a new version.
As an added benefit, decoupling edits and source data makes it easier to leverage spreadsheet access patterns, reducing the time needed to respond to user actions.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal notebook-style workflow system built on Apache Spark.
Existing versions of Vizier allow users to define workflow steps through a spreadsheet-style interface; each action adds a new workflow step.
In spite of the performance limitations of the virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features are implemented primarily over Spark dataframes.
In spite of the performance limitations of this virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features rely on encoding data transformations as Spark dataframes.
%
Our objective in this paper is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting for \textbf{replay and provenance} (i.e., like the virtual approach).
Our objective is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting \textbf{replay and provenance} (i.e., like the virtual approach).
As a secondary goal, explore the potential for performance improvement resulting from the overlay approach.
As a secondary goal, we explore potential performance improvements that the overlay approach enables.
Specifically, we observe that bulk updates in a spreadsheet (e.g., pasting a formula across a range of cells) rely on expression ``patterns,''
which admit more efficient dependency analysis and bulk computation when intermediate values are not required.
This hybrid strategy is akin to optimizations applied in data spread~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}, but operating over patterns of updates rather than patterns in the dependency graph.
which admit more efficient dependency analysis and bulk computation, when intermediate values are not required.
This hybrid strategy is akin to optimizations applied in DataSpread~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}, but operate over patterns of updates rather than patterns in the dependency graph, enabling additional optimizations.
% March 26 by OK: Trimming the ToC summary for space
%

View File

@ -63,7 +63,7 @@
\label{sec:spreadsheets}
Let $\columnDomain$ and $\rowDomain$ denote domains of column and row labels. Except where noted, $\rowDomain \subset \mathbb Z$.
Let $\valueDomain \subset \exprDomain$ denote domains of values and expressions, respectively.
Let $\valueDomain$ and $\exprDomain \supset \valueDomain$ denote domains of values and expressions, respectively.
A \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ is a partial mapping from \emph{cells} ($\cellRef{\column}{\row} \in (\columnDomain \times \rowDomain)$) to expressions.
We use $\valat{\spreadsheet}{\column}{\row}$ to denote $\spreadsheet(\cellRef{\column}{\row})$.
Let $\errorval \in \valueDomain$ indicate ``undefined'' and define the \emph{domain} $\dom(\spreadsheet)$ to be the set of cells $\cellRef{\column}{\row}$ where $\valat{\spreadsheet}{\column}{\row} \neq \errorval$.
@ -72,8 +72,8 @@ An expression $\expr \in \exprDomain$ is a formula defined over literals from $\
The expression $\expr$ is evaluated in the context of a spreadsheet ($\evalOf{\spreadsheet}{\cdot} : \exprDomain \rightarrow \valueDomain$) as follows:
(i) Literals and arithmetic are evaluated in the usual way, and (ii) References to the spreadsheet are evaluated recursively
($\evalOf{\spreadsheet}{\cellRef{\column}{\row}} \equiv \evalOf{\spreadsheet}{\spreadsheet(\column, \row)}$).
By convention, cyclic references evaluate to the distinguished error value $\errorval$ in $\valueDomain$.
By convention, cyclic references evaluate to $\errorval$.
%
An expression's dependencies ($\depsOf{\expr}$) are the cells referenced by $\expr$.
Dependencies induce a graph $\DG{\spreadsheet}\tuple{N, E}$ over the spreadsheet, with cells as nodes (i.e., $N = \columnDomain \times \rowDomain$), and dependencies as directed edges:
$$E = \bigcup_{\cell \in \columnDomain \times \rowDomain}
@ -85,7 +85,7 @@ Note that if all cell expressions are constants (i.e., a spreadsheet without for
\begin{example}
Consider the spreadsheet at the top of \Cref{fig:example-spreadsheet-and-a}.
Columns \emph{A} and \emph{B} hold constant expressions, while column \emph{C} holds reference cells from columns \emph{A} and \emph{B}.
Evaluating this spreadsheet assigns each cell a concrete value, as in the top right.
Evaluating this spreadsheet assigns each cell a value, as in the top right.
For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{A}{1} + \cellRef{B}{1}} = \evalOf{\spreadsheet}{\cellRef{A}{1}} + \evalOf{\spreadsheet}{\cellRef{B}{1}} = 15 + 50 = 65$.
\end{example}
@ -96,7 +96,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
\begin{minipage}{0.48\linewidth}
\centering
\textbf{Spreadsheet $\spreadsheet$}\\
\textbf{\small Spreadsheet $\spreadsheet$}\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{c|c|c|c|}
\cline{2-4}
@ -111,7 +111,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
%
\begin{minipage}{0.49\linewidth}
\centering
\textbf{Evaluated Spreadsheet $\evalOf{\spreadsheet}{\cdot}$}\\
\textbf{\small Evaluated Spreadsheet $\evalOf{\spreadsheet}{\cdot}$}\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{c|c|c|c|}
\cline{2-4}
@ -133,7 +133,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
%$,$\\ %\vspace{5mm}
\begin{minipage}{0.46\linewidth}
\centering
\textbf{Updated Spreadsheet $\upd(\spreadsheet)$}\\
\textbf{\small Updated Spreadsheet $\upd(\spreadsheet)$}\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{c|c|c|c|}
\cline{2-4}
@ -148,7 +148,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
%
\begin{minipage}{0.49\linewidth}
\centering
\textbf{Evaluated Update $\evalOf{\upd(\spreadsheet)}{\cdot}$}\\
\textbf{\small Evaluated Update $\evalOf{\upd(\spreadsheet)}{\cdot}$}\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{c|c|c|c|}
\cline{2-4}
@ -163,6 +163,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
\vspace{-3mm}
\caption{Example spreadsheet with expressions shown in \textcolor{tabexprcolor}{dark green}, and an update applied to the spreadsheet with updated expressions and values shown in \uv{red}.}\label{fig:example-spreadsheet-and-a}
\trimfigurespacing
% \vspace{1mm}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -171,7 +172,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
\label{sec:updates}
A cell update set $\upd \subseteq \columnDomain \times \rowDomain \times \exprDomain$ is a set of cell updates of the form $\acu$ that assign to cell $\cellRef{\column}{\row}$ the expression $\expr$.
Denote by $\dom(\upd)$ the domain of update $\upd$, containing all cells $\cellRef{\column}{\row}$ defined in $\upd$ (i.e., $\exists \expr : (\acu \in \upd)$).
Denote by $\dom(\upd)$ the domain of update $\upd$, containing all cells $\cellRef{\column}{\row}$ defined in $\upd$ (i.e., $\exists \expr : ([\acu] \in \upd)$).
Applying an update $\upd$ to a spreadsheet $\spreadsheet$ returns an updated spreadsheet:
\[
\valat{\upd(\spreadsheet)}{\column}{\row} =
@ -182,19 +183,18 @@ Applying an update $\upd$ to a spreadsheet $\spreadsheet$ returns an updated spr
\]
An update may affect cells beyond its domain.
For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes the expressions in cells \emph{\cellRef{A}{1}} and \emph{\cellRef{C}{3}}.
Evaluating the updated spreadsheet $\upd(\spreadsheet)$ results in \emph{three} cell changes (in red).
For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes two cells \emph{\cellRef{A}{1}} and \emph{\cellRef{C}{3}}, but evaluating the updated spreadsheet $\upd(\spreadsheet)$ results in \emph{three} cell changes (in red).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Spreadsheet Access to Datasets}
\label{sec:spre-access-datas}
To uniformly model source datasets, whether from relational databases or other spreadsheets, we assume an input dataset $\ds$ with a designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
In a relational table, these are the table's columns and values of a key attribute, respectively.
For csv data, $\rowDomain_{\ds} \subset \mathbb Z$ is the position of the row.
$\valat{\ds}{\row}{\column}$ denotes the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
To uniformly model source datasets, whether from relational databases or other spreadsheets, we assume an input dataset $\ds$ with designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
In a relational table, these are the table's columns and the values of a key or rowid attribute, respectively.
For csv data, $\rowDomain_{\ds} \subset \mathbb Z$ is the position of the row in the file.
We write $\valat{\ds}{\row}{\column}$ to denote the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
Denote by $\rframe: \rowDomain_{\ds} \to \mathbb{Z}$ a reference frame, an injective map that maps rows in $\ds$ into the spreadsheet.
Denote by $\rframe: \rowDomain_{\ds} \to \mathbb{Z}$ a reference frame, an injective map from rows in $\ds$ to rows of the spreadsheet.
A \emph{spreadsheet overlay} for a dataset $\ds$ is then a pair $(\ds, \rframe)$ that defines a spreadsheet $\spreadsheet_{\ds, \rframe}$ with domains $\columnDomain = \columnDomain_{\ds}$, $\rowDomain = \dom(\rframe)$ as
$
\valat{\spreadsheet_{\ds, \rframe}}{\column}{\row} = \valat{\ds}{\column}{\rframe^{-1}(\row)}
@ -205,7 +205,7 @@ $
\label{sec:overlay-updates}
An Overlay Update describes a set of changes to a spreadsheet (or dataset).
As we discuss in \Cref{sec:system-presentation}, column operations are purely cosmetic in our model, and we focus on cell and row updates.
As we discuss in \Cref{sec:system-presentation}, column operations are purely cosmetic in our model, and we focus on cell and row updates exclusively.
Concretely, a spreadsheet overlay $\overlay = \aol$ is a reference frame transformation $\rtrans$ and a set of pattern updates $\oup$, terms we now define.
% We now define these terms, and discuss their semantics.
@ -213,7 +213,7 @@ Concretely, a spreadsheet overlay $\overlay = \aol$ is a reference frame transfo
\partitle{Reference Frame Transformations}
Recall that a reference frame maps the spreadsheet's positional row references to native record identifiers.
Thus, to insert, delete, or move rows in the spreadsheet, it is sufficient to modify the reference frame.
Formally, a reference frame transformation $\rtrans$ is an injective mapping $\mathbb{Z} \to \mathbb{Z} \cup \errorval$ from an initial set of row positions to a new set of row positions, or the value $\errorval$ for a deleted row.
Formally, a reference frame transformation $\rtrans$ is an injective mapping $\mathbb{Z} \to \mathbb{Z} \cup \errorval$ from initial row positions to new row positions, or the value $\errorval$ for a deleted row.
The new reference frame, after applying $\overlay$ is $\rframe' = \rtrans \circ \mathcal F$, where $\circ$ denotes function composition.
As an example, consider deleting the 2nd row of the spreadsheet from \Cref{fig:example-spreadsheet-and-a}. The positions of rows $3$ and $4$ are decreased by one, while row $1$ retains its position
$$\rtrans(x) = \begin{cases}
@ -232,11 +232,11 @@ Spreadsheets allow a formula from one cell to be pasted across a range of cells.
In a classical spreadsheet, bulk interactions like this modify each cell's expression individually.
Overlay spreadsheets avoid the high cost that individual modifications can entail by grouping together the set of pasted cells into a single \emph{pattern}.
A \emph{range} $\rangeOf{\columnRange}{\rowRange}$ is the Cartesian product $\columnRange \times [l,h]$ of a set of columns ($\columnRange \subseteq \columnDomain$) and row positions ($R \subset \mathbb{Z}$).
A \emph{range} $\rangeOf{\columnRange}{\rowRange}$ is the Cartesian product $\columnRange \times [l,h]$ of a set of columns ($\columnRange \subseteq \columnDomain$) and row positions ($R = [l, h] \subset \mathbb{Z}$).
%
A pattern update $\oup$ is a set of pairs $\{ (\rangeOf{C_i}{R_i}, \pattern_i) \}$ where $\rangeOf{C_i}{R_i}$ is a range and $\pattern_i$ is a \emph{pattern expression}, i.e., an expression that may also contain cell references where rows are relative offsets (written as $+i$ or $-i$).
Ranges $\rangeOf{C_i}{R_i}$ must be pairwise disjoint.
A pattern update $(\rangeOf{C_i}{R_i}, \pattern_i)$ assigns an expression to every cell $(\column, \row)$ in $\rangeOf{C_i}{R_i}$ by replacing any relative references of the form $(\column, \delta)$ in $\pattern_i$ with $(\column, \row + \delta)$. We use $\pattern_i(\cell)$ to denote the instantiation of pattern $\pattern_i$ for cell $\cell$.
Ranges in an update $\rangeOf{C_i}{R_i}$ must be pairwise disjoint.
A pattern update $(\rangeOf{C_i}{R_i}, \pattern_i)$ assigns an expression to every cell $\cellRef{\column}{\row}$ in $\rangeOf{C_i}{R_i}$ by replacing any relative references of the form $\cellRef{\column}{+\delta}$ in $\pattern_i$ with $\cellRef{\column}{\row + \delta}$. We use $\pattern_i(\cell)$ to denote instantiation of pattern $\pattern_i$ for cell $\cell$.
For instance, to store a running sum of the values in column \emph{C} into column \emph{D} (for the spreadsheet from \Cref{fig:example-spreadsheet-and-a}):\\[-2mm]
%
@ -287,21 +287,21 @@ $\,$\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Semantics for Overlay Updates}
%
An overlay update $\overlay$ appleid to a spreadsheet $\spreadsheet$ defines the spreadsheet $\overlay(\spreadsheet$ computed by applying the reference frame update and then applying all pattern updates:
An overlay update $\overlay$ applied to a spreadsheet $\spreadsheet$ defines the spreadsheet $\overlay(\spreadsheet)$ computed by applying the reference frame update and then applying all pattern updates (with $\overlay = \tuple{\rtrans, \{ (\columnRange_i, \rowRange_i, \pattern_i)\}})$:
\begin{align*}
\valat{\overlay(\spreadsheet)}{\column}{\row} &=
\begin{cases}
\pattern_i(\cellRef{\column}{\row}) & \text{\textbf{if}} \exists i: \cellRef{\column}{\row} \in \rangeOf{C_i}{R_i} \\
\valat{\spreadsheet}{\column}{\rtrans^{-1}(\row)} & \text{\textbf{if}} \exists \row': \rtrans(\row') = \row\\
\pattern_i((\column,\row)) & \text{\textbf{if}} \exists i: (\column,\row) \in \rangeOf{C_i}{R_i} \\
\errorval &\text{\textbf{otherwise}}\\
\end{cases}
\end{align*}
\begin{example}
\label{ex:recursive-running-sum}
Consider our example update ($\overlay_{running} = (\rtrans_{id},\oup_{running})$ where $\rtrans_{id}(x) = x$) to our running example spreadsheet.
\Cref{fig:example-overlay-update} shows the result of applying $\overlay_{running}$
Consider our example update ($\overlay_{running} = (\rtrans_{id},\oup_{running})$ where $\rtrans_{id}(x) = x$).
\Cref{fig:example-overlay-update} shows the result of applying $\overlay_{running}$ to our running example spreadsheet.
\end{example}
Several remarks are in order. First, overlays can be used to encode common spreadsheet update operations in constant space (per update), including bulk updates via copy/paste.

View File

@ -3,7 +3,7 @@
\section{Related Work}
\label{sec:related-work}
Although spreadsheets present a convenient, direct-manipulation interface to data, they lack the scalability to manage large data.
Although spreadsheets present a convenient interface to data, they lack the scalability to manage large data.
A common approach to scaling spreadsheets (the ``virtual'' approach) reformulates the interface to an existing database or workflow system using spreadsheet-style direct manipulation metaphors~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/icde/LiuJ09,freire:2016:hilda:exception,DBLP:conf/sigmod/JagadishCEJLNY07,DBLP:conf/chi/KandelPHH11}.
The resulting systems bear varying levels of resemblance to existing spreadsheets, usually introducing concepts from relational databases like explicit tables, attributes, and records.
%

View File

@ -7,7 +7,7 @@ Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data
Our prototype is designed to accept any Spark dataframe as a data source.
% The prototype's design is illustrated in \Cref{fig:systemdesign}
Client applications connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and translates to our simplified spreadsheet model.
Client applications connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and translates to our simplified model of a spreadsheet to a more natural interface.
An \textbf{Execution} layer is responsible for evaluating spreadsheet cells and materializing values for the viewable set of cells.
An \textbf{Indexing} layer provides efficient access to the updates themselves, and a simple LRU cache provides efficient random access to the source dataframe.
@ -17,7 +17,7 @@ An \textbf{Indexing} layer provides efficient access to the updates themselves,
\subsection{Presentation Layer}
\label{sec:system-presentation}
User-facing client applications connect to the overlay spreadsheet through a presentation layer.
This layer mediates concurrent updates of the spreadsheet, allows clients to subscribe to push-based updates of cell state, and provides clients with the illusion of a fixed grid of cells by defining and maintaining an explicit order over columns, as well as maintaining a bound over the number of rows in the spreadsheet.
This layer mediates concurrent updates of the spreadsheet and provides clients with the illusion of a fixed grid of cells by defining and maintaining an explicit order over columns.
Column operations (insertion, deletion, reordering) are handled at this layer, so lower levels can reference the (comparatively small) set of columns solely by column identity.
Other updates are put into a serial order and relayed to lower levels.
@ -28,7 +28,7 @@ Other updates are put into a serial order and relayed to lower levels.
% \trimfigurespacing
% \end{figure}
The presentation layer expects the level below it to provide (i) efficient random access to cell values, (ii) subscription access to state (e.g., value) updates for ranges of cells.
The presentation layer expects the Executor to provide efficient random access to cell values and support updating ranges of cells with pattern expressions.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -36,11 +36,11 @@ The presentation layer expects the level below it to provide (i) efficient rando
\subsection{Executor}
\label{sec:system-executor}
The executor provides efficient access to cell values and notifications about cell state changes.
The executor provides efficient access to cell values and generates notifications about cell state changes.
Cell values are derived from two sources:
(i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and
(ii) Overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) that extend the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and the update index.
(ii) A sequence of overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) that extend the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and the update index, as discussed below.
% The update index stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
% Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
% The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
@ -51,15 +51,15 @@ These sources are implemented by a cache around $\spreadsheet_{\ds}$ and the upd
The naive approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) computes a topological sort over cell dependencies and evaluates cells in this order.
The Executor side-steps the linear (in the data size) cost of the naive approach through two insights:
(i) Updates are already provided over multiple cells in bulk as patterns, and
(i) Updates applied over multiple cells are already provided by higher layers as patterns, and
(ii) Only a small fraction of cells will be visible at any one time.
Assuming the dependencies of a range of cells can be computed efficiently (we return to this in \Cref{sec:system-index}), only the visible cells and their not visible dependencies need to be evaluated.
Assuming the dependencies of a range of cells can be computed efficiently (we return to this assumption in \Cref{sec:system-index}), only the visible cells and any hidden dependencies need to be evaluated.
The Executor only evaluates cell expressions on rows that are (close to being) visible to the user, and the transitive closure of their dependencies.
Some dependency chains (e.g., running sums) still require computation for each row of data.
Although we leave a detailed exploration of this challenge to future work, we observe that the fixed point of such pattern expressions can often be rewritten into a closed form.
For example, any cell in a running sum column is equivalent to a sum over the preceding cells.
Our preliminary experiments (\Cref{sec:experiments}) suggest promise in a hybrid evaluation strategy that evaluates visible cells individually and computes hidden pattern cells through closed form aggregate queries.
Our preliminary experiments (\Cref{sec:experiments}) suggest promise in a hybrid evaluation strategy that evaluates visible cells individually and computes cells defined by patterns through closed form aggregate queries.
\partitle{Updates}
When the executor receives an update to a cell, it uses the index to compute the set of invalidated cells, marks them as ``pending,'' and begins re-evaluating them in topological order.
@ -74,9 +74,9 @@ If a row with dependent cells is deleted, the dependent cells need to be updated
The update index stores sequence of updates ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$) and provide efficient access to the cells of an overlay spreadsheet (denoted $\spreadsheet_\overlay$) where undefined cells have the value $\errorval$.
This entails:
(i) cell expressions $\spreadsheet_\overlay[\column, \row]$ (for cell evaluation);
(ii) the upstream of a range of cells (for topological sort and computing the active set), and
(iii) the downstream of a range of cells (for cell invalidation after an update).
The key insight behind the index is that updates are stored in the form of pattern-range tuples.
(ii) upstream dependencies of a range (for topological sort and computing the active set), and
(iii) downstream dependents of a range (for cell invalidation after an update).
The key insight behind the index is that updates are stored as pattern-range tuples instead of as individual cells.
%As noted above, we assume that the number of columns is small and the number of rows is large.
\begin{figure}
@ -89,11 +89,11 @@ The key insight behind the index is that updates are stored in the form of patte
\partitle{Range Maps}
The update index is built over a one-dimensional range map, an ordered map with integer keys.
In addition to the usual operations of an ordered map (e.g., \texttt{put}, \texttt{get}, \texttt{successorOf}), we define the operation \texttt{bulkPut(low, high, value)} which is equivalent to a \texttt{put} on every element in the range from \texttt{low} to \texttt{high}.
Implemented naively (e.g. a size $N$ binary tree), this operation takes $O((\texttt{high}-\texttt{low})\cdot\log(N))$.
Implemented naively (e.g. a size $N$ binary tree), this operation is $O((\texttt{high}-\texttt{low})\cdot\log(N))$.
A range map avoids the $(\texttt{high}-\texttt{low})$ factor (and correspondingly reduces $N$) by storing an ordered sequence of disjoint ranges, each mapping one specific value as illustrated in \Cref{fig:rangemap}.
A binary tree provides efficient membership lookups over the ranges.
With a range map, the set of distinct values appearing in a range can be accessed in $O(\log(N)+M)$ time (where $M$ is the number of distinct values), and similar deletion and insertion costs.
With a range map, the set of distinct values appearing in a range can be accessed in $O(\log(N)+M)$ time (where $M$ is the number of distinct values), and has similar deletion and insertion costs.
\partitle{Cell Access}
The index layer maintains a ``forward'' index: An unordered map $\mathcal I$ that stores a range map $\mathcal I[\column]$ for each column.
@ -133,7 +133,7 @@ For each work item enqueued, we query the forward index to obtain the set of pat
If we discover a new dependency (lines 6-7), the newly discovered range is added to the return set and the work queue.
We will explain lines 10-12 shortly.
The \textbf{getDeps} operation (Line 5; \Cref{alg:getDeps}) computes the immediate dependencies of a range of cells $\rangeOf{\column, \rowRange}$ that share a pattern.
The \textbf{getDeps} operation (Line 5; \Cref{alg:getDeps}) computes the immediate dependencies of a range of cells $\rangeOf{\column}{\rowRange}$ that share a pattern.
Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column}{\rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf{\cell'}$.
The algorithm uses a recursive traversal (lines 6-7) to visit every cell reference (offset or explicit):
For offset references (lines 2-3), the provided range of rows is offset by the appropriate amount.
@ -175,7 +175,7 @@ When the offset is $\pm 1$, the elements of this sequence are efficiently repres
\partitle{Downstream Reachability}
When a cell's expression is updated, cells that depend on it (even transitively) must be recomputed, so the index must support downstream reachability queries.
To efficient downstream lookups, the index maintain as ``backward'' index that relates ranges to the set of patterns that depend on all cells in the range.
For efficient downstream lookups, the index maintains a ``backward'' index relating ranges to the set of patterns that depend on all cells in the range.
The resulting algorithm over the backward index is analogous to $\textbf{getDeps}$.
% \partitle{Column Insertions, Deletions, and Moves}