Final edits
parent
55f6df55d6
commit
9a819ae5a1
4
main.tex
4
main.tex
|
@ -122,14 +122,14 @@
|
|||
%% The abstract is a short summary of the work to be presented in the
|
||||
%% article.
|
||||
\begin{abstract}
|
||||
Spreadsheets provide a convenient, friendly direct manipulation interface to datasets.
|
||||
% Spreadsheets provide a convenient, friendly direct manipulation interface to datasets.
|
||||
Efforts to scale spreadsheets either follow a `virtual` strategy that imposes a spreadsheet interface over an existing database engine or a `materialized' strategy based on re-engineering the spreadsheet engine.
|
||||
Because database engines are not optimized for spreadsheet access patterns, the materialized approach has better performance.
|
||||
However, the virtual approach offers several advantages that can not be easily replicated in the materialized approach, including the ability to re-apply user interactions to an updated dataset.
|
||||
We propose a hybrid approach, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
|
||||
We introduce the overlay update model, and outline strategies for efficiently accessing an overlay spreadsheet.
|
||||
A key feature of our approach is storing updates generated by bulk operations (e.g., copy/paste) as ``patterns" that can be leveraged to reduce execution costs.
|
||||
We implement an overlay spreadsheet over Apache Spark and demonstrate that, compared to DataSpread, it can significantly reduce execution costs.
|
||||
We implement an overlay spreadsheet over Apache Spark and demonstrate that, compared to DataSpread (a standard materialized-style spreadsheet), it can significantly reduce execution costs.
|
||||
\end{abstract}
|
||||
|
||||
%%
|
||||
|
|
|
@ -54,8 +54,8 @@ We address our questions through a microbenchmark modeled after TPC-H query 1~\c
|
|||
\end{verbatim}
|
||||
}
|
||||
\noindent The \texttt{sum\_charge} column is a running total, creating a dependency chain that grows linearly with row index.
|
||||
As the user scrolls down the page (under normal usage), the runtime to compute individual cells grows linearly.
|
||||
Each system load the spreadsheet with a viewable area of 50 rows and updates a single cell.
|
||||
As the user scrolls down the page (under normal usage), the runtime to compute visible cells grows linearly.
|
||||
Each system loads the spreadsheet with a viewable area of 50 rows and updates a single cell.
|
||||
We measure (i) the cost of initialization and (ii) the cost of a single update.
|
||||
Time is measured until quiescence.
|
||||
To emulate batch processing, we replace the formula for the $\texttt{sum\_change}[i-1]$ (where $i$ is the first visible row) with a formula that computes the analogous aggregate query.
|
||||
|
@ -64,7 +64,7 @@ To emulate batch processing, we replace the formula for the $\texttt{sum\_change
|
|||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\partitle{Moving Viewport}
|
||||
\partitle{Moving View}
|
||||
% \begin{figure}
|
||||
% \includegraphics[width=0.7\columnwidth]{results/desktop-update_one.png}
|
||||
% \vspace*{-4mm}
|
||||
|
|
|
@ -6,11 +6,11 @@ Spreadsheets are a popular tools for data exploration, transformation, and visua
|
|||
rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
|
||||
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system. % like Apache Spark.
|
||||
In this model, the spreadsheet is a chain of versions, each linked by a lightweight transformation function~\cite{freire:2016:hilda:exception}.
|
||||
The approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the % entire
|
||||
A different approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the % entire
|
||||
spreadsheet runtime and specializes % around
|
||||
database primitives like indexes and incremental maintenance % specialized
|
||||
for spreadsheet access patterns.
|
||||
We refer to these as the virtual and materialized approach, respectively, and illustrate them in \Cref{fig:overlay}.
|
||||
We refer to these as the virtual and materialized approaches, respectively, and illustrate them in \Cref{fig:overlay}.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{figure}
|
||||
|
@ -33,23 +33,23 @@ Similar optimizations are considerably harder in the virtual approach, as the re
|
|||
Although the virtual approach is often less efficient, it does provide capabilities that the materialized approach does not:
|
||||
(i) It is a naturally efficient encoding of the spreadsheet's full version history.
|
||||
(ii) As in Wrangler, the user's actions can be re-applied to new data (e.g., an updated version of the source data); and
|
||||
(iii) As in Vizier, the spreadsheet can be re-encoded as a relational query allowing it to ``plug into'' existing scalable computation platforms (i.e., Spark) and provenance analysis tools (e.g., \cite{kumari:2021:cidr:datasense}).
|
||||
(iii) As in Vizier, the spreadsheet can be re-encoded as a relational query allowing it to ``plug into'' existing scalable computation platforms (e.g., Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15}) and provenance analysis tools (e.g., \cite{kumari:2021:cidr:datasense}).
|
||||
|
||||
We propose an optimized hybrid of the virtual and materialized approaches: \emph{Overlay Spreadsheets}.
|
||||
An Overlay Spreadsheet (\Cref{fig:overlay}) stores presents an interface analogous to a normal spreadsheet.
|
||||
An Overlay Spreadsheet (\Cref{fig:overlay}) presents an interface analogous to a normal spreadsheet.
|
||||
User edits are ``overlaid'' on top of a source dataset that can be easily be updated to a new version.
|
||||
As an added benefit, decoupling edits and source data makes it easier to leverage spreadsheet access patterns, reducing the time needed to respond to user actions.
|
||||
|
||||
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal notebook-style workflow system built on Apache Spark.
|
||||
Existing versions of Vizier allow users to define workflow steps through a spreadsheet-style interface; each action adds a new workflow step.
|
||||
In spite of the performance limitations of the virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features are implemented primarily over Spark dataframes.
|
||||
In spite of the performance limitations of this virtual approach, it remains preferable for Vizier, where (i) changes to an early step in the workflow may require automatically re-applying the user's edits, and (ii) fine-grained provenance features rely on encoding data transformations as Spark dataframes.
|
||||
%
|
||||
Our objective in this paper is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting for \textbf{replay and provenance} (i.e., like the virtual approach).
|
||||
Our objective is to demonstrate that a spreadsheet-style interface can provide \textbf{interactive latencies} (i.e., like the materialized approach), while still supporting \textbf{replay and provenance} (i.e., like the virtual approach).
|
||||
|
||||
As a secondary goal, explore the potential for performance improvement resulting from the overlay approach.
|
||||
As a secondary goal, we explore potential performance improvements that the overlay approach enables.
|
||||
Specifically, we observe that bulk updates in a spreadsheet (e.g., pasting a formula across a range of cells) rely on expression ``patterns,''
|
||||
which admit more efficient dependency analysis and bulk computation when intermediate values are not required.
|
||||
This hybrid strategy is akin to optimizations applied in data spread~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}, but operating over patterns of updates rather than patterns in the dependency graph.
|
||||
which admit more efficient dependency analysis and bulk computation, when intermediate values are not required.
|
||||
This hybrid strategy is akin to optimizations applied in DataSpread~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}, but operate over patterns of updates rather than patterns in the dependency graph, enabling additional optimizations.
|
||||
|
||||
% March 26 by OK: Trimming the ToC summary for space
|
||||
%
|
||||
|
|
|
@ -63,7 +63,7 @@
|
|||
\label{sec:spreadsheets}
|
||||
|
||||
Let $\columnDomain$ and $\rowDomain$ denote domains of column and row labels. Except where noted, $\rowDomain \subset \mathbb Z$.
|
||||
Let $\valueDomain \subset \exprDomain$ denote domains of values and expressions, respectively.
|
||||
Let $\valueDomain$ and $\exprDomain \supset \valueDomain$ denote domains of values and expressions, respectively.
|
||||
A \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ is a partial mapping from \emph{cells} ($\cellRef{\column}{\row} \in (\columnDomain \times \rowDomain)$) to expressions.
|
||||
We use $\valat{\spreadsheet}{\column}{\row}$ to denote $\spreadsheet(\cellRef{\column}{\row})$.
|
||||
Let $\errorval \in \valueDomain$ indicate ``undefined'' and define the \emph{domain} $\dom(\spreadsheet)$ to be the set of cells $\cellRef{\column}{\row}$ where $\valat{\spreadsheet}{\column}{\row} \neq \errorval$.
|
||||
|
@ -72,8 +72,8 @@ An expression $\expr \in \exprDomain$ is a formula defined over literals from $\
|
|||
The expression $\expr$ is evaluated in the context of a spreadsheet ($\evalOf{\spreadsheet}{\cdot} : \exprDomain \rightarrow \valueDomain$) as follows:
|
||||
(i) Literals and arithmetic are evaluated in the usual way, and (ii) References to the spreadsheet are evaluated recursively
|
||||
($\evalOf{\spreadsheet}{\cellRef{\column}{\row}} \equiv \evalOf{\spreadsheet}{\spreadsheet(\column, \row)}$).
|
||||
By convention, cyclic references evaluate to the distinguished error value $\errorval$ in $\valueDomain$.
|
||||
|
||||
By convention, cyclic references evaluate to $\errorval$.
|
||||
%
|
||||
An expression's dependencies ($\depsOf{\expr}$) are the cells referenced by $\expr$.
|
||||
Dependencies induce a graph $\DG{\spreadsheet}\tuple{N, E}$ over the spreadsheet, with cells as nodes (i.e., $N = \columnDomain \times \rowDomain$), and dependencies as directed edges:
|
||||
$$E = \bigcup_{\cell \in \columnDomain \times \rowDomain}
|
||||
|
@ -85,7 +85,7 @@ Note that if all cell expressions are constants (i.e., a spreadsheet without for
|
|||
\begin{example}
|
||||
Consider the spreadsheet at the top of \Cref{fig:example-spreadsheet-and-a}.
|
||||
Columns \emph{A} and \emph{B} hold constant expressions, while column \emph{C} holds reference cells from columns \emph{A} and \emph{B}.
|
||||
Evaluating this spreadsheet assigns each cell a concrete value, as in the top right.
|
||||
Evaluating this spreadsheet assigns each cell a value, as in the top right.
|
||||
For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{A}{1} + \cellRef{B}{1}} = \evalOf{\spreadsheet}{\cellRef{A}{1}} + \evalOf{\spreadsheet}{\cellRef{B}{1}} = 15 + 50 = 65$.
|
||||
\end{example}
|
||||
|
||||
|
@ -96,7 +96,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
|
|||
|
||||
\begin{minipage}{0.48\linewidth}
|
||||
\centering
|
||||
\textbf{Spreadsheet $\spreadsheet$}\\
|
||||
\textbf{\small Spreadsheet $\spreadsheet$}\\
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{tabular}{c|c|c|c|}
|
||||
\cline{2-4}
|
||||
|
@ -111,7 +111,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
|
|||
%
|
||||
\begin{minipage}{0.49\linewidth}
|
||||
\centering
|
||||
\textbf{Evaluated Spreadsheet $\evalOf{\spreadsheet}{\cdot}$}\\
|
||||
\textbf{\small Evaluated Spreadsheet $\evalOf{\spreadsheet}{\cdot}$}\\
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{tabular}{c|c|c|c|}
|
||||
\cline{2-4}
|
||||
|
@ -133,7 +133,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
|
|||
%$,$\\ %\vspace{5mm}
|
||||
\begin{minipage}{0.46\linewidth}
|
||||
\centering
|
||||
\textbf{Updated Spreadsheet $\upd(\spreadsheet)$}\\
|
||||
\textbf{\small Updated Spreadsheet $\upd(\spreadsheet)$}\\
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{tabular}{c|c|c|c|}
|
||||
\cline{2-4}
|
||||
|
@ -148,7 +148,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
|
|||
%
|
||||
\begin{minipage}{0.49\linewidth}
|
||||
\centering
|
||||
\textbf{Evaluated Update $\evalOf{\upd(\spreadsheet)}{\cdot}$}\\
|
||||
\textbf{\small Evaluated Update $\evalOf{\upd(\spreadsheet)}{\cdot}$}\\
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\begin{tabular}{c|c|c|c|}
|
||||
\cline{2-4}
|
||||
|
@ -163,6 +163,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
|
|||
\vspace{-3mm}
|
||||
\caption{Example spreadsheet with expressions shown in \textcolor{tabexprcolor}{dark green}, and an update applied to the spreadsheet with updated expressions and values shown in \uv{red}.}\label{fig:example-spreadsheet-and-a}
|
||||
\trimfigurespacing
|
||||
% \vspace{1mm}
|
||||
\end{figure}
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
||||
|
@ -171,7 +172,7 @@ For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{
|
|||
\label{sec:updates}
|
||||
|
||||
A cell update set $\upd \subseteq \columnDomain \times \rowDomain \times \exprDomain$ is a set of cell updates of the form $\acu$ that assign to cell $\cellRef{\column}{\row}$ the expression $\expr$.
|
||||
Denote by $\dom(\upd)$ the domain of update $\upd$, containing all cells $\cellRef{\column}{\row}$ defined in $\upd$ (i.e., $\exists \expr : (\acu \in \upd)$).
|
||||
Denote by $\dom(\upd)$ the domain of update $\upd$, containing all cells $\cellRef{\column}{\row}$ defined in $\upd$ (i.e., $\exists \expr : ([\acu] \in \upd)$).
|
||||
Applying an update $\upd$ to a spreadsheet $\spreadsheet$ returns an updated spreadsheet:
|
||||
\[
|
||||
\valat{\upd(\spreadsheet)}{\column}{\row} =
|
||||
|
@ -182,19 +183,18 @@ Applying an update $\upd$ to a spreadsheet $\spreadsheet$ returns an updated spr
|
|||
\]
|
||||
|
||||
An update may affect cells beyond its domain.
|
||||
For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes the expressions in cells \emph{\cellRef{A}{1}} and \emph{\cellRef{C}{3}}.
|
||||
Evaluating the updated spreadsheet $\upd(\spreadsheet)$ results in \emph{three} cell changes (in red).
|
||||
For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes two cells \emph{\cellRef{A}{1}} and \emph{\cellRef{C}{3}}, but evaluating the updated spreadsheet $\upd(\spreadsheet)$ results in \emph{three} cell changes (in red).
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Spreadsheet Access to Datasets}
|
||||
\label{sec:spre-access-datas}
|
||||
|
||||
To uniformly model source datasets, whether from relational databases or other spreadsheets, we assume an input dataset $\ds$ with a designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
|
||||
In a relational table, these are the table's columns and values of a key attribute, respectively.
|
||||
For csv data, $\rowDomain_{\ds} \subset \mathbb Z$ is the position of the row.
|
||||
$\valat{\ds}{\row}{\column}$ denotes the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
|
||||
To uniformly model source datasets, whether from relational databases or other spreadsheets, we assume an input dataset $\ds$ with designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
|
||||
In a relational table, these are the table's columns and the values of a key or rowid attribute, respectively.
|
||||
For csv data, $\rowDomain_{\ds} \subset \mathbb Z$ is the position of the row in the file.
|
||||
We write $\valat{\ds}{\row}{\column}$ to denote the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
|
||||
|
||||
Denote by $\rframe: \rowDomain_{\ds} \to \mathbb{Z}$ a reference frame, an injective map that maps rows in $\ds$ into the spreadsheet.
|
||||
Denote by $\rframe: \rowDomain_{\ds} \to \mathbb{Z}$ a reference frame, an injective map from rows in $\ds$ to rows of the spreadsheet.
|
||||
A \emph{spreadsheet overlay} for a dataset $\ds$ is then a pair $(\ds, \rframe)$ that defines a spreadsheet $\spreadsheet_{\ds, \rframe}$ with domains $\columnDomain = \columnDomain_{\ds}$, $\rowDomain = \dom(\rframe)$ as
|
||||
$
|
||||
\valat{\spreadsheet_{\ds, \rframe}}{\column}{\row} = \valat{\ds}{\column}{\rframe^{-1}(\row)}
|
||||
|
@ -205,7 +205,7 @@ $
|
|||
\label{sec:overlay-updates}
|
||||
|
||||
An Overlay Update describes a set of changes to a spreadsheet (or dataset).
|
||||
As we discuss in \Cref{sec:system-presentation}, column operations are purely cosmetic in our model, and we focus on cell and row updates.
|
||||
As we discuss in \Cref{sec:system-presentation}, column operations are purely cosmetic in our model, and we focus on cell and row updates exclusively.
|
||||
Concretely, a spreadsheet overlay $\overlay = \aol$ is a reference frame transformation $\rtrans$ and a set of pattern updates $\oup$, terms we now define.
|
||||
% We now define these terms, and discuss their semantics.
|
||||
|
||||
|
@ -213,7 +213,7 @@ Concretely, a spreadsheet overlay $\overlay = \aol$ is a reference frame transfo
|
|||
\partitle{Reference Frame Transformations}
|
||||
Recall that a reference frame maps the spreadsheet's positional row references to native record identifiers.
|
||||
Thus, to insert, delete, or move rows in the spreadsheet, it is sufficient to modify the reference frame.
|
||||
Formally, a reference frame transformation $\rtrans$ is an injective mapping $\mathbb{Z} \to \mathbb{Z} \cup \errorval$ from an initial set of row positions to a new set of row positions, or the value $\errorval$ for a deleted row.
|
||||
Formally, a reference frame transformation $\rtrans$ is an injective mapping $\mathbb{Z} \to \mathbb{Z} \cup \errorval$ from initial row positions to new row positions, or the value $\errorval$ for a deleted row.
|
||||
The new reference frame, after applying $\overlay$ is $\rframe' = \rtrans \circ \mathcal F$, where $\circ$ denotes function composition.
|
||||
As an example, consider deleting the 2nd row of the spreadsheet from \Cref{fig:example-spreadsheet-and-a}. The positions of rows $3$ and $4$ are decreased by one, while row $1$ retains its position
|
||||
$$\rtrans(x) = \begin{cases}
|
||||
|
@ -232,11 +232,11 @@ Spreadsheets allow a formula from one cell to be pasted across a range of cells.
|
|||
In a classical spreadsheet, bulk interactions like this modify each cell's expression individually.
|
||||
Overlay spreadsheets avoid the high cost that individual modifications can entail by grouping together the set of pasted cells into a single \emph{pattern}.
|
||||
|
||||
A \emph{range} $\rangeOf{\columnRange}{\rowRange}$ is the Cartesian product $\columnRange \times [l,h]$ of a set of columns ($\columnRange \subseteq \columnDomain$) and row positions ($R \subset \mathbb{Z}$).
|
||||
A \emph{range} $\rangeOf{\columnRange}{\rowRange}$ is the Cartesian product $\columnRange \times [l,h]$ of a set of columns ($\columnRange \subseteq \columnDomain$) and row positions ($R = [l, h] \subset \mathbb{Z}$).
|
||||
%
|
||||
A pattern update $\oup$ is a set of pairs $\{ (\rangeOf{C_i}{R_i}, \pattern_i) \}$ where $\rangeOf{C_i}{R_i}$ is a range and $\pattern_i$ is a \emph{pattern expression}, i.e., an expression that may also contain cell references where rows are relative offsets (written as $+i$ or $-i$).
|
||||
Ranges $\rangeOf{C_i}{R_i}$ must be pairwise disjoint.
|
||||
A pattern update $(\rangeOf{C_i}{R_i}, \pattern_i)$ assigns an expression to every cell $(\column, \row)$ in $\rangeOf{C_i}{R_i}$ by replacing any relative references of the form $(\column, \delta)$ in $\pattern_i$ with $(\column, \row + \delta)$. We use $\pattern_i(\cell)$ to denote the instantiation of pattern $\pattern_i$ for cell $\cell$.
|
||||
Ranges in an update $\rangeOf{C_i}{R_i}$ must be pairwise disjoint.
|
||||
A pattern update $(\rangeOf{C_i}{R_i}, \pattern_i)$ assigns an expression to every cell $\cellRef{\column}{\row}$ in $\rangeOf{C_i}{R_i}$ by replacing any relative references of the form $\cellRef{\column}{+\delta}$ in $\pattern_i$ with $\cellRef{\column}{\row + \delta}$. We use $\pattern_i(\cell)$ to denote instantiation of pattern $\pattern_i$ for cell $\cell$.
|
||||
|
||||
For instance, to store a running sum of the values in column \emph{C} into column \emph{D} (for the spreadsheet from \Cref{fig:example-spreadsheet-and-a}):\\[-2mm]
|
||||
%
|
||||
|
@ -287,21 +287,21 @@ $\,$\\
|
|||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\partitle{Semantics for Overlay Updates}
|
||||
%
|
||||
An overlay update $\overlay$ appleid to a spreadsheet $\spreadsheet$ defines the spreadsheet $\overlay(\spreadsheet$ computed by applying the reference frame update and then applying all pattern updates:
|
||||
An overlay update $\overlay$ applied to a spreadsheet $\spreadsheet$ defines the spreadsheet $\overlay(\spreadsheet)$ computed by applying the reference frame update and then applying all pattern updates (with $\overlay = \tuple{\rtrans, \{ (\columnRange_i, \rowRange_i, \pattern_i)\}})$:
|
||||
|
||||
\begin{align*}
|
||||
\valat{\overlay(\spreadsheet)}{\column}{\row} &=
|
||||
\begin{cases}
|
||||
\pattern_i(\cellRef{\column}{\row}) & \text{\textbf{if}} \exists i: \cellRef{\column}{\row} \in \rangeOf{C_i}{R_i} \\
|
||||
\valat{\spreadsheet}{\column}{\rtrans^{-1}(\row)} & \text{\textbf{if}} \exists \row': \rtrans(\row') = \row\\
|
||||
\pattern_i((\column,\row)) & \text{\textbf{if}} \exists i: (\column,\row) \in \rangeOf{C_i}{R_i} \\
|
||||
\errorval &\text{\textbf{otherwise}}\\
|
||||
\end{cases}
|
||||
\end{align*}
|
||||
|
||||
\begin{example}
|
||||
\label{ex:recursive-running-sum}
|
||||
Consider our example update ($\overlay_{running} = (\rtrans_{id},\oup_{running})$ where $\rtrans_{id}(x) = x$) to our running example spreadsheet.
|
||||
\Cref{fig:example-overlay-update} shows the result of applying $\overlay_{running}$
|
||||
Consider our example update ($\overlay_{running} = (\rtrans_{id},\oup_{running})$ where $\rtrans_{id}(x) = x$).
|
||||
\Cref{fig:example-overlay-update} shows the result of applying $\overlay_{running}$ to our running example spreadsheet.
|
||||
\end{example}
|
||||
|
||||
Several remarks are in order. First, overlays can be used to encode common spreadsheet update operations in constant space (per update), including bulk updates via copy/paste.
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
\section{Related Work}
|
||||
\label{sec:related-work}
|
||||
|
||||
Although spreadsheets present a convenient, direct-manipulation interface to data, they lack the scalability to manage large data.
|
||||
Although spreadsheets present a convenient interface to data, they lack the scalability to manage large data.
|
||||
A common approach to scaling spreadsheets (the ``virtual'' approach) reformulates the interface to an existing database or workflow system using spreadsheet-style direct manipulation metaphors~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/icde/LiuJ09,freire:2016:hilda:exception,DBLP:conf/sigmod/JagadishCEJLNY07,DBLP:conf/chi/KandelPHH11}.
|
||||
The resulting systems bear varying levels of resemblance to existing spreadsheets, usually introducing concepts from relational databases like explicit tables, attributes, and records.
|
||||
%
|
||||
|
|
|
@ -7,7 +7,7 @@ Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data
|
|||
Our prototype is designed to accept any Spark dataframe as a data source.
|
||||
|
||||
% The prototype's design is illustrated in \Cref{fig:systemdesign}
|
||||
Client applications connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and translates to our simplified spreadsheet model.
|
||||
Client applications connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and translates to our simplified model of a spreadsheet to a more natural interface.
|
||||
An \textbf{Execution} layer is responsible for evaluating spreadsheet cells and materializing values for the viewable set of cells.
|
||||
An \textbf{Indexing} layer provides efficient access to the updates themselves, and a simple LRU cache provides efficient random access to the source dataframe.
|
||||
|
||||
|
@ -17,7 +17,7 @@ An \textbf{Indexing} layer provides efficient access to the updates themselves,
|
|||
\subsection{Presentation Layer}
|
||||
\label{sec:system-presentation}
|
||||
User-facing client applications connect to the overlay spreadsheet through a presentation layer.
|
||||
This layer mediates concurrent updates of the spreadsheet, allows clients to subscribe to push-based updates of cell state, and provides clients with the illusion of a fixed grid of cells by defining and maintaining an explicit order over columns, as well as maintaining a bound over the number of rows in the spreadsheet.
|
||||
This layer mediates concurrent updates of the spreadsheet and provides clients with the illusion of a fixed grid of cells by defining and maintaining an explicit order over columns.
|
||||
Column operations (insertion, deletion, reordering) are handled at this layer, so lower levels can reference the (comparatively small) set of columns solely by column identity.
|
||||
Other updates are put into a serial order and relayed to lower levels.
|
||||
|
||||
|
@ -28,7 +28,7 @@ Other updates are put into a serial order and relayed to lower levels.
|
|||
% \trimfigurespacing
|
||||
% \end{figure}
|
||||
|
||||
The presentation layer expects the level below it to provide (i) efficient random access to cell values, (ii) subscription access to state (e.g., value) updates for ranges of cells.
|
||||
The presentation layer expects the Executor to provide efficient random access to cell values and support updating ranges of cells with pattern expressions.
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
|
@ -36,11 +36,11 @@ The presentation layer expects the level below it to provide (i) efficient rando
|
|||
\subsection{Executor}
|
||||
\label{sec:system-executor}
|
||||
|
||||
The executor provides efficient access to cell values and notifications about cell state changes.
|
||||
The executor provides efficient access to cell values and generates notifications about cell state changes.
|
||||
Cell values are derived from two sources:
|
||||
(i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and
|
||||
(ii) Overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) that extend the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
|
||||
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and the update index.
|
||||
(ii) A sequence of overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) that extend the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
|
||||
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and the update index, as discussed below.
|
||||
% The update index stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
|
||||
% Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
|
||||
% The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
|
||||
|
@ -51,15 +51,15 @@ These sources are implemented by a cache around $\spreadsheet_{\ds}$ and the upd
|
|||
|
||||
The naive approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) computes a topological sort over cell dependencies and evaluates cells in this order.
|
||||
The Executor side-steps the linear (in the data size) cost of the naive approach through two insights:
|
||||
(i) Updates are already provided over multiple cells in bulk as patterns, and
|
||||
(i) Updates applied over multiple cells are already provided by higher layers as patterns, and
|
||||
(ii) Only a small fraction of cells will be visible at any one time.
|
||||
Assuming the dependencies of a range of cells can be computed efficiently (we return to this in \Cref{sec:system-index}), only the visible cells and their not visible dependencies need to be evaluated.
|
||||
Assuming the dependencies of a range of cells can be computed efficiently (we return to this assumption in \Cref{sec:system-index}), only the visible cells and any hidden dependencies need to be evaluated.
|
||||
The Executor only evaluates cell expressions on rows that are (close to being) visible to the user, and the transitive closure of their dependencies.
|
||||
|
||||
Some dependency chains (e.g., running sums) still require computation for each row of data.
|
||||
Although we leave a detailed exploration of this challenge to future work, we observe that the fixed point of such pattern expressions can often be rewritten into a closed form.
|
||||
For example, any cell in a running sum column is equivalent to a sum over the preceding cells.
|
||||
Our preliminary experiments (\Cref{sec:experiments}) suggest promise in a hybrid evaluation strategy that evaluates visible cells individually and computes hidden pattern cells through closed form aggregate queries.
|
||||
Our preliminary experiments (\Cref{sec:experiments}) suggest promise in a hybrid evaluation strategy that evaluates visible cells individually and computes cells defined by patterns through closed form aggregate queries.
|
||||
|
||||
\partitle{Updates}
|
||||
When the executor receives an update to a cell, it uses the index to compute the set of invalidated cells, marks them as ``pending,'' and begins re-evaluating them in topological order.
|
||||
|
@ -74,9 +74,9 @@ If a row with dependent cells is deleted, the dependent cells need to be updated
|
|||
The update index stores sequence of updates ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$) and provide efficient access to the cells of an overlay spreadsheet (denoted $\spreadsheet_\overlay$) where undefined cells have the value $\errorval$.
|
||||
This entails:
|
||||
(i) cell expressions $\spreadsheet_\overlay[\column, \row]$ (for cell evaluation);
|
||||
(ii) the upstream of a range of cells (for topological sort and computing the active set), and
|
||||
(iii) the downstream of a range of cells (for cell invalidation after an update).
|
||||
The key insight behind the index is that updates are stored in the form of pattern-range tuples.
|
||||
(ii) upstream dependencies of a range (for topological sort and computing the active set), and
|
||||
(iii) downstream dependents of a range (for cell invalidation after an update).
|
||||
The key insight behind the index is that updates are stored as pattern-range tuples instead of as individual cells.
|
||||
%As noted above, we assume that the number of columns is small and the number of rows is large.
|
||||
|
||||
\begin{figure}
|
||||
|
@ -89,11 +89,11 @@ The key insight behind the index is that updates are stored in the form of patte
|
|||
\partitle{Range Maps}
|
||||
The update index is built over a one-dimensional range map, an ordered map with integer keys.
|
||||
In addition to the usual operations of an ordered map (e.g., \texttt{put}, \texttt{get}, \texttt{successorOf}), we define the operation \texttt{bulkPut(low, high, value)} which is equivalent to a \texttt{put} on every element in the range from \texttt{low} to \texttt{high}.
|
||||
Implemented naively (e.g. a size $N$ binary tree), this operation takes $O((\texttt{high}-\texttt{low})\cdot\log(N))$.
|
||||
Implemented naively (e.g. a size $N$ binary tree), this operation is $O((\texttt{high}-\texttt{low})\cdot\log(N))$.
|
||||
|
||||
A range map avoids the $(\texttt{high}-\texttt{low})$ factor (and correspondingly reduces $N$) by storing an ordered sequence of disjoint ranges, each mapping one specific value as illustrated in \Cref{fig:rangemap}.
|
||||
A binary tree provides efficient membership lookups over the ranges.
|
||||
With a range map, the set of distinct values appearing in a range can be accessed in $O(\log(N)+M)$ time (where $M$ is the number of distinct values), and similar deletion and insertion costs.
|
||||
With a range map, the set of distinct values appearing in a range can be accessed in $O(\log(N)+M)$ time (where $M$ is the number of distinct values), and has similar deletion and insertion costs.
|
||||
|
||||
\partitle{Cell Access}
|
||||
The index layer maintains a ``forward'' index: An unordered map $\mathcal I$ that stores a range map $\mathcal I[\column]$ for each column.
|
||||
|
@ -133,7 +133,7 @@ For each work item enqueued, we query the forward index to obtain the set of pat
|
|||
If we discover a new dependency (lines 6-7), the newly discovered range is added to the return set and the work queue.
|
||||
We will explain lines 10-12 shortly.
|
||||
|
||||
The \textbf{getDeps} operation (Line 5; \Cref{alg:getDeps}) computes the immediate dependencies of a range of cells $\rangeOf{\column, \rowRange}$ that share a pattern.
|
||||
The \textbf{getDeps} operation (Line 5; \Cref{alg:getDeps}) computes the immediate dependencies of a range of cells $\rangeOf{\column}{\rowRange}$ that share a pattern.
|
||||
Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column}{\rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf{\cell'}$.
|
||||
The algorithm uses a recursive traversal (lines 6-7) to visit every cell reference (offset or explicit):
|
||||
For offset references (lines 2-3), the provided range of rows is offset by the appropriate amount.
|
||||
|
@ -175,7 +175,7 @@ When the offset is $\pm 1$, the elements of this sequence are efficiently repres
|
|||
|
||||
\partitle{Downstream Reachability}
|
||||
When a cell's expression is updated, cells that depend on it (even transitively) must be recomputed, so the index must support downstream reachability queries.
|
||||
To efficient downstream lookups, the index maintain as ``backward'' index that relates ranges to the set of patterns that depend on all cells in the range.
|
||||
For efficient downstream lookups, the index maintains a ``backward'' index relating ranges to the set of patterns that depend on all cells in the range.
|
||||
The resulting algorithm over the backward index is analogous to $\textbf{getDeps}$.
|
||||
|
||||
% \partitle{Column Insertions, Deletions, and Moves}
|
||||
|
|
Loading…
Reference in New Issue