Smoothing out and trimming the model section. A few other changes for space-saving.

main
Oliver Kennedy 2023-03-27 20:05:07 -04:00
parent 4c1fc46be3
commit 42519ab53b
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
2 changed files with 119 additions and 92 deletions

View File

@ -21,16 +21,17 @@
\newcommand{\rowRange}{R}
\newcommand{\row}{r}
\newcommand{\rowOffset}{\delta}
\newcommand{\cell}{a}
\newcommand{\cell}{\cellRef{\column}{\row}}
\newcommand{\cellPrime}{\cellRef{\column'}{\row'}}
\newcommand{\exprDomain}{\mathcal E}
\newcommand{\expr}{e}
\newcommand{\patternDomain}{\mathcal P}
\newcommand{\pattern}{P}
\newcommand{\valueDomain}{\mathcal V}
\newcommand{\valat}[3]{#1[#2,#3]}
\newcommand{\rangeOf}[2]{(#1\texttt{:}#2)}
\newcommand{\rangeOf}[2]{#1[#2]}
\newcommand{\range}{\rangeOf{\columnRange}{\rowRange}}
\newcommand{\cellRef}[2]{(#1,#2)}
\newcommand{\cellRef}[2]{#1[#2]}
\newcommand{\evalOf}[2]{\left\llbracket\; #2 \;\right\rrbracket_{#1}}
\newcommand{\patternOf}[2]{\left\llbracket\; #1 \;\right\rrbracket_{#2}}
\newcommand{\depsOf}[1]{\textbf{deps}\left(#1\right)}
@ -41,7 +42,7 @@
\newcommand{\se}[1]{{\textcolor{tabexprcolor}{#1}}}
\newcommand{\uv}[1]{{\textcolor{red}{#1}}}
\newcommand{\upd}{U}
\newcommand{\cu}[3]{#1#2 = #3}
\newcommand{\cu}[3]{\cellRef{#1}{#2} = #3}
\newcommand{\cellupdate}{u}
\newcommand{\dom}{\textsc{Dom}}
\newcommand{\acu}{\cu{\column}{\row}{\expr}}
@ -61,25 +62,32 @@
\subsection{Spreadsheets}
\label{sec:spreadsheets}
Let $\columnDomain$ denote a domain of column labels and $\rowDomain \subset \mathbb{Z}$ be a set of row positions, and let $\exprDomain$ and $\valueDomain$ denote a domain of expressions and values; We will define $\exprDomain$ in greater detail below.
We define a \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ as a partial mapping from \emph{cells} ($(\column, \row) \in (\columnDomain \times \mathbb{Z})$) to expressions. We use $\valat{\spreadsheet}{\column}{\row}$ to denote $\spreadsheet((\column, \row))$ and set $\valat{\spreadsheet}{\column}{\row} = \errorval$ if $(\column,\row)$ is not in the domain of $\spreadsheet$ where $\errorval$ is a distinguished error value from $\valueDomain$. For convenience, we will use $\column\row$ as a shorthand for $(\column, \row)$.
Let $\columnDomain$ and $\rowDomain$ denote domains of column and row labels; unless otherwise noted, we assume $\rowDomain \subset \mathbb Z$.
Let $\valueDomain \subset \exprDomain$ denote domains of values and expressions, respectively; We define $\exprDomain$ in greater detail below.
We define a \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ as a partial mapping from \emph{cells} ($\cellRef{\column}{\row} \in (\columnDomain \times \rowDomain)$) to expressions.
We use $\valat{\spreadsheet}{\column}{\row}$ to denote $\spreadsheet(\cellRef{\column}{\row})$.
Let $\errorval \in \valueDomain$ indicate ``undefined'' and define the \emph{domain} $\dom(\spreadsheet)$ to be the set of cells $\cellRef{\column}{\row}$ where $\valat{\spreadsheet}{\column}{\row} \neq \errorval$.
An expression $\expr \in \exprDomain$ is a formula defined over literals from $\valueDomain$, the standard arithmetic operators, and references to other cells in the spreadsheet ($\cellRef{\column}{\row}$).
The expression $\expr$ may be evaluated in the context of a spreadsheet ($\evalOf{\spreadsheet}{\cdot} : \exprDomain \rightarrow \valueDomain$) as follows:
(i) Literals evaluate to themselves, (ii) Arithmetic formulas are evaluated in the usual way, and (iii) References to the spreadsheet are evaluated recursively:
$$\evalOf{\spreadsheet}{\cellRef{\column}{\row}} \equiv \evalOf{\spreadsheet}{\spreadsheet(\column, \row)}$$
(i) Literals evaluate to themselves, (ii) Arithmetic formulas are evaluated in the usual way, and (iii) References to the spreadsheet are evaluated recursively
($\evalOf{\spreadsheet}{\cellRef{\column}{\row}} \equiv \evalOf{\spreadsheet}{\spreadsheet(\column, \row)}$).
By convention, cyclic references evaluate to the distinguished error value $\errorval$ in $\valueDomain$.
We define the dependencies of an expression ($\depsOf{\expr}$) to be the set of cells referenced by $\expr$.
Expression dependencies induce a graph $\DG{\spreadsheet}\tuple{V, E}$ over the spreadsheet, where each cell is a node (i.e., $V = \columnDomain \times \rowDomain$), and each dependency is a (directed) edge:
We define the dependencies of an expression ($\depsOf{\expr}$) as the cells referenced by $\expr$.
Dependencies induce a graph $\DG{\spreadsheet}\tuple{N, E}$ over the spreadsheet, with cells as nodes (i.e., $N = \columnDomain \times \rowDomain$), and dependencies as directed edges:
$$E = \bigcup_{\cell \in \columnDomain \times \rowDomain}
\{\;\cell \rightarrow \cell'\;|\;\cell' \in \depsOf{\spreadsheet(\cell)}\;\} $$
We use $\TDG{\spreadsheet}$ to denote the graph $\tuple{V,E^*}$ where $E^*$ is the transitive closure of $E$, i.e., $\TDG{\spreadsheet}$ stores both direct and indirect dependencies among the cells in the spreadsheet.
\{\;\cell \rightarrow \cellPrime\;|\;\cellPrime \in \depsOf{\valat{\spreadsheet}{\column}{\row}}\;\} $$
Denote by $\TDG{\spreadsheet}$ the graph $\tuple{V,E^*}$ where $E^*$ is the transitive closure of $E$ (i.e., $\TDG{\spreadsheet}$ captures both direct and indirect dependencies).
%
Note that if all cell expressions are constants (i.e., a spreadsheet without formulas), then $\evalOf{\spreadsheet}{\cellRef{\column}{\row}} = \valat{\spreadsheet}{\column}{\row}$.
Note that if all cells expressions of a spreadsheet are constants (a spreadsheet without formulas), then $\evalOf{\spreadsheet}{\cellRef{\column}{\row}} = \valat{\spreadsheet}{\column}{\row}$.
As an example, consider the spreadsheet shown on the top of \Cref{fig:example-spreadsheet-and-a}. The expressions in columns \emph{A} and \emph{B} are constant expressions while the expressions in column \emph{C} are arithmetic expressions that references cells from columns \emph{A} and \emph{B}. The result of evaluating the spreadsheet to assign each cell a concrete value is shown on the top right of this figure, e.g., \emph{C1} evaluates to $\evalOf{\spreadsheet}{A1 + B1} = \evalOf{\spreadsheet}{A1} + \evalOf{\spreadsheet}{B1} = 15 + 50 = 65$.
\begin{example}
Consider the spreadsheet at the top of \Cref{fig:example-spreadsheet-and-a}.
Columns \emph{A} and \emph{B} hold constant expressions, while column \emph{C} holds arithmetic expressions referencing cells from columns \emph{A} and \emph{B}.
Evaluating this spreadsheet assigns each cell a concrete value, as in the top right.
For example, \emph{\cellRef{C}{1}} evaluates to $\evalOf{\spreadsheet}{\cellRef{A}{1} + \cellRef{B}{1}} = \evalOf{\spreadsheet}{\cellRef{A}{1}} + \evalOf{\spreadsheet}{\cellRef{B}{1}} = 15 + 50 = 65$.
\end{example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
@ -114,16 +122,16 @@ As an example, consider the spreadsheet shown on the top of \Cref{fig:example-sp
\tabhead{4} & 50 & 0 & 50\\ \hline
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{minipage}
$,$\\ \vspace{5mm}
\end{minipage}\\[2mm]
%$,$\\ \vspace{5mm}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{minipage}{1 \linewidth}
\centering
\textbf{Update} $\upd = \{ \cu{A}{1}{20}, \cu{C}{3}{2 \cdot A3 + B3} \}$
\end{minipage}
\end{minipage}\\[1mm]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
$,$\\ %\vspace{5mm}
\begin{minipage}{0.48\linewidth}
%$,$\\ %\vspace{5mm}
\begin{minipage}{0.46\linewidth}
\centering
\textbf{Updated Spreadsheet $\upd(\spreadsheet)$}\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -140,7 +148,7 @@ $,$\\ %\vspace{5mm}
%
\begin{minipage}{0.49\linewidth}
\centering
\textbf{Evaluated Spreadsheet $\evalOf{\upd(\spreadsheet)}{\cdot}$}\\
\textbf{Evaluated Update $\evalOf{\upd(\spreadsheet)}{\cdot}$}\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{c|c|c|c|}
\cline{2-4}
@ -152,80 +160,90 @@ $,$\\ %\vspace{5mm}
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{minipage}
\caption{Example spreadsheet (expressions are shown in \textcolor{tabexprcolor}{dark green} to distinguish them from values), result of evaluating the spreadsheet, and an update applied to the spreadsheet (updated expressions and values are shown in \uv{red}).}\label{fig:example-spreadsheet-and-a}
\vspace{-3mm}
\caption{Example spreadsheet with expressions shown in \textcolor{tabexprcolor}{dark green}, and an update applied to the spreadsheet with updated expressions and values shown in \uv{red}.}\label{fig:example-spreadsheet-and-a}
\trimfigurespacing
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Updates}
\subsection{Cell Updates}
\label{sec:updates}
An update $\upd$ to a spreadsheet is a set of cell updates $\cellupdate$ of the form $\acu$ which assigns to cell $\column\row$ the expression $\expr$. For an update $\upd$, $\dom(\upd)$ called the domain of the updates contains all cells $\column\row$ such that there exists $\cellupdate \in \upd$ such that $\cellupdate = (\acu)$, i.e., the update modified the expression of cell $\column\row$. Applying an update $\upd$ to a spreadsheet $\spreadsheet$ return an updated spreadsheet $\upd(\spreadsheet)$ defined as:
A cell update set $\upd \subseteq \columnDomain \times \rowDomain \times \exprDomain$ to a spreadsheet is a set of cell updates of the form $\acu$ that assign to cell $\cellRef{\column}{\row}$ the expression $\expr$.
Denote by $\dom(\upd)$ the domain of update $\upd$, containing all cells $\cellRef{\column}{\row}$ defined in $\upd$ (i.e., $\exists \expr : (\acu \in \upd)$).
Applying an update $\upd$ to a spreadsheet $\spreadsheet$ returns an updated spreadsheet:
\[
\valat{\upd(\spreadsheet)}{\column}{\row} =
\begin{cases}
\upd({\column}{\row}) &\text{\textbf{if}}\,\, \column\row \in \dom(\upd)\\
\upd(\cellRef{\column}{\row}) &\text{\textbf{if}}\,\, \cellRef{\column}{\row} \in \dom(\upd)\\
\valat{\spreadsheet}{\column}{\row} &\text{\textbf{otherwise}}\\
\end{cases}
\]
Note that this way to represent updates can be quite verbose, e.g., if inserting a row at some position in the spreadsheet, the expressions of all following rows have to be updated as they are moved down by one row.
We will discuss the more compact representation of updates as overlays used by our approach in \BG{section}
For example, consider the update shown in \Cref{fig:example-spreadsheet-and-a}. This update changes the constant expression in cell \emph{A1} and the arithmetic expression in cell \emph{C3}. When evaluating the resulting spreadsheet $\upd(\spreadsheet)$, the values of three cells (shown in red), the values of three cells are changed.
An update may affect cells beyond its domain.
For example, the update shown in \Cref{fig:example-spreadsheet-and-a} changes the constant expression in cell \emph{\cellRef{A}{1}} and the arithmetic expression in cell \emph{\cellRef{C}{3}}.
Evaluating the updated spreadsheet $\upd(\spreadsheet)$ results in \emph{three} cell changes (in red).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Spreadsheet Access to Datasets}
\label{sec:spre-access-datas}
To uniformly model spreadsheet access to relational data as well as to data already represented as spreadsheets, we assume an input dataset $\ds$ and let $\columnDomain$ and $\rowDomain$ denote a domain of column and row (respectively) labels for $\ds$.
For a relational table, $\columnDomain$ are the columns of the table and $\rowDomain$ could be the values of a list of key attributes or be $\mathbb{N}$ representing the order of a row on disk. If $\ds$ is a spreadsheet input dataset, then $\rowDomain$ is $\mathbb{N}$ and encodes the position of the row. For a dataset $\ds$ we use $\valat{\ds}{\row}{\column}$ to denote that value of column $\column$ of row $\row$ in $\ds$.
To uniformly model spreadsheet access to relational data as well as to data already represented as spreadsheets, we assume an input dataset $\ds$ with a designated row and column labels $\columnDomain_{\ds}$ and $\rowDomain_{\ds}$ as appropriate to the source data.
For example, in a relational table, these can be the table's columns and values of a key or rowid attribute, respectively.
For a spreadsheet or csv data, $\rowDomain_{\ds} \subset \mathbb Z$ can be the position of the row.
We use $\valat{\ds}{\row}{\column}$ to denote the value at column $\column \in \columnDomain_\ds$ of row $\row \in \rowDomain_\ds$ in $\ds$.
A \emph{spreadsheet overlay} for a dataset $\ds$ is a pair $(\ds, \rframe)$ where $\rframe: \rowDomain \to \mathbb{Z}$ called a reference frame is an injective map that determines the positions of rows from $\ds$ in the spreadsheet. Such an overlay defines a spreadsheet $\spreadsheet_{\ds, \rframe}$:
\[
Denote by $\rframe: \rowDomain_{\ds} \to \mathbb{Z}$ a reference frame, an injective map that maps rows in $\ds$ into the spreadsheet.
A \emph{spreadsheet overlay} for a dataset $\ds$ is then a pair $(\ds, \rframe)$ that defines a spreadsheet $\spreadsheet_{\ds, \rframe}$ with domains $\columnDomain = \columnDomain_{\ds}$, $\rowDomain = \dom(\rframe)$ as
$
\valat{\spreadsheet_{\ds, \rframe}}{\column}{\row} = \valat{\ds}{\column}{\rframe^{-1}(\row)}
\]
$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Overlay Updates}
\label{sec:overlay-updates}
We now introduce how to represent updates compactly as \emph{overlays}. An overlay update $\overlay = \aol$ encodes a update $\rtrans$ to the reference frame $\rframe$ of an overlay spreadsheet and a pattern update $\oup$ that compactly encodes updates to the formulas in a range of cells as a templates formula where cell references are expressed as offsets (and is also used to express insertions). We next discuss these components in more detail and then define the result of applying an overlay update $\overlay$ to a spreadsheet overlay $(\rframe, \ds)$. In overlays we make use of ranges to compactly represent insertions and updates.
A \emph{range} $\rangeOf{\columnRange}{\rowRange}$ is the Cartesian product of a set of columns ($\columnRange \subseteq \columnDomain$) and range of row positions ($R = [l,h] \in \mathbb{Z} \times \mathbb{Z}$):
$$\rangeOf{\columnRange}{\rowRange} \equiv \{\;(\column, \row)\;|\;\column\in \columnRange, \row \in \rowRange\;\}$$
%Some times we use only row ranges $R = [l,h]$.
An Overlay Update describes a set of changes to a target spreadsheet (or dataset).
Changes may include cell updates as already discussed, or the insertion, deletion, or reordering of rows or columns.
As we discuss in \Cref{sec:system-presentation}, column operations are purely cosmetic in our model, and so we focus on cell and row updates.
Concretely, a spreadsheet overlay $\overlay = \aol$ is a reference frame transformation $\rtrans$ and a set of pattern updates $\oup$, terms we now define.
% We now define these terms, and discuss their semantics.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Reference Frame Transformations}
The reference frame update $\rtrans$ of an overlay $\overlay$ is an injective mapping $\mathbb{Z} \to \mathbb{Z} \cup \errorval$ that updates positions of rows in the result of the update. The value $\errorval$ indicates deletions of rows (rows that do not have a position after the update).
The new reference frame for the spreadsheet overlay after applying $\overlay$ is $\rframe' = \rtrans \circ \mathcal F$, where $\circ$ denotes function composition. As an example, consider inserting a new row between rows 2 and 3 in the spreadsheet from \Cref{fig:example-spreadsheet-and-a}. The positions of rows $3$ and $4$ is increased by one, while rows $1$ and $2$ retrain their position:
Recall that the spreadsheet's positional row references are translated into the native record format of the source dataset through a mapping function called a reference frame.
To insert, delete, or move rows in the spreadsheet, it is sufficient to simply modify the reference frame.
Formally, a reference frame transformation $\rtrans$ is an injective mapping $\mathbb{Z} \to \mathbb{Z} \cup \errorval$ from an initial set of row positions to a new set of row positions, or the value $\errorval$ to indicate a deleted row.
The new reference frame for the spreadsheet overlay after applying $\overlay$ is $\rframe' = \rtrans \circ \mathcal F$, where $\circ$ denotes function composition.
As an example, consider deleting the 2nd row of the spreadsheet from \Cref{fig:example-spreadsheet-and-a}. The positions of rows $3$ and $4$ are decreased by one, while row $1$ retains its position
$$\rtrans(x) = \begin{cases}
x & \textbf{if } x \leq 2\\
x+1 & \textbf{otherwise}
\end{cases}$$
When a row is deleted then its position is mapped to $\errorval$. For instance, deleting the first row in \Cref{fig:example-spreadsheet-and-a} results in the following reference frame update:
$$\rtrans(x) = \begin{cases}
\errorval & \textbf{if } x =1\\
x & \textbf{if } x < 2\\
\errorval & \textbf{if } x = 2\\
x-1 & \textbf{otherwise}
\end{cases}$$
Note that the common operations of inserting and deleting rows can be expressed as reference frame updates of constant size independent of the size of the spreadsheet.
Row insertions and movement are handled analogously.
Note that row insertions, deletions, and movement are each expressible in constant size, independent of the size of the data.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Pattern Updates}
Spreadsheets allow users to prototype a formula in one cell and then generalize the formula by copying and pasting it into a range of cells.
%\footnote{``Relative" column and row references are updated to be relative to each cell the formula is pasted into.}
Such bulk interactions pose a challenge for state models that maintain an expression for each cell.
For example, a user might paste a formula into an entire column, creating one expression for each row of the dataset.
In lieu of this, an overlay update groups together the set of pasted cells into a single \emph{pattern}.
A \emph{range} $\rangeOf{\columnRange}{\rowRange}$ is the Cartesian product $\columnRange \times [l,h]$ of a set of columns ($\columnRange \subseteq \columnDomain$) and row positions ($R \subset \mathbb{Z}$).
%
The pattern update $\oup$ of an overlay update $\overlay$ is a set of pairs $\{ (\rangeOf{C_i}{R_i}, \pattern_i) \}$ where $\rangeOf{C_i}{R_i}$ is a range and $\pattern_i$ is a \emph{pattern formula}, i.e., a formula which may contain both absolute cell references (as in regular formulas) and references where rows are relative offsets (written as $+i$ or $-i$). We required that the ranges $\rangeOf{C_i}{R_i}$ are pairwise disjoint. The semantics of $(\rangeOf{C_i}{R_i}, \pattern_i)$ is that every cell $(\column, \row)$ in $\rangeOf{C_i}{R_i}$ is assigned a formula that is generated by replacing any relative references of the form $(\column, \delta)$ in $\pattern_i$ with $(\column, \row + \delta)$. We use $\pattern_i(\cell)$ to denote the instantiation of pattern $\pattern_i$ for cell $\cell$.
For instance, to store a running sum of the values in column \emph{C} as the cell values in column \emph{D} for the spreadsheet from \Cref{fig:example-spreadsheet-and-a}, we can use the following pattern update:
A pattern update $\oup$ is a set of pairs $\{ (\rangeOf{C_i}{R_i}, \pattern_i) \}$ where $\rangeOf{C_i}{R_i}$ is a range and $\pattern_i$ is a \emph{pattern expression}, i.e., an expression that may also contain cell references where rows are relative offsets (written as $+i$ or $-i$).
Ranges $\rangeOf{C_i}{R_i}$ must be pairwise disjoint.
A pattern update $(\rangeOf{C_i}{R_i}, \pattern_i)$ assigns an expression to every cell $(\column, \row)$ in $\rangeOf{C_i}{R_i}$ by replacing any relative references of the form $(\column, \delta)$ in $\pattern_i$ with $(\column, \row + \delta)$. We use $\pattern_i(\cell)$ to denote the instantiation of pattern $\pattern_i$ for cell $\cell$.
For instance, to store a running sum of the values in column \emph{C} as the cell values in column \emph{D} for the spreadsheet from \Cref{fig:example-spreadsheet-and-a}:\\[-2mm]
%
\[
\oup_{running} = (\rangeOf{D1}{D1}, (C,1)), (\rangeOf{D2}{D4}, (C,+0) + (D,-1))
\oup_{running} = (\rangeOf{D}{1}, (C,+0)), (\rangeOf{D}{2-4}, (C,+0) + (D,-1))
\]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -271,7 +289,7 @@ $\,$\\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Semantics for Overlay Updates}
%
Applying an overlay update $\overlay$ to a spreadsheet $\spreadsheet$ results in an updated spreadsheet $\overlay(\spreadsheet$ computed by applying the reference frame update and then applying all pattern updates:
An overlay update $\overlay$ appleid to a spreadsheet $\spreadsheet$ defines the spreadsheet $\overlay(\spreadsheet$ computed by applying the reference frame update and then applying all pattern updates:
\begin{align*}
\valat{\overlay(\spreadsheet)}{\column}{\row} &=
@ -284,13 +302,17 @@ Applying an overlay update $\overlay$ to a spreadsheet $\spreadsheet$ results i
For example, the result of applying the overlay update $\overlay_{running} = (\rtrans_{id},\oup_{running})$ where $\rframe_{id}(x) = x$ to our running example spreadsheet is shown in \Cref{fig:example-overlay-update}. The column \emph{D} is filled with new formulas that compute the hollowing sum.
Several remarks are in order. First, note that overlays can be used to encode common spreadsheet update operations in constant space, including . Second, \cite{tang-23-efcsfg} uses similar ideas to compress the dependencies in a spreadsheet using ranges and patterns. However, that work does not compress the updates themselves, but instead generates a compact representation of dependencies in a given spreadsheet.
Several remarks are in order. First, note that overlays can be used to encode common spreadsheet update operations in constant space (per update), including bulk updates via copy/paste.
Second, \cite{tang-23-efcsfg} uses similar ideas to compress the dependencies in a spreadsheet using ranges and patterns, but focuses exclusively on the dependency graph and not on compacting the spreadsheet itself.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Updating Datasets}
\subsection{Replacing Source Data}
\label{sec:updating-datasets}
In addition to the advantage that versions of a spreadsheet can be tracked as sequences of overlay updates, another major advantage of modeling spreadsheets as overlays over datasets and updates as overlays on top of these overlays, is that it enables the updates on top of a spreadsheet overlay for a dataset $\ds$ to be reapplied on top of an updated version $\ds'$ as long as we can consistently identify row labels in $\ds'$ (e.g., if $\rowDomain$ is a semantic key for the dataset this will be possible).
A major advantage of modeling spreadsheets as overlays is that source data may be updated;
An overlay designed for source data $(\ds, \rframe)$ may be applied to a dataset $(\ds', \rframe')$ as long as each $\row\in \rowDomain_{\ds}$ that corresponds to some $\row' \in \rowDomain_{\ds'}$, $\rframe'(\rframe^{-1}(\row)) = \row'$.
This is possible if, for example, $\rowDomain_{\ds}= \rowDomain_{\ds'}$ is a semantic key for the dataset.
%%% Local Variables:
%%% mode: latex

View File

@ -2,13 +2,6 @@
\section{System Design}
\label{sec:system}
\begin{figure}
\includegraphics[width=0.4\columnwidth]{graphics/system-arch}
\caption{Overlay system design.}
\label{fig:systemdesign}
\trimfigurespacing
\end{figure}
We now outline the design of our prototype overlay spreadsheet, implemented as part of the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and import/export format compatibility.
Our prototype likewise builds on Spark, using any dataframe as a data source.
@ -23,9 +16,18 @@ A simple LRU \textbf{Cache} provides efficient random access to a subset of the
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Presentation Layer}
\label{sec:system-presentation}
Multiple user-facing client applications connect to the overlay spreadsheet through a presentation layer.
This layer mediates concurrent updates of the spreadsheet, allows clients to subscribe to push-based updates of cell state, and provides clients with the illusion of a fixed grid of cells by defining and maintaining an explicit order over columns, as well as maintaining a bound over the number of rows in the spreadsheet.
With the exception of updates to column order, most updates are placed in a serial order and relayed to lower levels.
Operations over columns (insertion, deletion, reordering) are handled at this layer, allowing lower levels to reference the (comparatively small) set of columns by column identity.
With the exception of updates to columns, most updates are coalesced into a serial order and relayed to lower levels.
\begin{figure}
\includegraphics[width=0.4\columnwidth]{graphics/system-arch}
\caption{Overlay system design.}
\label{fig:systemdesign}
\trimfigurespacing
\end{figure}
The presentation layer expects the level below it to provide (i) efficient random access to cell values, (ii) subscription access to state (e.g., value) updates for ranges of cells.
@ -33,46 +35,49 @@ The presentation layer expects the level below it to provide (i) efficient rando
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Executor}
\label{sec:system-executor}
The executor's role is to provide efficient access to cell values and to push notifications about cell state changes.
Cell state is derived from two sources:
The executor provides efficient access to cell values and is responsible for pushing notifications about cell state changes to clients.
Cell values is derived from two sources:
(i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and
(ii) a series of updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) extends the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
The executor decouples these sources into a cache around $\spreadsheet_{\ds}$ and an update index that stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$, where $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
(ii) A series of overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) extends the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
The executor decouples these sources into a cache around $\spreadsheet_{\ds}$ and an update index that stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
$$\spreadsheet[\column,\row] = \begin{cases}
\spreadsheet_\overlay[\column,\row] & \textbf{if } \spreadsheet_\overlay[\column,\row] \neq \errorval\\
\spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
\end{cases}$$
To actually materialize $\spreadsheet$, we could follow \cite{DBLP:conf/sigmod/BendreWMCP19} to materialize the full set of values by (i) materializing expressions for each cell, (ii) computing a topological sort over the cells in order of dependencies, and (iii) evaluating cells in order of dependencies.
However, this approach has several shortcomings.
First, it may require impractically much memory to materialize every cell's expression individually, particularly when our goal is to allow users to apply expressions in bulk to entire datasets.
Second, given that patterns share a common structure, it can be more efficient to compute a topological sort over blocks of patterns rather than over individual cells.
Finally, materializing every cell's individual value can be just as impractical as materializing their expressions.
The direct approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) computes a topological sort over the full set of cells (in order of dependencies) and evaluates cells in this order.
However, this approach requires first expanding patterns (one per interaction) into individual expressions (one per cell);
Not only does this requires a significant amount of memory, it introduces unnecessary computational complexity:
In many cases, a topological sort can be derived from patterns rather than individual cells.
Moreover, it may be unnecessary to materialize the entire dataset.
The first two points are addressed by the update index, which is expected to provide bulk access to cells.
In addition to allowing bulk updates to ranges of cells (via pattern), the executor assumes that the index can efficiently compute dependencies (upstream), and sets of dependent cells (downstream) for entire ranges at once.
The executor relies on the Index to provide efficient topological sorts, as well as upstream and downstream dependency analysis.
We discuss these challenges below in \Cref{sec:system-index}.
To address the third point, we rely on the observation that in most spreadsheet applications, only a small fraction of cells will be visible at one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} relies on this observation to prioritize evaluation of visible cells).
Instead of materializing the full spreadsheet, the executor relies on user-facing clients to register interest in regions of cells.
We refer to the union of these cells and their upstream (dependencies) as the set of \emph{active} cells.
The executor only materializes active cells.
To address the materialization concern, we rely on the observation that in most spreadsheet applications, only a small fraction of cells will be visible at one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} relies on this observation to prioritize evaluation of visible cells).
Each user-facing client maintains a range of visible rows with the executor.
The executor materializes only visible cells and the transitive closure of their dependencies.
Some dependencies (e.g., as in the running sum column example) may require computations over more rows than fit in cache.
Although we leave a detailed exploration of this challenge to future work, we observe that in many cases the concrete value of any one cell can be rewritten into a closed form.
For example, any given cell in a running sum column may be expressed in terms of the sum of all preceding cells.
Our preliminary experiments show that when a chain of dependencies becomes sufficiently long, bulk computation can be used to provide a more responsive interface.
If the active cells remains confined to a specific set of rows that fit in cache, virtually all accesses can be serviced out of cache.
However, as we discuss below, the active region may scale to the full size of the dataset; We will return to this problem when we discuss future work.
\paragraph{Incremental Updates}
Updates to the reference frame ($\rtrans$) are passed through to both the data source and the update index.
Insertions and moves of rows and columns will not affect any dependencies; Although the effects must be reflected in the materialized view, they do not generally trigger re-evaluation.
Conversely, column or row deletions, as well as cell updates ($\oup$) may affect cells downstream of the deleted or updated cells.
When such an update occurs, the executor uses the index to compute the full downstream of the set of affected cells, places them in a ``pending'' state, and triggers re-evaluation.
When a set of cell updates $\oup$ is applied to the spreadsheet, the executor identifies the set of cells invalidated by the update and triggers their re-evaluation.
When the executor receives an update to a cell, it uses the index to compute the set of invalidated cells, marks them as ``pending,'' and begins re-evaluating them in topological order.
An update to the reference frame is applied to both the index and the data source.
Following typical spreadsheet semantics, an insertion or row move updates references in dependent formulas, so modulo changes in the set of visible rows, no re-evaluation is required.
If a row with dependent cells is deleted, the dependent cells need to be updated to indicate the error.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Update Index}
\label{sec:system-index}
The update index stores a series of update operations ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$) and provides efficient access to the resulting overlay spreadsheet ($\spreadsheet_\overlay$):
(i) Access to the expressions for individual cells $\spreadsheet_\overlay[\column, \row]$ (for cell evaluation);
(ii) Computing the upstream of a range of cells (for topological sort and computing the active set), and