main
Boris Glavic 2023-03-19 19:41:23 -05:00
parent 4e74337553
commit d9882a068f
2 changed files with 116 additions and 4 deletions

View File

@ -2,6 +2,7 @@
\newcommand{\tuple}[1]{\left<#1\right>}
\newcommand{\partitle}[1]{\medskip\noindent\textbf{#1.}}
\usepackage{todonotes}
% \usepackage[disable]{todonotes}

View File

@ -37,7 +37,6 @@
\newcommand{\DG}[1]{\ensuremath{G_{#1}}\xspace}
\newcommand{\TDG}[1]{\ensuremath{G_{#1}^*}\xspace}
\newcommand{\errorval}{\ensuremath{\bot}\xspace}
\newcommand{\rframe}{\ensuremath{\mathcal{F}}\xspace}
\newcommand{\se}[1]{{\textcolor{tabexprcolor}{#1}}}
\newcommand{\uv}[1]{{\textcolor{red}{#1}}}
\newcommand{\upd}{U}
@ -46,6 +45,17 @@
\newcommand{\dom}{\textsc{Dom}}
\newcommand{\acu}{\cu{\column}{\row}{\expr}}
\newcommand{\rframe}{\ensuremath{\mathcal{F}}\xspace}
\newcommand{\rtrans}{\ensuremath{\mathcal{T}}\xspace}
\newcommand{\overlay}{\mathcal{O}}
\newcommand{\ol}[2]{\tuple{#1,#2}}
\newcommand{\oins}{\mathcal{I}}
\newcommand{\oup}{\mathcal{U}}
\newcommand{\aol}{\ol{\rtrans}{\oup}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Spreadsheets}
\label{sec:spreadsheets}
@ -65,8 +75,6 @@ $$E = \bigcup_{\cell \in \columnDomain \times \rowDomain}
\{\;\cell \rightarrow \cell'\;|\;\cell' \in \depsOf{\spreadsheet(\cell)}\;\} $$
We use $\TDG{\spreadsheet}$ to denote the graph $\tuple{V,E^*}$ where $E^*$ is the transitive closure of $E$, i.e., $\TDG{\spreadsheet}$ stores both direct and indirect dependencies among the cells in the spreadsheet.
We define a \emph{range} $\rangeOf{\columnRange}{\rowRange}$ as the Cartesian product of a set of columns ($\columnRange \subseteq \columnDomain$) and range of row positions ($R = [l,h] \in \mathbb{Z} \times \mathbb{Z}$).
$$\rangeOf{\columnRange}{\rowRange} \equiv \{\;(\column, \row)\;|\;\column\in \columnRange, \row \in \rowRange\;\}$$
Note that if all cells expressions of a spreadsheet are constants (a spreadsheet without formulas), then $\evalOf{\spreadsheet}{\cellRef{\column}{\row}} = \valat{\spreadsheet}{\column}{\row}$.
@ -173,13 +181,116 @@ For example, consider the update shown in \Cref{fig:example-spreadsheet-and-a}.
To uniformly model spreadsheet access to relational data as well as to data already represented as spreadsheets, we assume an input dataset $\ds$ and let $\columnDomain$ and $\rowDomain$ denote a domain of column and row (respectively) labels for $\ds$.
For a relational table, $\columnDomain$ are the columns of the table and $\rowDomain$ could be the values of a list of key attributes or be $\mathbb{N}$ representing the order of a row on disk. If $\ds$ is a spreadsheet input dataset, then $\rowDomain$ is $\mathbb{N}$ and encodes the position of the row. For a dataset $\ds$ we use $\valat{\ds}{\row}{\column}$ to denote that value of column $\column$ of row $\row$ in $\ds$.
A \emph{spreadsheet overlay} for a dataset $\ds$ is a pair $(\ds, \rframe)$ where $\rframe: \rowDomain \to \mathbb{Z}$ determines the positions of rows from $\ds$
A \emph{spreadsheet overlay} for a dataset $\ds$ is a pair $(\ds, \rframe)$ where $\rframe: \rowDomain \to \mathbb{Z}$ called a reference frame is an injective map that determines the positions of rows from $\ds$ in the spreadsheet. Such an overlay defines a spreadsheet $\spreadsheet_{\ds, \rframe}$:
\[
\valat{\spreadsheet_{\ds, \rframe}}{\column}{\row} = \valat{\ds}{\column}{\rframe^{-1}(\row)}
\]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Overlay Updates}
\label{sec:overlay-updates}
We now introduce how to represent updates compactly as \emph{overlays}. An overlay update $\overlay = \aol$ encodes a update $\rtrans$ to the reference frame $\rframe$ of an overlay spreadsheet and a pattern update $\oup$ that compactly encodes updates to the formulas in a range of cells as a templates formula where cell references are expressed as offsets (and is also used to express insertions). We next discuss these components in more detail and then define the result of applying an overlay update $\overlay$ to a spreadsheet overlay $(\rframe, \ds)$. In overlays we make use of ranges to compactly represent insertions and updates.
A \emph{range} $\rangeOf{\columnRange}{\rowRange}$ is the Cartesian product of a set of columns ($\columnRange \subseteq \columnDomain$) and range of row positions ($R = [l,h] \in \mathbb{Z} \times \mathbb{Z}$):
$$\rangeOf{\columnRange}{\rowRange} \equiv \{\;(\column, \row)\;|\;\column\in \columnRange, \row \in \rowRange\;\}$$
%Some times we use only row ranges $R = [l,h]$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Reference Frame Transformations}
The reference frame update $\rtrans$ of an overlay $\overlay$ is an injective mapping $\mathbb{Z} \to \mathbb{Z} \cup \errorval$ that updates positions of rows in the result of the update. The value $\errorval$ indicates deletions of rows (rows that do not have a position after the update).
The new reference frame for the spreadsheet overlay after applying $\overlay$ is $\rframe' = \rtrans \circ \mathcal F$, where $\circ$ denotes function composition. As an example, consider inserting a new row between rows 2 and 3 in the spreadsheet from \Cref{fig:example-spreadsheet-and-a}. The positions of rows $3$ and $4$ is increased by one, while rows $1$ and $2$ retrain their position:
$$\rtrans(x) = \begin{cases}
x & \textbf{if } x \leq 2\\
x+1 & \textbf{otherwise}
\end{cases}$$
When a row is deleted then its position is mapped to $\errorval$. For instance, deleting the first row in \Cref{fig:example-spreadsheet-and-a} results in the following reference frame update:
$$\rtrans(x) = \begin{cases}
\errorval & \textbf{if } x =1\\
x-1 & \textbf{otherwise}
\end{cases}$$
Note that the common operations of inserting and deleting rows can be expressed as reference frame updates of constant size independent of the size of the spreadsheet.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Pattern Updates}
%
The pattern update $\oup$ of an overlay update $\overlay$ is a set of pairs $\{ (\rangeOf{C_i}{R_i}, \pattern_i) \}$ where $\rangeOf{C_i}{R_i}$ is a range and $\pattern_i$ is a \emph{pattern formula}, i.e., a formula which may contain both absolute cell references (as in regular formulas) and references where rows are relative offsets (written as $+i$ or $-i$). We required that the ranges $\rangeOf{C_i}{R_i}$ are pairwise disjoint. The semantics of $(\rangeOf{C_i}{R_i}, \pattern_i)$ is that every cell $(\column, \row)$ in $\rangeOf{C_i}{R_i}$ is assigned a formula that is generated by replacing any relative references of the form $(\column, \delta)$ in $\pattern_i$ with $(\column, \row + \delta)$. We use $\pattern_i(\cell)$ to denote the instantiation of pattern $\pattern_i$ for cell $\cell$.
For instance, to store a rolling sum of the values in column \emph{C} as the cell values in column \emph{D} for the spreadsheet from \Cref{fig:example-spreadsheet-and-a}, we can use the following pattern update:
\[
\oup_{rolling} = (\rangeOf{D1}{D1}, (C,1)), (\rangeOf{D2}{D4}, (C,+0) + (D,-1))
\]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\centering
\newcommand{\tabhead}[1]{\cellcolor{black}{\textcolor{white}{#1}}}
\begin{minipage}{1\linewidth}
\centering
\textbf{Spreadsheet $\spreadsheet$} \\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{c|c|c|c|c|}
\cline{2-4}
& \tabhead{A} & \tabhead{B} & \tabhead{C} & \tabhead{D} \\ \hline
\tabhead{1} & \se{15} & \se{50} & \se{A1 + B1} & \uv{C1} \\ \hline
\tabhead{2} & \se{20} & \se{60} & \se{A2 + B2} & \uv{C2 + D1} \\ \hline
\tabhead{3} & \se{25} & \se{100} & \se{A3 + B3} & \uv{C3 + D2} \\ \hline
\tabhead{4} & \se{50} & \se{0} & \se{A4 + B4} & \uv{C4 + D3} \\ \hline
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{minipage} %
$\,$\\
\begin{minipage}{1\linewidth}
\centering
\textbf{Evaluated Spreadsheet $\evalOf{\spreadsheet}{\cdot}$} \\
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{tabular}{c|c|c|c|c|}
\cline{2-4}
& \tabhead{A} & \tabhead{B} & \tabhead{C} & \tabhead{D} \\ \hline
\tabhead{1} & 15 & 50 & 65 & \uv{65} \\ \hline
\tabhead{2} & 20 & 60 & 80 & \uv{145} \\ \hline
\tabhead{3} & 25 & 100 & 125 & \uv{270} \\ \hline
\tabhead{4} & 50 & 0 & 50 & \uv{320} \\ \hline
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\end{minipage}
\caption{Example overlay update and result (updated expressions and values are shown in \uv{red}).}\label{fig:example-overlay-update}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Semantics for Overlay Updates}
%
Applying an overlay update $\overlay$ to a spreadsheet $\spreadsheet$ results in an updated spreadsheet $\overlay(\spreadsheet$ computed by applying the reference frame update and then applying all pattern updates:
\begin{align*}
\valat{\overlay(\spreadsheet)}{\column}{\row} &=
\begin{cases}
\valat{\spreadsheet}{\column}{\rtrans^{-1}(\row)} & \text{\textbf{if}} \exists \row': \rtrans(\row') = \row\\
\pattern_i((\column,\row)) & \text{\textbf{if}} \exists i: (\column,\row) \in \rangeOf{C_i}{R_i} \\
\errorval &\text{\textbf{otherwise}}\\
\end{cases}
\end{align*}
For example, the result of applying the overlay update $\overlay_{rolling} = (\rtrans_{id},\oup_{rolling})$ where $\rframe_{id}(x) = x$ to our running example spreadsheet is shown in \Cref{fig:example-overlay-update}. The column \emph{D} is filled with new formulas that compute the hollowing sum.
Several remarks are in order. First, note that overlays can be used to encode common spreadsheet update operations in constant space, including . Second, \cite{tang-23-efcsfg} uses similar ideas to compress the dependencies in a spreadsheet using ranges and patterns. However, that work does not compress the updates themselves, but instead generates a compact representation of dependencies in a given spreadsheet.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Updating Datasets}
\label{sec:updating-datasets}
In addition to the advantage that versions of a spreadsheet can be tracked as sequences of overlay updates, another major advantage of modeling spreadsheets as overlays over datasets and updates as overlays on top of these overlays, is that it enables the updates on top of a spreadsheet overlay for a dataset $\ds$ to be reapplied on top of an updated version $\ds'$ as long as we can consistently identify row labels in $\ds'$ (e.g., if $\rowDomain$ is a semantic key for the dataset this will be possible).
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% reftex-default-bibliography: ("../main.bib")
%%% End: