Merge branch 'main' of git.odin.cse.buffalo.edu:VizierDB/paper-Vizier-SpreadsheetOverlay

main
Boris Glavic 2023-03-18 19:30:05 -05:00
commit 9d2e139d2a
3 changed files with 82 additions and 0 deletions

BIN
graphics/systemdesign.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 270 KiB

View File

@ -156,6 +156,7 @@
\input{sections/introduction}
\input{sections/overview}
\input{sections/formalism}
\input{sections/data}
\input{sections/relwork}

81
sections/formalism.tex Normal file
View File

@ -0,0 +1,81 @@
%!TEX root=../main.tex
\section{Formalism}
\label{sec:formalism}
\newcommand{\comprehension}[2]{\left\{\;#1\;|\;#2\;\right\}}
\newcommand{\spreadsheet}{\mathbb S}
\newcommand{\encodedSpreadsheet}{\hat \spreadsheet}
\newcommand{\columnDomain}{\mathcal C}
\newcommand{\columnRange}{C}
\newcommand{\column}{c}
\newcommand{\rowDomain}{\mathcal R}
\newcommand{\rowRange}{R}
\newcommand{\row}{r}
\newcommand{\rowOffset}{\delta}
\newcommand{\cell}{a}
\newcommand{\exprDomain}{\mathcal E}
\newcommand{\expr}{e}
\newcommand{\patternDomain}{\mathcal P}
\newcommand{\pattern}{P}
\newcommand{\valueDomain}{\mathcal V}
\newcommand{\rangeOf}[2]{(#1\texttt{:}#2)}
\newcommand{\range}{\rangeOf{\columnRange}{\rowRange}}
\newcommand{\cellRef}[2]{(#1\texttt{:}#2)}
\newcommand{\evalOf}[2]{\left\llbracket\; #2 \;\right\rrbracket_{#1}}
\newcommand{\patternOf}[2]{\left\llbracket\; #1 \;\right\rrbracket_{#2}}
\newcommand{\depsOf}[1]{\textbf{deps}\left(#1\right)}
Let $\columnDomain$ and $\rowDomain$ denote a domain of column and row (respectively) labels, and let $\exprDomain$ and $\valueDomain$ denote a domain of expressions and values; We will define $\exprDomain$ in greater detail below.
We define a \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ as a mapping from \emph{cells} ($(\column, \row) \in (\columnDomain \times \rowDomain)$) to expressions.
We define a \emph{range} $\rangeOf{\columnRange}{\rowRange}$ as the Cartesian product of a set of columns ($\columnRange \subseteq \columnDomain$) and rows ($\rowRange \subseteq \rowDomain$).
$$\rangeOf{\columnRange}{\rowRange} \equiv \{\;(\column, \row)\;|\;\column\in \columnRange, \row \in \rowRange\;\}$$
An expression $\expr \in \exprDomain$ is a formula defined over literals from $\valueDomain$, the standard arithmetic operators, and references to other cells in the spreadsheet ($\cellRef{\column}{\row}$).
The expression $\expr$ may be evaluated in the context of a spreadsheet ($\evalOf{\spreadsheet}{\cdot} : \exprDomain \rightarrow \valueDomain$) as follows:
(i) Literals evaluate to themselves, (ii) Arithmetic formulas are evaluated in the usual way, and (iii) References to the spreadsheet are evaluated recursively:
$$\evalOf{\spreadsheet}{\cellRef{\column}{\row}} \equiv \evalOf{\spreadsheet}{\spreadsheet(\column, \row)}$$
Cyclic references evaluate to a distinguished error value in $\valueDomain$.
We define the dependencies of an expression ($\depsOf{\expr}$) to be the set of cells referenced by $\expr$.
Expression dependencies induce a graph $\tuple{V, E}$ over the spreadsheet, where each cell is a node (i.e., $V = \columnDomain \times \rowDomain$), and each dependency is a (directed) edge:
$$E = \bigcup_{\cell \in \columnDomain \times \rowDomain}
\{\;\cell \rightarrow \cell'\;|\;\cell' \in \depsOf{\spreadsheet(\cell)}\;\} $$
\subsection{Compressed Updates}
We adopt a common assumption in relational data: that $\columnDomain$ is small and $\rowDomain$ is large for a typical spreadsheet.
Accordingly, we define $\rowDomain$ for a spreadsheet with $N$ rows to be the range $[1,N]$, with rows identified by their position in the spreadsheet.
This notation allows us to compactly encode a contiguous range of rows $[r, r']$ by the lower and upper bounds of the range.
To compactly encode more general sets of rows, we define a \emph{row set} data structure as a sequence of ranges $\rowRange = ([r_1, r'_1], \ldots, [r_k, r'_k]$) respecting the following properties:
(i) All ranges are disjoint,
(ii) No two ranges are contiguous, and
(iii) The sequence is ordered on the lower bound.
Observe that given two range sets $\rowRange$, $\rowRange'$, satisfying the above properties, we can compute their intersection $\rowRange \cap \rowRange'$, union $\rowRange \cup \rowRange'$ and difference $\rowRange - \rowRange'$ in $O(|\rowRange|+|\rowRange'|)$ time; returning a row set that respects the same properties.
As previously noted, a spreadsheet user may apply a single formula to a range of cells in a single interaction.
Typically, such formulas are defined by an expression \emph{pattern} $\pattern \in \patternDomain$, a more general form of an expression that may also include \emph{offset references}.
An offset reference $\cellRef{\column}{\rowOffset}$ is defined by a column $\column \in \columnDomain$ and an integer row offset $\rowOffset \in \mathbb Z$.
A pattern may be expanded in the context of a row ($\patternOf{\pattern}{\row}$); A pattern expands to an expression by replacing every offset reference with an explicit cell reference at the corresponding offset:
$$\patternOf{\cellRef{\column}{\rowOffset}}{\row} = \cellRef{\column}{\row + \rowOffset}$$
We leverage row sets and patterns to define an encoded spreadsheet $\encodedSpreadsheet$ spreadsheet as a set of tuples $\tuple{\range,\pattern} \in \encodedSpreadsheet$ where $\tuple{\columnRange, \rowRange} = \range$ is a range, where $\rowRange$ is defined as a row set; and $\pattern$ is a pattern.
The encoded spreadsheet $\encodedSpreadsheet$ defines the spreadsheet $\spreadsheet$ as follows:
$$\spreadsheet \equiv \comprehension{
(\column, \row) \rightarrow \patternOf{\pattern}{\row}
}{
(\column, \row) \in \range,
\tuple{\range, \pattern}\in \encodedSpreadsheet
}$$
Informally, the expression at a cell $\cell$ is defined by identifying the range in $\encodedSpreadsheet$ that contains $\cell$, and expanding the pattern in the context of $\cell$'s row. We require that the set of ranges in $\encodedSpreadsheet$ be disjoint;
If the set of ranges is not complete, the expression for cells not covered by a range is defined to be the literal null.
\subsection{Update Index}
Evaluating a cell in a spreadsheet requires evaluating transitive dependencies;
The spreadsheet may thus be viewed as a graph, with one node for each cell and one edge for each dependency.
The update index maintains