Merge branch 'main' of git.odin.cse.buffalo.edu:VizierDB/paper-Vizier-SpreadsheetOverlay

main
Oliver Kennedy 2023-03-19 18:20:23 -04:00
commit dd809554b3
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
3 changed files with 40 additions and 14 deletions

View File

@ -2,6 +2,8 @@
\usepackage{cleveref}
\usepackage{todonotes}
\usepackage{amsmath}
\usepackage{xspace}
\usepackage{listings}
\usepackage{algorithm}
\usepackage{algpseudocode}

View File

@ -22,9 +22,12 @@
\newcommand{\rangeOf}[2]{(#1\texttt{:}#2)}
\newcommand{\range}{\rangeOf{\columnRange}{\rowRange}}
\newcommand{\cellRef}[2]{(#1\texttt{:}#2)}
\newcommand{\evalOf}[2]{\left\llbracket\; #2 \;\right\rrbracket_{#1}}
\newcommand{\patternOf}[2]{\left\llbracket\; #1 \;\right\rrbracket_{#2}}
\newcommand{\evalOf}[2]{\left\llbracket\; #2 \;\right\rrbracket_{#1}}
\newcommand{\patternOf}[2]{\left\llbracket\; #1 \;\right\rrbracket_{#2}}
\newcommand{\depsOf}[1]{\textbf{deps}\left(#1\right)}
\newcommand{\tdepsOf}[1]{\ensuremath{\text{\textbf{deps}\ast\left(#1\right)}\xspace}}
\newcommand{\DG}[1]{\ensuremath{G_{#1}}\xspace}
\newcommand{\TDG}[1]{\ensuremath{G_{#1}^*}\xspace}
Let $\columnDomain$ and $\rowDomain$ denote a domain of column and row (respectively) labels, and let $\exprDomain$ and $\valueDomain$ denote a domain of expressions and values; We will define $\exprDomain$ in greater detail below.
We define a \emph{spreadsheet} $\spreadsheet : (\columnDomain \times \rowDomain) \rightarrow \exprDomain$ as a mapping from \emph{cells} ($(\column, \row) \in (\columnDomain \times \rowDomain)$) to expressions.
@ -37,10 +40,11 @@ The expression $\expr$ may be evaluated in the context of a spreadsheet ($\evalO
$$\evalOf{\spreadsheet}{\cellRef{\column}{\row}} \equiv \evalOf{\spreadsheet}{\spreadsheet(\column, \row)}$$
Cyclic references evaluate to a distinguished error value in $\valueDomain$.
We define the dependencies of an expression ($\depsOf{\expr}$) to be the set of cells referenced by $\expr$.
Expression dependencies induce a graph $\tuple{V, E}$ over the spreadsheet, where each cell is a node (i.e., $V = \columnDomain \times \rowDomain$), and each dependency is a (directed) edge:
$$E = \bigcup_{\cell \in \columnDomain \times \rowDomain}
We define the dependencies of an expression ($\depsOf{\expr}$) to be the set of cells referenced by $\expr$.
Expression dependencies induce a graph $\DG{\spreadsheet}\tuple{V, E}$ over the spreadsheet, where each cell is a node (i.e., $V = \columnDomain \times \rowDomain$), and each dependency is a (directed) edge:
$$E = \bigcup_{\cell \in \columnDomain \times \rowDomain}
\{\;\cell \rightarrow \cell'\;|\;\cell' \in \depsOf{\spreadsheet(\cell)}\;\} $$
We use $\TDG{\spreadsheet}$ to denote the graph $\tuple{V,E^*}$ where $E^*$ is the transitive closure of $E$, i.e., $\TDG{\spreadsheet}$ stores both direct and indirect dependencies among the cells in the spreadsheet.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -57,8 +61,8 @@ To compactly encode more general sets of rows, we define a \emph{row set} data s
Observe that given two range sets $\rowRange$, $\rowRange'$, satisfying the above properties, we can compute their intersection $\rowRange \cap \rowRange'$, union $\rowRange \cup \rowRange'$ and difference $\rowRange - \rowRange'$ in $O(|\rowRange|+|\rowRange'|)$ time; returning a row set that respects the same properties.
As previously noted, a spreadsheet user may apply a single formula to a range of cells in a single interaction.
Typically, such formulas are defined by an expression \emph{pattern} $\pattern \in \patternDomain$, a more general form of an expression that may also include \emph{offset references}.
An offset reference $\cellRef{\column}{\rowOffset}$ is defined by a column $\column \in \columnDomain$ and an integer row offset $\rowOffset \in \mathbb Z$.
Typically, such formulas are defined by an expression \emph{pattern} $\pattern \in \patternDomain$, a more general form of an expression that may also include \emph{offset references}.
An offset reference $\cellRef{\column}{\rowOffset}$ is defined by a column $\column \in \columnDomain$ and an integer row offset $\rowOffset \in \mathbb Z$.
A pattern may be expanded in the context of a row ($\patternOf{\pattern}{\row}$); A pattern expands to an expression by replacing every offset reference with an explicit cell reference at the corresponding offset:
$$\patternOf{\cellRef{\column}{\rowOffset}}{\row} = \cellRef{\column}{\row + \rowOffset}$$
@ -70,5 +74,17 @@ $$\spreadsheet \equiv \comprehension{
(\column, \row) \in \range,
\tuple{\range, \pattern}\in \encodedSpreadsheet
}$$
Informally, the expression at a cell $\cell$ is defined by identifying the range in $\encodedSpreadsheet$ that contains $\cell$, and expanding the pattern in the context of $\cell$'s row. We require that the set of ranges in $\encodedSpreadsheet$ be disjoint;
Informally, the expression at a cell $\cell$ is defined by identifying the range in $\encodedSpreadsheet$ that contains $\cell$, and expanding the pattern in the context of $\cell$'s row. We require that the set of ranges in $\encodedSpreadsheet$ be disjoint;
If the set of ranges is not complete, the expression for cells not covered by a range is defined to be the literal null.
\subsection{Update Index}
Evaluating a cell in a spreadsheet requires evaluating transitive dependencies;
The spreadsheet may thus be viewed as a graph, with one node for each cell and one edge for each dependency.
The update index maintains
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End:

View File

@ -2,13 +2,17 @@
\section{System Overview}
\label{sec:overview}
\BG{define the spreadsheet model first?}
\newcommand{\errorval}{\ensuremath{\bot}\xspace}
A spreadsheet is a regular grid of cells, which are defined by formulas.
A cell's formula may be a literal value, or an expression defining a computation that may be based on the value of other cells.
The value of a cell is the result of evaluating the cell's formula.
This may require obtaining the value of cells on which the formula depends; we refer to such cells as \emph{upstream} cells.
When a cell is modified, the values of downstream (i.e., dependent) cells are updated accordingly;
This may require obtaining the value of cells on which the formula depends; we refer to such cells as \emph{direct prerequisite} cells. If cell
When a cell is modified, the values of \emph{dependent} cells, i.e., cells that use have to be updated.
That is, in contrast to a relational table, which can be updated by a sequence of imperative operations, the formulas of a spreadsheet are evaluated (conceptually) at the same time.
A cycle in the dependency graph (i.e., a cell being upstream of itself) is an error, and any cells participating in the cycle evaluate to a special error value.
A cycle in the dependency graph (i.e., a cell being upstream of itself) is an error, and any cells participating in the cycle evaluate to a special error value $\errorval$.
In contrast to classical spreadsheets, where each cell is a completely independent entity, we adopt the Relational spreadsheet model~\cite{DBLP:conf/cidr/BakkeB11}, which focuses on so-called `tidy data,' where each row is one record, and each column represents a distinct (strongly typed) variable.
This approach incentivizes usage patterns that streamline data caching and make it easier to implement on-disk: Critically, columns and type information are available in a static context even before data is loaded, while the need for dynamic data access via caching is limited to a one-dimensional index on records.
@ -47,7 +51,7 @@ The layer also provides push access to cell values through notifications that fi
This layer also acts as a visibility filter over the dataset.
The user interface explicitly maintains a subset of cells that are ``active'' (i.e., in or near the viewable area).
The data update layer extends this subset based on the transitive closure of the active cells with cells that are upstream of active cells.
The data update layer extends this subset based on the transitive closure of the active cells with cells that are upstream of active cells.
Only active cells are maintained.
We discuss the data update layer in greater depth in \Cref{sec:data}.
@ -65,7 +69,7 @@ To identify rows, we considered two approaches: (i) identifying rows by unique i
Assigning each row a unique identifier poses several scalability challenges.
First, this mapping makes caching more challenging, as row identifiers must be persisted in the original dataset.
Moreover, unique identifiers can not be used to partition the source data into rows of consistent size.
Moreover, unique identifiers can not be used to partition the source data into set of rows of consistent size.
Finally, unique identifiers preclude rule based updates, as we describe in \Cref{sec:data}.
Positional references can compactly encode contiguous ranges of rows.
@ -105,4 +109,8 @@ Errors in a transformation to an earlier reference frame indicate inserted rows,
\begin{example}
Some example of reference frames in practice. Insert, delete, move, etc...
\end{example}
\end{example}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: