paper-Vizier-SpreadsheetOve.../sections/overview.tex

66 lines
3.9 KiB
TeX

%!TEX root=../main.tex
\section{System Overview}
\label{sec:overview}
\BG{define the spreadsheet model first?}
A spreadsheet is a regular grid of cells, which are defined by formulas.
A cell's formula may be a literal value, or an expression defining a computation that may be based on the value of other cells.
The value of a cell is the result of evaluating the cell's formula.
This may require obtaining the value of cells on which the formula depends; we refer to such cells as \emph{direct prerequisite} cells. If cell
When a cell is modified, the values of \emph{dependent} cells, i.e., cells that use have to be updated.
That is, in contrast to a relational table, which can be updated by a sequence of imperative operations, the formulas of a spreadsheet are evaluated (conceptually) at the same time.
A cycle in the dependency graph (i.e., a cell being upstream of itself) is an error, and any cells participating in the cycle evaluate to a special error value $\errorval$.
In contrast to classical spreadsheets, where each cell is a completely independent entity, we adopt the Relational spreadsheet model~\cite{DBLP:conf/cidr/BakkeB11}, which focuses on so-called `tidy data,' where each row is one record, and each column represents a distinct (strongly typed) variable.
This approach incentivizes usage patterns that streamline data caching and make it easier to implement on-disk: Critically, columns and type information are available in a static context even before data is loaded, while the need for dynamic data access via caching is limited to a one-dimensional index on records.
\begin{figure}
\includegraphics[width=\columnwidth]{graphics/layers.png}
\caption{Stuff}
\end{figure}
\subsection{System Layers}
\paragraph{Source Data Layer}
An Overlay data source is primarily responsible for defining the initial shape (i.e., schema and number of rows) of the dataset, and providing random access to individual cell values.
To mitigate blocking, operations that rely on the data itself: counting rows or accessing data, are allowed to return Futures.
When random access is not available, for example when the input is an arbitrary Apache Spark dataframe, Overlay provides a simple caching layer.
The rows of the dataframe are subdivided into fixed-size (10k row) pages that are loaded as needed, and evicted on a least-recently-used basis.
\paragraph{Shape Update Layer}
The shape update layer manages updates to the shape of the dataset, including changes to the schema (column insertions, updates, or deletions), as well as changes to the data (row insertions or deletions, or changes in row order).
It acts as a select/project/union/order-by query over the source data, providing an updated schema, and random access to elements according to the new shape.
The key challenge at this layer is ensuring referential consistency to cells as the dataset changes shape.
We discuss this challenge in greater depth below, in \Cref{sec:cellidentity}.
\paragraph{Data Update Layer}
The data update layer manages updates to content, by storing and indexing a set of update rules.
The layer provides random access to updated cell values (i.e., the results of evaluating their formulas), or pass-through access to the source data the data has not been updated.
The layer also provides push access to cell values through notifications that fire when a cell is invalidated.
This layer also acts as a visibility filter over the dataset.
The user interface explicitly maintains a subset of cells that are ``active'' (i.e., in or near the viewable area).
The data update layer extends this subset based on the transitive closure of the active cells with cells that are upstream of active cells.
Only active cells are maintained.
We discuss the data update layer in greater depth in \Cref{sec:data}.
\paragraph{User Interface}
\todo{Discuss relevant aspects of the UI}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: