paper-Vizier-SpreadsheetOve.../sections/data.tex

%!TEX root=../main.tex
\section{Data Updates}
\label{sec:data}

Given a fixed size grid of cells, as defined by the shape update layer, the data update layer is responsible for replacing values as directed by the user.
Naively, the data update layer maintains a collection of \emph{cell updates}.
Each update targets one cell, identified by a unique column identifier and row position (relative to a reference frame).
The update itself consists of a formula, a single-valued Spark \texttt{Expression} that defines the cell contents.
The formula may reference other cells, typically a row position (and reference frame) and a column identifier.


\subsection{Formula Rules}

A key objective for Overlay is avoiding scaling dependencies on the size of the dataset.
As we noted earlier, the number of distinct interactions that a user has with Overlay will (typically) be far smaller than the number of rows in the dataset.
However, individual actions may affect a number of cells that grows with the size of the dataset.
For example, a common usage pattern in spreadsheets is for a user to develop a formula in one cell, and then copy+paste it into the remaining cells of the column\footnote{In existing spreadsheet platforms, references to cells or formulas are labeled as being absolute or relative.  When a cell formula is pasted at an offset, relative references are offset by the same amount.}.
With cells implemented in the naive manner described above, this single interaction would create one new cell for every row of the dataset.

Instead, Overlay instantiates bulk updates as ``formula rules.''
A formula rule is defined by a range of target cells and a formula.
Due to Overlay's use of the relational spreadsheet model, we limit the range to the row dimension: Each rule targets a range of rows in a single column.
Cell references in the formula are given as absolute, or relative references; both with respect to a specific reference frame.
A formula rule may be evaluated with respect to any of its target cells; References are resolved relative to the target cell, after accounting for differences in reference frame.

\begin{example}
  Show an example of a formula rule with an insert in the middle of it.
\end{example}

\subsection{Indexing Formula Rules}

The data update layer provides 3 API methods

\begin{itemize}
  \item \texttt{get(cell)}: Retrieve the value of the specified cell.
  \item \texttt{put(range, rule)}: Apply the specified formula to cells in the specified range.
  \item \texttt{trigger(range, callback)}: Invoke the provided callback when the value of a cell in the specified range changes.
\end{itemize}

\paragraph{Get}
To retrieve the value of a cell, the layer first determines if the cell appears in the target range of a rule.
If no rule applies, the request is redirected to the source data through the shape update layer.
To accelerate point lookups, we build a simple range index over the rules for each column.

\todo{Add discussion of range index}

\paragraph{Put}
When a rule is written to a cell, any prior rule that applies to it is removed.
If a rule's target range overlaps with the start or end of an existing rule's range, the range is truncated.
If a rule's target range overlaps with the range of an existing rule, but not the start or end, the existing rule is split in two on the boundaries of the new rule.

\paragraph{Trigger}
As in \texttt{get}, triggers are recorded in a simple range index.
Any point lookup on this index returns the set of all trigger callbacks applicable to the point.
Crucially, an update to an upstream cell forces an update of the current cell; In order for the cell

If a cell has a formula that depends on other cells, an update to one of its upstream cells can invalidate it.
Thus, if a cell has dependencies and at least one trigger on it, the cell recursively registers triggers for upstream cells.
This process is streamlined by allowing trigger define a range to associate the trigger with.

\begin{itemize}
  \item Compute the (sub)ranges of all defined rules that overlap with the trigger target range.
  \item For each rule and subrange pair...
  \item For each cell reference in the rule formula...
  \item If the cell reference is absolute, include the cell reference as a range
  \item If the cell reference is relative, then apply the relative offset to the affected subrange; This may require propagating the affected subrange through a set of reference frames, which in turn may require splitting the affected subrange into multiple disconnected ranges (e.g., if cells were deleted from the middle).
  \item Recursively establish a trigger for the upstream range(s).
\end{itemize}

To avoid cycles, recursive invocations of trigger maintain a set of visited cell ranges (i.e., downstream dependencies from the cell being triggered).
If the intersection of the visited cells and the trigger range is non-empty, then cells in the intersection are assigned an error value and the trigger is not propagated further for these cells.

\subsection{Evaluating Cells}

Other scalable spreadsheets (e.g., DataSpread~\cite{DBLP:conf/sigmod/BendreWMCP19}) materialize all cells in the spreadsheet.
This approach violates a key objective of Overlay, to decouple time and memory complexities from data size.
An alternative, naive approach might be to evaluate every cell on an as-needed basis, evaluating the formula and recursively evaluating upstream dependencies.
However, this lazy evaluation strategy requires invoking every upstream dependency; in the worst case, this may still be linear in the size of the data (or exponential without memoization).

Instead, Overlay relies on the fact that in a typical interactive interface, the full dataset will not be viewable to the user at any given time.
Overlay maintains a list of ``active'' cells, seeded by a frontend interacting with it, and also including all cells with registered triggers.
The data update layer will not provide values for cells that have not yet been activated.
In Vizier~\cite{brachmann:2019:sigmod:data}, this is the set of rows visible on screen, and a buffer of preceding and following rows.

When a cell range is activated, it registers triggers for its upstream cells, transitively activating them.
Cell values are persisted as Scala Futures, containers that store values that may not be available yet.
We observe that futures work equally well for formulas in the process of being computed, as well as source data values that have not been cached yet.

To evaluate the cell, Overlay obtains a future for each of the cell's direct dependencies;
Once all direct dependencies are available, the formula is evaluated, and the result is passed to the Future.
When a cell's value is updated, the triggers for its downstream cells are invoked, invalidating the existing value (i.e., future) for these cells, and triggering a re-execution.

\subsection{Recursive Rules}

We note that it is possible for the set of active cells to scale faster than the number of user interactions plus the set of viewable cells, typically by defining one or more formula rules that are mutually recursive.
We note that recursive formula \emph{rules} need not imply a cycle in the dependency graph.
For example, a standard pattern in spreadsheets is to define an auto-incrementing row counter by defining a formula that adds one to the counter value from the preceding row.
$$\mathcal C_{c,r} = \mathcal C_{c,r-1} + 1$$
The general rule, applied to the entire column (modulo the base case of the first-row), is recursive.
However, each individual cell over which the formula is materialized depends on the cell immediately preceding it.

Continuing the example if the user scrolls to the very end of the dataset (activating the last row), then the counter value for every other row will be transitively activated.
Although not implemented in Overlay at this time, we consider several solutions to the transitive activation problem for future work.

In the ideal case, given the set of recursive rules, it would be possible to derive a purely analytical solution to the recursive formula, for exam
The example recursive row-counter formula example is based on the row position (modulo reference frame).

However, an analytical solution may also involve data dependencies.
For example, a common pattern is to define a ``running total'' column:
$$\mathcal C_{c,r} = \mathcal C_{c,r-1} + \mathcal C_{c',r}$$
Although similar to the prior example in structure, the closed form representation of this expression involves a summation over the elements of column $c'$ in the rule's target range.

In a situation like this, when the closed form solution includes an aggregate term over the base data, it is possible to avoid materializing the full transitive closure of the visible cells.
Instead, we can identify a set of cells that form a cut over the dependency graph, where all of the visible cells are on one side of the cut.
If the source data can evaluate SQL (e.g., an Apache Spark dataframe), an approach analogous to~\cite{freire:2016:hilda:exception} can outsource materialization of these cells to the database.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: