paper-Vizier-SpreadsheetOve.../sections/system.tex

226 lines
16 KiB
TeX

%!TEX root = ../main.tex
\section{System Design}
\label{sec:system}
Our prototype overlay spreadsheet is implemented within the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and data import/export.
Our prototype is designed to accept any Spark dataframe as a data source.
% The prototype's design is illustrated in \Cref{fig:systemdesign}
Client applications connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and translates our internal model of a spreadsheet to a more natural interface.
The \textbf{Execution} layer evaluates spreadsheet cells and materializes cells currently visible to the user.
The \textbf{Indexing} layer provides efficient access to formulas, and a LRU cache provides efficient access to source dataframes.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Presentation Layer}
\label{sec:system-presentation}
User-facing client applications connect to the overlay spreadsheet through a presentation layer that serializes concurrent updates, and provides clients with the illusion of a fixed grid of cells.
Column operations (insertion, deletion, reordering) are handled at this layer, so lower levels can reference the small set of columns solely by column identity.
Other updates are serialized and forwarded to lower levels.
% \begin{figure}
% \includegraphics[width=0.4\columnwidth]{graphics/system-arch}
% \caption{Overlay system design.}
% \label{fig:systemdesign}
% \trimfigurespacing
% \end{figure}
The presentation layer expects the Executor to provide efficient random access to cell values and supports updating ranges of cells with pattern expressions.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Executor}
\label{sec:system-executor}
The executor provides efficient access to cell values and generates notifications about cell state changes.
Cell values are derived from two sources:
(i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and
(ii) A sequence of overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) that extend the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and the update index, as discussed below.
% The update index stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
% Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
% The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
% $$\spreadsheet[\column,\row] = \begin{cases}
% \spreadsheet_\overlay[\column,\row] & \textbf{if } \spreadsheet_\overlay[\column,\row] \neq \errorval\\
% \spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
% \end{cases}$$
The naive approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) topologically sorts cells based on dependencies and evaluates cells in this order.
The Executor side-steps the linear (in the data size) cost of the naive approach through two insights:
(i) Updates applied over multiple cells are already available as patterns, and
(ii) Only a small fraction of cells will be visible at any one time.
Assuming the dependencies of a range of cells can be computed efficiently (we return to this assumption in \Cref{sec:system-index}), only the visible cells and their dependencies need to be evaluated.
The Executor only evaluates expressions for rows that are (close to being) visible to the user, and the transitive closure of their dependencies.
Some dependency chains (e.g., running sums) still require computation for each row of data.
Although we leave a detailed exploration of this challenge to future work, we observe that the fixed point of such pattern expressions can often be rewritten into a closed form.
For example, any cell in a running sum column is equivalent to a sum over the preceding cells.
Our preliminary experiments (\Cref{sec:experiments}) suggest promise in a hybrid evaluation strategy that evaluates visible cells individually and computes cells defined by patterns through closed form windowed aggregation queries.
\partitle{Updates}
When the executor receives a cell update, it uses the index to identify invalidated cells and begins re-evaluating them in topological order.
An update to the reference frame is applied to both the index and the data source.
Following typical spreadsheet semantics, an insertion or row move updates references in dependent formulas, so no re-evaluation is typically required.
If a row with dependent cells is deleted, the dependent cells need to be updated to indicate the error.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Update Index}
\label{sec:system-index}
The update index stores sequence of updates ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$) and provide efficient access to the cells of an overlay spreadsheet (denoted $\spreadsheet_\overlay$) where undefined cells have the value $\errorval$.
This entails:
(i) cell expressions $\spreadsheet_\overlay[\column, \row]$ (for cell evaluation);
(ii) upstream dependencies of a range (for topological sort and computing the active set), and
(iii) downstream dependents of a range (for cell invalidation after an update).
The key insight behind the index is that updates are stored as pattern-range tuples instead of as individual cells.
%As noted above, we assume that the number of columns is small and the number of rows is large.
\begin{figure}
\includegraphics[width=0.5\columnwidth]{graphics/rangemap.pdf}
\caption{A range map maps disjoint ranges to values.}
\label{fig:rangemap}
\trimfigurespacing
\end{figure}
\partitle{Range Maps}
The update index is built over a one\--di\-men\-sion\-al range map, an ordered map with integer keys.
In addition to the usual operations of an ordered map (e.g., \texttt{put}, \texttt{get}, \texttt{successorOf}), we define the operation \texttt{bulkPut(low, high, value)} which is equivalent to a \texttt{put} on every element in the range from \texttt{low} to \texttt{high}.
Implemented naively (e.g. a size $N$ binary tree), this operation is $O((\texttt{high}-\texttt{low})\cdot\log(N))$.
A range map avoids the $(\texttt{high}-\texttt{low})$ factor by storing an ordered sequence of disjoint ranges, each mapping one specific value as illustrated in \Cref{fig:rangemap}.
A binary tree provides efficient membership lookups over the ranges.
With a range map, the set of distinct values appearing in a range can be accessed in $O(\log(N)+M)$ time (where $M$ is the number of distinct values), and has similar deletion and insertion costs.
\partitle{Cell Access}
The index layer maintains a ``forward'' index: An unordered map $\mathcal I$ that stores a range map $\mathcal I[\column]$ for each column.
The expression for a cell $\cellRef{\column}{\row}$ is stored at $\mathcal I[\column][\row]$.
\begin{algorithm}
\caption{\textbf{upstream}($\columnRange$, $\rowRange$)}
\label{alg:upstream}
\begin{algorithmic}[1]
\Require $\rangeOf{\columnRange, \rowRange}$: A range to compute the upstream of.
\Ensure $\texttt{upstream}$: Cells on which $\rangeOf{\column}{\rowRange}$ is a dependency.
\State $\texttt{upstream} \leftarrow \{\}$
\State $\texttt{work} \leftarrow \comprehension{(\column, \rowRange, \{\})}{\column \in \columnRange}$
\While{$(\column', \rowRange', \texttt{lineage}) \leftarrow \texttt{work}.\textbf{dequeue}$}
\For{$(\rowRange'', \texttt{pattern}) \leftarrow \texttt{forwardIndex}(\column', \rowRange')$}
\For{$(\column_{d}, \rowRange_{d}, \texttt{offset})\hspace{-1mm} \leftarrow\hspace{-1mm} \textbf{deps}(\texttt{pattern}, \column', \rowRange'')$}
\State $(\column_{d}, \rowRange_{d}) \leftarrow (\column_{d}, \rowRange_{d}) - \texttt{upstream}$
\If{$(\column_{d}, \rowRange_{d})$ is non-empty}
\State $\texttt{upstream} \leftarrow \texttt{upstream} + (\column_{d}, \rowRange_{d})$
\State $\texttt{queue}.\textbf{enqueue}( \column_{d}, \rowRange_{d},$\\
\hspace*{23mm}{\footnotesize $\{\;\texttt{p}' \rightarrow (\texttt{o}'+\texttt{offset})|\; (\texttt{p}' \rightarrow \texttt{o}' )\in \texttt{lineage}\}$}\\
\hfill {\footnotesize $\cup \{\texttt{pattern} \rightarrow \texttt{offset}\} )$}
\EndIf
\EndFor
\EndFor
\EndWhile
\end{algorithmic}
\end{algorithm}
\partitle{Upstream Reachability}
The execution layer needs to be able to derive the set of cells on which a specific target cell (or range) depends.
We refer to this set as the target's \emph{upstream}.
\Cref{alg:upstream} illustrates how to use breadth-first search to obtain the full upstream set for a given target range.
Each item in the BFS's work queue consists of a column, a row set, and a lineage; We will return to the lineage shortly.
For each work item enqueued, we query the forward index to obtain patterns in the range (line 4), and iterate over the set of their dependencies (line 5).
If we discover a new dependency (lines 6-7), the newly discovered range is added to the return set and the work queue.
We will explain lines 10-12 shortly.
The \textbf{deps} operation (Line 5; \Cref{alg:getDeps}) computes the immediate dependencies of a range of cells $\rangeOf{\column}{\rowRange}$ that share a pattern.
Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column}{\rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf{\cell'}$.
The algorithm uses a recursive traversal (lines 6-7) to visit every cell reference (offset or explicit):
For offset references (lines 2-3), the provided range of rows is offset by the appropriate amount.
For explicit cell references (lines 4-5), the explicit reference is used.
\begin{algorithm}
\caption{\textbf{deps}($\texttt{pattern}, \column, \rowRange$)}
\label{alg:getDeps}
\begin{algorithmic}[1]
\Require \texttt{pattern}: An expression pattern
\Require $\rangeOf{\column}{\rowRange}$: A range of cells
\Ensure \texttt{deps}: Dependencies of $\rangeOf{\column}{\rowRange}$'s \texttt{pattern}
\State $\texttt{deps} \leftarrow \{\}$
\If{\texttt{pattern} \textbf{is an offset reference} $\cellRef{\column'}{\delta'}$}
\State $\texttt{deps} \leftarrow \texttt{deps} \cup \{ (\column', \rowRange+\delta', \delta') \}$
\ElsIf{\texttt{pattern} \textbf{is a direct reference} $\cellRef{\column'}{\row'}$}
\State $\texttt{deps} \leftarrow \texttt{deps} \cup \{ (\column', \row', \emptyset) \} $
\Else
\State $\texttt{deps} \leftarrow \texttt{deps} \underset{\texttt{child} \in \texttt{pattern}}{\bigcup} \textbf{deps}(\texttt{child}, \column, \rowRange) $
\EndIf
\end{algorithmic}
\end{algorithm}
\partitle{Optimizing Recursive Reachability}
Consider a running sum, such as the one in \Cref{ex:recursive-running-sum}.
The $k$th element will have $O(k)$ upstream dependencies, and so naively following \Cref{alg:upstream} is in $O(k)$.
However, observe that a single pattern is responsible for all of these dependencies, suggesting that a more efficient option may be available.
This dependency chain arises from recursion over single pattern; most cells depend on other cells defined by the same pattern.
We refer to such a pattern as \emph{recursive}, even if it does not create dependency cycle over individual cells.
As with cell execution, the transitive closure of the dependencies of a recursive pattern has a closed-form representation.
In our running example, the upstream of any $\cellRef{D}{k}$ is exactly $\cellRef{D}{1-(k-1)}$ and $\cellRef{C}{1-k}$.
%
The \texttt{lineage} field of \Cref{alg:upstream} is used to track the set of patterns visited, and the offset(s) at which they were visited.
If the pattern being visited already appears in the lineage, then we know it is recursive and that we can extend out the sequence of upstream cells across the remaining cells of the pattern.
When the offset is $\pm 1$, the elements of this sequence are efficiently representable as a range of cells, computable in $O(1)$ time.
\partitle{Downstream Reachability}
When a cell's expression is updated, cells that depend on it (even transitively) must be recomputed, so the index must support downstream reachability queries.
For efficient downstream lookups, the index maintains a ``backward'' index relating ranges to the set of patterns that depend on all cells in the range.
The resulting algorithm over the backward index is analogous to $\textbf{deps}$.
% \partitle{Column Insertions, Deletions, and Moves}
% At the index layer, columns are referenced by unique identifier; Ordering is imposed only at the presentation layer.
% Column insertion (or deletion) requires simply inserting (resp., removing) an entry from the forward and backward index.
% Column reordering requires no actions at the index level.
% \partitle{Row Insertions, Deletions, and Moves}
% Rows are identified by their position.
% When a row is inserted, deleted, or moved, references to the affected rows (and rows following them) change and must be updated.
% This update can be expensive, as it may require defining an entirely new set of patterns
% One alternative is fractional indexing~\cite{DBLP:journals/jidm/HausteinHM010,DBLP:conf/sigmod/ONeilOPCSW04}, where a new identifier can be allocated in between any two rows.
% An analogous approach uses tree structures with counts to accelerate positional access to rows~\cite{DBLP:conf/icde/BendreVZCP18}.
% Both of these approaches avoid the overhead of reference updates, but impose a logarithmic cost to look up individual rows by their position.
% Instead, we adopt a lazy approach by associating every pattern with a construct that we call a reference frame, a function $\mathcal F$ a function mapping positions to specific rows $\mathcal F : \mathbb Z \rightarrow \mathcal R$ (where $\mathcal R$ denotes the set of all rows).
% In other words, the pair $\tuple{\row, \mathcal F}$ denotes the row $\mathcal F(\row)$.
% We observe that simple row insertions, deletions, or movement, apply a simple translation to a portion of the reference frame's domain.
% For example given initial reference frame $\mathcal F$, an insertion of three rows at position 5 defines a new reference frame $\mathcal F'$ as follows:
% $$\mathcal F'(\row) = \begin{cases}
% \mathcal F(\row) & \textbf{if } \row < 5\\
% \text{[new row } \row - 5\text{]} & \textbf{if } 5 \leq \row \leq 7\\
% \mathcal F(\row - 3) & \textbf{otherwise}
% \end{cases}$$
% Observe that, row positions defined with respect to $\mathcal F$ may be transformed in constant time into positions with respect to $\mathcal F'$.
% That is, we can define a reference frame transformation $T$:\tabularnewline
% $$T(\row) = \begin{cases}
% \row & \textbf{if } \row < 5\\
% \row+3 & \textbf{otherwise}
% \end{cases}$$
% For portions of the domain of $T$ that are defined, the function may be inverted:
% $$T^{-1}(\row) = \begin{cases}
% \row & \textbf{if } \row < 5\\
% \row-3 & \textbf{if } \row > 7\\
% error & \textbf{otherwise}
% \end{cases}$$
% Thus $\mathcal F(T(\row)) = \mathcal F'(\row)$, and $\mathcal F'(T^{-1}(\row)) = F(\row)$. Similar translations exist for deletions and row moves.
% Let $\mathcal F' = T_1 \circ \ldots \circ T_n \circ \mathcal F$, where $\circ$ denotes function composition.
% Given a history of transformations, any row $\tuple{i, \mathcal F}$ can be transformed into a later reference frame $\tuple{T_1(\ldots(T_n(\row))), \mathcal F'}$, or an earlier one.
% Errors in a transformation to an earlier reference frame indicate inserted rows, while errors moving forward through reference frames indicate deleted rows.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: