More trimming. Currently looking at S3.2
parent
d4735436f6
commit
4a416f70dc
|
@ -2,32 +2,31 @@
|
|||
\section{System Design}
|
||||
\label{sec:system}
|
||||
|
||||
We now overview our prototype overlay spreadsheet, implemented for use with the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
|
||||
Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and import/export format compatibility.
|
||||
Our prototype likewise builds on Spark, using any dataframe as a data source.
|
||||
Our prototype overlay spreadsheet is implemented within the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
|
||||
Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and data import/export.
|
||||
Our prototype is designed to accept any Spark dataframe as a data source.
|
||||
|
||||
The prototype's design is illustrated in \Cref{fig:systemdesign}
|
||||
Client applications (e.g., Javascript-based frontends) connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and provides light syntactic sugar over the underlying data and update model.
|
||||
The data model itself is maintained by an \textbf{Execution} layer that is responsible for evaluating spreadsheet cells and materializing a subset of the cell values that are viewable.
|
||||
The execution layer applies an update overlay stored by an \textbf{Indexing} layer to an arbitary Spark dataframe.
|
||||
A simple LRU \textbf{Cache} provides efficient random access to a subset of the dataframe's rows.
|
||||
% The prototype's design is illustrated in \Cref{fig:systemdesign}
|
||||
Client applications connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and translates to our simplified spreadsheet model.
|
||||
An \textbf{Execution} layer is responsible for evaluating spreadsheet cells and materializing values for the viewable set of cells.
|
||||
An \textbf{Indexing} layer provides efficient access to the updates themselves, and a simple LRU cache provides efficient random access to the source dataframe.
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Presentation Layer}
|
||||
\label{sec:system-presentation}
|
||||
Multiple user-facing client applications connect to the overlay spreadsheet through a presentation layer.
|
||||
User-facing client applications connect to the overlay spreadsheet through a presentation layer.
|
||||
This layer mediates concurrent updates of the spreadsheet, allows clients to subscribe to push-based updates of cell state, and provides clients with the illusion of a fixed grid of cells by defining and maintaining an explicit order over columns, as well as maintaining a bound over the number of rows in the spreadsheet.
|
||||
Operations over columns (insertion, deletion, reordering) are handled at this layer, allowing lower levels to reference the (comparatively small) set of columns by column identity.
|
||||
With the exception of updates to columns, most updates are coalesced into a serial order and relayed to lower levels.
|
||||
Column operations (insertion, deletion, reordering) are handled at this layer, so lower levels can reference the (comparatively small) set of columns solely by column identity.
|
||||
Other updates are put into a serial order and relayed to lower levels.
|
||||
|
||||
\begin{figure}
|
||||
\includegraphics[width=0.4\columnwidth]{graphics/system-arch}
|
||||
\caption{Overlay system design.}
|
||||
\label{fig:systemdesign}
|
||||
\trimfigurespacing
|
||||
\end{figure}
|
||||
% \begin{figure}
|
||||
% \includegraphics[width=0.4\columnwidth]{graphics/system-arch}
|
||||
% \caption{Overlay system design.}
|
||||
% \label{fig:systemdesign}
|
||||
% \trimfigurespacing
|
||||
% \end{figure}
|
||||
|
||||
The presentation layer expects the level below it to provide (i) efficient random access to cell values, (ii) subscription access to state (e.g., value) updates for ranges of cells.
|
||||
|
||||
|
@ -37,11 +36,11 @@ The presentation layer expects the level below it to provide (i) efficient rando
|
|||
\subsection{Executor}
|
||||
\label{sec:system-executor}
|
||||
|
||||
The executor provides efficient access to cell values and is responsible for pushing notifications about cell state changes to clients.
|
||||
Cell values is derived from two sources:
|
||||
The executor provides efficient access to cell values and pushes notifications about cell state changes.
|
||||
Cell values are derived from two sources:
|
||||
(i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and
|
||||
(ii) A series of overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) extends the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
|
||||
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and an update index, respectively.
|
||||
(ii) Overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) that extend the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
|
||||
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and the update index.
|
||||
% The update index stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
|
||||
% Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
|
||||
% The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
|
||||
|
@ -50,12 +49,12 @@ These sources are implemented by a cache around $\spreadsheet_{\ds}$ and an upda
|
|||
% \spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
|
||||
% \end{cases}$$
|
||||
|
||||
The direct approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) first computes a topological sort over all cells (in order of dependencies) and evaluates them in this order.
|
||||
However, the computational cost of this approach can be proportional to the size of the data, as it requires expanding patterns out over all individual cells.
|
||||
The Executor and Index leverage the fact that updates are already provided as patterns to materialize the spreadseheet faster.
|
||||
We return to the index and efficient strategies for computing dependencies \Cref{sec:system-index}, and first consider expression evaluation.
|
||||
|
||||
Specifically, we rely on the observation that only a small fraction of cells will be visible at any one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} uses this observation to prioritize evaluation of visible cells).
|
||||
The naive approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) first computes a topological sort over all cells (in order of dependencies) and evaluates them in this order.
|
||||
However, the computational cost of this approach can be proportional to the size of the data, as each cell may need to be evaluated.
|
||||
The Executor reduces this cost through two insights:
|
||||
(i) Updates are already provided over multiple cells in bulk as patterns, and
|
||||
(ii) Only a small fraction of cells will be visible at any one time.
|
||||
Assuming the dependencies of a range of cells can be computed efficiently (we return to this in \Cref{sec:system-index}), only the visible cells and their not visible dependencies need to be evaluated.
|
||||
The Executor only evaluates cell expressions on rows that are (close to being) visible to the user, and the transitive closure of their dependencies.
|
||||
|
||||
Note that some dependency chains (e.g., the running sum example) still require computation for each row of data (e.g., if the last row is visible).
|
||||
|
|
Loading…
Reference in New Issue