More trimming. Currently looking at S3.2

main
Oliver Kennedy 2023-03-30 16:55:16 -04:00
parent d4735436f6
commit 4a416f70dc
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
1 changed files with 26 additions and 27 deletions

View File

@ -2,32 +2,31 @@
\section{System Design}
\label{sec:system}
We now overview our prototype overlay spreadsheet, implemented for use with the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and import/export format compatibility.
Our prototype likewise builds on Spark, using any dataframe as a data source.
Our prototype overlay spreadsheet is implemented within the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and data import/export.
Our prototype is designed to accept any Spark dataframe as a data source.
The prototype's design is illustrated in \Cref{fig:systemdesign}
Client applications (e.g., Javascript-based frontends) connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and provides light syntactic sugar over the underlying data and update model.
The data model itself is maintained by an \textbf{Execution} layer that is responsible for evaluating spreadsheet cells and materializing a subset of the cell values that are viewable.
The execution layer applies an update overlay stored by an \textbf{Indexing} layer to an arbitary Spark dataframe.
A simple LRU \textbf{Cache} provides efficient random access to a subset of the dataframe's rows.
% The prototype's design is illustrated in \Cref{fig:systemdesign}
Client applications connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and translates to our simplified spreadsheet model.
An \textbf{Execution} layer is responsible for evaluating spreadsheet cells and materializing values for the viewable set of cells.
An \textbf{Indexing} layer provides efficient access to the updates themselves, and a simple LRU cache provides efficient random access to the source dataframe.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Presentation Layer}
\label{sec:system-presentation}
Multiple user-facing client applications connect to the overlay spreadsheet through a presentation layer.
User-facing client applications connect to the overlay spreadsheet through a presentation layer.
This layer mediates concurrent updates of the spreadsheet, allows clients to subscribe to push-based updates of cell state, and provides clients with the illusion of a fixed grid of cells by defining and maintaining an explicit order over columns, as well as maintaining a bound over the number of rows in the spreadsheet.
Operations over columns (insertion, deletion, reordering) are handled at this layer, allowing lower levels to reference the (comparatively small) set of columns by column identity.
With the exception of updates to columns, most updates are coalesced into a serial order and relayed to lower levels.
Column operations (insertion, deletion, reordering) are handled at this layer, so lower levels can reference the (comparatively small) set of columns solely by column identity.
Other updates are put into a serial order and relayed to lower levels.
\begin{figure}
\includegraphics[width=0.4\columnwidth]{graphics/system-arch}
\caption{Overlay system design.}
\label{fig:systemdesign}
\trimfigurespacing
\end{figure}
% \begin{figure}
% \includegraphics[width=0.4\columnwidth]{graphics/system-arch}
% \caption{Overlay system design.}
% \label{fig:systemdesign}
% \trimfigurespacing
% \end{figure}
The presentation layer expects the level below it to provide (i) efficient random access to cell values, (ii) subscription access to state (e.g., value) updates for ranges of cells.
@ -37,11 +36,11 @@ The presentation layer expects the level below it to provide (i) efficient rando
\subsection{Executor}
\label{sec:system-executor}
The executor provides efficient access to cell values and is responsible for pushing notifications about cell state changes to clients.
Cell values is derived from two sources:
The executor provides efficient access to cell values and pushes notifications about cell state changes.
Cell values are derived from two sources:
(i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and
(ii) A series of overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) extends the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and an update index, respectively.
(ii) Overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) that extend the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and the update index.
% The update index stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
% Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
% The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
@ -50,12 +49,12 @@ These sources are implemented by a cache around $\spreadsheet_{\ds}$ and an upda
% \spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
% \end{cases}$$
The direct approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) first computes a topological sort over all cells (in order of dependencies) and evaluates them in this order.
However, the computational cost of this approach can be proportional to the size of the data, as it requires expanding patterns out over all individual cells.
The Executor and Index leverage the fact that updates are already provided as patterns to materialize the spreadseheet faster.
We return to the index and efficient strategies for computing dependencies \Cref{sec:system-index}, and first consider expression evaluation.
Specifically, we rely on the observation that only a small fraction of cells will be visible at any one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} uses this observation to prioritize evaluation of visible cells).
The naive approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) first computes a topological sort over all cells (in order of dependencies) and evaluates them in this order.
However, the computational cost of this approach can be proportional to the size of the data, as each cell may need to be evaluated.
The Executor reduces this cost through two insights:
(i) Updates are already provided over multiple cells in bulk as patterns, and
(ii) Only a small fraction of cells will be visible at any one time.
Assuming the dependencies of a range of cells can be computed efficiently (we return to this in \Cref{sec:system-index}), only the visible cells and their not visible dependencies need to be evaluated.
The Executor only evaluates cell expressions on rows that are (close to being) visible to the user, and the transitive closure of their dependencies.
Note that some dependency chains (e.g., the running sum example) still require computation for each row of data (e.g., if the last row is visible).