Pass through S3.2
parent
e5dfdc405b
commit
33e2e379ec
|
@ -2,7 +2,7 @@
|
|||
\section{System Design}
|
||||
\label{sec:system}
|
||||
|
||||
We now outline the design of our prototype overlay spreadsheet, implemented as part of the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
|
||||
We now overview our prototype overlay spreadsheet, implemented for use with the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
|
||||
Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and import/export format compatibility.
|
||||
Our prototype likewise builds on Spark, using any dataframe as a data source.
|
||||
|
||||
|
@ -41,29 +41,25 @@ The executor provides efficient access to cell values and is responsible for pus
|
|||
Cell values is derived from two sources:
|
||||
(i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and
|
||||
(ii) A series of overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) extends the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
|
||||
The executor decouples these sources into a cache around $\spreadsheet_{\ds}$ and an update index that stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
|
||||
Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
|
||||
The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
|
||||
$$\spreadsheet[\column,\row] = \begin{cases}
|
||||
\spreadsheet_\overlay[\column,\row] & \textbf{if } \spreadsheet_\overlay[\column,\row] \neq \errorval\\
|
||||
\spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
|
||||
\end{cases}$$
|
||||
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and an update index, respectively.
|
||||
% The update index stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
|
||||
% Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
|
||||
% The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
|
||||
% $$\spreadsheet[\column,\row] = \begin{cases}
|
||||
% \spreadsheet_\overlay[\column,\row] & \textbf{if } \spreadsheet_\overlay[\column,\row] \neq \errorval\\
|
||||
% \spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
|
||||
% \end{cases}$$
|
||||
|
||||
The direct approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) computes a topological sort over the full set of cells (in order of dependencies) and evaluates cells in this order.
|
||||
However, this approach requires first expanding patterns (one per interaction) into individual expressions (one per cell);
|
||||
Not only does this requires a significant amount of memory, it introduces unnecessary computational complexity:
|
||||
In many cases, a topological sort can be derived from patterns rather than individual cells.
|
||||
Moreover, it may be unnecessary to materialize the entire dataset.
|
||||
The direct approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) first computes a topological sort over all cells (in order of dependencies) and evaluates them in this order.
|
||||
However, the computational cost of this approach can be proportional to the size of the data, as it requires expanding patterns out over all individual cells.
|
||||
The Executor and Index leverage the fact that updates are already provided as patterns to materialize the spreadseheet faster.
|
||||
We return to the index and efficient strategies for computing dependencies \Cref{sec:system-index}, and first consider expression evaluation.
|
||||
|
||||
The executor relies on the Index to provide efficient topological sorts, as well as upstream and downstream dependency analysis.
|
||||
We discuss these challenges below in \Cref{sec:system-index}.
|
||||
Specifically, we rely on the observation that only a small fraction of cells will be visible at any one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} uses this observation to prioritize evaluation of visible cells).
|
||||
The Executor only evaluates cell expressions on rows that are (close to being) visible to the user, and the transitive closure of their dependencies.
|
||||
|
||||
To address the materialization concern, we rely on the observation that in most spreadsheet applications, only a small fraction of cells will be visible at one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} relies on this observation to prioritize evaluation of visible cells).
|
||||
Each user-facing client maintains a range of visible rows with the executor.
|
||||
The executor materializes only visible cells and the transitive closure of their dependencies.
|
||||
|
||||
Some dependencies (e.g., as in the running sum column example) may require computations over more rows than fit in cache.
|
||||
Although we leave a detailed exploration of this challenge to future work, we observe that in many cases the concrete value of any one cell can be rewritten into a closed form.
|
||||
Note that some dependency chains (e.g., the running sum example) still require computation for each row of data (e.g., if the last row is visible).
|
||||
Although we leave a detailed exploration of this challenge to future work, we observe that the fixed point of such cell's expressions can often be rewritten into a closed form.
|
||||
For example, any given cell in a running sum column may be expressed in terms of the sum of all preceding cells.
|
||||
Our preliminary experiments show that when a chain of dependencies becomes sufficiently long, bulk computation can be used to provide a more responsive interface.
|
||||
|
||||
|
|
Loading…
Reference in New Issue