Pass through S3.2

main
Oliver Kennedy 2023-03-27 20:50:25 -04:00
parent e5dfdc405b
commit 33e2e379ec
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
1 changed files with 17 additions and 21 deletions

View File

@ -2,7 +2,7 @@
\section{System Design}
\label{sec:system}
We now outline the design of our prototype overlay spreadsheet, implemented as part of the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
We now overview our prototype overlay spreadsheet, implemented for use with the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and import/export format compatibility.
Our prototype likewise builds on Spark, using any dataframe as a data source.
@ -41,29 +41,25 @@ The executor provides efficient access to cell values and is responsible for pus
Cell values is derived from two sources:
(i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and
(ii) A series of overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) extends the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
The executor decouples these sources into a cache around $\spreadsheet_{\ds}$ and an update index that stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
$$\spreadsheet[\column,\row] = \begin{cases}
\spreadsheet_\overlay[\column,\row] & \textbf{if } \spreadsheet_\overlay[\column,\row] \neq \errorval\\
\spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
\end{cases}$$
These sources are implemented by a cache around $\spreadsheet_{\ds}$ and an update index, respectively.
% The update index stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
% Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
% The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
% $$\spreadsheet[\column,\row] = \begin{cases}
% \spreadsheet_\overlay[\column,\row] & \textbf{if } \spreadsheet_\overlay[\column,\row] \neq \errorval\\
% \spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
% \end{cases}$$
The direct approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) computes a topological sort over the full set of cells (in order of dependencies) and evaluates cells in this order.
However, this approach requires first expanding patterns (one per interaction) into individual expressions (one per cell);
Not only does this requires a significant amount of memory, it introduces unnecessary computational complexity:
In many cases, a topological sort can be derived from patterns rather than individual cells.
Moreover, it may be unnecessary to materialize the entire dataset.
The direct approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) first computes a topological sort over all cells (in order of dependencies) and evaluates them in this order.
However, the computational cost of this approach can be proportional to the size of the data, as it requires expanding patterns out over all individual cells.
The Executor and Index leverage the fact that updates are already provided as patterns to materialize the spreadseheet faster.
We return to the index and efficient strategies for computing dependencies \Cref{sec:system-index}, and first consider expression evaluation.
The executor relies on the Index to provide efficient topological sorts, as well as upstream and downstream dependency analysis.
We discuss these challenges below in \Cref{sec:system-index}.
Specifically, we rely on the observation that only a small fraction of cells will be visible at any one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} uses this observation to prioritize evaluation of visible cells).
The Executor only evaluates cell expressions on rows that are (close to being) visible to the user, and the transitive closure of their dependencies.
To address the materialization concern, we rely on the observation that in most spreadsheet applications, only a small fraction of cells will be visible at one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} relies on this observation to prioritize evaluation of visible cells).
Each user-facing client maintains a range of visible rows with the executor.
The executor materializes only visible cells and the transitive closure of their dependencies.
Some dependencies (e.g., as in the running sum column example) may require computations over more rows than fit in cache.
Although we leave a detailed exploration of this challenge to future work, we observe that in many cases the concrete value of any one cell can be rewritten into a closed form.
Note that some dependency chains (e.g., the running sum example) still require computation for each row of data (e.g., if the last row is visible).
Although we leave a detailed exploration of this challenge to future work, we observe that the fixed point of such cell's expressions can often be rewritten into a closed form.
For example, any given cell in a running sum column may be expressed in terms of the sum of all preceding cells.
Our preliminary experiments show that when a chain of dependencies becomes sufficiently long, bulk computation can be used to provide a more responsive interface.