Pass through S3.2

2023-03-27 20:50:25 -04:00 · 2023-03-27 20:50:25 -04:00 · 33e2e379ec
parent e5dfdc405b
commit 33e2e379ec
1 changed files with 17 additions and 21 deletions
--- a/sections/system.tex
+++ b/sections/system.tex
@ -2,7 +2,7 @@
 \section{System Design}
 \label{sec:system}

-We now outline the design of our prototype overlay spreadsheet, implemented as part of the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
+We now overview our prototype overlay spreadsheet, implemented for use with the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
 Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and import/export format compatibility. 
 Our prototype likewise builds on Spark, using any dataframe as a data source.

@ -41,29 +41,25 @@ The executor provides efficient access to cell values and is responsible for pus
 Cell values is derived from two sources: 
 (i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and 
 (ii) A series of overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) extends the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
-The executor decouples these sources into a cache around $\spreadsheet_{\ds}$ and an update index that stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
-Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$. 
-The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
-$$\spreadsheet[\column,\row] = \begin{cases}
-\spreadsheet_\overlay[\column,\row] & \textbf{if } \spreadsheet_\overlay[\column,\row] \neq \errorval\\
-\spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
-\end{cases}$$
+These sources are implemented by a cache around $\spreadsheet_{\ds}$ and an update index, respectively.
+% The update index stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$.
+% Here, $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$. 
+% The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
+% $$\spreadsheet[\column,\row] = \begin{cases}
+% \spreadsheet_\overlay[\column,\row] & \textbf{if } \spreadsheet_\overlay[\column,\row] \neq \errorval\\
+% \spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
+% \end{cases}$$

-The direct approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) computes a topological sort over the full set of cells (in order of dependencies) and evaluates cells in this order.
-However, this approach requires first expanding patterns (one per interaction) into individual expressions (one per cell); 
-Not only does this requires a significant amount of memory, it introduces unnecessary computational complexity: 
-In many cases, a topological sort can be derived from patterns rather than individual cells.
-Moreover, it may be unnecessary to materialize the entire dataset.
+The direct approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) first computes a topological sort over all cells (in order of dependencies) and evaluates them in this order.
+However, the computational cost of this approach can be proportional to the size of the data, as it requires expanding patterns out over all individual cells.
+The Executor and Index leverage the fact that updates are already provided as patterns to materialize the spreadseheet faster.
+We return to the index and efficient strategies for computing dependencies \Cref{sec:system-index}, and first consider expression evaluation.

-The executor relies on the Index to provide efficient topological sorts, as well as upstream and downstream dependency analysis.
-We discuss these challenges below in \Cref{sec:system-index}. 
+Specifically, we rely on the observation that only a small fraction of cells will be visible at any one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} uses this observation to prioritize evaluation of visible cells).  
+The Executor only evaluates cell expressions on rows that are (close to being) visible to the user, and the transitive closure of their dependencies.

-To address the materialization concern, we rely on the observation that in most spreadsheet applications, only a small fraction of cells will be visible at one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} relies on this observation to prioritize evaluation of visible cells).  
-Each user-facing client maintains a range of visible rows with the executor.
-The executor materializes only visible cells and the transitive closure of their dependencies.
-
-Some dependencies (e.g., as in the running sum column example) may require computations over more rows than fit in cache.  
-Although we leave a detailed exploration of this challenge to future work, we observe that in many cases the concrete value of any one cell can be rewritten into a closed form.
+Note that some dependency chains (e.g., the running sum example) still require computation for each row of data (e.g., if the last row is visible).
+Although we leave a detailed exploration of this challenge to future work, we observe that the fixed point of such cell's expressions can often be rewritten into a closed form.
 For example, any given cell in a running sum column may be expressed in terms of the sum of all preceding cells.  
 Our preliminary experiments show that when a chain of dependencies becomes sufficiently long, bulk computation can be used to provide a more responsive interface.