Back down to 6pp

main
Oliver Kennedy 2023-03-30 20:23:24 -04:00
parent 4a416f70dc
commit cbfc0553fd
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
4 changed files with 60 additions and 61 deletions

View File

@ -3,9 +3,9 @@
\section{Conclusions and Future Work}
\label{sec:conclusions}
In this work, we introduced overlay spreadsheets as a potential direction for reproducible spreadsheets where a user's edits can be re-applied to updated input data, and thus used directly in classical workflow and provenance analysis systems like Vizier.
In this work, we introduced overlay spreadsheets as a potential direction for reproducible spreadsheets in workflow and provenance analysis systems like Vizier.
This novel capability is powered by overlays that decouple the user's edits from the source data they are applied to.
We also demonstrated how updates to ranges of cells can be represented declaratively, improving performance and introducing several avenues for optimized evaluation of recursive patterns.
We also demonstrated how updates to ranges of cells can be represented declaratively, improving performance and enabling optimized evaluation of recursive patterns.
Recursive patterns remain the source of several open challenges for us.
Most notably, in the absence of recursive patterns, the depth of a dependency chains is bounded by the number of user interactions.

View File

@ -7,20 +7,20 @@
}
\subcaptionbox{Fix Data, Move View}{
\includegraphics[width=0.47\columnwidth]{results/laptop-init-varystart.pdf}
}
}\\[1mm]
\subcaptionbox{Scale Data, View Last}{
\includegraphics[width=0.47\columnwidth]{results/laptop-init-varystartandsize.pdf}
}
\subcaptionbox{Scale Data, View First}{
\includegraphics[width=0.47\columnwidth]{results/laptop-update_one-varysize.pdf}
}
}\\[-2mm]
% \subcaptionbox{Fix Data, Move View}{
% \includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystart.pdf}
% }
% \subcaptionbox{Scale Data, View Last}{
% \includegraphics[width=0.28\textwidth]{results/laptop-update_one-varystartandsize.pdf}
% }
\caption{System Initialization costs (a-c) and cost to update one cell (d-f)}
\caption{Time to initialize the spreadsheet (a-b) and cost to update one cell (c-d)}
\label{fig:experiments}
\trimfigurespacing
\end{figure}
@ -37,22 +37,25 @@ Concretely, we are interested in two questions:
% Laptop
Experiments were run on an 8-core 2.3 GHz Intel i7-11800H running Linux (Kernel 5.19), with 32G of DDR4-3200 RAM, and a 2TB 970 EVO NVME solid state drive.
We compare three systems:
(i) \textbf{DataSpread}: Dataspread version 0.5~\cite{bendre-15-d}, the most recent version of time of submission;
(i) \textbf{DataSpread}: Dataspread version 0.5~\cite{bendre-15-d};
(ii) \textbf{Vizier}: Our prototype implementation of overlay spreadsheets; and
(iii) \textbf{Vizier (Simulated Batching)}: Our prototype with simulated hybrid batch processing (see Setup, below).
(iii) \textbf{Vizier (Simulated Batching)}: Simulated hybrid batch processing (see Setup, below).
All experiments were performed with a warm cache.
\partitle{Setup}
We address our questions through a simple microbenchmark modeled after query 1 from the TPC-H benchmark~\cite{tpc-h}: The spreadsheet is defined by the TPC-H \texttt{lineitem} dataset with $\texttt{N}$ rows and four additional columns defined by the patterns:\\[-5mm]
We address our questions through a microbenchmark modeled after TPC-H query 1~\cite{tpc-h}: The spreadsheet is defined by the TPC-H \texttt{lineitem} dataset with $\texttt{N}$ rows and four additional columns defined by the patterns:\\[-2mm]
{\footnotesize
\begin{verbatim}
base_price[1-N] = ext_price[+0]
disc_price[1-N] = base_price[+0] * (1 - discount[+0])
charge[1-N] = disc_price[+0] * (1 + tax[+0])
sum_charge[1] = charge[1]
sum_charge[2-N] = charge[+0] + sum_charge[-1]
base_price[1-N] = ext_price[+0]
disc_price[1-N] = base_price[+0] * (1 - discount[+0])
charge[1-N] = disc_price[+0] * (1 + tax[+0])
sum_charge[1] = charge[1]
sum_charge[2-N] = charge[+0] + sum_charge[-1]
\end{verbatim}
Note that the \texttt{sum\_charge} column is a running total aand the length of the dependency chain on row $i$ proportional to $i$. Thus, as the user scrolls down the page (under normal usage), the runtime to compute individual cells grows linearly.
Each system under test is allowed to load the spreadsheet with a viewable area of 50 rows.
}
\noindent The \texttt{sum\_charge} column is a running total, creating a dependency chain that grows linearly with row index.
As the user scrolls down the page (under normal usage), the runtime to compute individual cells grows linearly.
Each system load the spreadsheet with a viewable area of 50 rows and updates a single cell.
We measure (i) the cost of initialization and (ii) the cost of a single update.
Time is measured until quiescence.
To emulate batch processing, we replace the formula for the $\texttt{sum\_change}[i-1]$ (where $i$ is the first visible row) with a formula that computes the analogous aggregate query.
@ -69,14 +72,15 @@ To emulate batch processing, we replace the formula for the $\texttt{sum\_change
% \label{fig:perf-scale-visible}
% \trimfigurespacing
% \end{figure}
\Cref{fig:experiments}(a,c) shows initialization and update costs, with a fixed dataset size of approximately 600,000 rows, and a variable viewport position.
Due to the running sum, the longest visible dependency chain grows as the visible region moves further into the dataset.
\Cref{fig:experiments}(a,c) shows costs for a fixed dataset size of approximately 600,000 rows, varying the viewable rows.
Due to the running sum, later rows require more computation.
Costs for Vizier and Dataspread grow significantly with the length of the dependency chain, while batch processing can compute the updated sum significantly faster.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Scaling Data}
\Cref{fig:experiments}(b,d) shows the initialization and update costs when the viewport is on the first cell. Vizier only needs to compute the visible cell formulas, and so is significantly faster.
\Cref{fig:experiments}(b,d) shows costs when varying data size, with the view fixed on the first cell.
Because dependencies in the visible area are of constant size, Vizier is faster.
% \Cref{fig:experiments}(c,f) show these costs when the viewport is on the last cell; as before, the costs for Vizier grow with the length of the longest visible dependency chain, supporting the value of batching.

View File

@ -22,14 +22,13 @@ A key challenge is that classical database techniques, which exploit common stru
\cite{DBLP:conf/icde/BendreVZCP18} explores data structures that can leverage partial structure; for example, when a range of cells are structured as a relational table.
\cite{DBLP:conf/sigmod/BendreWMCP19} explores strategies for quickly invalidating cells and computing dependencies, by leveraging a (lossy) compressed dependency graph that can efficiently bound a cell's downstream.
\cite{tang-23-efcsfg} introduces a different type of compressed dependency graph which is lossless, instead exploiting repeating patterns in formulas.
This is analogous to our own approach, but focuses on the dependency graph;
As we demonstrate, expression patterns create more optimization opportunities.
This is analogous to our own approach, but focuses on the dependency graph rather than expressions, limiting opportunities for optimization.
In summary, several efficient algorithms for storing, accessing, and updating spreadsheets have been developed and adapted in the context of the DataSpread.
In summary, DataSpread introduced multiple efficient algorithms for storing, accessing, and updating spreadsheets.
The virtual approach is often less efficient, but has the advantage of supporting light-weight versioning, tracking the provenance.
Crucially, this approach also enables replaying a user's updates, originally applied to one dataset, on a new dataset (e.g., to re-apply curation work on an updated version of the data).
The overlay approach we present in this work has the potential to retain these benefits while enabling performance competitive with, or exceeding that of DataSpread.
Furthermore, overlays with reference frames enable more efficient support for insertion and deletion for rows and columns as this only affects reference frames, but not the formulas of cells.
The overlay approach we present in this work has the potential to retain these benefits while enabling performance competitive with DataSpread.
% Furthermore, overlays with reference frames allow more efficient insertion and deletion for rows and columns as this only affects reference frames, but not the formulas of cells.
%%% Local Variables:

View File

@ -36,7 +36,7 @@ The presentation layer expects the level below it to provide (i) efficient rando
\subsection{Executor}
\label{sec:system-executor}
The executor provides efficient access to cell values and pushes notifications about cell state changes.
The executor provides efficient access to cell values and notifications about cell state changes.
Cell values are derived from two sources:
(i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and
(ii) Overlay updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) that extend the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
@ -49,37 +49,35 @@ These sources are implemented by a cache around $\spreadsheet_{\ds}$ and the upd
% \spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
% \end{cases}$$
The naive approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) first computes a topological sort over all cells (in order of dependencies) and evaluates them in this order.
However, the computational cost of this approach can be proportional to the size of the data, as each cell may need to be evaluated.
The Executor reduces this cost through two insights:
The naive approach to materializing $\spreadsheet$ (e.g., as in~\cite{DBLP:conf/sigmod/BendreWMCP19}) computes a topological sort over cell dependencies and evaluates cells in this order.
The Executor side-steps the linear (in the data size) cost of the naive approach through two insights:
(i) Updates are already provided over multiple cells in bulk as patterns, and
(ii) Only a small fraction of cells will be visible at any one time.
Assuming the dependencies of a range of cells can be computed efficiently (we return to this in \Cref{sec:system-index}), only the visible cells and their not visible dependencies need to be evaluated.
The Executor only evaluates cell expressions on rows that are (close to being) visible to the user, and the transitive closure of their dependencies.
Note that some dependency chains (e.g., the running sum example) still require computation for each row of data (e.g., if the last row is visible).
Although we leave a detailed exploration of this challenge to future work, we observe that the fixed point of such cell's expressions can often be rewritten into a closed form.
For example, any given cell in a running sum column may be expressed in terms of the sum of all preceding cells.
Our preliminary experiments show that when a chain of dependencies becomes sufficiently long, bulk computation can be used to provide a more responsive interface.
Some dependency chains (e.g., running sums) still require computation for each row of data.
Although we leave a detailed exploration of this challenge to future work, we observe that the fixed point of such pattern expressions can often be rewritten into a closed form.
For example, any cell in a running sum column is equivalent to a sum over the preceding cells.
Our preliminary experiments (\Cref{sec:experiments}) suggest promise in a hybrid evaluation strategy that evaluates visible cells individually and computes hidden pattern cells through closed form aggregate queries.
\partitle{Incremental Updates}
\partitle{Updates}
When the executor receives an update to a cell, it uses the index to compute the set of invalidated cells, marks them as ``pending,'' and begins re-evaluating them in topological order.
An update to the reference frame is applied to both the index and the data source.
Following typical spreadsheet semantics, an insertion or row move updates references in dependent formulas, so modulo changes in the set of visible rows, no re-evaluation is required.
Following typical spreadsheet semantics, an insertion or row move updates references in dependent formulas, so no re-evaluation is typically required.
If a row with dependent cells is deleted, the dependent cells need to be updated to indicate the error.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Update Index}
\label{sec:system-index}
The update index provides efficient positional access to the spreadsheet (denoted $\spreadsheet_\overlay$) defined by a sequence of updates ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$), with $\errorval$ for all undefined cells.
Specifically, the index is required to support:
(i) Access to the expressions for individual cells $\spreadsheet_\overlay[\column, \row]$ (for cell evaluation);
(ii) Computing the upstream of a range of cells (for topological sort and computing the active set), and
(iii) Computing the downstream of a range of cells (for cell invalidation after an update).
The key insight behind the index is that it stores updates in the form of pattern-range tuples to avoid materializing the full spreadsheet.
As noted above, we assume that the number of columns is comparatively small, and the number of rows is comparatively large.
The update index stores sequence of updates ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$) and provide efficient access to the cells of an overlay spreadsheet (denoted $\spreadsheet_\overlay$) where undefined cells have the value $\errorval$.
This entails:
(i) cell expressions $\spreadsheet_\overlay[\column, \row]$ (for cell evaluation);
(ii) the upstream of a range of cells (for topological sort and computing the active set), and
(iii) the downstream of a range of cells (for cell invalidation after an update).
The key insight behind the index is that updates are stored in the form of pattern-range tuples.
%As noted above, we assume that the number of columns is small and the number of rows is large.
\begin{figure}
\includegraphics[width=0.7\columnwidth]{graphics/rangemap.pdf}
@ -89,24 +87,24 @@ As noted above, we assume that the number of columns is comparatively small, and
\end{figure}
\partitle{Range Maps}
The core building block for the update index is a one-dimensional range map, an ordered map with integer keys.
The update index is built over a one-dimensional range map, an ordered map with integer keys.
In addition to the usual operations of an ordered map (e.g., \texttt{put}, \texttt{get}, \texttt{successorOf}), we define the operation \texttt{bulkPut(low, high, value)} which is equivalent to a \texttt{put} on every element in the range from \texttt{low} to \texttt{high}.
Implemented naively through a binary tree over $N$ elements, this operation takes $O((\texttt{high}-\texttt{low})\cdot\log(N))$ time.
Implemented naively (e.g. a size $N$ binary tree), this operation takes $O((\texttt{high}-\texttt{low})\cdot\log(N))$.
A range map avoids the $(\texttt{high}-\texttt{low})$ factor (and correspondingly reduces $N$) by storing an ordered sequence of disjoint ranges, each mapping one specific value as illustrated in \Cref{fig:rangemap}.
A binary tree provides efficient membership lookups over the ranges.
With a range map, the set of distinct values appearing in a range can be accessed in $O(\log(N)+M)$ time (where $M$ is the number of distinct values), and similar deletion and insertion costs.
\partitle{Cell Access}
The index layer maintains a ``forward" index: An unordered map that stores a range map for each column.
To compute the expression for a cell $\cellRef{\column}{\row}$, the index layer (i) looks up the range map for $\column$ in the unordered map, (ii) looks up $\row$ in the range map to obtain a pattern (and returns $\emptyset$ if the row is undefined), and (iii) computes the expression by applying the pattern to $\cellRef{\column}{\row}$.
The index layer maintains a ``forward'' index: An unordered map $\mathcal I$ that stores a range map $\mathcal I[\column]$ for each column.
The expression for a cell $\cellRef{\column}{\row}$ is stored at $\mathcal I[\column][\row]$.
\begin{algorithm}
\caption{\textbf{upstream}($\columnRange$, $\rowRange$)}
\label{alg:upstream}
\begin{algorithmic}[1]
\Require $\rangeOf{\columnRange, \rowRange}$: A range of cells to compute the upstream of.
\Ensure $\texttt{upstream}$: A set of cells on which $\rangeOf{\column}{\rowRange}$ is a dependency.
\Require $\rangeOf{\columnRange, \rowRange}$: A range to compute the upstream of.
\Ensure $\texttt{upstream}$: Cells on which $\rangeOf{\column}{\rowRange}$ is a dependency.
\State $\texttt{upstream} \leftarrow \{\}$
\State $\texttt{work} \leftarrow \comprehension{(\column, \rowRange, \{\})}{\column \in \columnRange}$
\While{$(\column', \rowRange', \texttt{lineage}) \leftarrow \texttt{work}.\textbf{dequeue}$}
@ -116,7 +114,8 @@ To compute the expression for a cell $\cellRef{\column}{\row}$, the index layer
\If{$(\column_{d}, \rowRange_{d})$ is non-empty}
\State $\texttt{upstream} \leftarrow \texttt{upstream} + (\column_{d}, \rowRange_{d})$
\State $\texttt{queue}.\textbf{enqueue}( \column_{d}, \rowRange_{d},$\\
\hfill$\comprehension{ \texttt{p}' \rightarrow (\texttt{o}'+\texttt{offset})}{ ((\texttt{p}' \rightarrow \texttt{o}' )\in \texttt{lineage}}$\\
\hspace*{37mm}$\{\;\texttt{p}' \rightarrow (\texttt{o}'+\texttt{offset})$~~~~~~~~~\\
\hspace*{40mm}$|\; (\texttt{p}' \rightarrow \texttt{o}' )\in \texttt{lineage}\}$\\
\hfill $\cup \{\texttt{pattern} \rightarrow \texttt{offset}\} )$
\EndIf
\EndFor
@ -132,7 +131,7 @@ We refer to this set as the target's \emph{upstream}.
Each item in the BFS's work queue consists of a column, a row set, and a lineage; We will return to the lineage shortly.
For each work item enqueued, we query the forward index to obtain the set of patterns in the range (line 4), and iterate over the set of their dependencies (line 5).
If we discover a new dependency (lines 6-7), the newly discovered range is added to the return set and the work queue.
We will explain line 10 shortly.
We will explain lines 10-12 shortly.
The \textbf{getDeps} operation (Line 5; \Cref{alg:getDeps}) computes the immediate dependencies of a range of cells $\rangeOf{\column, \rowRange}$ that share a pattern.
Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column}{\rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf{\cell'}$.
@ -161,26 +160,23 @@ For explicit cell references (lines 4-5), the explicit reference is used.
\partitle{Optimizing Recursive Reachability}
Consider a running sum, such as the one in \Cref{ex:recursive-running-sum}.
Observe that the $k$th element will have $O(k)$ upstream dependencies, and so naively following \Cref{alg:upstream} requires $O(k)$ compute.
The $k$th element will have $O(k)$ upstream dependencies, and so naively following \Cref{alg:upstream} requires $O(k)$ compute.
However, observe that a single pattern is responsible for all of these dependencies, suggesting that a more efficient option may be available.
Specifically, this dependency chain is defined by recursion over single pattern; all but the first cell depend on another cell defined by the same pattern.
We refer to a pattern that references cells defined by the same pattern as \emph{recursive}.
Note that a recursive pattern need not indicate a dependency cycle between individual cells.
This dependency chain arises from recursion over single pattern; most cells depend on other cells defined by the same pattern.
We refer to such a pattern as \emph{recursive}, even if it does not create dependency cycle over individual cells.
Our key insight is that for some (mutually) recursive patterns, the transitive closure of the dependencies will have a closed-form representation.
As with cell execution, the transitive closure of the dependencies of a recursive pattern has a closed-form representation.
In our running example, the upstream of any $\cellRef{D}{k}$ is exactly $\cellRef{D}{1-(k-1)}$ and $\cellRef{C}{1-k}$.
%
The \texttt{lineage} field of \Cref{alg:upstream} is used to track the set of patterns visited, and the offset(s) at which they were visited.
If the pattern being visited already appears in the lineage, then we know it is recursive and that we can extend out the sequence of upstream cells across the remaining cells of the pattern.
If the offset is $\pm 1$, then the elements of this sequence are efficiently representable as a range of cells and can return it in $O(1)$ time.
When the offset is $\pm 1$, the elements of this sequence are efficiently representable as a range of cells, computable in $O(1)$ time.
\partitle{Downstream Reachability}
When a cell's expression is updated, cells that depend on it (even transitively) must be recomputed.
The index must thus also support downstream reachability queries.
To support these efficiently, we maintain a backward index that relates cell ranges to the ranges of patterns that depend on it.
Analog to $\textbf{getDeps}$ inferring cells immediately upstream of a range of cells, we can infer the cells downstream of any cell or set of cells, with one caveat.
When the cell identified an absolute reference in a pattern is modified, all cells using the pattern are invalidated, so we track the set of ranges over which any given pattern is defined.
When a cell's expression is updated, cells that depend on it (even transitively) must be recomputed, so the index must support downstream reachability queries.
To efficient downstream lookups, the index maintain as ``backward'' index that relates ranges to the set of patterns that depend on all cells in the range.
The resulting algorithm over the backward index is analogous to $\textbf{getDeps}$.
% \partitle{Column Insertions, Deletions, and Moves}