Merge branch 'main' of git.odin.cse.buffalo.edu:VizierDB/paper-Vizier-SpreadsheetOverlay

2023-03-19 17:27:04 -05:00 · 2023-03-19 17:27:04 -05:00 · 4e74337553
parent 39c98f7204 dd809554b3
commit 4e74337553
3 changed files with 114 additions and 5 deletions
--- a/graphics/rangemap.png
+++ b/graphics/rangemap.png
--- a/main.tex
+++ b/main.tex
@ -5,6 +5,9 @@
 \usepackage{amsmath}
 \usepackage{xspace}
 \usepackage{colortbl}
+\usepackage{listings}
+\usepackage{algorithm}
+\usepackage{algpseudocode}

 \input{macros}

--- a/sections/system.tex
+++ b/sections/system.tex
@ -19,15 +19,121 @@ Finally, a \textbf{Presentation} layer defines syntactic sugar over the executio
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Update Index}
-The update index stores an encoded spreadsheet: a mapping from ranges of cells to expression patterns.  
+The update index stores an encoded spreadsheet: a mapping from ranges of cells to the expression patterns that define those cells.  
+The goal of the index is efficient access to expressions for each cell and the dependency graph that their expressions define.
+As noted above, we assume that the number of columns and rows are comparatively small and large, respectively.
+Accordingly, columns are referenced by unique identifiers, while rows are referenced by position or range set of positions.
+
+\begin{figure}
+  \includegraphics[width=\columnwidth]{graphics/rangemap.png}
+  \caption{A range map defines a mapping from disjoint ranges to values.}
+  \label{fig:rangemap}
+\end{figure}
+
+\paragraph{Range Maps}
+The core building block for the update index is a one-dimensional range map, an ordered map with integer keys, optimized for storing contiguous ranges of mappings.
+In addition to the usual operations defined over an ordered map (e.g., \lstinline{put}, \lstinline{get}, \lstinline{successorOf}), a range map defines:
+\begin{lstlisting}
+  bulkPut(low, high, value)
+\end{lstlisting}
+The semantics of this operation are defined as a \lstinline{put} on every element in the range from \lstinline{low} to \lstinline{high}.  
+Implemented naively through a binary tree over $N$ elements, this operation takes $O((high-low)\cdot\log(N))$ time.
+
+A range map avoids the $(high-low)$ factor (and correspondingly reduces $N$) by storing only one mapping for each \lstinline{bulkPut} operation, as illustrated in \Cref{fig:rangemap}.
+Analogous to a range set, we implement a range map as an ordered sequence of tuples of (disjoint) ranges and values.  
+To provide logarithmic-time access by position, we build a classical ordered map (e.g., a red-black tree) over the lower bound of each range.
+Since the map enforces disjoint ranges, ordering on the lower bound guarantees ordering over the full range.
+With a range map, the set of distinct values appearing in a range can be accessed in $O(\log(N)+M)$ time (where $M$ is the number of distinct values), and similar deletion and insertion costs.
+
+\paragraph{Cell Access}
+Efficient access to individual cells is obtained through a two-layered forward index, consisting of an unordered map over a set of range maps.
+The pattern for a specific cell is obtained by looking up the cell's column in the unordered map, and the cell's row in the corresponding range map.


+\begin{algorithm}
+\caption{\textbf{upstream}($\columnRange$, $\rowRange$)}
+\label{alg:upstream}
+\begin{algorithmic}[1]
+  \Require $\rangeOf{\columnRange, \rowRange}$: A range of cells to compute the upstream of.
+  \Ensure $\texttt{upstream}$: A set of cells on which $\rangeOf{\column}{\rowRange}$ is a dependency.
+  \State $\texttt{upstream} \leftarrow \{\}$
+  \State $\texttt{work} \leftarrow \comprehension{(\column, \rowRange, \{\})}{\column \in \columnRange}$
+  \While{$(\column', \rowRange', \texttt{lineage}) \leftarrow \texttt{work}.\textbf{dequeue}$}
+    \For{$(\rowRange'', \texttt{pattern}) \leftarrow \texttt{forwardIndex}(\column', \rowRange')$}
+      \For{$(\column_{d}, \rowRange_{d}, \texttt{offset}) \leftarrow \textbf{getDeps}(\texttt{pattern}, \column', \rowRange'')$}
+        \State $(\column_{d}, \rowRange_{d}) \leftarrow (\column_{d}, \rowRange_{d}) - \texttt{upstream}$
+        \If{$(\column_{d}, \rowRange_{d})$ is non-empty}
+          \State $\texttt{upstream} \leftarrow \texttt{upstream} + (\column_{d}, \rowRange_{d})$
+          \State $\texttt{queue}.\textbf{enqueue}( \column_{d}, \rowRange_{d},$
+          \State \hfill$\texttt{lineage} \cup \{\texttt{pattern} \rightarrow \texttt{offset}\} )$
+        \EndIf
+      \EndFor
+    \EndFor
+  \EndWhile
+\end{algorithmic}
+\end{algorithm}

-An update index stores an encoded spreadsheet 
+\paragraph{Upstream Reachability}
+In order to developing a materialization plan, the execution layer needs to be able to derive the set of cells on which a specific cell (or range of cells) depends.  
+We refer to this as the set of upstream cells for the specific target.
+\Cref{alg:upstream} illustrates a naive breadth-first search to obtain the full upstream set for a given target range.  
+Each item in the BFS's work queue consists of a column, a row set, and a lineage set that pertains to an optimization we will discuss below.
+For each work item enqueued, we use the forward index range map to obtain the set of patterns appearing in the range specified by the work item (line 4), and iterate over the set of their dependencies (line 5).
+If we discover a new dependency (lines 6-7), the newly discovered range is added to the return set and the work queue.
+We will explain line 10 shortly.
+
+The \textbf{getDeps} operation (Line 5; \Cref{alg:getDeps}) returns the set of dependencies of a pattern applied to a specific range of cells $\rangeOf{\column, \rowRange}$.
+Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column, \rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf(\cell')$.
+The algorithm uses a recursive traversal (lines 6-7) to visit every cell reference (offset or explicit):
+For offset references (lines 2-3), the provided range of rows is offset by the appropriate amount.
+For explicit cell references (lines 4-5), the explicit reference is used.
+
+\begin{algorithm}
+  \caption{\textbf{getDeps}($\texttt{pattern}, \column, \rowRange$)}
+  \label{alg:getDeps}
+  \begin{algorithmic}[1]
+    \Require \texttt{pattern}: An expression pattern
+    \Require $\rangeOf{\column}{\rowRange}$: A range of cells
+    \Ensure \texttt{deps}: The dependencies of \texttt{pattern} applied to $\rangeOf{\column}{\rowRange}$
+    \State $\texttt{deps} \leftarrow \{\}$
+    \If{\texttt{pattern} \textbf{is an offset reference} $\cellRef{\column', \delta'}$}
+      \State $\texttt{deps} \leftarrow \texttt{deps} \cup \{ (\column', \rowRange+\delta', \delta') \}$
+    \ElsIf{\texttt{pattern} \textbf{is a direct reference} $\cellRef{\column', \row'}$}
+      \State $\texttt{deps} \leftarrow \texttt{deps} \cup \{ (\column', \row', \emptyset) \} $
+    \Else 
+      \State $\texttt{deps} \leftarrow \texttt{deps} \underset{\texttt{child} \in \texttt{pattern}}{\bigcup} \textbf{getDeps}(\texttt{child}, \column, \rowRange) $
+    \EndIf
+  \end{algorithmic}
+\end{algorithm}


-Evaluating a cell in a spreadsheet requires evaluating transitive dependencies; 
-The spreadsheet may thus be viewed as a graph, with one node for each cell and one edge for each dependency.
-The update index maintains 
+\paragraph{Optimizing Recursive Reachability}

+Consider a computation analogous to that of \Cref{fig:overlay}, where one cell computes a running total of a second cell.  Such a cell might be defined by a pattern:
+$$\rangeOf{\texttt{total}}{[1, 1000]} \rightarrow \cellRef{\texttt{value}}{@0} + \cellRef{\texttt{total}}{@-1}$$
+Naively implemented, as in \Cref{alg:upstream}, computing reachability for the cell $\cellRef{\texttt{total}}{1000}$ will require visiting every distinct cell in the range 1-999: $\cellRef{\texttt{total}}{1000}$ depends on $\cellRef{\texttt{total}}{999}$, which depends on $\cellRef{\texttt{total}}{998}$, and so forth.
+
+Observe that this dependency chain is defined entirely by a single pattern: Each cell defined by the pattern depends on another cell defined by the pattern.
+We refer to a pattern that references cells defined by the same pattern as \emph{recursive}.
+Note that the pattern's value may be self-referential, even if there is not a dependency cycle between the individual cells that the pattern defines.
+
+When we encounter a recursive pattern, it may be possible to compute a closed form representation of its dependencies without visiting each individual dependency.
+Continuing the example above, any of the pattern's cells each depends on all of the preceding cells in the pattern.
+
+TODO
+
+\paragraph{Downstream Reachability}
+
+TODO: 
+Same algorithm as above, but use a reverse index.  
+Talk about maintaining the reverse index efficiently.
+
+\paragraph{Column Insertions and Deletions}
+
+TODO:
+short: just note that there is no implicit ordering, so these just involve updating the unordered maps encoding column values.
+
+\paragraph{Row Insertions/Deletions}
+
+TODO: probably just migrate the reference frame text here.