finishing out S3

main
Oliver Kennedy 2023-03-21 23:02:21 -04:00
parent 45d2c0e127
commit 7623914c8b
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
1 changed files with 53 additions and 52 deletions

View File

@ -61,20 +61,24 @@ The executor only materializes active cells.
If the active cells remains confined to a specific set of rows that fit in cache, virtually all accesses can be serviced out of cache.
However, as we discuss below, the active region may scale to the full size of the dataset; We will return to this problem when we discuss future work.
\paragraph{Incremental Updates}
Updates to the reference frame ($\rtrans$) are passed through to both the data source and the update index.
Insertions and moves of rows and columns will not affect any dependencies; Although the effects must be reflected in the materialized view, they do not generally trigger re-evaluation.
Conversely, column or row deletions, as well as cell updates ($\oup$) may affect cells downstream of the deleted or updated cells.
When such an update occurs, the executor uses the index to compute the full downstream of the set of affected cells, places them in a ``pending'' state, and triggers re-evaluation.
When a set of cell updates $\oup$ is applied to the spreadsheet, the executor identifies the set of cells invalidated by the update and triggers their re-evaluation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Update Index}
The update index stores an encoded spreadsheet: a mapping from ranges of cells to the expression patterns that define those cells.
The goal of the index is efficient access to expressions for each cell and the dependency graph that their expressions define.
As noted above, we assume that the number of columns and rows are comparatively small and large, respectively.
Accordingly, columns are referenced by unique identifiers, while rows are referenced by position or range set of positions.
The update index stores a series of update operations ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$) and provides efficient access to the resulting overlay spreadsheet ($\spreadsheet_\overlay$):
(i) Access to the expressions for individual cells $\spreadsheet_\overlay[\column, \row]$ (for cell evaluation);
(ii) Computing the upstream of a range of cells (for topological sort and computing the active set), and
(iii) Computing the downstream of a range of cells (for cell invalidation after an update).
Crucially, the update index avoids materializing the expressions for each cell by storing cell updates in the form of pattern-range tuples.
As noted above, we assume that the number of columns is comparatively small, and the number of rows is comparatively large.
\begin{figure}
\includegraphics[width=\columnwidth]{graphics/rangemap.png}
@ -84,10 +88,7 @@ Accordingly, columns are referenced by unique identifiers, while rows are refere
\paragraph{Range Maps}
The core building block for the update index is a one-dimensional range map, an ordered map with integer keys, optimized for storing contiguous ranges of mappings.
In addition to the usual operations defined over an ordered map (e.g., \lstinline{put}, \lstinline{get}, \lstinline{successorOf}), a range map defines:
\begin{lstlisting}
bulkPut(low, high, value)
\end{lstlisting}
In addition to the usual operations of an ordered map (e.g., \lstinline{put}, \lstinline{get}, \lstinline{successorOf}), a range map defines an operation \lstinline{bulkPut(low, high, value)}.
The semantics of this operation are defined as a \lstinline{put} on every element in the range from \lstinline{low} to \lstinline{high}.
Implemented naively through a binary tree over $N$ elements, this operation takes $O((high-low)\cdot\log(N))$ time.
@ -136,7 +137,7 @@ If we discover a new dependency (lines 6-7), the newly discovered range is added
We will explain line 10 shortly.
The \textbf{getDeps} operation (Line 5; \Cref{alg:getDeps}) returns the set of dependencies of a pattern applied to a specific range of cells $\rangeOf{\column, \rowRange}$.
Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column, \rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf(\cell')$.
Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column, \rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf{\cell'}$.
The algorithm uses a recursive traversal (lines 6-7) to visit every cell reference (offset or explicit):
For offset references (lines 2-3), the provided range of rows is offset by the appropriate amount.
For explicit cell references (lines 4-5), the explicit reference is used.
@ -163,7 +164,7 @@ For explicit cell references (lines 4-5), the explicit reference is used.
\paragraph{Optimizing Recursive Reachability}
Consider a computation analogous to that of \Cref{fig:overlay}, where one cell computes a running total of a second cell. Such a cell might be defined by a pattern:
$$\rangeOf{\texttt{total}}{[1, 1000]} \rightarrow \cellRef{\texttt{value}}{@0} + \cellRef{\texttt{total}}{@-1}$$
$$\rangeOf{\texttt{total}}{[1, 1000]} \rightarrow \cellRef{\texttt{value}}{+0} + \cellRef{\texttt{total}}{-1}$$
Naively implemented, as in \Cref{alg:upstream}, computing reachability for the cell $\cellRef{\texttt{total}}{1000}$ will require visiting every distinct cell in the range 1-999: $\cellRef{\texttt{total}}{1000}$ depends on $\cellRef{\texttt{total}}{999}$, which depends on $\cellRef{\texttt{total}}{998}$, and so forth.
Observe that this dependency chain is defined entirely by a single pattern: Each cell defined by the pattern depends on another cell defined by the pattern.
@ -172,7 +173,7 @@ Note that the pattern's value may be self-referential, even if there is not a de
Patterns allow absolute references or offset references, but the former can not trigger a recursive pattern without creating a cycle in the dependency graph.
Thus, recursive dependencies must be at fixed offsets, and the transitive closure must have a closed form representation.
For example, consider a cell $\cellRef{\texttt{total}}{500}$ defined by a recursive pattern over rows $[1,1000]$, with a recursive pattern dependency on $\cellRef{\texttt{total}}{@-2}$.
For example, consider a cell $\cellRef{\texttt{total}}{500}$ defined by a recursive pattern over rows $[1,1000]$, with a recursive pattern dependency on $\cellRef{\texttt{total}}{-2}$.
The transitive closure of the cell's dependency thus includes exactly the set of even rows (given the offset of $-2$) in the range $[1,500]$ (the cell through the start of the pattern's range).
Unfortunately, the size of the encoding of the range set needed to represent these dependencies scales with the number of rows, due to the gaps.
@ -192,48 +193,48 @@ To support these efficiently, we maintain a backward index that relates cell ran
Analog to $\textbf{getDeps}$ inferring cells immediately upstream of a range of cells, we can infer the cells downstream of any cell or set of cells, with one caveat.
When the cell identified an absolute reference in a pattern is modified, all cells using the pattern are invalidated, so we track the set of ranges over which any given pattern is defined.
\paragraph{Column Insertions, Deletions, and Moves}
% \paragraph{Column Insertions, Deletions, and Moves}
At the index layer, columns are referenced by unique identifier; Ordering is imposed only at the presentation layer.
Column insertion (or deletion) requires simply inserting (resp., removing) an entry from the forward and backward index.
Column reordering requires no actions at the index level.
% At the index layer, columns are referenced by unique identifier; Ordering is imposed only at the presentation layer.
% Column insertion (or deletion) requires simply inserting (resp., removing) an entry from the forward and backward index.
% Column reordering requires no actions at the index level.
\paragraph{Row Insertions, Deletions, and Moves}
% \paragraph{Row Insertions, Deletions, and Moves}
Rows are identified by their position.
When a row is inserted, deleted, or moved, references to the affected rows (and rows following them) change and must be updated.
This update can be expensive, as it may require defining an entirely new set of patterns
% Rows are identified by their position.
% When a row is inserted, deleted, or moved, references to the affected rows (and rows following them) change and must be updated.
% This update can be expensive, as it may require defining an entirely new set of patterns
One alternative is fractional indexing~\cite{DBLP:journals/jidm/HausteinHM010,DBLP:conf/sigmod/ONeilOPCSW04}, where a new identifier can be allocated in between any two rows.
An analogous approach uses tree structures with counts to accelerate positional access to rows~\cite{DBLP:conf/icde/BendreVZCP18}.
Both of these approaches avoid the overhead of reference updates, but impose a logarithmic cost to look up individual rows by their position.
% One alternative is fractional indexing~\cite{DBLP:journals/jidm/HausteinHM010,DBLP:conf/sigmod/ONeilOPCSW04}, where a new identifier can be allocated in between any two rows.
% An analogous approach uses tree structures with counts to accelerate positional access to rows~\cite{DBLP:conf/icde/BendreVZCP18}.
% Both of these approaches avoid the overhead of reference updates, but impose a logarithmic cost to look up individual rows by their position.
Instead, we adopt a lazy approach by associating every pattern with a construct that we call a reference frame, a function $\mathcal F$ a function mapping positions to specific rows $\mathcal F : \mathbb Z \rightarrow \mathcal R$ (where $\mathcal R$ denotes the set of all rows).
In other words, the pair $\tuple{\row, \mathcal F}$ denotes the row $\mathcal F(\row)$.
We observe that simple row insertions, deletions, or movement, apply a simple translation to a portion of the reference frame's domain.
For example given initial reference frame $\mathcal F$, an insertion of three rows at position 5 defines a new reference frame $\mathcal F'$ as follows:
$$\mathcal F'(\row) = \begin{cases}
\mathcal F(\row) & \textbf{if } \row < 5\\
\text{[new row } \row - 5\text{]} & \textbf{if } 5 \leq \row \leq 7\\
\mathcal F(\row - 3) & \textbf{otherwise}
\end{cases}$$
Observe that, row positions defined with respect to $\mathcal F$ may be transformed in constant time into positions with respect to $\mathcal F'$.
That is, we can define a reference frame transformation $T$:\tabularnewline
$$T(\row) = \begin{cases}
\row & \textbf{if } \row < 5\\
\row+3 & \textbf{otherwise}
\end{cases}$$
For portions of the domain of $T$ that are defined, the function may be inverted:
$$T^{-1}(\row) = \begin{cases}
\row & \textbf{if } \row < 5\\
\row-3 & \textbf{if } \row > 7\\
error & \textbf{otherwise}
\end{cases}$$
Thus $\mathcal F(T(\row)) = \mathcal F'(\row)$, and $\mathcal F'(T^{-1}(\row)) = F(\row)$. Similar translations exist for deletions and row moves.
% Instead, we adopt a lazy approach by associating every pattern with a construct that we call a reference frame, a function $\mathcal F$ a function mapping positions to specific rows $\mathcal F : \mathbb Z \rightarrow \mathcal R$ (where $\mathcal R$ denotes the set of all rows).
% In other words, the pair $\tuple{\row, \mathcal F}$ denotes the row $\mathcal F(\row)$.
% We observe that simple row insertions, deletions, or movement, apply a simple translation to a portion of the reference frame's domain.
% For example given initial reference frame $\mathcal F$, an insertion of three rows at position 5 defines a new reference frame $\mathcal F'$ as follows:
% $$\mathcal F'(\row) = \begin{cases}
% \mathcal F(\row) & \textbf{if } \row < 5\\
% \text{[new row } \row - 5\text{]} & \textbf{if } 5 \leq \row \leq 7\\
% \mathcal F(\row - 3) & \textbf{otherwise}
% \end{cases}$$
% Observe that, row positions defined with respect to $\mathcal F$ may be transformed in constant time into positions with respect to $\mathcal F'$.
% That is, we can define a reference frame transformation $T$:\tabularnewline
% $$T(\row) = \begin{cases}
% \row & \textbf{if } \row < 5\\
% \row+3 & \textbf{otherwise}
% \end{cases}$$
% For portions of the domain of $T$ that are defined, the function may be inverted:
% $$T^{-1}(\row) = \begin{cases}
% \row & \textbf{if } \row < 5\\
% \row-3 & \textbf{if } \row > 7\\
% error & \textbf{otherwise}
% \end{cases}$$
% Thus $\mathcal F(T(\row)) = \mathcal F'(\row)$, and $\mathcal F'(T^{-1}(\row)) = F(\row)$. Similar translations exist for deletions and row moves.
Let $\mathcal F' = T_1 \circ \ldots \circ T_n \circ \mathcal F$, where $\circ$ denotes function composition.
Given a history of transformations, any row $\tuple{i, \mathcal F}$ can be transformed into a later reference frame $\tuple{T_1(\ldots(T_n(\row))), \mathcal F'}$, or an earlier one.
Errors in a transformation to an earlier reference frame indicate inserted rows, while errors moving forward through reference frames indicate deleted rows.
% Let $\mathcal F' = T_1 \circ \ldots \circ T_n \circ \mathcal F$, where $\circ$ denotes function composition.
% Given a history of transformations, any row $\tuple{i, \mathcal F}$ can be transformed into a later reference frame $\tuple{T_1(\ldots(T_n(\row))), \mathcal F'}$, or an earlier one.
% Errors in a transformation to an earlier reference frame indicate inserted rows, while errors moving forward through reference frames indicate deleted rows.