finishing out S3
parent
45d2c0e127
commit
7623914c8b
|
@ -61,20 +61,24 @@ The executor only materializes active cells.
|
|||
If the active cells remains confined to a specific set of rows that fit in cache, virtually all accesses can be serviced out of cache.
|
||||
However, as we discuss below, the active region may scale to the full size of the dataset; We will return to this problem when we discuss future work.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
\paragraph{Incremental Updates}
|
||||
Updates to the reference frame ($\rtrans$) are passed through to both the data source and the update index.
|
||||
Insertions and moves of rows and columns will not affect any dependencies; Although the effects must be reflected in the materialized view, they do not generally trigger re-evaluation.
|
||||
Conversely, column or row deletions, as well as cell updates ($\oup$) may affect cells downstream of the deleted or updated cells.
|
||||
When such an update occurs, the executor uses the index to compute the full downstream of the set of affected cells, places them in a ``pending'' state, and triggers re-evaluation.
|
||||
When a set of cell updates $\oup$ is applied to the spreadsheet, the executor identifies the set of cells invalidated by the update and triggers their re-evaluation.
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Update Index}
|
||||
The update index stores an encoded spreadsheet: a mapping from ranges of cells to the expression patterns that define those cells.
|
||||
The goal of the index is efficient access to expressions for each cell and the dependency graph that their expressions define.
|
||||
As noted above, we assume that the number of columns and rows are comparatively small and large, respectively.
|
||||
Accordingly, columns are referenced by unique identifiers, while rows are referenced by position or range set of positions.
|
||||
The update index stores a series of update operations ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$) and provides efficient access to the resulting overlay spreadsheet ($\spreadsheet_\overlay$):
|
||||
(i) Access to the expressions for individual cells $\spreadsheet_\overlay[\column, \row]$ (for cell evaluation);
|
||||
(ii) Computing the upstream of a range of cells (for topological sort and computing the active set), and
|
||||
(iii) Computing the downstream of a range of cells (for cell invalidation after an update).
|
||||
|
||||
Crucially, the update index avoids materializing the expressions for each cell by storing cell updates in the form of pattern-range tuples.
|
||||
As noted above, we assume that the number of columns is comparatively small, and the number of rows is comparatively large.
|
||||
|
||||
\begin{figure}
|
||||
\includegraphics[width=\columnwidth]{graphics/rangemap.png}
|
||||
|
@ -84,10 +88,7 @@ Accordingly, columns are referenced by unique identifiers, while rows are refere
|
|||
|
||||
\paragraph{Range Maps}
|
||||
The core building block for the update index is a one-dimensional range map, an ordered map with integer keys, optimized for storing contiguous ranges of mappings.
|
||||
In addition to the usual operations defined over an ordered map (e.g., \lstinline{put}, \lstinline{get}, \lstinline{successorOf}), a range map defines:
|
||||
\begin{lstlisting}
|
||||
bulkPut(low, high, value)
|
||||
\end{lstlisting}
|
||||
In addition to the usual operations of an ordered map (e.g., \lstinline{put}, \lstinline{get}, \lstinline{successorOf}), a range map defines an operation \lstinline{bulkPut(low, high, value)}.
|
||||
The semantics of this operation are defined as a \lstinline{put} on every element in the range from \lstinline{low} to \lstinline{high}.
|
||||
Implemented naively through a binary tree over $N$ elements, this operation takes $O((high-low)\cdot\log(N))$ time.
|
||||
|
||||
|
@ -136,7 +137,7 @@ If we discover a new dependency (lines 6-7), the newly discovered range is added
|
|||
We will explain line 10 shortly.
|
||||
|
||||
The \textbf{getDeps} operation (Line 5; \Cref{alg:getDeps}) returns the set of dependencies of a pattern applied to a specific range of cells $\rangeOf{\column, \rowRange}$.
|
||||
Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column, \rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf(\cell')$.
|
||||
Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column, \rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf{\cell'}$.
|
||||
The algorithm uses a recursive traversal (lines 6-7) to visit every cell reference (offset or explicit):
|
||||
For offset references (lines 2-3), the provided range of rows is offset by the appropriate amount.
|
||||
For explicit cell references (lines 4-5), the explicit reference is used.
|
||||
|
@ -163,7 +164,7 @@ For explicit cell references (lines 4-5), the explicit reference is used.
|
|||
\paragraph{Optimizing Recursive Reachability}
|
||||
|
||||
Consider a computation analogous to that of \Cref{fig:overlay}, where one cell computes a running total of a second cell. Such a cell might be defined by a pattern:
|
||||
$$\rangeOf{\texttt{total}}{[1, 1000]} \rightarrow \cellRef{\texttt{value}}{@0} + \cellRef{\texttt{total}}{@-1}$$
|
||||
$$\rangeOf{\texttt{total}}{[1, 1000]} \rightarrow \cellRef{\texttt{value}}{+0} + \cellRef{\texttt{total}}{-1}$$
|
||||
Naively implemented, as in \Cref{alg:upstream}, computing reachability for the cell $\cellRef{\texttt{total}}{1000}$ will require visiting every distinct cell in the range 1-999: $\cellRef{\texttt{total}}{1000}$ depends on $\cellRef{\texttt{total}}{999}$, which depends on $\cellRef{\texttt{total}}{998}$, and so forth.
|
||||
|
||||
Observe that this dependency chain is defined entirely by a single pattern: Each cell defined by the pattern depends on another cell defined by the pattern.
|
||||
|
@ -172,7 +173,7 @@ Note that the pattern's value may be self-referential, even if there is not a de
|
|||
|
||||
Patterns allow absolute references or offset references, but the former can not trigger a recursive pattern without creating a cycle in the dependency graph.
|
||||
Thus, recursive dependencies must be at fixed offsets, and the transitive closure must have a closed form representation.
|
||||
For example, consider a cell $\cellRef{\texttt{total}}{500}$ defined by a recursive pattern over rows $[1,1000]$, with a recursive pattern dependency on $\cellRef{\texttt{total}}{@-2}$.
|
||||
For example, consider a cell $\cellRef{\texttt{total}}{500}$ defined by a recursive pattern over rows $[1,1000]$, with a recursive pattern dependency on $\cellRef{\texttt{total}}{-2}$.
|
||||
The transitive closure of the cell's dependency thus includes exactly the set of even rows (given the offset of $-2$) in the range $[1,500]$ (the cell through the start of the pattern's range).
|
||||
|
||||
Unfortunately, the size of the encoding of the range set needed to represent these dependencies scales with the number of rows, due to the gaps.
|
||||
|
@ -192,48 +193,48 @@ To support these efficiently, we maintain a backward index that relates cell ran
|
|||
Analog to $\textbf{getDeps}$ inferring cells immediately upstream of a range of cells, we can infer the cells downstream of any cell or set of cells, with one caveat.
|
||||
When the cell identified an absolute reference in a pattern is modified, all cells using the pattern are invalidated, so we track the set of ranges over which any given pattern is defined.
|
||||
|
||||
\paragraph{Column Insertions, Deletions, and Moves}
|
||||
% \paragraph{Column Insertions, Deletions, and Moves}
|
||||
|
||||
At the index layer, columns are referenced by unique identifier; Ordering is imposed only at the presentation layer.
|
||||
Column insertion (or deletion) requires simply inserting (resp., removing) an entry from the forward and backward index.
|
||||
Column reordering requires no actions at the index level.
|
||||
% At the index layer, columns are referenced by unique identifier; Ordering is imposed only at the presentation layer.
|
||||
% Column insertion (or deletion) requires simply inserting (resp., removing) an entry from the forward and backward index.
|
||||
% Column reordering requires no actions at the index level.
|
||||
|
||||
\paragraph{Row Insertions, Deletions, and Moves}
|
||||
% \paragraph{Row Insertions, Deletions, and Moves}
|
||||
|
||||
Rows are identified by their position.
|
||||
When a row is inserted, deleted, or moved, references to the affected rows (and rows following them) change and must be updated.
|
||||
This update can be expensive, as it may require defining an entirely new set of patterns
|
||||
% Rows are identified by their position.
|
||||
% When a row is inserted, deleted, or moved, references to the affected rows (and rows following them) change and must be updated.
|
||||
% This update can be expensive, as it may require defining an entirely new set of patterns
|
||||
|
||||
One alternative is fractional indexing~\cite{DBLP:journals/jidm/HausteinHM010,DBLP:conf/sigmod/ONeilOPCSW04}, where a new identifier can be allocated in between any two rows.
|
||||
An analogous approach uses tree structures with counts to accelerate positional access to rows~\cite{DBLP:conf/icde/BendreVZCP18}.
|
||||
Both of these approaches avoid the overhead of reference updates, but impose a logarithmic cost to look up individual rows by their position.
|
||||
% One alternative is fractional indexing~\cite{DBLP:journals/jidm/HausteinHM010,DBLP:conf/sigmod/ONeilOPCSW04}, where a new identifier can be allocated in between any two rows.
|
||||
% An analogous approach uses tree structures with counts to accelerate positional access to rows~\cite{DBLP:conf/icde/BendreVZCP18}.
|
||||
% Both of these approaches avoid the overhead of reference updates, but impose a logarithmic cost to look up individual rows by their position.
|
||||
|
||||
Instead, we adopt a lazy approach by associating every pattern with a construct that we call a reference frame, a function $\mathcal F$ a function mapping positions to specific rows $\mathcal F : \mathbb Z \rightarrow \mathcal R$ (where $\mathcal R$ denotes the set of all rows).
|
||||
In other words, the pair $\tuple{\row, \mathcal F}$ denotes the row $\mathcal F(\row)$.
|
||||
We observe that simple row insertions, deletions, or movement, apply a simple translation to a portion of the reference frame's domain.
|
||||
For example given initial reference frame $\mathcal F$, an insertion of three rows at position 5 defines a new reference frame $\mathcal F'$ as follows:
|
||||
$$\mathcal F'(\row) = \begin{cases}
|
||||
\mathcal F(\row) & \textbf{if } \row < 5\\
|
||||
\text{[new row } \row - 5\text{]} & \textbf{if } 5 \leq \row \leq 7\\
|
||||
\mathcal F(\row - 3) & \textbf{otherwise}
|
||||
\end{cases}$$
|
||||
Observe that, row positions defined with respect to $\mathcal F$ may be transformed in constant time into positions with respect to $\mathcal F'$.
|
||||
That is, we can define a reference frame transformation $T$:\tabularnewline
|
||||
$$T(\row) = \begin{cases}
|
||||
\row & \textbf{if } \row < 5\\
|
||||
\row+3 & \textbf{otherwise}
|
||||
\end{cases}$$
|
||||
For portions of the domain of $T$ that are defined, the function may be inverted:
|
||||
$$T^{-1}(\row) = \begin{cases}
|
||||
\row & \textbf{if } \row < 5\\
|
||||
\row-3 & \textbf{if } \row > 7\\
|
||||
error & \textbf{otherwise}
|
||||
\end{cases}$$
|
||||
Thus $\mathcal F(T(\row)) = \mathcal F'(\row)$, and $\mathcal F'(T^{-1}(\row)) = F(\row)$. Similar translations exist for deletions and row moves.
|
||||
% Instead, we adopt a lazy approach by associating every pattern with a construct that we call a reference frame, a function $\mathcal F$ a function mapping positions to specific rows $\mathcal F : \mathbb Z \rightarrow \mathcal R$ (where $\mathcal R$ denotes the set of all rows).
|
||||
% In other words, the pair $\tuple{\row, \mathcal F}$ denotes the row $\mathcal F(\row)$.
|
||||
% We observe that simple row insertions, deletions, or movement, apply a simple translation to a portion of the reference frame's domain.
|
||||
% For example given initial reference frame $\mathcal F$, an insertion of three rows at position 5 defines a new reference frame $\mathcal F'$ as follows:
|
||||
% $$\mathcal F'(\row) = \begin{cases}
|
||||
% \mathcal F(\row) & \textbf{if } \row < 5\\
|
||||
% \text{[new row } \row - 5\text{]} & \textbf{if } 5 \leq \row \leq 7\\
|
||||
% \mathcal F(\row - 3) & \textbf{otherwise}
|
||||
% \end{cases}$$
|
||||
% Observe that, row positions defined with respect to $\mathcal F$ may be transformed in constant time into positions with respect to $\mathcal F'$.
|
||||
% That is, we can define a reference frame transformation $T$:\tabularnewline
|
||||
% $$T(\row) = \begin{cases}
|
||||
% \row & \textbf{if } \row < 5\\
|
||||
% \row+3 & \textbf{otherwise}
|
||||
% \end{cases}$$
|
||||
% For portions of the domain of $T$ that are defined, the function may be inverted:
|
||||
% $$T^{-1}(\row) = \begin{cases}
|
||||
% \row & \textbf{if } \row < 5\\
|
||||
% \row-3 & \textbf{if } \row > 7\\
|
||||
% error & \textbf{otherwise}
|
||||
% \end{cases}$$
|
||||
% Thus $\mathcal F(T(\row)) = \mathcal F'(\row)$, and $\mathcal F'(T^{-1}(\row)) = F(\row)$. Similar translations exist for deletions and row moves.
|
||||
|
||||
Let $\mathcal F' = T_1 \circ \ldots \circ T_n \circ \mathcal F$, where $\circ$ denotes function composition.
|
||||
Given a history of transformations, any row $\tuple{i, \mathcal F}$ can be transformed into a later reference frame $\tuple{T_1(\ldots(T_n(\row))), \mathcal F'}$, or an earlier one.
|
||||
Errors in a transformation to an earlier reference frame indicate inserted rows, while errors moving forward through reference frames indicate deleted rows.
|
||||
% Let $\mathcal F' = T_1 \circ \ldots \circ T_n \circ \mathcal F$, where $\circ$ denotes function composition.
|
||||
% Given a history of transformations, any row $\tuple{i, \mathcal F}$ can be transformed into a later reference frame $\tuple{T_1(\ldots(T_n(\row))), \mathcal F'}$, or an earlier one.
|
||||
% Errors in a transformation to an earlier reference frame indicate inserted rows, while errors moving forward through reference frames indicate deleted rows.
|
||||
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue