Finished brain dump of system

main
Oliver Kennedy 2023-03-20 17:13:32 -04:00
parent 772ba3aaa7
commit ab74420bca
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
3 changed files with 59 additions and 56 deletions

View File

@ -165,7 +165,7 @@
\input{sections/overview}
\input{sections/formalism}
\input{sections/system}
\input{sections/data}
% \input{sections/data}
\input{sections/relwork}
\input{sections/conclusions}

View File

@ -59,56 +59,6 @@ We discuss the data update layer in greater depth in \Cref{sec:data}.
\todo{Discuss relevant aspects of the UI}
\subsection{Reference Frames}
\label{sec:cellidentity}
A reference to a cell (e.g., in a formula) is given as the intersection of a specific row and column.
Because Overlay adopts the relational spreadsheet model, the set of columns is available in a static context, making it easy to assign unique identifiers (e.g., column names).
To identify rows, we considered two approaches: (i) identifying rows by unique identifiers, and (ii) identifying rows by their position.
Assigning each row a unique identifier poses several scalability challenges.
First, this mapping makes caching more challenging, as row identifiers must be persisted in the original dataset.
Moreover, unique identifiers can not be used to partition the source data into set of rows of consistent size.
Finally, unique identifiers preclude rule based updates, as we describe in \Cref{sec:data}.
Positional references can compactly encode contiguous ranges of rows.
However whenever a row is inserted or deleted, every reference to a row following the update must be updated.
A similar approach is fractional indexing~\cite{DBLP:journals/jidm/HausteinHM010,DBLP:conf/sigmod/ONeilOPCSW04}, which allocate new identifiers in sequential order by using the midpoint of the predecessor and successor rows.
An analogous approach uses tree structures with counts to accelerate positional access to rows~\cite{DBLP:conf/icde/BendreVZCP18}.
Both of these approaches avoid the overhead of reference updates, but impose a logarithmic cost to point lookups of rows by their position.
Overlay adopts a positional style of reference, but augments it with a construct that we call a reference frame.
Concretely, a row is identified by a 2-tuple $\tuple{i, \mathcal F}$, where $i$ is an integer position, and $\mathcal F$ is a reference frame.
A reference frame is a function mapping positions to specific rows $\mathcal F : \mathbb Z \rightarrow \mathcal R$ (where $\mathcal R$ denotes the set of all rows).
In other words, a row $\tuple{i, \mathcal F}$ denotes the row $\mathcal F(i)$.
We observe that simple row insertions, deletions, or movement, apply a simple translation to a portion of the reference frame's domain.
For example given initial reference frame $\mathcal F$, an insertion of three rows at position 5 defines a new reference frame $\mathcal F'$ as follows:
$$\mathcal F'(x) = \begin{cases}
\mathcal F(x) & \textbf{if } x < 5\\
\text{[new row } x - 5\text{]} & \textbf{if } 5 \leq x \leq 7\\
\mathcal F(x - 3) & \textbf{otherwise}
\end{cases}$$
Observe that, row positions defined with respect to $\mathcal F$ may be transformed in constant time into positions with respect to $\mathcal F'$.
That is, we can define a reference frame transformation $T$:\tabularnewline
$$T(x) = \begin{cases}
x & \textbf{if } x < 5\\
x+3 & \textbf{otherwise}
\end{cases}$$
For portions of the domain of $T$ that are defined, the function may be inverted:
$$T^{-1}(x) = \begin{cases}
x & \textbf{if } x < 5\\
x-3 & \textbf{if } x > 7\\
error & \textbf{otherwise}
\end{cases}$$
Thus $\mathcal F(T(x)) = \mathcal F'(x)$, and $\mathcal F'(T^{-1}(x)) = F(x)$. Similar translations exist for deletions and row moves.
Let $\mathcal F' = T_1 \circ \ldots \circ T_n \circ \mathcal F$, where $\circ$ denotes function composition.
Given a history of transformations, any row $\tuple{i, \mathcal F}$ can be transformed into a later reference frame $\tuple{T_1(\ldots(T_n(x))), \mathcal F'}$, or an earlier one.
Errors in a transformation to an earlier reference frame indicate inserted rows, while errors moving forward through reference frames indicate deleted rows.
\begin{example}
Some example of reference frames in practice. Insert, delete, move, etc...
\end{example}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"

View File

@ -140,12 +140,65 @@ To support these efficiently, we maintain a backward index that relates cell ran
Analog to $\textbf{getDeps}$ inferring cells immediately upstream of a range of cells, we can infer the cells downstream of any cell or set of cells, with one caveat.
When the cell identified an absolute reference in a pattern is modified, all cells using the pattern are invalidated, so we track the set of ranges over which any given pattern is defined.
\paragraph{Column Insertions and Deletions}
\paragraph{Column Insertions, Deletions, and Moves}
At the index layer, columns are referenced by unique identifier; Ordering is imposed only at the presentation layer.
Column insertion (or deletion) requires simply inserting (resp., removing) an entry from the forward and backward index.
Column reordering requires no actions at the index level.
\paragraph{Row Insertions, Deletions, and Moves}
Rows are identified by their position.
When a row is inserted, deleted, or moved, references to the affected rows (and rows following them) change and must be updated.
This update can be expensive, as it may require defining an entirely new set of patterns
One alternative is fractional indexing~\cite{DBLP:journals/jidm/HausteinHM010,DBLP:conf/sigmod/ONeilOPCSW04}, where a new identifier can be allocated in between any two rows.
An analogous approach uses tree structures with counts to accelerate positional access to rows~\cite{DBLP:conf/icde/BendreVZCP18}.
Both of these approaches avoid the overhead of reference updates, but impose a logarithmic cost to look up individual rows by their position.
Instead, we adopt a lazy approach by associating every pattern with a construct that we call a reference frame, a function $\mathcal F$ a function mapping positions to specific rows $\mathcal F : \mathbb Z \rightarrow \mathcal R$ (where $\mathcal R$ denotes the set of all rows).
In other words, the pair $\tuple{\row, \mathcal F}$ denotes the row $\mathcal F(\row)$.
We observe that simple row insertions, deletions, or movement, apply a simple translation to a portion of the reference frame's domain.
For example given initial reference frame $\mathcal F$, an insertion of three rows at position 5 defines a new reference frame $\mathcal F'$ as follows:
$$\mathcal F'(\row) = \begin{cases}
\mathcal F(\row) & \textbf{if } \row < 5\\
\text{[new row } \row - 5\text{]} & \textbf{if } 5 \leq \row \leq 7\\
\mathcal F(\row - 3) & \textbf{otherwise}
\end{cases}$$
Observe that, row positions defined with respect to $\mathcal F$ may be transformed in constant time into positions with respect to $\mathcal F'$.
That is, we can define a reference frame transformation $T$:\tabularnewline
$$T(\row) = \begin{cases}
\row & \textbf{if } \row < 5\\
\row+3 & \textbf{otherwise}
\end{cases}$$
For portions of the domain of $T$ that are defined, the function may be inverted:
$$T^{-1}(\row) = \begin{cases}
\row & \textbf{if } \row < 5\\
\row-3 & \textbf{if } \row > 7\\
error & \textbf{otherwise}
\end{cases}$$
Thus $\mathcal F(T(\row)) = \mathcal F'(\row)$, and $\mathcal F'(T^{-1}(\row)) = F(\row)$. Similar translations exist for deletions and row moves.
Let $\mathcal F' = T_1 \circ \ldots \circ T_n \circ \mathcal F$, where $\circ$ denotes function composition.
Given a history of transformations, any row $\tuple{i, \mathcal F}$ can be transformed into a later reference frame $\tuple{T_1(\ldots(T_n(\row))), \mathcal F'}$, or an earlier one.
Errors in a transformation to an earlier reference frame indicate inserted rows, while errors moving forward through reference frames indicate deleted rows.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Execution Layer}
The execution layer is responsible for providing efficient access to cell values, the results of executing cells.
As in \cite{DBLP:conf/sigmod/BendreWMCP19}, this can be accomplished by (i)deriving a topological sort over the cells of the spreadsheet in dependency order, and (ii) materializing cells in this order.
However, materializing the full spreadsheet becomes impractical for a sufficiently large dataset.
Instead, the execution layer maintains an \emph{active region} that includes all of the rows that are on a client's screen, a small surrounding buffer, and all of their upstream dependencies (\Cref{alg:upstream}).
Only cells in the active region are materialized.
When the user's view changes, a new set of cells (typically in the surrounding buffer) are recomputed.
We observe that recursive patterns (as discussed above) create situations where an active region may scale to the full size of the dataset.
Although it is beyond the scope of this work, note that any such form of recursion may be expressed as a window function over the base dataset, and is likely well suited for evaluation in a batch-processing system.
TODO:
short: just note that there is no implicit ordering, so these just involve updating the unordered maps encoding column values.
\paragraph{Row Insertions/Deletions}
TODO: probably just migrate the reference frame text here.