Finished brain dump of system

2023-03-20 17:13:32 -04:00 · 2023-03-20 17:13:32 -04:00 · ab74420bca
parent 772ba3aaa7
commit ab74420bca
3 changed files with 59 additions and 56 deletions
--- a/main.tex
+++ b/main.tex
@ -165,7 +165,7 @@
 \input{sections/overview}
 \input{sections/formalism}
 \input{sections/system}
-\input{sections/data}
+% \input{sections/data}
 \input{sections/relwork}

 \input{sections/conclusions}
--- a/sections/overview.tex
+++ b/sections/overview.tex
@ -59,56 +59,6 @@ We discuss the data update layer in greater depth in \Cref{sec:data}.

 \todo{Discuss relevant aspects of the UI}

-\subsection{Reference Frames}
-\label{sec:cellidentity}
-
-A reference to a cell (e.g., in a formula) is given as the intersection of a specific row and column.
-Because Overlay adopts the relational spreadsheet model, the set of columns is available in a static context, making it easy to assign unique identifiers (e.g., column names).
-To identify rows, we considered two approaches: (i) identifying rows by unique identifiers, and (ii) identifying rows by their position.
-
-Assigning each row a unique identifier poses several scalability challenges.
-First, this mapping makes caching more challenging, as row identifiers must be persisted in the original dataset.
-Moreover, unique identifiers can not be used to partition the source data into set of rows of consistent size.
-Finally, unique identifiers preclude rule based updates, as we describe in \Cref{sec:data}.
-
-Positional references can compactly encode contiguous ranges of rows.
-However whenever a row is inserted or deleted, every reference to a row following the update must be updated.
-A similar approach is fractional indexing~\cite{DBLP:journals/jidm/HausteinHM010,DBLP:conf/sigmod/ONeilOPCSW04}, which allocate new identifiers in sequential order by using the midpoint of the predecessor and successor rows.
-An analogous approach uses tree structures with counts to accelerate positional access to rows~\cite{DBLP:conf/icde/BendreVZCP18}.
-Both of these approaches avoid the overhead of reference updates, but impose a logarithmic cost to point lookups of rows by their position.
-
-Overlay adopts a positional style of reference, but augments it with a construct that we call a reference frame.
-Concretely, a row is identified by a 2-tuple $\tuple{i, \mathcal F}$, where $i$ is an integer position, and $\mathcal F$ is a reference frame.
-A reference frame is a function mapping positions to specific rows $\mathcal F : \mathbb Z \rightarrow \mathcal R$ (where $\mathcal R$ denotes the set of all rows).
-In other words, a row $\tuple{i, \mathcal F}$ denotes the row $\mathcal F(i)$.
-We observe that simple row insertions, deletions, or movement, apply a simple translation to a portion of the reference frame's domain.
-For example given initial reference frame $\mathcal F$, an insertion of three rows at position 5 defines a new reference frame $\mathcal F'$ as follows:
-$$\mathcal F'(x) = \begin{cases}
-  \mathcal F(x) & \textbf{if } x < 5\\
-  \text{[new row } x - 5\text{]} & \textbf{if } 5 \leq x \leq 7\\
-  \mathcal F(x - 3) & \textbf{otherwise}
-\end{cases}$$
-Observe that, row positions defined with respect to $\mathcal F$ may be transformed in constant time into positions with respect to $\mathcal F'$.
-That is, we can define a reference frame transformation $T$:\tabularnewline
-$$T(x) = \begin{cases}
-  x & \textbf{if } x < 5\\
-  x+3 & \textbf{otherwise}
-\end{cases}$$
-For portions of the domain of $T$ that are defined, the function may be inverted:
-$$T^{-1}(x) = \begin{cases}
-  x & \textbf{if } x < 5\\
-  x-3 & \textbf{if } x > 7\\
-  error & \textbf{otherwise}
-\end{cases}$$
-Thus $\mathcal F(T(x)) = \mathcal F'(x)$, and $\mathcal F'(T^{-1}(x)) = F(x)$.  Similar translations exist for deletions and row moves.
-
-Let $\mathcal F' = T_1 \circ \ldots \circ T_n \circ \mathcal F$, where $\circ$ denotes function composition.
-Given a history of transformations, any row $\tuple{i, \mathcal F}$ can be transformed into a later reference frame $\tuple{T_1(\ldots(T_n(x))), \mathcal F'}$, or an earlier one.
-Errors in a transformation to an earlier reference frame indicate inserted rows, while errors moving forward through reference frames indicate deleted rows.
-
-\begin{example}
-  Some example of reference frames in practice.  Insert, delete, move, etc...
-\end{example}
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: "../main"
--- a/sections/system.tex
+++ b/sections/system.tex
@ -140,12 +140,65 @@ To support these efficiently, we maintain a backward index that relates cell ran
 Analog to $\textbf{getDeps}$ inferring cells immediately upstream of a range of cells, we can infer the cells downstream of any cell or set of cells, with one caveat.
 When the cell identified an absolute reference in a pattern is modified, all cells using the pattern are invalidated, so we track the set of ranges over which any given pattern is defined.

-\paragraph{Column Insertions and Deletions}
+\paragraph{Column Insertions, Deletions, and Moves}
+
+At the index layer, columns are referenced by unique identifier; Ordering is imposed only at the presentation layer.
+Column insertion (or deletion) requires simply inserting (resp., removing) an entry from the forward and backward index.
+Column reordering requires no actions at the index level.
+
+\paragraph{Row Insertions, Deletions, and Moves}
+
+Rows are identified by their position. 
+When a row is inserted, deleted, or moved, references to the affected rows (and rows following them) change and must be updated.
+This update can be expensive, as it may require defining an entirely new set of patterns 
+
+One alternative is fractional indexing~\cite{DBLP:journals/jidm/HausteinHM010,DBLP:conf/sigmod/ONeilOPCSW04}, where a new identifier can be allocated in between any two rows.
+An analogous approach uses tree structures with counts to accelerate positional access to rows~\cite{DBLP:conf/icde/BendreVZCP18}.
+Both of these approaches avoid the overhead of reference updates, but impose a logarithmic cost to look up individual rows by their position.
+
+Instead, we adopt a lazy approach by associating every pattern with a construct that we call a reference frame, a function $\mathcal F$ a function mapping positions to specific rows $\mathcal F : \mathbb Z \rightarrow \mathcal R$ (where $\mathcal R$ denotes the set of all rows).
+In other words, the pair $\tuple{\row, \mathcal F}$ denotes the row $\mathcal F(\row)$.
+We observe that simple row insertions, deletions, or movement, apply a simple translation to a portion of the reference frame's domain.
+For example given initial reference frame $\mathcal F$, an insertion of three rows at position 5 defines a new reference frame $\mathcal F'$ as follows:
+$$\mathcal F'(\row) = \begin{cases}
+  \mathcal F(\row) & \textbf{if } \row < 5\\
+  \text{[new row } \row - 5\text{]} & \textbf{if } 5 \leq \row \leq 7\\
+  \mathcal F(\row - 3) & \textbf{otherwise}
+\end{cases}$$
+Observe that, row positions defined with respect to $\mathcal F$ may be transformed in constant time into positions with respect to $\mathcal F'$.
+That is, we can define a reference frame transformation $T$:\tabularnewline
+$$T(\row) = \begin{cases}
+  \row & \textbf{if } \row < 5\\
+  \row+3 & \textbf{otherwise}
+\end{cases}$$
+For portions of the domain of $T$ that are defined, the function may be inverted:
+$$T^{-1}(\row) = \begin{cases}
+  \row & \textbf{if } \row < 5\\
+  \row-3 & \textbf{if } \row > 7\\
+  error & \textbf{otherwise}
+\end{cases}$$
+Thus $\mathcal F(T(\row)) = \mathcal F'(\row)$, and $\mathcal F'(T^{-1}(\row)) = F(\row)$.  Similar translations exist for deletions and row moves.
+
+Let $\mathcal F' = T_1 \circ \ldots \circ T_n \circ \mathcal F$, where $\circ$ denotes function composition.
+Given a history of transformations, any row $\tuple{i, \mathcal F}$ can be transformed into a later reference frame $\tuple{T_1(\ldots(T_n(\row))), \mathcal F'}$, or an earlier one.
+Errors in a transformation to an earlier reference frame indicate inserted rows, while errors moving forward through reference frames indicate deleted rows.
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Execution Layer}
+
+The execution layer is responsible for providing efficient access to cell values, the results of executing cells.
+As in \cite{DBLP:conf/sigmod/BendreWMCP19}, this can be accomplished by (i)deriving a topological sort over the cells of the spreadsheet in dependency order, and (ii) materializing cells in this order.
+
+However, materializing the full spreadsheet becomes impractical for a sufficiently large dataset.
+Instead, the execution layer maintains an \emph{active region} that includes all of the rows that are on a client's screen, a small surrounding buffer, and all of their upstream dependencies (\Cref{alg:upstream}).
+Only cells in the active region are materialized.
+When the user's view changes, a new set of cells (typically in the surrounding buffer) are recomputed.
+
+We observe that recursive patterns (as discussed above) create situations where an active region may scale to the full size of the dataset.  
+Although it is beyond the scope of this work, note that any such form of recursion may be expressed as a window function over the base dataset, and is likely well suited for evaluation in a batch-processing system.

-TODO:
-short: just note that there is no implicit ordering, so these just involve updating the unordered maps encoding column values.

-\paragraph{Row Insertions/Deletions}

-TODO: probably just migrate the reference frame text here.