WIP: system design

main
Oliver Kennedy 2023-03-19 12:55:32 -04:00
parent 287c40e733
commit 02d1605c0a
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
4 changed files with 39 additions and 12 deletions

View File

@ -157,6 +157,7 @@
\input{sections/introduction}
\input{sections/overview}
\input{sections/formalism}
\input{sections/system}
\input{sections/data}
\input{sections/relwork}

View File

@ -43,6 +43,8 @@ $$E = \bigcup_{\cell \in \columnDomain \times \rowDomain}
\{\;\cell \rightarrow \cell'\;|\;\cell' \in \depsOf{\spreadsheet(\cell)}\;\} $$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Compressed Updates}
We adopt a common assumption in relational data: that $\columnDomain$ is small and $\rowDomain$ is large for a typical spreadsheet.
Accordingly, we define $\rowDomain$ for a spreadsheet with $N$ rows to be the range $[1,N]$, with rows identified by their position in the spreadsheet.
@ -70,12 +72,3 @@ $$\spreadsheet \equiv \comprehension{
}$$
Informally, the expression at a cell $\cell$ is defined by identifying the range in $\encodedSpreadsheet$ that contains $\cell$, and expanding the pattern in the context of $\cell$'s row. We require that the set of ranges in $\encodedSpreadsheet$ be disjoint;
If the set of ranges is not complete, the expression for cells not covered by a range is defined to be the literal null.
\subsection{Update Index}
Evaluating a cell in a spreadsheet requires evaluating transitive dependencies;
The spreadsheet may thus be viewed as a graph, with one node for each cell and one edge for each dependency.
The update index maintains

View File

@ -10,11 +10,11 @@ We refer to these as the relational and architectural approaches, respectively.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\includegraphics[width=0.6\columnwidth]{graphics/overlay.png}
\includegraphics[width=0.9\columnwidth]{graphics/forms-of-spreadsheet.pdf}
\Description{
Two layers. At the bottom, a classical relational table including column headers and miscellaneous data values. Stacked over it is a set of edits: cells with dependencies marked by arrows.
Three architectures: First, a materialized approach depicted by a 4 by 3 grid of cells with arrows indicating dependencies. The cells of the 3rd column point to cells in the same row in the 1st and 2nd columns. The cells of the 4th column are marked red and point to the cell in the 3rd column of the same row, and the cell of the 4th column in the preceding row. Second, a virtual approach depicted by three boxes: The first box contains a 2 by 3 grid of cells. An arrow labeled with a pi (projection) connects it to the second box, which contains a 3 by 3 grid of cells. An arrow labeled with a sigma (aggregation) connects it to the third box, which contains a 4 by 3 grid of cells. The sigma, its arrow, and the 4th column of cells are marked in red. Finally, an overlay approach depicted by 2 boxes. The 2nd box is nearly identical to the materialized approach, but the 1st and 2nd columns of the grid are greyed out with arrows pointing to the first box; The first box is the same 2 by 3 grid from the virtual approach.
}
\caption{Architecture of an overlay spreadsheet}
\caption{Approaches to scalable spreadsheet design}
\label{fig:overlay}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

33
sections/system.tex Normal file
View File

@ -0,0 +1,33 @@
%!TEX root = ../main.tex
\section{System Design}
\label{sec:system}
\begin{figure}
\includegraphics[width=0.8\columnwidth]{graphics/systemdesign.png}
\caption{Overlay system design.}
\label{fig:systemdesign}
\end{figure}
As illustrated in \Cref{fig:systemdesign}, we decompose Overlay into several distinct layers.
The \textbf{Update Index} is responsible for storing a mutable encoded spreadsheet ($\encodedSpreadsheet$), and providing efficient access to cells, and answers to reachability queries over the dependency graph.
The \textbf{Execution} layer is responsible for evaluating cell values, and maintianing a materialized view over the spreadsheet's values.
This layer specifically overlays values computed from expressions stored in the update index, on top of a raw dataset obtained from Apache Spark.
A thin \textbf{Cache} layer provides the execution layer with random access to the cells of the dataframe.
Finally, a \textbf{Presentation} layer defines syntactic sugar over the execution layer to provide a spreadsheet-like API to client applications.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Update Index}
The update index stores an encoded spreadsheet: a mapping from ranges of cells to expression patterns.
An update index stores an encoded spreadsheet
Evaluating a cell in a spreadsheet requires evaluating transitive dependencies;
The spreadsheet may thus be viewed as a graph, with one node for each cell and one edge for each dependency.
The update index maintains