Oliver Kennedy 2023-03-21 17:00:03 -04:00
parent ab74420bca
commit 45d2c0e127
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
3 changed files with 81 additions and 26 deletions

View File

@ -180,3 +180,24 @@ Nigel Westbury},
}
@inproceedings{DBLP:conf/sigmod/ArmbrustXLHLBMK15,
author = {Michael Armbrust and
Reynold S. Xin and
Cheng Lian and
Yin Huai and
Davies Liu and
Joseph K. Bradley and
Xiangrui Meng and
Tomer Kaftan and
Michael J. Franklin and
Ali Ghodsi and
Matei Zaharia},
title = {Spark {SQL:} Relational Data Processing in Spark},
booktitle = {{SIGMOD} Conference},
pages = {1383--1394},
publisher = {{ACM}},
year = {2015}
}

View File

@ -83,8 +83,6 @@
%% Of note is the shared affiliation of the first two authors, and the
%% "authornote" and "authornotemark" commands
%% used to denote shared contribution to the research.
\author{Victoria Dib}
\email{vdib@buffalo.edu}
\author{Oliver Kennedy}
\email{okennedy@buffalo.edu}
\affiliation{%
@ -162,8 +160,8 @@
\input{sections/introduction}
\input{sections/model}
\input{sections/overview}
\input{sections/formalism}
% \input{sections/overview}
% \input{sections/formalism}
\input{sections/system}
% \input{sections/data}
\input{sections/relwork}

View File

@ -8,12 +8,64 @@
\label{fig:systemdesign}
\end{figure}
As illustrated in \Cref{fig:systemdesign}, we decompose Overlay into several distinct layers.
The \textbf{Update Index} is responsible for storing a mutable encoded spreadsheet ($\encodedSpreadsheet$), and providing efficient access to cells, and answers to reachability queries over the dependency graph.
The \textbf{Execution} layer is responsible for evaluating cell values, and maintianing a materialized view over the spreadsheet's values.
This layer specifically overlays values computed from expressions stored in the update index, on top of a raw dataset obtained from Apache Spark.
A thin \textbf{Cache} layer provides the execution layer with random access to the cells of the dataframe.
Finally, a \textbf{Presentation} layer defines syntactic sugar over the execution layer to provide a spreadsheet-like API to client applications.
We now outline the design of our prototype overlay spreadsheet, implemented as part of the Vizier reproducible notebook platform~\cite{brachmann:2020:cidr:your,brachmann:2019:sigmod:data,kennedy:2022:ieee-deb:right}.
Vizier leverages Apache Spark~\cite{DBLP:conf/sigmod/ArmbrustXLHLBMK15} for data provenance, processing, and import/export format compatibility.
Our prototype likewise builds on Spark, using any dataframe as a data source.
The prototype's design is illustrated in \Cref{fig:systemdesign}
Client applications (e.g., Javascript-based frontends) connect through a thin \textbf{Presentation} layer that mediates concurrent access to the spreadsheet and provides light syntactic sugar over the underlying data and update model.
The data model itself is maintained by an \textbf{Execution} layer that is responsible for evaluating spreadsheet cells and materializing a subset of the cell values that are viewable.
The execution layer applies an update overlay stored by an \textbf{Indexing} layer to an arbitary Spark dataframe.
A simple LRU \textbf{Cache} provides efficient random access to a subset of the dataframe's rows.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Presentation Layer}
Multiple user-facing client applications connect to the overlay spreadsheet through a presentation layer.
This layer mediates concurrent updates of the spreadsheet, allows clients to subscribe to push-based updates of cell state, and provides clients with the illusion of a fixed grid of cells by defining and maintaining an explicit order over columns, as well as maintaining a bound over the number of rows in the spreadsheet.
With the exception of updates to column order, most updates are placed in a serial order and relayed to lower levels.
The presentation layer expects the level below it to provide (i) efficient random access to cell values, (ii) subscription access to state (e.g., value) updates for ranges of cells.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Executor}
The executor's role is to provide efficient access to cell values and to push notifications about cell state changes.
Cell state is derived from two sources:
(i) A data source ($\ds, \rframe$) defines a base spreadsheet $\spreadsheet_{\ds}[\column, \row] = \ds[\column,\rframe^{-1}(\row)]$, and
(ii) a series of updates ($\overlay_{1}\ldots \overlay_k$; where $\overlay_i = \ol{\rtrans_i}{\oup_i}$) extends the spreadsheet $\spreadsheet = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\ds)$.
The executor decouples these sources into a cache around $\spreadsheet_{\ds}$ and an update index that stores an overlay spreadsheet defined as: $\spreadsheet_\overlay = (\overlay_k\circ\ldots\circ\overlay_1)(\spreadsheet_\errorval)$, where $\spreadsheet_\errorval$ denotes a spreadsheet that maps every cell to $\errorval$.
The full spreadsheet can be obtained by deferring to the source data for cells where the overlay is undefined:
$$\spreadsheet[\column,\row] = \begin{cases}
\spreadsheet_\overlay[\column,\row] & \textbf{if } \spreadsheet_\overlay[\column,\row] \neq \errorval\\
\spreadsheet_{\ds}[\column, (\rtrans_k^{-1} \circ \ldots \circ \rtrans_1^{-1})(r)] & \textbf{otherwise}
\end{cases}$$
To actually materialize $\spreadsheet$, we could follow \cite{DBLP:conf/sigmod/BendreWMCP19} to materialize the full set of values by (i) materializing expressions for each cell, (ii) computing a topological sort over the cells in order of dependencies, and (iii) evaluating cells in order of dependencies.
However, this approach has several shortcomings.
First, it may require impractically much memory to materialize every cell's expression individually, particularly when our goal is to allow users to apply expressions in bulk to entire datasets.
Second, given that patterns share a common structure, it can be more efficient to compute a topological sort over blocks of patterns rather than over individual cells.
Finally, materializing every cell's individual value can be just as impractical as materializing their expressions.
The first two points are addressed by the update index, which is expected to provide bulk access to cells.
In addition to allowing bulk updates to ranges of cells (via pattern), the executor assumes that the index can efficiently compute dependencies (upstream), and sets of dependent cells (downstream) for entire ranges at once.
To address the third point, we rely on the observation that in most spreadsheet applications, only a small fraction of cells will be visible at one time (e.g., \cite{DBLP:conf/sigmod/BendreWMCP19} relies on this observation to prioritize evaluation of visible cells).
Instead of materializing the full spreadsheet, the executor relies on user-facing clients to register interest in regions of cells.
We refer to the union of these cells and their upstream (dependencies) as the set of \emph{active} cells.
The executor only materializes active cells.
If the active cells remains confined to a specific set of rows that fit in cache, virtually all accesses can be serviced out of cache.
However, as we discuss below, the active region may scale to the full size of the dataset; We will return to this problem when we discuss future work.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -184,21 +236,5 @@ Given a history of transformations, any row $\tuple{i, \mathcal F}$ can be trans
Errors in a transformation to an earlier reference frame indicate inserted rows, while errors moving forward through reference frames indicate deleted rows.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Execution Layer}
The execution layer is responsible for providing efficient access to cell values, the results of executing cells.
As in \cite{DBLP:conf/sigmod/BendreWMCP19}, this can be accomplished by (i)deriving a topological sort over the cells of the spreadsheet in dependency order, and (ii) materializing cells in this order.
However, materializing the full spreadsheet becomes impractical for a sufficiently large dataset.
Instead, the execution layer maintains an \emph{active region} that includes all of the rows that are on a client's screen, a small surrounding buffer, and all of their upstream dependencies (\Cref{alg:upstream}).
Only cells in the active region are materialized.
When the user's view changes, a new set of cells (typically in the surrounding buffer) are recomputed.
We observe that recursive patterns (as discussed above) create situations where an active region may scale to the full size of the dataset.
Although it is beyond the scope of this work, note that any such form of recursion may be expressed as a window function over the base dataset, and is likely well suited for evaluation in a batch-processing system.