Graphs, experiments, and a bit of trimming

main
Oliver Kennedy 2023-03-28 23:10:46 -04:00
parent f3f21f0d5a
commit b082e1e8c9
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
11 changed files with 90 additions and 26 deletions

View File

@ -201,3 +201,9 @@ Nigel Westbury},
year = {2015}
}
@misc{tpc-h,
author = {The Transaction Processing Performance Council},
title = {TPC Benchmark H (Decision Support), Revision 2.18.0},
howpublished = {https://www.tpc.org/tpch/default5.asp},
year = {2018}
}

View File

@ -7,7 +7,7 @@
\usepackage{colortbl}
\usepackage{listings}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage[noend]{algpseudocode}
\newcommand{\trimfigurespacing}{\vspace*{-5mm}}
@ -168,7 +168,7 @@
\input{sections/system}
% \input{sections/data}
\input{sections/relwork}
\input{sections/experiments}
\input{sections/conclusions}

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 30 KiB

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

After

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 28 KiB

After

Width:  |  Height:  |  Size: 18 KiB

View File

@ -81,7 +81,10 @@ print(stages)
def plot_one(testbed, stage):
global data
fig, ax = plt.subplots()
fig, ax = plt.subplots(
figsize=(4, 2),
constrained_layout=True
)
# ax.set_title(f"{stage} ({testbed})")
ax.set_ylabel(f"{stage} (s)")
@ -97,7 +100,7 @@ def plot_one(testbed, stage):
], key=lambda x: x[0])
ax.plot(
[pt[0] for pt in points],
[pt[0] / 1000 for pt in points],
[pt[1] for pt in points],
label=system
)

59
sections/experiments.tex Normal file
View File

@ -0,0 +1,59 @@
%!TEX root=../main.tex
\section{Experiments}
\label{sec:experiments}
In this section we explore the performance of the overlay approach.
Concretely, we are interested in two questions:
(i) How does data size affect the performance of each system?
(ii) How does dependency chain length affect the performance of each system?
Experiments were run on a 10-core 1.7 GHz Intel i7-12700H running Linux (Kernel 6.0), with 64G of DDR-3200 RAM, and a 2TB 970 EVO NVME solid state drive.
We compare three systems:
(i) \textbf{dataspread}: Dataspread version 0.5~\cite{bendre-15-d}, the most recent version of time of submission;
(ii) \textbf{vizier}: Our prototype implementation of overlay spreadsheets; and
(iii) \textbf{vizier-batch}: Our prototype implementation with simulated hybrid batch processing.
All experiments were performed with a warm cache.
\partitle{Setup}
We address our questions through a simple microbenchmark modeled after query 1 from the TPC-H benchmark~\cite{tpc-h}: The spreadsheet is defined by the TPC-H \texttt{lineitem} dataset with $\texttt{N}$ rows and four additional columns defined by the patterns:\\[-5mm]
\begin{verbatim}
base_price[1-N] = ext_price[+0]
disc_price[1-N] = base_price[+0] * (1 - discount[+0])
charge[1-N] = disc_price[+0] * (1 + tax[+0])
sum_charge[1] = charge[1]
sum_charge[2-N] = charge[+0] + sum_charge[-1]
\end{verbatim}
Note that the \texttt{sum\_charge} column is a running total aand the length of the dependency chain on row $i$ proportional to $i$. Thus, as the user scrolls down the page (under normal usage), the runtime to compute individual cells grows linearly.
Each system under test is allowed to load the spreadsheet with a viewable area of 50 rows.
We measure (i) the cost of initialization and (ii) the cost of a single update.
Time is measured until quiescence.
To emulate batch processing, we replace the formula for the $\texttt{sum\_change}[i-1]$ (where $i$ is the first visible row) with a formula that computes the analogous aggregate query.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Scaling Data}
\begin{figure}
\includegraphics[width=0.7\columnwidth]{results/desktop-init_formulas.png}
\vspace*{-4mm}
\caption{Performance as data size scales.}
\label{fig:perf-scale-size}
\trimfigurespacing
\end{figure}
\Cref{fig:perf-scale-size} shows performance as the size of the dataset grows.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\partitle{Viewport}
\begin{figure}
\includegraphics[width=0.7\columnwidth]{results/desktop-update_one.png}
\vspace*{-4mm}
\caption{Performance based on viewable range.}
\label{fig:perf-scale-visible}
\trimfigurespacing
\end{figure}
\Cref{fig:perf-scale-size} shows performance as the viewable area moves lower.

View File

@ -64,7 +64,7 @@ For example, any given cell in a running sum column may be expressed in terms of
Our preliminary experiments show that when a chain of dependencies becomes sufficiently long, bulk computation can be used to provide a more responsive interface.
\paragraph{Incremental Updates}
\partitle{Incremental Updates}
When the executor receives an update to a cell, it uses the index to compute the set of invalidated cells, marks them as ``pending,'' and begins re-evaluating them in topological order.
An update to the reference frame is applied to both the index and the data source.
Following typical spreadsheet semantics, an insertion or row move updates references in dependent formulas, so modulo changes in the set of visible rows, no re-evaluation is required.
@ -74,12 +74,12 @@ If a row with dependent cells is deleted, the dependent cells need to be updated
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Update Index}
\label{sec:system-index}
The update index stores a series of update operations ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$) and provides efficient access to the resulting overlay spreadsheet ($\spreadsheet_\overlay$):
The update index stores a sequence of updates ($\overlay = \overlay_k \circ \ldots \circ \overlay_1$) and provides efficient access to the spreadsheet (denoted $\spreadsheet_\overlay$) defined by $\overlay$, with $\errorval$ for all undefined cells.
Efficient access entails:
(i) Access to the expressions for individual cells $\spreadsheet_\overlay[\column, \row]$ (for cell evaluation);
(ii) Computing the upstream of a range of cells (for topological sort and computing the active set), and
(iii) Computing the downstream of a range of cells (for cell invalidation after an update).
Crucially, the update index avoids materializing the expressions for each cell by storing cell updates in the form of pattern-range tuples.
The key insight behind the index is that it stores updates in the form of pattern-range tuples to avoid materializing the full spreadsheet.
As noted above, we assume that the number of columns is comparatively small, and the number of rows is comparatively large.
\begin{figure}
@ -89,19 +89,16 @@ As noted above, we assume that the number of columns is comparatively small, and
\trimfigurespacing
\end{figure}
\paragraph{Range Maps}
The core building block for the update index is a one-dimensional range map, an ordered map with integer keys, optimized for storing contiguous ranges of mappings.
In addition to the usual operations of an ordered map (e.g., \lstinline{put}, \lstinline{get}, \lstinline{successorOf}), a range map defines an operation \lstinline{bulkPut(low, high, value)}.
The semantics of this operation are defined as a \lstinline{put} on every element in the range from \lstinline{low} to \lstinline{high}.
Implemented naively through a binary tree over $N$ elements, this operation takes $O((high-low)\cdot\log(N))$ time.
\partitle{Range Maps}
The core building block for the update index is a one-dimensional range map, an ordered map with integer keys.
In addition to the usual operations of an ordered map (e.g., \texttt{put}, \texttt{get}, \texttt{successorOf}), the operation \texttt{bulkPut(low, high, value)} has semantics identical to a \texttt{put} on every element in the range from \texttt{low} to \texttt{high}.
Implemented naively through a binary tree over $N$ elements, this operation takes $O((\texttt{high}-\texttt{low})\cdot\log(N))$ time.
A range map avoids the $(high-low)$ factor (and correspondingly reduces $N$) by storing only one mapping for each \lstinline{bulkPut} operation, as illustrated in \Cref{fig:rangemap}.
Analogous to a range set, we implement a range map as an ordered sequence of tuples of (disjoint) ranges and values.
To provide logarithmic-time access by position, we build a classical ordered map (e.g., a red-black tree) over the lower bound of each range.
Since the map enforces disjoint ranges, ordering on the lower bound guarantees ordering over the full range.
A range map avoids the $(\texttt{high}-\texttt{low})$ factor (and correspondingly reduces $N$) by storing an ordered sequence of disjoint ranges, each mapping one specific value as illustrated in \Cref{fig:rangemap}.
A binary tree provides efficient access to the ranges.
With a range map, the set of distinct values appearing in a range can be accessed in $O(\log(N)+M)$ time (where $M$ is the number of distinct values), and similar deletion and insertion costs.
\paragraph{Cell Access}
\partitle{Cell Access}
Efficient access to individual cells is obtained through a two-layered forward index, consisting of an unordered map over a set of range maps.
The pattern for a specific cell is obtained by looking up the cell's column in the unordered map, and the cell's row in the corresponding range map.
@ -130,7 +127,7 @@ The pattern for a specific cell is obtained by looking up the cell's column in t
\end{algorithmic}
\end{algorithm}
\paragraph{Upstream Reachability}
\partitle{Upstream Reachability}
In order to developing a materialization plan, the execution layer needs to be able to derive the set of cells on which a specific cell (or range of cells) depends.
We refer to this as the set of upstream cells for the specific target.
\Cref{alg:upstream} illustrates a naive breadth-first search to obtain the full upstream set for a given target range.
@ -140,7 +137,7 @@ If we discover a new dependency (lines 6-7), the newly discovered range is added
We will explain line 10 shortly.
The \textbf{getDeps} operation (Line 5; \Cref{alg:getDeps}) returns the set of dependencies of a pattern applied to a specific range of cells $\rangeOf{\column, \rowRange}$.
Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column, \rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf{\cell'}$.
Concretely, it returns a set of cells $\texttt{deps}$ such that for each cell $\cell \in \texttt{deps}$, there exists at least one cell $\cell' \in \rangeOf{\column}{\rowRange}$ such that $\cell$ is in the transitive closure of $\depsOf{\cell'}$.
The algorithm uses a recursive traversal (lines 6-7) to visit every cell reference (offset or explicit):
For offset references (lines 2-3), the provided range of rows is offset by the appropriate amount.
For explicit cell references (lines 4-5), the explicit reference is used.
@ -164,10 +161,10 @@ For explicit cell references (lines 4-5), the explicit reference is used.
\end{algorithm}
\paragraph{Optimizing Recursive Reachability}
\partitle{Optimizing Recursive Reachability}
Consider a computation analogous to that of \Cref{fig:overlay}, where one cell computes a running total of a second cell. Such a cell might be defined by a pattern:
$$\rangeOf{\texttt{total}}{[1, 1000]} \rightarrow \cellRef{\texttt{value}}{+0} + \cellRef{\texttt{total}}{-1}$$
$$\rangeOf{\texttt{total}}{1-1000} \gets \cellRef{\texttt{value}}{+0} + \cellRef{\texttt{total}}{-1}$$
Naively implemented, as in \Cref{alg:upstream}, computing reachability for the cell $\cellRef{\texttt{total}}{1000}$ will require visiting every distinct cell in the range 1-999: $\cellRef{\texttt{total}}{1000}$ depends on $\cellRef{\texttt{total}}{999}$, which depends on $\cellRef{\texttt{total}}{998}$, and so forth.
Observe that this dependency chain is defined entirely by a single pattern: Each cell defined by the pattern depends on another cell defined by the pattern.
@ -188,21 +185,20 @@ In its naive implementation, \textbf{upstream} attempts to advance the frontier
However, prior to line 5, we can check the lineage object to determine if the pattern defining the cell we are currently examining has previously been encountered along the path being advanced at an offset of $\pm 1$.
If so, we add the remainder of the range over which the pattern is defined in the direction indicated by the offset to the active range.
\paragraph{Downstream Reachability}
\partitle{Downstream Reachability}
When a cell's expression is updated, cells that depend on it (even transitively) must be recomputed.
The index must thus also support downstream reachability queries.
To support these efficiently, we maintain a backward index that relates cell ranges to the ranges of patterns that depend on it.
Analog to $\textbf{getDeps}$ inferring cells immediately upstream of a range of cells, we can infer the cells downstream of any cell or set of cells, with one caveat.
When the cell identified an absolute reference in a pattern is modified, all cells using the pattern are invalidated, so we track the set of ranges over which any given pattern is defined.
% \paragraph{Column Insertions, Deletions, and Moves}
% \partitle{Column Insertions, Deletions, and Moves}
% At the index layer, columns are referenced by unique identifier; Ordering is imposed only at the presentation layer.
% Column insertion (or deletion) requires simply inserting (resp., removing) an entry from the forward and backward index.
% Column reordering requires no actions at the index level.
% \paragraph{Row Insertions, Deletions, and Moves}
% \partitle{Row Insertions, Deletions, and Moves}
% Rows are identified by their position.
% When a row is inserted, deleted, or moved, references to the affected rows (and rows following them) change and must be updated.