Related work, abstract, conclusions

main
Oliver Kennedy 2023-03-29 22:00:00 -04:00
parent a078dd8525
commit 9f701be095
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
5 changed files with 96 additions and 12 deletions

View File

@ -206,4 +206,48 @@ Nigel Westbury},
title = {TPC Benchmark H (Decision Support), Revision 2.18.0},
howpublished = {https://www.tpc.org/tpch/default5.asp},
year = {2018}
}
}
@inproceedings{DBLP:conf/sigmod/JagadishCEJLNY07,
author = {H. V. Jagadish and
Adriane Chapman and
Aaron Elkiss and
Magesh Jayapandian and
Yunyao Li and
Arnab Nandi and
Cong Yu},
title = {Making database systems usable},
booktitle = {{SIGMOD} Conference},
pages = {13--24},
publisher = {{ACM}},
year = {2007}
}
@inproceedings{DBLP:conf/icde/LiuJ09,
author = {Bin Liu and
H. V. Jagadish},
title = {A Spreadsheet Algebra for a Direct Data Manipulation Query Interface},
booktitle = {{ICDE}},
pages = {417--428},
publisher = {{IEEE} Computer Society},
year = {2009}
}
@inproceedings{DBLP:conf/chi/BakkeKM11,
author = {Eirik Bakke and
David R. Karger and
Rob Miller},
title = {A spreadsheet-based user interface for managing plural relationships
in structured data},
booktitle = {{CHI}},
pages = {2541--2550},
publisher = {{ACM}},
year = {2011}
}

View File

@ -121,8 +121,15 @@
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
While databases provide extensive functionality and guarantees for working with data, the majority of humanity manages their data in spreadsheets. Past work on database useability has argued that spreadsheet interfaces backed by an efficient storage engine can enable effective data management. The spreadsheet interface has the advantage that it enables one-off fixes to data that are inconvenient to implement in a query language. However, building a spreadsheet interface on top of a database-like system is challenging as it requires dealing with formulas (which are essentially large numbers of ``small'' views) and dealing with positional access to data. Specifically, maintaining the mapping between positions and data under updates (e.g., deleting or inserting a row) can be challenging. Based on the observation that a user can only view a small portion of a spreadsheet at each point in time, we argue that a light-weight overlay mechanism that records updates to the spreadsheet in a compact way and applying such overlays (updates) to the currently visible portion of the spreadsheet is superior to eagerly applying updates to the full spreadsheet. Furthermore, by tracking the computation (updates) rather than the current values of all cells, our approach allows efficient versioning of spreadsheets and enables a user's edits to be translated when a dataset is replaced with a more up-to-date version. While we share this benefit with the implementation of spreadsheets in the Vizier system, overlays significantly improve performance compared to that system and are comparable to DataSpread, a scalable spreadsheet system that materializes the results of updates eagerly.
\BG{Results and update if not correct}
Spreadsheets provide a convenient, friendly direct manipulation interface to datasets.
Efforts to scale spreadsheets have taken two approaches: A `virtual` strategy that imposes a spreadsheet-like interface over an existing database engine, and a `materialized' strategy based on re-engineering the spreadsheet engine around standard database optimizations like indexes.
Because database engines are typically optimized for bulk query processing over interactive latencies, the materialized approach has better performance.
However, the virtual approach offers several key advantages that can not be easily replicated in the materialized approach, including notably the ability to re-apply user interactions to an updated version of the same dataset.
We propose a hybrid of the materialized and virtual approaches, where patterns of user updates are indexed (as in the materialized approach) and overlaid on an existing dataset (as in the virtual approach).
We introduce the overlay update model, and outline strategies for efficiently accessing a spreadsheet defined in this way.
A key feature of our approach is storing updates generated by bulk operations (e.g., copy/paste) as ``patterns" that can be leveraged to reduce execution costs.
We implement an overlay spreadsheet over Apache Spark and compare it against DataSpread, a popular materialized spreadsheet.
Our preliminary results show that overlay spreadsheets can significantly reduce execution costs.
\end{abstract}
%%

View File

@ -1,8 +1,19 @@
%!TEX root=../main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusions}
\section{Conclusions and Future Work}
\label{sec:conclusions}
In this work, we introduced overlay spreadsheets as a potential solutions for implementing scalable, versioned spreadsheets which have the important advantage that a users's edits can (where possible) be reapplied when there are updates to the input data, a common issue in practice. This novel capability is powered by overlays which allow updates to the spread to be represented declaratively. While representing updates as ``views'' over the original spreadsheet has been applied in the Vizier project to enable provenance-tracking and light-weight versioning for computational notebooks whose datasets can be accessed and edited through a spreadsheet interface, the overlay approach we present in this work significantly improves performance. \BG{How do we fare against data spread?}
In this work, we introduced overlay spreadsheets as a potential direction for reproducible spreadsheets where a user's edits can be re-applied to updated input data, and thus used directly in classical workflow and provenance analysis systems like Vizier.
This novel capability is powered by overlays that decouple the user's edits from the source data they are applied to.
We also demonstrated how updates to ranges of cells can be represented declaratively, improving performance and introducing several avenues for optimized evaluation of recursive patterns.
Recursive patterns remain the source of several open challenges for us.
Most notably, in the absence of recursive patterns, the depth of a dependency chains is bounded by the number of user interactions.
We suggested two strategies for improving performance in the presence of recursive patterns: (i) Closed-form computation of dependencies, and (ii) using bulk processing to avoid individual evaluation of cells that are not being shown to the user.
We also observe two additional challenges of adapting a dataset to new source data.
As we noted, row identity is a critical challenge for updating source data, as each row in the updated dataset needs to be mapped to its corresponding row in the original.
Additionally, the spreadsheet itself may need to change, for example extending patterns to incorporate newly introduced rows in the dataset.
%%% Local Variables:
%%% mode: latex

View File

@ -3,7 +3,7 @@
\label{sec:introduction}
Spreadsheets are a popular tools for data exploration, transformation, and visualization, but have historically had challenges managing ``big data'' --- with as few as fifty thousand rows of data create problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system like Apache Spark.
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others~\cite{DBLP:conf/icde/LiuJ09} relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system like Apache Spark.
In this model, the spreadsheet is a chain of versions, each linked by a lightweight transformation function~\cite{freire:2016:hilda:exception}.
A more recent approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the entire spreadsheet runtime around database primitives like indexes and incremental maintenance specialized for spreadsheet access patterns.
We refer to these as the virtual and materialized approach, respectively, and illustrate them in \Cref{fig:overlay}.

View File

@ -1,15 +1,37 @@
%!TEX root=../main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Related Work}
\label{sec:related-work}
\BG{needs to be compressed}
Scaling up spreadsheets has been identified as an issue in prior work by the database community. One noteworthy project is \emph{DataSpread}~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}. \cite{DBLP:conf/icde/BendreVZCP18} introduced several storage layouts for spreadsheet data (e.g, a position to value mapping that is efficient for sparse spreadsheets or encoding the rows / columns as rows of a single relation which is more efficient for dense spreadsheets), introduce a heuristic for self-tuning storage by selecting an appropriate layout for individual parts of a large spreadsheet, and introduced a tree index structure that enables the positions of cells to be maintained under insertions and deletion of rows in time logarithmic in the size of spread while also supporting look-ups (retrieving the cell at a certain position) in logarithmic time.
%
\cite{DBLP:conf/sigmod/BendreWMCP19} introduced asynchronous algorithms for updating the values of cells with formulas when a cell is updated. This work compresses the dependency graph of a spreadsheet which stores dependencies between cells (the formula of a cell references another cell) into a table that compactly over-approximates the transitive closure of the inverse dependency relation for cells using a constant number of cell ranges. When a cell is updated, this table is then used to determine a super-set of the cells that depend on the cell directly or indirectly and may need to be refreshed. \cite{tang-23-efcsfg} introduces a different type of compressed dependency graph which is lossless and exploits repetitive patterns in formulas which are common in spreadsheets due to features like auto-fill and the fact that a formula only determines the value of a single cell, e.g., when all cells of a column are computed based on other columns within the same row. While these techniques enable fast re-computation of cell values, they do not enable the input dataset to be updated as they do not track updates. Furthermore, how to efficiently support updates like inserting and deleting rows which potentially affect large parts of the dependency graph has not been addressed in this work.\footnote{\cite{tang-23-efcsfg} may be better equipped with such updates, but as this is a lossless data structure, this may still require modifying a large number or all entries in a compressed formula graph.}
Although spreadsheets present a convenient, direct-manipulation interface to data, they lack the scalability to manage large data.
A common approach to scaling spreadsheets --- what we term the ``virtual'' approach --- is to reformulate the interface to an existing database or workflow system using spreadsheet-style direct manipulation metaphors~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/icde/LiuJ09,freire:2016:hilda:exception,DBLP:conf/sigmod/JagadishCEJLNY07,DBLP:conf/chi/KandelPHH11}.
The resulting systems bear varying levels of resemblance to existing spreadsheets, usually introducing concepts from relational databases like explicit tables, attributes, and records.
Vizier~\cite{brachmann:2019:sigmod:data, kennedy:2022:ieee-deb:right, kumari:2021:cidr:datasense, brachmann:2020:cidr:your} is a computational notebook system that automatically versions notebooks while they are edited by a users. This is achieved using a light-weight versioning scheme based on workflow evolution provenance, i.e., storing updates to the code (computation) of a notebook rather then the (input, intermediate, and result) data. In Vizier, any dataset used in a computational notebook can be accessed and edited through a spreadsheet interface.
Vizier~\cite{brachmann:2019:sigmod:data, kennedy:2022:ieee-deb:right, kumari:2021:cidr:datasense, brachmann:2020:cidr:your} is a computational notebook system that automatically versions notebooks as they are edited by users.
In Vizier, any dataset used in a computational notebook can be accessed and edited through a spreadsheet interface; the resulting edits are integrated into the workflow.
In summary, several efficient algorithms for storing, accessing, and updating spreadsheets have been developed and adapted in the context of the DataSpread. The approach developed for Vizier is often less efficient, but has the advantage of supporting light-weight versioning and tracking the provenance of the evolution of a dataset (and the computational notebook containing it) under spreadsheet operations. Importantly, this approach enables replaying a user's updates that were originally applied to a dataset $D_{old}$ when $D_{old}$ is replaced with an updated dataset $D_{new}$ (e.g., the user may have downloaded a new version of an open dataset and wants to keep the manual fixes they have applied to the original version of the dataset). The overlay approach we present in this work has the potential to retain these benefits while enabling performance close to or exceeding that of DataSpread \BG{fix if not true}. Furthermore, overlays with reference frames enable more efficient support for insertion and deletion for rows and columns as this only affects reference frames, but not the formulas of cells. \BG{That right? How much do we gain by that?}
Wrangler~\cite{DBLP:conf/chi/KandelPHH11} is an ETL workflow development tool with an interface inspired by spreadsheets.
Users open a small sample of a dataset in Wrangler and use spreadsheet-style direct manipulations to indicate a desired change to the dataset.
Wrangler, in turn, proposes ETL workflow steps that can achieve the user's desired effect on the target cell, as well as the remainder of the dataset.
Other approaches more directly mimic relational databases through spreadsheet-style interfaces.
The Spreadsheet Algebra~\cite{DBLP:conf/sigmod/JagadishCEJLNY07,DBLP:conf/icde/LiuJ09} allows users to specify any SPJGA-query purely through spreadsheet-style user interactions.
Related Worksheets~\cite{DBLP:conf/cidr/BakkeB11,DBLP:conf/chi/BakkeKM11} re-imagines the classical spreadsheet-style interface by introducing relational structure, as well as nested display of foreign-key dependencies.
A second class of approach --- what we term the ``materialized'' approach --- instead redesigns the spreadsheet engine itself through database concepts;
The primary example in this space is DataSpread~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}.
A key challenge that the materialized approach faces is that classical database techniques, which exploit common structures in a dataset, are not directly applicable.
\cite{DBLP:conf/icde/BendreVZCP18} explores data structures that can leverage partial structure; for example, when a range of cells are structured as a relational table.
\cite{DBLP:conf/sigmod/BendreWMCP19} explores strategies for quickly invalidating cells and computing dependencies, by leveraging a (lossy) compressed dependency graph that can efficiently bound a cell's downstream.
\cite{tang-23-efcsfg} introduces a different type of compressed dependency graph which is lossless, instead exploiting repeating patterns in formulas.
This is analogous to our own approach, but focuses on the dependency graph;
As we demonstrate here, applying a similar approach to expressions as well creates multiple optimization opportunities.
In summary, several efficient algorithms for storing, accessing, and updating spreadsheets have been developed and adapted in the context of the DataSpread.
The approach developed for Vizier is often less efficient, but has the advantage of supporting light-weight versioning and tracking the provenance of the evolution of a dataset (and the computational notebook containing it) under spreadsheet operations.
Importantly, this approach enables replaying a user's updates that were originally applied to a dataset $D_{old}$ when $D_{old}$ is replaced with an updated dataset $D_{new}$ (e.g., the user may have downloaded a new version of an open dataset and wants to keep the manual fixes they have applied to the original version of the dataset).
The overlay approach we present in this work has the potential to retain these benefits while enabling performance competitive with, or exceeding that of DataSpread.
Furthermore, overlays with reference frames enable more efficient support for insertion and deletion for rows and columns as this only affects reference frames, but not the formulas of cells.
%%% Local Variables: