Merge branch 'main' of git.odin.cse.buffalo.edu:VizierDB/paper-Vizier-SpreadsheetOverlay

main
Oliver Kennedy 2023-03-19 12:55:40 -04:00
commit 1a5562591b
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
3 changed files with 355 additions and 22 deletions

317
.gitignore vendored
View File

@ -7,4 +7,321 @@ comment.cut
*.synctex.gz
*.bbl
*.blg
# Created by https://www.toptal.com/developers/gitignore/api/latex
# Edit at https://www.toptal.com/developers/gitignore?templates=latex
### LaTeX ###
## Core latex/pdflatex auxiliary files:
*.aux
*.lof
*.log
*.lot
*.fls
*.out
*.toc
*.fmt
*.fot
*.cb
*.cb2
.*.lb
## Intermediate documents:
*.dvi
*.xdv
*-converted-to.*
# these rules might exclude image files for figures etc.
# *.ps
# *.eps
# *.pdf
## Generated if empty string is given at "Please type another file name for output:"
.pdf
## Bibliography auxiliary files (bibtex/biblatex/biber):
*.bbl
*.bcf
*.blg
*-blx.aux
*-blx.bib
*.run.xml
## Build tool auxiliary files:
*.fdb_latexmk
*.synctex
*.synctex(busy)
*.synctex.gz
*.synctex.gz(busy)
*.pdfsync
## Build tool directories for auxiliary files
# latexrun
latex.out/
## Auxiliary and intermediate files from other packages:
# algorithms
*.alg
*.loa
# achemso
acs-*.bib
# amsthm
*.thm
# beamer
*.nav
*.pre
*.snm
*.vrb
# changes
*.soc
# comment
*.cut
# cprotect
*.cpt
# elsarticle (documentclass of Elsevier journals)
*.spl
# endnotes
*.ent
# fixme
*.lox
# feynmf/feynmp
*.mf
*.mp
*.t[1-9]
*.t[1-9][0-9]
*.tfm
#(r)(e)ledmac/(r)(e)ledpar
*.end
*.?end
*.[1-9]
*.[1-9][0-9]
*.[1-9][0-9][0-9]
*.[1-9]R
*.[1-9][0-9]R
*.[1-9][0-9][0-9]R
*.eledsec[1-9]
*.eledsec[1-9]R
*.eledsec[1-9][0-9]
*.eledsec[1-9][0-9]R
*.eledsec[1-9][0-9][0-9]
*.eledsec[1-9][0-9][0-9]R
# glossaries
*.acn
*.acr
*.glg
*.glo
*.gls
*.glsdefs
*.lzo
*.lzs
*.slg
*.slo
*.sls
# uncomment this for glossaries-extra (will ignore makeindex's style files!)
# *.ist
# gnuplot
*.gnuplot
*.table
# gnuplottex
*-gnuplottex-*
# gregoriotex
*.gaux
*.glog
*.gtex
# htlatex
*.4ct
*.4tc
*.idv
*.lg
*.trc
*.xref
# hyperref
*.brf
# knitr
*-concordance.tex
# TODO Uncomment the next line if you use knitr and want to ignore its generated tikz files
# *.tikz
*-tikzDictionary
# listings
*.lol
# luatexja-ruby
*.ltjruby
# makeidx
*.idx
*.ilg
*.ind
# minitoc
*.maf
*.mlf
*.mlt
*.mtc[0-9]*
*.slf[0-9]*
*.slt[0-9]*
*.stc[0-9]*
# minted
_minted*
*.pyg
# morewrites
*.mw
# newpax
*.newpax
# nomencl
*.nlg
*.nlo
*.nls
# pax
*.pax
# pdfpcnotes
*.pdfpc
# sagetex
*.sagetex.sage
*.sagetex.py
*.sagetex.scmd
# scrwfile
*.wrt
# svg
svg-inkscape/
# sympy
*.sout
*.sympy
sympy-plots-for-*.tex/
# pdfcomment
*.upa
*.upb
# pythontex
*.pytxcode
pythontex-files-*/
# tcolorbox
*.listing
# thmtools
*.loe
# TikZ & PGF
*.dpth
*.md5
*.auxlock
# titletoc
*.ptc
# todonotes
*.tdo
# vhistory
*.hst
*.ver
# easy-todo
*.lod
# xcolor
*.xcp
# xmpincl
*.xmpi
# xindy
*.xdy
# xypic precompiled matrices and outlines
*.xyc
*.xyd
# endfloat
*.ttt
*.fff
# Latexian
TSWLatexianTemp*
## Editors:
# WinEdt
*.bak
*.sav
# Texpad
.texpadtmp
# LyX
*.lyx~
# Kile
*.backup
# gummi
.*.swp
# KBibTeX
*~[0-9]*
# TeXnicCenter
*.tps
# auto folder when using emacs and auctex
./auto/*
*.el
# expex forward references with \gathertags
*-tags.tex
# standalone packages
*.sta
# Makeindex log files
*.lpz
# xwatermark package
*.xwm
# REVTeX puts footnotes in the bibliography by default, unless the nofootinbib
# option is specified. Footnotes are the stored in a file with suffix Notes.bib.
# Uncomment the next line to have this generated file ignored.
#*Notes.bib
### LaTeX Patch ###
# LIPIcs / OASIcs
*.vtc
# glossaries
*.glstex
# End of https://www.toptal.com/developers/gitignore/api/latex
/main.pdf

View File

@ -2,11 +2,13 @@
\section{Introduction}
\label{sec:introduction}
Spreadsheets are a popular tool for data exploration, as they provide a simple environment for programmatically accessing and manipulating data.
\BG{In the next pass over the intro we can start to reduce redundancy and compact}
Spreadsheets are a popular tools for data exploration, as they provide a simple environment for programmatically accessing and manipulating data.
However, spreadsheets have historically had challenges managing ``big data'', with as few as fifty thousand rows of data creating problems for existing spreadsheet engines~\cite{DBLP:conf/sigmod/RahmanMBZKP20}.
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others relies on translating spreadsheet interactions into declarative transformations that can be deployed to a classical relational database.
A second, more recent approach, employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the entire spreadsheet runtime around database primitives like indexes and incremental maintenance.
We refer to these as the relational and architectural approaches, respectively.
One approach to scalability, employed by \emph{Wrangler}~\cite{DBLP:conf/chi/KandelPHH11}, \emph{Vizier}~\cite{freire:2016:hilda:exception,brachmann:2020:cidr:your}, and others relies on translating spreadsheet interactions into declarative transformations (dataflows) that can be deployed to a database or dataflow system like Spark. As demonstrated in the context of Vizier, this enables light-weight versioning of spreadsheets as only multiple versions of the computation (which are small) rather than multiple versions of the data (which are large) have to be stored.
A more recent approach employed by \emph{DataSpread}~\cite{DBLP:conf/sigmod/BendreWMCP19,DBLP:conf/sigmod/RahmanMBZKP20,DBLP:conf/icde/BendreVZCP18}, instead re-architects the entire spreadsheet runtime around database primitives like indexes and incremental maintenance specialized for spreadsheet access patterns.
We refer to these as the virtual and materialized approach, respectively.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
@ -23,40 +25,53 @@ The data access patterns of spreadsheet have several characteristics~\cite{DBLP:
\begin{itemize}
\item \textbf{Positional Order:} Both the rows and columns in a spreadsheet are ordered and cells can be referenced by position. As already observed in \cite{DBLP:conf/icde/BendreVZCP18}, cell positions needs to be maintained under operations (e.g., inserting or deleting rows and columns).
\item \textbf{Local Updates:} Editing the content of cells in a spreadsheet results in local updates that typically affect one or a small number of cells. Operations like inserting \& deleting rows or columns affect the whole spreadsheet and require updates to the positions of a large number of cells (\cite{DBLP:conf/icde/BendreVZCP18} presented an index structure that allows fast maintenance of positions under updates and access to a cell at a certain position).
\item \textbf{Small Number of Updates:} Since spreadsheet updates are typically made manually, the number of updates is limited by the speed of a human interacting with the system.
\item \textbf{Limited Visibility:} At each point in time, the user can only observe a small portion of the spreadsheet (that fits on the screen). This enables lazy materialization of cells values and updates to the spreadsheet as the values of cells which are not currently visible to the user only have to be computed if they affect the value of a currently visible cell (e.g., because the visible cell uses a formula the references the other cell).
\item \textbf{Formulas:} An important feature of spreadsheets are that cells may contain formulas which are expressions that reference other cells in the spreadsheets. This essentially empowers the user to write highly localized ``views''. Formulas complicate the implementation of spreadsheets as to compute the value of a cell the values of all cells it depends on directly or indirectly have to be determined first.
\item \textbf{Formulas:} An important feature of spreadsheets are that cells may contain formulas which are expressions that reference other cells in the spreadsheets. This essentially empowers the user to write highly localized ``views''. Formulas complicate the implementation of spreadsheets, because to compute the value of a cell, the values of all cells it depends on directly or indirectly have to be determined first.
\end{itemize}
Because relational engines are not optimized for low-latency update queries, the architectural approach can provide significantly better interactive performance than the relational approach.\BG{Furthermore, an important feature of spreadsheets is that cells may contain expressions (\emph{formulas}) which reference other cells. This enables users to create ``local'' views which populate some parts of the spreadsheet automatically. The evaluation of formulas is an inherently recursive process (a cell with a formula may refer to another cell that itself contains a formula and so on) which is hard to express efficiently in SQL).}
However, the relational design an important benefit:
User interactions manipulate a data transformation process, rather than the data itself.
The implementation of the materialized approach in DataSpread targets such access patterns, by (i) using a specialized index structure to maintain positional order in logarithmic time under typical spreadsheet updates, (ii) by developing specialized compressed representations of dependencies between cells for efficient computation of values of cells with formulas; and (iii) by prioritizing the refresh of the values of cells that are currently visible to the user. Maintaining positional order efficiently and exploiting the fact that the user can only view a small portion of the spreadsheet
at each point in time is much harder to exploit in the virtual approach as the result of updates and their effects on cell position are only materialized when data is received. Thus, the virtual approach is often less efficient. However, the virtual approach also has several important benefits: because we are storing only the updates made by the user (insert a row at position $x$, replace the value of cell $c$ with $v$, \ldots), multiple versions of the spreadsheet can be retained at very low storage cost by storing changes to the sequence of transformations that were applied to the data rather than changes to the data itself (linear in the number of operations independent of the size of the spreadsheet).
In Wrangler, the resulting data transformation process can be easily upscaled from an interaction-friendly sample of the data to the entire dataset.
In Vizier, the user's manipulations are encoded in the lineage of a Spark's dataframe, facilitating detailed provenance analysis and effective versioning.
In Vizier, the user's manipulations are encoded in the lineage of a Spark dataframe, facilitating detailed provenance analysis.
In fact, this is how versioning of datasets (spreadsheets) is implemented in Vizier. Furthermore, spreadsheets are often used for data preparation: a user loads a dataset and then iteratively curates the data. If the original dataset is updated, e.g., the user may have downloaded the dataset from an open data portal and the provider of the portal has uploaded a new version of the dataset, then the user may want to reapply their edits to the new version of the dataset. This translation is often possible for the virtual approach, but is not possible in the materialized approach where there is no log of the update operations that were applied to the spreadsheet.
In this paper, we present a hybrid of the architectural and relational approaches to scalable spreadsheets: \emph{Overlay Spreadsheets}.
% User interactions manipulate a data transformation process, rather than the data itself.
% In Wrangler, the resulting data transformation process can be easily upscaled from an interaction-friendly sample of the data to the entire dataset.
% In Vizier, the user's manipulations are encoded in the lineage of a Spark's dataframe, facilitating detailed provenance analysis and effective versioning.
\BG{Because relational engines are not optimized for low-latency update queries, the architectural approach can provide significantly better interactive performance than the relational approach.}\BG{Furthermore, an important feature of spreadsheets is that cells may contain expressions (\emph{formulas}) which reference other cells. This enables users to create ``local'' views which populate some parts of the spreadsheet automatically. The evaluation of formulas is an inherently recursive process (a cell with a formula may refer to another cell that itself contains a formula and so on) which is hard to express efficiently in SQL).}
% However, the relational design an important benefit:
% User interactions manipulate a data transformation process, rather than the data itself.
% In Wrangler, the resulting data transformation process can be easily upscaled from an interaction-friendly sample of the data to the entire dataset.
% In Vizier, the user's manipulations are encoded in the lineage of a Spark's dataframe, facilitating detailed provenance analysis and effective versioning.
In this paper, we present an optimized hybrid of the virtual and materialized approaches to scalable spreadsheets: \emph{Overlay Spreadsheets}.
An Overlay Spreadsheet keeps source data in-situ, decoupled from the user's edits to a spreadsheet ``overlaid'' on top of the source data, as illustrated in \Cref{fig:overlay}.
Users interact with the resulting structure much like they would an ordinary spreadsheet, inserting or removing rows or columns, overwriting data with formulas or literals, and reorganizing the data.
Crucially, the overlay virtualizes references to the source dataset, allowing users to replay their actions on a new, updated dataset.
Users interact with the resulting structure much like they would an ordinary spreadsheet, inserting or removing rows or columns, overwriting data with formulas or literals, and reorganizing the data.
Crucially, the overlay virtualizes references to the source dataset, allowing users to replay their actions on a new, updated dataset and in contrast to the purely virtual approach which expressed updates as relational operations, overlays store more concrete information about updates to the positional order of cells and about which cells where modified. We demonstrate that this different virtual representation of edits enables more efficient exploitation of spreadsheet access patterns including computing the values of cells currently visible to the user.
We outline a preliminary implementation of Overlay Spreadsheets within Vizier~\cite{brachmann:2019:sigmod:data,brachmann:2020:cidr:your,kennedy:2022:ieee-deb:right}, a multi-modal, reproducibility-oriented, notebook-style workflow system built on Apache Spark.
Users of Vizier define sequences of data transformation steps that may include scripts, templated widgets, or other operations.
A key feature of Vizier is that users can define data transformations (including limited formula support) through a spreadsheet style interface;
A key feature of Vizier is that users can define data transformations (including limited formula support) through a spreadsheet style interface;
Following~\cite{freire:2016:hilda:exception}, user interactions are applied to a dataframe, and the results are updated and displayed.
In spite of this approach's performance limitations, it remains preferable, as it allows the user actions to be reapplied to new source data (a necessity in Vizier's workflow model), and enables fine-grained provenance analysis (another key feature of Vizier).
As mentioned above, on spite of this approach's performance limitations, it remains preferable, as it allows the user actions to be reapplied to new source data (a necessity in Vizier's workflow model), and enables fine-grained provenance analysis and light-weight versioning (another key feature of Vizier).
Concretely, our objective is to demonstrate a spreadsheet-style interface that provides interactive latencies (i.e., like architectural scaling), while simultaneously supporting for replay and provenance (i.e., like relational scaling).
The resulting interface can be used for data exploration and preliminary analyses, but also provides a low-friction, visual environment for defining bulk data transformation logic for workflow systems.
Concretely, our objective is to demonstrate a spreadsheet-style interface that provides interactive latencies (i.e., like the materialized approach), while simultaneously supporting for replay, provenance, and dealing with updates to the input data (i.e., like the virtual approach).
The resulting interface can be used for data exploration, data preparation, and preliminary analysis, but also provides a low-friction, visual environment for defining bulk data transformations.
The overlay approach also carries one additional benefit.
We expect that the number of distinct interactions that the user performs on the dataset will be much smaller than a sufficiently large dataset --- the user is unlikely to need to manually inspect and update each individual row of a million row dataset.
As mentioned above, typically the number of interactions that the user performs on the dataset will be small compared to the size of the dataset --- the user is unlikely to need to manually inspect and update each individual row of a million row dataset.
Rather, we expect a common pattern to involve fine-grained manipulation of a small fragment of the dataset to derive new formulas, followed by a bulk application of the formula to the remainder of the dataset.
Under the assumption that the majority of cell updates will be bulk applications of a common formula ``pattern,'' then the overlay only needs to record the pattern and the range of cells it was applied to.
Under the assumption that the majority of cell updates will be bulk applications of a common formula ``pattern,'' then the overlay only needs to record the pattern and the range of cells it was applied to. This is akin to optimizations applied in data spread~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}, but we are creating patterns of updates to the spreadsheet rather than patterns of dependencies in a particular version of the spreadsheet.
Such patterns reference external cells by offsets to the cell on which the formula is applied, allowing one pattern to define an entire data-parallel computation.
This form of compression can substantially reduce the size of the overlay's encoding, but its use of offset positions becomes problematic the shape of the dataset changes.
For example, if a new row is inserted, the offset for a given formula changes.
We explore how the compression can be preserved through versioned ``reference frames'' that record and facilitate low-overhead transformations between different positional numbering schemes.
This form of compression can substantially reduce the size of the overlay's encoding, but its use of offset positions becomes problematic if the shape of the dataset changes.
For example, if a new row is inserted, the offset for a given formula changes, an issue that was not addressed in~\cite{DBLP:conf/sigmod/BendreWMCP19, tang-23-efcsfg}.
We explore how the compression can be preserved through versioned ``reference frames'' that record and facilitate low-overhead transformations between different versions of the mapping between positions and cells that defines a spreadsheet.
As a further advantage of the overlay approach, user interactions with the overlay can be translated into other representations~\cite{freire:2016:hilda:exception}.
For example, a user's edits on a spreadsheet can be transformed into a series of transformations over a dataframe, allowing seamless integration of existing approaches to provenance management~\cite{brachmann:2020:cidr:your,kumari:2021:cidr:datasense} and workflow execution~\cite{kennedy:2022:ieee-deb:right}.
@ -64,7 +79,7 @@ For example, a user's edits on a spreadsheet can be transformed into a series of
In this paper, we introduce Overlay Spreadsheets, and present the details of our prototype implementation.
We implement the concept in the Vizier notebook~\cite{kennedy:2022:ieee-deb:right,brachmann:2020:cidr:your,brachmann:2019:sigmod:data}, a workflow-style notebook built over Apache Spark.
We explore the challenges of integrating overlay spreadsheets with Apache Spark dataframes, and discuss preliminary work in translating an overlay spreadsheet to derive a dataframe.
\BG{Experimal result take-aways}

View File

@ -2,6 +2,7 @@
\section{Related Work}
\label{sec:related-work}
\BG{needs to be compressed}
Scaling up spreadsheets has been identified as an issue in prior work by the database community. One noteworthy project is \emph{DataSpread}~\cite{DBLP:conf/icde/BendreVZCP18, DBLP:conf/sigmod/RahmanMBZKP20, DBLP:conf/sigmod/BendreWMCP19}. \cite{DBLP:conf/icde/BendreVZCP18} introduced several storage layouts for spreadsheet data (e.g, a position to value mapping that is efficient for sparse spreadsheets or encoding the rows / columns as rows of a single relation which is more efficient for dense spreadsheets), introduce a heuristic for self-tuning storage by selecting an appropriate layout for individual parts of a large spreadsheet, and introduced a tree index structure that enables the positions of cells to be maintained under insertions and deletion of rows in time logarithmic in the size of spread while also supporting look-ups (retrieving the cell at a certain position) in logarithmic time.
%
\cite{DBLP:conf/sigmod/BendreWMCP19} introduced asynchronous algorithms for updating the values of cells with formulas when a cell is updated. This work compresses the dependency graph of a spreadsheet which stores dependencies between cells (the formula of a cell references another cell) into a table that compactly over-approximates the transitive closure of the inverse dependency relation for cells using a constant number of cell ranges. When a cell is updated, this table is then used to determine a super-set of the cells that depend on the cell directly or indirectly and may need to be refreshed. \cite{tang-23-efcsfg} introduces a different type of compressed dependency graph which is lossless and exploits repetitive patterns in formulas which are common in spreadsheets due to features like auto-fill and the fact that a formula only determines the value of a single cell, e.g., when all cells of a column are computed based on other columns within the same row. While these techniques enable fast re-computation of cell values, they do not enable the input dataset to be updated as they do not track updates. Furthermore, how to efficiently support updates like inserting and deleting rows which potentially affect large parts of the dependency graph has not been addressed in this work.\footnote{\cite{tang-23-efcsfg} may be better equipped with such updates, but as this is a lossless data structure, this may still require modifying a large number or all entries in a compressed formula graph.}