paper-HILDA-2016-Spreadsheets/sections/introduction.tex

79 lines
11 KiB
TeX
Raw Normal View History

%!TEX root = ../main.tex
2016-04-08 20:17:12 -04:00
2016-04-24 15:42:12 -04:00
In spite of the availability of powerful automated curation, cleaning, and analysis tools, spreadsheets and notebook UIs (e.g., iPython) are still the predominant tools used by most data scientists. Some of their ubiquity can be attributed to the fact that typical users will prefer a known over an unknown interface - even if the unknown interface may be better suited for the task at hand. However, we argue that while both types of interfaces have shortcomings, each are also good fits for some use cases. Based on this observation we propose a combined spreadsheet and notebook UI for data curation that combines the best of both worlds and augments it new functionality such as automated curation operators, deployment of curation workflows over large datasets, and support for exploratory curation tasks. We first review the spreadsheet and notebook UI paradigms and discusse their individual advantages and disadvantages, and then give an overview of how they may be combined into a single coherent interface.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Spreadsheets}
2016-04-18 17:33:19 -04:00
2016-04-24 15:04:48 -04:00
Spreadsheets allow inline specification of views (micro-computations) over data and displaying the results of these computations inline with their inputs.
\oksays{I don't like the word generalize here. It implies that an operation is being altered or extended in some way. I think that's a large part of what we need to add.}
Furthermore, they can sometimes help users to repeatedly apply such computations to ``more'' data.
For example, by using a spreadsheet's copy/fill paste feature, a single formula can be mapped over a many cells.
However, this feature is quite limited in current implementations of speadsheets and sometimes leads to unexpected results \bgsays{(sort example). }
BAD: The order in which operations have been applied is not exposed in a spreadsheet. Thus, spreadsheets are not great for complex data curation and exploration operations, because it is hard to keep track of what operations where applied to the data and in which order (needs workflow + provenance). A notebook UI such as in Jupyther and the visualization provided by workflow systems are much better suited in this regard. However, their disadvantage is that they are not suited well for small modifications to data, exploratory changes, and generalization of operations.
%
Based on these observations we propose a hybrid notebook and spreadsheet interface that combines the best of both worlds. In this paper, we discuss how such an interface facilitates data curation and exploration.
%
Furthermore, we propose additional novel functionality for addressing some of the shortcoming of both spreadsheets and notebooks. Some of these are based on changes to the UI while others rely on changes to the underlying data processing platform: UI:
1) more flexible mechanism for generalize\&apply: user defines operation of data, system generalizes and applies it to more data
PLATFORM:
1) by keeping track of a users operations and by tracking them as a declarative program we can support better exploration including automatic updates of derived data based on changes to operations, 2) provenance tracking to understand and explore the derivation of data, 3) recommendations to let
\bgsays{NOT SURE WHETHER ALL OF THESE SHOULD BE EXPLORED IN THIS PAPER}
2016-04-24 15:42:12 -04:00
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Notebook-style UIs}
\begin{itemize}
\item
\end{itemize}
2016-04-24 15:04:48 -04:00
2016-04-24 15:42:12 -04:00
\begin{itemize}
\item
\end{itemize}
2016-04-24 15:04:48 -04:00
Spreadsheets are a ubiquitous data processing tool. Their simplicity, generality, and adaptability make them ideal for ``playing'' with through predominantly visual programming metaphors. In particular, spreadsheets provide powerful, but entirely visual metaphors for programming both data transformations and visualizations. In this paper, we explore how similar visual metaphors can be adapted for use with relational databases and discuss how this exploration informs the design of our prototype data curation tool, called \sysname. \sysname's user interface combines elements of spreadsheets, so-called notebook interfaces, and classical relational queries, enabling easy data manipulation, summarization, and visualization.
2016-04-08 20:17:12 -04:00
One especially powerful feature of the spreadsheet user interface is that it is easy to define both bulk, set-at-a-time operations, as well as exceptional, singleton data operations. The former class, already a strength of relational databases, is crucial for analyzing a dataset of any significant size. However, many applications require users to ``break the rules'' and apply one-off modifications or transformations to individual fields or records. For example, (1) hypothetical what-if scenarios require users to apply small ad-hoc updates to adjust the inputs under test; (2) repairs for data errors may be easy to define for individual cases, but be far harder to define in a general case; (3) complex data transformations that need to be generalized still be easier to define for individual test cases than on bulk data.
By making it easy for users to break the rules, even if only temporarily, spreadsheets empower users to explore data, evaluate options, and better understand the effects of their curation efforts.
Such rules violations, or \textit{singleton} operations, are not handled gracefully by existing relational DBMSes.
2016-04-08 20:17:12 -04:00
\tinysection{Combining Spreadsheets and Notebooks}
Spreadsheets give users an infinite 2-dimensional grid of cells that can hold either constant values or computed values derived from other cells through \textit{formulas}.
Collections, as they exist, are defined implicitly as any 1- or 2-dimensional region of cells that has meaning to the user.
Thus, instead of classical programmatic specification of bulk, set-at-a-time operations as operations over named collection objects, spreadsheets use the metaphor of copying code to a (user-specified) range of cells, combined with relative, positional data dependencies to quite literally ``map'' singleton operations over entire collections.
In addition to making it easy to perform bulk operations over the entire collection, this approach provides a clear affordance for declaring exceptions: Even after being copied, each cell's formula is (and is presented as) a singleton, logically independent of other cells' formulas.
\oksays{Need to discuss how creating / modifying figures is easier in spreadsheets - same deal as before, allowing users to spontaneously declare collections by selecting a region.}
Between the simplicity of creating singleton operations and the simplicity of creating visualizations, spreadsheets are a powerful tool for data curation and exploration. Indeed, spreadsheet users often ``do not appear inclined to use other software packages for their tasks, even if these packages might be more suitable"~\cite{Chan1996119}. Our goal in \sysname is to empower users with a similar level of flexibility for transforming, visualizing, and exploring relational data by making it easier to define singleton operations in a relational setting, and using this capability to create a bi-directional mapping between a spreadsheet-style graphical interface, and a notebook-style programmatic interface.
% This means preserving the illusion that all operations on the spreadsheet are singletons, while still synthesizing a human-readable workflow that compactly encodes these singletons. Our primary goal in this paper is to explore the challenges that must be overcome to implement this bi-directional mapping.
To enable singleton transformations within the framework of a classical relational database, we propose a new mechanism for data exploration called interactive views. An interactive view begins its life as a classical database view, presented to the user in tabular form. In contrast to a classical view however, an interactive view can be edited much like a spreadsheet. Users can modify fields, add new rows and columns, use a spreadsheet-style equation editor to define derived values, and more. As the user edits the view, the user's actions are seamlessly transformed into a program of relational(-ish) data transformation operators that derive the new, edited view. This program provides two major benefits. The action trail serves as a form of history, allowing the user to revisit and revise earlier edits, even out of order. Second, the trail defines a workflow, albeit one highly specialized to a specific dataset. Even this is sufficient to provide classical benefits of workflow provenance such as auditability and explainability for derived data. Once an interactive view is developed for one dataset, however, it can more readily be adapted to new data or to react to changes in its inputs. Recasting the user's actions programmatically allows us to leverage existing work on algebraic equivalences and program rewriting to first obtain multiple interpretations of sequences of user actions, and then to extrapolate more general expressions of the user's intent.
\tinysection{Overview} In this paper, we outline the technical challenges of implementing interactive views and sketch our proposed solutions.
The core challenge that we consider is how to minimize unexpected side effects.
As users transform a spreadsheet, for example by sorting data or pasting formulas, existing spreadsheet software takes great pains to provide a consistent mental model of its behavior.
Unlike a relational database where query semantics are explicit, visual interactions in a spreadsheet requires many implicit behaviors.
In this paper, we address one specific example: a spreadsheet's coordinate system: User actions that move cells or reposition operations trigger implicit secondary effects (e.g., formulas for other cells may need to be updated).
We develop a data model that allows us to link relational data with a spreadsheet's coordinate system, allowing us to better gauge the secondary effects of user actions.
Our core contributions are as follows:
(1) We present \sysname, a hybrid relational notebook/spreadsheet exploration-based data curation environment and outline the capabilities of each interface mode,
(2) We outline the challenges of mapping actions back and forth between the two modes,
(3) We define a data model for \sysname that allows us to precisely characterize the side-effects of user's actions on a spreadsheet.
(4) We apply this model through a case study on existing spreadsheet software, and show how user actions in spreadsheet software follow a specific heuristic that we believe captures the principle of least surprise.
(5) We outline our future research directions, including a readability-enhancing optimizer, and a tool for generalizing singleton-based workflows.
2016-04-08 20:17:12 -04:00
2016-04-17 23:28:58 -04:00
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: