paper-HILDA-2016-Spreadsheets/sections/introduction.tex

154 lines
16 KiB
TeX

%!TEX root = ../main.tex
In spite of the availability of powerful automated curation, cleaning, and analysis tools, spreadsheets and notebook UIs (e.g., Jupyter/iPython) are still the predominant tools used by most data scientists. Some of their ubiquity can be attributed to the fact that typical users will prefer a known over an unknown interface - even if the unknown interface may be better suited for the task at hand~\cite{Chan1996119}. However, we argue that while both interfaces lack the scalability and power of relational databases, they offer several compelling benefits for curation workloads.
Based on this observation we propose a combined spreadsheet and notebook UI for data curation over relational data % and non-relational data
that combines the best of spreadsheets, notebooks, and relational DBMSes.
This UI will support functionality not commonly found in either spreadsheets or notebooks, including automated curation operators~\cite{Yang:2015:LOA:2824032.2824055}, deployment of curation workflows over large datasets~\cite{Kandel:2011:WIV:1978942.1979444}, declarative queries~\cite{AG14a,Olston:2008:PLN:1376616.1376726}, and support for exploratory curation tasks~\cite{SV08}.
We first review the spreadsheet and notebook UI paradigms and then evaluate how they may be combined into a single coherent interface: a system that we call \sysname.
This hybrid UI enables powerful relational queries, while still being flexible enough to permit easy data manipulation, summarization, and visualization.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Spreadsheets}
%
Spreadsheets are a ubiquitous data processing tool. Their simplicity, generality, and adaptability make them ideal for ``playing'' with data through predominantly visual programming metaphors.
%In particular, spreadsheets provide powerful, but entirely visual metaphors for programming both data transformations and visualizations.
Spreadsheets provide several important features that are useful during data curation:\\
%\begin{itemize}
%\item
\emph{Convenient modification of values and computations.} The user can update any cell's value or formula immediately from the user interface. This enables manual curation operations like resolving missing values and undoing earlier errors.\\
%\item
\emph{Manual operations with inline results.} By using formulas in cells, the user defines a computation and the result of this computation is shown inline with its input data.\\
%\item
\emph{Visual mapping over data collections.}
Most spreadsheet systems enable the user to take a formula (computation) and map it to a range of cells through \textit{position-relative} references in cell formulas.
For example, this can be done by copy/paste or by, e.g., fill operation. We refer to this mechanism \textit{adapt\&apply}. This approach to bulk, set-at-a-time functionality is very useful in data curation: A fix to repair one piece of data (e.g., conversion between units) can be deployed over the whole dataset, while the appearance of independent formulas provides an affordance for declaring exceptions to the bulk rule.
%\end{itemize}
% One especially powerful feature of the spreadsheet user interface is that it is easy to define both bulk, set-at-a-time operations, as well as exceptional, singleton data operations. The former class, already a strength of relational databases, is crucial for analyzing a dataset of any significant size.
% Spreadsheets allow inline specification of views (micro-computations) over data and displaying the results of these computations inline with their inputs.
% \oksays{I don't like the word generalize here. It implies that an operation is being altered or extended in some way. I think that's a large part of what we need to add.}
Indeed, many curation applications require users to ``break the rules'' and apply one-off modifications or transformations to individual fields or records.
For example,
(1) hypothetical what-if scenarios require users to apply small ad-hoc updates to adjust the inputs under test;
(2) repairs for data errors may be easy to define for individual cases, but be far harder to define in a general case;
(3) complex data transformations that need to be generalized would still be easier to define for individual test cases than on bulk data.
By making it easy for users to break the rules, even if only temporarily, spreadsheets empower users to explore data, evaluate options, and better understand the effects of their curation efforts.
Such rules violations, or \textit{singleton} operations, are not handled gracefully by existing relational DBMSes.
\noindent The spreadsheet UI also has several drawbacks:\\
% Furthermore, they can sometimes help users to repeatedly apply such computations to ``more'' data.
% For example, by using a spreadsheet's copy/fill paste feature, a single formula can be mapped over a many cells.
% \bgsays{(sort example). }
%\begin{itemize}
%\item
\emph{Non-Adaptive Computations over Collections}
While spreadsheets support an adapt\&apply approach to apply a computation defined for a specific set of cells to a larger collection of cells, the two dominant spreadsheets: Microsoft Excel and Google Sheets do not support automatically extending such a cell collection if more data is added~\footnote{We note that Apple's Numbers does exhibit this behavior.}.
This is in contrast to relational databases with their declarative query languages that provide data independence. \\
%Furthermore, the visual specification of an adapt\&apply operation is not suited well for very large datasets since the user has to manually select a range of cells.\\
%\item
\emph{Collection operation intent is not explicit.}
While adapt\&apply allows a computation to be mapped over a collection, there is no visual evidence that indicates that a set of cells are storing formulas which where mapped in this way. That is, there is no generalization/abstraction mechanism in spreadsheets to represent a higher-level bulk operation such as a view query in a database.\\
%\bgsays{provenance should help to identify that a ``map'' was applied}
%\item
\emph{No order among operations and tracking of workflow branches.}
Spreadsheets use cell highlighting as a visual metaphors for dependency tracking.
These visualizations are specific to single cells and do not lend themselves to tracking large curation workflows.
Furthermore, these visualizations are limited to tracing one dependency at a time, making tracking transitive dependencies cumbersome.\\
%which makes tracking of transitive dependencies cumbersome. It is unfeasible to assume that the user has a complete mental model of the whole workflow and all its branches. \\
%\bgsays{Workflow provenance would help to the higher-level view, database provenance for the transitive dependencies}
%\item
\emph{Unintuitive results for adapt\&apply.} As we will discuss further in Section~\ref{sec:language}, adapt\&apply functionality as implemented in many spreadsheet systems can sometimes lead to unexpected results.
%\end{itemize}
%BAD: The order in which operations have been applied is not exposed in a spreadsheet. Thus, spreadsheets are not great for complex data curation and exploration operations, because it is hard to keep track of what operations where applied to the data and in which order (needs workflow + provenance).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Notebook-style UIs}
%
Systems like iPython expose an interactive, interpreted programming environment through a notebook like interface where the user can mix documentation (text) with code. The output for code blocks is shown directly in the notebook - a feature that is widely used to produce data visualizations.
% A notebook UI such as in Jupyther and the visualization provided by workflow systems are much better suited in this regard. However, their disadvantage is that they are not suited well for small modifications to data, exploratory changes, and generalization of operations.
\begin{itemize}
\item \textbf{Inline documentation of operations.} The notebook enables the user to write documentation and inline it together with the code.
\item \textbf{Incremental development of complex curation workflows.} Notebooks support an analyst to incrementally build a curation workflow by adding and revising operations one at a time. The structure of the workflow is made explicit through the notebook interface as long as the workflow is linear.
\end{itemize}
However, some operations that are supported well in spreadsheets are harder to express in notebooks. Furthermore, some disadvantages are shared among both paradigms.
\begin{itemize}
\item \textbf{Small modification to data are cumbersome.} Compared to spreadsheets, modifying a few data values of a table requires the user to write code.
\item \textbf{No support for mapping curation operation developed for an example over a data collection.}
\item \textbf{Non-linear workflows and development with backtracking is not supported well.}
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Visualizations}
%
Both spreadsheets and notebook UIs make it very easy for users to create visualizations from data on the fly and show these visualization inline with the data. Also both paradigms allow these visualization to be tweaked and to be refreshed based on change to their inputs. Spreadsheets in particular provide a very easy to use interface for selecting what data should be visualized.
% \begin{itemize}
% \item
% \end{itemize}
% \begin{itemize}
% \item
% \end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Combining Spreadsheets and Notebooks}
%
Based on these observations we propose a hybrid notebook and spreadsheet interface that combines the best of both worlds. In this paper, we discuss how such an interface facilitates data curation and exploration.
%
\bgsays{NOT SURE WHETHER ALL OF THESE SHOULD BE EXPLORED IN THIS PAPER
Furthermore, we propose additional novel functionality for addressing some of the shortcoming of both spreadsheets and notebooks. Some of these are based on changes to the UI while others rely on changes to the underlying data processing platform: UI:
1) more flexible mechanism for generalize\&apply: user defines operation of data, system generalizes and applies it to more data
PLATFORM:
1) by keeping track of a users operations and by tracking them as a declarative program we can support better exploration including automatic updates of derived data based on changes to operations, 2) provenance tracking to understand and explore the derivation of data, 3) recommendations to let
}
Spreadsheets give users an infinite 2-dimensional grid of cells that can hold either constant values or computed values derived from other cells through \textit{formulas}.
Collections, as they exist, are defined implicitly as any 1- or 2-dimensional region of cells that has meaning to the user.
Thus, instead of classical programmatic specification of bulk, set-at-a-time operations as operations over named collection objects, spreadsheets use the metaphor of copying code to a (user-specified) range of cells, combined with relative, positional data dependencies to quite literally ``map'' singleton operations over entire collections.
In addition to making it easy to perform bulk operations over the entire collection, this approach provides a clear affordance for declaring exceptions: Even after being copied, each cell's formula is (and is presented as) a singleton, logically independent of other cells' formulas.
\oksays{Need to discuss how creating / modifying figures is easier in spreadsheets - same deal as before, allowing users to spontaneously declare collections by selecting a region.}
Between the simplicity of creating singleton operations and the simplicity of creating visualizations, spreadsheets are a powerful tool for data curation and exploration. Indeed, spreadsheet users often ``do not appear inclined to use other software packages for their tasks, even if these packages might be more suitable"~\cite{Chan1996119}. Our goal in \sysname is to empower users with a similar level of flexibility for transforming, visualizing, and exploring relational data by making it easier to define singleton operations in a relational setting, and using this capability to create a bi-directional mapping between a spreadsheet-style graphical interface, and a notebook-style programmatic interface.
% This means preserving the illusion that all operations on the spreadsheet are singletons, while still synthesizing a human-readable workflow that compactly encodes these singletons. Our primary goal in this paper is to explore the challenges that must be overcome to implement this bi-directional mapping.
To enable singleton transformations within the framework of a classical relational database, we propose a new mechanism for data exploration called interactive views. An interactive view begins its life as a classical database view, presented to the user in tabular form. In contrast to a classical view however, an interactive view can be edited much like a spreadsheet. Users can modify fields, add new rows and columns, use a spreadsheet-style equation editor to define derived values, and more. As the user edits the view, the user's actions are seamlessly transformed into a program of relational(-ish) data transformation operators that derive the new, edited view. This program provides two major benefits. The action trail serves as a form of history, allowing the user to revisit and revise earlier edits, even out of order. Second, the trail defines a workflow, albeit one highly specialized to a specific dataset. Even this is sufficient to provide classical benefits of workflow provenance such as auditability and explainability for derived data. Once an interactive view is developed for one dataset, however, it can more readily be adapted to new data or to react to changes in its inputs. Recasting the user's actions programmatically allows us to leverage existing work on algebraic equivalences and program rewriting to first obtain multiple interpretations of sequences of user actions, and then to extrapolate more general expressions of the user's intent.
\tinysection{Overview} In this paper, we outline the technical challenges of implementing interactive views and sketch our proposed solutions.
The core challenge that we consider is how to minimize unexpected side effects.
As users transform a spreadsheet, for example by sorting data or pasting formulas, existing spreadsheet software takes great pains to provide a consistent mental model of its behavior.
Unlike a relational database where query semantics are explicit, visual interactions in a spreadsheet requires many implicit behaviors.
In this paper, we address one specific example: a spreadsheet's coordinate system: User actions that move cells or reposition operations trigger implicit secondary effects (e.g., formulas for other cells may need to be updated).
We develop a data model that allows us to link relational data with a spreadsheet's coordinate system, allowing us to better gauge the secondary effects of user actions.
Our core contributions are as follows:
(1) We present \sysname, a hybrid relational notebook/spreadsheet exploration-based data curation environment and outline the capabilities of each interface mode,
(2) We outline the challenges of mapping actions back and forth between the two modes,
(3) We define a data model for \sysname that allows us to precisely characterize the side-effects of user's actions on a spreadsheet.
(4) We apply this model through a case study on existing spreadsheet software, and show how user actions in spreadsheet software follow a specific heuristic that we believe captures the principle of least surprise.
(5) We outline our future research directions, including a readability-enhancing optimizer, and a tool for generalizing singleton-based workflows.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: