paper-HILDA-2016-Spreadsheets/sections/interface.tex

99 lines
11 KiB
TeX

%!TEX root = ../main.tex
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\centering
\includegraphics[width=\columnwidth]{graphics/vizir-ui-menu}
\caption{An example of \sysname's UI}
\label{fig:hybridinterface}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\sysname's interface (illustrated in Figure~\ref{fig:hybridinterface}) combines elements of both notebooks and spreadsheets. Notebook interfaces like Jupyter use an analogy of pages in a notebook that consist of a block of code, as well as an output for the block like a table, visualization, or documentation text. Blocks are part of a continuous program, allowing a user to quickly probe intermediate state by creating new visualizations or views of the data, or to safely insert hypothetical, exploratory modifications by adding or disabling pages.
Spreadsheets give users an infinite 2-dimensional grid of cells that can hold either constant values or computed values derived from other cells through \textit{formulas}.
Collections, as they exist, are defined implicitly as any 1- or 2-dimensional region of cells that has meaning to the user.
Thus, instead of classical programmatic specification of bulk, set-at-a-time operations as operations over named collection objects, spreadsheets use the metaphor of copying code to a (user-specified) range of cells, combined with relative, positional data dependencies to quite literally ``map'' singleton operations over entire collections.
In addition to making it easy to perform bulk operations over the entire collection, this approach provides a clear affordance for declaring exceptions: Even after being copied, each cell's formula is (and is presented as) a singleton, logically independent of other cells' formulas.
\oksays{Need to discuss how creating / modifying figures is easier in spreadsheets - same deal as before, allowing users to spontaneously declare collections by selecting a region.}
Between the simplicity of creating singleton operations and the simplicity of creating visualizations, spreadsheets are a powerful tool for data curation and exploration. Indeed, spreadsheet users often ``do not appear inclined to use other software packages for their tasks, even if these packages might be more suitable"~\cite{Chan1996119}. Our goal in \sysname is to empower users with a similar level of flexibility for transforming, visualizing, and exploring relational data. This means preserving the illusion that all operations on the spreadsheet are singletons, while still synthesizing a human-readable workflow that compactly encodes these singletons. Our primary goal in this paper is to explore the challenges that must be overcome to implement this bi-directional mapping.
\subsection{The Notebook UI}
Each a page in a \sysname notebook can be thought of as a block of SQL-like code that generates a table or visualization. Pages are evaluated in sequential order. Code defining later pages may reference preceding pages as if they were (materialized) views and edits to a page may result in cascading changes to the pages following it. Thus, unlike classical databases where views are static entities, a page in \sysname acts as a \textit{interactive view} that enables changes in its definition, both by making it easier for users to define such changes, and by optimizing the system to enact such changes.
To make it easy to declare edits to views, interactive views in \sysname are backed by an imperative-flavored language: The \sysname user action language (\langname).
Although designed to give an imperative flavor, operators in \langname form a monad that can be compiled down to a slightly generalized form of relational algebra.
Our goal in the design of \langname is to define a form of relational algebra that naturally admits singleton operations and positional semantics, enabling it to mirror actions on the front-end.
We will first sketch the language itself through several examples so that we can better introduce the front-end, before returning to define it more precisely in Section~\ref{sec:language}.
\begin{figure}
\begin{verbatim}
LOAD 'lineitem.csv'
ADD COLUMN total;
UPDATE total = price * (1 - discount)
UPDATE total = 1020 WHERE ID = 90;
INSERT ROW (
name = 'table', price = 10, discount = 0.05,
total = price * (1-discount)
)
\end{verbatim}
\caption{An example \langname page script}
\label{fig:program}
\end{figure}
Figure~\ref{fig:program} illustrates an example \langname script that loads a CSV file, extends it with a new column named \texttt{total}, defines a value for the column (derived from the remaining attributes), and applies two minor edits to the result (a single value update and a row pasted into the result). \langname operates over ordered relations, and its language primitives are based on SQL's DDL and DML. As a result, while operations in \langname appear imperative, they actually define a sequence of declarative transformations on the data imported by the \texttt{LOAD} operation on the first line. The entire script can be rewritten into a SQL query:
\begin{lstlisting}[morekeywords={LOAD}]
SELECT *, total = CASE WHEN ID = 90 THEN 1020
ELSE price*(1-discount) END
FROM LOAD(lineitem.csv)
UNION ALL
SELECT name = 'table', price = 10,
discount = 0.05, total = 9.5
\end{lstlisting}
Imperative-flavored declarative language syntax has been repeatedly found to be more user-friendly than classic declarative syntax~\cite{Olston:2008:PLN:1376616.1376726,Sowell:2009aa}. Here however, it also serves to highlight the compositional nature of interactive views; Every user action that changes the view's schema or contents is reflected in the script by a new statement appended to its end. Thus, we aim for --- in principle at least --- a one-to-one mapping between user actions and operations in \langname.
\subsection{The Spreadsheet UI}
\sysname's users can edit tables and visualizations directly and have those edits reflected in the corresponding table's page, and propagated to subsequent pages. As a result, the user's edits, whether applied via the spreadsheet or notebook UI, are recorded as a form of workflow provenance. Note again that our goal is not to reproduce the full interface of a spreadsheet entirely, but rather to replicate as many of the
flexible data and schema manipulation features of spreadsheets as possible within a more structured framework. Concretely, \sysname's UI allows users to:
\begin{itemize}
\item \textbf{Overwrite arbitrary values with constants or formulas}: As in a spreadsheet, the user may click on any cell in the output to overwrite its contents with a constant value or a new formula defined interactively by clicking on cells and typing code.
\item \textbf{Cast cells to a new type}: Dropdown menus in an inspector allow the user to apply general transformations like typecasting. The transformation is applied in bulk to entire regions of selected cells.
\item \textbf{Copy/Paste cells}: As in a spreadsheet, users can copy and paste regions of cells. The formula of the copied cell(s) is replicated in the target region. Dependencies in the formulas are re-mapped to preserve the formula's positional semantics --- we discuss these semantics in more detail below in Section~\ref{sec:language}. As in a spreadsheet, if the target region is larger than the source region in either or both dimensions, cells in the source region are tiled to scale over the entire target.
\item \textbf{Add/Delete/Reorder columns or rows}: Users may drag or columns rows to reposition them. A tab at the bottom and right edges of the displayed table allows users to widen or lengthen the table, adding new columns or rows respectively. Finally, several interface elements allow users to insert rows (resp., columns) before or after any existing row (column).
\item \textbf{Sort data}: A dropdown menu allows users to sort data according to values in one or more columns.
\item \textbf{Filter data}: Another dropdown menu allows users to filter out rows according to a formula defined over the row.
\end{itemize}
Many of these operations (e.g., paste, typecast) require the user to define a target region, most commonly in the form of rectangular area of cells selected by clicking and dragging with the cursor. We refer to these regions, defined as the projection/selection of a set of columns and rows, as \textit{regions}, and discuss them in greater depth below.
\subsection{Spreadsheet to Notebook and Back}
Our goal is to create a seamless interface between the spreadsheet and notebook interfaces. To accomplish this, we need to be able to easily map operational semantics and effects back and forth between the two interaction models. We now outline several of these challenges and sketch our proposed solutions.
\tinysection{Identifying Singletons}
To allow singleton operations, \langname must be able to uniquely identify specific rows and columns of the dataset, including rows and columns introduced in the code itself. More importantly, these identifying markers must persist through the program: A user edit applied to the row 10 of \texttt{'lineitem.csv'} must continue to be applied to the 10th row, even if an operation inserts a new row between rows 8 and 9. A \texttt{LOAD} operation generates a unique tag for each row and column, and every \texttt{INSERT} operation is silently associated with a unique identifier for the row(s) it generates.
\tinysection{Positional vs Qualitative Semantics}
Unique row and column identifiers solve one critical mismatch between spreadsheets and \langname's relational semantics: They allow cell positions to be defined as properties of the cell, in turn allowing relational operators to qualitatively identify the cell. However, a second issue arises with respect to positional semantics: Spreadsheets allow formulas to reference other cells by relative position. For example, a cell might compute a row-by-row running total by adding the current row's value to the prior row's running total. Put another way, \texttt{UPDATE} operations in \langname must permit a form of implicit windowing, where cell formulas can reference earlier or later rows.
\tinysection{Readability}
An interactive spreadsheet interface encourages many small, iterative transformation. By comparison, code tends to encourage abstraction and terse expressions that precisely convey the user's intent. As a result, directly translating visually generated operations into code is likely to produce a large, hard-to-follow, unreadable mess. Our goal is to create a source-to-source compiler that optimizes for readability rather than efficiency. For example, this compiler might rewrite a sequence of \texttt{UPDATE} operations that modify adjacent cells into a single \texttt{UPDATE} operation that modifies a range of cells (i.e., the reverse of loop-unrolling, a common compiler optimization). We note that this compiler will need balance the tension between conciseness and comprehension: Code that is too dense can be just as unreadable as code that is too verbose.
\tinysection{Formula Extraction}
An important challenge arises in the reverse direction as well. When a user clicks on a formula to edit it, we need to be able to reconstruct the formula that derived the cell's value. However, obtaining the precise formula may not be as simple as simply tracing the provenance of the cell's value; Operations (e.g., reordering rows) may alter dependencies. We address this specific issue in more depth next as part of our data model.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: