paper-HILDA-2016-Spreadsheets/sections/interface.tex

%!TEX root = ../main.tex

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
  \centering
  \includegraphics[width=0.8\columnwidth]{graphics/vizir-ui-new-two-columns}
  \caption{An example of \sysname's UI}
    \label{fig:hybridinterface}
    \vspace*{-3mm}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Figure~\ref{fig:hybridinterface} illustrates the interface for \sysname, our proposed tool for data curation and exploration.  This interface combines elements of both notebooks and spreadsheets.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{The Notebook UI}

% JF: this in not clear to me -- what do you mean by probing intermediate states?
Notebook interfaces like Jupyter's use an analogy of pages in a notebook that consist of a block of code and an output for the block, e.g., a table, visualization, or documentation.  Blocks are part of a continuous program, allowing a user to quickly probe intermediate states by creating new visualizations or views of the data, or to safely insert hypothetical, exploratory modifications by adding or disabling pages.

Each page in a \sysname notebook can be thought of as a block of SQL DML/DDL code that imperatively manipulates a single relation, which is displayed as a table or visualization.  Pages are evaluated in sequential order.  Code defining later pages may reference preceding pages as if they were (materialized) views and edits to a page may result in cascading changes to the pages following it.
We refer to this SQL-based language as the \sysname user action language (\langname).
In spite of its imperative flavor, operators in \langname form a monad that can be compiled down to a generalized form of relational algebra~\cite{AG16,AG14a}.
%This same imperative flavor also carries several benefits: (1) Singleton operations are easy to express as explicif updates, (2) Positional semantics are clearly defined by context, (3) It is easier to translate a sequence of user interactions with a spreadsheet into a sequence of imperative operations.
Due to space constraints, we will only sketch the language through examples in this paper; a full description is left to future work.

\begin{figure}
\begin{lstlisting}[morekeywords={LOAD,ROW}]
LOAD 'lineitem.csv'
ADD COLUMN total;
UPDATE total = price * (1 - discount)
UPDATE total = 1020 WHERE ID = 90;
INSERT ROW ( name = 'table', price = 10,
             discount = 0.05, total = 9.5 )
\end{lstlisting}
\caption{An example \langname script}
\label{fig:program}
\vspace*{-6mm}
\end{figure}

Figure~\ref{fig:program} shows an example \langname script that loads a CSV file, extends it with a new column named \texttt{total}, defines a value for the column (derived from the remaining attributes), and applies two minor \emph{singleton} edits to the result (a single value update and a row pasted into the result).
\langname operates over ordered relations, and its language primitives are based on SQL's DDL and DML.  As a result, while operations in \langname appear imperative, they actually define a sequence of declarative transformations on the data imported by the \texttt{LOAD} operation in the first line.  The entire script can be rewritten into a SQL query:
\begin{lstlisting}[morekeywords={LOAD}]
SELECT *, total = CASE WHEN ID = 90 THEN 1020
                  ELSE price*(1-discount) END
FROM LOAD(lineitem.csv)
UNION ALL
SELECT name = 'table', price = 10,
       discount = 0.05, total = 9.5
\end{lstlisting}
Imperative-flavored declarative language syntax has been repeatedly found to be more user-friendly than classic declarative syntax~\cite{Olston:2008:PLN:1376616.1376726,Sowell:2009aa}.  Here however, it also serves to highlight the compositional nature of interactive views: each user action that changes the view's schema or contents is reflected in the script by a new statement appended to its end.  Thus, we aim for --- in principle at least --- a bi-directional mapping between user actions and operations in \langname.

In addition to enabling singletons and being easy to integrate with spreadsheets, the imperative flavor of \langname also enables a form of backtracking and branching.  As illustrated in Figure~\ref{fig:hybridinterface}, users can quickly try out hypothetical changes by checkpointing program state and applying a variant sequence of edits.  \sysname will support a comprehensive suite of branching and merging capabilities for both data~\cite{NA16} and workflows~\cite{SV08}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{The Spreadsheet UI}

% JF: this is a bit confusing -- are we assuming that the pages are linear and each page depends on the previous page? what about branches?
As a user edits tables and visualizations directly, these edits are reflected in the page where the table resides and they are also propagated to subsequent pages. The user's edits, whether applied via the spreadsheet or notebook UI, are recorded as a form of workflow provenance~\cite{SV08,CF12a,AD11c,DC07}.  Note that our goal is not to reproduce the full interface of a spreadsheet, but rather to replicate many of the
flexible data and schema manipulation features within a more structured framework.  Concretely, \sysname's UI allows users to:\\
%\begin{compactitem}
\inlineitem{Overwrite arbitrary cells with constants, formulas, or regular expressions} Users may click on any cell in the output to overwrite its contents (as in a spreadsheet).  \\
% JF rm to save space -- it sounded redundant: with a constant value or a new formula
\inlineitem{Cast cells to a new type} Dropdown menus allow the user to apply general transformations like typecasting.  The transformation is applied in bulk to cell in a selected region.  \\
% JF: here, a concrete example would help
\inlineitem{Copy/Paste cells} Users can copy and paste regions of cells.  The formula of the copied cell(s) is replicated in the target region, preserving the formula's positional semantics.  If the target region is larger than the source, cells in the source region are tiled to scale over the entire target.\\
\inlineitem{Add/Delete/Reorder columns or rows} Users may drag columns or rows to reposition them.  A tab at the bottom and right edges of the displayed table allows users to widen or lengthen the table, adding new columns or rows respectively.  Other interface elements allow users to insert rows (resp., columns) before or after any existing row (column).\\
\inlineitem{Sort data} A dropdown menu allows users to sort data according to values in one or more columns.\\
\inlineitem{Filter data} A dropdown menu allows users to filter out rows according to a formula defined over the row.\\
%\end{compactitem}
Many of these operations (e.g., paste, typecast) require the user to define a target, normally specified as rectangular area selected by clicking and dragging with the cursor; We also propose to support declarative regions, as discussed below.

\subsection{Spreadsheet to Notebook and Back}
To create a seamless interface between the spreadsheet and notebook interfaces, we need to map operational semantics and effects between the two interaction models.  We now sketch solutions to several of the resulting challenges.

\tinysection{Identifying Singletons}
To allow singleton operations, \langname must be able to uniquely identify specific rows and columns of the dataset, including rows and columns introduced in the code itself.  More importantly, these identifying markers must persist through the program: A user edit applied to the row 10 of \texttt{'lineitem.csv'} must continue to be applied to the 10th row, even if an insert operation occurs between rows 8 and 9.  We address this challenge through provenance: Each operation that creates rows generates a unique tag for each row, column, and cell, which persists through the lifetime of the row, column, or cell.

\tinysection{Positional vs Qualitative Semantics}
Spreadsheets allow formulas to reference other cells by relative position.  For example, a cell's formula might compute a cumulative average over all rows up to that point.  To capture these semantics, bulk update operations must permit a form of implicit windowing, semantics that can be unintuitive if handled incorrectly.  We address positional semantics as part of \langname's data model.
%Unique row and column identifiers solve one critical mismatch between spreadsheets and \langname's relational semantics: They allow cell positions to be defined as properties of the cell, in turn allowing relational operators to qualitatively identify the cell.  However, a second issue arises with respect to positional semantics: Spreadsheets allow formulas to reference other cells by relative position.  For example, a cell might compute a row-by-row running total by adding the current row's value to the prior row's running total.  Put another way, \texttt{UPDATE} operations in \langname must permit a form of implicit windowing, where cell formulas can reference earlier or later rows.

\tinysection{Readability}
Interactive spreadsheet interfaces encourage many small transformations.  In contrast, code promotes abstraction and terse expressions that precisely convey the user's intent.  As a result, directly translating visually generated operations into code is likely to produce a large, hard-to-follow, unreadable mess.  We address this by proposing a source-to-source readability-optimizing compiler.

%  Our goal is to create a source-to-source compiler that optimizes for readability rather than efficiency.  For example, this compiler might rewrite a sequence of \texttt{UPDATE} operations that modify adjacent cells into a single \texttt{UPDATE} operation that modifies a range of cells (i.e., the reverse of loop-unrolling, a common compiler optimization).  We note that this compiler will need balance the tension between conciseness and comprehension: Code that is too dense can be just as unreadable as code that is too verbose.

\tinysection{Formula Extraction}
An important challenge arises in the reverse direction as well.  When a user clicks on a formula to edit it, we need  to reconstruct the formula that derived the cell's value.  However, obtaining the precise formula may not be as simple as tracing the provenance of the cell's value, since operations (e.g., reordering rows) may alter dependencies.  We address this specific issue as part of our data model.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: