paper-HILDA-2016-Spreadsheets/sections/language.tex

%!TEX root = ../main.tex

The fundamental unit of data in \langname is a \textit{cell}, a 3-tuple: $C_i = \tuple{id_i, f_i, v_i}$, consisting of a globally unique identifier $id_i$, a formula expression $f_i$, and a value $v_i$.
This is similar to a provenance-aware data management system, where each record is associated with metadata describing how it was computed.
Here, this metadata serves two purposes.
First is a practical concern: As noted above, we need to be able to reliably materialize the formula backing each cell so that it may be edited.  We need to ensure that each operator defines precise semantics for how it affects formulas.
Second, and perhaps more importantly, we track both values and the formula used to derive them as a way to define operational semantics that minimize user surprise.  As we will discuss shortly, one specific update to a spreadsheet may have many secondary, incidental effects on formulas and/or values in a spreadsheet.  By tracking both, we can better understand these effects and ensure that the complexities and unexpected side-effects of an operation are minimized.

Cells are arranged into a 2-dimensional grid of rows and columns indexed by a coordinate system, a function $s : \mathbb N \times \mathbb N \rightarrow id$ that maps positions in the grid to the cell occupying that position.  The function $s$ need not be complete, but must be one-to-one; A cell may only appear in one position in the spreadsheet.

A formula is a primitive-valued expression that may include references to the values of other cells, identified by the cell's global id, by absolute coordinates (explicit and absolute references, respectively).  A formula evaluated in the context of a cell may also specify coordinate references as being relative to the cell (relative references).  Columns are usually denoted by letters, Rows by numbers,
A \textit{state} is the 2-tuple $\tuple{ C, s }$, consisting of a set of cells $C = \{C_i\}$ and a coordinate system.
We say that a formula $f$ evaluates to a value $v$ in the context of a given state ($f \mapsto_{\tuple{C,s}} v$) if, after replacing all references (coordinate references using $s$ and $C$, and explicit references using $C$), the formula reduces to $v$~\footnote{Similar operational semantics were previously proposed by Krishnamurthi and Ramakrishnan~\cite{Erwig2002}}.
%
We say that a state $\tuple{C, s}$ is \textit{valid} if each cell's formula evaluates to the cell's value:
$$\forall \tuple{id_i, f_i, v_i} \in C\;:\; f_i \mapsto_{\tuple{C,s}} v_i$$

User \textit{actions} in \langname, transform a state $\tuple{C_1, s_1}$ into a new state $\tuple{C_2, s_2}$.   %
We call the semantics for an action correct if they ensure that if the input to an action is valid, then the output is also valid.
%We also focus on two classes of action: (1) \textit{data actions} that change only the spreadsheet's cells (i.e., for which $s_1 = s_2$), and (2) \textit{structural actions} that alter the spreadsheet's coordinate system and only modify the spreadsheet's cells to the extent necessary to preserve validity under the new coordinate system.

\begin{figure*}
\centering
\begin{subfigure}{0.3\textwidth}
  \centering
  \begin{tabular}{>{\tiny}rc|c|c}
   & \tiny A & \tiny B & \tiny C \\
    1& Alice & 10 & \texttt{=B1} (10)\\ \hline
    2& Bob & 4 & \texttt{=B2+C1} (14)\\ \hline
    3& Carol & 8 & \texttt{=B3+C2} (22)\\ \hline
    4& Dave & 9 & \texttt{=B4+C3} (31)
  \end{tabular}
  \caption{Initial State}
  \label{fig:rearrange:initial}
\end{subfigure}
%
\begin{subfigure}{0.3\textwidth}
  \centering
  \begin{tabular}{>{\tiny}rc|c|c}
   & \tiny A & \tiny B & \tiny C \\
    1& Alice & 10 & \texttt{=B1} (10)\\ \hline
    2& Carol & 8 & \texttt{=B2+C3} (22)\\ \hline
    3& Bob & 4 & \texttt{=B3+C1} (14)\\ \hline
    4& Dave & 9 & \texttt{=B4+C3} (31)
  \end{tabular}
  \caption{After swapping rows 2 and 3}
  \label{fig:rearrange:manual}
\end{subfigure}
%
\begin{subfigure}{0.3\textwidth}
  \centering
  \begin{tabular}{>{\tiny}rc|c|c}
   & \tiny A & \tiny B & \tiny C \\
    1& Alice & 10 & \texttt{=B1} (10)\\ \hline
    2& Dave & 9 & \texttt{=B2+C1} (19)\\ \hline
    3& Carol & 8 & \texttt{=B3+C2} (27)\\ \hline
    4& Bob & 4 & \texttt{=B4+C3} (31)
  \end{tabular}
  \caption{After sorting on column 'B'}
  \label{fig:rearrange:sort}
\end{subfigure}
\caption{Examples of both swapping rows and sorting rows in commercial database systems.}
\end{figure*}

\subsection{Unsurprising Inconsistencies}
As noted earlier, user actions on a spreadsheet have not only direct, intended effects, but may also have indirect, \textit{incidental} effects.  Examples include changing a formula (dependent formulas are recomputed), repositioning a row (formulas depending on the row are modified), or sorting (formulas are recomputed based on the new, sorted coordinate system).
In modern commercial spreadsheet systems, the semantics of indirect effects at first appear to be inconsistent.  Take, for example, two mechanisms for rearranging rows.  For example, consider the table given in Figure~\ref{fig:rearrange:initial}, which shows a list of players (column A), scores (column B), and a cumulative total score (column C).
%
A user might manually drag row 3 to a position between rows 1 and 2, effecting a swap of rows 2 and 3.
Microsoft Excel, Apple's Numbers, and Google's Sheets~\footnote{These and other behaviors described were evaluated on Excel for Mac version 15.20, Numbers version 3.6.1, and Google Sheets as of April 2016} all have identical behavior, each resulting in the table shown in Figure~\ref{fig:rearrange:manual}.
Note that the formulas for C2 and C3 have changed to ensure that each cell retains its original value under the transposed coordinate system.  In other words, the user's \texttt{MOVE} action treats formula references as being explicit references.
%
Conversely, a user might sort the rows of the table in descending order on Column B.  The resulting table in all three systems is identical, and shown in Figure~\ref{fig:rearrange:sort}.  Here, the formulas in column C are changed only in appearance; each continues to reference the cells immediately to the left and above.  However the values of each cell have changed as a result.  In other words, the user's \texttt{SORT} action treats formula references as being relative references.

At a high level, both actions are structural, as they only transform the coordinate system by rearranging the coordinate mapping; The effects on cells (the $C$ part of a state) are only incidental consequences of the new coordinate scheme.  For each dependent cell that changes coordinates, the action must also something else to be correct.
The distinction between the two example actions is quite significant, because it underlies a fundamental tradeoff in minimizing the ``surprising'' incidental effects~\cite{saltzer2009principles} of a change in coordinates: For \texttt{MOVE}, cell formulas are \textit{translated} into the new coordinate system to ensure that each cell's values stay the same, while for \texttt{SORT}, cell formulas are \textit{re-evaluated} in the new coordinate system, changing the values to ensure that the formulas stay the same.
We leave the details of this tradeoff to future work, but observe that virtually all structural actions we explored (drag cell, rearrange rows, filter, etc\ldots) favor minimizing changes in values.

\subsection{Regions to Relations}
Many operations in \sysname operate over sets or collections of cells.
For example, aggregates in formulas, the `paste' operation, and type conversions all target or reference entire regions of cells.
In a typical spreadsheet, such regions are specified as rectangular regions of cells in the current coordinate system (e.g., \texttt{[A3\;:\;B99]} or \texttt{[A\;:\;A]}).
Conversely, in a relational setting, sets of target values are specified qualitatively through selection predicates.

The former semantics are critical for enabling the singleton-based interaction model of spreadsheets, while the latter is important if the curation workflow is to be generalized beyond the upfront dataset.
Thus, existing data curation systems focus on the latter approach; Even Wrangler~\cite{Kandel:2011:WIV:1978942.1979444}, which does allow users to write one-off singleton curation operators, still requires users to resolve each singleton operation into a generalized predicate before allowing them to continue.

In \langname, regions combine both semantics.  Concretely, a \textit{region} is a 3-tuple $\tuple{R, C, f}$, where $R$ is a (possibly infinite) set of rows, $C$ is a (possibly infinite) set of columns, and $f$ is a boolean-valued formula defining a predicate over cells that fall into the specified range.