paper-HILDA-2016-Spreadsheets/sections/language.tex

%!TEX root = ../main.tex

The fundamental unit of data in \langname is a \textit{cell}, a 3-tuple: $C = \tuple{id, f, v}$, consisting of a globally unique identifier $id$, a formula expression $f$, and a value $v$. The identifier of a cell is assigned to it when it is first accessed and is immutable --- even if the cell is moved to a different position in the spreadsheet.
By combining storing both a formula $f$ and its result $v$, a cell stores provenance similar to the provenance in a provenance-aware data management system, where each record is associated with metadata describing how it was computed.
Here, this metadata serves two purposes.
First, as noted above, we need to be able to reliably materialize the formula backing each cell so that it may be edited.  We need to ensure that each operator defines precise semantics for how it affects formulas.
Second, and perhaps more importantly, we track both values and the formula used to derive them as a way to define operational semantics that minimize user surprise.  As we will discuss shortly, one specific update to a spreadsheet may have many secondary, incidental effects on formulas and/or values in a spreadsheet.  By tracking both, we can better understand these effects and ensure that the complexities and unexpected side-effects of an operation are minimized.

\tinysection{Coordinate System} Cells are arranged into a 2-dimensional grid of rows and columns indexed by a coordinate system, a function $s : \mathbb N \times \mathbb N \rightarrow id$ that maps positions in the grid to the cell occupying that position.  The function $s$ need not be complete, but must be one-to-one; A cell may only appear in one position in the spreadsheet.

\tinysection{Formulas} A formula is a primitive-valued expression that may include references to the values of other cells, identified by the cell's global id or by absolute coordinates (explicit and absolute references, respectively).  A formula evaluated in the context of a cell may also specify coordinate references as being relative to the cell (relative references).  Columns are usually denoted by letters, Rows by numbers,
A \textit{state} is the 2-tuple $\tuple{ C, s }$, consisting of a set of cells $C = \{C_i\}$ and a coordinate system.
We say that a formula $f$ evaluates to a value $v$ in the context of a given state ($f \mapsto_{\tuple{C,s}} v$) if, after replacing all references (coordinate references using $s$ and $C$, and explicit references using $C$), the formula reduces to $v$~\footnote{Similar operational semantics were previously proposed by Krishnamurthi and Ramakrishnan~\cite{Erwig2002}.}.
%
We say that a state $\tuple{C, s}$ is \textit{valid}\footnote{Note that this definition does not preclude direct or indirect circular references as long as the computations defined by the cell formulas have a fixpoint. However, such a fixpoint computation may be hard to understand for a user and, thus, we disallow circular references for now.} if each cell's formula evaluates to the cell's value:
$\forall \tuple{id_i, f_i, v_i} \in C\;:\; f_i \mapsto_{\tuple{C,s}} v_i$

User \textit{actions} in \langname, transform a state $\tuple{C_1, s_1}$ into a new state $\tuple{C_2, s_2}$.   %
We call the semantics for an action correct if they ensure that if the input to an action is valid, then the output is also valid.
%We also focus on two classes of action: (1) \textit{data actions} that change only the spreadsheet's cells (i.e., for which $s_1 = s_2$), and (2) \textit{structural actions} that alter the spreadsheet's coordinate system and only modify the spreadsheet's cells to the extent necessary to preserve validity under the new coordinate system.

\begin{figure*}
\centering
{\small
\begin{subfigure}{0.3\textwidth}
  \centering
  \begin{tabular}{>{\tiny}rc|c|c}
   & \tiny A & \tiny B & \tiny C \\
    1& Alice & 10 & \texttt{=B1} (10)\\ \hline
    2& Bob & 4 & \texttt{=B2+C1} (14)\\ \hline
    3& Carol & 8 & \texttt{=B3+C2} (22)\\ \hline
    4& Dave & 9 & \texttt{=B4+C3} (31)
  \end{tabular}
  \caption{Initial State}
  \label{fig:rearrange:initial}
\end{subfigure}
%
\begin{subfigure}{0.3\textwidth}
  \centering
  \begin{tabular}{>{\tiny}rc|c|c}
   & \tiny A & \tiny B & \tiny C \\
    1& Alice & 10 & \texttt{=B1} (10)\\ \hline
    2& Carol & 8 & \texttt{=B2+C3} (22)\\ \hline
    3& Bob & 4 & \texttt{=B3+C1} (14)\\ \hline
    4& Dave & 9 & \texttt{=B4+C3} (31)
  \end{tabular}
  \caption{After swapping rows 2 and 3}
  \label{fig:rearrange:manual}
\end{subfigure}
%
\begin{subfigure}{0.3\textwidth}
  \centering
  \begin{tabular}{>{\tiny}rc|c|c}
   & \tiny A & \tiny B & \tiny C \\
    1& Alice & 10 & \texttt{=B1} (10)\\ \hline
    2& Dave & 9 & \texttt{=B2+C1} (19)\\ \hline
    3& Carol & 8 & \texttt{=B3+C2} (27)\\ \hline
    4& Bob & 4 & \texttt{=B4+C3} (31)
  \end{tabular}
  \caption{After sorting on column 'B'}
  \label{fig:rearrange:sort}
\end{subfigure}
}
\caption{Examples of both swapping rows and sorting rows in commercial database systems.}
\vspace*{-3mm}
\end{figure*}

\subsection{Unsurprising Inconsistencies}
User actions on a spreadsheet have both direct, intended effects, and may also have indirect, \textit{incidental} effects.  Examples include changing a formula (dependent formulas are recomputed), repositioning a row (formulas depending on the row are modified), or sorting (formulas are recomputed based on the new, sorted coordinate system).
In commercial spreadsheet systems, indirect effect semantics can sometimes be inconsistent.  Take, for example, two mechanisms for rearranging rows in the table given in Figure~\ref{fig:rearrange:initial}.
%
A user might manually drag row 3 to a position between rows 1 and 2, effecting a swap of rows 2 and 3.
Microsoft Excel, Apple's Numbers, and Google's Sheets~\footnote{These and other behaviors described were evaluated on Excel for Mac version 15.20, Numbers version 3.6.1, and Google Sheets as of April 2016.} all have identical behavior, each resulting in the table shown in Figure~\ref{fig:rearrange:manual}.
Note that the formulas for C2 and C3 have changed to ensure that each cell retains its original value under the transposed coordinate system.  In other words, the user's \texttt{MOVE} action treats formula references as being \textit{explicit} references.
%
Conversely, a user might sort the rows of the table in descending order on Column B.  The resulting table in all three systems is identical, and shown in Figure~\ref{fig:rearrange:sort}.  Here, the formulas in column C are changed only in appearance; each continues to reference the cells immediately to the left and above.  However the values of each cell have changed as a result.  In other words, the user's \texttt{SORT} action treats formula references as being \textit{relative} references.

At a high level, both actions are structural, as they only transform the coordinate system by changing the positions of cells in the coordinate system; The effects on cell formulas and values (the $C$ part of a state) are only incidental consequences of the new coordinate scheme. To be correct, the action must also adapt 1) references in the formulas of cells that have been moved and 2) adapt references in formulas of other cells that reference a moved cell.
% For each dependent cell that changes coordinates, the action must also something else to be correct.
The distinction between the two example actions is quite significant, because it underlies a fundamental tradeoff in minimizing the ``surprising'' incidental effects~\cite{saltzer2009principles} of a change in coordinates: For \texttt{MOVE}, cell formulas are \textit{translated} into the new coordinate system to ensure that each cell's values stay the same (the ids of referenced cells in every formula is the same before and after the operation), while for \texttt{SORT}, cell formulas are \textit{re-evaluated} under the new coordinate system (the relative position of the referencing cell and the referenced cell is fixed), changing the values to ensure that the formulas stay the same.

We tested a range of actions that affect the coordinate system of a spreadsheet, and all consistently exhibited one of these types of stability: either on values (formulas are translated), or on formulas (new values are computed).  Our results are shown in Figure~\ref{fig:stability}.  Virtually all action semantics favor value stability --- clearly the simpler case in general.  Semantics that enforce formula stability are used only in the one operation where the coordinate transformation required to enforce value stability is non-intuitive.  Sorting applies an effectively random coordinate transformation, mangling formulas beyond recognizability.
This suggests a heuristic for defining action semantics in \langname: Prefer semantics that enforce value stability, as long as the resulting coordinate transformation is simple and easy to understand.

\begin{figure}
{\small
\begin{center}
\begin{tabular}{r|ccc}
 & & \textbf{Stability} & \\\hline
\textbf{Action} & \textbf{Excel} & \textbf{Numbers} & \textbf{Sheets} \\ \hline
Cut/Paste & V & F & V \\
Drag Cell/Row/Col & n/a & V & V\\
Insert Row/Col & V & V & V \\
Sort & F & F & F \\
Filter & V & V & V
\end{tabular}
\caption{Interface actions and whether they are \textbf{F}ormula-stable, or \textbf{V}alue-stable.  Excel does not support dragging.}
\label{fig:stability}
\end{center}
}
\vspace*{-7mm}
\end{figure}


\subsection{Regions to Relations}
Many operations in \sysname operate over sets or collections of cells.
For example, aggregates in formulas, the `paste' operation, and type conversions all target or reference entire regions of cells.
In a typical spreadsheet, such regions are specified as rectangular regions of cells in the current coordinate system (e.g., \texttt{[A3\;:\;B99]} or \texttt{[A\;:\;A]}).
Conversely, in a relational setting, sets of target values are specified qualitatively through selection predicates.

The former semantics are critical for enabling the spreadsheet interaction model, while the latter is important for generalizing the curation workflow beyond the initial dataset.
Existing data curation systems focus on the latter approach; Even Wrangler~\cite{Kandel:2011:WIV:1978942.1979444}, which does allow users to initially write curation operators as singletons, still forces users to define a generalized predicate before moving on.

In \langname, regions combine both semantics. Concretely, a \textit{region} is defined through a 3-tuple $\tuple{R, C, f}$, where $R$ is a (possibly infinite) set of rows, $C$ is a (possibly infinite) set of columns, and $f$ is a boolean-valued formula defining a predicate over cells that fall into the specified range. All cells within the intersection of $R$ and $C$ that fulfill $f$ are part of the region.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: