The Surprising semantics of Sort

This commit is contained in:
Oliver Kennedy 2016-04-11 18:47:51 -04:00
parent 5132b2d1b9
commit 3898c8aaf8
6 changed files with 169 additions and 69 deletions

1
.gitignore vendored
View file

@ -4,4 +4,5 @@
*.fdb_latexmk
*.fls
*.log
*.out
/main.pdf

View file

@ -1,10 +0,0 @@
\BOOKMARK [1][-]{section.1}{Introduction}{}% 1
\BOOKMARK [1][-]{section.2}{Interface}{}% 2
\BOOKMARK [2][-]{subsection.2.1}{The Vizier Notebook}{section.2}% 3
\BOOKMARK [1][-]{section.3}{Language}{}% 4
\BOOKMARK [2][-]{subsection.3.1}{Data Model}{section.3}% 5
\BOOKMARK [2][-]{subsection.3.2}{Anatomy of a VizUAL script}{section.3}% 6
\BOOKMARK [2][-]{subsection.3.3}{VizUAL}{section.3}% 7
\BOOKMARK [1][-]{section.4}{Generalizing Singletons}{}% 8
\BOOKMARK [1][-]{section.5}{Related Work}{}% 9
\BOOKMARK [1][-]{section.6}{References}{}% 10

Binary file not shown.

View file

@ -1,13 +1,37 @@
%% This BibTeX bibliography file was created using BibDesk.
%% http://bibdesk.sourceforge.net/
%% Created for Oliver Kennedy at 2016-04-04 12:02:57 -0400
%% Created for Oliver Kennedy at 2016-04-11 16:51:47 -0400
%% Saved with string encoding Unicode (UTF-8)
@book{saltzer2009principles,
Author = {Saltzer, Jerome H and Kaashoek, M Frans},
Date-Added = {2016-04-11 20:51:46 +0000},
Date-Modified = {2016-04-11 20:51:46 +0000},
Publisher = {Morgan Kaufmann},
Title = {Principles of computer system design: an introduction},
Year = {2009}}
@inbook{Erwig2002,
Address = {Berlin, Heidelberg},
Author = {Erwig, Martin and Burnett, Margaret},
Chapter = {Adding Apples and Oranges},
Date-Added = {2016-04-11 18:48:43 +0000},
Date-Modified = {2016-04-11 18:48:43 +0000},
Doi = {10.1007/3-540-45587-6_12},
Editor = {Krishnamurthi, Shriram and Ramakrishnan, C. R.},
Isbn = {978-3-540-45587-5},
Pages = {173--191},
Publisher = {Springer Berlin Heidelberg},
Title = {Practical Aspects of Declarative Languages: 4th International Symposium, PADL 2002 Portland, OR, USA, January 19--20, 2002 Proceedings},
Url = {http://dx.doi.org/10.1007/3-540-45587-6_12},
Year = {2002},
Bdsk-Url-1 = {http://dx.doi.org/10.1007/3-540-45587-6_12}}
@inproceedings{ives2015looking,
Author = {Ives, Zachary G and Yan, Zhepeng and Zheng, Nan and Litt, Brian and Wagenaar, Joost B},
Date-Added = {2016-04-04 15:50:57 +0000},

View file

@ -18,6 +18,7 @@
\usepackage{wrapfig}
\usepackage[textsize=tiny]{todonotes}
\usepackage{cleveref}
\usepackage{tabu}
%%%%%% Package Configuration %%%%%%
%%% Listings

View file

@ -1,69 +1,153 @@
%!TEX root = ../main.tex
Interactive views in \sysname are backed by an imperative-flavored language: The \sysname user action language (\langname). Although appearing imperative, operators in \langname form a monad that can be compiled down to a slightly generalized form of relational algebra. We now overview \langname, its general properties, and how it connects to user actions in \sysname's UI. Recall that our goal is not to reconstruct the full spreadsheet interface, but rather to define a form of relational algebra that naturally admits singleton operations and positional semantics, enabling it to mirror actions on the frontend.
Interactive views in \sysname are backed by an imperative-flavored language: The \sysname user action language (\langname). Although appearing imperative, operators in \langname form a monad that can be compiled down to a slightly generalized form of relational algebra. We now overview \langname, its data model, general properties, and how it connects to user actions in \sysname's UI. Recall that our goal is not to reconstruct the full spreadsheet interface, but rather to define a form of relational algebra that naturally admits singleton operations and positional semantics, enabling it to mirror actions on the frontend.
\subsection{Data Model}
A \langname script is a monad over \textit{data frames}, or ordered lists of uniform-width tuples of primitive typed values, or cells. For clarity of presentation, we will first discuss \langname considering only real-valued primitives, before defining the more complex type-system actually used.
Each attribute position, or column of a data frame has a globally unique identifier (a column id) and an optional human-readable name. Each tuple, or row of a data frame also has a globally unique identifier (a row id). Cells are thus uniquely identified by a pair of row and column ids.
The fundamental unit of data in \langname is a \textit{cell}, a 3-tuple: $C_i = \tuple{id_i, f_i, v_i}$, consisting of a globally unique identifier $id_i$, a formula expression $f_i$, and a value $v_i$. Note that we maintain both a value and the formula used to derive it for each cell, a property we will exploit below when discussing the interface.
Cells are also arranged into a 2-dimensional grid of rows and columns indexed by a coordinate system, a function $s : \mathbb N \times \mathbb N \rightarrow id$ that maps positions in the grid to the cell occupying that position. The function $s$ need not be complete, but must be one-to-one; A cell may only appear in one position in the spreadsheet.
\subsection{Anatomy of a \langname script}
A formula is a primitive-valued expression that may include references to the values of other cells, identified by the cell's global id, by absolute coordinates (explicit and absolute references, respectively). A formula evaluated in the context of a cell may also specify coordinate references as being relative to the cell (relative references). Columns are usually denoted by letters, Rows by numbers,
A \textit{state} is the 2-tuple $\tuple{ C, s }$, consisting of a set of cells $C = \{C_i\}$ and a coordinate system.
We say that a formula $f$ evaluates to a value $v$ in the context of a given state ($f \mapsto_{\tuple{C,s}} v$) if, after replacing all references (coordinate references using $s$ and $C$, and explicit references using $C$), the formula reduces to $v$~\footnote{Similar operational semantics were previously proposed by Krishnamurthi and Ramakrishnan~\cite{Erwig2002}}.
%
We say that a state $\tuple{C, s}$ is \textit{valid} if each cell's formula evaluates to the cell's value:
$$\forall \tuple{id_i, f_i, v_i} \in C\;:\; f_i \mapsto_{\tuple{C,s}} v_i$$
\begin{figure}
\begin{verbatim}
LOAD 'input.csv'
ADD COLUMN total;
UPDATE total = price * (1 - discount)
INSERT ROW x AT LINE 9;
UPDATE name = 'table', price = 10, discount = 0.05,
total = price * (1-discount) WHERE id = x;
\end{verbatim}
\caption{An example \langname program}
\label{fig:program}
\end{figure}
User \textit{actions} in \langname, transform a state $\tuple{C_1, s_1}$ into a new state $\tuple{C_2, s_2}$. We consider two classes of action: (1) \textit{data actions} that change only the spreadsheet's cells (i.e., for which $s_1 = s_2$), and (2) \textit{structural actions} that alter the spreadsheet's coordinate system.
%
We call the semantics for an action correct if they ensure that if the input to an action is valid, then the output is also valid.
An example \langname script is shown in Figure~\ref{fig:program}. Scripts begin with a \texttt{LOAD} statement that initializes the frame, declaring a set of columns and populating the frame with data drawn from either a CSV file or a frame defined by a previous page.
\begin{figure*}
\centering
\begin{subfigure}{0.3\textwidth}
\centering
\begin{tabular}{>{\tiny}rc|c|c}
& \tiny A & \tiny B & \tiny C \\
1& Alice & 10 & \texttt{=B1} (10)\\ \hline
2& Bob & 4 & \texttt{=B2+C1} (14)\\ \hline
3& Carol & 8 & \texttt{=B3+C2} (22)\\ \hline
4& Dave & 9 & \texttt{=B4+C3} (31)
\end{tabular}
\caption{Initial State}
\label{fig:rearrange:initial}
\end{subfigure}
%
\begin{subfigure}{0.3\textwidth}
\centering
\begin{tabular}{>{\tiny}rc|c|c}
& \tiny A & \tiny B & \tiny C \\
1& Alice & 10 & \texttt{=B1} (10)\\ \hline
2& Carol & 8 & \texttt{=B2+C3} (22)\\ \hline
3& Bob & 4 & \texttt{=B3+C1} (14)\\ \hline
4& Dave & 9 & \texttt{=B4+C3} (31)
\end{tabular}
\caption{After swapping rows 2 and 3}
\label{fig:rearrange:manual}
\end{subfigure}
%
\begin{subfigure}{0.3\textwidth}
\centering
\begin{tabular}{>{\tiny}rc|c|c}
& \tiny A & \tiny B & \tiny C \\
1& Alice & 10 & \texttt{=B1} (10)\\ \hline
2& Dave & 9 & \texttt{=B2+C1} (19)\\ \hline
3& Carol & 8 & \texttt{=B3+C2} (27)\\ \hline
4& Bob & 4 & \texttt{=B4+C3} (31)
\end{tabular}
\caption{After sorting on column 'B'}
\label{fig:rearrange:sort}
\end{subfigure}
\caption{Examples of both swapping rows and sorting rows in commercial database systems.}
\end{figure*}
\subsection{Unsurprising Inconsistencies}
User actions on a spreadsheet have not only direct, intended effects, but may also have indirect, \textit{incidental} effects. Examples include changing a formula (dependent formulas are recomputed), repositioning a row (formulas depending on the row are modified), or sorting (formulas are recomputed based on the new, sorted coordinate system).
In modern commercial spreadsheet systems, the semantics of indirect effects at first appear to be inconsistent. Take, for example, two mechanisms for rearranging rows. For example, consider the table given in Figure~\ref{fig:rearrange:initial}, which shows a list of players (column A), scores (column B), and a cumulative total score (column C).
%
A user might manually drag row 3 to a position between rows 1 and 2, effecting a swap of rows 2 and 3.
Microsoft Excel, Apple's Numbers, and Google's Sheets~\footnote{These and other behaviors described were evaluated on Excel for Mac version 15.20, Numbers version 3.6.1, and Google Sheets as of April 2016} all have identical behavior, each resulting in the table shown in Figure~\ref{fig:rearrange:manual}.
Note that the formulas for C2 and C3 have changed to ensure that each cell retains its original value under the transposed coordinate system. In other words, the user's \texttt{MOVE} action treats formula references as being explicit references.
%
Conversely, a user might sort the rows of the table in descending order on Column B. The resulting table in all three systems is identical, and shown in Figure~\ref{fig:rearrange:sort}. Here, the formulas in column C are changed only in appearance; each continues to reference the cells immediately to the left and above. However the values of each cell have changed as a result. In other words, the user's \texttt{SORT} action treats formula references as being relative references.
At a high level, both actions are structural, as they only transform the coordinate system by rearranging the coordinate mapping; The effects on cells (the $C$ part of a state) are only incidental consequences of the new coordinate scheme. For each dependent cell that changes coordinates, the action must also something else to be correct.
The distinction between the two example actions is quite significant, because it underlies a fundamental tradeoff in minimizing the ``surprising'' incidental effects~\cite{saltzer2009principles} of a change in coordinates: For \texttt{MOVE}, cell formulas are \textit{translated} into the new coordinate system to ensure that each cell's values stay the same, while for \texttt{SORT}, cell formulas are \textit{re-evaluated} in the new coordinate system, changing the values to ensure that the formulas stay the same. We leave the optimization of this tradeoff to future work, but observe that virtually all structural actions we explored (drag cell, rearrange rows, filter, etc\ldots) favor minimizing changes in values.
A \textit{selector} in \langname identifies a rectangular region of cells and consists of a column selector and a row selector. A column selector identifies a set of columns by their ids. Row selectors operate according to one of two semantics: by row id, or by some predicate over the row's attributes. We refer to these semantics as universal and qualitative, respectively.
\subsection{\langname}
\newcommand{\vizcommand}[2]{\noindent\texttt{#1}\\{#2}}
\vizcommand{LOAD \{frame | file\}}{
The load operation initializes a data frame and is the first line of any \langname script. The frame is initialized either as a copy of an existing frame identified by name, or by importing a CSV file.
}
\vizcommand{UPDATE \{formula\} WHERE \{selector\}}{
The update operation modifies values in a rectangular region defined by a selector according to the specified formula.
}
Ways to define to a collection
\begin{itemize}
\item Set of IDs
\item Range of Positions
\item Property of the Data
\item Everything
\end{itemize}
Context:
\begin{itemize}
\item Sort order
\item Unique Identifiers
\item
\end{itemize}
Operations:
\begin{itemize}
\item UPDATE ... WHERE ...
\item DELETE WHERE ...
\item SORT BY ... / ARRANGE AS ...
\item INSERT ROW[S] ... / INSERT ... / LOAD ...
\item ADD COLUMN ... / DROP COLUMN ... / ALTER COLUMN ...
\item GROUP BY ... (a'la pivot tables)
\end{itemize}
%
%
%
%
%
%
%
%
%
%
%A \langname script is a monad over \textit{data frames}, or ordered lists of uniform-width tuples of primitive typed values, or cells. For clarity of presentation, we will first discuss \langname considering only real-valued primitives, before defining the more complex type-system actually used.
%Each attribute position, or column of a data frame has a globally unique identifier (a column id) and an optional human-readable name. Each tuple, or row of a data frame also has a globally unique identifier (a row id). Cells are thus uniquely identified by a pair of row and column ids.
%
%\subsection{Anatomy of a \langname script}
%
%\begin{figure}
%\begin{verbatim}
%LOAD 'input.csv'
%ADD COLUMN total;
%UPDATE total = price * (1 - discount)
%INSERT ROW x AT LINE 9;
%UPDATE name = 'table', price = 10, discount = 0.05,
% total = price * (1-discount) WHERE id = x;
%\end{verbatim}
%\caption{An example \langname program}
%\label{fig:program}
%\end{figure}
%
%An example \langname script is shown in Figure~\ref{fig:program}. Scripts begin with a \texttt{LOAD} statement that initializes the frame, declaring a set of columns and populating the frame with data drawn from either a CSV file or a frame defined by a previous page.
%
%
%
%
%A \textit{selector} in \langname identifies a rectangular region of cells and consists of a column selector and a row selector. A column selector identifies a set of columns by their ids. Row selectors operate according to one of two semantics: by row id, or by some predicate over the row's attributes. We refer to these semantics as universal and qualitative, respectively.
%
%\subsection{\langname}
%
%\newcommand{\vizcommand}[2]{\noindent\texttt{#1}\\{#2}}
%
%\vizcommand{LOAD \{frame | file\}}{
%The load operation initializes a data frame and is the first line of any \langname script. The frame is initialized either as a copy of an existing frame identified by name, or by importing a CSV file.
%}
%
%\vizcommand{UPDATE \{formula\} WHERE \{selector\}}{
%The update operation modifies values in a rectangular region defined by a selector according to the specified formula.
%}
%
%
%
%Ways to define to a collection
%\begin{itemize}
%\item Set of IDs
%\item Range of Positions
%\item Property of the Data
%\item Everything
%\end{itemize}
%
%
%Context:
%\begin{itemize}
%\item Sort order
%\item Unique Identifiers
%\item
%\end{itemize}
%
%Operations:
%\begin{itemize}
%\item UPDATE ... WHERE ...
%\item DELETE WHERE ...
%\item SORT BY ... / ARRANGE AS ...
%\item INSERT ROW[S] ... / INSERT ... / LOAD ...
%\item ADD COLUMN ... / DROP COLUMN ... / ALTER COLUMN ...
%\item GROUP BY ... (a'la pivot tables)
%\end{itemize}