Paper formatting --- widows, orphans, overfull lines, etc...

master
Oliver Kennedy 2016-05-30 22:25:56 -04:00
parent 35fed72eb5
commit ba33201c3e
3 changed files with 13 additions and 13 deletions

View File

@ -1,6 +1,6 @@
%!TEX root = ../main.tex
\begin{abstract}
The database community has developed a plethora of tools and techniques for data curation and exploration, from declarative languages, to specialized techniques for data repair, and more.
The database community has developed numerous tools and techniques for data curation and exploration, from declarative languages, to specialized techniques for data repair, and more.
Yet, there is currently no consensus on how to best expose these powerful tools to an analyst in a simple, intuitive, and above all, flexible way.
Thus, analysts continue to rely on tools such as spreadsheets, imperative languages, and notebook style programming environments like Jupyter for data curation.
In this work, we explore the %intersection

View File

@ -1,11 +1,11 @@
%!TEX root = ../main.tex
In spite of the availability of powerful automated curation, cleaning, and analysis tools, spreadsheets and notebook UIs (e.g., Jupyter/iPython) are still the predominant tools used by most data scientists. Although their ubiquity is in part a matter of user familiarity~\cite{Chan1996119}, we argue that they also offer several compelling benefits for curation workloads.
Key among these is the simplicity with which users can define exceptions to bulk set-at-a-time operations in both a spreadsheet and a notebook setting.
In this paper, we examine the spreadsheet and notebook interface models, and explore how lessons from both can be incorporated into relational database interfaces.
We present a new user interface for data curation and a tool implementing this interface called \sysname.
\sysname will combine UI elements from both spreadsheets and notebooks and will support functionality not commonly found in either spreadsheets or notebooks, including automated curation operators~\cite{Yang:2015:LOA:2824032.2824055}, deployment of curation workflows over large datasets~\cite{Kandel:2011:WIV:1978942.1979444}, declarative queries~\cite{AG16,Olston:2008:PLN:1376616.1376726}, and support for exploratory curation tasks~\cite{SV08}.
This hybrid UI enables powerful relational queries, while still being flexible enough to permit easy data manipulation, summarization, and visualization.
In spite of the availability of powerful automated curation, cleaning, and analysis tools, spreadsheets and notebook UIs (e.g., Jupyter/iPython) are still the predominant tools used by most data scientists for manipulating and visualizing virtually all but the largest datasets. Although their ubiquity is in part a matter of user familiarity~\cite{Chan1996119}, we argue that they also offer several compelling benefits, specifically for curation workloads.
Key among these is the simplicity with which users can define \emph{exceptions} to bulk set-at-a-time operations in both spreadsheets and notebooks.
In this paper, we examine these two interface models and explore how lessons from both can be incorporated into relational database interfaces.
We present a new \emph{data curation} user interface and a tool implementing this interface called \sysname.
\sysname will combine UI elements from both spreadsheets and notebooks and support functionality not commonly found in either spreadsheets or notebooks, including automated curation operators~\cite{Yang:2015:LOA:2824032.2824055}, deployment of workflows over large datasets~\cite{Kandel:2011:WIV:1978942.1979444}, declarative queries~\cite{AG16,Olston:2008:PLN:1376616.1376726}, and support for exploratory curation tasks~\cite{SV08}.
This hybrid UI enables powerful relational queries, but remains flexible enough to permit easy data manipulation, summarization, and visualization.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -48,11 +48,11 @@ However spreadsheets also have several drawbacks compared to a DBMS:\\
% \bgsays{(sort example). }
%\begin{itemize}
%\item
\inlineitem{Non-Adaptive Computations over Collections}
While spreadsheets
\inlineitem{Non-Adaptive Computation over Collections}
Spreadsheets
%support adapt\&apply for applying a
% JF: I don't understand the connection to data independence here, this problem seems to be related to scope, and not how the data are structured?=
allow a computation defined in one cell to be adapted to larger collections of cells, the two dominant systems, i.e., Microsoft Excel and Google Sheets, do not support automatically extending formulas to new cells as data is added~\footnote{We note that Apple Numbers does exhibit this behavior.}.
allow a computation defined in one cell to be adapted to larger collections of cells. However, the two dominant systems, i.e., Microsoft Excel and Google Sheets, do not support automatically extending formulas to new cells as data is added~\footnote{We note that Apple Numbers does exhibit this behavior.}.
This is in stark contrast to relational databases with their declarative, data-independent query languages. \\
%Furthermore, the visual specification of an adapt\&apply operation is not suited well for very large datasets since the user has to manually select a range of cells.\\
%\item
@ -89,7 +89,7 @@ Systems like Jupyter expose an interactive, interpreted programming environment
%\end{itemize}
\smallskip
However, some operations that are supported well in spreadsheets are harder to express in notebooks, and some disadvantages are shared among both paradigms:\\
However, some operations that are efficiently supported in spreadsheets are harder to express in notebooks, and some disadvantages are shared among both paradigms:\\
%\begin{itemize}
%\item
\inlineitem{Small edits are cumbersome} Compared to spreadsheets, modifying individual data values requires users to write code.\\
@ -107,7 +107,7 @@ Both spreadsheets and notebook UIs make it very easy for users to create visual
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Combining Spreadsheets and Notebooks}
\tinysection{Combined Spreadsheets and Notebooks}
%
Spreadsheets permit intuitive visual interactions with data, while notebooks provide a clearer expression of the user's intent that can actually be reproduced. We propose a hybrid UI that combines elements of both interfaces, augmenting them with capabilities common to relational data processing. We discuss the challenges of developing such an integrated interface and how it facilitates data curation and exploration. We also introduce our proposed system, called \sysname, which empowers users with spreadsheet-like flexibility for transforming, visualizing, and exploring relational data, while still retaining the expressiveness and workflow capabilities of a notebook. At the heart of our approach is support for singleton operations in a relational setting, which in turn enables a bi-directional mapping between a spreadsheet-style graphical interface, and a notebook-style programmatic interface.

View File

@ -6,7 +6,7 @@ Here, this metadata serves two purposes.
First, as noted above, we need to be able to reliably materialize the formula backing each cell so that it can be edited. We need to ensure that each operator defines precise semantics for how it affects formulas.
Second, and perhaps more importantly, we track both values and the formula used to derive them as a way to define operational semantics that minimize user surprise. As we discuss shortly, one specific update to a spreadsheet may have many secondary, incidental effects on the spreadsheet's formulas and/or values. By tracking both, we can better understand these effects and minimize the complexities and unexpected side-effects of each operation.
\tinysection{Coordinate System} Cells are arranged into a 2-dimensional grid of rows and columns indexed by a coordinate system, a function $s : \mathbb N \times \mathbb N \rightarrow id$ that maps positions in the grid to the cell occupying that position. The function $s$ need not be complete, but must be one-to-one: a cell may only appear in one position in the spreadsheet.
\tinysection{Coordinate System} Cells are arranged in a 2-dimensional grid of rows and columns indexed by a coordinate system, a function $s : \mathbb N \times \mathbb N \rightarrow id$ that maps positions in the grid to the cell occupying that position. The function $s$ need not be complete, but must be one-to-one: a cell may only appear in one position in the spreadsheet.
\tinysection{Formulas} A formula is a primitive-valued expression that may include references to the values of other cells, identified by the cell's global id or by absolute coordinates (explicit and absolute references, respectively). A formula evaluated in the context of a cell may also specify coordinate references as being relative to the cell (relative references). Columns are usually denoted by letters and rows by numbers.
A \textit{state} is a 2-tuple $\tuple{ C, s }$ consisting of a set of cells $C = \{C_i\}$ and a coordinate system.