This commit is contained in:
Boris Glavic 2016-04-24 16:05:14 -05:00
parent ef47522bc0
commit fe014dd397
3 changed files with 59 additions and 18 deletions

View file

@ -3,6 +3,7 @@
\newcommand{\oksays}[1]{\todo[inline]{\textbf{Oliver says:} #1}}
\newcommand{\bgsays}[1]{\todo[inline,color=green!40]{\textbf{Boris says:} #1}}
\newcommand{\bgdel}[1]{\todo[inline,color=green!40]{\textbf{Boris has deleted:} #1}}
\newcommand{\jfsays}[1]{\todo[inline,color=blue!40]{\textbf{Juliana says:} #1}}
\newcommand{\sysname}{Vizier\xspace}

View file

@ -1,7 +1,7 @@
%!TEX root = ../main.tex
\begin{abstract}
The database community has developed a plethora of tools and techniques for supporting data curation and analysis including declarative query languages, data cleaning approaches, entity resolution and data fusion algorithms, schema matching and mapping, and many more. While usability has recently been observed to be a problem with databases~\cite{JC07}, there is currently no consensus on what is the best way of exposing these powerful tools to an analyst in a fashion that aids exploratory data curation and analysis. Thus, analysts continue to rely on tools such as spreadsheets and notebook-style programming environments~\cite{Chan1996119} (e.g., iPython notebook) for their data curation needs and cannot benefit from the contributions made by the database community. In this work we argue that both spreadsheets and notebooks have their advantages and disadvantages, and that a user friendly curation tool should expose automated data curation techniques through an interface that is an extended hybrid between the spreadsheet and notebook UIs. To support exploratory data curation and analysis, additional functionality is need that is found neither in spreadsheets or notebooks. Particularly, support for changing past decisions on the fly and having these changes propagate through an analysis workflow, the ability to combine small idiosyncratic manual curation steps to form a larger, more readable computation, and support for higher-level, automated data curation operations.
We also discuss the technical challenges of supporting such a hybrid UI over large scale datasets, present our vision of \sysname - a system that exposes data curation operations through such an interface - and its declarative spreadsheet language \langname.
We also discuss the technical challenges of supporting such a hybrid UI over large scale datasets, present our vision of \sysname \,- a system that exposes data curation operations through such an interface - and its declarative spreadsheet language \langname.
\end{abstract}

View file

@ -1,34 +1,64 @@
%!TEX root = ../main.tex
In spite of the availability of powerful automated curation, cleaning, and analysis tools, spreadsheets and notebook UIs (e.g., iPython) are still the predominant tools used by most data scientists. Some of their ubiquity can be attributed to the fact that typical users will prefer a known over an unknown interface - even if the unknown interface may be better suited for the task at hand. However, we argue that while both types of interfaces have shortcomings, each are also good fits for some use cases. Based on this observation we propose a combined spreadsheet and notebook UI for data curation that combines the best of both worlds and augments it new functionality such as automated curation operators, deployment of curation workflows over large datasets, and support for exploratory curation tasks. We first review the spreadsheet and notebook UI paradigms and discusse their individual advantages and disadvantages, and then give an overview of how they may be combined into a single coherent interface.
In spite of the availability of powerful automated curation, cleaning, and analysis tools, spreadsheets and notebook UIs (e.g., iPython) are still the predominant tools used by most data scientists. Some of their ubiquity can be attributed to the fact that typical users will prefer a known over an unknown interface - even if the unknown interface may be better suited for the task at hand. However, we argue that while both types of interfaces have shortcomings, each are also good fits for some use cases. Based on this observation we propose a combined spreadsheet and notebook UI for data curation over relational and non-relational data that combines the best of both worlds and augments it new functionality such as automated curation operators, deployment of curation workflows over large datasets, declarative queries, and support for exploratory curation tasks. We first review the spreadsheet and notebook UI paradigms and discusse their individual advantages and disadvantages, and then give an overview of how they may be combined into a single coherent interface. The UI of \sysname, the system we are planning to build, thus enables powerful relational queries as well as easy data manipulation, summarization, and visualization.
\bgdel{In this paper, we explore how similar visual metaphors can be adapted for use with relational databases and discuss how this exploration informs the design of our prototype data curation tool, called \sysname.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Spreadsheets}
Spreadsheets are a ubiquitous data processing tool. Their simplicity, generality, and adaptability make them ideal for ``playing'' with data through predominantly visual programming metaphors.
In particular, spreadsheets provide powerful, but entirely visual metaphors for programming both data transformations and visualizations.
\begin{itemize}
\item
\end{itemize}
One especially powerful feature of the spreadsheet user interface is that it is easy to define both bulk, set-at-a-time operations, as well as exceptional, singleton data operations. The former class, already a strength of relational databases, is crucial for analyzing a dataset of any significant size.
Spreadsheets allow inline specification of views (micro-computations) over data and displaying the results of these computations inline with their inputs.
\oksays{I don't like the word generalize here. It implies that an operation is being altered or extended in some way. I think that's a large part of what we need to add.}
However, many applications require users to ``break the rules'' and apply one-off modifications or transformations to individual fields or records. For example, (1) hypothetical what-if scenarios require users to apply small ad-hoc updates to adjust the inputs under test; (2) repairs for data errors may be easy to define for individual cases, but be far harder to define in a general case; (3) complex data transformations that need to be generalized still be easier to define for individual test cases than on bulk data.
By making it easy for users to break the rules, even if only temporarily, spreadsheets empower users to explore data, evaluate options, and better understand the effects of their curation efforts.
Such rules violations, or \textit{singleton} operations, are not handled gracefully by existing relational DBMSes.
Furthermore, they can sometimes help users to repeatedly apply such computations to ``more'' data.
For example, by using a spreadsheet's copy/fill paste feature, a single formula can be mapped over a many cells.
However, this feature is quite limited in current implementations of speadsheets and sometimes leads to unexpected results \bgsays{(sort example). }
BAD: The order in which operations have been applied is not exposed in a spreadsheet. Thus, spreadsheets are not great for complex data curation and exploration operations, because it is hard to keep track of what operations where applied to the data and in which order (needs workflow + provenance). A notebook UI such as in Jupyther and the visualization provided by workflow systems are much better suited in this regard. However, their disadvantage is that they are not suited well for small modifications to data, exploratory changes, and generalization of operations.
%
Based on these observations we propose a hybrid notebook and spreadsheet interface that combines the best of both worlds. In this paper, we discuss how such an interface facilitates data curation and exploration.
%
Furthermore, we propose additional novel functionality for addressing some of the shortcoming of both spreadsheets and notebooks. Some of these are based on changes to the UI while others rely on changes to the underlying data processing platform: UI:
1) more flexible mechanism for generalize\&apply: user defines operation of data, system generalizes and applies it to more data
PLATFORM:
1) by keeping track of a users operations and by tracking them as a declarative program we can support better exploration including automatic updates of derived data based on changes to operations, 2) provenance tracking to understand and explore the derivation of data, 3) recommendations to let
\bgsays{NOT SURE WHETHER ALL OF THESE SHOULD BE EXPLORED IN THIS PAPER}
\begin{itemize}
\item While spreadsheets support an adapt\&apply approach to apply a computation defined for a specific set of cells to a larger collection of cells, there is no support to automatically extend such a cell collection if more data is added. This is in contrast to relational databases with their declarative query languages that provide data independence. Plus the visual specification of an adapt\&apply operation is not suited well for very large datasets since the user has to manually select a range of cells
\item The visual metaphors that spreadsheets provide for keeping track of dependencies of operations are specific to single cells (highlighting cells on which the formula of cell depends on directly) and, thus, do not lend themselves to keep track of a large workflows of curation operations. It is unfeasible to assume that the user has a complete mental model of the whole workflow and all its branches.
\end{itemize}
BAD: The order in which operations have been applied is not exposed in a spreadsheet. Thus, spreadsheets are not great for complex data curation and exploration operations, because it is hard to keep track of what operations where applied to the data and in which order (needs workflow + provenance).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Notebook-style UIs}
A notebook UI such as in Jupyther and the visualization provided by workflow systems are much better suited in this regard. However, their disadvantage is that they are not suited well for small modifications to data, exploratory changes, and generalization of operations.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Visualizations}
Both spreadsheets and notebook UIs make it very easy for users to create visualizations from data on the fly.
Both spreadsheets and notebook UIs make it very easy for users to create visualizations from data on the fly and show these visualization inline with the data. Also both paradigms allow these visualization to be tweaked and to be refreshed based on change to their inputs. Spreadsheets in particular make
\begin{itemize}
\item
@ -38,14 +68,24 @@ Both spreadsheets and notebook UIs make it very easy for users to create visual
\item
\end{itemize}
Spreadsheets are a ubiquitous data processing tool. Their simplicity, generality, and adaptability make them ideal for ``playing'' with through predominantly visual programming metaphors. In particular, spreadsheets provide powerful, but entirely visual metaphors for programming both data transformations and visualizations. In this paper, we explore how similar visual metaphors can be adapted for use with relational databases and discuss how this exploration informs the design of our prototype data curation tool, called \sysname. \sysname's user interface combines elements of spreadsheets, so-called notebook interfaces, and classical relational queries, enabling easy data manipulation, summarization, and visualization.
One especially powerful feature of the spreadsheet user interface is that it is easy to define both bulk, set-at-a-time operations, as well as exceptional, singleton data operations. The former class, already a strength of relational databases, is crucial for analyzing a dataset of any significant size. However, many applications require users to ``break the rules'' and apply one-off modifications or transformations to individual fields or records. For example, (1) hypothetical what-if scenarios require users to apply small ad-hoc updates to adjust the inputs under test; (2) repairs for data errors may be easy to define for individual cases, but be far harder to define in a general case; (3) complex data transformations that need to be generalized still be easier to define for individual test cases than on bulk data.
By making it easy for users to break the rules, even if only temporarily, spreadsheets empower users to explore data, evaluate options, and better understand the effects of their curation efforts.
Such rules violations, or \textit{singleton} operations, are not handled gracefully by existing relational DBMSes.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Combining Spreadsheets and Notebooks}
%
Based on these observations we propose a hybrid notebook and spreadsheet interface that combines the best of both worlds. In this paper, we discuss how such an interface facilitates data curation and exploration.
%
\bgsays{NOT SURE WHETHER ALL OF THESE SHOULD BE EXPLORED IN THIS PAPER
Furthermore, we propose additional novel functionality for addressing some of the shortcoming of both spreadsheets and notebooks. Some of these are based on changes to the UI while others rely on changes to the underlying data processing platform: UI:
1) more flexible mechanism for generalize\&apply: user defines operation of data, system generalizes and applies it to more data
PLATFORM:
1) by keeping track of a users operations and by tracking them as a declarative program we can support better exploration including automatic updates of derived data based on changes to operations, 2) provenance tracking to understand and explore the derivation of data, 3) recommendations to let
}
Spreadsheets give users an infinite 2-dimensional grid of cells that can hold either constant values or computed values derived from other cells through \textit{formulas}.
Collections, as they exist, are defined implicitly as any 1- or 2-dimensional region of cells that has meaning to the user.
Thus, instead of classical programmatic specification of bulk, set-at-a-time operations as operations over named collection objects, spreadsheets use the metaphor of copying code to a (user-specified) range of cells, combined with relative, positional data dependencies to quite literally ``map'' singleton operations over entire collections.