From bd509e303849f2cdd8429e597c6c5ffefb0c30d0 Mon Sep 17 00:00:00 2001 From: Boris Glavic Date: Sun, 24 Apr 2016 17:51:38 -0500 Subject: [PATCH] intro --- sections/introduction.tex | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/sections/introduction.tex b/sections/introduction.tex index a1fd75f..d0cbd78 100644 --- a/sections/introduction.tex +++ b/sections/introduction.tex @@ -1,6 +1,7 @@ %!TEX root = ../main.tex -In spite of the availability of powerful automated curation, cleaning, and analysis tools, spreadsheets and notebook UIs (e.g., iPython) are still the predominant tools used by most data scientists. Some of their ubiquity can be attributed to the fact that typical users will prefer a known over an unknown interface - even if the unknown interface may be better suited for the task at hand. However, we argue that while both types of interfaces have shortcomings, each are also good fits for some use cases. Based on this observation we propose a combined spreadsheet and notebook UI for data curation over relational and non-relational data that combines the best of both worlds and augments it new functionality such as automated curation operators, deployment of curation workflows over large datasets, declarative queries, and support for exploratory curation tasks. We first review the spreadsheet and notebook UI paradigms and discusse their individual advantages and disadvantages, and then give an overview of how they may be combined into a single coherent interface. The UI of \sysname, the system we are planning to build, thus enables powerful relational queries as well as easy data manipulation, summarization, and visualization. +In spite of the availability of powerful automated curation, cleaning, and analysis tools, spreadsheets and notebook UIs (e.g., iPython) are still the predominant tools used by most data scientists. Some of their ubiquity can be attributed to the fact that typical users will prefer a known over an unknown interface - even if the unknown interface may be better suited for the task at hand. However, we argue that while both types of interfaces have shortcomings, each are also good fits for some use cases. Based on this observation we propose a combined spreadsheet and notebook UI for data curation over relational % and non-relational data +that combines the best of both worlds and augments it with new functionality such as automated curation operators, deployment of curation workflows over large datasets, declarative queries, and support for exploratory curation tasks. We first review the spreadsheet and notebook UI paradigms and discusse their individual advantages and disadvantages, and then give an overview of how they may be combined into a single coherent interface. The UI of \sysname, the system we are planning to build, thus enables powerful relational queries as well as easy data manipulation, summarization, and visualization. \bgdel{In this paper, we explore how similar visual metaphors can be adapted for use with relational databases and discuss how this exploration informs the design of our prototype data curation tool, called \sysname.} @@ -8,16 +9,16 @@ In spite of the availability of powerful automated curation, cleaning, and analy %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \tinysection{Spreadsheets} - +% Spreadsheets are a ubiquitous data processing tool. Their simplicity, generality, and adaptability make them ideal for ``playing'' with data through predominantly visual programming metaphors. -In particular, spreadsheets provide powerful, but entirely visual metaphors for programming both data transformations and visualizations. - - +In particular, spreadsheets provide powerful, but entirely visual metaphors for programming both data transformations and visualizations. Spreadsheets provide several important features that are useful during data curation. \begin{itemize} -\item +\item \textbf{Convenient modification of values and computations.} The user can update any cell's value or formula immediately from the user interface. This enables flexile manual curation operations, e.g., to resolve missing value and to easily modify or overwrite any past choices. +\item \textbf{Singleton operations with inline results.} By using formulas in cells, the user defines a computation and the result of this computation is shown inline with its input data. +\item \textbf{Mapping singleton operations to data collections.} \end{itemize} @@ -40,9 +41,9 @@ For example, by using a spreadsheet's copy/fill paste feature, a single formula However, this feature is quite limited in current implementations of speadsheets and sometimes leads to unexpected results \bgsays{(sort example). } \begin{itemize} -\item While spreadsheets support an adapt\&apply approach to apply a computation defined for a specific set of cells to a larger collection of cells, there is no support to automatically extend such a cell collection if more data is added. This is in contrast to relational databases with their declarative query languages that provide data independence. Plus the visual specification of an adapt\&apply operation is not suited well for very large datasets since the user has to manually select a range of cells -\item The visual metaphors that spreadsheets provide for keeping track of dependencies of operations are specific to single cells (highlighting cells on which the formula of cell depends on directly) and, thus, do not lend themselves to keep track of a large workflows of curation operations. It is unfeasible to assume that the user has a complete mental model of the whole workflow and all its branches. -\item +\item \textbf{Computations over collections do not adapt to inserts and deletes.} While spreadsheets support an adapt\&apply approach to apply a computation defined for a specific set of cells to a larger collection of cells, there is no support to automatically extend such a cell collection if more data is added. This is in contrast to relational databases with their declarative query languages that provide data independence. Plus the visual specification of an adapt\&apply operation is not suited well for very large datasets since the user has to manually select a range of cells +\item \textbf{Collection operations are not made explicit.} While adapt\&apply allows a computation to be mapped over a collection, there is no visual evidence that indicates that a set of cells are storing formulas which where mapped in this way. That is, there is no generalization/abstraction mechanism in spreadsheets to represent a higher-level bulk operation (e.g., such as a view query in databases).\bgsays{provenance should help to identify that a ``map'' was applied} +\item \textbf{No order among operations and tracking of workflow branches.} The visual metaphors that spreadsheets provide for keeping track of dependencies of operations are specific to single cells (highlighting cells on which the formula of cell depends on directly) and, thus, do not lend themselves to keep track of a large workflows of curation operations. Furthermore, these visualizations are limited to tracing one step at a time which makes tracking of transitive dependencies cumbersome. It is unfeasible to assume that the user has a complete mental model of the whole workflow and all its branches. \bgsays{Workflow provenance would help to the higher-level view, database provenance for the transitive dependencies} \end{itemize} BAD: The order in which operations have been applied is not exposed in a spreadsheet. Thus, spreadsheets are not great for complex data curation and exploration operations, because it is hard to keep track of what operations where applied to the data and in which order (needs workflow + provenance).