documentation-DIBBSPIMeeting/DIBBSWhitepaper.tex

155 lines
5.9 KiB
TeX

\documentclass[11pt, oneside]{article} % use "amsart" instead of "article" for AMSLaTeX format
\usepackage{fullpage}
\usepackage{wrapfig}
\usepackage[textsize=tiny]{todonotes}
\usepackage[margin=0.78in]{geometry}
%\usepackage[disable]{todonotes}
\pagestyle{empty}
\begin{document}
\vspace*{-5mm}
\noindent \textsc{NSF:DIBBS - PIs: Freire, Kennedy, Glavic}\hfill
DIBBs PI Meeting; Jan 2017\\[-5mm]
\begin{center}
%\textbf{\Large Making Data Science Easier with \sysname}
\textbf{\large Streamlining and Understanding Curation with Vizier} (Award \#1640864)
\end{center}
\hrule
\bigskip
\begin{wrapfigure}{R}{0.5\textwidth}
\vspace*{-8mm}
\begin{flushright}
\includegraphics[width=0.48\textwidth]{VizierUISmall}
\vspace*{-8mm}
\end{flushright}
\caption{Prototype Vizier User Interface}
\vspace*{-4mm}
\end{wrapfigure}
Data curation or wrangling, is a critical stage in data science in
which raw data is validated and repaired to establish trust in the
data, and structured to streamline analytics. Traditionally, data
curation has been performed as a pre-processing task: only after all
the data selected for a study (or application) are curated, are they
ready to be loaded into an analytics system for use. This is
problematic because while some cleaning constraints can be easily
defined (e.g., checking for valid attribute ranges), others are only
discovered as one analyzes the data.
%We posit that curation and exploration should be part of an ongoing,
%\emph{iterative} process.
Vizier will link exploration and curation. Analysts will
leverage the full power of existing SQL-based analytics platform
to explore data and iteratively curate it.
%even if it has not yet been fully curated. As the
As a user queries the data, Vizier will present her with
provenance information, quality assesments, and opportunities for
quality improvement relevant to her current exploration efforts.
Provenance and quality information help the analyst to evaluate
whether she can trust the data she's looking at. If she decides that
more curation effort is required, Vizier can help her to direct her
curation efforts.
Through an intuitive hybrid notebook-spreadsheet interface, the system
will enable analysts to gracefully transition from preliminary data
surveys (best done in a spreadsheet) to more rigorous, procedural data
manipulations (best done in an imperative language). As illustrated in the
prototype interface above, Vizier presents users with both a tabular,
spreadsheet-style interface, and a scripting interface. Edits to the
spreadsheet are reflected as operations in the script, while bulk
procedural transformations in the script are decomposed into a single,
editable expression in the spreadsheet. As a result, analysts can
both transform data through quick, visual edits, as well as bulk,
set-at-a-time scripting and automated data curation heuristics.
%, e.g., missing value imputation.
%\begin{center}
%\textbf{\Large Making Data Science Easier with \sysname}
%\textbf{Preliminary Challenges}\vspace*{-2mm}
%\end{center}
%\hrule
%\vspace{2mm}
Vizier will combine and extend three existing systems: GProM --- a
system that captures fine-grained provenance for queries, Mimir --- a
system that supports probabilistic data curation, and VisTrails ---a
scientific workflow and provenance management system that provides
support for exploratory computations. During the first year of our
project, we will focus on the integration of these systems. We will
address the following challenges.
\newcommand{\challenge}[1]{\medskip \noindent \textbf{#1}:~~}
\challenge{Integrating Provenance with Different Granularities}
The three systems adopt different provenance models. We will develop
a multi-level provenance model with the attribute-level detail
required by Mimir, but without sacrificing the simplicity of
coarse-grained provenance supported by VisTrails.
\challenge{Defining Cleaning Workflows}
Vizier requires a scripting language that can gracefully capture
spreadsheet interactions and common curation tasks. We will design a
first draft of this language.
\challenge{Improvements to GProM and Mimir} GProM and Mimir are both
academic prototypes. We will improve their robustness and extending them to meet
the requirements of the Vizier system.
\challenge{Interface Design}
Over the first year, we will work with domain experts --- particularly
those without an extensive programming background --- to understand
their existing curation workflows and gather the requirements for the
design of the Vizier user interface.
% can be designed to best fit their needs.
\challenge{Systems Integration} Finally, there is the mechanical
challenge of getting systems written in different languages to talk to
one another. We have already made progress on this front and do not
anticipate significant roadblocks.
\end{document}
%
% -------
%
%
% If this is just one page, I am not sure how you are going to be able to fit all of this.
%
% I would focus on the motivation and provide a high-level overview of the solution -- stressing % provenance+usability as the enablers, and list some of the challenges at a higher level, e.g.,%
% integrating fine- and coarse-grained provenance, definition of cleaning operations (language?)% ,% % usable interface targeted to domain experts wo computing background, integration of existing % systems (Mimim, GProM, VisTrails).%
%
% Best,
% Juliana
%
% --- Mimir ---
% - The mechanics of integrating GProM
% - New Lenses (better imputation, sequence alignment, entity resolution)
% - Materialization (performance is becoming a concern)
% - Bulletproofing
%
% --- GProM ---
% - Attribute-level Uncertainty
% - Metadata tagging
% - Bulletproofing
% - Better backend support (e.g., UDFs/UDAs).
%
% --- VisTrails/Vizier UI ---
% - Notebook UI
% - Spreadsheet Editor
% - Translating VisTrails workflows to Vizier workflows
% - Integrating classical and spreadsheet provenance models?
%
% --- All ---
% - Spec for VizUAL v0.1
%
%
% ----
%
%