155 lines
5.9 KiB
TeX
155 lines
5.9 KiB
TeX
\documentclass[11pt, oneside]{article} % use "amsart" instead of "article" for AMSLaTeX format
|
|
\usepackage{fullpage}
|
|
\usepackage{wrapfig}
|
|
\usepackage[textsize=tiny]{todonotes}
|
|
\usepackage[margin=0.78in]{geometry}
|
|
%\usepackage[disable]{todonotes}
|
|
\pagestyle{empty}
|
|
|
|
|
|
\begin{document}
|
|
\vspace*{-5mm}
|
|
\noindent \textsc{NSF:DIBBS - PIs: Freire, Kennedy, Glavic}\hfill
|
|
DIBBs PI Meeting; Jan 2017\\[-5mm]
|
|
\begin{center}
|
|
%\textbf{\Large Making Data Science Easier with \sysname}
|
|
\textbf{\large Streamlining and Understanding Curation with Vizier} (Award \#1640864)
|
|
\end{center}
|
|
\hrule
|
|
\bigskip
|
|
|
|
\begin{wrapfigure}{R}{0.5\textwidth}
|
|
\vspace*{-8mm}
|
|
\begin{flushright}
|
|
\includegraphics[width=0.48\textwidth]{VizierUISmall}
|
|
\vspace*{-8mm}
|
|
\end{flushright}
|
|
\caption{Prototype Vizier User Interface}
|
|
\vspace*{-4mm}
|
|
\end{wrapfigure}
|
|
|
|
|
|
Data curation or wrangling, is a critical stage in data science in
|
|
which raw data is validated and repaired to establish trust in the
|
|
data, and structured to streamline analytics. Traditionally, data
|
|
curation has been performed as a pre-processing task: only after all
|
|
the data selected for a study (or application) are curated, are they
|
|
ready to be loaded into an analytics system for use. This is
|
|
problematic because while some cleaning constraints can be easily
|
|
defined (e.g., checking for valid attribute ranges), others are only
|
|
discovered as one analyzes the data.
|
|
|
|
%We posit that curation and exploration should be part of an ongoing,
|
|
%\emph{iterative} process.
|
|
|
|
Vizier will link exploration and curation. Analysts will
|
|
leverage the full power of existing SQL-based analytics platform
|
|
to explore data and iteratively curate it.
|
|
%even if it has not yet been fully curated. As the
|
|
As a user queries the data, Vizier will present her with
|
|
provenance information, quality assesments, and opportunities for
|
|
quality improvement relevant to her current exploration efforts.
|
|
Provenance and quality information help the analyst to evaluate
|
|
whether she can trust the data she's looking at. If she decides that
|
|
more curation effort is required, Vizier can help her to direct her
|
|
curation efforts.
|
|
|
|
Through an intuitive hybrid notebook-spreadsheet interface, the system
|
|
will enable analysts to gracefully transition from preliminary data
|
|
surveys (best done in a spreadsheet) to more rigorous, procedural data
|
|
manipulations (best done in an imperative language). As illustrated in the
|
|
prototype interface above, Vizier presents users with both a tabular,
|
|
spreadsheet-style interface, and a scripting interface. Edits to the
|
|
spreadsheet are reflected as operations in the script, while bulk
|
|
procedural transformations in the script are decomposed into a single,
|
|
editable expression in the spreadsheet. As a result, analysts can
|
|
both transform data through quick, visual edits, as well as bulk,
|
|
set-at-a-time scripting and automated data curation heuristics.
|
|
%, e.g., missing value imputation.
|
|
|
|
%\begin{center}
|
|
%\textbf{\Large Making Data Science Easier with \sysname}
|
|
%\textbf{Preliminary Challenges}\vspace*{-2mm}
|
|
%\end{center}
|
|
%\hrule
|
|
%\vspace{2mm}
|
|
|
|
Vizier will combine and extend three existing systems: GProM --- a
|
|
system that captures fine-grained provenance for queries, Mimir --- a
|
|
system that supports probabilistic data curation, and VisTrails ---a
|
|
scientific workflow and provenance management system that provides
|
|
support for exploratory computations. During the first year of our
|
|
project, we will focus on the integration of these systems. We will
|
|
address the following challenges.
|
|
|
|
\newcommand{\challenge}[1]{\medskip \noindent \textbf{#1}:~~}
|
|
|
|
\challenge{Integrating Provenance with Different Granularities}
|
|
The three systems adopt different provenance models. We will develop
|
|
a multi-level provenance model with the attribute-level detail
|
|
required by Mimir, but without sacrificing the simplicity of
|
|
coarse-grained provenance supported by VisTrails.
|
|
|
|
\challenge{Defining Cleaning Workflows}
|
|
Vizier requires a scripting language that can gracefully capture
|
|
spreadsheet interactions and common curation tasks. We will design a
|
|
first draft of this language.
|
|
|
|
\challenge{Improvements to GProM and Mimir} GProM and Mimir are both
|
|
academic prototypes. We will improve their robustness and extending them to meet
|
|
the requirements of the Vizier system.
|
|
|
|
\challenge{Interface Design}
|
|
Over the first year, we will work with domain experts --- particularly
|
|
those without an extensive programming background --- to understand
|
|
their existing curation workflows and gather the requirements for the
|
|
design of the Vizier user interface.
|
|
% can be designed to best fit their needs.
|
|
|
|
\challenge{Systems Integration} Finally, there is the mechanical
|
|
challenge of getting systems written in different languages to talk to
|
|
one another. We have already made progress on this front and do not
|
|
anticipate significant roadblocks.
|
|
|
|
|
|
|
|
\end{document}
|
|
|
|
%
|
|
% -------
|
|
%
|
|
%
|
|
% If this is just one page, I am not sure how you are going to be able to fit all of this.
|
|
%
|
|
% I would focus on the motivation and provide a high-level overview of the solution -- stressing % provenance+usability as the enablers, and list some of the challenges at a higher level, e.g.,%
|
|
% integrating fine- and coarse-grained provenance, definition of cleaning operations (language?)% ,% % usable interface targeted to domain experts wo computing background, integration of existing % systems (Mimim, GProM, VisTrails).%
|
|
%
|
|
% Best,
|
|
% Juliana
|
|
%
|
|
% --- Mimir ---
|
|
% - The mechanics of integrating GProM
|
|
% - New Lenses (better imputation, sequence alignment, entity resolution)
|
|
% - Materialization (performance is becoming a concern)
|
|
% - Bulletproofing
|
|
%
|
|
% --- GProM ---
|
|
% - Attribute-level Uncertainty
|
|
% - Metadata tagging
|
|
% - Bulletproofing
|
|
% - Better backend support (e.g., UDFs/UDAs).
|
|
%
|
|
% --- VisTrails/Vizier UI ---
|
|
% - Notebook UI
|
|
% - Spreadsheet Editor
|
|
% - Translating VisTrails workflows to Vizier workflows
|
|
% - Integrating classical and spreadsheet provenance models?
|
|
%
|
|
% --- All ---
|
|
% - Spec for VizUAL v0.1
|
|
%
|
|
%
|
|
% ----
|
|
%
|
|
%
|