pass over white paper

master
Juliana Freire 2016-12-05 21:43:08 -05:00
parent 89fd69b0cd
commit 2a17dc9ce1
1 changed files with 67 additions and 29 deletions

View File

@ -29,49 +29,87 @@ DIBBs PI Meeting; Jan 2017\\[-5mm]
\end{wrapfigure}
Data curation or wrangling, is a critical stage in data science in which raw data is validated and repaired to establish trust in the data, and structured to streamline analytics.
Traditionally, data curation has been performed as a pre-processing task:
only after all the data selected for a study (or application) are curated, are they ready to be loaded into an analytics system for use.
This is problematic because while some cleaning constraints can be easily defined (e.g., checking for valid attribute ranges), others are only discovered as one analyzes the data.
In short, curation and exploration are both part of an ongoing, \emph{iterative} process.
Data curation or wrangling, is a critical stage in data science in
which raw data is validated and repaired to establish trust in the
data, and structured to streamline analytics. Traditionally, data
curation has been performed as a pre-processing task: only after all
the data selected for a study (or application) are curated, are they
ready to be loaded into an analytics system for use. This is
problematic because while some cleaning constraints can be easily
defined (e.g., checking for valid attribute ranges), others are only
discovered as one analyzes the data.
Vizier will link exploration and curation and allows analysts to leverage the full power of their existing SQL-based analytics platform to explore data, even if it has not yet been fully curated.
As the analyst explores and queries her data, Vizier will present her with provenance information, quality assesments, and opportunities for quality improvement relevant to her current exploration efforts.
Provenance and quality information help the analyst to evaluate whether she can trust the data she's looking at.
If she decides that more curation effort is required, Vizier can help her to direct her curation efforts.
%We posit that curation and exploration should be part of an ongoing,
%\emph{iterative} process.
Vizier's intuitive hybrid notebook-spreadsheet interface will make it easy for analysts to gracefully transition from preliminary data surveys (best done in a spreadsheet) to more rigorous, procedural data manipulations (best done in an imperative language).
As shown in the prototype interface above, Vizier presents users with both a tabular spreadsheet-style interface, and a script interface.
Edits to the spreadsheet are reflected as operations in the script, while bulk procedural transformations in the script are decomposed into a single, editable expressions in the spreadsheet.
As a result, analysts can both transform data through quick, visual edits, as well as bulk, set-at-a-time scripting and automated data curation heuristics, e.g., missing value imputation.
Vizier will link exploration and curation. Analysts will
leverage the full power of existing SQL-based analytics platform
to explore data and iteratively curate it.
%even if it has not yet been fully curated. As the
As a user queries the data, Vizier will present her with
provenance information, quality assesments, and opportunities for
quality improvement relevant to her current exploration efforts.
Provenance and quality information help the analyst to evaluate
whether she can trust the data she's looking at. If she decides that
more curation effort is required, Vizier can help her to direct her
curation efforts.
\begin{center}
Through an intuitive hybrid notebook-spreadsheet interface, the system
will enable analysts to gracefully transition from preliminary data
surveys (best done in a spreadsheet) to more rigorous, procedural data
manipulations (best done in an imperative language). As illustrated in the
prototype interface above, Vizier presents users with both a tabular,
spreadsheet-style interface, and a scripting interface. Edits to the
spreadsheet are reflected as operations in the script, while bulk
procedural transformations in the script are decomposed into a single,
editable expression in the spreadsheet. As a result, analysts can
both transform data through quick, visual edits, as well as bulk,
set-at-a-time scripting and automated data curation heuristics.
%, e.g., missing value imputation.
%\begin{center}
%\textbf{\Large Making Data Science Easier with \sysname}
\textbf{Preliminary Challenges}\vspace*{-2mm}
\end{center}
\hrule
\vspace{2mm}
Vizier will be composed of three existing systems: GProM --- A system for generic fine-grained provenance queries, Mimir --- A system for probabilistic data curation, and VizTrails --- A system for workflow management and provenance. Our first year is largely dedicated to getting these systems to work together.
%\textbf{Preliminary Challenges}\vspace*{-2mm}
%\end{center}
%\hrule
%\vspace{2mm}
Vizier will combine and extend three existing systems: GProM --- a
system that captures fine-grained provenance for queries, Mimir --- a
system that supports probabilistic data curation, and VisTrails ---a
scientific workflow and provenance management system that provides
support for exploratory computations. During the first year of our
project, we will focus on the integration of these systems. We will
address the following challenges.
\newcommand{\challenge}[1]{\medskip \noindent \textbf{#1}:~~}
\challenge{Integrating Different Provenance Granularities}
All three systems adopt different provenance models. Over the coming year, we will need to develop a multi-level provenance model with the attribute-level detail required by Mimir, but without sacrificing the simplicity of coarse-grained provenance.
\challenge{Integrating Provenance with Different Granularities}
The three systems adopt different provenance models. We will develop
a multi-level provenance model with the attribute-level detail
required by Mimir, but without sacrificing the simplicity of
coarse-grained provenance supported by VisTrails.
\challenge{Defining Cleaning Workflows}
Vizier requires a scripting language that can gracefully capture spreadsheet interactions and common curation tasks. Our first year goals include a draft of this language.
Vizier requires a scripting language that can gracefully capture
spreadsheet interactions and common curation tasks. We will design a
first draft of this language.
\challenge{Bulletproofing}
GProM and Mimir are both academic projects. Our goals for the first year include bulletproofing, testing, and extending these systems to meet the requirements of the Vizier system.
\challenge{Improvements to GProM and Mimir} GProM and Mimir are both
academic prototypes. We will improve their robustness and extending them to meet
the requirements of the Vizier system.
\challenge{Interface Design}
Over the first year, we will work with domain experts --- particularly those without an extensive programming background --- to understand their existing curation workflows and how Vizier's interface can be designed to best fit their needs.
Over the first year, we will work with domain experts --- particularly
those without an extensive programming background --- to understand
their existing curation workflows and gather the requirements for the
design of the Vizier user interface.
% can be designed to best fit their needs.
\challenge{Systems Integration}
Finally, there is the mechanical challenge of getting systems written in different languages to talk to one another. We are making progress and do not anticipate significant roadblocks.
\challenge{Systems Integration} Finally, there is the mechanical
challenge of getting systems written in different languages to talk to
one another. We have already made progress on this front and do not
anticipate significant roadblocks.