pass over white paper
parent
89fd69b0cd
commit
2a17dc9ce1
|
@ -29,49 +29,87 @@ DIBBs PI Meeting; Jan 2017\\[-5mm]
|
|||
\end{wrapfigure}
|
||||
|
||||
|
||||
Data curation or wrangling, is a critical stage in data science in which raw data is validated and repaired to establish trust in the data, and structured to streamline analytics.
|
||||
Traditionally, data curation has been performed as a pre-processing task:
|
||||
only after all the data selected for a study (or application) are curated, are they ready to be loaded into an analytics system for use.
|
||||
This is problematic because while some cleaning constraints can be easily defined (e.g., checking for valid attribute ranges), others are only discovered as one analyzes the data.
|
||||
In short, curation and exploration are both part of an ongoing, \emph{iterative} process.
|
||||
Data curation or wrangling, is a critical stage in data science in
|
||||
which raw data is validated and repaired to establish trust in the
|
||||
data, and structured to streamline analytics. Traditionally, data
|
||||
curation has been performed as a pre-processing task: only after all
|
||||
the data selected for a study (or application) are curated, are they
|
||||
ready to be loaded into an analytics system for use. This is
|
||||
problematic because while some cleaning constraints can be easily
|
||||
defined (e.g., checking for valid attribute ranges), others are only
|
||||
discovered as one analyzes the data.
|
||||
|
||||
Vizier will link exploration and curation and allows analysts to leverage the full power of their existing SQL-based analytics platform to explore data, even if it has not yet been fully curated.
|
||||
As the analyst explores and queries her data, Vizier will present her with provenance information, quality assesments, and opportunities for quality improvement relevant to her current exploration efforts.
|
||||
Provenance and quality information help the analyst to evaluate whether she can trust the data she's looking at.
|
||||
If she decides that more curation effort is required, Vizier can help her to direct her curation efforts.
|
||||
%We posit that curation and exploration should be part of an ongoing,
|
||||
%\emph{iterative} process.
|
||||
|
||||
Vizier's intuitive hybrid notebook-spreadsheet interface will make it easy for analysts to gracefully transition from preliminary data surveys (best done in a spreadsheet) to more rigorous, procedural data manipulations (best done in an imperative language).
|
||||
As shown in the prototype interface above, Vizier presents users with both a tabular spreadsheet-style interface, and a script interface.
|
||||
Edits to the spreadsheet are reflected as operations in the script, while bulk procedural transformations in the script are decomposed into a single, editable expressions in the spreadsheet.
|
||||
As a result, analysts can both transform data through quick, visual edits, as well as bulk, set-at-a-time scripting and automated data curation heuristics, e.g., missing value imputation.
|
||||
Vizier will link exploration and curation. Analysts will
|
||||
leverage the full power of existing SQL-based analytics platform
|
||||
to explore data and iteratively curate it.
|
||||
%even if it has not yet been fully curated. As the
|
||||
As a user queries the data, Vizier will present her with
|
||||
provenance information, quality assesments, and opportunities for
|
||||
quality improvement relevant to her current exploration efforts.
|
||||
Provenance and quality information help the analyst to evaluate
|
||||
whether she can trust the data she's looking at. If she decides that
|
||||
more curation effort is required, Vizier can help her to direct her
|
||||
curation efforts.
|
||||
|
||||
\begin{center}
|
||||
Through an intuitive hybrid notebook-spreadsheet interface, the system
|
||||
will enable analysts to gracefully transition from preliminary data
|
||||
surveys (best done in a spreadsheet) to more rigorous, procedural data
|
||||
manipulations (best done in an imperative language). As illustrated in the
|
||||
prototype interface above, Vizier presents users with both a tabular,
|
||||
spreadsheet-style interface, and a scripting interface. Edits to the
|
||||
spreadsheet are reflected as operations in the script, while bulk
|
||||
procedural transformations in the script are decomposed into a single,
|
||||
editable expression in the spreadsheet. As a result, analysts can
|
||||
both transform data through quick, visual edits, as well as bulk,
|
||||
set-at-a-time scripting and automated data curation heuristics.
|
||||
%, e.g., missing value imputation.
|
||||
|
||||
%\begin{center}
|
||||
%\textbf{\Large Making Data Science Easier with \sysname}
|
||||
\textbf{Preliminary Challenges}\vspace*{-2mm}
|
||||
\end{center}
|
||||
\hrule
|
||||
\vspace{2mm}
|
||||
|
||||
Vizier will be composed of three existing systems: GProM --- A system for generic fine-grained provenance queries, Mimir --- A system for probabilistic data curation, and VizTrails --- A system for workflow management and provenance. Our first year is largely dedicated to getting these systems to work together.
|
||||
%\textbf{Preliminary Challenges}\vspace*{-2mm}
|
||||
%\end{center}
|
||||
%\hrule
|
||||
%\vspace{2mm}
|
||||
|
||||
Vizier will combine and extend three existing systems: GProM --- a
|
||||
system that captures fine-grained provenance for queries, Mimir --- a
|
||||
system that supports probabilistic data curation, and VisTrails ---a
|
||||
scientific workflow and provenance management system that provides
|
||||
support for exploratory computations. During the first year of our
|
||||
project, we will focus on the integration of these systems. We will
|
||||
address the following challenges.
|
||||
|
||||
\newcommand{\challenge}[1]{\medskip \noindent \textbf{#1}:~~}
|
||||
|
||||
|
||||
\challenge{Integrating Different Provenance Granularities}
|
||||
All three systems adopt different provenance models. Over the coming year, we will need to develop a multi-level provenance model with the attribute-level detail required by Mimir, but without sacrificing the simplicity of coarse-grained provenance.
|
||||
\challenge{Integrating Provenance with Different Granularities}
|
||||
The three systems adopt different provenance models. We will develop
|
||||
a multi-level provenance model with the attribute-level detail
|
||||
required by Mimir, but without sacrificing the simplicity of
|
||||
coarse-grained provenance supported by VisTrails.
|
||||
|
||||
\challenge{Defining Cleaning Workflows}
|
||||
Vizier requires a scripting language that can gracefully capture spreadsheet interactions and common curation tasks. Our first year goals include a draft of this language.
|
||||
Vizier requires a scripting language that can gracefully capture
|
||||
spreadsheet interactions and common curation tasks. We will design a
|
||||
first draft of this language.
|
||||
|
||||
\challenge{Bulletproofing}
|
||||
GProM and Mimir are both academic projects. Our goals for the first year include bulletproofing, testing, and extending these systems to meet the requirements of the Vizier system.
|
||||
\challenge{Improvements to GProM and Mimir} GProM and Mimir are both
|
||||
academic prototypes. We will improve their robustness and extending them to meet
|
||||
the requirements of the Vizier system.
|
||||
|
||||
\challenge{Interface Design}
|
||||
Over the first year, we will work with domain experts --- particularly those without an extensive programming background --- to understand their existing curation workflows and how Vizier's interface can be designed to best fit their needs.
|
||||
Over the first year, we will work with domain experts --- particularly
|
||||
those without an extensive programming background --- to understand
|
||||
their existing curation workflows and gather the requirements for the
|
||||
design of the Vizier user interface.
|
||||
% can be designed to best fit their needs.
|
||||
|
||||
\challenge{Systems Integration}
|
||||
Finally, there is the mechanical challenge of getting systems written in different languages to talk to one another. We are making progress and do not anticipate significant roadblocks.
|
||||
\challenge{Systems Integration} Finally, there is the mechanical
|
||||
challenge of getting systems written in different languages to talk to
|
||||
one another. We have already made progress on this front and do not
|
||||
anticipate significant roadblocks.
|
||||
|
||||
|
||||
|
||||
|
|
Loading…
Reference in New Issue