Initial commit
commit
76d48e7d0c
|
@ -0,0 +1,4 @@
|
|||
*.aux
|
||||
*.log
|
||||
DIBBSWhitepaper.pdf
|
||||
*.synctex.gz
|
|
@ -0,0 +1,116 @@
|
|||
\documentclass[11pt, oneside]{article} % use "amsart" instead of "article" for AMSLaTeX format
|
||||
\usepackage{fullpage}
|
||||
\usepackage{wrapfig}
|
||||
\usepackage[textsize=tiny]{todonotes}
|
||||
\usepackage[margin=0.78in]{geometry}
|
||||
%\usepackage[disable]{todonotes}
|
||||
\pagestyle{empty}
|
||||
|
||||
|
||||
\begin{document}
|
||||
\vspace*{-5mm}
|
||||
\noindent \textsc{NSF:DIBBS - PIs: Freire, Kennedy, Glavic}\hfill
|
||||
DIBBs PI Meeting; Jan 2017\\[-5mm]
|
||||
\begin{center}
|
||||
%\textbf{\Large Making Data Science Easier with \sysname}
|
||||
\textbf{\large Streamlining and Understanding Curation with Vizier} (Award \#1640864)
|
||||
\end{center}
|
||||
\hrule
|
||||
\bigskip
|
||||
|
||||
\begin{wrapfigure}{R}{0.5\textwidth}
|
||||
\vspace*{-8mm}
|
||||
\begin{flushright}
|
||||
\includegraphics[width=0.48\textwidth]{VizierUI}
|
||||
\vspace*{-8mm}
|
||||
\end{flushright}
|
||||
\caption{Prototype Vizier User Interface}
|
||||
\vspace*{-4mm}
|
||||
\end{wrapfigure}
|
||||
|
||||
|
||||
Data curation or wrangling, is a critical stage in data science in which raw data is validated and repaired to establish trust in the data, and structured to streamline analytics.
|
||||
Traditionally, data curation has been performed as a pre-processing task:
|
||||
only after all the data selected for a study (or application) are curated, are they ready to be loaded into an analytics system for use.
|
||||
This is problematic because while some cleaning constraints can be easily defined (e.g., checking for valid attribute ranges), others are only discovered as one analyzes the data.
|
||||
In short, curation and exploration are both part of an ongoing, \emph{cyclic} process.
|
||||
|
||||
Vizier will link exploration and curation and allows analysts to leverage the full power of their existing SQL-based analytics platform to explore data, even if it has not yet been fully curated.
|
||||
As the analyst explores and queries her data, Vizier will present her with provenance information, quality assesments, and opportunities for quality improvement relevant to her current exploration efforts.
|
||||
Provenance and quality information help the analyst to evaluate whether she can trust the data she's looking at.
|
||||
If she decides that more curation effort is required, Vizier can help her to direct her curation efforts.
|
||||
|
||||
Vizier's intuitive hybrid notebook-spreadsheet interface will make it easy for analysts to gracefully transition from preliminary data surveys (best done in a spreadsheet) to more rigorous, procedural data manipulations (best done in an imperative language).
|
||||
As shown in the prototype interface above, Vizier presents users with both a tabular spreadsheet-style interface, and a script interface.
|
||||
Edits to the spreadsheet are reflected as operations in the script, while bulk procedural transformations in the script are decomposed into a single, editable expression in the spreadsheet.
|
||||
As a result, analysts can simultaneously transform data through quick, visual edits, as well as bulk, set-at-a-time scripting.
|
||||
|
||||
\begin{center}
|
||||
%\textbf{\Large Making Data Science Easier with \sysname}
|
||||
\textbf{Preliminary Challenges}\vspace*{-2mm}
|
||||
\end{center}
|
||||
\hrule
|
||||
\vspace{2mm}
|
||||
|
||||
Vizier will be composed of three existing systems: GProM --- A system for generic fine-grained provenance queries, Mimir --- A system for probabilistic data curation, and VizTrails --- A system for workflow management and provenance. Our first year is largely dedicated to getting the systems to work together.
|
||||
|
||||
|
||||
\newcommand{\challenge}[1]{\medskip \noindent \textbf{#1}:~~}
|
||||
|
||||
|
||||
\challenge{Integrating Different Provenance Granularities}
|
||||
All three systems adopt different provenance models. Over the coming year, we will need to develop a multi-level provenance model with the attribute-level detail required by Mimir, but without sacrificing the simplicity of coarse-grained provenance.
|
||||
|
||||
\challenge{Defining Cleaning Workflows}
|
||||
Vizier requires a scripting language that can gracefully capture spreadsheet interactions and common curation tasks. Our first year goals include a draft of this language.
|
||||
|
||||
\challenge{Bulletproofing}
|
||||
GProM and Mimir are both academic projects. Our goals for the first year include bulletproofing, testing, and extending these systems to meet the requirements of the Vizier system.
|
||||
|
||||
\challenge{Interface Design}
|
||||
Over the first year, we will work with domain experts --- particularly those without an extensive programming background --- to understand their existing curation workflows and how Vizier's interface can be designed to best fit their needs.
|
||||
|
||||
\challenge{Systems Integration}
|
||||
Finally, there is the mechanical challenge of getting systems written in different languages to talk to one another. We are making progress and do not anticipate significant roadblocks.
|
||||
|
||||
|
||||
|
||||
\end{document}
|
||||
|
||||
%
|
||||
% -------
|
||||
%
|
||||
%
|
||||
% If this is just one page, I am not sure how you are going to be able to fit all of this.
|
||||
%
|
||||
% I would focus on the motivation and provide a high-level overview of the solution -- stressing % provenance+usability as the enablers, and list some of the challenges at a higher level, e.g.,%
|
||||
% integrating fine- and coarse-grained provenance, definition of cleaning operations (language?)% ,% % usable interface targeted to domain experts wo computing background, integration of existing % systems (Mimim, GProM, VisTrails).%
|
||||
%
|
||||
% Best,
|
||||
% Juliana
|
||||
%
|
||||
% --- Mimir ---
|
||||
% - The mechanics of integrating GProM
|
||||
% - New Lenses (better imputation, sequence alignment, entity resolution)
|
||||
% - Materialization (performance is becoming a concern)
|
||||
% - Bulletproofing
|
||||
%
|
||||
% --- GProM ---
|
||||
% - Attribute-level Uncertainty
|
||||
% - Metadata tagging
|
||||
% - Bulletproofing
|
||||
% - Better backend support (e.g., UDFs/UDAs).
|
||||
%
|
||||
% --- VisTrails/Vizier UI ---
|
||||
% - Notebook UI
|
||||
% - Spreadsheet Editor
|
||||
% - Translating VisTrails workflows to Vizier workflows
|
||||
% - Integrating classical and spreadsheet provenance models?
|
||||
%
|
||||
% --- All ---
|
||||
% - Spec for VizUAL v0.1
|
||||
%
|
||||
%
|
||||
% ----
|
||||
%
|
||||
%
|
Binary file not shown.
Loading…
Reference in New Issue