Initial commit

master
Oliver Kennedy 2016-11-27 16:53:28 -05:00
commit 76d48e7d0c
3 changed files with 120 additions and 0 deletions

4
.gitignore vendored Normal file
View File

@ -0,0 +1,4 @@
*.aux
*.log
DIBBSWhitepaper.pdf
*.synctex.gz

116
DIBBSWhitepaper.tex Normal file
View File

@ -0,0 +1,116 @@
\documentclass[11pt, oneside]{article} % use "amsart" instead of "article" for AMSLaTeX format
\usepackage{fullpage}
\usepackage{wrapfig}
\usepackage[textsize=tiny]{todonotes}
\usepackage[margin=0.78in]{geometry}
%\usepackage[disable]{todonotes}
\pagestyle{empty}
\begin{document}
\vspace*{-5mm}
\noindent \textsc{NSF:DIBBS - PIs: Freire, Kennedy, Glavic}\hfill
DIBBs PI Meeting; Jan 2017\\[-5mm]
\begin{center}
%\textbf{\Large Making Data Science Easier with \sysname}
\textbf{\large Streamlining and Understanding Curation with Vizier} (Award \#1640864)
\end{center}
\hrule
\bigskip
\begin{wrapfigure}{R}{0.5\textwidth}
\vspace*{-8mm}
\begin{flushright}
\includegraphics[width=0.48\textwidth]{VizierUI}
\vspace*{-8mm}
\end{flushright}
\caption{Prototype Vizier User Interface}
\vspace*{-4mm}
\end{wrapfigure}
Data curation or wrangling, is a critical stage in data science in which raw data is validated and repaired to establish trust in the data, and structured to streamline analytics.
Traditionally, data curation has been performed as a pre-processing task:
only after all the data selected for a study (or application) are curated, are they ready to be loaded into an analytics system for use.
This is problematic because while some cleaning constraints can be easily defined (e.g., checking for valid attribute ranges), others are only discovered as one analyzes the data.
In short, curation and exploration are both part of an ongoing, \emph{cyclic} process.
Vizier will link exploration and curation and allows analysts to leverage the full power of their existing SQL-based analytics platform to explore data, even if it has not yet been fully curated.
As the analyst explores and queries her data, Vizier will present her with provenance information, quality assesments, and opportunities for quality improvement relevant to her current exploration efforts.
Provenance and quality information help the analyst to evaluate whether she can trust the data she's looking at.
If she decides that more curation effort is required, Vizier can help her to direct her curation efforts.
Vizier's intuitive hybrid notebook-spreadsheet interface will make it easy for analysts to gracefully transition from preliminary data surveys (best done in a spreadsheet) to more rigorous, procedural data manipulations (best done in an imperative language).
As shown in the prototype interface above, Vizier presents users with both a tabular spreadsheet-style interface, and a script interface.
Edits to the spreadsheet are reflected as operations in the script, while bulk procedural transformations in the script are decomposed into a single, editable expression in the spreadsheet.
As a result, analysts can simultaneously transform data through quick, visual edits, as well as bulk, set-at-a-time scripting.
\begin{center}
%\textbf{\Large Making Data Science Easier with \sysname}
\textbf{Preliminary Challenges}\vspace*{-2mm}
\end{center}
\hrule
\vspace{2mm}
Vizier will be composed of three existing systems: GProM --- A system for generic fine-grained provenance queries, Mimir --- A system for probabilistic data curation, and VizTrails --- A system for workflow management and provenance. Our first year is largely dedicated to getting the systems to work together.
\newcommand{\challenge}[1]{\medskip \noindent \textbf{#1}:~~}
\challenge{Integrating Different Provenance Granularities}
All three systems adopt different provenance models. Over the coming year, we will need to develop a multi-level provenance model with the attribute-level detail required by Mimir, but without sacrificing the simplicity of coarse-grained provenance.
\challenge{Defining Cleaning Workflows}
Vizier requires a scripting language that can gracefully capture spreadsheet interactions and common curation tasks. Our first year goals include a draft of this language.
\challenge{Bulletproofing}
GProM and Mimir are both academic projects. Our goals for the first year include bulletproofing, testing, and extending these systems to meet the requirements of the Vizier system.
\challenge{Interface Design}
Over the first year, we will work with domain experts --- particularly those without an extensive programming background --- to understand their existing curation workflows and how Vizier's interface can be designed to best fit their needs.
\challenge{Systems Integration}
Finally, there is the mechanical challenge of getting systems written in different languages to talk to one another. We are making progress and do not anticipate significant roadblocks.
\end{document}
%
% -------
%
%
% If this is just one page, I am not sure how you are going to be able to fit all of this.
%
% I would focus on the motivation and provide a high-level overview of the solution -- stressing % provenance+usability as the enablers, and list some of the challenges at a higher level, e.g.,%
% integrating fine- and coarse-grained provenance, definition of cleaning operations (language?)% ,% % usable interface targeted to domain experts wo computing background, integration of existing % systems (Mimim, GProM, VisTrails).%
%
% Best,
% Juliana
%
% --- Mimir ---
% - The mechanics of integrating GProM
% - New Lenses (better imputation, sequence alignment, entity resolution)
% - Materialization (performance is becoming a concern)
% - Bulletproofing
%
% --- GProM ---
% - Attribute-level Uncertainty
% - Metadata tagging
% - Bulletproofing
% - Better backend support (e.g., UDFs/UDAs).
%
% --- VisTrails/Vizier UI ---
% - Notebook UI
% - Spreadsheet Editor
% - Translating VisTrails workflows to Vizier workflows
% - Integrating classical and spreadsheet provenance models?
%
% --- All ---
% - Spec for VizUAL v0.1
%
%
% ----
%
%

BIN
VizierUI.pdf Normal file

Binary file not shown.