Initial commit

2016-11-27 16:53:28 -05:00 · 2016-11-27 16:53:28 -05:00 · 76d48e7d0c
commit 76d48e7d0c
3 changed files with 120 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,4 @@
+*.aux
+*.log
+DIBBSWhitepaper.pdf
+*.synctex.gz
--- a/DIBBSWhitepaper.tex
+++ b/DIBBSWhitepaper.tex
@ -0,0 +1,116 @@
+\documentclass[11pt, oneside]{article}   	% use "amsart" instead of "article" for AMSLaTeX format
+\usepackage{fullpage}
+\usepackage{wrapfig}
+\usepackage[textsize=tiny]{todonotes}
+\usepackage[margin=0.78in]{geometry}
+%\usepackage[disable]{todonotes}
+\pagestyle{empty}
+
+
+\begin{document}
+\vspace*{-5mm}
+\noindent \textsc{NSF:DIBBS - PIs: Freire, Kennedy, Glavic}\hfill
+DIBBs PI Meeting; Jan 2017\\[-5mm]
+\begin{center}
+%\textbf{\Large Making Data Science Easier with \sysname}
+\textbf{\large Streamlining and Understanding Curation with Vizier} (Award \#1640864)
+\end{center}
+\hrule
+\bigskip
+
+\begin{wrapfigure}{R}{0.5\textwidth}
+\vspace*{-8mm}
+\begin{flushright}
+\includegraphics[width=0.48\textwidth]{VizierUI}
+\vspace*{-8mm}
+\end{flushright}
+\caption{Prototype Vizier User Interface}
+\vspace*{-4mm}
+\end{wrapfigure}
+
+
+Data curation or wrangling, is a critical stage in data science in which raw data is validated and repaired to establish trust in the data, and structured to streamline analytics.
+Traditionally, data curation has been performed as a pre-processing task: 
+only after all the data selected for a study (or application) are curated, are they ready to be loaded into an analytics system for use.
+This is problematic because while some cleaning constraints can be easily defined (e.g., checking for valid attribute ranges), others are only discovered as one analyzes the data. 
+In short, curation and exploration are both part of an ongoing, \emph{cyclic} process.
+
+Vizier will link exploration and curation and allows analysts to leverage the full power of their existing SQL-based analytics platform to explore data, even if it has not yet been fully curated.  
+As the analyst explores and queries her data, Vizier will present her with provenance information, quality assesments, and opportunities for quality improvement relevant to her current exploration efforts.
+Provenance and quality information help the analyst to evaluate whether she can trust the data she's looking at.
+If she decides that more curation effort is required, Vizier can help her to direct her curation efforts.
+
+Vizier's intuitive hybrid notebook-spreadsheet interface will make it easy for analysts to gracefully transition from preliminary data surveys (best done in a spreadsheet) to more rigorous, procedural data manipulations (best done in an imperative language). 
+As shown in the prototype interface above, Vizier presents users with both a tabular spreadsheet-style interface, and a script interface.  
+Edits to the spreadsheet are reflected as operations in the script, while bulk procedural transformations in the script are decomposed into a single, editable expression in the spreadsheet.
+As a result, analysts can simultaneously transform data through quick, visual edits, as well as bulk, set-at-a-time scripting.
+
+\begin{center}
+%\textbf{\Large Making Data Science Easier with \sysname}
+\textbf{Preliminary Challenges}\vspace*{-2mm}
+\end{center}
+\hrule
+\vspace{2mm}
+
+Vizier will be composed of three existing systems: GProM --- A system for generic fine-grained provenance queries, Mimir --- A system for probabilistic data curation, and VizTrails --- A system for workflow management and provenance.  Our first year is largely dedicated to getting the systems to work together.
+
+
+\newcommand{\challenge}[1]{\medskip \noindent \textbf{#1}:~~}
+
+
+\challenge{Integrating Different Provenance Granularities}
+All three systems adopt  different provenance models.  Over the coming year, we will need to develop a multi-level provenance model with the attribute-level detail required by Mimir, but without sacrificing the simplicity of coarse-grained provenance.
+
+\challenge{Defining Cleaning Workflows}
+Vizier requires a scripting language that can gracefully capture spreadsheet interactions and common curation tasks.  Our first year goals include a draft of this language.
+
+\challenge{Bulletproofing}
+GProM and Mimir are both academic projects.  Our goals for the first year include bulletproofing, testing, and extending these systems to meet the requirements of the Vizier system.
+
+\challenge{Interface Design}
+Over the first year, we will work with domain experts --- particularly those without an extensive programming background --- to understand their existing curation workflows and how Vizier's interface can be designed to best fit their needs.
+
+\challenge{Systems Integration}
+Finally, there is the mechanical challenge of getting systems written in different languages to talk to one another.  We are making progress and do not anticipate significant roadblocks.
+
+
+
+\end{document}  
+
+% 
+% -------
+% 
+% 
+% If this is just one page, I am not sure how you are going to be able to fit all of this.
+% 
+% I would focus on the motivation and provide a high-level overview of the solution -- stressing % provenance+usability as the enablers, and list some of the challenges at a higher level, e.g.,% 
+% integrating fine- and coarse-grained provenance, definition of cleaning operations (language?)% ,%  % usable interface targeted to domain experts wo computing background, integration of existing % systems (Mimim, GProM, VisTrails).% 
+% 
+% Best,
+% Juliana
+% 
+% --- Mimir ---
+%  - The mechanics of integrating GProM
+%  - New Lenses (better imputation, sequence alignment, entity resolution)
+%  - Materialization (performance is becoming a concern)
+%  - Bulletproofing
+% 
+% --- GProM --- 
+%  - Attribute-level Uncertainty
+%  - Metadata tagging
+%  - Bulletproofing
+%  - Better backend support (e.g., UDFs/UDAs).
+% 
+% --- VisTrails/Vizier UI ---
+%  - Notebook UI
+%  - Spreadsheet Editor
+%  - Translating VisTrails workflows to Vizier workflows
+%  - Integrating classical and spreadsheet provenance models?
+% 
+% --- All ---
+%  - Spec for VizUAL v0.1
+% 
+% 
+% ----
+% 
+% 
--- a/VizierUI.pdf
+++ b/VizierUI.pdf