Compare commits

...

10 Commits

Author SHA1 Message Date
Juliana Freire b32021ef1d fixed ack 2020-02-10 02:04:42 -05:00
Juliana Freire 5369620c4e fixed 2 typos 2020-02-10 02:02:46 -05:00
Boris Glavic 756fb1b614 acks 2020-02-09 23:46:24 -06:00
Boris Glavic a74a49e2cf CR 2019-12-16 21:13:11 -06:00
Boris Glavic 018ec9dbf1 layout fixes 2019-12-16 21:12:11 -06:00
Boris Glavic 6bee5b44c1 text abstract 2019-12-16 21:07:24 -06:00
Boris Glavic b35782ac75 break urls 2019-12-16 21:06:58 -06:00
Boris Glavic 9c92b54fcd CR 2019-12-16 20:58:26 -06:00
Boris Glavic b4395b969a changes 2019-12-16 20:46:52 -06:00
Boris Glavic 5b7d18aef5 simplified algorithm 2019-12-16 19:03:33 -06:00
11 changed files with 128 additions and 159 deletions

Binary file not shown.

1
abstract.txt Normal file
View File

@ -0,0 +1 @@
Notebook and spreadsheet systems are currently the de-facto standard for data collection, preparation, and analysis. However, these systems have been criticized for their lack of reproducibility, versioning, and support for sharing. These shortcomings are particularly detrimental for data curation where data scientists iteratively build workflows to clean up and integrate data as a prerequisite for analysis. We present Vizier, an open-source tool that helps analysts to build and refine data pipelines. Vizier combines the flexibility of notebooks with the easy-to-use data manipulation interface of spreadsheets. Combined with advanced provenance tracking for both data and computational steps this enables reproducibility, versioning, and streamlined data exploration. Unique to Vizier is that it exposes potential issues with data, no matter whether they already exist in the input or are introduced by the operations of a notebook. We refer to such potential errors as \emph{data caveats}. Caveats are propagated alongside data using principled techniques from uncertain data management. Vizier provides extensive user interface support for caveats, e.g., exposing them as summaries in a dedicated error view and highlighting cells with caveats in spreadsheets.

View File

@ -66,30 +66,12 @@ Paris, France, April 16-19, 2018},
number = {1},
pages = {51--62},
projects = {GProM; Reenactment},
title = {GProM - A Swiss Army Knife for Your Provenance Needs},
title = {{GProM} - A Swiss Army Knife for Your Provenance Needs},
volume = {41},
year = {2018}
}
@article{DBLP:journals/tkde/ArabGKRG18,
author = {Bahareh Sadat Arab and Dieter Gawlick and Vasudha Krishnaswamy and Venkatesh Radhakrishnan and Boris Glavic},
journal = {IEEE Trans. Knowl. Data Eng.},
number = {3},
pages = {599--612},
title = {Using Reenactment to Retroactively Capture Provenance for Transactions},
volume = {30},
year = {2018}
}
@article{DBLP:journals/debu/ArabFGLNZ17,
author = {Bahareh Sadat Arab and Su Feng and Boris Glavic and Seokki Lee and Xing Niu and Qitian Zeng},
journal = {IEEE Data Eng. Bull.},
number = {1},
pages = {51--62},
title = {GProM - A Swiss Army Knife for Your Provenance Needs},
volume = {41},
year = {2018}
}
@incollection{BC04a,
author = {Bertossi, Leopoldo and Chomicki, Jan},
@ -99,12 +81,7 @@ Paris, France, April 16-19, 2018},
year = {2004}
}
@article{BB14,
author = {Bhardwaj, Anant and Bhattacherjee, Souvik and Chavan, Amit and Deshpande, Amol and Elmore, Aaron J and Madden, Samuel and Parameswaran, Aditya G},
journal = {arXiv preprint arXiv:1409.0798},
title = {DataHub: Collaborative Data Science \& Dataset Version Management at Scale},
year = {2014}
}
@article{BD15,
author = {Bhardwaj, Anant and Deshpande, Amol and Elmore, Aaron J and Karger, David and Madden, Sam and Parameswaran, Aditya and Subramanyam, Harihar and Wu, Eugene and Zhang, Rebecca},
@ -171,7 +148,7 @@ H. V. Jagadish},
@inproceedings{CF06b,
author = {Callahan, Steven P and Freire, Juliana and Santos, Emanuele and Scheidegger, Carlos E and Silva, Claudio T and Vo, Huy T},
booktitle = {Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on},
booktitle = {ICDE Workshops},
pages = {71--71},
title = {Managing the evolution of dataflows with vistrails},
year = {2006}
@ -312,12 +289,6 @@ H. V. Jagadish},
year = {2016}
}
@phdthesis{F07a,
author = {Fuxman, A.D.},
school = {University of Toronto},
title = {Efficient query processing over inconsistent databases},
year = {2007}
}
@inproceedings{FM05,
author = {Fuxman, Ariel D and Miller, Renée J},
@ -475,12 +446,8 @@ century},
}
@inproceedings{koop@tapp2017,
address = {Berkeley, CA, USA},
author = {Koop, David and Patel, Jay},
booktitle = {TaPP},
numpages = {1},
pages = {17--17},
series = {TaPP'17},
title = {Dataflow Notebooks: Encoding and Tracking Dependencies of Cells},
year = {2017}
}
@ -592,8 +559,7 @@ Shankar Pal and
Istvan Cseri and
Gideon Schaller and
Nigel Westbury},
booktitle = {Proceedings of the ACM SIGMOD International Conference on Management
of Data, Paris, France, June 13-18, 2004},
booktitle = {SIGMOD},
pages = {903--908},
title = {ORDPATHs: Insert-Friendly XML Node Labels},
year = {2004}
@ -752,7 +718,10 @@ Xibei Jia},
@inproceedings{XH,
author = {Xu, Liqi and Huang, Silu and Hui, Sili and Elmore, A and Parameswaran, Aditya},
title = {ORPHEUSDB: A lightweight approach to relational dataset versioning}
title = {{OrpheusDB}: {A} Lightweight Approach to Relational Dataset Versioning},
booktitle = {SIGMOD},
pages = {1655--1658},
year = {2017},
}
@article{yang2015lenses,

View File

@ -47,7 +47,6 @@
\documentclass{sig-alternate}
\usepackage{cleveref}
\usepackage{listings}
\usepackage{todonotes}
\usepackage{xspace}
@ -75,6 +74,12 @@
\newcommand{\trimfigurespacing}{\vspace*{-5mm}}
\newcommand{\hide}{}
\usepackage{url}
\def\UrlBreaks{\do\/\do-}
\usepackage{breakurl}
\usepackage[bookmarks=false,breaklinks]{hyperref}
\usepackage{cleveref}
\begin{document}
% Copyright
@ -106,7 +111,7 @@
% --- End of Author Metadata ---
% \title{What needs to be REPLaced in notebooks}
\title{Your notebook is not crumby enough, REPLace it.}
\title{Your notebook is not crumby enough, REPLace it}
%
% You need the command \numberofauthors to handle the 'placement
% and alignment' of the authors beneath the title.
@ -256,6 +261,15 @@
\label{sec:future}
\input{sections/future.tex}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Acknowledgments}
This work is supported in part by NSF Awards OAC-1640864 and
CNS-1229185, the NYU Moore Sloan Data Science Environment, and DARPA
D3M. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do not
necessarily reflect the views of the funders.
%ACKNOWLEDGMENTS are optional
% \section{Acknowledgments}

View File

@ -1,26 +1,26 @@
% Optional fields: author, title, howpublished, month, year, note
@MISC{vanderplass:2017:reproducibility,
howpublished = {https://twitter.com/jakevdp/status/935178916490223616},
howpublished = {\url{https://twitter.com/jakevdp/status/935178916490223616}},
title = {Idea: Jupyter notebooks could have a "reproducibility mode"},
author = {Jake VanderPlas}
}
% Optional fields: author, title, howpublished, month, year, note
@MISC{zelnicki:2017:nodebook,
howpublished = {https://multithreaded.stitchfix.com/blog/2017/07/26/nodebook/},
howpublished = {\url{https://multithreaded.stitchfix.com/blog/2017/07/26/nodebook/}},
author = {Kevin Zielnicki},
title = {Nodebook}
}
% Optional fields: author, title, howpublished, month, year, note
@MISC{jobevers:2018:jupyterOrderOfExec,
howpublished = {https://github.com/jupyter/notebook/issues/3229},
howpublished = {\url{https://github.com/jupyter/notebook/issues/3229}},
author = {Job Evers-Meltzer},
title = {Enforce a top-down order of execution}
}
@MISC{nyt:wrangling,
howpublished = {http://nyti.ms/1Aqif2X},
howpublished = {\url{http://nyti.ms/1Aqif2X}},
author = {S. Lohr},
title = {For big-data scientists, janitor work is key hurdle to insights.},
year = {2014}

View File

@ -1,35 +1,22 @@
% -*- root: ../paper.tex -*-
Notebook and spreadsheet systems are currently the de-facto standard for data collection, preparation, and analysis.
However, these systems have been criticized for their lack of
reproducibility, versioning, and support for sharing.
%
These shortcomings are particularly detrimental for
data curation where data scientists iteratively
build workflows to clean up and integrate data as a prerequisite for
analysis.
% \hide{ JF: here, there is a disconnect, since in the prev parag we
% talk about spreadsheets too. Also, we get into details that may not
% be clear for readers without giving some background first-- I
% suggest we remove the sentence below. %
% A key reason for these shortcomings is an impedence mismatch between
% the notebook user interface (as a sequence of steps) and the
% underlying implementation of most notebooks (as a library of code
% snippets).}
%
We present Vizier, an open-source tool that helps analysts to
iteratively build and refine data pipelines. Vizier combines the flexibility
of notebooks with the easy-to-use data manipulation
interface of spreadsheets.
%a publicly available, open-source workflow-based notebook system aimed at helping analysts to iteratively build and refine data pipelines.
%We highlight two features of Vizier: A spreadsheet interface for
%simultaneous exploration and direct manipulation of data, and caveats,
%an advanced approach for tracking potential data errors.
Combined with advanced provenance tracking for both data
and computational steps this enables reproducibility, versioning, and
streamlined data exploration.
% caveats
Unique to Vizier is that it exposes potential issues with data, no matter whether they already exist in the input or are introduced by the operations of a notebook. We refer to such potential errors as \emph{data caveats}. Caveats are propagated alongside data using principled techniques from uncertain data management. Vizier provides extensive user interface support for caveats, e.g., exposing them as summaries in a dedicated error view and highlighting cells with caveats in spreadsheets.
Notebook and spreadsheet systems are currently the de-facto standard for data
collection, preparation, and analysis. However, these systems have been
criticized for their lack of reproducibility, versioning, and support for
sharing. These shortcomings are particularly detrimental for data curation where
data scientists iteratively build workflows to clean up and integrate data as a
prerequisite for analysis. We present Vizier, an open-source tool that helps
analysts to build and refine data pipelines. Vizier combines the flexibility of
notebooks with the easy-to-use data manipulation interface of spreadsheets.
Combined with advanced provenance tracking for both data and computational steps
this enables reproducibility, versioning, and streamlined data exploration.
Unique to Vizier is that it exposes potential issues with data, no matter
whether they already exist in the input or are introduced by the operations of a
notebook. We refer to such potential errors as \emph{data caveats}. Caveats are
propagated alongside data using principled techniques from uncertain data
management. Vizier provides extensive user interface support for caveats, e.g.,
exposing them as summaries in a dedicated error view and highlighting cells with
caveats in spreadsheets.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../paper"

View File

@ -104,7 +104,7 @@ When needed, Vizier can undertake the more expensive task of deriving the full s
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Instrumenting Queries}
Even just determining whether a data value or row is affected by a caveat is analogous to determining certain answers for a query applied to an incomplete database~\cite{Imielinski:1984:IIR:1634.1886} (i.e., CoNP-complete for realtively simple types of queries~\cite{feng:2019:sigmod:uncertainty}).
Even just determining whether a data value or row is affected by a caveat is analogous to determining certain answers for a query applied to an incomplete database~\cite{Imielinski:1984:IIR:1634.1886} (i.e., CoNP-complete for relatively simple types of queries~\cite{feng:2019:sigmod:uncertainty}).
Thus, Vizier adopts a conservative approximation: All rows or cells that depend on a caveatted value are guaranteed to be marked.
It is theoretically possible, although rare in practice~\cite{feng:2019:sigmod:uncertainty} for the algorithm to unnecessarily mark cells or rows.
Specifically, queries are rewritten recursively using an extension of the scheme detailed in~\cite{feng:2019:sigmod:uncertainty} to add Boolean-valued attributes that indicate whether a column, or the entire row depends on a caveatted value.
@ -176,7 +176,7 @@ In the spreadsheet view, clicking on a field or row-header opens up a pop-up lis
Vizier also provides a dedicated view to list caveats on any dataset and its rows, fields, or columns.
As before, we adopt a conservative approximation --- it is possible, though rare, for a caveat to be displayed in this list unnecessarily.
The first of generating the caveat details views is to statically analyze the query.
The first step of generating the caveat details views is to statically analyze the query.
This analysis produces a \emph{CaveatSet} query, which computes the \lstinline{id} and \lstinline{message} parameters for every relevant call to the \lstinline{caveat} function.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -185,7 +185,7 @@ Bob now asks for caveats affecting the average price of part \lstinline{12345}.
The \lstinline{caveat} function is used exactly once, so we obtain a single CaveatSet:
%
\begin{lstlisting}
SELECT cost&' is not an int' AS message,
SELECT cost || ' is not an int' AS message,
ROWID AS id
FROM parts_uncast WHERE id = '12345'
\end{lstlisting}

View File

@ -25,15 +25,15 @@ scientists spend most of their time on the complex tasks of data curation and ex
% and exploration are complex and time-consuming tasks.
As a prerequisite for analysis, a data scientist has to find the right datasets
for their task and then iteratively construct a pipeline of curation operations
to clean and integrate the data.
Typically, this is not a linear process, but requires backtracking, e.g., to fix a problem with a curation step that causes errors in later stages of the pipeline.
to clean and integrate the data.
Typically, this is not a linear process, but requires backtracking, e.g., to fix a problem with a curation step that causes errors in later stages of the pipeline.
The lack of support for automatic dependency tracking in existing tools is already detrimental, as problems with later stages introduced as a result of changing an earlier stage might never be detected.
However, dependency tracking alone is not sufficient to aid an analyst in iterative refinement of a dataset.
As the name implies, data problems are often repaired a little at a time:
However, dependency tracking alone is not sufficient to aid an analyst in iterative refinement of a dataset.
As the name curation implies, data problems are often repaired one bit at a time:
In early stages of data exploration, data quality is less of a concern than data structure and content, and investing heavily in cleaning could be wasteful if the data turns out to be inappropriate for the analyst's needs.
Similarly, in the early stages of data preparation, it is often necessary to take shortcuts like heuristic or manual repairs that, although sufficient for the current dataset and analysis, may not generalize.
As progressively more critical decisions are made based on the data, such deferred curation tasks are typically revisited and addressed if neccessary.
However, deferring cleaning tasks requires analysts to keep fastidious notes, and to continously track possible implications.
As progressively more critical decisions are made based on the data, such deferred curation tasks are typically revisited and addressed if necessary.
However, deferring cleaning tasks requires analysts to keep fastidious notes, and to continuously track possible implications of the heuristic choices they make during curation.
There is substantial work on detecting data errors (e.g., \cite{DBLP:journals/pvldb/AbedjanCDFIOPST16,DBLP:conf/sigmod/ChuIKW16}) and streamlining the cleaning process (e.g., \cite{DBLP:conf/sigmod/ChuIKW16,DBLP:journals/pvldb/FanGJ08a}).
However, effective use of error detection still requires \emph{upfront} cleaning effort, and with only rare exceptions (e.g., \cite{DBLP:journals/pvldb/BeskalesIG10}) automatic curation obscures the potential implications of its heuristic choices.
@ -41,10 +41,10 @@ Ideally a data exploration system would supplement automatic error detection and
\noindent In short, we target four limitations of existing work:\\
%\begin{itemize}
\textbf{• Reproducibility.} The nature of most existing notebook systems as wrappers around REPLs leads to non-reproducible analysis and unintuitive and hard to track errors during iterative pipeline construction. \\
\textbf{• Reproducibility.} The nature of most existing notebook systems as wrappers around REPLs leads to non-reproducible analysis and unintuitive and hard to track errors during iterative pipeline construction. \\[3mm]
\textbf{• Direct Manipulation.} It is often necessary to manually manipulate data (e.g., to apply simple one-off repairs), pulling users out of the notebook environment and limiting the notebook's ability to serve as a historical record.\\
\textbf{• Versioning and Sharing.} Existing notebook and spreadsheet systems often lack versioning capabilities, forcing users to rely on manual versioning using version control systems like git and hosting platforms (git forges) like github.\\
\textbf{• Uncertainty and Error Tracking.} Existing systems do not expose, track, or manage deferred curation tasks, nor their interactions with data transformations and analyses.
\textbf{• Uncertainty and Error Tracking.} Existing systems do not expose, track, or manage issues with data and deferred curation tasks, nor their interactions with data transformations and analysts
\medskip
@ -61,7 +61,7 @@ precluding out-of-order execution, a common source of
frustration\footnote{ {\small
\url{https://twitter.com/jakevdp/status/935178916490223616}\\
\url{https://multithreaded.stitchfix.com/blog/2017/07/26/nodebook}\\
\url{https://github.com/jupyter/notebook/issues/3229} }} and
\url{https://github.com/jupyter/notebook/issues/3229} }} and cause of
non-reproducible workflows~\cite{pimentel:2019:msr:large}. To aid
reproducibility, Vizier maintains a full version history for
% of edits to
@ -71,7 +71,7 @@ Vizier facilitates debugging and re-usability of data and workflows by tracking
Vizier can propagate these annotations through operations of a notebook based on a principled, yet lightweight, uncertainty model called UA-DBs~\cite{feng:2019:sigmod:uncertainty}.
While some aspects of Vizier such as automated dependency tracking for notebooks, versioning, and workflow provenance tracking are also supported by other approaches, the combination of these features and the support for caveats leads to a system that is more than the sum of its components and provides unique capabilities that to the best of our knowledge are not supported by any other approach.
Many aspects of Vizier, including parts of its user interface~\cite{freire:2016:hilda:exception,kumari:2016:qdb:communicating}, provenance models~\cite{DBLP:journals/debu/ArabFGLNZ17,DBLP:conf/visualization/BavoilCSVCSF05}, and caveats~\cite{yang2015lenses,feng:2019:sigmod:uncertainty}, were explored independently in prior work.
Many aspects of Vizier, including parts of its user interface~\cite{freire:2016:hilda:exception,kumari:2016:qdb:communicating}, provenance models~\cite{AG17c,DBLP:conf/visualization/BavoilCSVCSF05}, and caveats~\cite{yang2015lenses,feng:2019:sigmod:uncertainty}, were explored independently in prior work.
In this paper, we focus on the challenge of unifying these components into a cohesive, notebook-style system for data exploration and curation, and in particular on the practical implementation of caveats and Vizier's spreadsheet interface.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -92,7 +92,7 @@ Such \emph{breadcrumbing} can be very powerful if performed correctly.
Results produced from the notebook can be reproduced on new data;
intermediate state resulting from earlier cells can be re-used for different analyses; or
the notebook itself can be used as a prototype for a production system like a dashboard or data curation pipeline.
However, breadcrumbing in a REPL-based notebook also requires extreme diligence from the user.
However, breadcrumbing in a REPL-based notebook requires extreme diligence from the user.
She must manually divide tasks into independent logical chunks.
She must mentally track dependencies between cells to ensure that revisions to earlier steps do not break downstream cells.
She must also explicitly design around cells with side effects like relational database updates or file writes.
@ -110,19 +110,19 @@ First, particularly in the presence of native code (e.g., for popular libraries
Thus, if an existing cell is edited, % it is most efficient to simply execute
it will be executed on the REPL's current state, requiring the \emph{user} to ensure that it is operating on a consistent input state (e.g., by manually ensuring that each cell is idempotent).
%
Second, treating the state as an opaque blob makes it difficult to automatically invalidate and recompute dependencies of a modified cell.
Dependency tracking through REPL state is possible --- For example Nodebook~\cite{zelnicki:2017:nodebook} automatically tracks data dependencies across Python cells and caches cell outputs.
However, such dependency tracking is often limited --- Nodebook, for example, makes a strong, user-enforced assumption that all variables are immutable.
Second, treating the state as an opaque blob makes it difficult to automatically invalidate and recompute dependencies of a modified cell.
Dependency tracking through REPL state is possible. However, such dependency tracking is often limited.
For example, Nodebook~\cite{zelnicki:2017:nodebook} tracks data dependencies across Python cells and caches cell outputs, but makes the strong assumption that all variables are immutable. Even worse, the user is held responsible for enforcing this assumption.
Vizier addresses both problems through a simple, but robust, solution: isolating cells.
Vizier addresses both problems through a simple, but robust, solution: isolating cells.
In lieu of a REPL, cells execute in independent interpreter contexts.
Our approach is similar to Koop's~\cite{koop@tapp2017} proposal of a dataflow notebook where notebook cells can refer to specific outputs of specific cells, but without the need to manually manage inter-cell dependencies.
Our approach is similar to Koop's~\cite{koop@tapp2017} proposal of a dataflow notebook where notebook cells can refer to specific outputs of specific cells, but without the need to manually manage inter-cell dependencies and with strong isolation guarantees for cells.
Specifically, cells can communicate by consuming and producing \textit{datasets}. % (relational tables). %, but through a well-defined, structured data model: a relational database.
This well-defined API enables efficient state snapshots, as well as dependency analysis across cell types (e.g., a Scala cell may depend on an SQL cell's output). %, which in turn makes it possible for cell execution order in Vizier to mirror the order in which the cells appear.
Whenever Alice updates a cell, Vizier automatically re-executes all cells that directly or indirectly depend on the updated cell.
In Vizier each change to a notebook cell or edit in a spreadsheet creates, conceptually, a new version of the notebook (and of the results of all cells of the notebook).
Vizier maintains a full history of these versions.
In Vizier each change to a notebook cell or edit in a spreadsheet creates, conceptually, a new version of the notebook (and of the results of all cells of the notebook).
Vizier maintains a full history of these versions.
With Vizier, if Alice's notebook evolves in a non-productive direction, she can always backtrack to any previous version. Furthermore, any version of a notebook and dataset has a unique URL that she can share with collaborators.
% Vizier further leverages its structured data model % and provenance tracking capabilities
@ -144,7 +144,7 @@ Intuitively, a caveat indicates that an element is \emph{potentially} erroneous
% This is a critical problem for data curation, where time and resource constraints often make it necessary to rely on quick-fixes: heuristic data transformations that are sufficient for the current dataset and the questions currently being asked of it.
The need for caveats arises, because decisions made during error detection and
cleaning typically are uncertain or depend on assumptions about the data. That
is true no matter these decisions are made by the user or by an automated data
is true no matter whether these decisions are made by the user or by an automated data
curation method. For example, data values may be incorrectly flagged as errors
or automated data cleaning techniques may chose an incorrect repair from a space
of possible repair options. A way to model this uncertainty is to encode the
@ -166,7 +166,7 @@ Notebook systems used for data curation should be able to model and track the un
We emphasize that caveats are orthogonal to any specific error detection or cleaning schemes.
A wide range of such tools can be can be wrapped to expose their heuristic assumptions and any resulting uncertainty to Vizier, integrating them into the Vizier ecosystem.
Vizier then automatically tracks caveats and handles other advanced features such as versioning and cell dependency tracking.
Vizier then automatically tracks caveats and handles other advanced features such as versioning and cell dependency tracking.
In our experience, extending methods to expose caveats is often quite straight-forward (e.g., see \cite{yang2015lenses} for several examples), and Vizier already supports a range of caveat-enabled data cleaning and error detection operations (see \Cref{fig:cellTypes} for a list of currently supported cell types).
Similarly, Vizier's data load operation also relies on caveats as a non-invasive and robust (data errors do not block the notebook) way to communicate records that fail to load correctly.
To support one-off curation transformations, Vizier also allows users to create caveats manually through its spreadsheet interface or programmatically via \texttt{Python}, \texttt{Scala}, or \texttt{SQL} cells.
@ -198,7 +198,7 @@ However, to do so safely, he needs to (i) understand all of the heuristic assump
Current best practices suggest documenting heuristic data transformations out-of-band.
For example, Alice might note her assumption that ``all prices are in US Dollars'' in a README file attached to the dataset, or in a comment in her data preparation script.
However, even when such documentation exists, it can get lost in pile of similar notes, many of which may not be relevant \emph{to a specific user asking a specific question}. Moreover, the documentation is not directly associated with the individual data values that it refers to. Out-of-band documentation makes it difficult for a user making changes to understand how the assumptions affect their code and data. Using Vizier, Alice can ensure that her assumptions are documented and will transition when her workflow is re-used in the future.
However, even when such documentation exists, it can get lost in a pile of similar notes, many of which may not be relevant \emph{to a specific user asking a specific question}. Moreover, the documentation is not directly associated with the individual data values that it refers to. Out-of-band documentation makes it difficult for a user making changes to understand how the assumptions affect their code and data. Using Vizier, Alice can ensure that her assumptions are documented and will transition when her workflow is re-used in the future.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{example}
@ -214,18 +214,18 @@ Later, when Bob re-uses Alice's script, any prices containing non-digit characte
\tinysection{Automatic Propagation of Data Caveats}
%
% \emph{Caveat}s are lightweight, human-readable annotations attached to data values or records that mark the data value as the result of a heuristic data transformation, and thus possibly suspect or subject to change.
Documenting errors and uncertainty is important.
Documenting errors and uncertainty is important.
However, for complex workflows the user needs to understand how caveats for one dataset affect data produced by cells that directly or indirectly depend on this dataset.
% Crucially, Vizier propagates caveats through all notebook cells:
% The outputs of a cell automatically reflect caveats in its input.
Vizier supports this functionality by automatically propagating caveats through data transformations based on uncertain data management principles. Formally, we may think of a data value annotated with a caveat as representing a \textit{labeled null} (the correct value is unknown) complemented with a guess for the unknown value (the annotated data value).
Vizier supports this functionality by automatically propagating caveats through data transformations based on uncertain data management principles. Formally, we may think of a data value annotated with a caveat as representing a \textit{labeled null} (the correct value is unknown) complemented with a guess for the unknown value (the annotated data value).
Then a dataset with caveats can be thought of as an approximation of an incomplete database~\cite{Imielinski:1984:IIR:1634.1886} where the values encode one possible world and the caveats encode an under-approximation of the certain values in the incomplete database: any row not marked with a caveat
is a \textit{certain answer}.
Wherever possible, Vizier preserves this property when propagating caveats.
We note that unlike traditional incomplete (or probabilistic) databases, we can not assume that users will be able to (or have the time) to completely, precisely characterize the uncertainty in their data and workflows.
Thus precisely characterizing the set of certain answers is not generally possible, and we apply our conservative approximation from~\cite{feng:2019:sigmod:uncertainty} to propagate caveats.
For certain cell types no techniques for propagating incompleteness in this fashion are known.
Thus, Vizier propagates caveats based on fine-grained provenance (data-dependencies) when supported (e.g., for SQL queries) or based on coarse-grained provenance when fine-grained provenance is not supported for a cell type (e.g., a Python cell).
is a \textit{certain answer}.
Wherever possible, Vizier preserves this property when propagating caveats.
We note that unlike traditional incomplete (or probabilistic) databases, we can not assume that users will be able to (or have the time) to completely and precisely characterize the uncertainty in their data and workflows.
Thus, precisely characterizing the set of certain answers is in general not possible. We apply our conservative approximation from~\cite{feng:2019:sigmod:uncertainty} to propagate caveats.
For certain cell types no techniques for propagating incompleteness in this fashion are known.
Thus, Vizier propagates caveats based on fine-grained provenance (data-dependencies) when supported (e.g., for SQL queries) or based on coarse-grained provenance when fine-grained provenance is not supported for a cell type (e.g., a Python cell).
The rationale is that a value is most likely affected by caveats associated with values it depends on\footnote{Of course, this is not guaranteed to be the case, e.g., if a missing value with caveats should have been included.}.
%\tinysection{Approximation}

View File

@ -6,7 +6,7 @@ Our work has connections to error detection and data cleaning, notebook systems,
\tinysection{Error Detection, Data Curation and Cleaning}
Automated data curation and cleaning tools help users to prepare their data for analysis by detecting and potentially repairing errors~\cite{DBLP:journals/pvldb/AbedjanCDFIOPST16,DBLP:conf/sigmod/ChuIKW16}.
These tools employ techniques such as constraint-based data cleaning~\cite{DBLP:journals/pvldb/FanGJ08a}, transformation scripts aka wrangling~\cite{DBLP:conf/chi/KandelPHH11}, entity resolution~\cite{GM12,DBLP:books/daglib/0030287} and data fusion~\cite{BN09}, and many others.
While great progress has been made, error detection and repair are typically heuristic in nature, since there is insufficient information to determine which data values are erroneous let alone what repair is correct. R% ather than proposing yet another data repair and error detection scheme,
While great progress has been made, error detection and repair are typically heuristic in nature, since there is insufficient information to determine which data values are erroneous let alone what repair is correct. % ather than proposing yet another data repair and error detection scheme,
Vizier enhances existing data cleaning and curation techniques by exposing the uncertainty in their decisions as data caveats and tracks the effect of caveats on further curation and analysis steps using a principled, yet efficient solution for incomplete data management~\cite{yang2015lenses,feng:2019:sigmod:uncertainty}. Thus, our solution enhances existing techniques with new functionality instead of replacing them. Based on our experience, wrapping existing techniques to expose uncertainty is often surprisingly straight-forward. We expect to extend Vizier with many additional error detection and cleaning techniques in the future.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -15,26 +15,26 @@ Notebook systems like Jupyter and Zeppelin have received praise for their intera
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Versioning and Provenance}
Another problem with notebook systems is their lack of versioning capabilities. For reproducibility and collaboration, it is essential to keep track of both versions of the datasets produced and consumed by notebooks as well as versions of the notebook itself. Versioning is closely related to data provenance which tracks the creation process of data keeping track of both dependencies among data items and the processes and actors involved in the creation process. The W3C PROV~\cite{MB13a} has been proposed as an application-independent way of representing provenance information.
Provenance in workflow systems has been studied intensively in the past~\cite{CW17a,BM08,DF08,SV08,DC07,FS12,SV08,CF06b,DBLP:conf/visualization/BavoilCSVCSF05}. So-called retrospective provenance, the data and control-dependencies of a workflow execution, can be used to reproduce a result and understand how it was derived. Koop~\cite{DBLP:conf/ipaw/Koop16} and \cite{CF06b} propose to track the provenance of how a workflow evolves over time in addition to tracking the provenance of its executions. \cite{DBLP:journals/pvldb/NiuALFZGKLG17} uses a similar model to enable ``provenance-aware data workspaces'' which allow analysts to non-destructively change their workflows and update their data.
In the context of dataset versioning, prior work has investigated optimized storage for versioned datasets~\cite{XH,BD15,BB14,MG16a}. Bhattacherjee et al.~\cite{BC15a} study the trade-off between storage versus recreation cost for versioned datasets.
Another problem with notebook systems is their lack of versioning capabilities. For reproducibility and collaboration, it is essential to keep track of both versions of the datasets produced and consumed by notebooks as well as versions of the notebook itself. Versioning is closely related to data provenance which tracks the creation process of data keeping track of both dependencies among data items and the processes and actors involved in the creation process. The W3C PROV standard~\cite{MB13a} has been proposed as an application-independent way of representing provenance information.
Provenance in workflow systems has been studied intensively in the past~\cite{CW17a,BM08,DF08,SV08,DC07,FS12,SV08,CF06b,DBLP:conf/visualization/BavoilCSVCSF05}. So-called retrospective provenance, the data and control-dependencies of a workflow execution, can be used to reproduce a result and understand how it was derived. Koop~\cite{DBLP:conf/ipaw/Koop16} and \cite{CF06b} propose to track the provenance of how a workflow evolves over time in addition to tracking the provenance of its executions. Niu et al.~\cite{DBLP:journals/pvldb/NiuALFZGKLG17} use a similar model to enable ``provenance-aware data workspaces'' which allow analysts to non-destructively change their workflows and update their data.
In the context of dataset versioning, prior work has investigated optimized storage for versioned datasets~\cite{XH,BD15,MG16a}. Bhattacherjee et al.~\cite{BC15a} study the trade-off between storage versus recreation cost for versioned datasets.
The version graphs used in this work essentially track coarse-grained provenance.
The Nectar system~\cite{GR10} automatically caches intermediate results of distributed dataflow computations also trading storage versus computational cost.
Similarly, metadata management systems like Ground and Apache Atlas (\url{https://atlas.apache.org/}) manage coarse-grained provenance for datasets in a data lake.
In contrast to workflow provenance which is often coarse-grained, i.e., at the level of datasets, database provenance is typically more fine-grained, e.g., at the level of rows~\cite{CC09,HD17,DBLP:journals/debu/ArabFGLNZ17,GM13,SJ18,MD18}. Many systems capture database provenance by annotating data and propagating these annotations during query processing.
Vizier's version and provenance management techniques integrate several lines of prior work by the authors including tracking the provenance of workflow versions~\cite{DBLP:journals/concurrency/ScheideggerKSVCFS08,XN16}, provenance tracking for updates and reenactment~\cite{DBLP:journals/tkde/ArabGKRG18,DBLP:journals/pvldb/NiuALFZGKLG17}, and using provenance-based techniques for tracking uncertainty annotations~\cite{yang2015lenses,feng:2019:sigmod:uncertainty}.
Similarly, metadata management systems like Ground and Apache Atlas (\url{https://atlas.apache.org/}) manage coarse-grained provenance for da\-ta\-sets in a data lake.
In contrast to workflow provenance which is often coarse-grained, i.e., at the level of datasets, database provenance is typically more fine-grained, e.g., at the level of rows~\cite{CC09,HD17,AF18,AG17c,GM13,SJ18,MD18}. Many systems capture database provenance by annotating data and propagating these annotations during query processing.
Vizier's version and provenance management techniques integrate several lines of prior work by the authors including tracking the provenance of workflow versions~\cite{DBLP:journals/concurrency/ScheideggerKSVCFS08,XN16}, provenance tracking for updates and reenactment~\cite{AG17c,DBLP:journals/pvldb/NiuALFZGKLG17}, and using provenance-based techniques for tracking uncertainty annotations~\cite{yang2015lenses,feng:2019:sigmod:uncertainty}.
The result is a system that is more than the sum of it components and to the best of our knowledge is the first system to support all of these features.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Uncertain Data}
Vizier's cavets are a practical application of uncertain data management. Incomplete~\cite{GM18a,CL03b,M98b,IL84a}, inconsistent~\cite{F07a,FM05,BC04a,CL03b}, and probabilistic databases~\cite{SO11,OH10,WT08,AK07,RD06a} have been studied for several decades. % in the past.
Vizier's cavets are a practical application of uncertain data management. Incomplete~\cite{GM18a,CL03b,M98b,IL84a}, inconsistent~\cite{FM05,BC04a,CL03b}, and probabilistic databases~\cite{SO11,OH10,WT08,AK07,RD06a} have been studied for several decades. % in the past.
However, even simple types of queries become intractable when evaluated over uncertain data. While approximation techniques have been proposed (e.g., ~\cite{GM18a,OH10,GP17}), these techniques are often still not efficient enough, ignore useful, albeit uncertain, data, or do not support complex queries. In~\cite{feng:2019:sigmod:uncertainty} we formalized \emph{uncertainty-annotated databases} (\emph{UA-DBs}), a light-weight model for uncertain data where rows are annotated as either certain or uncertain.
In~\cite{yang2015lenses} we introduced Lenses which are uncertain versions of data curation and cleaning operators that represent the uncertainty inherent in a curation step using an attribute-level version of the UA-DB model. Data cavets in Vizier generalize this idea to support non-relational operations and to enrich such annotations with additional information to record more details about data errors.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Data Spreadsheets}
Approaches like DataSpread and other~\cite{DBLP:conf/icde/BendreVZCP18,DBLP:conf/icde/LiuJ09,DBLP:conf/sigmod/BakkeK16} utilize spreadsheet interfaces as front-ends for databases. Vizier stands out through its seamless integration of spreadsheets and notebooks~\cite{freire:2016:hilda:exception}. Like other approaches that improve the usability of databases~\cite{DBLP:journals/debu/LiJ12}, Vizier provides a simple user interface that can be used effectively by both experts and non-experts and does not require any background in relational data processing to be understood. Furthermore, we argue in~\cite{freire:2016:hilda:exception} that the spreadsheets and notebook interfaces complement each other well for data curation and exploration tasks. For example, spreadsheets are suited well for handling rare exceptions by manually updating cells and are convenient for certain schema-level operations (e.g., creating or deleting columns) while notebooks are more suited for complex workflows and bulk operations (e.g., automated data repair).
Integrating the spreadsheet paradigm which heavily emphasizes updates, e.g., a user overwrites the value of a cell, with Viziers functional, data-flow model of notebook workflows would have been challenging if not for our prior work on \emph{reenactment}~\cite{DBLP:journals/tkde/ArabGKRG18,AF18,AG17c,DBLP:journals/pvldb/NiuALFZGKLG17}. Reenactment enables us to
Approaches like DataSpread and others~\cite{DBLP:conf/icde/BendreVZCP18,DBLP:conf/icde/LiuJ09,DBLP:conf/sigmod/BakkeK16} utilize spreadsheet interfaces as front-ends for databases. Vizier stands out through its seamless integration of spreadsheets and notebooks~\cite{freire:2016:hilda:exception}. Like other approaches that improve the usability of databases~\cite{DBLP:journals/debu/LiJ12}, Vizier provides a simple user interface that can be used effectively by both experts and non-experts and does not require any background in relational data processing to be understood. Furthermore, we argue in~\cite{freire:2016:hilda:exception} that the spreadsheets and notebook interfaces complement each other well for data curation and exploration tasks. For example, spreadsheets are suited well for handling rare exceptions by manually updating cells and are convenient for certain schema-level operations (e.g., creating or deleting columns) while notebooks are more suited for complex workflows and bulk operations (e.g., automated data repair).
Integrating the spreadsheet paradigm which heavily emphasizes updates, e.g., a user overwrites the value of a cell, with Viziers functional, data-flow model of notebook workflows would have been challenging if not for our prior work on \emph{reenactment}~\cite{AG17c,AF18,AG17c,DBLP:journals/pvldb/NiuALFZGKLG17}. Reenactment enables us to
translates updates into queries (side-effect free functions).
%%% Local Variables:

View File

@ -7,7 +7,7 @@ In this section, we explore three challenges we had to overcome to implement DML
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Spreadsheet Data Types}
The lightweight interface offered by typical spreadsheets has two minor impedance mismatches with the more strongly typed relational data model used by Vizier's datasets.
The lightweight interface offered by typical spreadsheets has two impedance mismatches with the more strongly typed relational data model used by Vizier's datasets.
First, types in a spreadsheet are assigned on a per-value basis, but on a per-column basis in a typical relational table.
A spreadsheet allows users to enter arbitrary text into a column of integers.
Because Vizier's history makes undoing a mistake trivial, Vizier assumes the user's action is intentional: column types are escalated (e.g., \texttt{int} to \texttt{float} to \texttt{string}) to allow the newly entered value to be represented as-is.
@ -22,9 +22,9 @@ Otherwise, an empty value is treated as \texttt{NULL}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Reenactment for Declarative Updates}
Through the spreadsheet interface, users can create, rename, reorder, or delete rows and columns, or alter data --- the standard array of DDL and DML operations.
Through the spreadsheet interface, users can create, rename, reorder, or delete rows and columns, or alter data --- a standard set of DDL and DML operations for spreadsheets.
These operations can not be applied in-place without sacrificing the immutability of versions.
To preserve versioning and avoid unnecessary data copies, Vizier builds on a technique called reenactment~\cite{DBLP:journals/pvldb/NiuALFZGKLG17,DBLP:journals/tkde/ArabGKRG18}, which translates sequences of DML operations into equivalent queries.
To preserve versioning and avoid unnecessary data copies, Vizier builds on a technique called reenactment~\cite{DBLP:journals/pvldb/NiuALFZGKLG17,AF18}, which translates sequences of DML operations into equivalent queries.
We emphasize that our use of the SQL code examples shown in this section are produced automatically as part of the translation of Vizual into SQL queries. Users will not need to write SQL queries to express spreadsheet operations. The user' actions in the spreadsheet are automatically added as Vizual cells to the notebook and these Vizual operations are automatically translated into equivalent SQL DDL/DML expressions~\cite{freire:2016:hilda:exception}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -66,15 +66,15 @@ Ideally, we would like to use row identifiers that are stable through such chang
%Column identifiers are already defined by the source table.
%For row identifiers,
For derived data, Vizier uses a row identity model based on GProM~\cite{DBLP:journals/debu/ArabFGLNZ17} encoding of provenance.
For derived data, Vizier uses a row identity model based on GProM's~\cite{AF18} encoding of provenance.
Derived rows, such as those produced by declaratively specified table updates, are identified as follows:
(1) Rows in the output of a projection or selection use the identifier of the source row that produced them,
(2) Rows in the output of a \lstinline{UNION ALL} are identified by the identifier of the source row and an identifier marking which side of the union the row came from\footnote{To preserve associativity and commutativity during optimization, union-handedness is recorded during parsing}.
(3) Rows in the output of a cross product or join are identified by combining identifiers from the source rows that produced them into a single identifier, and
(1) Rows in the output of a projection or selection use the identifier of the source row that produced them;
(2) Rows in the output of a \lstinline{UNION ALL} are identified by the identifier of the source row and an identifier marking which side of the union the row came from\footnote{To preserve associativity and commutativity during optimization, union-handedness is recorded during parsing};
(3) Rows in the output of a cross product or join are identified by combining identifiers from the source rows that produced them into a single identifier; and
(4) Rows in the output of an aggregate are identified by each row's group-by attribute values.
What remains is the base case: datasets loaded into Vizier or created through the workflow API.
We considered three approaches to identifying rows in raw data: order-, hash-, and key-based.
We considered three approaches for identifying rows in raw data: order-, hash-, and key-based.
None of these approaches is ideal:
If rows are identified by position, changes to the source data (e.g., uploading a new version) may change row identities.
Worse, identifiers are re-used, potentially re-targeting spreadsheet operations in unintended ways.

View File

@ -16,19 +16,19 @@ Vizier additionally supports cells that use point-and-click interfaces to stream
\begin{figure}
\centering
\renewcommand{\arraystretch}{1.3}
\begin{tabular}{p{0.75in}|p{1.7in}|c}
\textbf{Category} & \textbf{Cell Type Examples} & \textbf{API} \\ \hline
Script & Python, Scala & Workflow \\
Query & SQL & Dataflow \\
Information & Markdown & n/a \\
Point/Click & Plot, Load Data, Export Data & Workflow \\
Spreadsheet DML/DDL & Add/Delete/Move Row,
\begin{tabular}{p{0.8in}|p{1.7in}|c}
\textbf{Category} & \textbf{Cell Type Examples} & \textbf{API} \\ \hline
Script & Python, Scala & Workflow \\
Query & SQL & Dataflow \\
Documentation & Markdown & n/a \\
Point/Click & Plot, Load Data, Export Data & Workflow \\
Spreadsheet DML/DDL & Add/Delete/Move Row,
Add/Delete/Move/Rename Column,
Edit Cell, Sort, Filter & Dataflow \\
Cleaning & Infer Types, Repair Key,
Edit Cell, Sort, Filter & Dataflow \\
Cleaning & Infer Types, Repair Key,
Impute,
Repair Sequence, Merge Columns,
Geocode & Dataflow \\
Geocode & Dataflow \\
\end{tabular}
\caption{Cell Types in Vizier}
\label{fig:cellTypes}
@ -36,9 +36,9 @@ Vizier additionally supports cells that use point-and-click interfaces to stream
\subsection{Cells and Workflow State}
Data flow in a typical workflow system (e.g., VisTrails~\cite{SV08,DBLP:conf/visualization/BavoilCSVCSF05,DF08}) is explicit.
Dataflow in a typical workflow system (e.g., VisTrails~\cite{SV08,DBLP:conf/visualization/BavoilCSVCSF05,DF08}) is explicit.
Steps in the workflow define outputs that are explicitly bound to the inputs expected by subsequent steps.
Conversely, data flow in a notebook is implicit: Each cell manipulates a global, shared state.
Conversely, data flow in a notebook is implicit: each cell manipulates a global, shared state.
For example, in Jupyter, this state is the internal state of the REPL interpreter itself (variables, allocated objects, etc\ldots).
Jupyter cells are executed in the context of this interpreter, resulting in a new state visible to subsequently executed cells.
@ -76,8 +76,8 @@ For example, the workflow cell in \Cref{fig:wfVsDFCells} reads from datasets \ls
The workflow API, illustrated in \Cref{fig:wfVsDFCells}.a, targets cells where only collecting coarse-grained provenance information is presently feasible.
This includes \texttt{Python} and \texttt{Scala} cells, which implement Turing-complete languages; as well as cells like \texttt{Load Dataset} that manipulate entire datasets.
To discourage out-of-band communication between cells (which hinders reproducibility), as well as to avoid remote code execution attacks when Vizier is run in a public setting, workflow cells are executed in an isolated environment.
Vizier presently supports execution in an fresh interpreter instance (for efficiency) or a docker container (for safety).
Vizier's workflow API is designed accordingly, providing three operations: \emph{Read dataset} (copy a named dataset from Vizier to the isolated execution environment), \emph{Checkpoint dataset} (copy an updated version of a dataset back to Vizier), and \textit{Create dataset} (Allocate a new dataset in Vizier).
Vizier presently supports execution in a fresh interpreter instance (for efficiency) or a docker container (for safety).
Vizier's workflow API is designed accordingly, providing three operations: \emph{Read dataset} (copy a named dataset from Vizier to the isolated execution environment), \emph{Checkpoint dataset} (copy an updated version of a dataset back to Vizier), and \textit{Create dataset} (allocate a new dataset in Vizier).
A more efficient asynchronous, paged version of the \emph{read} operation is also available, and the \emph{Create dataset} operation can optionally initialize a dataset from a URL, S3 Bucket, or Google Sheet to avoid unnecessary copies.
\tinysection{Dataflow Cells}
@ -86,7 +86,7 @@ Dataflow cells are compiled down to equivalent SQL queries.
Updated versions of the state are defined as views based on these queries.
We emphasize that most dataflow cells do not require the user to write % actual
SQL. SQL is the language used by the implementation of such a cell to communicate with Vizier.
For example, spreadsheet operation cells created as a consequence of edits in the spreadsheet and interactively configured cleaning operations are both types of dataflow cells.
For example, spreadsheet operation cells created as a consequence of edits in the spreadsheet and interactively configured cleaning operations are both dataflow cells.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -115,13 +115,13 @@ The result is a version graph, a tree-like structure that shows the notebook's e
Under typical usage, edits are applied to a leaf of the tree to create chains of edits.
Editing a prior version of the notebook creates a \emph{branch} in the version history.
Vizier requires users to explicitly name branches when they are created; This explicit branch management makes it easier for users to follow the intent of a notebook's creator.
Vizier requires users to explicitly name branches when they are created. This explicit branch management makes it easier for users to follow the intent of a notebook's creator.
Internally however, each notebook version is identified by a 2-tuple consisting of a randomly generated, unique branch identifier and a monotonically increasing notebook version number.
\tinysection{Cell Versions}
A cell version is an immutable specification of one step of a notebook workflow.
In its simplest form, the cell version stores the cell's type, as well as any parameters to the cell.
In its simplest form, the cell version stores the cell's type, as well as any parameters of the cell.
This can include simple parameters like the table and columns to plot for a \texttt{Plot} cell, scripts like those used in the \texttt{SQL}, \texttt{Scala}, and \texttt{Python} cells, as well as references to files like those uploaded into the \texttt{Load Dataset} cell.
In short, the cell configuration contains everything required to deterministically re-execute the cell\footnote{Of course, we assume here that the computation of the cell itself is deterministic. For cells with non-deterministic computation , e.g., random number generators, we cannot guarantee that multiple execution of the same cell yield the same result.}.
A cell version is identified by an identifier derived from a hash of its parameters.
@ -137,7 +137,7 @@ The cell cache specifically includes:
\tinysection{Dataset Versions}
A dataset is presented to the user as a mutable relational table identified by a human readable name.
Internally however, a dataset version is an immutable Spark dataframe identified by a randomly generated, globally unique identifier.
Keeping dataset versions immutable makes it possible to quickly recover notebook state in between cells and to safely re-use compute work across notebook branches.
Keeping dataset versions immutable makes it possible to quickly recover notebook state in between cells and to safely share state across notebook branches.
To preserve the illusion of mutability, Vizier maintains a \emph{scope} that maps human-readable dataset names to the appropriate identifiers.
When a cell is executed, it receives a scope that maps dataset names to dataset version identifiers.
To create or modify a dataset, a cell first initializes a new dataset version: uploading a new dataframe (workflow cells) or creating a Spark view (dataflow cells).
@ -267,17 +267,15 @@ Although not presently supported by Vizier, we note that this naive algorithm ca
$i \leftarrow \texttt{min($\mathcal D$)}$;
\texttt{eval($\mathcal N$[$i$])}
\Comment{Eval first dirty cell}
\For{$j \in 1 \ldots i-1$}
\Comment{Clear \texttt{waiting} cells}
\If{$\mathcal N\texttt{[$j$].state} = \texttt{waiting}$}
\For{$j \in 1 \ldots i$}
\Comment{Mark cells up-to-date}
% \If{$\mathcal N\texttt{[$j$].state} = \texttt{waiting}$}
\State $\mathcal N\texttt{[$j$].state} \leftarrow \texttt{ready}$
\ElsIf{$\mathcal N\texttt{[$j$].state} = \texttt{dirty}$}
\State \textbf{break}
\Comment{... until the first \texttt{dirty} cell}
\EndIf
% \Comment{... until the first \texttt{dirty} cell}
% \EndIf
\EndFor
\State $\mathcal N\texttt{[$i$].state} \leftarrow \texttt{ready}$
\Comment{Mark cell up-to-date}
% \State $\mathcal N\texttt{[$i$].state} \leftarrow \texttt{ready}$
% \Comment{Mark cell up-to-date}
\For{$j \in i+1 \ldots \texttt{len($\mathcal N$)}$}
\Comment{Mark dependencies}
\If{$\texttt{reads($\mathcal N$[$j$])} \cap \texttt{writes($\mathcal N$[$i$])} \neq \emptyset$}
@ -339,7 +337,7 @@ Portability allows the dataflow engine to execute queries, in whole or in part,
\tinysection{Analyzing Caveats}
When a dataset version is accessed, the dataflow manager marks data elements (i.e., cells and rows) that have a caveat applied.
To avoid impacting data access latencies, these markings simply indicate the presence or absence of a caveat, but not the associated message or metadata.
Thus, the dataflow manager exposes a final interface that allows callers to retrieve the specific caveats affecting a given cell, row, column, or dataset on an as-needed basis.
Thus, the dataflow manager exposes an interface that allows callers to retrieve the specific caveats affecting a given cell, row, column, or dataset on an as-needed basis.
We discuss caveat handling in more detail in \Cref{sec:caveats}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -415,14 +413,14 @@ This allows users to, for example, debug a cell by simultaneously viewing before
\tinysection{Spreadsheet View}
Although data curation tasks can often be scripted, there are numerous situations where a manual override is more efficient.
Practical examples include:
(i) Cleaning tasks requiring manual data validation (e.g., Personally contacting a cab driver to confirm the \$1000 tip that the dataset claims that they received),
(i) Cleaning tasks requiring manual data validation (e.g., personally contacting a cab driver to confirm the \$1000 tip that the dataset claims that they received),
(ii) Subjective data entry tasks (e.g., ``tagging" user study transcripts),
(iii) One-off repairs requiring human attention (e.g., Standardizing notation in a free-entry text field from a survey), or
(iv) Transient ``what-if" exploration (e.g., How is an analysis affected when outliers are removed).
(iii) One-off repairs requiring human attention (e.g., standardizing notation in a free-entry text field from a survey), or
(iv) Transient ``what-if" exploration (e.g., how is an analysis affected when outliers are removed).
Manual overrides are often performed in a text editor or through a spreadsheet.
In Vizier, manual data overrides are supported through a spreadsheet-style interface, illustrated in \Cref{fig:spreadsheet}.
Opening a dataset version in spreadsheet view displays the dataset as a relational table.
Opening a dataset version in this spreadsheet view displays the dataset as a relational table.
Users may modify the contents of cells; insert, reorder, rename, or delete columns and rows; or apply simple data transformations like sorting.
We note that in addition to allowing manual data overrides, the spreadsheet interface can be simpler and more accessible to novice users.
@ -461,7 +459,7 @@ Each caveat is displayed with a human-readable description of the error, for exa
As discussed in \Cref{sec:caveats}, the caveat list is a summary, with caveats organized into groups based on the type of error.
The interface also allows caveats to be acknowledged by clicking on the caveat and then clicking ``Acknowledge."
An acknowledged caveat is still displayed in the caveat list, but otherwise ignored.
For example, cells that depend on it will not be highlighted in data tables shown in Vizier.
For example, Vizier will not highlight cells that depend on it. % will not be highlighted in datasets.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
@ -475,7 +473,7 @@ For example, cells that depend on it will not be highlighted in data tables show
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{History View}
As noted above, Vizier maintains a branching notebook history.
The history view illustrated in \Cref{fig:history} displays the history of the current branch: the sequence of edits that led to the currently displayed version of the notebook.
The history view shown in \Cref{fig:history} displays the history of the current branch: the sequence of edits that led to the currently displayed version of the notebook.
Any prior version of the notebook may be opened in a read-only form.
If the user wishes to edit a prior version of the notebook, they can create a new branch from that version.