master
Oliver Kennedy 2019-08-20 16:01:44 -04:00
parent 717dc62169
commit 71db73fa9b
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
10 changed files with 64 additions and 63 deletions

0
.paper.aux.~2f901976 Normal file
View File

View File

@ -49,7 +49,8 @@ stringstyle=\color{lstreddark},
commentstyle=\color{lstgreen},
mathescape=true,
escapechar=@,
sensitive=true
sensitive=true,
showstringspaces=false
}

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 26 KiB

After

Width:  |  Height:  |  Size: 26 KiB

View File

@ -42,7 +42,7 @@
author = {Louis Bavoil and Steven P. Callahan and Carlos Eduardo Scheidegger and Huy T. Vo and Patricia Crossno and Cl{\'{a}}udio T. Silva and Juliana Freire},
booktitle = {{IEEE} Visualization},
pages = {135--142},
publisher = {{IEEE} Computer Society},
publisher = {{IEEE} CS},
title = {VisTrails: Enabling Interactive Multiple-View Visualizations},
year = 2005
}

View File

@ -244,6 +244,8 @@
% initial runs of your .tex file to
% produce the bibliography for the citations in your paper.
\bibliographystyle{abbrv}
{\small
\bibliography{oliver,quotes} % sigproc.bib is the name of the Bibliography in this case
}
\end{document}

View File

@ -1,14 +1,14 @@
% -*- root: ../paper.tex -*-
Workflow systems like Vizier often encounter data errors like un-parseable integers, or a line in a CSV file with too many columns, or range errors in user-defined functions.
Analytics tools often encounter data errors like un-parseable integers or CSV files with uneven column counts.
Existing systems take one of three approaches to handling such errors:
(1) The error stops the workflow, often discarding all work done up to this point or leaving the system in an ambiguous state,
(2) The error is silently dropped as the system heuristically recovers (e.g., by deleting spurious characters, or through \lstinline{NULL}s), making debugging efforts significantly harder, or
(3) The system heuristically recovers, but logs the event to a log file that no one looks at.
(1) The error stops the workflow, discarding work done or leaving the system in an ambiguous state;
(2) The error is silently dropped as the system heuristically recovers, impeding debugging efforts; or
(3) The system heuristically recovers and logs the event to a log file that no one looks at.
As described in \Cref{sec:system:caveats}, Vizier takes a fourth approach: by annotating affected fields, rows, columns, and tables with a caveat.
Caveats include a human-readable description of the problem, and are propagated through queries to any dependent elements derived from them, serving as a form of automatic documentation for the dataset.
As described in \Cref{sec:system:caveats}, Vizier takes a fourth approach: annotating affected fields, rows, columns, and tables with a caveat.
Caveats include a human-readable description of the problem, and are propagated through queries to any dependent elements derived from them. %, serving as a form of automatic documentation for the dataset.
Propagation follows a simple rule:
\emph{Is it possible to change the derived value by changing the caveatted value?}
If so, the caveat propagates to the derived value.
@ -20,73 +20,68 @@ Here, we focus on the practical challenges of realizing caveats in Vizier.
\subsection{Applying \abbrCaveatsCap}
Caveats in Vizier are implemented by Mimir, a simple query rewriting frontend over Spark's dataframes.
To annotate data values with caveats, Mimir provides a new function: \lstinline{caveat(id, value, message)}.
The function takes the value to annotate, and a message describing the caveat.
A unique identifier (e.g., derived from the row id) is also included for book-keeping purposes --- we will omit this identifier from examples for conciseness.
Rows are annotated when the caveat appears in a WHERE clause, and we leave the discussion of how Vizier implements schema-level caveats (column and table) to future work.
Mimir provides a new function: \lstinline{caveat(id, value, message)} to annotate data.
The function takes a value to annotate, and a message describing the caveat.
A unique identifier (e.g., derived from the row id) is used for book-keeping purposes, and omitted from examples for conciseness.
Rows are annotated when the caveat appears in a \lstinline{WHERE} clause\footnote{We leave a discussion of how Vizier implements column and table caveats to future work}.
We now highlight several examples of how Vizier instruments several operators to capture heuristically recoverable errors.
\tinysection{Instrumenting Type Parsing}
\tinysection{Instrumenting String Parsing}
Many file formats lack type information (e.g., CSV), or have minimal type systems (e.g., JSON).
Thus a commonly used operation is type parsing: converting a string encoding of a typed value into its native representation.
However, when the string is improperly formatted or contains spurious characters, it can not be safely parsed. Consider the following parsing operation in a normal database system.
Thus extracting native representations from strings is common, as in the following:
\begin{lstlisting}
SELECT CAST(pay as int) AS pay, ... FROM EMP;
\end{lstlisting}
In Vizier, this cast would instead be implemented as:
Vizier rewrites \lstinline{CAST} to emit a \lstinline{NULL} annotated with a caveat when a string can not be safely parsed:
\begin{lstlisting}
SELECT CASE WHEN CAST(pay as int) IS NULL
THEN caveat(NULL, pay & ' is not an int')
ELSE CAST(pay as int) END AS pay, ...
FROM EMP_uncast;
CASE WHEN CAST(pay as int) IS NULL
THEN caveat(NULL, pay & ' is not an int')
ELSE CAST(pay as int) END
\end{lstlisting}
If the cast fails, it is replaced by a null value caveatted with a message indicating that the invalid string could not be cast.
\tinysection{Instrumenting CSV Parsing}
There are numerous ways in which a CSV file can be misread, including un-escaped commas or newlines, blank lines, or comment lines.
CSV parsing necessarily happens prior to having a dataframe available, so the \lstinline{caveat} function is not available.
Instead, Vizier adopts an instrumented version of Spark's CSV parser that emits an additional field containing an error description and a string representation of the original (pre-parsed) version of the erroneous line.
This field is \lstinline{NULL} on lines that successfully parse.
After the CSV file is loaded, a \lstinline{caveat} is applied as needed by a post-processing step.
CSV files are subject to data errors like un-escaped commas or newlines, blank lines, or comment lines.
CSV parsing necessarily happens prior to having a dataframe available, so a purely rewrite-based approach is not possible.
Instead, Vizier adopts an instrumented version of Spark's CSV parser that emits an additional field that is \lstinline{NULL} on lines that successfully parse and contains error-related metadata otherwise.
Then, caveats are applied in a post-processing step.
\begin{lstlisting}
SELECT * FROM EMP_raw
WHERE CASE WHEN _error_msg IS NOT NULL
THEN caveat(true, _error_msg)
ELSE true END;
SELECT * FROM EMP_raw WHERE
CASE WHEN _error_msg IS NOT NULL
THEN caveat(true, _error_msg) ELSE true END;
\end{lstlisting}
This use of the \lstinline{WHERE} clause seems un-intuitive at first, but is a deliberate decision rooted in caveats' origin in incomplete databases.
Due to the parse error, we are not certain that the row is valid; Here \lstinline{caveat(true, ...)} captures that the choice to include the row (i.e., \lstinline{WHERE true}) is in question.
\tinysection{Other Caveats}
Vizier additionally provides a range of data curation operators called lenses~\cite{yang2015lenses}.
Lenses apply heuristic data cleaning rules: Missing value imputation, key repair, sequence repair, or geocoding.
In each case, lenses annotate repaired values with a brief description of the repair applied.
Additionally, users may also manually annotate values, for example to identify unusual values.
% \tinysection{Other Caveats}
% Vizier additionally provides a range of data curation operators called lenses~\cite{yang2015lenses}.
% Lenses apply heuristic data cleaning rules: Missing value imputation, key repair, sequence repair, or geocoding.
% In each case, lenses annotate repaired values with a brief description of the repair applied.
% Additionally, users may also manually annotate values, for example to identify unusual values.
\subsection{Implementing \abbrCaveatsCap}
\label{sec:tracking-abbrcaveats}
Applied directly to datasets, caveats serve as little more than a glorified error list.
The true strength of caveats is in Vizier's ability to propagate them through queries.
If the user's query result is independent of a particular caveat, there is no need for the user to carefully address an issue since it does not affect her analysis.
Caveats can serve as a supplemental form of documentation that is automatically filtered down based on the user's needs, but to do so it is necessary for our system to track \abbrCaveats throughout the dataflow encoded by a Vizier notebook.
%Applied directly to datasets, caveats serve as little more than a glorified error list.
The utility of caveats arises from Vizier's ability to propagate them through queries, as users can avoid costly data repairs if those repairs are not relevant.
In this section, we overview Vizier's lightweight strategy for propagating caveats through a Vizier notebook.
\begin{example}
A small number of remote employees in Alice's dataset have had their pay entered with foreign currency markers (e.g., \texttt{\$}).
In lieu of aborting the initial data ingest (and wasting several hours of work), Alice's analytics platform simply parses these values as NULLs.
Alice inspects the first few dozen rows to find everything in order, and begins to explore her data by adding a notebook cell to ask:
Pay for a small number of employees in Alice's dataset includes foreign currency markers (e.g., \texttt{\$}).
%In lieu of aborting the initial data ingest (and wasting several hours of work),
Alice's analytics tool silently converts these numbers to \lstinline{NULL}.
Alice begins exploring with a query:
\begin{lstlisting}
SELECT dept, avg(pay) FROM EMP GROUP BY dept;
\end{lstlisting}
On a database that silently hides data ingest errors, Alice would simply get an incorrect result to her query.
Vizier however, caveats the average pay of departments with remote employees, helping her to discover and repair the error.
The result that Alice gets is incorrect.
Vizier however, caveats the average pay of departments with remote employees, helping Alice to discover and repair the error.
\end{example}
Propagating caveats directly through query processing is expensive.
Thus Vizier adopts a three stage scheme for materializing caveats, where each stage provides a progressively more detailed picture of the caveats affecting a query result.
To avoid the overheads of propagating annotated data directly, Vizier simulates propagation in three stages: Presence, Static, and Detail.
Each stage provides a progressively more detailed picture of the caveats affecting a query result.
\tinysection{Presence}
The first stage simply identifies which fields or rows are affected.
@ -108,13 +103,15 @@ Consider Alice's query over \lstinline{EMP}, instrumented as above to cast \lsti
% FROM EMP_uncast
% ) SELECT dept, avg(pay) FROM EMP GROUP BY dept;
% \end{lstlisting}
% SELECT CASE WHEN CAST(pay as int) IS NULL
% THEN caveat(NULL, pay&' is not an int')
% ELSE CAST(pay as int) END AS pay,
% dept, ... FROM EMP_uncast
\begin{lstlisting}
WITH EMP AS (
SELECT CASE WHEN CAST(pay as int) IS NULL
THEN caveat(NULL, pay&' is not an int')
ELSE CAST(pay as int) END AS pay,
dept, ... FROM EMP_uncast
) SELECT dept, avg(pay) FROM EMP GROUP BY dept;
WITH EMP AS ( /* cast pay as above */ )
SELECT dept, avg(pay) FROM EMP GROUP BY dept;
\end{lstlisting}
This query would be transparently instrumented and optimized by Vizier into:
\begin{lstlisting}
@ -136,7 +133,7 @@ However, the algorithm is capable of spuriously marking cells or rows.
We find unwarranted marks to be rare in practice~\cite{feng:2019:sigmod:uncertainty}.
As we show in our experimental results, caveat propagation also has minimal computational overhead relative to native query processing.
\tinysection{Static Analysis}
\tinysection{Static}
The initial phase simply determines which fields and rows are affected by caveats.
At this point, the user can request more detailed information through the Vizier user interface.
In the spreadsheet view clicking on a field or row-header opens up a pop-up that lists caveats on the field or row.

View File

@ -12,10 +12,11 @@ All datasets were loaded through Vizier and cached locally in Parquet format.
\begin{figure}[htbp]
\centering
\begin{tabular}{c|c|c|c}
\textbf{Dataset} & \textbf{Rows} & \textbf{Caveats} & \textbf{CaveatSets} \\ \hline
\texttt{Shootings} & 2890 & 121 & 51\\
\texttt{Crime}
\begin{tabular}{c|c|c|c|c}
\textbf{Dataset} & \textbf{Rows} & \textbf{Cols} & \textbf{Caveats} & \textbf{CaveatSets} \\ \hline
\texttt{Shootings} & 2890 & 43 & 121 & 51\\
\texttt{Graffiti} & 985K & 15 & ? & ? \\
% \texttt{Crime} & 6.6M & 17 & ? & ?
\end{tabular}
\caption{Datasets Evaluated}
\label{fig:datasets}
@ -36,7 +37,7 @@ A hybrid distributed/local query engine may be beneficial for local execution.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.9\columnwidth,trim={0 6mm 8mm 0}]{measurements/waterfall-warm-shootings.pdf}
\includegraphics[width=0.95\columnwidth,trim={0 6mm 8mm 0}]{measurements/waterfall-warm-shootings.pdf}
\caption{Materializing caveats for \texttt{Shootings}}
\label{fig:waterfall:shooting}
\trimfigurespacing

View File

@ -1,8 +1,8 @@
% -*- root: ../paper.tex -*-
In this paper, we discuss the design and implementation of Vizier, a novel system for data curation and exploration with a combination of a spreadsheet and a notebook interface. In contrast to other notebook systems that a wrappers around REPLs, Vizier is \textit{data-centric}, i.e., notebooks are dataflows which are versioned (both the dataflow itself and the results it produces).
Vizier supports iterative notebook construction through automated data-dependency tracking and debugging through the automated detection and propagation of \emph{data caveats}.
In future work, we will investigate propagation of caveats for new cell types, e.g., displaying annotations in plots, and explore trade-offs between performance overhead and accuracy when propagating caveats through cells with turing-complete languages. Furthermore, we plan to develop caching and incremental maintenance techniques for datasets in Vizier workflows to speed-up reexecution of cells in response to an update to a notebook. Finally, to support very large datasets, we will investigate how to build vizier notebooks over samples of a full dataset and automatically generalize such workflows for deployment over a full dataset.
In this paper, we discuss the design and implementation of Vizier, a novel system for data curation and exploration with a combination of a spreadsheet and a notebook interface. In contrast to other library-manager style notebook systems (i.e., wrappers around REPLs), Vizier is a versioned workflow manager notebook system.
Vizier supports iterative notebook construction through automated data-dependency tracking and debugging through the automated detection and propagation of \emph{caveats}.
In future work, we will investigate propagation of caveats for new cell types, e.g., displaying caveats in plots, and explore trade-offs between performance overhead and accuracy when propagating caveats through cells with turing-complete languages. Furthermore, we plan to develop caching and incremental maintenance techniques for datasets in Vizier workflows to speed-up reexecution of cells in response to an update to a notebook. Finally, to support very large datasets, we will investigate how to build vizier notebooks over samples of a full dataset and automatically generalize such workflows for deployment over a full dataset.
% \begin{itemize}
% \item Plotting caveats

View File

@ -9,7 +9,7 @@ However, a recent study by Pimentel et. al.~\cite{pimentel:2019:msr:large}
found only 4\% of notebooks sampled from GitHub to be reproducible, and only 24\% to be directly re-usable.% without error.
These unfortunate statistics stem from Jupyter's heritage as a thin facade over a read-evaluate-print-loop (REPL).
Many existing notebooks, like Jupyter, are not designed as a historical log --- as one would want for reproducibility --- but rather as library managers for code snippets (i.e., cells).
Reproducibility and re-usability require an active effort organizing cells, keeping cells up to date, managing inter-cell dependencies, and manual version control (e.g., git); frustrating users~\cite{grus:2018:notebooks}.
Reproducibility and re-usability require an active effort organizing cells, keeping cells up to date, managing inter-cell dependencies, and manual version control (e.g., git); frustrating users.
In this paper, we present Vizier\footnote{~\texttt{pip2 install vizier-webapi} \hfill \url{https://vizierdb.info}}, a notebook-style data exploration system designed from the ground up to encourage notebook reproducibility and re-use.
Vizier eschews REPLs in favor of a more powerful state model: A versioned relational database and workflow.