Finishing read through

- Mostly minor edits for flow/breaking up long sentences/typos
- Also blurred out a public spreadsheet URL in a png.
master
Oliver Kennedy 2019-12-14 21:19:39 -05:00
parent 8c407279b0
commit e9d3dfafff
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
5 changed files with 63 additions and 66 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

After

Width:  |  Height:  |  Size: 24 KiB

View File

@ -2,24 +2,24 @@
%!TEX root=../paper.tex
Vizier's central feature is support for annotations on groups of cells or rows called caveats.
As introduced above, caveats consist of a human-readable description of the concern and a reference to the cell where the caveat was applied.
More concretely, a caveat applied to a cell indicates that the cell value is tied to a heuristic decision and thus potentially suspect / uncertain.
Similarly, caveats applied to rows (resp., excluded rows) indicate the presence (resp., absence) of a row in a dataset is tied to a heuristic decision.
In each case, if the heuristic is inappropriate (e.g., because a dataset is re-used for a new analysis), the analysis must be revisited.
As introduced above, caveats consist of a human-readable description of a concern about the row or value, and a reference to the cell where the caveat was applied.
More concretely, a caveat applied to a cell indicates that the cell value is potentially suspect or uncertain --- typically because it is an outlier or its validity is tied to a heuristic assumption made during data preparation.
Similarly, caveats applied to rows (resp., excluded rows) indicate the presence (resp., absence) of a row in a dataset is similarly suspect.
In each case, if the heuristic is inappropriate (e.g., because a dataset is re-used for a new analysis) or the outlier indeed indicates an error, the analysis must be revisited.
Vizier propagates caveats based on data-dependencies:
\emph{Is it possible to change the derived value by changing the caveatted values?}
If so, the derived value is likewise tied to the heuristic choice and likewise suspect if the heuristic turns out to be wrong.
If so, the derived value is likewise tied to the heuristic choice or outlier and likewise suspect if the heuristic/outlier turns out to be wrong.
Caveats were originally introduced as \emph{uncertain values} in \cite{yang2015lenses}.
We later formalized propagation of row caveats in \cite{feng:2019:sigmod:uncertainty} and proposed a minimal-overhead rewrite-based implementation. For dataflow cells that utilize only query features that are supported by these approaches, we use these techniques to provide the following strong guarantee: if in the input a superset of the data that is uncertain is marked as such through caveats then in the result of the operation data elements not marked with caveats are guaranteed to be certain.
Here, we focus on the practical challenges of realizing caveats in Vizier\footnote{As before, the SQL code shown here is for strictly pedagogical purposes or the result of automated processed in Vizier; managing caveats does not require the user to write SQL!}
We later formalized propagation of row caveats in \cite{feng:2019:sigmod:uncertainty} and proposed a minimal-overhead rewrite-based implementation. For dataflow cells that utilize only query features that are supported by these approaches, we use these techniques to provide the following strong guarantee: If all invalid (uncertain) data values in the input are marked by a caveat, all outputs not marked by a caveat are guaranteed to be valid (certain).
Here, we focus on the practical challenges of realizing caveats in Vizier\footnote{As before, the SQL code shown here is for strictly pedagogical purposes or the result of automated processed in Vizier; Managing caveats does not require the user to write SQL.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{example}
Consider a simpler version of our running example where Alice uses Vizier's default (caveat-enabled) string parsing operation, as described above.
This operation converts numbers with unknown characters to \lstinline{NULL}.
Bob, taking Alice's script begins exploring with a query:
Consider a simpler version of our running example where Alice uses Vizier's default (caveat-enabled) string parsing operation, as described below.
This operation replaces unparseable numbers with \lstinline{NULL}.
Bob applies Alice's script to his data and begins exploring with a query:
\begin{lstlisting}
SELECT id, avg(cost) FROM parts GROUP BY id;
\end{lstlisting}
@ -30,19 +30,19 @@ SELECT id, avg(cost) FROM parts GROUP BY id;
\subsection{Applying \abbrCaveatsCap}
Vizier's dataflow layer exposes a new scalar function to annotate data with caveats:
\lstinline{caveat(id, value, message)}. This function is meant to be used inline in SQL queries written by the user (an SQL cell) or produced by dataflow cells.
\lstinline{caveat(id, value, message)}. This function is meant to be used inline in SQL queries written by the user (a SQL cell) or produced by dataflow cells.
\lstinline!caveat! takes as input a value to annotate and a message describing the caveat.
A unique identifier (e.g., derived from the row id) is used for book-keeping purposes, and omitted from examples for conciseness.
Rows (as well as excluded rows) are annotated when the caveatted values is accessed in a \lstinline{WHERE} clause that evaluates to \texttt{true}.
Rows (as well as excluded rows) are annotated when the caveatted value is accessed in a \lstinline{WHERE} clause that evaluates to \texttt{true}.
Vizier also provides a manual annotation cell. % although users may also use the caveat function to apply caveats inline.
The workflow dataset API also support passing a list of caveats to apply when a dataset is uploaded.
The workflow dataset API also supports passing a list of caveats to apply when a dataset is uploaded.
Additionally, several existing Vizier cells make use of caveats to encode heuristic choices about a dataset or to communicate heuristic recovery from errors.
We now use two of Vizier's cell types to illustrate how caveats are used to annotate data.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Instrumenting String Parsing}
Many file formats lack type information (e.g., CSV), or have minimal type systems (e.g., JSON).
Thus extracting native representations from strings is a common task. For example consider the following query: % as in the following:
Thus extracting native representations from strings is a common task. For example consider the query: % as in the following:
\begin{lstlisting}
SELECT CAST(cost as int) AS cost, $\ldots$ FROM parts;
\end{lstlisting}
@ -107,11 +107,11 @@ When needed, Vizier can undertake the more expensive task of deriving the full s
Even just determining whether a data value or row is affected by a caveat is analogous to determining certain answers for a query applied to an incomplete database~\cite{Imielinski:1984:IIR:1634.1886} (i.e., CoNP-complete for realtively simple types of queries~\cite{feng:2019:sigmod:uncertainty}).
Thus, Vizier adopts a conservative approximation: All rows or cells that depend on a caveatted value are guaranteed to be marked.
It is theoretically possible, although rare in practice~\cite{feng:2019:sigmod:uncertainty} for the algorithm to unnecessarily mark cells or rows.
Specifically, queries are rewritten recursively using an extension of the scheme detailed in~\cite{feng:2019:sigmod:uncertainty} to add Boolean-valued attributes that indicate that a column, or the entire row depends on a caveatted value.
Specifically, queries are rewritten recursively using an extension of the scheme detailed in~\cite{feng:2019:sigmod:uncertainty} to add Boolean-valued attributes that indicate whether a column, or the entire row depends on a caveatted value.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{example}
Consider Bob's \lstinline{parts} query, instrumented as above to cast \lstinline{cost} to an integer and assuming no other operations that create caveats have been applied. Let \lstinline!parts_uncast! be the dataset before the cast operation and \lstinline!part! be the dataset after the operation. Recall the Bob's query is:
Consider Bob's \lstinline{parts} query, instrumented as above to cast \lstinline{cost} to an integer and assume no other operations that create caveats have been applied. Let \lstinline!parts_uncast! be the dataset before the cast operation and \lstinline!parts! be the dataset after the operation. Recall the Bob's query is:
% \begin{lstlisting}
% WITH EMP_uncast AS (
% SELECT * FROM EMP_raw
@ -132,7 +132,7 @@ Consider Bob's \lstinline{parts} query, instrumented as above to cast \lstinline
% dept, ... FROM EMP_uncast
\begin{lstlisting}
WITH parts AS ( /* as string parsing, above */ )
WITH parts AS (/* as string parsing, above */)
SELECT id, avg(cost) FROM parts GROUP BY id;
\end{lstlisting}
This query would be instrumented and optimized into:

View File

@ -167,7 +167,7 @@ Notebook systems used for data curation should be able to model and track the un
We emphasize that caveats are orthogonal to any specific error detection or cleaning schemes.
A wide range of such tools can be can be wrapped to expose their heuristic assumptions and any resulting uncertainty to Vizier, integrating them into the Vizier ecosystem.
Vizier then automatically tracks caveats and handles other advanced features such as versioning and cell dependency tracking.
In our experience, extending methods to expose caveats is often quite straight-forward (e.g., see \cite{yang2015lenses} for several examples), and Vizier already supports a range of caveat-enabled data cleaning and error detection operations (see \Cref{fig:notebook:cells} for a list of currently supported cell types).
In our experience, extending methods to expose caveats is often quite straight-forward (e.g., see \cite{yang2015lenses} for several examples), and Vizier already supports a range of caveat-enabled data cleaning and error detection operations (see \Cref{fig:cellTypes} for a list of currently supported cell types).
Similarly, Vizier's data load operation also relies on caveats as a non-invasive and robust (data errors do not block the notebook) way to communicate records that fail to load correctly.
To support one-off curation transformations, Vizier also allows users to create caveats manually through its spreadsheet interface or programmatically via \texttt{Python}, \texttt{Scala}, or \texttt{SQL} cells.
% Multiple Vizier cell types also make use of caveats.

View File

@ -8,13 +8,13 @@ In this section, we explore three challenges we had to overcome to implement DML
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Spreadsheet Data Types}
The lightweight interface offered by typical spreadsheets has two minor impedance mismatches with the more strongly typed relational data model used by Vizier's datasets.
First, types in a spreadsheet are assigned on a per-value basis, while they are assigned on a per-column basis in a typical relational table.
First, types in a spreadsheet are assigned on a per-value basis, but on a per-column basis in a typical relational table.
A spreadsheet allows users to enter arbitrary text into a column of integers.
Because Vizier's history makes undoing a mistake trivial, Vizier assumes the user's action is intentional: column types are escalated (e.g., \texttt{int} to \texttt{float} to \texttt{string}) to allow the newly entered value to be represented as-is.
The second impedance mismatch pertains to \texttt{NULL} values.
Spreadsheets do not make a distinction between empty strings and missing values (i.e., a relational \texttt{NULL}).
Thus, when a user replaces a value with the empty string, the user's intended action is ambiguous.
Spreadsheets do not distinguish between empty strings and missing values (i.e., a relational \texttt{NULL}).
Thus, when a user replaces a value with the empty string, the user's intention is ambiguous.
To resolve this ambiguity, Vizier relies on the type of the column:
If the column is string-typed, an empty value is treated as the empty string.
Otherwise, an empty value is treated as \texttt{NULL}.
@ -24,13 +24,12 @@ Otherwise, an empty value is treated as \texttt{NULL}.
Through the spreadsheet interface, users can create, rename, reorder, or delete rows and columns, or alter data --- the standard array of DDL and DML operations.
These operations can not be applied in-place without sacrificing the immutability of versions.
To preserve versioning and avoid unnecessary data copies, Vizier builds on a technique called reenactment~\cite{DBLP:journals/pvldb/NiuALFZGKLG17,DBLP:journals/tkde/ArabGKRG18}, which translates sequences of DDL or DML operations into equivalent queries\footnote{
We emphasize that our use of the SQL code examples shown in this section are produced automatically as part of the translation of Vizual into SQL queries. Users will not need to write SQL queries to express spreadsheet operations. The user' actions in the spreadsheet are automatically added as Vizual cells to the notebook and these Vizual operations are automatically translated into equivalent SQL DDL/DML expressions~\cite{freire:2016:hilda:exception}.
}.
To preserve versioning and avoid unnecessary data copies, Vizier builds on a technique called reenactment~\cite{DBLP:journals/pvldb/NiuALFZGKLG17,DBLP:journals/tkde/ArabGKRG18}, which translates sequences of DML operations into equivalent queries.
We emphasize that our use of the SQL code examples shown in this section are produced automatically as part of the translation of Vizual into SQL queries. Users will not need to write SQL queries to express spreadsheet operations. The user' actions in the spreadsheet are automatically added as Vizual cells to the notebook and these Vizual operations are automatically translated into equivalent SQL DDL/DML expressions~\cite{freire:2016:hilda:exception}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{example}
Consider a table \lstinline{EMP} and the SQL statements:
Consider a table \lstinline{EMP} and the SQL:
\begin{lstlisting}
UPDATE EMP SET pay=pay*1.1 WHERE type = 1;
INSERT INTO EMP(name, pay, type)
@ -47,12 +46,12 @@ SELECT 'Bob' AS name, 100000 AS pay, 2 AS type;
\end{example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Vizier translates primitive cell operations into an equivalent queries, resulting in each table version being defined (declaratively) as a view.
Due to limited space, we refer the interested reader to \cite{DBLP:journals/pvldb/NiuALFZGKLG17} for an introduction to reenactment, with a further discussion of its applicability to spreadsheets in \cite{freire:2016:hilda:exception}.
Vizier translates primitive cell operations into equivalent queries, resulting in each table version being defined (declaratively) as a view.
We refer the interested reader to \cite{DBLP:journals/pvldb/NiuALFZGKLG17} for an introduction to reenactment, with a further discussion of its applicability to spreadsheets in \cite{freire:2016:hilda:exception}.
Vizual~\cite{freire:2016:hilda:exception}, and by extension our primitive cell types, also cover DDL operations like column renaming, reordering, and creation.
We implemented these through a similar view-based approach, adopting transformations similar to the ones from the Prism Workbench~\cite{Curino:2008:GDS:1453856.1453939} and \cite{DBLP:journals/vldb/HerrmannVPL18}.
For example, column creation is analogous to projecting on the full schema of the table plus an additional column initialized with \lstinline{NULL}.
For example, column creation is analogous to projecting on the full schema of the table plus an additional column initialized with the new column's default value (or \lstinline{NULL}).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -62,7 +61,7 @@ For example, column creation is analogous to projecting on the full schema of th
Identifying update targets for Vizual cell operations presented a challenge.
In SQL DML, update operations specify target rows by a predicate. % , making the user's intent explicit.
By contrast, spreadsheet users specify the target of an update by explicitly modifying a cell at a certain position in the spreadsheet, e.g., overwriting the value of the cell in the second column of the 3rd row. To deal with such updates and to be able to represent unordered relational data as a spreadsheet we need to maintain a mapping between rows and their positions in the spreadsheet. Since we record both the position of a row and a unique stable identifier for it, we can ensure that a Vizual operation always applies to the same cell even when, e.g., rows/columns are deleted or added.
However, when source data changes --- for example when a new cell is added earlier in the notebook --- determining how to reapplying the user's update is more challenging. % to the new data.
However, when source data changes --- for example when a new cell is added earlier in the notebook --- determining how to reapply the user's update is more challenging. % to the new data.
Ideally, we would like to use row identifiers that are stable through such changes.%\footnote{fields can be identified by column/row intersections.}.
%Column identifiers are already defined by the source table.
@ -70,10 +69,9 @@ Ideally, we would like to use row identifiers that are stable through such chang
For derived data, Vizier uses a row identity model based on GProM~\cite{DBLP:journals/debu/ArabFGLNZ17} encoding of provenance.
Derived rows, such as those produced by declaratively specified table updates, are identified as follows:
(1) Rows in the output of a projection or selection use the identifier of the source row that produced them,
(2) Rows in the output of a \lstinline{UNION ALL} are identified by the identifier of the source row and an identifier marking which side of the union the row came from.
To preserve associativity and commutativity during optimization, union-handedness is recorded during parsing.
(2) Rows in the output of a \lstinline{UNION ALL} are identified by the identifier of the source row and an identifier marking which side of the union the row came from\footnote{To preserve associativity and commutativity during optimization, union-handedness is recorded during parsing}.
(3) Rows in the output of a cross product or join are identified by combining identifiers from the source rows that produced them into a single identifier, and
(4) Rows in the output of an aggregate are identified by group-by attribute values.
(4) Rows in the output of an aggregate are identified by each row's group-by attribute values.
What remains is the base case: datasets loaded into Vizier or created through the workflow API.
We considered three approaches to identifying rows in raw data: order-, hash-, and key-based.
@ -84,7 +82,7 @@ Using hashing preserves row identity through re-ordering, but risks collisions o
Using keys addresses both concerns, but requires users to manually specify a key column, assuming one exists in the first place.
Our prototype implementation combines the first two approaches: deriving identifiers from both sequence and hash code.
Such row-identifiers are stable under appends. % but not re-used.
While techniques for creating identifier that are stable under updates has been studied extensively for XML databases (e.g., ORDPATH~\cite{DBLP:conf/sigmod/ONeilOPCSW04}) and recently also for spreadsheet views of relational databases~\cite{DBLP:journals/pvldb/BendreSZZCP15}, the main challenge we face in Vizier is how to retain row identity when a new version of a dataset is loaded into Vizier. In this scenario we only have access to two snapshot of the dataset and no further information on how they relate to each other.
While techniques for creating identifiers that are stable under updates has been studied extensively for XML databases (e.g., ORDPATH~\cite{DBLP:conf/sigmod/ONeilOPCSW04}) and recently also for spreadsheet views of relational databases~\cite{DBLP:journals/pvldb/BendreSZZCP15}, the main challenge we face in Vizier is how to retain row identity when a new version of a dataset is loaded into Vizier. In this scenario we only have access to two (identifier-free) snapshots of the dataset and no further information on how they relate to each other.
%%% Local Variables:
%%% mode: latex

View File

@ -3,8 +3,8 @@
Vizier is an interactive code notebook similar to Jupyter\footnote{\url{https://jupyter.org/}} or Apache Zeppelin\footnote{\url{https://zeppelin.apache.org/}}.
An analytics workflow is broken down into individual steps called cells.
Workflow cells are displayed in order, with each cell shown with code (e.g., Python or SQL) implementing the step, along with any outputs (e.g., console output, data visualizations, or datasets) produced by this code.
Each cell sees the effects of all preceding cells, and can create effects visible to subsequent cells.
Workflow cells are displayed in order, with each cell shown with code (e.g., Python or SQL) implementing the step, and any outputs (e.g., console output, data visualizations, or datasets) produced by this code.
Each cell sees the effects of all preceding (upstream) cells, and can create effects visible to subsequent (downstream) cells.
The user may edit, add, or delete any cell in the notebook, which \emph{triggers re-execution of all dependent cells appearing below it in the notebook.}
Vizier workflows allow different cell types to be mixed within the same notebook.
@ -27,7 +27,7 @@ Vizier additionally supports cells that use point-and-click interfaces to stream
Edit Cell, Sort, Filter & Dataflow \\
Cleaning & Infer Types, Repair Key,
Impute,
Repair Timeline, Merge Columns,
Repair Sequence, Merge Columns,
Geocode & Dataflow \\
\end{tabular}
\caption{Cell Types in Vizier}
@ -45,9 +45,9 @@ Jupyter cells are executed in the context of this interpreter, resulting in a ne
In designing the state model for Vizier, we considered four requirements.
Of these, only one is satisfied by an opaque, REPL-based state model.
\begin{enumerate}
\item \textbf{Enforcing in-order execution}: We needed a state representation that could be effectively and efficiently checkpointed to allow replay from any point in the notebook. A typical REPL's state is a complex graph of mutable objects that can not be efficiently checkpointed.
\item \textbf{Enforcing in-order execution}: We needed a state representation that could be correctly, efficiently checkpointed to allow replay from any point in the notebook. A typical REPL's state is a complex graph of mutable objects that can be checkpointed completely or efficiently, but not both simultaneously.
\item \textbf{Workflow-style execution}: We needed a state representation that permits at least coarse-grained dependency analysis between cells. Operations in a REPL can have arbitrary side effects, making dependency analysis challenging.
\item \textbf{Fine-grained provenance}: To support provenance-based features of Vizier like caveats, we wanted a state representation that supported declarative state transformations. A typical REPL, built around an imperative language like Python, makes such analysis challenging.
\item \textbf{Fine-grained provenance}: To support provenance-based features of Vizier like caveats, we wanted a state representation that supported declarative state transformations. A typical REPL for an imperative language like Python makes such analysis challenging.
\item \textbf{Big data support}: We needed a state representation that could manage large, structured datasets. This criterion is met by Python REPLs, which offer limited support for large, in-memory datasets through libraries like NumPy and Pandas.
\end{enumerate}
A versioned relational database satisfies all four criteria, providing efficient checkpointing, instrumentable and declarative updates, as well as natural parallelism for scalability.
@ -55,7 +55,7 @@ Thus, relational tables provide the foundation for Vizier's state model:
A notebook's state is a set of named relational tables called datasets\footnote{Support for primitive values and BLOB state is in progress.}.
Cells access, manipulate, create, or destroy datasets through one of two instrumented APIs, illustrated in \Cref{fig:wfVsDFCells}.
As we discuss below, the behavior of some cell types can be expressed declaratively, making fine-grained instrumentation for provenance and caveat tracking at the cell level possible.
As we discuss below, the behavior of some cell types can be expressed declaratively, making fine-grained instrumentation for provenance and caveat tracking at the level of individual data values possible.
Implementations of such \emph{dataflow} cells access datasets through a functional % write-only
API that allows new datasets to be defined as SQL views over existing datasets, i.e., datasets are treated as immutable from the viewpoint of the code using the API.
For example, the dataflow cell in \Cref{fig:wfVsDFCells} is translated into a SQL view definition that redefines the dataset named \lstinline{R} in terms of the dataset \lstinline{S} and the previous version of \lstinline{R}.
@ -73,9 +73,9 @@ For example, the workflow cell in \Cref{fig:wfVsDFCells} reads from datasets \ls
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Workflow Cells}
The workflow API, illustrated in \Cref{fig:wfVsDFCells}.a, is aimed at cells where only collecting coarse-grained provenance information is feasible.
This includes the \texttt{Python} and \texttt{Scala} cells that implement Turing-complete languages; as well as cells like \texttt{Load Dataset} that manipulate entire datasets.
To discourage out-of-band communication between cells that hinders reproducibility, as well as to avoid remote code execution attacks when Vizier is run in a public setting, workflow cells are typically executed in an isolated environment.
The workflow API, illustrated in \Cref{fig:wfVsDFCells}.a, targets cells where only collecting coarse-grained provenance information is presently feasible.
This includes \texttt{Python} and \texttt{Scala} cells, which implement Turing-complete languages; as well as cells like \texttt{Load Dataset} that manipulate entire datasets.
To discourage out-of-band communication between cells (which hinders reproducibility), as well as to avoid remote code execution attacks when Vizier is run in a public setting, workflow cells are executed in an isolated environment.
Vizier presently supports execution in an fresh interpreter instance (for efficiency) or a docker container (for safety).
Vizier's workflow API is designed accordingly, providing three operations: \emph{Read dataset} (copy a named dataset from Vizier to the isolated execution environment), \emph{Checkpoint dataset} (copy an updated version of a dataset back to Vizier), and \textit{Create dataset} (Allocate a new dataset in Vizier).
A more efficient asynchronous, paged version of the \emph{read} operation is also available, and the \emph{Create dataset} operation can optionally initialize a dataset from a URL, S3 Bucket, or Google Sheet to avoid unnecessary copies.
@ -86,7 +86,7 @@ Dataflow cells are compiled down to equivalent SQL queries.
Updated versions of the state are defined as views based on these queries.
We emphasize that most dataflow cells do not require the user to write % actual
SQL. SQL is the language used by the implementation of such a cell to communicate with Vizier.
For example, spreadsheet operation cells are created as a consequence of edits in the spreadsheet, while cleaning operations are configured through a form displayed in the cell.
For example, spreadsheet operation cells created as a consequence of edits in the spreadsheet and interactively configured cleaning operations are both types of dataflow cells.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -110,26 +110,26 @@ This version graph tracks three types of objects: Notebooks, Cells, and Datasets
\tinysection{Notebook Versions}
A single version of the notebook is an immutable sequence of references to the immutable cells that specify a workflow.
Every edit a user makes to a notebook, whether adding a new cell, editing an existing cell, or deleting a cell creates a new notebook version, preserving a full, reproducible history of the notebook.
Following the workflow provenance model from \cite{CF06b}, we record for each notebook version the operation performed to create the new version, and the previous, \emph{parent version} of the notebook.
Following the workflow provenance model from \cite{CF06b}, we record for each notebook version the operation performed to create the version, and the previous, \emph{parent version} of the notebook.
The result is a version graph, a tree-like structure that shows the notebook's evolution over time.
Under typical usage, edits are applied to a leaf of the tree to create chains of edits.
Editing a prior version of the notebook creates a \emph{branch} in the version history.
Vizier requires users to explicitly name branches when they are created; This explicit branch management makes it easier to users to follow the intent of a notebook's creator.
Internally however, each notebook version is identified by a 2-tuple consisting of a randomly generated, unique branch identifier and a monotonically increasing version number.
Vizier requires users to explicitly name branches when they are created; This explicit branch management makes it easier for users to follow the intent of a notebook's creator.
Internally however, each notebook version is identified by a 2-tuple consisting of a randomly generated, unique branch identifier and a monotonically increasing notebook version number.
\tinysection{Cell Versions}
A cell version is an immutable specification of one step of a notebook workflow.
In its simplest form, the cell version stores the cell's type, as well as any parameters to the cell.
This can include simple parameters like the table and columns to plot for a \texttt{Plot} cell, scripts like those used in the \texttt{SQL}, \texttt{Scala}, and \texttt{Python} cells, as well as references to files like those uploaded into the \texttt{Load Dataset} cell.
In short, the cell configuration contains everything required to deterministically re-execute the cell.\footnote{Of course, we assume here that the computation of the cell itself is deterministic. For cells with non-deterministic computation , e.g., random number generators, we cannot guarantee that multiple execution of the same cell yield the same result.}
In short, the cell configuration contains everything required to deterministically re-execute the cell\footnote{Of course, we assume here that the computation of the cell itself is deterministic. For cells with non-deterministic computation , e.g., random number generators, we cannot guarantee that multiple execution of the same cell yield the same result.}.
A cell version is identified by an identifier derived from a hash of its parameters.
Alongside the cell version, Vizier also stores a cache of results derived from executing the cell.
Unlike the immutable cell version itself, this cache is mutable, and is updated every time the cell is executed.
The cell cache specifically includes:
(i) Outputs like console output or HTML-formatted data plots, excluding datasets produced by the cell;
(i) Outputs for presentation to the user, including console output and HTML-formatted data plots;
(ii) Coarse-grained (workflow) provenance information, including a list of datasets the cell read from and wrote to; and
(iii) A reference to each dataset version produced by the cell.
@ -143,7 +143,7 @@ When a cell is executed, it receives a scope that maps dataset names to dataset
To create or modify a dataset, a cell first initializes a new dataset version: uploading a new dataframe (workflow cells) or creating a Spark view (dataflow cells).
The scope entry for the corresponding dataset is updated accordingly, and passed to the next cell.
The result is the illusion of mutability, but with state checkpoints between each cell.
Of course, it may be more efficient to compute deltas between version instead taking full copies or even to trade computation for storage (recompute the dataset when required). For instance, such ideas have been applied in Nectar~\cite{GR10} and DataHub~\cite{BC15a}. We leave such optimization to future work.
Of course, it may be more efficient to compute deltas between versions instead taking full copies or even to trade computation for storage (recompute the dataset when required). For instance, such ideas have been applied in Nectar~\cite{GR10} and DataHub~\cite{BC15a}. We leave such optimization to future work.
@ -181,12 +181,11 @@ Of course, it may be more efficient to compute deltas between version instead ta
\end{figure}
\subsection{Dependency Models}
\tinysection{Dependencies}
As illustrated in \Cref{fig:virtualProvenance} Vizier manages two graphs of dependencies: one at the dataflow and one at the workflow level.
As illustrated in \Cref{fig:virtualProvenance}, Vizier manages two graphs of dependencies: one at the dataflow and one at the workflow level.
Workflow dependencies form a coarse-grained view of which cells depend on which other cells, and are used to manage cell execution order.
When a cell is executed, dataset accesses are instrumented to compute
a \textbf{read set} which records the user-facing names of datasets accessed by the cell and
a \textbf{write map} which records user-facing names of datasets created by the cell with the identifier of the new/updated version (or \texttt{NULL} if the cell deleted the dataset).
a \textbf{read set} that records the user-facing names of datasets accessed by the cell and
a \textbf{write map} that records user-facing names of dataset versions created by the cell with the new version's identifier (or \texttt{NULL} if the cell deleted the dataset).
For each dataset in a cell's read set, Vizier creates a workflow edge to the most recent cell to modify that dataset (i.e., the last cell with that dataset in its write set).
For example, in \Cref{fig:virtualProvenance}, the \texttt{Python} cell creates a new version of dataset \lstinline{R} that is read by both the \texttt{Merge} and \texttt{Insert Row} cells.
Thus both of the latter cells depend on the former.
@ -200,11 +199,11 @@ Instead, Vizier creates coarse-grained provenance records based on the cell's re
\subsection{Execution Model}
\Cref{fig:system_diagram} overviews the Vizier system, including the its two core components: The workflow manager and the Dataflow manager.
\Cref{fig:system_diagram} overviews the Vizier system, including the its two core components: The workflow manager and the dataflow manager.
The workflow manager is a simplified version of the VisTrails~\cite{DBLP:conf/visualization/BavoilCSVCSF05} workflow system, and is responsible for managing cells, inter-cell dependencies, scheduling workflow cell execution, and the version history.
The Dataflow manager is implemented using the Mimir incomplete database~\cite{yang2015lenses,feng:2019:sigmod:uncertainty,kumari:2016:qdb:communicating}, and is responsible for fine-grained provenance, data loading, Vizier's ``lens"~\cite{yang2015lenses} data curation operators (the cleaning cell types supported by Vizier), and functionality required to support caveats.
Mimir, in turn, is implemented as a query-rewriting front-end over Apache Spark, which handles the heavy lifting of query processing.
At the other end of the Vizier stack, most user interactions go through a user interface that provides a range of modalities for interacting with and analyzing the workflow.
The dataflow manager is implemented using the Mimir incomplete database~\cite{yang2015lenses,feng:2019:sigmod:uncertainty,kumari:2016:qdb:communicating}, and is responsible for fine-grained provenance, data loading, Vizier's ``lens"~\cite{yang2015lenses} data curation operators (the cleaning cell types supported by Vizier), and functionality required to support caveats.
Mimir, in turn, is implemented as a query-rewriting front-end over Apache Spark, which handles query evaluation.
At the other end of the Vizier stack is a user interface that provides a range of modalities for interacting with and analyzing the workflow.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure*}
@ -222,7 +221,7 @@ Clients can create or delete notebooks; add, update, or delete cells; or create
The workflow manager tracks notebook and cell versions, as well as a dependency graph between cells.
The workflow manager executes cells asynchronously.
As a \emph{workflow} cell is added or updated, it is scheduled for execution through either a lightweight worker process, or through a celery distributed task queue (\url{http://www.celeryproject.org/}) if one is configured.
As a \emph{workflow} cell is added or updated, it is scheduled for execution through either a lightweight worker process, or through a celery distributed task queue\footnote{\url{http://www.celeryproject.org/}} if one is configured.
Both execution engines establish a connection to the dataflow manager through which the cell can access existing datasets or upload new dataset versions.
When a \emph{dataflow} cell is added or updated, the workflow manager allocates a new, globally unique view identifier and synchronously creates the view in the dataflow manager.
@ -230,7 +229,7 @@ When a \emph{dataflow} cell is added or updated, the workflow manager allocates
\tinysection{Execution Scheduling}
Anytime a cell is added, updated, or deleted, its transitive dependencies must be re-executed as well.
The challenge is that, at least for workflow cells, dependencies are not inferred statically --- they are observed from execution traces.
As noted above, we require cell execution to be deterministic: if a cell's dependencies are unchanged, the cell does not need re-execution.
We require cell execution to be deterministic: If a cell's dependencies are unchanged, the cell does not need re-execution.
When a cell does need re-execution, we pessimistically assume all downstream datasets could be affected until execution completes.
Scheduling and execution strategies are presented in simplified forms in Algorithms \ref{alg:schedUpdates} and \ref{alg:asyncEval}, respectively.
When the notebook cell at index $i$ changes, \Cref{alg:schedUpdates} marks it as \texttt{dirty} and marks all downstream cells as \texttt{waiting} (i.e., potentially dirty).
@ -317,14 +316,14 @@ When a dataset is created from scratch, the caller may optionally create virtual
Alternatively, a \emph{declarative} dataset version may be created as a view in one of three ways.
The caller provides:
(i) a standard SQL view definition,
(ii) a script in an imperative, DDL/DML-style language called Vizual~\cite{freire:2016:hilda:exception} used to implement spreadsheed-style operations, or
(ii) a script in an imperative, DDL/DML-style language called Vizual~\cite{freire:2016:hilda:exception} used to implement spreadsheet operations, or
(iii) a lens definition~\cite{yang2015lenses} used to implement cleaning cells.
We discuss Vizual further in \Cref{sec:spreadsheet} and lenses further in \Cref{sec:caveats}.
Regardless of how the views are specified, the dataflow manager records the corresponding view definition in an intermediate representation based on Relational Algebra~\cite{nandi:2016:arxiv:mimir} and returns a unique identifier to the caller.
\tinysection{Dataset Access}
The dataflow layer is built on Apache Spark~\cite{DBLP:conf/nsdi/ZahariaCDDMMFSS12} for scalability.
As noted above, literal datasets are stored by the dataflow manager as URLs.
Literal datasets are stored by the dataflow manager as URLs.
The dataflow manager selects a Spark data loader (e.g., CSV, parquet) appropriate for the URL and creates a Spark dataframe.
Queries and views defined over these dataframes are compiled from Vizier's intermediate representation to Spark's intermediate query representation, and optimized and executed directly in Spark.
During this translation, queries are instrumented through a lightweight rewriting scheme~\cite{yang2015lenses,feng:2019:sigmod:uncertainty} that marks caveatted cells and rows as discussed in \Cref{sec:caveats}.
@ -397,9 +396,9 @@ When the cell completes, the status area can be used to show:
(i) Console output (i.e., from script cells),
(ii) HTML data visualizations generated by the cell (e.g., from the Plot cell or Python plotting libraries like Bokeh), or
(iii) A tabular view of any dataset version as it would appear immediately after the cell.
When a dataset is displayed, as in \Cref{fig:notebook:cells}, the UI highlights caveatted cells in red.
When a dataset is displayed (\Cref{fig:notebook:cells}), caveatted cells are highlighted in red.
Users may manually switch the status view to display any available output or any dataset version at that point in the notebook.
This allows users to, for example, debug a cell by simultaneously viewing before and after versions of the dataset.
This allows users to, for example, debug a cell by simultaneously viewing before and after dataset versions.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -436,9 +435,9 @@ As spreadsheet edits are typically lightweight and likely to occur in rapid succ
After a Vizual command has been created in response to a spreadsheet action, the spreadsheet view must be refreshed.
However, the normal cell execution workflow is too heavy-weight to allow this refresh to happen at interactive speeds.
To make value updates ``feel'' instantaneous, refreshes are handled asynchronously;
To make value updates ``feel'' instantaneous, refreshes are asynchronous;
While the spreadsheet view is being updated the client uses a placeholder value, typically the value the user typed in.
Leveraging the caveat metaphor, placeholder values are highlighted in red until the refresh completes.
Following the caveat metaphor, placeholder values are highlighted in red until the refresh completes.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -460,7 +459,7 @@ The Vizier UI provides one caveat tab for each dataset present at the end of the
When a caveat tab is opened, the dataflow manager is queried (see \Cref{sec:caveats}) to generate a list of all caveats affecting the specified dataset as illustrated in \Cref{fig:caveatView}.
Each caveat is displayed with a human-readable description of the error, for example as an input datum that could not be properly cast to the type of the column.
As discussed in \Cref{sec:caveats}, the caveat list is a summary, with caveats organized into groups based on the type of error.
The interface also allows caveats to be acknowledged by clicking on the caveat and then clicking "Acknowledge"
The interface also allows caveats to be acknowledged by clicking on the caveat and then clicking ``Acknowledge."
An acknowledged caveat is still displayed in the caveat list, but otherwise ignored.
For example, cells that depend on it will not be highlighted in data tables shown in Vizier.