State model
parent
fa84972f52
commit
3741bcc7dd
Binary file not shown.
Binary file not shown.
|
@ -1,26 +1,108 @@
|
|||
|
||||
% -*- root: ../paper.tex -*-
|
||||
|
||||
Vizier is an interactive code notebook similar to Jupyter or Apache Zeppelin.
|
||||
As in these systems, a Vizier notebook is a sequence of cells.
|
||||
Each cell describes one unit of computation, for example a Python or SQL script.
|
||||
Once a cell is executed, its results are displayed inline as part of the cell.
|
||||
Vizier is an interactive code notebook in the style of Jupyter or Apache Zeppelin.
|
||||
An analytics workflow is broken down into individual steps called cells.
|
||||
A workflow cell displays code (e.g., Python or SQL) implementing the step, as well as any outputs (e.g., Console output, data visualizations, or datasets) produced by this code.
|
||||
Each cell sees the effects of all preceding cells, and can apply changes visible to subsequent cells.
|
||||
The user may edit, add, or delete any cell in the notebook, which triggers re-execution of all dependent cells appearing below it in the notebook.
|
||||
|
||||
In contrast to REPL-based notebook systems (i.e., library managers), Vizier is a
|
||||
workflow and data manager: A notebook consists of a workflow specification (the
|
||||
cells and their configurations) as well as multiple datasets (relational tables). Both the
|
||||
Vizier workflows allow different cell types to be mixed within the same notebook.
|
||||
\Cref{fig:cellTypes} lists the cell types presently supported by Vizier.
|
||||
Classical notebook workhorses like Python, SQL, and Markdown cell are supported.
|
||||
Vizier additionally supports cells providing point-and-click interfaces that streamline (1) Common notebook tasks like data ingest/export and visualization; (2) Primitive dataset manipulations (\Cref{sec:spreadsheet}); and (3) Data cleaning operations (\Cref{sec:errors}).
|
||||
|
||||
|
||||
\begin{figure}[htbp]
|
||||
\centering
|
||||
\begin{tabular}{r|p{1.7in}|c}
|
||||
\textbf{Category} & \textbf{Cells} & \textbf{API} \\ \hline
|
||||
Script & Python, Scala & Workflow \\
|
||||
Query & SQL & Dataflow \\
|
||||
Information & Markdown & n/a \\
|
||||
Point/Click & Plot, Load Data, Export Data & Workflow \\
|
||||
Primitive & Add/Delete/Move Row,
|
||||
Add/Delete/Move/Rename Column,
|
||||
Edit Cell, Sort, Filter,
|
||||
Rename Dataset & Dataflow \\
|
||||
Cleaning & Infer Types, Repair Key,
|
||||
Impute, Shape Checker,
|
||||
Repair Timeline, Merge Columns,
|
||||
Geocode & Dataflow \\
|
||||
\end{tabular}
|
||||
\caption{Cell Types in Vizier}
|
||||
\label{fig:cellTypes}
|
||||
\end{figure}
|
||||
|
||||
|
||||
\subsection{Cell State}
|
||||
Data flow in a workflow system is explicit, with the the outputs of one step explicitly bound to the inputs expected by another.
|
||||
In a notebook, data flow is implicit, with each cell manipulating a global, shared state.
|
||||
For example, in Jupyter, this state is the internal state of the interpreter itself (variables, allocated objects, etc\ldots).
|
||||
Jupyter cells are executed in the context of this interpreter, leaving behind a new state for subsequently executed cells.
|
||||
The design of Vizier's state model is driven by four factors:
|
||||
(i) \textbf{In-order execution}: We needed state that could be quickly and efficiently checkpointed to allow replay from any point in the notebook;
|
||||
(ii) \textbf{Dependency analysis}: To minimize the number of cells that need to be re-executed, we need to be able to instrument each cell's interactions (reads/writes) with a cell;
|
||||
(iii) \textbf{Big Data}: Given Vizier's focus on data exploration, we wanted a state model that supports large, structured datasets;
|
||||
(iv) \textbf{Caveats}: Provenance-based features of Vizier benefit from fine-grained provenance, making it useful to have support for declarative state transformations.
|
||||
|
||||
A natural starting point for Vizier's state model, satisfying all four criteria, was relational tables.
|
||||
Concretely, state in Vizier is a set of named relational tables, called datasets\footnote{Support for primitive valued and blob/text-like state is a work in progress.}.
|
||||
Cells can access, manipulate, or destroy existing datasets; as well as create new ones.
|
||||
However, the specific way in which this happens differs by cell.
|
||||
Ideally we would like to collect fine-grained provenance information about each cell's interactions with the dataset.
|
||||
However, fine-grained provenance is not always feasible:
|
||||
For example, while instrumenting Python cells to collect this information is possible under limited circumstances\cite{}, further generalization is not practical.
|
||||
Instead, Vizier provides a 2-tiered dataset API for cells to interact with datasets; \Cref{fig:cellTypes} identifies which cell types adopt which API.
|
||||
|
||||
\begin{figure}[htbp]
|
||||
\centering
|
||||
\includegraphics[width=0.95\textwidth]{graphics/wf_vs_df_cells.pdf}
|
||||
\caption{Workflow vs Dataflow State Interactions}
|
||||
\label{fig:wfVsDFCells}
|
||||
\end{figure}
|
||||
|
||||
\tinysection{Workflow Cells}
|
||||
The workflow API, illustrated in \Cref{fig:wfVsDFCells}.a, is aimed at cells where collecting only coarse-grained provenance information is feasible.
|
||||
This includes the \texttt{Python} and \texttt{Scala} script cells, where instrumentation poses a challenge, as well as cells like \texttt{Load Dataset} that manipulate entire datasets.
|
||||
To discourage out-of-band communication between cells, as well as to avoid remote code execution attacks when Vizier is run in a public setting, workflow cells are typically executed in an isolated environment.
|
||||
Vizier presently supports execution in an independent interpreter (for efficiency) or a docker container (for safety).
|
||||
Vizier's workflow API is designed accordingly, providing three primitive operations: Read dataset (copy a named dataset from Vizier to the isolated execution environment), Checkpoint dataset (copy an updated version of a dataset back to Vizier), and Create dataset (Allocate a new dataset in Vizier).
|
||||
A more efficient asynchronous, paged versions of the read operation is also available, and the Create dataset operation can optionally initialize a dataset from a URL, S3 Bucket, or Google Sheet to avoid unnecessary copies.
|
||||
|
||||
\tinysection{Dataflow Cells}
|
||||
The dataflow cell API, illustrated in \Cref{fig:wfVsDFCells}.b is aimed at cell types that implement declarative dataset transformations.
|
||||
Dataflow cells are first compiled down to equivalent SQL queries.
|
||||
Thus, in contrast to workflow cells, where datasets are defined explicitly, dataflow cells define and update relations as materialized views.
|
||||
|
||||
|
||||
\subsection{Caveats}
|
||||
|
||||
|
||||
|
||||
- What is a Caveat
|
||||
- Annotation
|
||||
- Comparison to COloring
|
||||
- General definition
|
||||
- Propagation Rules
|
||||
- Dataflow Cells
|
||||
- WOrkflow Cells
|
||||
|
||||
|
||||
|
||||
|
||||
\subsection{Execution Model}
|
||||
|
||||
|
||||
|
||||
\subsection{User Interface}
|
||||
|
||||
|
||||
Both the
|
||||
workflow and the data are versioned as we will explain in more detail in
|
||||
Section~\ref{sec:version-model}. The cells of the notebook are steps in a
|
||||
workflow that each run in an isolated execution environment. Thus, no unforeseen
|
||||
interactions between cells are possible. The datasets of a notebook are either
|
||||
imported from the outside world using a load dataset cell or are produced as
|
||||
output by other cell types.
|
||||
Section~\ref{sec:version-model}.
|
||||
|
||||
Cells can consume, produce, and update datasets through language-native APIs provided by Vizier for each cell type. For instance, in SQL cells, datasets are accessed as regular tables, while Python and Scala cells access datasets through a dataframe-style API. By isolating cells and by storing datasets in a
|
||||
relational database, Vizier can support many different cell types in the same
|
||||
notebook including Python, SQL, and Scala cells, as well as
|
||||
point-and-click cell types to streamline common notebook tasks like data ingest,
|
||||
missing value imputation, or visualization. In addition, using a relational database for state makes it possible to link other interaction modalities
|
||||
In addition, using a relational database for state makes it possible to link other interaction modalities
|
||||
to the Vizier notebook. In particular, Vizier allows users to directly interact
|
||||
with notebook state through a spreadsheet-like view. User interactions in this
|
||||
view are translated back into (reproducible) notebook cells.
|
||||
|
|
Loading…
Reference in New Issue