State model

master
Oliver Kennedy 2019-10-31 19:27:17 -04:00
parent fa84972f52
commit 3741bcc7dd
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
3 changed files with 99 additions and 17 deletions

BIN
graphics/wf_vs_df_cells.pdf Normal file

Binary file not shown.

Binary file not shown.

View File

@ -1,26 +1,108 @@
% -*- root: ../paper.tex -*-
Vizier is an interactive code notebook similar to Jupyter or Apache Zeppelin.
As in these systems, a Vizier notebook is a sequence of cells.
Each cell describes one unit of computation, for example a Python or SQL script.
Once a cell is executed, its results are displayed inline as part of the cell.
Vizier is an interactive code notebook in the style of Jupyter or Apache Zeppelin.
An analytics workflow is broken down into individual steps called cells.
A workflow cell displays code (e.g., Python or SQL) implementing the step, as well as any outputs (e.g., Console output, data visualizations, or datasets) produced by this code.
Each cell sees the effects of all preceding cells, and can apply changes visible to subsequent cells.
The user may edit, add, or delete any cell in the notebook, which triggers re-execution of all dependent cells appearing below it in the notebook.
In contrast to REPL-based notebook systems (i.e., library managers), Vizier is a
workflow and data manager: A notebook consists of a workflow specification (the
cells and their configurations) as well as multiple datasets (relational tables). Both the
Vizier workflows allow different cell types to be mixed within the same notebook.
\Cref{fig:cellTypes} lists the cell types presently supported by Vizier.
Classical notebook workhorses like Python, SQL, and Markdown cell are supported.
Vizier additionally supports cells providing point-and-click interfaces that streamline (1) Common notebook tasks like data ingest/export and visualization; (2) Primitive dataset manipulations (\Cref{sec:spreadsheet}); and (3) Data cleaning operations (\Cref{sec:errors}).
\begin{figure}[htbp]
\centering
\begin{tabular}{r|p{1.7in}|c}
\textbf{Category} & \textbf{Cells} & \textbf{API} \\ \hline
Script & Python, Scala & Workflow \\
Query & SQL & Dataflow \\
Information & Markdown & n/a \\
Point/Click & Plot, Load Data, Export Data & Workflow \\
Primitive & Add/Delete/Move Row,
Add/Delete/Move/Rename Column,
Edit Cell, Sort, Filter,
Rename Dataset & Dataflow \\
Cleaning & Infer Types, Repair Key,
Impute, Shape Checker,
Repair Timeline, Merge Columns,
Geocode & Dataflow \\
\end{tabular}
\caption{Cell Types in Vizier}
\label{fig:cellTypes}
\end{figure}
\subsection{Cell State}
Data flow in a workflow system is explicit, with the the outputs of one step explicitly bound to the inputs expected by another.
In a notebook, data flow is implicit, with each cell manipulating a global, shared state.
For example, in Jupyter, this state is the internal state of the interpreter itself (variables, allocated objects, etc\ldots).
Jupyter cells are executed in the context of this interpreter, leaving behind a new state for subsequently executed cells.
The design of Vizier's state model is driven by four factors:
(i) \textbf{In-order execution}: We needed state that could be quickly and efficiently checkpointed to allow replay from any point in the notebook;
(ii) \textbf{Dependency analysis}: To minimize the number of cells that need to be re-executed, we need to be able to instrument each cell's interactions (reads/writes) with a cell;
(iii) \textbf{Big Data}: Given Vizier's focus on data exploration, we wanted a state model that supports large, structured datasets;
(iv) \textbf{Caveats}: Provenance-based features of Vizier benefit from fine-grained provenance, making it useful to have support for declarative state transformations.
A natural starting point for Vizier's state model, satisfying all four criteria, was relational tables.
Concretely, state in Vizier is a set of named relational tables, called datasets\footnote{Support for primitive valued and blob/text-like state is a work in progress.}.
Cells can access, manipulate, or destroy existing datasets; as well as create new ones.
However, the specific way in which this happens differs by cell.
Ideally we would like to collect fine-grained provenance information about each cell's interactions with the dataset.
However, fine-grained provenance is not always feasible:
For example, while instrumenting Python cells to collect this information is possible under limited circumstances\cite{}, further generalization is not practical.
Instead, Vizier provides a 2-tiered dataset API for cells to interact with datasets; \Cref{fig:cellTypes} identifies which cell types adopt which API.
\begin{figure}[htbp]
\centering
\includegraphics[width=0.95\textwidth]{graphics/wf_vs_df_cells.pdf}
\caption{Workflow vs Dataflow State Interactions}
\label{fig:wfVsDFCells}
\end{figure}
\tinysection{Workflow Cells}
The workflow API, illustrated in \Cref{fig:wfVsDFCells}.a, is aimed at cells where collecting only coarse-grained provenance information is feasible.
This includes the \texttt{Python} and \texttt{Scala} script cells, where instrumentation poses a challenge, as well as cells like \texttt{Load Dataset} that manipulate entire datasets.
To discourage out-of-band communication between cells, as well as to avoid remote code execution attacks when Vizier is run in a public setting, workflow cells are typically executed in an isolated environment.
Vizier presently supports execution in an independent interpreter (for efficiency) or a docker container (for safety).
Vizier's workflow API is designed accordingly, providing three primitive operations: Read dataset (copy a named dataset from Vizier to the isolated execution environment), Checkpoint dataset (copy an updated version of a dataset back to Vizier), and Create dataset (Allocate a new dataset in Vizier).
A more efficient asynchronous, paged versions of the read operation is also available, and the Create dataset operation can optionally initialize a dataset from a URL, S3 Bucket, or Google Sheet to avoid unnecessary copies.
\tinysection{Dataflow Cells}
The dataflow cell API, illustrated in \Cref{fig:wfVsDFCells}.b is aimed at cell types that implement declarative dataset transformations.
Dataflow cells are first compiled down to equivalent SQL queries.
Thus, in contrast to workflow cells, where datasets are defined explicitly, dataflow cells define and update relations as materialized views.
\subsection{Caveats}
- What is a Caveat
- Annotation
- Comparison to COloring
- General definition
- Propagation Rules
- Dataflow Cells
- WOrkflow Cells
\subsection{Execution Model}
\subsection{User Interface}
Both the
workflow and the data are versioned as we will explain in more detail in
Section~\ref{sec:version-model}. The cells of the notebook are steps in a
workflow that each run in an isolated execution environment. Thus, no unforeseen
interactions between cells are possible. The datasets of a notebook are either
imported from the outside world using a load dataset cell or are produced as
output by other cell types.
Section~\ref{sec:version-model}.
Cells can consume, produce, and update datasets through language-native APIs provided by Vizier for each cell type. For instance, in SQL cells, datasets are accessed as regular tables, while Python and Scala cells access datasets through a dataframe-style API. By isolating cells and by storing datasets in a
relational database, Vizier can support many different cell types in the same
notebook including Python, SQL, and Scala cells, as well as
point-and-click cell types to streamline common notebook tasks like data ingest,
missing value imputation, or visualization. In addition, using a relational database for state makes it possible to link other interaction modalities
In addition, using a relational database for state makes it possible to link other interaction modalities
to the Vizier notebook. In particular, Vizier allows users to directly interact
with notebook state through a spreadsheet-like view. User interactions in this
view are translated back into (reproducible) notebook cells.