State model

2019-10-31 19:27:17 -04:00 · 2019-10-31 19:27:17 -04:00 · 3741bcc7dd
parent fa84972f52
commit 3741bcc7dd
3 changed files with 99 additions and 17 deletions
--- a/graphics/wf_vs_df_cells.pdf
+++ b/graphics/wf_vs_df_cells.pdf
--- a/measurements/a5598dfd.tar.gz
+++ b/measurements/a5598dfd.tar.gz
--- a/sections/system.tex
+++ b/sections/system.tex
@ -1,26 +1,108 @@

 % -*- root: ../paper.tex -*-

-Vizier is an interactive code notebook similar to Jupyter or Apache Zeppelin.
-As in these systems, a Vizier notebook is a sequence of cells.
-Each cell describes one unit of computation, for example a Python or SQL script.
-Once a cell is executed, its results are displayed inline as part of the cell.
+Vizier is an interactive code notebook in the style of Jupyter or Apache Zeppelin.
+An analytics workflow is broken down into individual steps called cells.
+A workflow cell displays code (e.g., Python or SQL) implementing the step, as well as any outputs (e.g., Console output, data visualizations, or datasets) produced by this code.
+Each cell sees the effects of all preceding cells, and can apply changes visible to subsequent cells. 
+The user may edit, add, or delete any cell in the notebook, which triggers re-execution of all dependent cells appearing below it in the notebook.

-In contrast to REPL-based notebook systems (i.e., library managers), Vizier is a
-workflow and data manager: A notebook consists of a workflow specification (the
-cells and their configurations) as well as multiple datasets (relational tables). Both the
+Vizier workflows allow different cell types to be mixed within the same notebook.
+\Cref{fig:cellTypes} lists the cell types presently supported by Vizier.  
+Classical notebook workhorses like Python, SQL, and Markdown cell are supported.
+Vizier additionally supports cells providing point-and-click interfaces that streamline (1) Common notebook tasks like data ingest/export and visualization; (2) Primitive dataset manipulations (\Cref{sec:spreadsheet}); and (3) Data cleaning operations (\Cref{sec:errors}).
+
+
+\begin{figure}[htbp]
+  \centering
+  \begin{tabular}{r|p{1.7in}|c}
+    \textbf{Category} & \textbf{Cells} & \textbf{API} \\ \hline
+    Script            & Python, Scala                  & Workflow \\
+    Query             & SQL                            & Dataflow \\
+    Information       & Markdown                       & n/a \\
+    Point/Click       & Plot, Load Data, Export Data   & Workflow \\
+    Primitive         & Add/Delete/Move Row, 
+                        Add/Delete/Move/Rename Column, 
+                        Edit Cell, Sort, Filter, 
+                        Rename Dataset                 & Dataflow \\
+    Cleaning          & Infer Types, Repair Key, 
+                        Impute, Shape Checker, 
+                        Repair Timeline, Merge Columns, 
+                        Geocode                        & Dataflow \\
+  \end{tabular}
+  \caption{Cell Types in Vizier}
+  \label{fig:cellTypes}
+\end{figure}
+
+
+\subsection{Cell State}
+Data flow in a workflow system is explicit, with the the outputs of one step explicitly bound to the inputs expected by another.
+In a notebook, data flow is implicit, with each cell manipulating a global, shared state.
+For example, in Jupyter, this state is the internal state of the interpreter itself (variables, allocated objects, etc\ldots).  
+Jupyter cells are executed in the context of this interpreter, leaving behind a new state for subsequently executed cells.
+The design of Vizier's state model is driven by four factors:
+(i) \textbf{In-order execution}: We needed state that could be quickly and efficiently checkpointed to allow replay from any point in the notebook;
+(ii) \textbf{Dependency analysis}: To minimize the number of cells that need to be re-executed, we need to be able to instrument each cell's interactions (reads/writes) with a cell;
+(iii) \textbf{Big Data}: Given Vizier's focus on data exploration, we wanted a state model that supports large, structured datasets;
+(iv) \textbf{Caveats}: Provenance-based features of Vizier benefit from fine-grained provenance, making it useful to have support for declarative state transformations.
+
+A natural starting point for Vizier's state model, satisfying all four criteria, was relational tables.
+Concretely, state in Vizier is a set of named relational tables, called datasets\footnote{Support for primitive valued and blob/text-like state is a work in progress.}.
+Cells can access, manipulate, or destroy existing datasets; as well as create new ones.
+However, the specific way in which this happens differs by cell.
+Ideally we would like to collect fine-grained provenance information about each cell's interactions with the dataset.  
+However, fine-grained provenance is not always feasible: 
+For example, while instrumenting Python cells to collect this information is possible under limited circumstances\cite{}, further generalization is not practical.
+Instead, Vizier provides a 2-tiered dataset API for cells to interact with datasets; \Cref{fig:cellTypes} identifies which cell types adopt which API.
+
+\begin{figure}[htbp]
+  \centering
+  \includegraphics[width=0.95\textwidth]{graphics/wf_vs_df_cells.pdf}
+  \caption{Workflow vs Dataflow State Interactions}
+  \label{fig:wfVsDFCells}
+\end{figure}
+
+\tinysection{Workflow Cells}
+The workflow API, illustrated in \Cref{fig:wfVsDFCells}.a, is aimed at cells where collecting only coarse-grained provenance information is feasible. 
+This includes the \texttt{Python} and \texttt{Scala} script cells, where instrumentation poses a challenge, as well as cells like \texttt{Load Dataset} that manipulate entire datasets.
+To discourage out-of-band communication between cells, as well as to avoid remote code execution attacks when Vizier is run in a public setting, workflow cells are typically executed in an isolated environment.
+Vizier presently supports execution in an independent interpreter (for efficiency) or a docker container (for safety).
+Vizier's workflow API is designed accordingly, providing three primitive operations: Read dataset (copy a named dataset from Vizier to the isolated execution environment), Checkpoint dataset (copy an updated version of a dataset back to Vizier), and Create dataset (Allocate a new dataset in Vizier).
+A more efficient asynchronous, paged versions of the read operation is also available, and the Create dataset operation can optionally initialize a dataset from a URL, S3 Bucket, or Google Sheet to avoid unnecessary copies.
+
+\tinysection{Dataflow Cells}
+The dataflow cell API, illustrated in \Cref{fig:wfVsDFCells}.b is aimed at cell types that implement declarative dataset transformations.
+Dataflow cells are first compiled down to equivalent SQL queries.
+Thus, in contrast to workflow cells, where datasets are defined explicitly, dataflow cells define and update relations as materialized views.
+
+
+\subsection{Caveats}
+
+
+
+- What is a Caveat
+- Annotation
+  - Comparison to COloring
+  - General definition
+- Propagation Rules
+  - Dataflow Cells
+  - WOrkflow Cells
+
+
+
+
+\subsection{Execution Model}
+
+
+
+\subsection{User Interface}
+
+
+ Both the
 workflow and the data are versioned as we will explain in more detail in
-Section~\ref{sec:version-model}.  The cells of the notebook are steps in a
-workflow that each run in an isolated execution environment. Thus, no unforeseen
-interactions between cells are possible.  The datasets of a notebook are either
-imported from the outside world using a load dataset cell or are produced as
-output by other cell types.
+Section~\ref{sec:version-model}.  

-Cells can consume, produce, and update datasets through language-native APIs provided by Vizier for each cell type. For instance, in SQL cells, datasets are accessed as regular tables, while Python and Scala cells access datasets through a dataframe-style API. By isolating cells and by storing datasets in a
-relational database, Vizier can support many different cell types in the same
-notebook including Python, SQL, and Scala cells, as well as
-point-and-click cell types to streamline common notebook tasks like data ingest,
-missing value imputation, or visualization.  In addition, using a relational database for state makes it possible to link other interaction modalities
+In addition, using a relational database for state makes it possible to link other interaction modalities
 to the Vizier notebook.  In particular, Vizier allows users to directly interact
 with notebook state through a spreadsheet-like view.  User interactions in this
 view are translated back into (reproducible) notebook cells.