intro + make compact itemize

This commit is contained in:
Boris Glavic 2016-04-25 09:09:13 -05:00
parent 7a2c7ccd3d
commit 9d08fa6cbc
4 changed files with 28 additions and 27 deletions

View file

@ -19,6 +19,7 @@
\usepackage[textsize=tiny]{todonotes}
\usepackage{cleveref}
\usepackage{tabu}
\usepackage{paralist}
%%%%%% Package Configuration %%%%%%
%%% Listings
@ -46,3 +47,4 @@
\newcommand{\ccomment}[1]{{\small\texttt{/*} #1 \texttt{*/}}}
\newcommand{\tinysection}[1]{\smallskip\noindent \textbf{#1.}$\,$}
\newcommand{\keyword}[1]{\textcolor{blue}{\texttt{#1}}}
\newcommand{\inlineitem}[1]{\noindent \textbullet \emph{{#1}:}}

View file

@ -11,13 +11,13 @@ Recording them directly in this form is likely to produce long, hard to follow \
Thus, it will be necessary for \sysname to dynamically rewrite scripts being modified in a principled way that optimizes for readability.
We consider readability to be a tradeoff between minimizing two measures: size and complexity.
For example, consider a sequence of 10 update actions with the form:\\[-5mm]
\begin{verbatim}
\begin{lstlisting}[morekeywords={ROWID}]
UPDATE A = 3 WHERE ROWID = ?
\end{verbatim}~\\[-5mm]
\end{lstlisting}~\\[-5mm]
with \texttt{?} assigned values from 1 to 10. Instead, we could express all 10 updates in a single expression:\\[-5mm]
\begin{verbatim}
\begin{lstlisting}[morekeywords={ROWID}]
UPDATE A = 3 WHERE ROWID BETWEEN 1 AND 10
\end{verbatim}~\\[-5mm]
\end{lstlisting}~\\[-5mm]
The latter script (a) more concisely represents the same concept, with (b) a similar level of complexity, and (c) is semantically equivalent.
Similar transformations appear in optimizing compilers --- the above equivalence inverts a common compiler optimization called loop unrolling. Although there has been substantial research effort on obfuscating compilers, we are not aware of any source-to-source compilers designed to increase code readability.
@ -28,4 +28,8 @@ However, when the user needs to adapt their preliminary data cleaning solution t
Although they put more control over the curation process in the user's hands, singleton actions increase the size and complexity of a \langname script, with no benefits beyond the initial dataset.
In addition to considering readability-enhancing rewrites that preserve semantic equivalence, it will be necessary for \sysname to evaluate how singleton actions can be generalized --- effectively a form of query (or curation, in this case) by example~\cite{Zloof:1975:QE:1499949.1500034}. Concretely, given a set of similar statements with singleton targets, we would like to propose to the user a set of rewrite that applies the same update to a region covering all the singletons.
%\oksays{Should put more here... but I don't really know what.}
%\oksays{Should put more here... but I don't really know what.}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End:

View file

@ -12,6 +12,7 @@
\sysname is a tool for data curation and exploration. \sysname's interface (illustrated in Figure~\ref{fig:hybridinterface}) combines elements of both notebooks and spreadsheets. Notebook interfaces like Jupyter use an analogy of pages in a notebook that consist of a block of code, as well as an output for the block like a table, visualization, or documentation text. Blocks are part of a continuous program, allowing a user to quickly probe intermediate state by creating new visualizations or views of the data, or to safely insert hypothetical, exploratory modifications by adding or disabling pages.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{The Notebook UI}
Each a page in a \sysname notebook can be thought of as a block of SQL DML/DDL code that generates a table or visualization. Pages are evaluated in sequential order. Code defining later pages may reference preceding pages as if they were (materialized) views and edits to a page may result in cascading changes to the pages following it. Thus, unlike classical databases where views are static entities, a page in \sysname acts as a \textit{interactive view} that enables changes in its definition by making it easier for users to define such changes, and by optimizing the system to enact such changes.
@ -47,18 +48,19 @@ SELECT name = 'table', price = 10,
\end{lstlisting}
Imperative-flavored declarative language syntax has been repeatedly found to be more user-friendly than classic declarative syntax~\cite{Olston:2008:PLN:1376616.1376726,Sowell:2009aa}. Here however, it also serves to highlight the compositional nature of interactive views; Every user action that changes the view's schema or contents is reflected in the script by a new statement appended to its end. Thus, we aim for --- in principle at least --- a one-to-one mapping between user actions and operations in \langname.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{The Spreadsheet UI}
\sysname's users can edit tables and visualizations directly and have those edits reflected in the corresponding table's page, and propagated to subsequent pages. As a result, the user's edits, whether applied via the spreadsheet or notebook UI, are recorded as a form of workflow provenance~\cite{SV08,CF12a,AD11c,DC07}. Note that our goal is not to reproduce the full interface of a spreadsheet entirely, but rather to replicate as many of the
flexible data and schema manipulation features of spreadsheets as possible within a more structured framework. Concretely, \sysname's UI allows users to:
\begin{itemize}
\begin{compactitem}
\item \textbf{Overwrite arbitrary cells with constants, formulas, or regular expressions}: The user may click on any cell in the output to overwrite its contents with a constant value or a new formula defined interactively by clicking on cells and typing code.
\item \textbf{Cast cells to a new type}: Dropdown menus in an inspector allow the user to apply general transformations like typecasting. The transformation is applied in bulk to entire regions of selected cells.
\item \textbf{Copy/Paste cells}: Users can copy and paste regions of cells. The formula of the copied cell(s) is replicated in the target region, preserving the formula's positional semantics. If the target region is larger than the source region in either or both dimensions, cells in the source region are tiled to scale over the entire target.
\item \textbf{Add/Delete/Reorder columns or rows}: Users may drag or columns rows to reposition them. A tab at the bottom and right edges of the displayed table allows users to widen or lengthen the table, adding new columns or rows respectively. Finally, several interface elements allow users to insert rows (resp., columns) before or after any existing row (column).
\item \textbf{Sort data}: A dropdown menu allows users to sort data according to values in one or more columns.
\item \textbf{Filter data}: A dropdown menu allows users to filter out rows according to a formula defined over the row.
\end{itemize}
\end{compactitem}
Many of these operations (e.g., paste, typecast) require the user to define a target region, most commonly in the form of rectangular area of cells selected by clicking and dragging with the cursor. We refer to these regions, defined as the projection/selection of a set of columns and rows, as \textit{regions}, and discuss them in greater depth below.

View file

@ -16,11 +16,11 @@ Spreadsheets are a ubiquitous data processing tool. Their simplicity, generalit
Spreadsheets provide several important features that are useful during data curation:\\
%\begin{itemize}
%\item
\emph{(1) Convenient modification of values and computations.} The user can update any cell's value or formula immediately from the user interface. This enables manual curation operations like resolving missing values and undoing earlier errors.\\
\inlineitem{Convenient modification of values and computations} The user can update any cell's value or formula immediately from the user interface. This enables manual curation operations like resolving missing values and undoing earlier errors.\\
%\item
\emph{(2) Manual operations with inline results.} By using formulas in cells, the user defines a computation and the result of this computation is shown inline with its input data.\\
\inlineitem{Manual operations with inline results} By using formulas in cells, the user defines a computation and the result of this computation is shown inline with its input data.\\
%\item
\emph{(3) Visual mapping over data collections.}
\inlineitem{Visual mapping over data collections}
Most spreadsheet systems enable the user to take a formula (computation) and map it to a range of cells through \textit{position-relative} references in cell formulas.
For example, this can be done by copy/paste or by, e.g., fill operation. We refer to this mechanism \textit{adapt\&apply}. This approach to bulk, set-at-a-time functionality is very useful in data curation: A fix to repair one piece of data (e.g., conversion between units) can be deployed over the whole dataset, while the appearance of independent formulas provides an affordance for declaring exceptions to the bulk rule.
%\end{itemize}
@ -43,23 +43,24 @@ The spreadsheet UI also has several drawbacks:\\
% \bgsays{(sort example). }
%\begin{itemize}
%\item
\emph{(1) Non-Adaptive Computations over Collections}
\inlineitem{Non-Adaptive Computations over Collections}
While spreadsheets support an adapt\&apply approach to apply a computation defined for a specific set of cells to a larger collection of cells, the two dominant spreadsheets: Microsoft Excel and Google Sheets do not support automatically extending such a cell collection if more data is added~\footnote{We note that Apple's Numbers does exhibit this behavior.}.
This is in contrast to relational databases with their declarative query languages that provide data independence. \\
%Furthermore, the visual specification of an adapt\&apply operation is not suited well for very large datasets since the user has to manually select a range of cells.\\
%\item
\emph{Collection operation intent is not explicit:}
\inlineitem{Collection operation intent is not explicit}
Adapt\&apply allows a computation to be mapped over a collection, but there is no visual evidence that indicates that a set of cells are storing formulas which where mapped in this way. That is, there is no generalization/abstraction mechanism in spreadsheets to represent a higher-level bulk operation such as a view query in a database.\\
%\bgsays{provenance should help to identify that a ``map'' was applied}
%\item
\emph{(3) No order among operations and tracking of workflow branches.}
\inlineitem{No order among operations and tracking of workflow branches}
Spreadsheets use cell highlighting as a visual metaphors for dependency tracking.
These visualizations are specific to single cells and do not lend themselves to tracking large curation workflows.
Furthermore, these visualizations are limited to tracing one dependency at a time, making tracking transitive dependencies cumbersome.\\
%which makes tracking of transitive dependencies cumbersome. It is unfeasible to assume that the user has a complete mental model of the whole workflow and all its branches. \\
%\bgsays{Workflow provenance would help to the higher-level view, database provenance for the transitive dependencies}
%\item
\emph{(4) Unintuitive results for adapt\&apply.} As we will discuss further in Section~\ref{sec:language}, adapt\&apply functionality as implemented in many spreadsheet systems can sometimes lead to unexpected results.
\inlineitem{Unintuitive results for adapt\&apply}
As we will discuss further in Section~\ref{sec:language}, adapt\&apply functionality as implemented in many spreadsheet systems can sometimes lead to unexpected results.
%\end{itemize}
%BAD: The order in which operations have been applied is not exposed in a spreadsheet. Thus, spreadsheets are not great for complex data curation and exploration operations, because it is hard to keep track of what operations where applied to the data and in which order (needs workflow + provenance).
@ -74,20 +75,20 @@ Systems like Jupyter expose an interactive, interpreted programming environment
% A notebook UI such as in Jupyther and the visualization provided by workflow systems are much better suited in this regard. However, their disadvantage is that they are not suited well for small modifications to data, exploratory changes, and generalization of operations.
%\begin{itemize}
%\item
\emph{Inline documentation:} Notebooks allow users to integrate comments, descriptive text, notes, and formatting details, making it easier for others to retrace their steps.\\
\inlineitem{Inline documentation} Notebooks allow users to integrate comments, descriptive text, notes, and formatting details, making it easier for others to retrace their steps.\\
%\item
\emph{Incremental development of complex workflows:} Notebooks allow users to incrementally build curation workflows by modifying one page at a time. The (always linear) structure of the workflow is made explicit through the notebook interface.
\inlineitem{Incremental development of complex workflows} Notebooks allow users to incrementally build curation workflows by modifying one page at a time. The (always linear) structure of the workflow is made explicit through the notebook interface.
%\end{itemize}
\smallskip
However, some operations that are supported well in spreadsheets are harder to express in notebooks, and some disadvantages are shared among both paradigms:\\
%\begin{itemize}
%\item
\emph{Small edits are cumbersome:} Compared to spreadsheets, modifying individual data values requires users to write code.\\
\inlineitem{Small edits are cumbersome} Compared to spreadsheets, modifying individual data values requires users to write code.\\
%\item
\emph{Linear workflows and no backtracking:} Notebooks are inherently linear and do not allow users to backtrack or branch their development efforts.\\
\inlineitem{Linear workflows and no backtracking} Notebooks are inherently linear and do not allow users to backtrack or branch their development efforts.\\
%\item
\emph{All-at-once processing:}
\inlineitem{All-at-once processing}
Both spreadsheets and notebooks operate over datasets in their entirety, something that is not feasible for large giga-/tera-byte files.
%\end{itemize}
@ -96,14 +97,6 @@ Both spreadsheets and notebooks operate over datasets in their entirety, somethi
%
Both spreadsheets and notebook UIs make it very easy for users to create visualizations from data on the fly and show these visualization inline with the data. Also both paradigms allow these visualization to be tweaked and to be refreshed based on change to their inputs. Spreadsheets in particular provide a very easy to use interface for selecting what data should be visualized.
% \begin{itemize}
% \item
% \end{itemize}
% \begin{itemize}
% \item
% \end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\tinysection{Combining Spreadsheets and Notebooks}