paper-HILDA-2016-Spreadsheets/sections/generalizing.tex

%!TEX root = ../main.tex

As the user makes edits in the spreadsheet interface, the corresponding actions are recorded in the notebook as a \langname script.
Although these scripts do encode the evaluation logic that generates the spreadsheet being displayed, they also serve as an audit trail, tool for reverting or altering older edits, and vector for generalizing the same curation process to new data.
As such, \langname is subject to a different set of optimization goals than most programming languages.
Rather than optimizing for performance or resource usage as in a normal optimizer, \langname needs an optimizer that prioritizes both \textit{readability} and \textit{generality}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Rewriting for Readability}
User actions on the spreadsheet are expected to be small, isolated changes.
Recording them directly in this form is likely to produce long, hard to follow \langname scripts.
Thus, it will be necessary for \sysname to dynamically rewrite scripts being modified in a principled way that optimizes for readability.
We consider readability to be a tradeoff between minimizing two measures: size and complexity.
For example, consider a sequence of 10 update actions with the form:\\[-5mm]
\begin{lstlisting}[morekeywords={ROWID}]
UPDATE A = 3 WHERE ROWID = ?
\end{lstlisting}~\\[-5mm]
with \texttt{?} taking values from 1 to 10.  Instead, we could express all 10 updates in a single expression using a \texttt{BETWEEN} predicate that (a) more concisely represents the same concept, with (b) a similar level of complexity, and (c) is semantically equivalent.
Similar transformations appear in optimizing compilers --- the above equivalence inverts a common compiler optimization called loop unrolling.  Although there has been substantial research effort on obfuscating compilers, we are not aware of any source-to-source compilers designed to increase code readability.
Vizier will not just record a \langname script for a workflow, but also keep track of what user operations each operation in the script is based on. This information can be exploited during rewriting. A set of formulas created by an adapt\&apply operation is a good candidate for rewriting, because we know that all formulas in such a set follow the same pattern.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Generalizing Singletons}
Singletons allow users to try out hypotheticals, explore cleaning solutions, and conduct small-scale tests.
It is often easier for users to perform one-off curation steps initially, repairing errors in the data as singletons, rather than expending the mental effort to generalize the repair upfront.
However, when the user needs to adapt their preliminary data cleaning solution to new data, to a larger dataset, or to an updated dataset, these singleton operations can become a burden.
Although they put more control over the curation process in the user's hands, singleton actions increase the size and complexity of a \langname script, with no benefits beyond the initial dataset.
In addition to considering readability-enhancing rewrites that preserve semantic equivalence, it will be necessary for \sysname to evaluate how singleton actions can be generalized --- effectively a form of query (or curation, in this case) by example~\cite{Zloof:1975:QE:1499949.1500034}.  Concretely, given a set of similar statements with singleton targets, we would like to propose to the user a set of rewrite that applies the same update to a region covering all the singletons.

%\oksays{Should put more here... but I don't really know what.}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End: