layout fixes

master
Boris Glavic 2019-12-16 21:12:11 -06:00
parent 6bee5b44c1
commit 018ec9dbf1
2 changed files with 2 additions and 2 deletions

View File

@ -41,7 +41,7 @@ Ideally a data exploration system would supplement automatic error detection and
\noindent In short, we target four limitations of existing work:\\
%\begin{itemize}
\textbf{• Reproducibility.} The nature of most existing notebook systems as wrappers around REPLs leads to non-reproducible analysis and unintuitive and hard to track errors during iterative pipeline construction. \\
\textbf{• Reproducibility.} The nature of most existing notebook systems as wrappers around REPLs leads to non-reproducible analysis and unintuitive and hard to track errors during iterative pipeline construction. \\[3mm]
\textbf{• Direct Manipulation.} It is often necessary to manually manipulate data (e.g., to apply simple one-off repairs), pulling users out of the notebook environment and limiting the notebook's ability to serve as a historical record.\\
\textbf{• Versioning and Sharing.} Existing notebook and spreadsheet systems often lack versioning capabilities, forcing users to rely on manual versioning using version control systems like git and hosting platforms (git forges) like github.\\
\textbf{• Uncertainty and Error Tracking.} Existing systems do not expose, track, or manage issues with data and deferred curation tasks, nor their interactions with data transformations and analysts

View File

@ -20,7 +20,7 @@ Provenance in workflow systems has been studied intensively in the past~\cite{CW
In the context of dataset versioning, prior work has investigated optimized storage for versioned datasets~\cite{XH,BD15,MG16a}. Bhattacherjee et al.~\cite{BC15a} study the trade-off between storage versus recreation cost for versioned datasets.
The version graphs used in this work essentially track coarse-grained provenance.
The Nectar system~\cite{GR10} automatically caches intermediate results of distributed dataflow computations also trading storage versus computational cost.
Similarly, metadata management systems like Ground and Apache Atlas (\url{https://atlas.apache.org/}) manage coarse-grained provenance for datasets in a data lake.
Similarly, metadata management systems like Ground and Apache Atlas (\url{https://atlas.apache.org/}) manage coarse-grained provenance for da\-ta\-sets in a data lake.
In contrast to workflow provenance which is often coarse-grained, i.e., at the level of datasets, database provenance is typically more fine-grained, e.g., at the level of rows~\cite{CC09,HD17,AF18,AG17c,GM13,SJ18,MD18}. Many systems capture database provenance by annotating data and propagating these annotations during query processing.
Vizier's version and provenance management techniques integrate several lines of prior work by the authors including tracking the provenance of workflow versions~\cite{DBLP:journals/concurrency/ScheideggerKSVCFS08,XN16}, provenance tracking for updates and reenactment~\cite{AG17c,DBLP:journals/pvldb/NiuALFZGKLG17}, and using provenance-based techniques for tracking uncertainty annotations~\cite{yang2015lenses,feng:2019:sigmod:uncertainty}.
The result is a system that is more than the sum of it components and to the best of our knowledge is the first system to support all of these features.