layout fixes
parent
6bee5b44c1
commit
018ec9dbf1
|
@ -41,7 +41,7 @@ Ideally a data exploration system would supplement automatic error detection and
|
|||
|
||||
\noindent In short, we target four limitations of existing work:\\
|
||||
%\begin{itemize}
|
||||
\textbf{• Reproducibility.} The nature of most existing notebook systems as wrappers around REPLs leads to non-reproducible analysis and unintuitive and hard to track errors during iterative pipeline construction. \\
|
||||
\textbf{• Reproducibility.} The nature of most existing notebook systems as wrappers around REPLs leads to non-reproducible analysis and unintuitive and hard to track errors during iterative pipeline construction. \\[3mm]
|
||||
\textbf{• Direct Manipulation.} It is often necessary to manually manipulate data (e.g., to apply simple one-off repairs), pulling users out of the notebook environment and limiting the notebook's ability to serve as a historical record.\\
|
||||
\textbf{• Versioning and Sharing.} Existing notebook and spreadsheet systems often lack versioning capabilities, forcing users to rely on manual versioning using version control systems like git and hosting platforms (git forges) like github.\\
|
||||
\textbf{• Uncertainty and Error Tracking.} Existing systems do not expose, track, or manage issues with data and deferred curation tasks, nor their interactions with data transformations and analysts
|
||||
|
|
|
@ -20,7 +20,7 @@ Provenance in workflow systems has been studied intensively in the past~\cite{CW
|
|||
In the context of dataset versioning, prior work has investigated optimized storage for versioned datasets~\cite{XH,BD15,MG16a}. Bhattacherjee et al.~\cite{BC15a} study the trade-off between storage versus recreation cost for versioned datasets.
|
||||
The version graphs used in this work essentially track coarse-grained provenance.
|
||||
The Nectar system~\cite{GR10} automatically caches intermediate results of distributed dataflow computations also trading storage versus computational cost.
|
||||
Similarly, metadata management systems like Ground and Apache Atlas (\url{https://atlas.apache.org/}) manage coarse-grained provenance for datasets in a data lake.
|
||||
Similarly, metadata management systems like Ground and Apache Atlas (\url{https://atlas.apache.org/}) manage coarse-grained provenance for da\-ta\-sets in a data lake.
|
||||
In contrast to workflow provenance which is often coarse-grained, i.e., at the level of datasets, database provenance is typically more fine-grained, e.g., at the level of rows~\cite{CC09,HD17,AF18,AG17c,GM13,SJ18,MD18}. Many systems capture database provenance by annotating data and propagating these annotations during query processing.
|
||||
Vizier's version and provenance management techniques integrate several lines of prior work by the authors including tracking the provenance of workflow versions~\cite{DBLP:journals/concurrency/ScheideggerKSVCFS08,XN16}, provenance tracking for updates and reenactment~\cite{AG17c,DBLP:journals/pvldb/NiuALFZGKLG17}, and using provenance-based techniques for tracking uncertainty annotations~\cite{yang2015lenses,feng:2019:sigmod:uncertainty}.
|
||||
The result is a system that is more than the sum of it components and to the best of our knowledge is the first system to support all of these features.
|
||||
|
|
Loading…
Reference in New Issue