adding applicationsd

master
Oliver Kennedy 2021-09-20 20:46:43 -04:00
parent 64ee16c4ff
commit 86687bfab0
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
2 changed files with 134 additions and 0 deletions

View File

@ -324,6 +324,48 @@ In contrast, known approximation techniques in set-\abbrPDB\xplural are at most
%\OK{Atri's (and most theoretician's) statements about complexity always need to be suffixed with ``to within a log factor''}
%(iii) We finally observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Applications}
Probabilistic databases have been explored extensively for a variety of specialized tasks from probabilistic programming~\cite{DBLP:journals/tods/OlteanuS16}, to simulations~\cite{DBLP:conf/sigmod/GaoLPJ17,DBLP:conf/sigmod/CaiVPAHJ13}, and more.
Especially notable is work on data cleaning~\cite{yang:2015:pvldb:lenses,DBLP:journals/vldb/SaRR0W0Z17,DBLP:journals/pvldb/RekatsinasCIR17,DBLP:journals/pvldb/BeskalesIG10}, where probabilistic databases facilitate queries over heuristically cleaned datasets.
This work is particularly crucial, as larger and more noisy datasets make fully manual data cleaning progressively less feasible.
As observed by Feng et. al. \cite{feng:2019:sigmod:uncertainty}, users of classical, deterministic databases are forced to choose between discarding potentially irrelevant data due to the cost of cleaning it properly, and ignoring the error and hoping that query outputs remain informative.
Thanks to existing work on probabilistic data cleaning, probabilistic databases can be leveraged provide a convenient middle ground between these extremes, allowing users to avoid the upfront costs of data cleaning, while simultaneously receiving meaningful query outputs (or at least an output that indicates the need for manual cleaning).
Unfortunately, probabilistic databases remain impractically slow (e.g., by orders of magnitude~\cite{feng:2019:sigmod:uncertainty}).
Detecting data errors, outliers, or other corner cases is often easier than
Manual validation of data is becoming increasingly intractable as data sizes grow.
Probabilistic databases are a powerful tool for managing data that,
As data sizes grow and manual validation becomes more difficult, it becomes more critical than ever for tools to treat data unc
Probabilistic databases are a valuable tool for
provide a
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly$. In fact, it turns out that for the \abbrTIDB (and \abbrBIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the \abbrTIDB/\abbrBIDB. Next, we motivate this reduced polynomial.
Consider the query $\query$ defined as follows over the bag relations of \Cref{fig:two-step}:
\begin{lstlisting}

View File

@ -664,3 +664,95 @@ Maximilian Schleich},
biburl = {https://dblp.org/rec/journals/siamcomp/AtseriasGM13.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{DBLP:journals/vldb/SaRR0W0Z17,
author = {Christopher De Sa and
Alexander Ratner and
Christopher R{\'{e}} and
Jaeho Shin and
Feiran Wang and
Sen Wu and
Ce Zhang},
title = {Incremental knowledge base construction using DeepDive},
journal = {{VLDB} J.},
volume = {26},
number = {1},
pages = {81--105},
year = {2017}
}
@article{DBLP:journals/pvldb/RekatsinasCIR17,
author = {Theodoros Rekatsinas and
Xu Chu and
Ihab F. Ilyas and
Christopher R{\'{e}}},
title = {HoloClean: Holistic Data Repairs with Probabilistic Inference},
journal = {Proc. {VLDB} Endow.},
volume = {10},
number = {11},
pages = {1190--1201},
year = {2017}
}
@article{DBLP:journals/pvldb/BeskalesIG10,
author = {George Beskales and
Ihab F. Ilyas and
Lukasz Golab},
title = {Sampling the Repairs of Functional Dependency Violations under Hard
Constraints},
journal = {Proc. {VLDB} Endow.},
volume = {3},
number = {1},
pages = {197--207},
year = {2010}
}
@article{DBLP:journals/tods/OlteanuS16,
author = {Dan Olteanu and
Sebastiaan J. van Schaik},
title = {ENFrame: {A} Framework for Processing Probabilistic Data},
journal = {{ACM} Trans. Database Syst.},
volume = {41},
number = {1},
pages = {3:1--3:44},
year = {2016}
}
@inproceedings{DBLP:conf/sigmod/GaoLPJ17,
author = {Zekai J. Gao and
Shangyu Luo and
Luis Leopoldo Perez and
Chris Jermaine},
title = {The {BUDS} Language for Distributed Bayesian Machine Learning},
booktitle = {{SIGMOD} Conference},
pages = {961--976},
publisher = {{ACM}},
year = {2017}
}
@inproceedings{DBLP:conf/sigmod/CaiVPAHJ13,
author = {Zhuhua Cai and
Zografoula Vagena and
Luis Leopoldo Perez and
Subramanian Arumugam and
Peter J. Haas and
Christopher M. Jermaine},
title = {Simulation of database-valued markov chains using SimSQL},
booktitle = {{SIGMOD} Conference},
pages = {637--648},
publisher = {{ACM}},
year = {2013}
}