master
Oliver Kennedy 2021-09-20 21:23:22 -04:00
parent 86687bfab0
commit 67b3f750be
Signed by: okennedy
GPG Key ID: 3E5F9B3ABD3FDB60
3 changed files with 40 additions and 39 deletions

View File

@ -340,6 +340,18 @@ Progress can be made on this as follows:
For any positive integer $m$ we can compute the $m$-th moment of the multiplicities, allowing us to e.g. use the Chebyschev inequality or other high moment based probability bounds on the events we might be interested in.
We leave further investigations for future work.
% \section{The Karp-Luby Estimator}
% \label{sec:karp-luby}
% %
% Computing the marginal probability of a tuple in the output of a set-probabilistic database query has been studied extensively.
% To the best of our knowledge, the current state of the art approximation algorithm for this problem is the Karp-Luby estimator~\cite{DBLP:journals/jal/KarpLM89}, which first appeared in MayBMS/Sprout~\cite{DBLP:conf/icde/OlteanuHK10}, and more recently as part of an online ``anytime'' approximation algorithm~\cite{FH13,heuvel-19-anappdsd}.
% The estimator works by observing that for any \ell random events $X_1, \ldots, X_\ell$, the probability of either occurring $\probOf\inparen(X_1 \vee \ldots X_\ell)$ is bounded from above by the sum of the independent event probabilities (i.e., $\probOf\inparen(X_1 \vee \ldots X_\ell) \leq \probOf\inparen{X_1} + \ldots + \probOf\inparen{X_\ell}$).
% Starting from this (easy to compute and large) value, the estimator proceeds to ``adjust'' the estimate by computing the expectation of the number of
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "main"

View File

@ -325,45 +325,6 @@ In contrast, known approximation techniques in set-\abbrPDB\xplural are at most
%(iii) We finally observe that our results trivially extend to higher moments of the tuple multiplicity (instead of just the expectation).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Applications}
Probabilistic databases have been explored extensively for a variety of specialized tasks from probabilistic programming~\cite{DBLP:journals/tods/OlteanuS16}, to simulations~\cite{DBLP:conf/sigmod/GaoLPJ17,DBLP:conf/sigmod/CaiVPAHJ13}, and more.
Especially notable is work on data cleaning~\cite{yang:2015:pvldb:lenses,DBLP:journals/vldb/SaRR0W0Z17,DBLP:journals/pvldb/RekatsinasCIR17,DBLP:journals/pvldb/BeskalesIG10}, where probabilistic databases facilitate queries over heuristically cleaned datasets.
This work is particularly crucial, as larger and more noisy datasets make fully manual data cleaning progressively less feasible.
As observed by Feng et. al. \cite{feng:2019:sigmod:uncertainty}, users of classical, deterministic databases are forced to choose between discarding potentially irrelevant data due to the cost of cleaning it properly, and ignoring the error and hoping that query outputs remain informative.
Thanks to existing work on probabilistic data cleaning, probabilistic databases can be leveraged provide a convenient middle ground between these extremes, allowing users to avoid the upfront costs of data cleaning, while simultaneously receiving meaningful query outputs (or at least an output that indicates the need for manual cleaning).
Unfortunately, probabilistic databases remain impractically slow (e.g., by orders of magnitude~\cite{feng:2019:sigmod:uncertainty}).
Detecting data errors, outliers, or other corner cases is often easier than
Manual validation of data is becoming increasingly intractable as data sizes grow.
Probabilistic databases are a powerful tool for managing data that,
As data sizes grow and manual validation becomes more difficult, it becomes more critical than ever for tools to treat data unc
Probabilistic databases are a valuable tool for
provide a
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Overview of our Techniques} All of our results rely on working with a {\em reduced} form of the lineage polynomial $\poly$. In fact, it turns out that for the \abbrTIDB (and \abbrBIDB) case, computing the expected multiplicity is {\em exactly} the same as evaluating this reduced polynomial over the probabilities that define the \abbrTIDB/\abbrBIDB. Next, we motivate this reduced polynomial.
@ -448,6 +409,16 @@ we get that $\poly^2\inparen{\vct{\prob}}$ is in the range $[\inparen{p_0}^3\cdo
%
To get an $(1\pm \epsilon)$-multiplicative approximation we uniformly sample monomials from the \abbrSMB representation of $\poly$ and `adjust' their contribution to $\widetilde{\poly}\left(\cdot\right)$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Applications}
Probabilistic databases have been applied to everything from probabilistic programming~\cite{DBLP:journals/tods/OlteanuS16} to simulations~\cite{DBLP:conf/sigmod/GaoLPJ17,DBLP:conf/sigmod/CaiVPAHJ13}.
Especially notable is work on heuristic data cleaning~\cite{yang:2015:pvldb:lenses,DBLP:journals/vldb/SaRR0W0Z17,DBLP:journals/pvldb/RekatsinasCIR17,DBLP:journals/pvldb/BeskalesIG10} that emits a \abbrPDB when insifficient data exists to select a correct data reparir.
This work is especially critical, as the alternative is to simply select one repair and `hope' that queries over it produce meaningful results.
\abbrPDB queries, can convey how trustworthy the result is~\cite{kumari:2016:qdb:communicating}, but are impractically slow~\cite{feng:2019:sigmod:uncertainty,feng:2021:sigmod:efficient}.
This work lays the foundation for probabilistic database systems that can be competitive with deterministic databases.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Paper Organization} We present relevant background and notation in \Cref{sec:background}. We then prove our main hardness results in \Cref{sec:hard} and present our approximation algorithm in \Cref{sec:algo}. %We present some (easy) generalizations of our results in \Cref{sec:gen}.
%and also discuss extensions from computing expectations of polynomials to the expected result multiplicity problem

View File

@ -756,3 +756,21 @@ Maximilian Schleich},
year = {2013}
}
@inproceedings{kumari:2016:qdb:communicating,
author = {Kumari, Poonam and Achmiz, Said and Kennedy, Oliver},
title = {Communicating Data Quality in On-Demand Curation},
booktitle = {QDB},
year = {2016}
}
@inproceedings{feng:2021:sigmod:efficient,
author = {Feng, Su and Glavic, Boris and Huber, Aaron and Kennedy, Oliver},
title = {Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds},
booktitle = {SIGMOD},
year = {2021}
}