Compare commits

...

10 Commits

Author SHA1 Message Date
Boris Glavic 01265d76a3 full print 2018-10-02 13:03:00 -05:00
Boris Glavic ee2b9d0d4b sub 2018-10-02 12:57:53 -05:00
Boris Glavic 26fd49936e new 2018-10-02 12:56:50 -05:00
Boris Glavic a6c66f5978 new 2018-10-02 12:48:30 -05:00
Boris Glavic 651f2f95ad thrust I 2018-10-02 12:31:12 -05:00
Boris Glavic 6cd245713b sec 3 2018-10-02 12:18:46 -05:00
Oliver Kennedy d7afde2fcc Minor tweaks 2018-10-02 13:10:51 -04:00
Boris Glavic e09f4b7c9f 3 2018-10-02 12:02:26 -05:00
Oliver Kennedy 68d1750d90 Minor edits 2018-10-02 12:56:24 -04:00
Boris Glavic 04d3e63d2f 2 + 3 2018-10-02 11:51:16 -05:00
12 changed files with 38 additions and 39 deletions

BIN
figures/crime.pdf Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -12,7 +12,7 @@ Assume a universal domain of attribute values $\aDom$. A tuple with schema $\rel
A set relation $\rel$ with schema $\relschema$ is a set of tuples with schema $\relschema$, i.e., $\rel \subseteq \aDom^{\arity{\relschema}}$.
A bag relation $\rel$ with schema $\relschema$ is a bag (multiset) of tuples with schema $\relschema$.
% In either case, a tuple is an element from $\aDom^{\arity{\relschema}}$.
We use $\tupDom$ to denote the set of all tuples over domain $\aDom$.
%We use $\tupDom$ to denote the set of all tuples over domain $\aDom$.
%\BG{ADD BACK IF WE ACTUALLY USE THIS SOMEWHERE: Given a relation $\rel$ with schema $\relschema$ we define $\tupDom(\rel)$, the set of tuples with values from $\aDom$ and schema $\relschema$. For a database $\db$, we define $\tupDom(\db)$ as the union of $\tupDom(\rel)$ for all relations $\rel$ in $\db$. }
\subsection{Possible Worlds Semantics}
@ -23,7 +23,6 @@ We use $\tupDom$ to denote the set of all tuples over domain $\aDom$.
% data
Incomplete databases model uncertainty and its impact on query results.
An \emph{incomplete database} $\pdb$ is a set of deterministic database instances $\db_1, \ldots, \db_{w}$ called \emph{possible worlds}.
\BG{I THINK THIS IS NO LONGER USED? We write $\pwDom$ to denote the set of world indexes $\{1 \ldots w\}$, and follow most existing incomplete and probabilistic literature in assuming that $\pwDom$ is finite.}
We write $\tup \in \db$ to denote that a tuple $\tup$ appears in a specific possible world $\db$.
Most approaches adopt the so called ``possible worlds'' semantics for querying incomplete databases:
The result of evaluating a deterministic query $\query$ over an incomplete database is the set of relation instances resulting from evaluating $\query$ over each possible world individually using standard deterministic query semantics: $\query(\pdb) = \comprehension{\query(\db)}{\db \in \pdb}$.
@ -204,7 +203,7 @@ The $\semK$-relation framework defines a non-standard semantics for positive rel
% for propagating annotations through queries into query results using only the two semiring operations $\addK$ and $\multK$.
When using the boolean semiring, i.e., the semiring over the boolean constants $\semB = \{T,F\}$ with disjunction $(\vee)$ as addition and conjunction ($\wedge$) as multiplication, the resulting $\semB$-relations exactly mirror set-semantics (a tuple is annotated with $T$ if and only if it appears in a relation).
Similarly, the semiring $(\semN, +, \times, 0, 1)$ of natural numbers $\semN$ with standard addition and multiplication over natural numbers mirrors bag semantics (each tuple is annotated with its multiplicity).
Other semirings model, for example, security policies and provenance.
Other semirings can be used to model, for example, security policies and provenance.
% \BG{I THINK THIS WOULD REQUIRE FURTHER EXPLANATION TO MAKE SENSE HERE, or relations based on multi-valued boolean logic.}
% \BG{Do we need to talk about provenance now? Data provenance documents historical record of data and its origins. One standard way of modeling provenance in databases is to annotate provenance on data items and compute the output provenance by propagate annotations. \textbf{Provenance semiring} is defined as $\semNX=(\semNX,+,\times,0,1)$ where $X$ is set of tuple ids and $\semNX$ is all \textbf{provenance polynomials} with variables from $X$ and coefficients from $\semN$}.

View File

@ -13,7 +13,8 @@
\end{wrapfigure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Before discussing our proposed contributions, we first introduce the core concepts behind them. %We begin by introducing a incomplete $\semK$-relations.
We begin with incomplete $\semK$-relations, an extension of incomplete databases to $\semK$-relations. Recall that $\semK$-relations generalize set semantics, bag semantics, provenance, and many other extension of the relational model. The same holds for incomplete $\semK$-relations. We introduce certain annotations as a sensible generalization of certain answers to incomplete $\semK$-relations, exploiting the natural order and its corresponding lattice structure that are defined for many semirings. Importantly, this generalization coincides with the standard definition of certain answers for set semantics and the definition for bag semantics proposed by Guagliardo et al.~\cite{DBLP:journals/sigmod/GuagliardoL17}.
We begin with incomplete $\semK$-relations, an extension of incomplete databases to $\semK$-relations. Recall that $\semK$-relations generalize set semantics, bag semantics, provenance, and many other extension of the relational model. The same holds for incomplete $\semK$-relations. We introduce certain annotations as a sensible generalization of certain answers to incomplete $\semK$-relations. Exploiting the natural order, an order relation over the domain of a semiring that is defined for many semirings, we define the certain annotation of a tuple as a lower bound of its annotation across all possible worlds. % wrt. the natural order.
Importantly, this generalization coincides with the standard definition of certain answers for set semantics and the definition for bag semantics proposed by Guagliardo et al.~\cite{DBLP:journals/sigmod/GuagliardoL17}.
We then introduce Uncertainty-Annotated Databases (\abbrUADBs{}), which are databases where each tuple is annotated with an under-approximation as well as an over-approximation of its certain annotation.
By choosing the annotation of a tuple in the \termBGW{} as the over-approximation, \abbrUADBs{} are a strict improvement over \termBGQP{}.
Figure~\ref{fig:queryTypeBreakdown} shows the relationship of certain answers, \termBGA{}, and \abbrUADBs{} for set semantics. Recall that set semantics corresponds to $\semB$-relations: tuples are either annotated with $T$ (true) to indicate that they exist or $F$ (false) to indicate that they do not exist.
@ -32,7 +33,7 @@ Queries over \abbrUADBs{} treat the two annotations of a tuple independently and
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Incomplete $\semK$-Relations}
An \emph{incomplete $\semK$-database} $\pdb$ is a set of possible worlds $\pdb = \{\db_1, \ldots, \db_n\}$ where each possible world $D_i$ is a $\semK$-database. Incomplete $\semK$-relations are defined analogously. Queries over incomplete $\semK$-databases are evaluated using possible worlds semantics, but specifically under $\semK$-relational semantics (i.e., query results are a set of $\semK$-relations).
An \emph{incomplete $\semK$-database} $\pdb$ is a set of possible worlds $\pdb = \{\db_1, \ldots, \db_n\}$ where each possible world $D_i$ is a $\semK$-database. Incomplete $\semK$-relations are defined analogously. Queries over incomplete $\semK$-databases are evaluated using possible worlds semantics, i.e., evaluating the query over each possible world using $\semK$-relational semantics. That is, the result of a query is a set of $\semK$-relations (possible worlds).
Incomplete $\semK$-databases can be trivially extended to probabilistic $\semK$-databases by defining a distribution $\prob: \pdb \mapsto [0,1]$ such that $\sum_{\db \in \pdb} \prob(\db) = 1$.
%\AR{Technically $\semK$-database has not been defined, though $\semK$-relations have been.}
@ -104,7 +105,7 @@ Conversely for the third tuple we get $\pwCertOf{\semB}(\{F,T\}) = F \wedge T =
\end{example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
Just as like classical incomplete databases, incomplete $\semK$-databases are used only as an abstract model to define clear semantics.
Just like classical incomplete databases, incomplete $\semK$-databases are used only as an abstract model to define clear semantics.
More compact representations are needed.
For example, existing incomplete data models like c-tables can be used to concisely encode incomplete $\semB$-databases.
@ -184,7 +185,8 @@ For example, existing incomplete data models like c-tables can be used to conci
\subsection{UA-Databases} \label{sec:UA-model}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{wrapfigure}[15]{r}[0pt]{9.5cm}
\begin{wrapfigure}[13]{r}[0pt]{9.5cm}
\vspace*{-8mm}
\centering
\includegraphics[width=9cm]{../figures/uadb-in-comparison}
\caption{Representations of incomplete data and relationships between them.}
@ -201,17 +203,17 @@ D(\tup) = [ \bgdb(\tup), \TUL(\tup) ] &&
\query(D)(\tup) = [ \query(\bgdb)(\tup), \query(\TUL)(\tup) ]
\end{align*}
That is, the \abbrUADB $D$ annotates each tuple with (1) the annotation from $\bgdb$, and (2) the annotation from $\TUL$.
Similarly, the result of query $\query(D)$ is a relation that labels its tuples with the annotations obtained by querying $\bgdb$ and $\TUL$ independently. Since $\bgdb$ and $\TUL$ are $\semK$-databases this amounts to standard $\semK$ query evaluation. Observe that this trivially preserves the over-approximation of certain answers, because by definition the certain answers are a lower bound on the result of a query over one possible world. The fact that the under-approximation is preserved by this approach is less obvious. In~\cite{FH18} we have proven that positive relational algebra over $\semK$-relations preserves under-approximations. For specific incomplete data models, e.g., tuple-independent database, \abbrUADBs{} \abbrUADBs{} compute precisely the certain answers. The same holds for certain classes of queries.
Similarly, the result of query $\query(D)$ is a relation that labels its tuples with the annotations obtained by querying $\bgdb$ and $\TUL$ independently. Since $\bgdb$ and $\TUL$ are $\semK$-databases this amounts to standard $\semK$ query evaluation. Observe that this trivially preserves the over-approximation of certain answers, because by definition the certain answers are a lower bound on the result of a query over one possible world. The fact that the under-approximation is preserved by this approach is less obvious. In~\cite{FH18} we have proven that positive relational algebra over $\semK$-relations preserves under-approximations. For specific incomplete data models, e.g., tuple-independent databases, \abbrUADBs{} compute precisely the certain answers. The same holds for certain classes of queries.
% Our key theoretical contribution for \abbrUADBs thus far is demonstrating that positive relational algebra queries over \abbrUADBs{} preserve the under- and over-approximation of certain answers encoded by a \abbrUADBs{} constructed for any l-semiring.
To demonstrate the backward-compatibility of \abbrUADBs{} with existing incomplete and probabilistic data models we have developed methods~\cite{FH18} for computing best-guess worlds and labelings (under-approximations of certain answers) for these models. We refer to the later as \textit{labeling schemes} in this work.
To demonstrate the backward-compatibility of \abbrUADBs{} with existing incomplete and probabilistic data models we have developed methods~\cite{FH18} for computing best-guess worlds and labelings (under-approximations of certain answers) for these models. We refer to the later as \textit{labeling schemes}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{example}\label{example:kwlabels}
Consider a probabilistic version of the incomplete $\semB$-relation from Example~\ref{ex:incomplete-k-databases} where $P(D_1) = 0.4$ and $P(D_2) = 0.6$.
One way to encode this database more compactly is to use a tuple-independent probabilistic database~\cite{DBLP:series/synthesis/2011Suciu} (\abbrTIDB) as shown below. In a \abbrTIDB{}, tuples are assumed to be independent probabilistic events. The marginal probability $P(t)$ of each tuple $t$ is stored in an attribute $p$. A world with the highest probability for a \abbrTIDB can be computed as the set of tuples that have probability larger than or equal to $0.5$.\footnote{Each subset of a \abbrTIDB is a possible world and the probability of a possible world is computed by multiplying the probabilities of tuples that are in the possible world and $1-p(t)$ for tuples that are not in the possible world. The probability can be maximized by including tuples if $p(t) \geq 1-p(t)$ which is the case then $p(t) \geq 0.5$.} For \abbrTIDBs, certain answers can be computed efficiently by returning all tuples that have a probability of $1$ and we can use this method as a labeling scheme. For instance, storing the annotations of a tuple in attributes \texttt{isBG} and \texttt{isCert}\footnote{Note that the \texttt{isBG} attribute is redundant since it will be \lstinline!'T'! for every tuple in the result of the query.}, the table \texttt{loc} from the previous example can be computed using the SQL query: % \lstinline{SELECT locale, state, 'T' AS is BG, CASE WHEN p >= 0.5 THEN 'T' ELSE 'F' END AS isCert FROM R WHERE p = 1}
One way to encode this database more compactly is to use a tuple-independent probabilistic database~\cite{DBLP:series/synthesis/2011Suciu} (\abbrTIDB) as shown below. In a \abbrTIDB{}, tuples are assumed to be independent probabilistic events. The marginal probability $P(t)$ of each tuple $t$ is stored in an attribute $p$. A world with the highest probability for a \abbrTIDB can be computed as the set of tuples that have probability larger than or equal to $0.5$.\footnote{Each subset of a \abbrTIDB is a possible world and the probability of a possible world is computed by multiplying the probabilities of tuples that are in the possible world and $1-p(t)$ for tuples that are not in the possible world. The probability can be maximized by including tuples if $p(t) \geq 1-p(t)$ which is the case if $p(t) \geq 0.5$.} For \abbrTIDBs, certain answers can be computed efficiently by returning all tuples that have a probability of $1$ and we can use this method as a labeling scheme. For instance, storing the annotations of a tuple in attributes \texttt{isBG} and \texttt{isCert}\footnote{Note that the \texttt{isBG} attribute is redundant since it will be \lstinline!'T'! for every tuple in the result of the query. Our implementations omits this attribute.}, we can compute a \abbrUADB{} from the TIP table \texttt{loc} using the SQL query: % \lstinline{SELECT locale, state, 'T' AS is BG, CASE WHEN p >= 0.5 THEN 'T' ELSE 'F' END AS isCert FROM R WHERE p = 1}
\begin{lstlisting}
SELECT locale, state, 'T' AS isBG, CASE WHEN p>=0.5 THEN 'T' ELSE 'F' END AS isCert
FROM R WHERE p = 1

View File

@ -11,7 +11,7 @@ However, the heuristic steps in a data curation workflow frequently admit altern
%
\abbrUADBs{} have the potential for great practical impact since they combine the practicality and performance of \termBGQP with the rigor of certain answers.
Our proposed techniques can significantly improve many real world use cases which currently make decisions based on uncertain data with severe negative impact.
In addition to the potential of the proposed research itself, this grant will support three Ph.D. students and s one postdoctoral researcher.
In addition to the potential of the proposed research itself, this grant will support three Ph.D. students and one postdoctoral researcher.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\mypar{Integration of Research and Education}

View File

@ -4,7 +4,7 @@
Uncertainty is prevalent in data analysis, no matter what the size of the data, the application domain, or the type of analysis.
Common types of uncertainty include missing values, sensor errors, bias, outliers, mismatched data, and many more.
If ignored, data uncertainty can result in hard to trace errors in analytical results, which in turn can have severe real world implications such as unfounded scientific discoveries, financial damages, or even effects on peoples physical well-being (e.g., medical decisions based on incorrect data).
If ignored, data uncertainty can result in hard to trace errors in analytical results, which in turn can have severe real world implications such as unfounded scientific discoveries, financial damages, or even effects on people's physical well-being (e.g., medical decisions based on incorrect data).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -66,13 +66,13 @@ The result is a strict generalization of classical data management that also cle
\begin{example}
Consider the \lstinline{customers} table shown in Figure~\ref{fig:curation-running-example}, which tracks customer features that a bank uses to determine a customer's eligibility for loans.
This data is missing values including Peter's income and whether Alice owns property, and is thus uncertain.
This data is missing values, including Peter's income and whether Alice owns property, and is thus uncertain.
Ideally, the bank would curate the data manually (e.g., a service representative calls Alice to request the missing information) before acting on it.
However if the bank is sufficiently large, it may be more productive to mine existing data (e.g., property tax records) to ``fill in'' missing values.
Such mining often also involves uncertainty, for example as a result of entity matching.
For this example, assume that the best match for Alice in her town's property tax database has a 40\% likelihood.
Because no match exists with greater than 50\% likelihood, the $\sqlNull$ value would typically be replaced replaced with a ``no''.
From the bank's perspective, the data quality issue is now resolved.
From the bank's perspective, the data quality issue is resolved.
However, there is now at least a 40\% chance that any loan decision the bank makes will be based on faulty evidence.
\end{example}
@ -113,11 +113,11 @@ Unfortunately, in this sort of \emph{\termBEA}, information about ambiguous inte
\begin{tabular}{p{2.4cm}rrrrr}\toprule
& $Q1$ & $Q2$ & $Q3$ & $Q4$ & $Q5$\\ \midrule
Runtime Overhead & 2.28\% & 1.81\% & 1.32\% & 2.88\% & 3.51\%\\
Misclassified Certain Answers & 0.55\% & 0.37\% & 0\% & 0.92\% & 0.29\%\\
Misclassified Answers & 0.55\% & 0.37\% & 0\% & 0.92\% & 0.29\%\\
\bottomrule
\end{tabular}
}
\includegraphics[width=0.45\textwidth, trim=0cm 4cm 1cm 5cm, clip]{../figures/incomp_bn}
\includegraphics[width=0.45\textwidth]{../figures/crime}
\bfcaption{Experimental results using crime data~\cite{CHICRIME}: (top) 5 real-world queries;
(bottom) a large number of randomly generated projection queries.
\abbrUADBs have low overhead, never incorrectly flag an answer as certain, and rarely mis-classify certain results as uncertain.}
@ -126,8 +126,8 @@ Misclassified Certain Answers & 0.55\% & 0.37\% & 0\% & 0.92\% & 0.29\%\\
An abstraction recently proposed by the PIs~\cite{FH18} called \textbf{Uncertainty-Annotated} \textbf{Da\-ta\-bases} (\textbf{\abbrUADBs}) bridges the gap between certain query answering and \termBEA{} by fusing the latter with a lightweight approximation of the former.
Specifically, queries in a \abbrUADB behave exactly as in \termBEA{}, but results include sufficient information to distinguish between result tuples and values that are certain, from those that might not be.
As illustrated in Figure~\ref{fig:UADBPerformance} (see~\cite{FH18} for more details), \abbrUADBs only introduce a small performance penalty compared to \termBEA{}, never falsely flag an answer as certain, and (depending on dataset quality and query specifics) only rarely incorrectly flag answers as uncertain.
Specifically, queries in a \abbrUADB behave exactly as in \termBEA{}, but results also include sufficient information to distinguish result tuples and values that are certain from those that might not be.
As illustrated in Figure~\ref{fig:UADBPerformance} (see~\cite{FH18} for more details), \abbrUADBs only introduce a small performance penalty compared to \termBEA{}, never falsely flag an answer as certain, and (depending on dataset quality and query specifics) only rarely incorrectly flag answers as being uncertain.
\abbrUADBs (and our preliminary implementation of them) extend the state-of-the-art in the following ways:
%\begin{itemize}
@ -144,7 +144,7 @@ A superset of the certain answers is annotated such that the subset labeled as c
%\item \textbf
\mypar{Backward Compatibility}
A \abbrUADB can be efficiently constructed from any existing model of incomplete, probabilistic, or fuzzy data as long as we can compute (1) a distinguished possible world (the over-approximation) and (2) a labeling of the tuples from this world as (un-)certain (the under-approximation). In preliminary work~\cite{FH18}, we have already demonstrated compatibility with several common incomplete database models such as V-tables~\cite{DBLP:conf/sigmod/AbiteboulKG87}, c-tables~\cite{Imielinski:1984:IIR:1634.1886}, and tuple- and block-independent probabilistic databases~\cite{DBLP:series/synthesis/2011Suciu}.
A \abbrUADB can be efficiently constructed from any existing model of incomplete, probabilistic, or fuzzy data, as long as we can compute (1) a distinguished possible world (the over-approximation) and (2) a labeling of the tuples from this world as (un-)certain (the under-approximation). In preliminary work~\cite{FH18}, we have already demonstrated compatibility with several common incomplete database models such as V-tables~\cite{DBLP:conf/sigmod/AbiteboulKG87}, c-tables~\cite{Imielinski:1984:IIR:1634.1886}, and tuple- and block-independent probabilistic databases~\cite{DBLP:series/synthesis/2011Suciu}.
%\item \textbf
\mypar{Combing \utermBGQP{} with Certain Answers}
@ -170,7 +170,7 @@ The proposed work follows two major research thrusts:
We will develop the theory of \abbrUADBs to take advantage of this generality and investigate how the choice of annotation domain affects the accuracy of our approximations of certain answers.
%\item \textbf
\textit{(c) Attribute-level Annotations}:
We will extend \abbrUADBs{} with attribute-level annotations that encode bounds on the values of an attribute across all possible worlds, enabling a more precise encoding of the uncertainty inherent in data. We conjecture that this extension will not result in a significant increase in computational complexity.
We will extend \abbrUADBs{} with attribute-level annotations that encode bounds on the values of an attribute across all possible worlds, enabling a more precise encoding of the uncertainty inherent in a dataset. We conjecture that this extension will not result in a significant increase in computational complexity.
%\item \textbf
\textit{(d) Non-monotone Queries}:
Under-approximating certain answers is easy only for monotone queries.
@ -187,13 +187,13 @@ The proposed work follows two major research thrusts:
We will develop adapters expressed as relational queries that translate data from incomplete and probabilistic data models into \abbrUADBs{}. Furthermore, we propose to extend SQL's DML and DDL constructs to encode operations (e.g., key-repair~\cite{DBLP:conf/icde/AntovaKO07a}) that introduce and manage uncertainty. Importantly, this will include the study of how update operations can be executed over \abbrUADBs{}.
%\item \textbf
\textit{(b) Query Rewriting for \abbrUADBs{}}:
Building on promising results with a prototype query-rewriting middleware implementing tuple-level \abbrUADBs{}, we will implement and evaluate support for attribute-level annotations, aggregation (non-monotone queries), and additional types of annotations (e.g., uncertain provenance annotations).
Building on promising results with a prototype query-rewriting middleware implementing tuple-level \abbrUADBs{}, we will implement and evaluate support for attribute-level annotations, aggregation (and non-monotone queries), and additional types of annotations (e.g., uncertain provenance annotations).
%\item \textbf
\textit{(c) A Specialized Database Engine for \abbrUADBs{}}:
To further improve performance of query evaluation, we will design and implement a specialized \abbrUADB{} query processing engine that will exploit compact representations based on factorization, functional aggregate queries~\cite{faq-2} (FAQ), and geometrical interpretations of relations~\cite{tetris}.
% \item \textbf
\textit{(d) Online Refinement of \abbrUADBs{}}:
Since \abbrUADBs{} encode both super- and subsets of certain answers, they can be used as a pruning step to speed up any exact algorithm for computing certain answers (and/or probabilities).
Since \abbrUADBs{} encode both super- and subsets of certain answers, they can be used as a pruning step to speed up exact algorithms for computing certain answers (and/or probabilities).
% We will investigate the use of \abbrUADBs{} as an efficient preprocessing step for exact methods for computing certain answers. Thus, we can reduce amount of data that needs to be processed by an exact, significantly more expensive, method for computing certain answers by only applying it to answers which may be labeled incorrectly by the \abbrUADB{}.
% \end{itemize}
% \end{enumerate}

View File

@ -6,14 +6,12 @@
% We propose to develop a solid theoretical foundation for \abbrUADBs{}. % based on our preliminary work in this area.
% \end{sectionsummary}
In preliminary work on \abbrUADBs~\cite{FH18}, we extended the classical notion of certain and possible answers in incomplete databases to admit support for any type of semiring-annotated relation (generalizing bag semantics, various types of provenance, access control, \ldots).
In preliminary work on \abbrUADBs~\cite{FH18}, we extended the classical notion of certain and possible answers for incomplete databases to support any type of semiring-annotated relation (generalizing bag semantics, various types of provenance, access control, \ldots).
% Furthermore, we have generalized the concept of certain answers to certain annotations.
\BG{MOVED TO SEC 3, TOO DETAILED HERE AND SHOULD HAVE BEEN COVERED IN SEC 3 ANYWAYS: The fundamental ideas behind of our preliminary work build on $\semK$-databases, assuming only that $\semK$ is an l-semiring --- a semiring where the (natural) order of their elements (i.e., based on addition) forms a lattice structure.
Our key theoretical contribution for \abbrUADBs thus far is demonstrating that positive relational algebra queries over \abbrUADBs{} preserve the under- and over-approximation of certain answers encoded by a \abbrUADBs{} constructed for any l-semiring.}
We established \abbrUADBs as a light-weight model that labels \termBGA according to an approximation of certain answers.
We established \abbrUADBs as a light-weight model that labels \termBGA according to an approximation of certain answers.
The \termBGA provide an over-approximation of certain answers, while the labeled tuples provide an under-approximation.
The work conducted as part of this proposal will significantly extend the formal foundation of \abbrUADBs by studying under which conditions (data, query, or semiring properties) the approximations of certain answers provided \abbrUADBs{} are bounded (or exact).
The work conducted as part of this proposal will significantly extend the formal foundation of \abbrUADBs by studying under which conditions (data, query, or semiring properties) the approximations of certain answers provided by \abbrUADBs{} are bounded (or exact).
Furthermore, we will extend \abbrUADBs{} to support attribute-level annotations that bound the values of attributes across possible worlds.
Finally, using attribute-level bounds and extending ideas from functional aggregate queries~\cite{faq-1,faq-2,ajar} to concisely encode possible answers, we will approximate certain answers for non-monotone queries such as queries with aggregation and negation.
@ -37,7 +35,7 @@ Furthermore, we will investigate whether we can guarantee any bounds on the appr
In preliminary work we have demonstrated that positive queries (SPJ-U) over an \abbrUADB preserve the under-approximation and over-approximation of certain answers.
However, we observe that under certain circumstances \abbrUADBs exactly encode the certain answers of a query.
For instance, evaluating a positive query over any \abbrUADB generated from a \abbrTIDB returns a labeling that is precisely the certain answers. % (the under-approximation is correct).
Even when this is not the case, initial experiments~\cite{FH18} suggest that in many real-world settings, labelings are close approximations of (or exactly equal to) certain answers. In~\cite{FH18} we did evaluate a large number of randomly generated projection queries over the result of missing value imputation for 9 real-world dataset and found that the percentage of certain answers mis-classified by our approach as uncertain is typically less than 5\%. When using real-world queries, the error rate is even lower (less than 1\% for the example queries we have tested).
Even when this is not the case, initial experiments~\cite{FH18} suggest that in many real-world settings, labelings are close approximations of (or exactly equal to) certain answers. In~\cite{FH18} we did evaluate a large number of randomly generated projection queries over the result of missing value imputation for 9 real-world datasets and found that the percentage of certain answers mis-classified by our approach as uncertain is typically less than 5\%. When using real-world queries, the error rate is even lower (less than 1\% for the example queries we have tested).
We propose to study how characteristics of the input \textit{data} (e.g. the incomplete or probabilistic data model a \abbrUADBs is derived from) and of the \textit{query} (e.g., structural parameters such as whether the query is hierarchical~\cite{DBLP:journals/vldb/DalviS07}) relate to the tightness of the approximation provided by a \abbrUADB.
This information is useful for keeping the user aware of the degree of uncertainty in a \abbrUADB's query results
For example, if we can establish that the error rate of a labeling is no more than 1\% then it is reasonable to trust a \abbrUADB{} if it labels a result as uncertain. % labeling of any given answer as certain or uncertain.
@ -51,7 +49,7 @@ Although \abbrUADBs admit a natural extension to incomplete databases and certai
We will study the properties of these non-traditional cases and will investigate how approximation of certain answers is affected by the choice of annotation domain.
\end{sectionsummary}
Our incomplete $\semK$-relations and \abbrUADBs generalize incomplete data and certain answers beyond sets and bags. This opens up new use cases such as uncertain provenance where we keep track of which parts of the provenance of a data item are certain and which are uncertain or uncertain fine-grained access-control where a query result can be exposed to a user if its certain confidentiality level is one that the user is allowed to see. Extending the result of research thrust I-a, we will study how the choice of semiring affects the precision of our approximation of certain answers. Furthermore, we will investigate novel applications enabled by incomplete databases beyond set semantics. For instance, the aforementioned incomplete databases with access control annotations would enable a rigorous treatment of access control for applications applications that analyze data that is the results of information extraction, data cleaning, or data wrangling which are inherently uncertain.
Our incomplete $\semK$-relations and \abbrUADBs generalize incomplete data and certain answers beyond sets and bags. This opens up new use cases such as uncertain provenance where we keep track of which parts of the provenance of a data item are certain and which are uncertain or uncertain fine-grained access-control where a query result can be exposed to a user if its certain confidentiality level is one that the user is allowed to see. Extending the result of research thrust I-a, we will study how the choice of semiring affects the precision of our approximation of certain answers. Furthermore, we will investigate novel applications enabled by incomplete databases beyond set semantics. For instance, the aforementioned incomplete databases with access control annotations would enable a rigorous treatment of access control for applications that analyze data that is the inherently uncertain result of information extraction, data cleaning, or data wrangling.
@ -63,13 +61,13 @@ Our incomplete $\semK$-relations and \abbrUADBs generalize incomplete data and c
We will extend the \abbrUADB{} model with attribute-level annotations that bound an attribute's values across all possible worlds, extend the definition of certain answers accordingly, study how these annotations propagate through queries, and investigate what certainty guarantees are provided.
\end{sectionsummary}
Like many existing models for incomplete and probabilistic databases, our preliminary work with \abbrUADBs tracks uncertainty at the row-level. % through under- and over-approximations of certain annotations.
Like many existing models for incomplete and probabilistic databases, our preliminary work on \abbrUADBs tracks uncertainty at the row-level. % through under- and over-approximations of certain annotations.
However, as repeated efforts have shown~\cite{DBLP:conf/icde/KennedyK10,Jampani:2008:MMC:1376616.1376686,widom2004trio,Singh:2008:ONS:1376616.1376744,DBLP:journals/corr/NandiYKGFLG16,Antova:2009:WBE:1644245.1644253}, tracking uncertainty at the attribute-level can lead to more concise and precise representations of uncertainty.
\begin{example}
Consider an employee table with two attributes: name and salary.
Assume we know with certainty that Peter is an employee, but only know that his salary is either \$25,000, \$30,000, or \$40,000.
This information modeled as an incomplete database with 3 possible worlds: $D_1 = \{ (Peter, 25000)\}$, $D_2 = \{ (Peter, 30000) \}$, and $D_3 = \{(Peter, 40000)\}$.
This information can be modeled as an incomplete database with 3 possible worlds: $D_1 = \{ (Peter, 25000)\}$, $D_2 = \{ (Peter, 30000) \}$, and $D_3 = \{(Peter, 40000)\}$.
Let us assume that we choose $D_1 = \{(Peter, 25000)\}$ as the \termBGW{}.
Even though part of the information in the single tuple of $D_1$ is certain (the name),
it would be labeled as uncertain in a \abbrUADB{} because the tuple itself is not certain.
@ -89,7 +87,7 @@ Because this tuple matches exactly one tuple in each possible world that has a n
Of course the same would be true if we choose a strictly wider bound for the salary (e.g., $[1000,70000]$).
Intuitively, the first bound is more precise and thus preferable.
We plan to formalize this concept into a measure for \AR{So this metric will be valid for any totally ordered semi-ring or will it be like integers?} how precisely an \abbrUADB with attribute-level annotations represents an incomplete database. For instance, this measure may take the form of a partial order that models whether an tuple with attribute-level annotations ``dominates'' another tuple with attribute-level annotations.\BG{@Atri: does this clarify things you think?}
We plan to formalize this concept into a measure for how precisely an \abbrUADB with attribute-level annotations represents an incomplete database. For instance, this measure may take the form of a partial order that models whether a tuple with attribute-level annotations ``dominates'' another tuple with attribute-level annotations.
Furthermore, we will investigate algorithms for computing queries over attribute-level \abbrUADBs and for translating incomplete data models into our model, and study the computational complexity of these problems.
Finally, we will investigate alternatives to bounds for data types that do not have a meaningful ordering (e.g., name in the example):
For example, as with rows, we might label attributes as certain (i.e., taking exactly one value across all possible worlds where the tuple appears) or uncertain (taking any value).\footnote{This would be closely related to models which allow variables as values such as Codd-tables or V-tables.}

View File

@ -2,7 +2,7 @@
\section{Research Thrust II - Design and Implementation of \sysname}
\label{sec:research-thrust-ii}
Thrust II addresses the challenge of translating principles developed in Thrust I into practice through a prototype \abbrUADB called \textit{\sysname} (\textit{Uncertainty for You}).
Thrust II addresses the challenge of translating the principles developed in Thrust I into practice through a prototype \abbrUADB called \textit{\sysname} (\textit{Uncertainty for You}).
We will address logistical challenges involved in managing labeled data and then realize \sysname in three stages.
In stage 1, we will realize a purely rewrite-based implementation based on our preliminary work~\cite{FH18} for incomplete bag semantics databases.
Then in stage 2 we will explore how augmenting the database through new data structures, algorithms, and cost-based optimization strategies can improve performance and accuracy.
@ -62,7 +62,7 @@ We will develop techniques for rewriting queries with uncertainty annotations, e
\end{sectionsummary}
In \cite{FH18} and \cite{DBLP:journals/corr/NandiYKGFLG16}, we developed query rewriting middleware that implemented a bag semantics \abbrUADB with tuple-level uncertainty using a classical relational database.
In \cite{FH18} and \cite{DBLP:journals/corr/NandiYKGFLG16}, we developed a query rewriting middleware that implemented a bag semantics \abbrUADB with tuple-level uncertainty using a classical relational database.
As the first stage of realizing \sysname{}, we will extend this middleware with support for: (1) Attribute-level uncertainty, (2) Aggregation, and if time permits (3) semirings other than bags.
The key challenge we will focus on at this stage is ensuring performance, while retaining backwards compatibility by not requiring any fundamental changes to the underlying database.
@ -87,7 +87,7 @@ Similarly, schema-level rules (e.g., the upper bound is always infinite) preclud
% Similarly,
exact bounds may not be available (e.g., if an uncertain value is transformed by a non-monotone user-defined function).
For such attributes, we will explore gracefully degrading to simpler boolean (i.e., certain vs uncertain) attribute-level annotations.
\BG{Time permitting, we will also explore different user interfaces for uncertain relational data~\cite{kumari:2016:qdb:communicating}, as a range may preclude the need for \termBGA in some settings.}
% \BG{Time permitting, we will also explore different user interfaces for uncertain relational data~\cite{kumari:2016:qdb:communicating}, as a range may preclude the need for \termBGA in some settings.}
% For example, after rewriting, the schema of $R$ becomes
@ -129,7 +129,7 @@ For such attributes, we will explore gracefully degrading to simpler boolean (i.
\mypar{Supporting Aggregation}
Aggregate queries introduce an additional layer of complexity.
For example take the query: \lstinline{SELECT SUM(guests) AS total FROM PARTICIPANTS}.
The aggregation function result (the value of \lstinline{total}) can be uncertain in multiple ways: (1) if the existence of at least one input tuple is uncertain, then the aggregation function result cannot be the same in worlds that include this tuple and worlds that do not include this tuple (unless that \lstinline{guests} value of this tuple is $0$ in every possible world and (2) even if the existence of all input tuples is certain, the aggregation function result may still be uncertain if the \lstinline!guests! attribute of one of these tuples differs across possible worlds (is uncertain).
The aggregation function result (the value of \lstinline{total}) can be uncertain in multiple ways: (1) if the existence of at least one input tuple is uncertain, then the aggregation function result cannot be the same in worlds that include this tuple and worlds that do not include this tuple (unless the \lstinline{guests} value of this tuple is $0$ in every possible world); and (2) even if the existence of all input tuples is certain, the aggregation function result may still be uncertain if the \lstinline!guests! attribute of one of these tuples differs across possible worlds (is uncertain).
% Uncertainty in the result attribute (e.g., the value of \lstinline{total}) requires one of the source rows to be different across possible worlds.
% \lstinline{total} being uncertain requires that (1) there must exist a certain tuple in \lstinline{PARTICIPANTS} with an uncertain value of \lstinline{guests}; or
% (2) there must exist a possible, but not certain, tuple in \lstinline{PARTICIPANTS}.
@ -175,11 +175,11 @@ Conversely, PI Glavic's GProM~\cite{DBLP:journals/debu/ArabFGLNZ17} encodes prov
\subsection{II-c: Database Engine Specializations for \abbrUADBs}
\begin{sectionsummary}
We will identify, realize, and evaluate new data-structures, algorithms, optimizer passes, and other internal improvements that will make \abbrUADBs{} more efficient and expressive than they could be made through query rewriting alone.
We will identify, realize, and evaluate new data-structures, algorithms, optimization techniques, and other internal improvements that will make \abbrUADBs{} more efficient and expressive than they could be made through query rewriting alone.
\end{sectionsummary}
Our first goal aims at supporting uncertainty-aware data management within existing relational databases.
Our second goal aims at supporting uncertainty-aware data management within existing relational databases.
However, ensuring full backwards-compatibility precludes many opportunities for optimization.
The third goal of this thrust is to consider architectural changes including new algorithms, data structures, and optimization techniques for improving \abbrUADB{} performance and utility.
@ -213,7 +213,7 @@ We will extend \sysname to support exact certain (and probabilistic) queries ove
The heavyweight machinery of computing certain answers using incomplete, probabilistic, and fuzzy databases (e.g., C-Tables~\cite{Imielinski:1984:IIR:1634.1886,green2006models}, U-Relations~\cite{Antova:2009:WBE:1644245.1644253}, VG-Functions~\cite{Jampani:2008:MMC:1376616.1376686}, or VC-Tables~\cite{DBLP:journals/pvldb/YangMFLK15}) is wasteful when most tuples (and/or attributes) in a dataset are certain.
We will explore a novel approach to incomplete and probabilistic query processing that works in three stages: (1) Compute a preliminary result using \abbrUADBs, (2) Compute exact certainty, probabilities, expectations, and other relevant measures for tuples and attributes that the \abbrUADB result labels as uncertain (as having a range), and (3) Compute counts, marginal distributions, and other relevant measures over tuples not in the \termBGA.
This three stage approach significantly benefits interactive analytics: (1) Based on preliminary experimental experience with real-world data~\cite{FH18}, \abbrUADBs mis-classify comparatively few tuples, so the need for heavyweight computations is limited and
This three stage approach significantly benefits interactive analytics: (1) Based on preliminary experimental experience with real-world data~\cite{FH18}, \abbrUADBs mis-classify comparatively few tuples, so the need for heavyweight computations is limited and
%\BG{I think this is too strong of a claim. Even though we only have to apply the methods to determine whether tuples from the BGW are certain, doing so may very well require inspecting tuples not in the BGW: (2) Using \termBGA avoids the problem of polynomial explosion of \emph{possible} query results common to probabilistic databases}
(2) Results are produced incrementally as in online aggregation~\cite{Hellerstein:1997:OA:253260.253291} and and approximate query processing~\cite{Acharya:1999:AAQ:304181.304581}\footnote{Our proposed approach compliments anytime probabilities~\cite{DBLP:journals/vldb/FinkHO13}, which targets existential queries and individual tuples.}.
% Based on preliminary we observe that the amount of certain answers misclassified by our approach as uncertain is often quite low. Thus, this technique could lead to quite significant performance improvements.

BIN
submitted/fullprint.pdf Normal file

Binary file not shown.

BIN
submitted/proposal-only.pdf Normal file

Binary file not shown.

BIN
submitted/references.pdf Normal file

Binary file not shown.