master
Boris Glavic 2018-10-02 12:31:12 -05:00
parent 6cd245713b
commit 651f2f95ad
1 changed files with 7 additions and 7 deletions

View File

@ -6,12 +6,12 @@
% We propose to develop a solid theoretical foundation for \abbrUADBs{}. % based on our preliminary work in this area.
% \end{sectionsummary}
In preliminary work on \abbrUADBs~\cite{FH18}, we extended the classical notion of certain and possible answers in incomplete databases to admit support for any type of semiring-annotated relation (generalizing bag semantics, various types of provenance, access control, \ldots).
In preliminary work on \abbrUADBs~\cite{FH18}, we extended the classical notion of certain and possible answers for incomplete databases to support any type of semiring-annotated relation (generalizing bag semantics, various types of provenance, access control, \ldots).
% Furthermore, we have generalized the concept of certain answers to certain annotations.
We established \abbrUADBs as a light-weight model that labels \termBGA according to an approximation of certain answers.
The \termBGA provide an over-approximation of certain answers, while the labeled tuples provide an under-approximation.
The work conducted as part of this proposal will significantly extend the formal foundation of \abbrUADBs by studying under which conditions (data, query, or semiring properties) the approximations of certain answers provided \abbrUADBs{} are bounded (or exact).
The work conducted as part of this proposal will significantly extend the formal foundation of \abbrUADBs by studying under which conditions (data, query, or semiring properties) the approximations of certain answers provided by \abbrUADBs{} are bounded (or exact).
Furthermore, we will extend \abbrUADBs{} to support attribute-level annotations that bound the values of attributes across possible worlds.
Finally, using attribute-level bounds and extending ideas from functional aggregate queries~\cite{faq-1,faq-2,ajar} to concisely encode possible answers, we will approximate certain answers for non-monotone queries such as queries with aggregation and negation.
@ -35,7 +35,7 @@ Furthermore, we will investigate whether we can guarantee any bounds on the appr
In preliminary work we have demonstrated that positive queries (SPJ-U) over an \abbrUADB preserve the under-approximation and over-approximation of certain answers.
However, we observe that under certain circumstances \abbrUADBs exactly encode the certain answers of a query.
For instance, evaluating a positive query over any \abbrUADB generated from a \abbrTIDB returns a labeling that is precisely the certain answers. % (the under-approximation is correct).
Even when this is not the case, initial experiments~\cite{FH18} suggest that in many real-world settings, labelings are close approximations of (or exactly equal to) certain answers. In~\cite{FH18} we did evaluate a large number of randomly generated projection queries over the result of missing value imputation for 9 real-world dataset and found that the percentage of certain answers mis-classified by our approach as uncertain is typically less than 5\%. When using real-world queries, the error rate is even lower (less than 1\% for the example queries we have tested).
Even when this is not the case, initial experiments~\cite{FH18} suggest that in many real-world settings, labelings are close approximations of (or exactly equal to) certain answers. In~\cite{FH18} we did evaluate a large number of randomly generated projection queries over the result of missing value imputation for 9 real-world datasets and found that the percentage of certain answers mis-classified by our approach as uncertain is typically less than 5\%. When using real-world queries, the error rate is even lower (less than 1\% for the example queries we have tested).
We propose to study how characteristics of the input \textit{data} (e.g. the incomplete or probabilistic data model a \abbrUADBs is derived from) and of the \textit{query} (e.g., structural parameters such as whether the query is hierarchical~\cite{DBLP:journals/vldb/DalviS07}) relate to the tightness of the approximation provided by a \abbrUADB.
This information is useful for keeping the user aware of the degree of uncertainty in a \abbrUADB's query results
For example, if we can establish that the error rate of a labeling is no more than 1\% then it is reasonable to trust a \abbrUADB{} if it labels a result as uncertain. % labeling of any given answer as certain or uncertain.
@ -49,7 +49,7 @@ Although \abbrUADBs admit a natural extension to incomplete databases and certai
We will study the properties of these non-traditional cases and will investigate how approximation of certain answers is affected by the choice of annotation domain.
\end{sectionsummary}
Our incomplete $\semK$-relations and \abbrUADBs generalize incomplete data and certain answers beyond sets and bags. This opens up new use cases such as uncertain provenance where we keep track of which parts of the provenance of a data item are certain and which are uncertain or uncertain fine-grained access-control where a query result can be exposed to a user if its certain confidentiality level is one that the user is allowed to see. Extending the result of research thrust I-a, we will study how the choice of semiring affects the precision of our approximation of certain answers. Furthermore, we will investigate novel applications enabled by incomplete databases beyond set semantics. For instance, the aforementioned incomplete databases with access control annotations would enable a rigorous treatment of access control for applications applications that analyze data that is the results of information extraction, data cleaning, or data wrangling which are inherently uncertain.
Our incomplete $\semK$-relations and \abbrUADBs generalize incomplete data and certain answers beyond sets and bags. This opens up new use cases such as uncertain provenance where we keep track of which parts of the provenance of a data item are certain and which are uncertain or uncertain fine-grained access-control where a query result can be exposed to a user if its certain confidentiality level is one that the user is allowed to see. Extending the result of research thrust I-a, we will study how the choice of semiring affects the precision of our approximation of certain answers. Furthermore, we will investigate novel applications enabled by incomplete databases beyond set semantics. For instance, the aforementioned incomplete databases with access control annotations would enable a rigorous treatment of access control for applications that analyze data that is the inherently uncertain result of information extraction, data cleaning, or data wrangling.
@ -61,13 +61,13 @@ Our incomplete $\semK$-relations and \abbrUADBs generalize incomplete data and c
We will extend the \abbrUADB{} model with attribute-level annotations that bound an attribute's values across all possible worlds, extend the definition of certain answers accordingly, study how these annotations propagate through queries, and investigate what certainty guarantees are provided.
\end{sectionsummary}
Like many existing models for incomplete and probabilistic databases, our preliminary work with \abbrUADBs tracks uncertainty at the row-level. % through under- and over-approximations of certain annotations.
Like many existing models for incomplete and probabilistic databases, our preliminary work on \abbrUADBs tracks uncertainty at the row-level. % through under- and over-approximations of certain annotations.
However, as repeated efforts have shown~\cite{DBLP:conf/icde/KennedyK10,Jampani:2008:MMC:1376616.1376686,widom2004trio,Singh:2008:ONS:1376616.1376744,DBLP:journals/corr/NandiYKGFLG16,Antova:2009:WBE:1644245.1644253}, tracking uncertainty at the attribute-level can lead to more concise and precise representations of uncertainty.
\begin{example}
Consider an employee table with two attributes: name and salary.
Assume we know with certainty that Peter is an employee, but only know that his salary is either \$25,000, \$30,000, or \$40,000.
This information modeled as an incomplete database with 3 possible worlds: $D_1 = \{ (Peter, 25000)\}$, $D_2 = \{ (Peter, 30000) \}$, and $D_3 = \{(Peter, 40000)\}$.
This information can be modeled as an incomplete database with 3 possible worlds: $D_1 = \{ (Peter, 25000)\}$, $D_2 = \{ (Peter, 30000) \}$, and $D_3 = \{(Peter, 40000)\}$.
Let us assume that we choose $D_1 = \{(Peter, 25000)\}$ as the \termBGW{}.
Even though part of the information in the single tuple of $D_1$ is certain (the name),
it would be labeled as uncertain in a \abbrUADB{} because the tuple itself is not certain.
@ -87,7 +87,7 @@ Because this tuple matches exactly one tuple in each possible world that has a n
Of course the same would be true if we choose a strictly wider bound for the salary (e.g., $[1000,70000]$).
Intuitively, the first bound is more precise and thus preferable.
We plan to formalize this concept into a measure for how precisely an \abbrUADB with attribute-level annotations represents an incomplete database. For instance, this measure may take the form of a partial order that models whether an tuple with attribute-level annotations ``dominates'' another tuple with attribute-level annotations.
We plan to formalize this concept into a measure for how precisely an \abbrUADB with attribute-level annotations represents an incomplete database. For instance, this measure may take the form of a partial order that models whether a tuple with attribute-level annotations ``dominates'' another tuple with attribute-level annotations.
Furthermore, we will investigate algorithms for computing queries over attribute-level \abbrUADBs and for translating incomplete data models into our model, and study the computational complexity of these problems.
Finally, we will investigate alternatives to bounds for data types that do not have a meaningful ordering (e.g., name in the example):
For example, as with rows, we might label attributes as certain (i.e., taking exactly one value across all possible worlds where the tuple appears) or uncertain (taking any value).\footnote{This would be closely related to models which allow variables as values such as Codd-tables or V-tables.}