Website/src/grants/2014-IntuitiveDataInterpret...

72 lines
9.5 KiB
Plaintext

{\rtf1\ansi\ansicpg1252\cocoartf1347\cocoasubrtf570
{\fonttbl\f0\fswiss\fcharset0 ArialMT;\f1\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;}
{\*\listtable{\list\listtemplateid1\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid1\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid1}
{\list\listtemplateid2\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid101\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid2}
{\list\listtemplateid3\listhybrid{\listlevel\levelnfc23\levelnfcn23\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace360\levelindent0{\*\levelmarker \{disc\}}{\leveltext\leveltemplateid201\'01\uc0\u8226 ;}{\levelnumbers;}\fi-360\li720\lin720 }{\listname ;}\listid3}}
{\*\listoverridetable{\listoverride\listid1\listoverridecount0\ls1}{\listoverride\listid2\listoverridecount0\ls2}{\listoverride\listid3\listoverridecount0\ls3}}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\deftab720
\pard\pardeftab720\pardirnatural
\f0\b\fs24 \cf0 Executive Summary:\
\pard\pardeftab720\pardirnatural
\f1\b0 \cf0 Application domains such as healthcare, data as a service, and business analytics are regularly faced with noisy or incomplete data. Querying such data in a traditional database is dangerous, as it may lead to potentially erroneous results. The field of uncertain databases has arisen to address this issue. Queries over an uncertain database produce sets of "possible results" instead of unique values. For example, each tuple in a query result might be labeled with the probability that it will appear in the result set. Although such information provides useful feedback about result quality, it provides end-users with no guidance about how to improve result quality. We propose to extend work in uncertain databases with a sort of 'what's missing?' query, through which users can quickly identify sources of uncertainty in their query results, and interactively refine those results.\
\pard\pardeftab720\pardirnatural
\f0\b \cf0 Project Background and Description:\
\pard\pardeftab720\pardirnatural
\i\b0\fs18 \cf0 Please include a description of the participation by other Oracle Investigators\
\pard\pardeftab720\pardirnatural\qj
\f1\i0\fs24 \cf0 In application domains like healthcare, data as a service, and data cleaning, data being queried may be noisy, incomplete, or be the result of imprecise computations (e.g., classifiers or models). When such a database is queried, it may be necessary to gather information from outside of the database before precise results can be provided to end-users. Such information gathering tasks are often non-trivial, as end-users are unlikely to have sufficient information to decide what data is relevant, while the database might not be able to actively gather the data itself. \
\
For example, in a hospital setting, precisely diagnosing a patient might require multiple tests, based on an assessment of the patient's symptoms. Although a medical database could easily be queried for diagnoses that fit a patient's profile, determining the specific set of tests that could differentiate between diagnoses is a more daunting challenge. As another example, in the data-as-a-service model or on a dataset that has not been cleaned, a gradual, pay-as-you-go approach (where data is purchased or cleaned as it becomes relevant to do so) is often preferable to a bulk purchase/cleaning effort.\
\
Two themes emerge from each of these example applications: (1) The database application issues queries to be (imprecisely) answered over incomplete, noisy, or otherwise uncertain data, and (2) The database application needs to know what outside information can be provided to obtain more precise query results. \
\
We generalize on these themes with two new abstractions: a new construct called an interpretive view (or iview), and a new class of query called a \'91what\'92s missing?\'92 query. Abstractly, an iview identifies sources of uncertainty in the data, and a \'91what\'92s missing?\'92 query relates sources of uncertainty in the query results to the iview that introduced them.\
\
Concretely, an iview is built around a materialized view, and annotates the view's contents as being incomplete, not yet available, or open to judgment. For example, in a medical database, the results of a blood test might not be immediately available. Even so, different doctors might be able to make (different) educated guesses about what those results might turn out to be. An iview: (1) extends the materialized view relation with uncertain attributes. The value of such attributes in any given row is characterized by a (placeholder) variable, and (2) annotate uncertain attributes with constraint statements that range-restrict each variable's value (3) annotate each range restriction to indicate a relative confidence level in the restriction. For example, an iview might define a variable corresponding to a patient's blood sugar level, and range restrict it to a precise value if a measurement was taken in the last ten minutes, with a decaying confidence from this point on.\
\
When an iview is queried, the query processor returns all possible answers, with annotations indicating an (approximate) confidence level in each row of the result. If a stronger confidence is required, the user may enter an interactive 'what's missing?' mode to refine the query results. In this mode, the database queries the user for information potentially relevant to the query (e.g., asking the doctor to run a blood sugar test). As information is returned to the database, it refines the query results, presenting the user with progressively more accurate results. \
\
We note that information outside of the database is unlikely to be free (e.g., blood tests cost money). Consequently, the order in which information is requested is especially important, and is a major challenge to be overcome in the proposed work.\
\
The majority of the system-building and evaluation work will be performed by the SUNY Buffalo investigators. Oracle investigators will provide expertise regarding the Oracle database, relevant APIs, evaluation strategies, and domain-specific specific knowledge that will help to ensure compatibility with existing Oracle collaborations.\
\
\pard\pardeftab720\pardirnatural\qj
\f0\b \cf0 Objectives:\
\pard\pardeftab720\li709\pardirnatural
\i \cf0 Technical Objectives\
\pard\pardeftab720\li1440\pardirnatural\qj
\f1\i0\b0 \cf0 We will use a combination of query rewriting, database introspection, and user defined types and functions to provide iview functionality within Oracle. Although it will be necessary to initially build portions of the system\'92s functionality into a middleware layer, SUNY Buffalo and Oracle investigators will work closely to ensure the possibility of transitioning these features into Oracle itself.\
\
Our concrete 1-year goals are as follows:\
\pard\tx220\tx720\pardeftab720\li720\fi-720\pardirnatural\qj
\ls1\ilvl0\cf0 {\listtext \'95 }To develop a formal theory for iviews, including a syntactic mapping from iviews and queries over iviews to equivalent view definitions and queries posed over a traditional database.\
{\listtext \'95 }To implement a middleware layer capable of interpreting SQL statements that manipulate and query iviews. This will, by necessity, include support within Oracle for processing and answering uncertain queries.\
{\listtext \'95 }To implement in the middleware and Oracle layers as needed, support for what\'92s missing queries. This includes one or more ranking heuristics, to produce information gathering requests with a priority order. \
\pard\pardeftab720\li1440\pardirnatural\qj
\cf0 \
With the resulting system, a user will be able to create iviews with restrictions (see the proposed semantics in the appendix), pose queries over the iview, obtain (uncertain) results, and issue both interactive and non-interactive what\'92s missing queries. For a detailed example, see the appendix.\
\
We plan to subject our implementation to a thorough evaluation, including an analysis of:\
\pard\tx220\tx720\pardeftab720\li720\fi-720\pardirnatural\qj
\ls2\ilvl0\cf0 {\listtext \'95 }Different strategies for encoding iviews in a traditional database system.\
{\listtext \'95 }Different heuristics for directing information gathering when resolving a 'what's missing?' query.\
{\listtext \'95 }Different workloads with varying degrees of uncertainty.\
{\listtext \'95 }Any additional functionality developed to streamline the interactive uncertainty refinement process.\
\pard\pardeftab720\li1440\pardirnatural\qj
\cf0 \
The evaluation will include a variety of performance, scalability, and resource consumption tests. Concretely, we are interested in\
\pard\tx220\tx720\pardeftab720\li720\fi-720\pardirnatural\qj
\ls3\ilvl0\cf0 {\listtext \'95 }Expressiveness/Developer Effort: How easy is it to integrate what\'92s missing functionality into an existing database application; How much effort would have been taken to incorporate similar capabilities manually.\
{\listtext \'95 }Overhead Measurements: Does support for uncertain queries and whats missing queries apply a performance penalty. If so, how does this compare to other (i.e., manual) approaches to performing similar queries in terms of LoC or time.\
{\listtext \'95 }Information-Gathering Efficiency: How effective is each heuristic at prioritizing information gathering requests for a what\'92s missing query.\
}