Mimir analysis

pull/1/head
Oliver Kennedy 2015-12-10 10:21:33 -05:00
parent ba9d181ab3
commit f4c3a9f04e
1 changed files with 33 additions and 2 deletions

View File

@ -14,13 +14,44 @@ Many analytics tasks are based on information that is initially incomplete, inco
Mimir takes a step back and accepts that uncertainty is a fact of life.  Rather than trying to fight it, Mimir embraces uncertainty, and helps users to understand it better.  Combining automated data cleaning and data analysis techniques, Mimir's goal is to help users clean and query uncertain data, and to understand the impact of that uncertainty on the results of their analyses.
------
## Active Research
## Active Research Efforts
### On-Demand Data Certainty
Asserting the correctness, consistency, and completeness of a dataset is an extremely expensive, time-consuming process. Worse still, the effort expended on this task may be disproportionately high when compared to the fragment of the dataset that is actually queried by end-users. This initiative looks for strategies that make it easier for data-cleaning and data-gathering tasks to be performed on-demand -- as the data is queried. Our current focus is on a form of targetted crowdsourcing, where users querying uncertain data are presented with a prioritized list of data-cleaning/-collection tasks that will increase confidence in the result set.
Curating data, or making sure that it is correct, consistent, and complete
can be very slow and expensive. Most of this effort is often wasted, since
only a small portion of the curated data will ever be relevant to analysts
using it. Unfortunately, without basing an analysis on trustworthy, curated
data, it's currently foolish to trust the analysis' results. Our
on-demand certainty effort links query results to potential sources of
uncertainty that could affect them using a provenance model called Virtual
C-Tables. Seeing the impact of uncertainty can help analysts to evaluate
the quality and trustworthiness of those results.
### Transparent Probabilistic Databases
Mimir is built around a probabilistic database system. Classical
deterministic databases assume that all of your data is fixed: Every
cell has exactly one value, and every table has a fixed set of rows in it.
Probabilistic databases instead track multiple possibilities: for example
the results of OCR software parsing a glyph as being either a 4 or a 9.
That could be useful, but no one really wants to move their data to an
entirely new database system. We're exploring ways to enable probabilistic
database functionality within existing deterministic database engines,
allowing legacy database applications to transparently co-exist with
probability-aware applications.
### Sensitivity Analysis
Quantitative metrics like standard deviations and probabilities help to
measure how reliable query results are, but don't really provide a good
sense of why the results aren't reliable or what can be done to fix them.
Mimir can provide users with a list of explanations of why a particular
result is uncertain, and rank that list in order of relevance. We are
exploring what contextual cues make an explanation relevant, and ways of
efficiently ranking explanations in bulk.
{{!
### Consistent Query Semantics
Minor differences in data semantics can easily combine to produce subtle errors in the correctness of a query. For example, when a table listing historical orders is joined with a table of current currency conversions, the result may be inaccurate (depending on what the user's intent is): The exchange rate listed will be valid as of today, and not when the order was placed. Unfortunately, detecting these errors is difficult, as it is not generally possible to gauge user intent, or to ask users to provide such fine-grained semantic information about data. Using a combination of natural language processing, and usage modeling, we instead seek to answer a simpler, though closely related question: "Will the answer to my query be the same if I ask it tomorrow?"
}}
------