Mimir analysis

2015-12-10 10:21:33 -05:00 · 2015-12-10 10:21:33 -05:00 · f4c3a9f04e
parent ba9d181ab3
commit f4c3a9f04e
1 changed files with 33 additions and 2 deletions
--- a/src/research/mimir/index.md
+++ b/src/research/mimir/index.md
@ -14,13 +14,44 @@ Many analytics tasks are based on information that is initially incomplete, inco
 Mimir takes a step back and accepts that uncertainty is a fact of life.  Rather than trying to fight it, Mimir embraces uncertainty, and helps users to understand it better.  Combining automated data cleaning and data analysis techniques, Mimir's goal is to help users clean and query uncertain data, and to understand the impact of that uncertainty on the results of their analyses.

 ------
-## Active Research 
+## Active Research Efforts

 ### On-Demand Data Certainty
-Asserting the correctness, consistency, and completeness of a dataset is an extremely expensive, time-consuming process. Worse still, the effort expended on this task may be disproportionately high when compared to the fragment of the dataset that is actually queried by end-users. This initiative looks for strategies that make it easier for data-cleaning and data-gathering tasks to be performed on-demand -- as the data is queried. Our current focus is on a form of targetted crowdsourcing, where users querying uncertain data are presented with a prioritized list of data-cleaning/-collection tasks that will increase confidence in the result set.
+Curating data, or making sure that it is correct, consistent, and complete 
+can be very slow and expensive.  Most of this effort is often wasted, since
+only a small portion of the curated data will ever be relevant to analysts 
+using it.  Unfortunately, without basing an analysis on trustworthy, curated 
+data, it's currently foolish to trust the analysis' results.  Our 
+on-demand certainty effort links query results to potential sources of
+uncertainty that could affect them using a provenance model called Virtual
+C-Tables.  Seeing the impact of uncertainty can help analysts to evaluate
+the quality and trustworthiness of those results.

+### Transparent Probabilistic Databases
+Mimir is built around a probabilistic database system.  Classical
+deterministic databases assume that all of your data is fixed: Every
+cell has exactly one value, and every table has a fixed set of rows in it.
+Probabilistic databases instead track multiple possibilities: for example
+the results of OCR software parsing a glyph as being either a 4 or a 9.  
+That could be useful, but no one really wants to move their data to an 
+entirely new database system.  We're exploring ways to enable probabilistic
+database functionality within existing deterministic database engines, 
+allowing legacy database applications to transparently co-exist with 
+probability-aware applications.
+
+### Sensitivity Analysis
+Quantitative metrics like standard deviations and probabilities help to 
+measure how reliable query results are, but don't really provide a good
+sense of why the results aren't reliable or what can be done to fix them.
+Mimir can provide users with a list of explanations of why a particular 
+result is uncertain, and rank that list in order of relevance.  We are
+exploring what contextual cues make an explanation relevant, and ways of
+efficiently ranking explanations in bulk.
+
+{{!
 ### Consistent Query Semantics
 Minor differences in data semantics can easily combine to produce subtle errors in the correctness of a query. For example, when a table listing historical orders is joined with a table of current currency conversions, the result may be inaccurate (depending on what the user's intent is): The exchange rate listed will be valid as of today, and not when the order was placed. Unfortunately, detecting these errors is difficult, as it is not generally possible to gauge user intent, or to ask users to provide such fine-grained semantic information about data. Using a combination of natural language processing, and usage modeling, we instead seek to answer a simpler, though closely related question: "Will the answer to my query be the same if I ask it tomorrow?"
+}}

 ------