From 2481ab8e574b7dcc6793ac3f1dcecc811b053095 Mon Sep 17 00:00:00 2001 From: Oliver Kennedy Date: Wed, 24 Aug 2016 14:20:50 -0400 Subject: [PATCH] Mimir whitepaper --- Rakefile | 9 +- lib/bootstrap_markdown.rb | 7 ++ lib/gemsmith.rb | 4 +- src/research/mimir/whitepaper.md | 203 +++++++++++++++++++++++++++++++ 4 files changed, 220 insertions(+), 3 deletions(-) create mode 100644 lib/bootstrap_markdown.rb create mode 100644 src/research/mimir/whitepaper.md diff --git a/Rakefile b/Rakefile index f71bdee1..6b18f329 100644 --- a/Rakefile +++ b/Rakefile @@ -6,6 +6,7 @@ require "lab_metadata.rb" require "util.rb" require "cv.rb" require "nsfcp.rb" +require "bootstrap_markdown.rb" include GemSmith $db = JDB.new("db") @@ -24,7 +25,13 @@ site :odin_lab, out: "build" do ## Render specialized formats for_files(/\.md$/) do - render_markdown + render_markdown(BootstrapMarkdown.new, + tables: true, + disable_indented_code_blocks: true, + fenced_code_blocks: true, + autolink: true, + no_intra_emphasis: true + ) end for_files(/\.erb$/) do render_erb diff --git a/lib/bootstrap_markdown.rb b/lib/bootstrap_markdown.rb new file mode 100644 index 00000000..7f2652c2 --- /dev/null +++ b/lib/bootstrap_markdown.rb @@ -0,0 +1,7 @@ + + +class BootstrapMarkdown < Redcarpet::Render::HTML + def table(header, body) + "\n#{header}\n#{body}
" + end +end \ No newline at end of file diff --git a/lib/gemsmith.rb b/lib/gemsmith.rb index a814b937..704cb24f 100644 --- a/lib/gemsmith.rb +++ b/lib/gemsmith.rb @@ -167,8 +167,8 @@ module GemSmith } end - def render_markdown(renderer = Redcarpet::Render::HTML.new()) - markdown = Redcarpet::Markdown.new(renderer) + def render_markdown(renderer = Redcarpet::Render::HTML.new(), options = {}) + markdown = Redcarpet::Markdown.new(renderer, options) apply { |f| f[:out_path] = File.join(File.dirname(f[:out_path]), "#{File.basename(f[:out_path], ".*")}.html") f[:stream].transform_all { |body| diff --git a/src/research/mimir/whitepaper.md b/src/research/mimir/whitepaper.md new file mode 100644 index 00000000..755a6f2f --- /dev/null +++ b/src/research/mimir/whitepaper.md @@ -0,0 +1,203 @@ +--- +title: Mimir Whitepaper +--- + +# Mimir + +Historically, database management systems have assumed that source data is reliable. However obtaining reliable data is becoming an increasingly difficult proposition. Big data management systems operate at scales too large to reliably validate by hand. IoT data management systems operate in highly resource-constrained environments where obtaining reliable data in real-time may not be feasible. + +As the adage goes: garbage-in, garbage-out. Classical data management systems allow unreliable data to be loaded and queried as if it were correct. From the perspective of a user or application posing queries over a dataset, there is no visible difference between reliable and unreliable query results. + +Mimir changes all of that. + +As a probabilistic database engine, Mimir makes uncertainty a first-class primitive through annotations on potentially invalid data. These annotations and their effect on the data are propagated through queries, helping users to understand not only why their results are unreliable, but also how much trust they can put into those results. + +## Using Mimir + +By default, Mimir behaves as an ordinary relational database. In fact, it's possible to use Mimir as an ordinary database client without using any of its uncertainty management features (Mimir actually uses existing backend databases that do the bulk of the query processing and data management). Mimir's main capabilities are accessed through two new primitives called Lenses and EXPLAIN operations, both described below. + +### Mimir's User Interface + +1. The query field accepts SQL queries. Mimir accepts an extended form of SQL92, although support for aggregation is still in progress. +2. Results are shown below. If you are que rying unreliable data, uncertain fields will be highlighted in red, and uncertain rows will be highlighted in grey and have a red bar to the right. +3. A schematic view of the query lineage is shown alongside the results. +4. Mimir can connect to databases through JDBC. The Database menu allows connections to other databases, and quick at-a-glance access to tables and views available in the current database. +5. Mimir features a quick and easy CSV data loader. +6. Lens construction wizards (see 'Lenses' below) +7. The notification area shows contextual hints about the query. + +![Mimir Overview](screenshots/Mimir-Overview.png) + + +### Lenses + +``` +lens := CREATE LENS _name_ + AS _query_ + USING _model_ + +_model_ := + | MISSING_VALUE(_col_ [, _col_ [, ...]]) + | SCHEMA_MATCHING(_col_ _type_ [, _col_ _type_ [, ...]]) + | TYPE_INFERENCE() + | ARCHIVAL(_period_) + | DOMAIN_CONSTRAINT(_col1_ _constraint_ [, _col_ _constraint_ [, ...]]) + | XML_EXTRACT(_xmlschema_) + | MARKOV(.?.) +``` + +Lenses are a family of data processing components that model knowledge. A lens can be queried as a normal relational table or view; The contents of the lens depend on the specific type of lens `_model_` being applied. Mimir presently includes lenses for three common ETL tasks, and we are rapidly developing new lenses. Each lens takes a SQL `_query_` as input and applies a specific data transformation to the result. Examples of possible lenses include: + +* **Schema Matching**: The schema matching lens automatically remaps the columns of a relation to a new schema. Schema mappings are inferred based on the edit distances between the attributes of the source and target schemas. + +* **Missing Value**: The missing value lens replaces all `NULL` values in a column. Replacement is done using a tree-based classifier trained on the remaining rows of the input table. + +* **Type Inference**: The type inference lens automatically assigns types to each 'string' column in the input data. Types are guessed based on a majority vote over the best fit for all values in the column. + +* **Archival**: The archival lens `_period_`ically runs and caches the result of the source query. The lens uses a combination of periodic sampling and bi-temporal query support in the underlying database to model the volatility of the source data and guess at the correct *current* value without needing to refresh the cache. (The Archival Lens is not complete) + +* **Domain Constraint**: A more powerful version of the missing value lens. The domain constraint lens enforces type constraints on the data columns. Data not conforming to the constraint is replaced by a best-guess estimate based on the original value and the remaining attributes of the column. (The Domain Constraint Lens is not complete) + +* **XML Extraction**: A variant of the schema matching lens that uses example schemas to extract relational data from heterogeneous XML inputs. (The XML Extraction Lens is not complete) + +* **Markov Process**: Assuming the source data models transition events, produce a view of the data that represents the current state of the markov process at any given time. + +The Mimir GUI further streamlines the process of lens creation through several lens creation 'wizards' (marker 6 in the diagram above). For most lenses, the only additional information required is a name for the lens. + +### The EXPLAIN Operation + +Query results in Mimir are highlighted in red if the result is uncertain. That is, a result is highlighted if it is affected in any way by data being manipulated by a lens. For additional information, you may click on any uncertain row or cell to EXPLAIN it. + +![Mimir's EXPLAIN feature](screenshots/MimirExplain.png) + +The EXPLAIN popup contains two areas: Quantitative statistics and qualitative explanations. + +At the top are a set of quantitative statistics about the value's uncertainty. The precise set of statistics presented varies based on the value's type. Generally, they include at least one numerical measure of Mimir's confidence in the quality of the value such as variance or entropy, and one measure of the level of variation in the result values such as sample values or a confidence interval. + +Below are qualitative explanations of why the value is uncertain. (what follows has not been implemented yet:) Explanations are ranked in terms of their contribution to the uncertainty of the value. Each explanation is associated with both an 'Accept' and a 'Fix' button. Clicking 'Accept' acknowledges the explanation. If no unacknowledged sources of uncertainty affect the result, its highlight color changes to green. Clicking 'Fix' brings up a Lens-specific dialog box that allows you to override how the Lens chose to repair the value in question. + +## Example Applications + + + +### Internet-of-Things + +Consider the next-generation smart-house. + + + +### On-Demand ETL + +Meet Alice. Alice is an analyst at HappyBuy, your friendly local electronics retailer. Alice is in charge of developing a big promotional offer. As an analyst, she starts by sifting through the data HappyBuy has available. She first wants to explore what products HappyBuy sells, so she finds and looks over the relevant table: + +``` +SELECT * FROM Product +``` + +id | name | brand | category +------ | ------------------ | ------- | -------- +P123 | Apple 6s, White | `NULL` | phone +P124 | Apple 5s, Black | `NULL` | phone +P125 | Samsung Note2 | Samsung | phone +P2345 | Sony to inches | `NULL` | `NULL` +P34234 | Dell, Intel 4 core | Dell | laptop +P34235 | HP, AMD 2 core | HP | laptop + +Ugh! So much messy, missing data. Fortunately, Alice is using Mimir. She wraps the product data in a **Missing Value** lens. + +``` +CREATE LENS SaneProduct AS SELECT * FROM Product + USING MISSING_VALUE( 'category', 'brand' ); + +SELECT * FROM SaneProduct; +``` + +id | name | brand | category +------ | ------------------ | ----------------- | -------- +P123 | Apple 6s, White | Apple_*_ | phone +P124 | Apple 5s, Black | Black & Decker_*_ | phone +P125 | Samsung Note2 | Samsung | phone +P2345 | Sony to inches | Sony_*_ | laptop_*_ +P34234 | Dell, Intel 4 core | Dell | laptop +P34235 | HP, AMD 2 core | HP | laptop + +It's not perfect, but it's a start. Next, she turns to some product rating information that HappyBuy has collected over the years. + +``` +SELECT * FROM ratings1 +``` + +pid | rating | review_ct +----- | ------ | --------- +P123 | 4.5 | 50 +P2345 | `NULL` | 245 +P124 | 4 | 100 + +``` +SELECT * FROM ratings2 +``` +pid | evaluation | num_ratings +------ | ---------- | ----------- +P125 | 3 | 50 +P34234 | 5 | 245 +P34235 | 4.5 | 100 + +It looks like `ratings` is missing some data, and `ratings2` has a mismatched schema. Alice creates a **Missing Value** lens for the former, and a **Schema Matching** lens for the latter. + +``` +CREATE LENS ratings1clean AS SELECT * FROM ratings1 + USING MISSING_VALUE('rating'); +CREATE LENS ratings2clean AS SELECT * FROM ratings2 + USING SCHEMA_MATCH(pid string, rating float, review_ct int); +CREATE VIEW allratings AS + SELECT * FROM ratings1clean UNION ALL ratings2clean; +SELECT * FROM allratings; +``` +pid | rating | review_ct +--------- | ---------- | --------- +P123 | 4.5 | 50 +P124 | 4 | 100 +P125_*_ | 3_*_ | 50_*_ +P2345 | 4_*_ | 245 +P34234_*_ | 5_*_ | 245_*_ +P34235_*_ | 4.5_*_ | 100_*_ + +Now she's getting somewhere. Alice combines this new information with what she had before. + +``` +SELECT name, rating, category FROM allratings r, products p WHERE p.id = r.pid; +``` + name | rating | category | notes +------------------ | ---------- | --------- | ----- +Apple 6s, White | 4.5 | phone | +Apple 5s, Black | 4 | phone | +Samsung Note2 | 3_*_ | phone | _*_ +Sony to inches | 4_*_ | laptop_*_ | +Dell, Intel 4 core | 5_*_ | laptop | _*_ +HP, AMD 2 core | 4.5_*_ | laptop | _*_ + +Alice decides to explore moderately rated laptops: + +``` +SELECT name FROM allratings r, products p + WHERE p.id = r.pid AND p.category = 'laptop' + AND r.rating <= 4.5 AND r.rating >= 4.0 +``` + name | notes +------------------ | ----- +Sony to inches | _*_ +HP, AMD 2 core | _*_ + + + +## Technology + +![DCR Lens](screenshots/DCRLens.png) + +### Generalized C-Tables + + +### Virtual C-Tables + + +### CPI \ No newline at end of file