diff --git a/slides/talks/2017-5-Tour-Mimir/graphics/Kliponius-Cardboard-box-package.svg b/slides/talks/2017-5-Tour-Mimir/graphics/Kliponius-Cardboard-box-package.svg new file mode 100644 index 00000000..ce48f4d1 --- /dev/null +++ b/slides/talks/2017-5-Tour-Mimir/graphics/Kliponius-Cardboard-box-package.svg @@ -0,0 +1,274 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + Openclipart + + + Cardboard box / package + 2009-05-21T05:05:20 + A cardboard packing box with tape holding it shut. Classic packing box / software package image. Enjoy! + https://openclipart.org/detail/26101/cardboard-box-/-package-by-kliponius + + + Kliponius + + + + + box + cardboard + icon + package + packaging + + + + + + + + + + + diff --git a/slides/talks/2017-5-Tour-Mimir/index.html b/slides/talks/2017-5-Tour-Mimir/index.html index f02c1de2..37d34a3d 100644 --- a/slides/talks/2017-5-Tour-Mimir/index.html +++ b/slides/talks/2017-5-Tour-Mimir/index.html @@ -264,7 +264,7 @@

@@ -893,18 +893,21 @@ -->

-

VGTerm

-

$VGTerm(\ldots)$ constructs new variables
(it's a skolem function)

+

VGTerms

+

A $VGTerm(\ldots)$ references configuration parameters
(aka "variables").

Lenses: An On-Demand Approach to ETL; Yang et. al.; VLDB 2015 +
-

$VGTerm()$s behave like normal expressions

+

$VGTerm()$s can be used like normal expressions


                   SELECT A, VGTerm('X', B) AS C FROM R;
 					
@@ -914,16 +917,16 @@ R |AB - |12 - |34 - |54 + |12 + |34 + |54 - - - + + +
AC
1$X_2$
3$X_4$
5$X_4$
1$X_2$
3$X_4$
5$X_4$
 
@@ -932,6 +935,7 @@

+
-

Schema Matching

+

Schema Matching

$$ratings2(pid, num\_ratings, evaluation) \rightarrow (pid, rating)$$
@@ -972,10 +977,13 @@ END AS rating FROM ratings2; +

+ One global configuration variable decides which column gets mapped to "rating". +

-

Missing Value Imputation

+

Missing Value Imputation

$$ratings1(pid, rating, review\_ct) \text{ s.t. } rating \text{ is not NULL}$$
@@ -989,15 +997,157 @@ review_ct FROM ratings1; +

+ A family of variables indexed by ROWID represent each imputed value. +

+
+ + + +
+
+

Defining Configurations

+ + + + + Config. + + + + + Model + + + Model + + + Model + + + + + + All assignments for one family. + + + + Description of the family in English. + + + + Other feasible assignments. + + + + + + Config. + + Config. + + + + + (Best) + + + + +

+ Models designate one "best-guess" configuration. +

-

Non-Deterministic Queries

+

Example Models

+ -

... so which branch gets taken?

+
+ +
+

Convenience Operators: Lenses

+ +

Lenses instantiate/train a model and wrap a query

+ + + †Lenses: An On-Demand Approach to ETL; Yang et. al.; VLDB 2015
+ *Adaptive Schema Databases; Spoth et. al.; CIDR 2017 +
@@ -1006,52 +1156,32 @@

Uncertainty as Provenance

+ +

(aka fun with query compilers)

Data Curation ends up with one canonical good dataset.

-

Mimir picks one "best-guess" configuration

+

Mimir starts with the default "guess" configuration.

+ +

As users explore, they validate or refine guesses for configuration variables as necessary.

+ +
-

But...

+

Are there any relevant configuration variables that I haven't validated yet?

-

We might be wrong!?!

- -

If there's a possibility we might be wrong we need to...

- -
- -
-

Each VG-Term family is associated with a Model object that facilitates introspection. -

-

-
- -
-

Uncertainty as Provenance

- - - +
    +
  1. How much of my query result is affected by unvalidated variables?
  2. +
  3. Which variables affect my query results?
  4. +
  5. How bad is the situation?
  6. +
@@ -1059,9 +1189,18 @@
-

Which cells/rows depend on tagged inputs?

+

How much of my query result is affected by unvalidated variables?

-

Idea: Extend query schemas with taint annotations.

+

Idea: Mark values in query results that depend on unvalidated variables.

+
+ +
+ + Communicating Data Quality in On-Demand Curation; Kumari et. al.; QDB 2016 +
+ +
+
@@ -1077,8 +1216,21 @@ TRUE AS B_TAINTED FROM R; -

Add *_TAINTED fields to each row.

+

The Mimir compiler adds *_TAINTED fields to each row.

+ +
+

Non-Determinism Taint

+ +
+
A row is untainted if...
+
... we can guarantee that it (or a counterpart) appears in the result regardless of configuration.
+
A cell is untainted if...
+
... we can guarantee that its value in the result is independent of the configuration.
+
+
+ +

Non-Determinism Taint


@@ -1096,7 +1248,7 @@
         (B IS NULL) AS B_TAINTED
   FROM R;
 					
-

Sometimes outputs are independent of VGTerms.

+

Expressions with VGTerms can be conditionally tainted.

@@ -1109,6 +1261,7 @@ CREATE VIEW R_CLEANED AS
+

Non-Determinism Taint

@@ -1132,29 +1286,32 @@ CREATE VIEW R_CLEANED AS

  SELECT A, SUM(B) AS B, 
         FALSE AS A_TAINTED,
-        GROUP_OR(A_TAINTED OR B_TAINTED OR ROW_TAINTED) AS B_TAINTED
+        GROUP_OR(B_TAINTED OR ROW_TAINTED) 
+          OR (SELECT GROUP_OR(A_TAINTED) FROM R_CLEANED) AS B_TAINTED
         GROUP_AND(A_TAINTED OR ROW_TAINTED) AS ROW_TAINTED
  FROM R_CLEANED;
 					
-

Aggregates: Group-taint affects rows, not group-by attrs.

+

Aggregates work too!

+
+ +
+

Taint Benefits

+

Taint Limitations

In spite of this, taint works well in practice.

-
- -
- -
- -
- + *Ongong work w/ Su Feng, Aaron Huber, Boris Glavic
@@ -1162,24 +1319,9 @@ CREATE VIEW R_CLEANED AS
-

One more thing...

-

- SELECT A, VGTerm('X', ROWID) AS B FROM R;
-					
-↓    ↓    ↓    ↓ -

- SELECT A, BEST_GUESS('X', ROWID) AS B, 
-        FALSE AS ROW_TAINTED,
-        FALSE AS A_TAINTED,
-        NOT IS_ACKNOWLEDGED('X', ROWID) AS B_TAINTED
-  FROM R;
-					
-

Allow users to turn-off taint for specific tags.

-
- -
-

What are the tags affecting a result?

-

Solution: Static dependency analysis produces a list of tag families and queries to generate all relevant indexes.

+

Which variables affect my query results?

+

Idea: Static dependency analysis produces a list of variable families and queries to generate all relevant indexes.

+ Mimir: Bringing CTables into Practice; Nandi et. al.; ArXiV
@@ -1188,17 +1330,23 @@ CREATE VIEW R_CLEANED AS
-

How much do results depend on a tag?

-

Solution: Sensitivity analysis (Kanagal & Deshpande; SIGMOD 2011).

-
- -
-

... but sensitivity analysis requires sampling...

+

How bad is the situation?

+

Idea: Sample from the space of alternatives to... +

    +
  • Estimate error, expectations, or other statistical measures.
  • +
  • Highlight other possible query results.
  • +
  • Compute sensitivity (Kanagal & Deshpande; SIGMOD 2011)
  • +
+

+ +
+

Sampling is slooooow

+

Trivial Sampling

@@ -1230,7 +1378,7 @@ CREATE VIEW R_CLEANED AS

Mimir isn't committed to one fixed data representation.

-

(work in progress)

+

(optimization is a work in progress)