diff --git a/slides/talks/2017-5-Tour-Mimir/index.html b/slides/talks/2017-5-Tour-Mimir/index.html index ce3af00e..ee5acf71 100644 --- a/slides/talks/2017-5-Tour-Mimir/index.html +++ b/slides/talks/2017-5-Tour-Mimir/index.html @@ -290,7 +290,8 @@
-

Data Cleaning is Hard!

+

Loading requires curation...

+

Data Curation is Hard!

@@ -299,21 +300,39 @@ (skilledup.com) -

Alice spends weeks cleaning her data before using it.

+

Alice spends weeks curating her data before using it.

+
+ +
+

Relational databases make this worse...

+

The data needs... +

+

+

This is all required upfront. Before asking a single question.

-

The database is in the way

-

Why?

+

Relational DBs are useless in early stages of curation.

+

Why?

- In the name of Codd,
thou shalt not give the user a wrong answer. + In the name of Codd,
thou shalt not give the user a wrong answer.

+

There are tons of good heuristics available for guessing how to clean data.

+
+ +
+

+ Thou shalt not give the user a wrong answer. +

... but what if we did? @@ -417,7 +436,7 @@ width="93" height="103" x="0" y="10" /> - + - Probab. - Cert. A. - - + + Probability + Expectation + Variance + Histogram + + -

+

We've gotten good at query processing on uncertain data.
- But not at "sourcing" uncertain data - ... or communicating results. + But not sourcing uncertain data + ... or communicating results to humans.

Challenges

A small shift in how we think about PDBs addresses all three points.

@@ -948,18 +966,18 @@
- + - - - + + +
R |AB
RAB
|12
|34
|54
12
34
54
- - - - + + + +
AC
1$X_2$
3$X_4$
5$X_4$
Q(R)AC
1$X_2$
3$X_4$
5$X_4$
 
@@ -1223,7 +1241,7 @@
-

ETL Question 1

+

Provenance Question 1

How much of my query result is affected by unvalidated variables?

Idea: Mark values in query results that depend on unvalidated variables.

@@ -1360,7 +1378,8 @@ CREATE VIEW R_CLEANED AS
-

Which variables affect my query results?

+

Provenance Question 2

+

Which variables affect my query results?

Idea: Static dependency analysis produces a list of variable families and queries to generate all relevant indexes.

Mimir: Bringing CTables into Practice; Nandi et. al.; ArXiV
@@ -1371,7 +1390,8 @@ CREATE VIEW R_CLEANED AS
-

How bad is the situation?

+

Provenance Question 3

+

How bad is the situation?

Idea: Sample from the space of alternatives to...

  • Estimate error, expectations, or other statistical measures.
  • @@ -1403,21 +1423,21 @@ CREATE VIEW R_CLEANED AS
    - - - + + + - - + +
    $R_1$AB
    12
    34
    $R_1$AB
    12
    34
    $R_2$AB
    15
    $R_2$AB
    15
    - - - - + + + +
    $R_{sparse}$ABS#
    121
    341
    152
    $R_{sparse}$ABS#
    121
    341
    152
    @@ -1428,20 +1448,20 @@ CREATE VIEW R_CLEANED AS
    - - - + + + - - + +
    $R_1$AB
    12
    34
    $R_1$AB
    12
    34
    $R_2$AB
    15
    $R_2$AB
    15
    - - - + + +
    $R_{bundle}$AB$\phi$
    1[2,5][T,T]
    34[T,F]
    $R_{bundle}$AB$\phi$
    1[2,5][T,T]
    34[T,F]