@@ -972,10 +977,13 @@
END AS rating
FROM ratings2;
+
+ One global configuration variable decides which column gets mapped to "rating".
+
-
Missing Value Imputation
+
Missing Value Imputation
$$ratings1(pid, rating, review\_ct) \text{ s.t. } rating \text{ is not NULL}$$
@@ -989,15 +997,157 @@
review_ct
FROM ratings1;
+
+ A family of variables indexed by ROWID represent each imputed value.
+
+
+
+
+
+
+
+
Defining Configurations
+
+
+
+
+ Models designate one "best-guess" configuration.
+
-
Non-Deterministic Queries
+
Example Models
+
-
VG-Terms create non-deterministic branch-points in queries.
-
Non-deterministic branch points are uniquely identified.
+
Imputation using a SparkML classifier
+
Heuristic detection of order-by columns for interpolation
+
Schema matching based on edit-distance
+
MayBMS-style probabilistic repair-key
+
And more...
-
... so which branch gets taken?
+
+
+
+
Convenience Operators: Lenses
+
+
Lenses instantiate/train a model and wrap a query
+
+
Domain Constraint Repair / Missing Value Imputation †
+
Schema Matching †
+
Sequence Repair
+
Key Repair
+
Arbitrary Choice
+
Type Detection *
+
Header Detection *
+
JSON Shredder *
+
+
+ †Lenses: An On-Demand Approach to ETL; Yang et. al.; VLDB 2015
+ *Adaptive Schema Databases; Spoth et. al.; CIDR 2017
+
@@ -1006,52 +1156,32 @@
Uncertainty as Provenance
+
+
(aka fun with query compilers)
Data Curation ends up with one canonical good dataset.
-
Mimir picks one "best-guess" configuration
+
Mimir starts with the default "guess" configuration.
+
+
As users explore, they validate or refine guesses for configuration variables as necessary.
+
+
-
But...
+
Are there any relevant configuration variables that I haven't validated yet?
-
We might be wrong!?!
-
-
If there's a possibility we might be wrong we need to...
-
-
... communicate the fact to users.
-
... help users understand why.
-
... help users to fix it.
-
-
-
-
-
Each VG-Term family is associated with a Model object that facilitates introspection.
-
-
Selecting Best Guesses
-
Enumerating/Sampling alternatives
-
Generating Human-Readable descriptions of the branch
-
-
-
-
-
-
Uncertainty as Provenance
-
-
-
Replace each VG-Term with a "best-guess" lookup function.
-
Best guess values are "tagged" with nondeterminsm.
-
-
Which result cells/rows depend on tagged inputs?
-
What are the tag dependencies for a specific result?
-
How much do cells/rows depend on specific tags?
-
-
-
+
+
How much of my query result is affected by unvalidated variables?
+
Which variables affect my query results?
+
How bad is the situation?
+
@@ -1059,9 +1189,18 @@
-
Which cells/rows depend on tagged inputs?
+
How much of my query result is affected by unvalidated variables?
-
Idea: Extend query schemas with taint annotations.
+
Idea: Mark values in query results that depend on unvalidated variables.
+
+
+
+
+ Communicating Data Quality in On-Demand Curation; Kumari et. al.; QDB 2016
+
+
+
+
@@ -1077,8 +1216,21 @@
TRUE AS B_TAINTED
FROM R;
-
Add *_TAINTED fields to each row.
+
The Mimir compiler adds *_TAINTED fields to each row.
+
+
+
Non-Determinism Taint
+
+
+
A row is untainted if...
+
... we can guarantee that it (or a counterpart) appears in the result regardless of configuration.
+
A cell is untainted if...
+
... we can guarantee that its value in the result is independent of the configuration.
+
+
+
+
Non-Determinism Taint
@@ -1096,7 +1248,7 @@
(B IS NULL) AS B_TAINTED
FROM R;
-
Sometimes outputs are independent of VGTerms.
+
Expressions with VGTerms can be conditionally tainted.
@@ -1109,6 +1261,7 @@ CREATE VIEW R_CLEANED AS
+
Non-Determinism Taint
@@ -1132,29 +1286,32 @@ CREATE VIEW R_CLEANED AS
SELECT A, SUM(B) AS B,
FALSE AS A_TAINTED,
- GROUP_OR(A_TAINTED OR B_TAINTED OR ROW_TAINTED) AS B_TAINTED
+ GROUP_OR(B_TAINTED OR ROW_TAINTED)
+ OR (SELECT GROUP_OR(A_TAINTED) FROM R_CLEANED) AS B_TAINTED
GROUP_AND(A_TAINTED OR ROW_TAINTED) AS ROW_TAINTED
FROM R_CLEANED;
-
Aggregates: Group-taint affects rows, not group-by attrs.
+
Aggregates work too!
+
+
+
+
Taint Benefits
+
+
Much faster than classical Prob. DBs (comparable to deterministic queries).
+
At-a-glance visual of how bad your data is.
+
Can help to focus subsequent analysis.
+
Taint Limitations
-
Taint is (probably) C-Sound, but not C-Complete.
-
Taint on group-by aggregates can be misleading.
-
Taint does not work well with set difference.
+
Taint is (probably *) C-Sound, but (usually *) not C-Complete.
+
Taint on group-by aggregates can be misleading.
+
Taint does not work well with set difference.
In spite of this, taint works well in practice.
-
-
-
-
-
-
-
-
+ *Ongong work w/ Su Feng, Aaron Huber, Boris Glavic
@@ -1162,24 +1319,9 @@ CREATE VIEW R_CLEANED AS
-
One more thing...
-
- SELECT A, VGTerm('X', ROWID) AS B FROM R;
-
-↓ ↓ ↓ ↓
-
- SELECT A, BEST_GUESS('X', ROWID) AS B,
- FALSE AS ROW_TAINTED,
- FALSE AS A_TAINTED,
- NOT IS_ACKNOWLEDGED('X', ROWID) AS B_TAINTED
- FROM R;
-
-
Allow users to turn-off taint for specific tags.
-
-
-
-
What are the tags affecting a result?
-
Solution: Static dependency analysis produces a list of tag families and queries to generate all relevant indexes.
+
Which variables affect my query results?
+
Idea: Static dependency analysis produces a list of variable families and queries to generate all relevant indexes.
+ Mimir: Bringing CTables into Practice; Nandi et. al.; ArXiV
@@ -1188,17 +1330,23 @@ CREATE VIEW R_CLEANED AS
-