Alice wants to analize two unaligned time series.
+Time | Reading |
---|---|
1575731001 | 0 |
⚠ | +
+ The data included an unexpected value: 'Non-Hispanic White' The most similar known value is 'White Non-Hispanic' + |
+
Declare a caveat when volating an assumption might...
-... this is what Eve saw:
+So what is a caveat?
+A brief digression...
Possible tuples exist in at least one one possible world. $$possible(\mathcal R) = \bigcup_{R \in \mathcal R} R$$
Certain tuples exist in all possible worlds. $$certain(\mathcal R) = \bigcap_{R \in \mathcal R} R$$
+Uncertain tuples exist in at least one,
but not all possible worlds. $$uncertain(\mathcal R) = \bigcup_{R \in \mathcal R} R - certain(\mathcal R)$$
(not limited to set semantics)
A caveat is an assumption tied to one or more data elements (cells or rows).
+If the assumption is wrong, so is the element.
+An element has a caveat → The element is uncertain.
+ +... and btw, here's why.
+
+ SELECT setting_1, setting_2, estimate
+ FROM Simulation;
+
+
+ We want to indicate that the estimate column is only accurate if (for example) P ≠ NP.
+caveat(value, assumption)
+ +returns value, annotated with assumption.
+
SELECT setting_1, setting_2,
- caveat(estimate, 'Only correct if phi is 42')
+ caveat(estimate, 'Only correct if P ≠ NP')
AS estimate
FROM Simulation;
- is the same as
-
- SELECT setting_1, setting_2, estimate
- FROM Simulation;
-
- Caveat: If it turns out that phi ≠ 42,
all estimate values could be wrong.
(The first query annotates all `estimate` values with the caveat)
+annotation is just a human-readable string.
caveat(value, assumption)
-Each call fragments reality into multiple possible worlds.
-a few examples...
++ caveat() creates 2 sets of possible worlds: +
Mark multi-valued buckets (key repair).
SELECT bucket,
@@ -412,20 +504,24 @@
FIRST(reading) AS reading
COUNT(*) AS bucket_size
FROM sensor
+ GROUP BY bucket;
)
Interpolation is more complex... but similar.
Mark unexpected values the model wasn't trained on.
SELECT
CASE WHEN race_ethnicity
- IN ('white non-hispanic', 'black non-hispanic', /* ... */)
+ IN ('White Non-Hispanic', 'Black Non-Hispanic', /* ... */)
THEN race_ethnicity
+
ELSE caveat(race_ethnicity,
'Unexpected race_ethnicity: ' & race_ethnicity)
+
END, /* ... */
FROM R
@@ -433,41 +529,29 @@
Spark's CSV loader can augment tables with a $\texttt{parse_error}$ column.
+
- SELECT * FROM csv_file
- WHERE
- CASE WHEN parse_error IS NULL THEN TRUE ELSE
- caveat(FALSE, parse_error)
- END;
+ SELECT /* ... */,
+ CASE WHEN CAST(salary AS float) IS NULL THEN
+
+ caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
+
+ ELSE CAST(salary AS float) END AS salary
+ FROM raw_csv_data;
What semantics do we want?
-Caveatted data elements could be wrong.
-Certain Data Elements: Elements guaranteed to be in the result in all possible worlds.
@@ -578,8 +647,9 @@If a caveatted element can't affect an output element, don't propagate its caveats!
Propagate caveats to any data elements that could be affected by a change.
Challenge: How do we propagate caveats
without penalizing query evaluation.
Challenge: How do we propagate caveats
without penalizing query evaluation?
Don't!
- CREATE VIEW by_language AS
+ CREATE VIEW survey_responses AS
SELECT language,
- CASE WHEN CAST(salary AS float) IS NOT NULL THEN
-
+ CASE WHEN CAST(salary AS float) IS NULL THEN
caveat(NULL, 'Could not cast [ '&salary&' ] to float.')
-
ELSE CAST(salary AS float) END AS salary
FROM raw_csv_data;
- CREATE VIEW by_language AS
+ CREATE VIEW survey_responses AS
SELECT language, CAST(salary AS float) AS salary,
FALSE AS _caveat_field_language,
CAST(salary as float) IS NULL AS _caveat_field_salary
@@ -674,7 +742,7 @@
SELECT salary
- FROM by_language
+ FROM survey_responses
WHERE language = 'Scala'
@@ -683,7 +751,7 @@
SELECT salary,
_caveat_field_salary AS _caveat_field_salary,
_caveat_row AND _caveat_field_language AS _caveat_row
- FROM by_language
+ FROM survey_responses
WHERE language = 'Scala'
@@ -692,7 +760,7 @@
SELECT AVG(salary) AS salary
- FROM by_language
+ FROM survey_responses
becomes
@@ -700,7 +768,7 @@
SELECT salary,
GROUP_OR(_caveat_field_salary) AS _caveat_field_salary,
FALSE AS _caveat_row
- FROM by_language
+ FROM survey_responses
@@ -708,21 +776,21 @@
SELECT language, AVG(salary) AS salary
- FROM by_language
+ FROM survey_responses
GROUP BY language
... first we evaluate
SELECT GROUP_OR(_caveat_field_language)
- FROM by_language
+ FROM survey_responses
Can often be evaluated statically.
- If TRUE
+ If GROUP BY has caveats
SELECT language, AVG(salary) AS salary
@@ -736,7 +804,7 @@
- If FALSE
+ If no GROUP BY caveats
SELECT language, AVG(salary) AS salary
@@ -749,10 +817,6 @@
-
- Ongoing work with Boris Glavic + Su Feng @ IIT
-
-