diff --git a/src/talks/2022-02-16-MaterialsDB.erb b/src/talks/2022-02-16-MaterialsDB.erb
index 73de4c64..cea881f1 100644
--- a/src/talks/2022-02-16-MaterialsDB.erb
+++ b/src/talks/2022-02-16-MaterialsDB.erb
@@ -7,7 +7,7 @@ title: "Caveatting your data"
Adding explainability to incomplete datasets
Oliver Kennedy
University at Buffalo
- (Based on joint work with Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...
+ (Based on joint work with Olga Wodo, Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...
@@ -117,6 +117,16 @@ title: "Caveatting your data"
Bob needs to know Alice's assumptions (and how to use the workflow)?
+
+
+ In summary
+
+
+ Shortcuts are unavoidable
+ ... but introduce ambiguity into datasets
+ ... and processes
+
+
@@ -169,7 +179,7 @@ title: "Caveatting your data"
- The curse of small data
+ Incomplete Coverage
@@ -183,13 +193,19 @@ title: "Caveatting your data"
Let the dataset users figure it out.
- Small datasets must trade off between utility and trustworthiness.
+ The Curse of Small Data : Utility vs Trustworthiness.
+
+
+
+ Incomplete Data Management
+
+
What is Incomplete Data?
@@ -204,31 +220,283 @@ title: "Caveatting your data"
Data that is not known precisely.
+
+ Data Model
+
+ (General Idea)
+
+
- A 'placeholder' value or record.
+ A 'placeholder' value or record.
n/a or None
+ "The value is probably 3.2"
- Constraints describing what we do know.
- The value must be between $0$ and $1$.
- The value is normally distributed ($\mu = 10, \sigma^2 = 2$).
- There are at most 10 records with values from $0.9$ to $1$.
+ Constraints describing what we do know.
+ "The value must be between $0$ and $0.7$."
+ "There are at most 10 records with values from $0.9$ to $1$."
- Metadata describing the source of the error.
- The experiment hasn't been run yet.
- Alignment issues between datasets $A$ and $B$.
+ Metadata describing the source of the error.
+ "The experiment hasn't been run yet."
+ "Alignment issues between datasets $A$ and $B$."
-
+
+
+ locale
+ rate
+ size
+
+
+ Los Angeles
+ $[3\%,4\%]^1$
+ metro
+
+
+ Austin
+ $18\%$
+ [city, metro]$^2$
+
+
+ Houston
+ $14\%$
+ metro
+
+
+ Berlin
+ $1\%$, town or $3\%$, city$^2$
+
+
+ Sacramento
+ $1\%$
+ null
+
+
+ Springfield
+ null
+ town
+
+
+
+ $1:$ Conflict between CDC and locality-reported statistics.
+ $2:$ Multiple localities with this name.
+
+
+ Feng et. al. "Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds"
+
+
+
+
+
+
+
+
+ Incomplete Databases
+
+ "Incompleteness" as a first-class database primitive.
+
+
+ Start with a Data Model (e.g., Relations/Data Frames)
+ Add a (formal) notion of "Incompleteness"
+
+ Pick a Query Language (e.g., SQL, Pandas, Spark)
+ Define semantics for queries under incompleteness.
+ Optimize.
+
+
+
- How is Incomplete Data used?
-
+ Using Incomplete Data
+
+ "Certain" and "Possible" answers.
+ Summary Statistics.
+ Presenting Incompleteness.
+ Incompleteness-Aware ML.
+
+
+
+
+
+ UMAMI
+
+
+
+
+
+ Certain, Possible Answers
+
+
+ result = df[ df["NORMALIZED_INTERFERENCE"] < 10
+ and df["ABS_wf_D"] > 0.28] ]
+
+ $3.1 > 2.8$ is True
+ $\text{n/a} < 10$ is Uncertain
+ $(3.1 > 2.8) \wedge (\text{n/a} > 10)$ is ???
+
+
+
+
+ Records meeting both conditions are certain matches.
+ Records with ABS_wf_D < 2.8 are definitely not matches.
+ Records with ABS_wf_D > 2.8 but missing NORMALIZED_INTERFERENCE are possible matches.
+
+
+
+ 3-Valued Boolean Logic
+
+ AND
+
+
+ True Unknown False
+ True True Unknown False
+ Unknown Unknown Unknown False
+ False False False False
+
+
+ similar truth tables for OR, NOT, etc...
+
+
+
+ Certain vs Possible
+
+
+
+
+ Result
+ Includes...
+
+
+
+ Certain
+ Filter = True
+
+
+ Possible
+ Filter $\in$ { True, Unknown }
+
+
+
+
+
+ Summary Statistics
+
+
+ result.count()
+
+
+ There are at least certain.count() records.
+ There are at most possible.count() records.
+ The output is also incomplete
+ (but in a different way)
+
+
+
+
+
+
+
+ Presenting Incompleteness
+
+ Database: This nanostructure is possibly a result.
+ You: Why isn't it certain?
+
+
+
+ Incompleteness as a provenance problem (aka lineage, pedigree)
+ Yang et. al., "Lenses: An On-Demand Approach to ETL"
+
+
+
+ Caveats
+
+ Annotate incomplete values with a small note
+
+ Challenge: Propagating annotations through queries.
+
+
+
+
+ Filter = Uncertain $\rightarrow$ The annotation propagates to the row.
+
+ df.count() $\rightarrow$ Row annotations propagate back to the value.
+
+
+
We have caveats working with Apache Spark,with negligible overhead .
+
Brachmann et. al., "Your notebook is not crumby enough, REPLace it"
+
Feng et. al., "Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers"
+
+
+
+
+
+
+
+
+
+ Incompleteness-Aware ML
+
+ ... a work in progress
+
+ ... and initially focused on explainable models like Bayes Nets
+
+
+
+ Bayes Net
+
+ Fitting a model involves repeatedly computing:
+
+ $P[A | B, C]$ $= \frac{\sum_{D, E, \ldots} P[A, B, C, D, E, \ldots]}{\sum_{B, C, D, E, \ldots} P[A, B, C, D, E, \ldots]}$
+
+
+
+ by_a = df.groupby("A").count()
+ counts = df.groupby(["A", "B", "C"]).count()
+ counts["count"] = counts.map(
+ lambda row: row["count"] / by_a[row["A"]]
+ )
+
+
+
+
+ Bayes Net
+ Fitting graphical models is just filtering & counting🤞
+
+ 30 decades of work on incomplete databases come "for free"
+
+
+
+ Incomplete Bayes Nets
+
+
+ "$P[A < 23 | B = 2] \in [0.7, 0.99]$"
+ "The estimate is inaccurate because 100 records are missing attribute C in the following 3 datasets."
+ "Improve precision by at least 50% by running simulation C on the following three materials."
+
+
+
+
+
+
+ Open Questions...
+
+
+ Measuring "unknown unknowns"
+ Combining models of incompleteness
+ Presenting/summarizing result incompleteness
+ ... and your questions too 😀
+
+
+
\ No newline at end of file
diff --git a/src/talks/graphics/2022-02-16/caveat_dataframe.png b/src/talks/graphics/2022-02-16/caveat_dataframe.png
new file mode 100644
index 00000000..65216480
Binary files /dev/null and b/src/talks/graphics/2022-02-16/caveat_dataframe.png differ
diff --git a/src/talks/graphics/2022-02-16/microstructure_filter.png b/src/talks/graphics/2022-02-16/microstructure_filter.png
new file mode 100644
index 00000000..536f8f9d
Binary files /dev/null and b/src/talks/graphics/2022-02-16/microstructure_filter.png differ