diff --git a/src/talks/2022-02-16-MaterialsDB.erb b/src/talks/2022-02-16-MaterialsDB.erb index 73de4c64..cea881f1 100644 --- a/src/talks/2022-02-16-MaterialsDB.erb +++ b/src/talks/2022-02-16-MaterialsDB.erb @@ -7,7 +7,7 @@ title: "Caveatting your data"

Adding explainability to incomplete datasets

Oliver Kennedy

University at Buffalo

-

(Based on joint work with Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...

+

(Based on joint work with Olga Wodo, Boris Glavic, Juliana Freire, William Spoth, Poonam Kumari, Ying Yang, Michael Brachmann, and many more...

@@ -117,6 +117,16 @@ title: "Caveatting your data"

Bob needs to know Alice's assumptions
(and how to use the workflow)?

+ +
+

In summary

+ + +
@@ -169,7 +179,7 @@ title: "Caveatting your data"
-

The curse of small data

+

Incomplete Coverage

@@ -183,13 +193,19 @@ title: "Caveatting your data"
  • Let the dataset users figure it out.
  • -

    Small datasets must trade off between utility and trustworthiness.

    +

    The Curse of Small Data: Utility vs Trustworthiness.

    + +
    +
    +

    Incomplete Data Management

    +
    +

    What is Incomplete Data?

    @@ -204,31 +220,283 @@ title: "Caveatting your data"

    Data that is not known precisely.

    +
    +

    Data Model

    + +

    (General Idea)

    +
    +
      -
    1. A 'placeholder' value or record.
        +
      • A 'placeholder' value or record.
        • n/a or None
        • +
        • "The value is probably 3.2"
      • -
      • Constraints describing what we do know.
          -
        • The value must be between $0$ and $1$.
        • -
        • The value is normally distributed ($\mu = 10, \sigma^2 = 2$).
        • -
        • There are at most 10 records with values from $0.9$ to $1$.
        • +
        • Constraints describing what we do know.
            +
          • "The value must be between $0$ and $0.7$."
          • +
          • "There are at most 10 records with values from $0.9$ to $1$."
        • -
        • Metadata describing the source of the error.
            -
          • The experiment hasn't been run yet.
          • -
          • Alignment issues between datasets $A$ and $B$.
          • +
          • Metadata describing the source of the error.
              +
            • "The experiment hasn't been run yet."
            • +
            • "Alignment issues between datasets $A$ and $B$."
    - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    localeratesize
    Los Angeles$[3\%,4\%]^1$metro
    Austin$18\%$[city, metro]$^2$
    Houston$14\%$metro
    Berlin$1\%$, town  or  $3\%$, city$^2$
    Sacramento$1\%$null
    Springfieldnulltown
    +

    + $1:$ Conflict between CDC and locality-reported statistics.
    + $2:$ Multiple localities with this name. +

    + + Feng et. al. "Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds" +
    + + +
    + +
    + +
    +

    Incomplete Databases

    + +

    "Incompleteness" as a first-class database primitive.

    + + +
    -

    How is Incomplete Data used?

    - +

    Using Incomplete Data

    +
    + +
    + +
    +

    UMAMI

    + + +
    + +
    +

    Certain, Possible Answers

    + +
    
    +      result = df[ df["NORMALIZED_INTERFERENCE"] < 10 
    +                    and df["ABS_wf_D"] > 0.28] ]
    +    
    +

    $3.1 > 2.8$ is True

    +

    $\text{n/a} < 10$ is Uncertain

    +

    $(3.1 > 2.8) \wedge (\text{n/a} > 10)$ is ???

    + +
    + +
    +

    Records meeting both conditions are certain matches.

    +

    Records with ABS_wf_D < 2.8 are definitely not matches.

    +

    Records with ABS_wf_D > 2.8 but missing NORMALIZED_INTERFERENCE are possible matches.

    +
    + +
    +

    3-Valued Boolean Logic

    + +

    AND

    + + + + + + +
    TrueUnknownFalse
    TrueTrueUnknownFalse
    UnknownUnknownUnknownFalse
    FalseFalseFalseFalse
    + +

    similar truth tables for OR, NOT, etc...

    +
    + +
    +

    Certain vs Possible

    + + + + + + + + + + + + + + + + +
    ResultIncludes...
    CertainFilter = True
    PossibleFilter $\in$ { True, Unknown }
    +
    + +
    +

    Summary Statistics

    + +
    
    +      result.count()
    +    
    + +

    There are at least certain.count() records.

    +

    There are at most possible.count() records.

    +

    The output is also incomplete

    +

    (but in a different way)

    +
    + +
    + +
    + +
    +

    Presenting Incompleteness

    + +

    Database: This nanostructure is possibly a result.

    +

    You: Why isn't it certain?

    +
    + +
    +

    Incompleteness as a provenance problem
    (aka lineage, pedigree)

    + Yang et. al., "Lenses: An On-Demand Approach to ETL" +
    + +
    +

    Caveats

    + +

    Annotate incomplete values with a small note

    + +

    Challenge: Propagating annotations through queries.

    +
    + + +
    +

    Filter = Uncertain $\rightarrow$ The annotation propagates to the row.

    + +

    df.count() $\rightarrow$ Row annotations propagate back to the value.

    + +
    +

    We have caveats working with Apache Spark,
    with negligible overhead.

    + Brachmann et. al., "Your notebook is not crumby enough, REPLace it" + Feng et. al., "Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers" +
    +
    + +
    + +
    + (https://mimimrdb.info | https://vizierdb.info) +
    +
    + +
    + +
    +

    Incompleteness-Aware ML

    + +

    ... a work in progress

    + +

    ... and initially focused on explainable models like Bayes Nets

    +
    + +
    +

    Bayes Net

    + +

    Fitting a model involves repeatedly computing:

    +

    + $P[A | B, C]$ $= \frac{\sum_{D, E, \ldots} P[A, B, C, D, E, \ldots]}{\sum_{B, C, D, E, \ldots} P[A, B, C, D, E, \ldots]}$ +

    + +
    
    +    by_a = df.groupby("A").count()
    +    counts = df.groupby(["A", "B", "C"]).count()
    +    counts["count"] = counts.map( 
    +                          lambda row: row["count"] / by_a[row["A"]] 
    +                      )
    +    
    +
    + +
    +

    Bayes Net

    +

    Fitting graphical models is just filtering & counting🤞

    + +

    30 decades of work on incomplete databases come "for free"

    +
    + +
    +

    Incomplete Bayes Nets

    + + +
    + +
    + +
    +

    Open Questions...

    + + + +
    \ No newline at end of file diff --git a/src/talks/graphics/2022-02-16/caveat_dataframe.png b/src/talks/graphics/2022-02-16/caveat_dataframe.png new file mode 100644 index 00000000..65216480 Binary files /dev/null and b/src/talks/graphics/2022-02-16/caveat_dataframe.png differ diff --git a/src/talks/graphics/2022-02-16/microstructure_filter.png b/src/talks/graphics/2022-02-16/microstructure_filter.png new file mode 100644 index 00000000..536f8f9d Binary files /dev/null and b/src/talks/graphics/2022-02-16/microstructure_filter.png differ