Motivation for Semantic Type Detection in Longitudinal Schema Integration
We presented DDM @HILDA 2024. DDM uses the similarity scores of sentence embeddings of the attribute(column) names to suggest mergeable attributes. However, some attribute names appear in multiple places (within different question categories). For instance: "remarks" appears after "which social group are you a member of?" and also appears after "schoolmates provenance". The true essence of these attributes only makes sense when asked together with some other questions. DDM fails to resolve these types of appearances of attributes.
Semantic type detection is not a perfect solution for resolving these types of cases. However, Sherlock and Sato provides some insight and inspiration into how local and global context can be embedded into column names for detecting their true matchable pair.
Paper: Sherlock: A Deep Learning Approach to Semantic Data Type Detection
Sherlock is a deep learning model for detecting semantic data types that is trained on a large collection of real world data. While most systems reliably detect atomic types like string and decimals; semantic types provide finer grained description of the data by establishing correspondences between columns and real world concepts. Correctly detecting these semantic types is crucial for data science tasks like schema matching. Sherlock is built on previous works that matches types from dbpedia with the web tables corpus. Then they match these types of column headers from viznet(previous work)- a large scale repository of real world datasets.
Each column is characterized into 1,588 features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Each of the 686,765 column names in the training data are labeled to their respective semantic types (ground truth in VizNet Corpus).
How is training data prepared?
For instance a column with name col_xyz is labeled as l1. Now the data values of this column is converted into a 1588 dimension vector. these dimensions are independent variables and label of col_xyz becomes the variable to predict.