Field and Record Comparison
At the heart of the data matching process lies the detailed comparison of records with each other. These comparisons are usually performed on several attributes (or fields) of records, leading to a vector of numerical similarity values for each compared record pair. These similarity values are used to decide whether the two records in a pair are a match (i.e. correspond to the same entity) or a non-match (i.e. correspond to two different entities). Even after the data to be matched have been pre-processed (cleaned, standardised and segmented), it is likely that attribute values from different input databases do include variations and errors, and therefore some kind of approximate or ‘fuzzy’ comparison function is required to calculate the similarities between attribute values. Most attributes that are used in data matching contain values in the form of strings (such as names and addresses). In this chapter, the most commonly used approximate string comparison functions are presented in detail, and an overview of several more recently developed such functions is provided. An experimental comparison of the presented approximate string comparison functions on a data set that contains real name values shows the differences in the calculated similarity values. Furthermore, comparison functions for numerical data, as well as dates, ages, times, geographic locations, and more complex types of data are also discussed in this chapter.