A note on using the F-measure for evaluating record linkage algorithms

Article

Abstract

Record linkage is the process of identifying and linking records about the same entities from one or more databases. Record linkage can be viewed as a classification problem where the aim is to decide whether a pair of records is a match (i.e. two records refer to the same real-world entity) or a non-match (two records refer to two different entities). Various classification techniques—including supervised, unsupervised, semi-supervised and active learning based—have been employed for record linkage. If ground truth data in the form of known true matches and non-matches are available, the quality of classified links can be evaluated. Due to the generally high class imbalance in record linkage problems, standard accuracy or misclassification rate are not meaningful for assessing the quality of a set of linked records. Instead, precision and recall, as commonly used in information retrieval and machine learning, are used. These are often combined into the popular F-measure, which is the harmonic mean of precision and recall. We show that the F-measure can also be expressed as a weighted sum of precision and recall, with weights which depend on the linkage method being used. This reformulation reveals that the F-measure has a major conceptual weakness: the relative importance assigned to precision and recall should be an aspect of the problem and the researcher or user, but not of the particular linkage method being used. We suggest alternative measures which do not suffer from this fundamental flaw.

Keywords

Data linkage Entity resolution Classification Precision Recall Class imbalance 

References

  1. Belin, T.R., Rubin, D.B.: A method for calibrating false-match rates in record linkage. J. Am. Stat. Assoc. 90(430), 694–707 (1995)CrossRefMATHGoogle Scholar
  2. Christen, P.: Development and user experiences of an open source data cleaning, deduplication and record linkage system. SIGKDD Explor. 11(1), 39–48 (2009)MathSciNetCrossRefGoogle Scholar
  3. Christen, P.: Data Matching—Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, Berlin (2012)Google Scholar
  4. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)CrossRefGoogle Scholar
  5. Christen, P.: Preparation of a Real Temporal Voter Data Set for Record Linkage and Duplicate Detection Research. Technical Report, The Australian National University (2014)Google Scholar
  6. Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F., Hamilton, H. (eds.) Quality Measures in Data Mining, Studies in Computational Intelligence, vol. 43, pp. 127–151. Springer, Berlin (2007)CrossRefGoogle Scholar
  7. Christen, P., Vatsalan, D., Wang, Q.: Efficient entity resolution with adaptive and interactive training data selection. In: IEEE International Conference on Data Mining, pp. 727–732. Atlantic City (2015)Google Scholar
  8. Copas, J., Hilton, F.: Record linkage: statistical models for matching computer records. J. R. Stat. Soc. Ser. A (Stat. Soc.) 153(3), 287–320 (1990)CrossRefGoogle Scholar
  9. Domingo-Ferrer, J., Torra, V.: Disclosure risk assessment in statistical microdata protection via advanced record linkage. Stat. Comput. 13(4), 343–354 (2003)MathSciNetCrossRefGoogle Scholar
  10. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefMATHGoogle Scholar
  11. Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice and open challenges. VLDB Endow. 5(12), 2018–2019 (2012)CrossRefGoogle Scholar
  12. Gutman, R., Afendulis, C.C., Zaslavsky, A.M.: A Bayesian procedure for file linking to analyze end-of-life medical costs. J. Am. Stat. Assoc. 108(501), 34–47 (2013)MathSciNetCrossRefMATHGoogle Scholar
  13. Gutman, R., Sammartino, C., Green, T., Montague, B.: Error adjustments for file linking methods using encrypted unique client identifier (eUCI) with application to recently released prisoners who are HIV+. Stat. Med. 35(1), 115–129 (2016)Google Scholar
  14. Hand, D.J.: Construction and Assessment of Classification Rules. Wiley, New York (1997)MATHGoogle Scholar
  15. Hand, D.J.: Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach. Learn. 77(1), 103–123 (2009)Google Scholar
  16. Hand, D.J.: Evaluating diagnostic tests: the area under the ROC curve and the balance of errors. Stat. Med. 29(14), 1502–1510 (2010)MathSciNetGoogle Scholar
  17. Hand, D.J.: Assessing the performance of classification methods. Int. Stat. Rev. 80(3), 400–414 (2012)MathSciNetCrossRefGoogle Scholar
  18. Harron, K., Goldstein, H., Dibben, C.: Methodological Developments in Data Linkage. Wiley, New York (2015)CrossRefGoogle Scholar
  19. Herzog, T., Scheuren, F., Winkler, W.E.: Data Quality and Record Linkage Techniques. Springer, Berlin (2007)MATHGoogle Scholar
  20. Jaro, M.A.: Advances in record-linkage methodology a applied to matching the 1985 Census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRefGoogle Scholar
  21. Larsen, M.D., Rubin, D.B.: Iterative automated record linkage using mixture models. J. Am. Stat. Assoc. 96(453), 32–41 (2001)MathSciNetCrossRefGoogle Scholar
  22. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATHGoogle Scholar
  23. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 169–178. Boston (2000)Google Scholar
  24. Murray, J.S.: Probabilistic record linkage and deduplication after indexing, blocking, and filtering. J. Priv. Confid. 7(1), 2 (2016)Google Scholar
  25. Naumann, F., Herschel, M.: An introduction to duplicate detection. In: Synthesis Lectures on Data Management, vol. 3. Morgan and Claypool Publishers (2010)Google Scholar
  26. Newcombe, H.B.: Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business. Oxford University Press Inc, New York (1988)Google Scholar
  27. Reid, A., Davies, R., Garrett, E.: Nineteenth-century Scottish demography from linked censuses and civil registers. Hist. Comput. 14(1–2), 61–86 (2002)CrossRefGoogle Scholar
  28. Sadinle, M.: Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann. Appl. Stat. 8(4), 2404–2434 (2014)MathSciNetCrossRefMATHGoogle Scholar
  29. Sadinle, M., Fienberg, S.E.: A generalized Fellegi–Sunter framework for multiple record linkage with application to homicide record systems. J. Am. Stat. Assoc. 108(502), 385–397 (2013)MathSciNetCrossRefMATHGoogle Scholar
  30. van Rijsbergen, C.: Information Retrieval. Butterworth, Oxford (1979)MATHGoogle Scholar
  31. Vatsalan, D., Christen, P., Verykios, V.S.: A taxonomy of privacy-preserving record linkage techniques. Inf. Syst. 38(6), 946–969 (2013)CrossRefGoogle Scholar
  32. Winkler, W.E.: Methods for evaluating and creating data quality. Inf. Syst. 29(7), 531–550 (2004)CrossRefGoogle Scholar
  33. Winkler, W.E., Yancey, W.E., Porter, E.H.: Fast record linkage of very large files in support of decennial and administrative records projects. In: Proceedings of the Section on Survey Research Methods, pp. 2120–2130. American Statistical Association (2010)Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Imperial CollegeLondonUK
  2. 2.Winton Group LimitedLondonUK
  3. 3.The Australian National UniversityCanberraAustralia

Personalised recommendations