A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage

  • Robespierre PitaEmail author
  • Everton Mendonça
  • Sandra Reis
  • Marcos Barreto
  • Spiros Denaxas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10440)


Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results.


  1. 1.
    Altman, D.G., Bland, J.M.: Diagnostic tests 1: Sensitivity and specificity. BMJ Br. Med. J. 308(6943), 1552 (1994)CrossRefGoogle Scholar
  2. 2.
    Altman, D.G., Bland, J.M.: Diagnostic tests 3: receiver operating characteristic plots. BMJ Br. Med. J. 309(6948), 188 (1994)CrossRefGoogle Scholar
  3. 3.
    Altman, D.G., Bland, J.M.: Statistics notes: diagnostic tests 2: predictive values. BMJ 309(6947), 102 (1994)CrossRefGoogle Scholar
  4. 4.
    Antonie, M.L., Zaiane, O.R., Holte, R.C.: Learning to use a learned model: a two-stage approach to classification. In: Sixth International Conference on Data Mining, ICDM 2006, pp. 33–42. IEEE (2006)Google Scholar
  5. 5.
    Barreto, M.E., Alves, A., Sena, S., Fiaccone, R.L., Amorim, L., Ichihara, M., Barreto, M.: Assessing the accuracy of probabilistic record linkage of huge brazilian healthcare databases, vol. 1, p. 12. Oxford (2016)Google Scholar
  6. 6.
    Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive blocking: Learning to scale up record linkage. In: Sixth International Conference on Data Mining, ICDM 2006, pp. 87–96. IEEE (2006)Google Scholar
  7. 7.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  8. 8.
    Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)CrossRefGoogle Scholar
  9. 9.
    Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F.J., Hamilton, H.J. (eds.) Quality Measures in Data Mining, pp. 127–151. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  10. 10.
    Christen, P., et al.: Parallel techniques for high-performance record linkage (data matching). Data Mining Group, Australian National University, Epidemiology and Surveillance Branch, pp. 1-27 (2002). Project web page:
  11. 11.
    Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)CrossRefGoogle Scholar
  12. 12.
    Elfeky, M.G., Verykios, V.S., Elmagarmid, A.K.: Tailor: a record linkage toolbox. In: 18th International Conference on Data Engineering, 2002, Proceedings, pp. 17–28. IEEE (2002)Google Scholar
  13. 13.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefzbMATHGoogle Scholar
  14. 14.
    Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI, vol. 14, pp. 1137–1145, Stanford, CA (1995)Google Scholar
  16. 16.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 707–710 (1966)MathSciNetzbMATHGoogle Scholar
  17. 17.
    McDonald, C.J.: Analysis of a probabilistic record linkage technique without human review (2003)Google Scholar
  18. 18.
    Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17(34), 1–7 (2016)MathSciNetzbMATHGoogle Scholar
  19. 19.
    Michalski, R.S., Carbonell, J.G., Mitchell, T.M.: Machine Learning: An Artificial Intelligence Approach. Springer Science & Business Media, Heidelberg (2013)zbMATHGoogle Scholar
  20. 20.
    Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI, pp. 440–445 (2006)Google Scholar
  21. 21.
    Newcombe, H.B., Kennedy, J.M., Axford, S., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)CrossRefGoogle Scholar
  22. 22.
    Pinto, C., Pita, R., Melo, P., Sena, S., Barreto, M.: Correlação probabilística de bancos de dados governamentais, pp. 77–88 (2015)Google Scholar
  23. 23.
    Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., Rasella, D.: A spark-based workflow for probabilistic record linkage of healthcare data. In: EDBT/ICDT Workshops, pp. 17–26 (2015)Google Scholar
  24. 24.
    Press, S.J., Wilson, S.: Choosing between logistic regression and discriminant analysis. J. Am. Stat. Assoc. 73(364), 699–705 (1978)CrossRefzbMATHGoogle Scholar
  25. 25.
    Raileanu, L.E., Stoffel, K.: Theoretical comparison between the gini index and information gain criteria. Ann. Math. Artif. Intell. 41(1), 77–93 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Siegert, Y., Jiang, X., Krieg, V., Bartholomus, S.: Classification-based record linkage with pseudonymized data for epidemiological cancer registries. IEEE Trans. Multimed. 18(10), 1929–1941 (2016)CrossRefGoogle Scholar
  27. 27.
    da Silveira, D.P., Artmann, E.: Accuracy of probabilistic record linkage applied to health databases: systematic review. Rev. Saúde Pública 43(5), 875–882 (2009)CrossRefGoogle Scholar
  28. 28.
    Tromp, M., Ravelli, A., Meray, N., Reitsma, J., Bonsel, G., et al.: An efficient validation method of probabilistic record linkage including readmissions and twins. Methods Inf. Med. 47(4), 356–363 (2008)Google Scholar
  29. 29.
    Williamson, D.F., Parker, R.A., Kendrick, J.S.: The box plot: a simple visual method to interpret data. Ann. Intern. Med. 110(11), 916–921 (1989)CrossRefGoogle Scholar
  30. 30.
    Wilson, D.R.: Beyond probabilistic record linkage: using neural networks and complex features to improve genealogical record linkage. In: The 2011 International Joint Conference on Neural Networks, pp. 9–14, July 2011Google Scholar
  31. 31.
    Winkler, W.E.: The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau. Citeseer (1999)Google Scholar
  32. 32.
    Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, US Census Bureau, Washington, DC (2002)Google Scholar
  33. 33.
    Winkler, W.E., et al.: Machine learning, information retrieval and record linkage. In: Proceedings of Section on Survey Research Methods, American Statistical Association, pp. 20–29 (2000)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Robespierre Pita
    • 1
    Email author
  • Everton Mendonça
    • 1
  • Sandra Reis
    • 2
  • Marcos Barreto
    • 1
    • 3
  • Spiros Denaxas
    • 3
  1. 1.Computer Science DepartmentFederal University of Bahia (UFBA)SalvadorBrazil
  2. 2.Centre for Data and Knowledge Integration for Health (CIDACS)Oswaldo Cruz Foundation (FIOCRUZ)Rio de JaneiroBrazil
  3. 3.Farr Institute of Health Informatics ResearchUniversity College LondonLondonUK

Personalised recommendations