Skip to main content
Log in

Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data

  • Special Feature
  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

Probabilistic record linkage is a well established topic in the literature. Fellegi–Sunter probabilistic record linkage and its enhanced versions are commonly used methods, which calculate match and non-match weights for each pair of records. Bayesian network classifiers–naive Bayes classifier and TAN have also been successfully used here. Recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN to record linkage and investigated the benefits of using naturally existing hierarchical feature level information and parsed fields of the datasets. In this work, we extend the naive Bayes classifier with such hierarchical feature level information. Finally we illustrate the benefits of our method over previously proposed methods on four datasets in terms of the linkage performance (\(F_1\) score). We also show the results can be further improved by evaluating the benefit provided by additionally parsing the fields of these datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Note in conventional PRL-FS method [8], two fields are either matched or unmatched. Thus, the k of \(m_{k,i}\) can be omitted in this case.

  2. These datasets can be found at http://yzhou.github.io/.

  3. Because the phone number is unique for each restaurant, it, on its own, can be used to identify duplicates without the need to resort to probabilistic record linkage techniques. Thus, this field is not used in our experiments.

  4. In each dataset, we only introduce one hierarchical restriction between the name and address field.

References

  1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on Very Large Data Bases, VLDB Endowment, pp. 586–597 (2002)

  2. de Campos, C.P., Cuccu, M., Corani, G., Zaffalon, M.: Extended tree augmented naive classifier. In: van der Gaag, L.C., Feelders, A.J. (eds.) Probabilistic Graphical Models, pp. 176–189. Springer, Berlin (2014)

    Google Scholar 

  3. de Campos, C.P., Corani, G., Scanagatta, M., Cuccu, M., Zaffalon, M.: Learning extended tree augmented naive structures. Int. J. Approx. Reason. 68, 153–163 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  4. Christen, P., Belacic, D.: Automated probabilistic address standardisation and verification. In: Australasian Data Mining Conference (AusDM05), pp. 53–67(2005)

  5. Churches, T., Christen, P., Lim, K., Zhu, J.X.: Preparation of name and address data for record linkage using hidden Markov models. BMC Med. Inf. Decis. Making 2(1), 1 (2002)

    Article  Google Scholar 

  6. Dunn, H.L.: Record linkage*. Am. J. Public Health Nations Health 36(12), 1412–1416 (1946)

    Article  Google Scholar 

  7. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  8. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    Article  MATH  Google Scholar 

  9. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)

    Article  MATH  Google Scholar 

  10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  11. Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)

    MATH  Google Scholar 

  12. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)

    Article  Google Scholar 

  13. Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)

    Article  Google Scholar 

  14. Leitão, L., Calado, P., Weis, M.: Structure-based inference of XML similarity for fuzzy duplicate detection. In: Proceedings of the sixteenth ACM conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’07, pp. 293–302 (2007)

  15. Leitão, L., Calado, P., Herschel, M.: Efficient and effective duplicate detection in hierarchical data. IEEE Trans. Knowl. Data Eng. 25(5), 1028–1041 (2013)

    Article  Google Scholar 

  16. Li, X., Guttmann, A., Cipiere, S., Maigne, L., Demongeot, J., Boire, J.Y., Ouchchane, L.: Implementation of an extended Fellegi–Sunter probabilistic record linkage method using the Jaro–Winkler string comparator. In: 2014 IEEE-EMBS international conference on biomedical and health informatics (BHI), IEEE, pp. 375–379 (2014)

  17. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  18. Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, AUAI Press, pp. 454–461 (2004)

  19. Tromp, M., Ravelli, A.C., Bonsel, G.J., Hasman, A., Reitsma, J.B.: Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. J. Clin. Epidemiol. 64(5), 565–572 (2011)

    Article  Google Scholar 

  20. Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi–Sunter model of record linkage. In: Proceedings of the section on survey research, pp. 354–359 (1990)

  21. Winkler, W.E.: The state of record linkage and current research problems. In: Statistical research division, US Census Bureau, Citeseer (1999)

  22. Zhou, Y., Fenton, N., Neil, M.: Bayesian network approach to multinomial parameter learning using data and expert judgments. Int. J. Approx. Reason. 55(5), 1252–1268 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  23. Zhou, Y., Fenton, N., Hospedales, T., Neil, M.: Probabilistic graphical models parameter learning with transferred prior and constraints. In: Proceedings of the 31st conference on uncertainty in artificial intelligence, AUAI Press, pp. 972–981 (2015a)

  24. Zhou, Y., Howroyd, J., Danicic, S., Bishop, J.: Extending naive bayes classifier with hierarchy feature level information for record linkage. In: Suzuki, J., Ueno, M. (eds.) Advanced Methodologies for Bayesian Networks, Lecture Notes in Computer Science, vol. 9505, pp. 93–104. Springer, Berlin. doi:10.1007/978-3-319-28379-1_7 (2015b)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yun Zhou.

Additional information

The authors would like to thank the Tungsten Network for their financial support.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Y., Wang, M., Haberland, V. et al. Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data. New Gener. Comput. 35, 87–104 (2017). https://doi.org/10.1007/s00354-016-0008-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00354-016-0008-5

Keywords

Navigation