Skip to main content

Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage

  • Conference paper
  • First Online:
Advanced Methodologies for Bayesian Networks (AMBN 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9505))

Included in the following conference series:

Abstract

Probabilistic record linkage has been well investigated in recent years. The Fellegi-Sunter probabilistic record linkage and its enhanced version are commonly used methods, which calculate match and non-match weights for each pair of corresponding fields of record-pairs. Bayesian network classifiers – naive Bayes classifier and TAN have also been successfully used here. Very recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN in record linkage and investigated the benefits of using a naturally existing hierarchy feature level information. In this work, we extend the naive Bayes classifier with such information. Finally we apply all the methods to four datasets and estimate the \(F_1\) scores.

Y. Zhou—The authors would like to thank the Tungsten Network for their financial support.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Note in conventional PRL-FS method [3], two fields are either matched or unmatched. Thus the k of \(m_{k,i}\) can be omitted in this case.

  2. 2.

    Here \(\overrightarrow{\varvec{f}} = \{ f_i | f_i = I_k, i = 1,...,n \}\) contains n elements, whose values indicate the distances between two records on specific fields, \(I_k\) is the state/interval discretised from \(cf(a_i,b_i)\).

  3. 3.

    These datasets can be found at http://yzhou.github.io/.

  4. 4.

    Because the phone number is unique for each restaurant, it, on its own, can be used to identify duplicates without the need to resort to probabilistic record linkage techniques. Thus, this field is not used in our experiments.

  5. 5.

    In each dataset, we only introduce one hierarchy restriction between the name and address fields.

References

  1. Dunn, H.L.: Record linkage*. Am. J. Public Health Nations Health 36(12), 1412–1416 (1946)

    Article  Google Scholar 

  2. Tromp, M., Ravelli, A.C., Bonsel, G.J., Hasman, A., Reitsma, J.B.: Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. J. Clin. Epidemiol. 64(5), 565–572 (2011)

    Article  Google Scholar 

  3. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

    Article  MATH  Google Scholar 

  4. Winkler, W.E.: The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau, Citeseer (1999)

    Google Scholar 

  5. Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research, pp. 354–359 (1990)

    Google Scholar 

  6. Li, X., Guttmann, A., Cipiere, S., Maigne, L., Demongeot, J., Boire, J.Y., Ouchchane, L.: Implementation of an extended Fellegi-Sunter probabilistic record linkage method using the Jaro-Winkler string comparator. In: 2014 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), pp. 375–379. IEEE (2014)

    Google Scholar 

  7. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  8. Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 454–461. AUAI Press (2004)

    Google Scholar 

  9. de Campos, C.P., Zaffalon, M., Corani, G., Cuccu, M.: Extended tree augmented naive classifier. In: van der Gaag, L.C., Feelders, A.J. (eds.) PGM 2014. LNCS, vol. 8754, pp. 176–189. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  10. de Campos, C.P., Corani, G., Scanagatta, M., Cuccu, M., Zaffalon, M.: Learning extended tree augmented naive structures. Int. J. Approximate Reasoning 68, 153–163 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  11. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 586–597. VLDB Endowment (2002)

    Google Scholar 

  12. Leitao, L., Calado, P., Herschel, M.: Efficient and effective duplicate detection in hierarchical data. IEEE Trans. Knowl. Data Eng. 25(5), 1028–1041 (2013)

    Article  Google Scholar 

  13. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)

    Article  Google Scholar 

  14. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)

    Article  MATH  Google Scholar 

  15. Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)

    Article  MATH  Google Scholar 

  16. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)

    Article  Google Scholar 

  17. Zhou, Y., Fenton, N., Neil, M.: Bayesian network approach to multinomial parameter learning using data and expert judgments. Int. J. Approximate Reasoning 55(5), 1252–1268 (2014)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yun Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhou, Y., Howroyd, J., Danicic, S., Bishop, J.M. (2015). Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage. In: Suzuki, J., Ueno, M. (eds) Advanced Methodologies for Bayesian Networks. AMBN 2015. Lecture Notes in Computer Science(), vol 9505. Springer, Cham. https://doi.org/10.1007/978-3-319-28379-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-28379-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-28378-4

  • Online ISBN: 978-3-319-28379-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics