Abstract
Probabilistic record linkage has been well investigated in recent years. The Fellegi-Sunter probabilistic record linkage and its enhanced version are commonly used methods, which calculate match and non-match weights for each pair of corresponding fields of record-pairs. Bayesian network classifiers – naive Bayes classifier and TAN have also been successfully used here. Very recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN in record linkage and investigated the benefits of using a naturally existing hierarchy feature level information. In this work, we extend the naive Bayes classifier with such information. Finally we apply all the methods to four datasets and estimate the \(F_1\) scores.
Y. Zhou—The authors would like to thank the Tungsten Network for their financial support.
Similar content being viewed by others
Notes
- 1.
Note in conventional PRL-FS method [3], two fields are either matched or unmatched. Thus the k of \(m_{k,i}\) can be omitted in this case.
- 2.
Here \(\overrightarrow{\varvec{f}} = \{ f_i | f_i = I_k, i = 1,...,n \}\) contains n elements, whose values indicate the distances between two records on specific fields, \(I_k\) is the state/interval discretised from \(cf(a_i,b_i)\).
- 3.
These datasets can be found at http://yzhou.github.io/.
- 4.
Because the phone number is unique for each restaurant, it, on its own, can be used to identify duplicates without the need to resort to probabilistic record linkage techniques. Thus, this field is not used in our experiments.
- 5.
In each dataset, we only introduce one hierarchy restriction between the name and address fields.
References
Dunn, H.L.: Record linkage*. Am. J. Public Health Nations Health 36(12), 1412–1416 (1946)
Tromp, M., Ravelli, A.C., Bonsel, G.J., Hasman, A., Reitsma, J.B.: Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. J. Clin. Epidemiol. 64(5), 565–572 (2011)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Winkler, W.E.: The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau, Citeseer (1999)
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research, pp. 354–359 (1990)
Li, X., Guttmann, A., Cipiere, S., Maigne, L., Demongeot, J., Boire, J.Y., Ouchchane, L.: Implementation of an extended Fellegi-Sunter probabilistic record linkage method using the Jaro-Winkler string comparator. In: 2014 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), pp. 375–379. IEEE (2014)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 454–461. AUAI Press (2004)
de Campos, C.P., Zaffalon, M., Corani, G., Cuccu, M.: Extended tree augmented naive classifier. In: van der Gaag, L.C., Feelders, A.J. (eds.) PGM 2014. LNCS, vol. 8754, pp. 176–189. Springer, Heidelberg (2014)
de Campos, C.P., Corani, G., Scanagatta, M., Cuccu, M., Zaffalon, M.: Learning extended tree augmented naive structures. Int. J. Approximate Reasoning 68, 153–163 (2016)
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 586–597. VLDB Endowment (2002)
Leitao, L., Calado, P., Herschel, M.: Efficient and effective duplicate detection in hierarchical data. IEEE Trans. Knowl. Data Eng. 25(5), 1028–1041 (2013)
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)
Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)
Zhou, Y., Fenton, N., Neil, M.: Bayesian network approach to multinomial parameter learning using data and expert judgments. Int. J. Approximate Reasoning 55(5), 1252–1268 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhou, Y., Howroyd, J., Danicic, S., Bishop, J.M. (2015). Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage. In: Suzuki, J., Ueno, M. (eds) Advanced Methodologies for Bayesian Networks. AMBN 2015. Lecture Notes in Computer Science(), vol 9505. Springer, Cham. https://doi.org/10.1007/978-3-319-28379-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-28379-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28378-4
Online ISBN: 978-3-319-28379-1
eBook Packages: Computer ScienceComputer Science (R0)