Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data

Zhou, Yun; Wang, Minlue; Haberland, Valeriia; Howroyd, John; Danicic, Sebastian; Bishop, J. Mark

doi:10.1007/s00354-016-0008-5

Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data

Special Feature
Published: 10 January 2017

Volume 35, pages 87–104, (2017)
Cite this article

New Generation Computing Aims and scope Submit manuscript

Yun Zhou¹,
Minlue Wang²,
Valeriia Haberland²,
John Howroyd²,
Sebastian Danicic² &
…
J. Mark Bishop²

307 Accesses
4 Citations
Explore all metrics

Abstract

Probabilistic record linkage is a well established topic in the literature. Fellegi–Sunter probabilistic record linkage and its enhanced versions are commonly used methods, which calculate match and non-match weights for each pair of records. Bayesian network classifiers–naive Bayes classifier and TAN have also been successfully used here. Recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN to record linkage and investigated the benefits of using naturally existing hierarchical feature level information and parsed fields of the datasets. In this work, we extend the naive Bayes classifier with such hierarchical feature level information. Finally we illustrate the benefits of our method over previously proposed methods on four datasets in terms of the linkage performance (\(F_1\) score). We also show the results can be further improved by evaluating the benefit provided by additionally parsing the fields of these datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A random forest guided tour

Article 19 April 2016

Comparing different supervised machine learning algorithms for disease prediction

Article Open access 21 December 2019

A comprehensive survey of data mining

Article 06 February 2020

Notes

Note in conventional PRL-FS method [8], two fields are either matched or unmatched. Thus, the k of \(m_{k,i}\) can be omitted in this case.
These datasets can be found at http://yzhou.github.io/.
Because the phone number is unique for each restaurant, it, on its own, can be used to identify duplicates without the need to resort to probabilistic record linkage techniques. Thus, this field is not used in our experiments.
In each dataset, we only introduce one hierarchical restriction between the name and address field.

References

Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on Very Large Data Bases, VLDB Endowment, pp. 586–597 (2002)
de Campos, C.P., Cuccu, M., Corani, G., Zaffalon, M.: Extended tree augmented naive classifier. In: van der Gaag, L.C., Feelders, A.J. (eds.) Probabilistic Graphical Models, pp. 176–189. Springer, Berlin (2014)
Google Scholar
de Campos, C.P., Corani, G., Scanagatta, M., Cuccu, M., Zaffalon, M.: Learning extended tree augmented naive structures. Int. J. Approx. Reason. 68, 153–163 (2016)
Article MathSciNet MATH Google Scholar
Christen, P., Belacic, D.: Automated probabilistic address standardisation and verification. In: Australasian Data Mining Conference (AusDM05), pp. 53–67(2005)
Churches, T., Christen, P., Lim, K., Zhu, J.X.: Preparation of name and address data for record linkage using hidden Markov models. BMC Med. Inf. Decis. Making 2(1), 1 (2002)
Article Google Scholar
Dunn, H.L.: Record linkage*. Am. J. Public Health Nations Health 36(12), 1412–1416 (1946)
Article Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Article MATH Google Scholar
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)
Article MATH Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn. 20(3), 197–243 (1995)
MATH Google Scholar
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
Article Google Scholar
Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
Article Google Scholar
Leitão, L., Calado, P., Weis, M.: Structure-based inference of XML similarity for fuzzy duplicate detection. In: Proceedings of the sixteenth ACM conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’07, pp. 293–302 (2007)
Leitão, L., Calado, P., Herschel, M.: Efficient and effective duplicate detection in hierarchical data. IEEE Trans. Knowl. Data Eng. 25(5), 1028–1041 (2013)
Article Google Scholar
Li, X., Guttmann, A., Cipiere, S., Maigne, L., Demongeot, J., Boire, J.Y., Ouchchane, L.: Implementation of an extended Fellegi–Sunter probabilistic record linkage method using the Jaro–Winkler string comparator. In: 2014 IEEE-EMBS international conference on biomedical and health informatics (BHI), IEEE, pp. 375–379 (2014)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Ravikumar, P., Cohen, W.W.: A hierarchical graphical model for record linkage. In: Proceedings of the 20th conference on uncertainty in artificial intelligence, AUAI Press, pp. 454–461 (2004)
Tromp, M., Ravelli, A.C., Bonsel, G.J., Hasman, A., Reitsma, J.B.: Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. J. Clin. Epidemiol. 64(5), 565–572 (2011)
Article Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi–Sunter model of record linkage. In: Proceedings of the section on survey research, pp. 354–359 (1990)
Winkler, W.E.: The state of record linkage and current research problems. In: Statistical research division, US Census Bureau, Citeseer (1999)
Zhou, Y., Fenton, N., Neil, M.: Bayesian network approach to multinomial parameter learning using data and expert judgments. Int. J. Approx. Reason. 55(5), 1252–1268 (2014)
Article MathSciNet MATH Google Scholar
Zhou, Y., Fenton, N., Hospedales, T., Neil, M.: Probabilistic graphical models parameter learning with transferred prior and constraints. In: Proceedings of the 31st conference on uncertainty in artificial intelligence, AUAI Press, pp. 972–981 (2015a)
Zhou, Y., Howroyd, J., Danicic, S., Bishop, J.: Extending naive bayes classifier with hierarchy feature level information for record linkage. In: Suzuki, J., Ueno, M. (eds.) Advanced Methodologies for Bayesian Networks, Lecture Notes in Computer Science, vol. 9505, pp. 93–104. Springer, Berlin. doi:10.1007/978-3-319-28379-1_7 (2015b)

Download references

Author information

Authors and Affiliations

Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha, China
Yun Zhou
Tungsten Centre for Intelligent Data Analytics (TCIDA), Goldsmiths, University of London, London, UK
Minlue Wang, Valeriia Haberland, John Howroyd, Sebastian Danicic & J. Mark Bishop

Authors

Yun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Minlue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Valeriia Haberland
View author publications
You can also search for this author in PubMed Google Scholar
John Howroyd
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Danicic
View author publications
You can also search for this author in PubMed Google Scholar
J. Mark Bishop
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yun Zhou.

Additional information

The authors would like to thank the Tungsten Network for their financial support.

About this article

Cite this article

Zhou, Y., Wang, M., Haberland, V. et al. Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data. New Gener. Comput. 35, 87–104 (2017). https://doi.org/10.1007/s00354-016-0008-5

Download citation

Received: 12 March 2016
Accepted: 27 July 2016
Published: 10 January 2017
Issue Date: January 2017
DOI: https://doi.org/10.1007/s00354-016-0008-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

Comparing different supervised machine learning algorithms for disease prediction

A comprehensive survey of data mining

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Keywords

Navigation

Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data

Abstract

Access this article

Similar content being viewed by others

A random forest guided tour

Comparing different supervised machine learning algorithms for disease prediction

A comprehensive survey of data mining

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Share this article

Keywords

Search

Navigation