Abstract
Name variants which differ more than a few characters can seriously hamper record linkage. A method is described by which variants of first names and surnames can be learned automatically from records that contain more information than needed for a true link decision. Post-processing and limited manual intervention (active learning) is unavoidable, however, to differentiate errors in the original and the digitised data from variants. The method is demonstrated on the basis of an analysis of 14.8 million records from the Dutch vital registration.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
A web-based query interface is available on https://familysearch.org/stdfinder/NameStandardLookup.jsp.
References
Anderson, J. M. (2007). The grammar of names. Oxford: Oxford University Press.
Bloothooft, G. (1995). Rules for semi-phonetic conversion of first names and family names. Uil-OTS internal report (in Dutch).
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1, (Article 5).
Bratley, P., & Lusignan, S. (1976). Information processing in dictionary making: Some technical guidelines. Computers and the Humanities, 10(3), 133–143.
Christen, P. (2012). Data matching—Concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Berlin: Springer.
Dolby, J. L. (1970). An algorithm for variable-length proper-name compression. Journal of Library Automation, 3(4), 257–275.
Driscoll, P. (2013). Computational methods for name normalization using hypocoristic personal name variants. In Multi-source, multilingual information extraction and summarization (pp. 73–91), Springer.
Malin, B. (2005). Unsupervised name disambiguation via social network similarity. In Proceedings of the workshop on link analysis, counterterrorism, and security (pp. 93–102).
Olsson, F. (2009). A literature survey of active machine learning in the context of natural language processing. SICS Technical Report T, 2009, 06.
Oosten, M. (2008). Past names, family relation based on data from Genlias, MSc thesis, LIACS, Leiden University (in Dutch).
Philips, L. (2000). The double metaphone search algorithm. C/C++ Users Journal, 18(6), 38–43.
Russel, R. (1918). Index. US Patent 1261167.
Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 269–278). ACM.
Schaar, J. van der (1964). Woordenboek van voornamen, Aula (since 1992 edited by D. Gerritzen).
Steinberger, R., Pouliquen, B., Kabadjov, M., Belyaeva, J., & van der Goot, E. (2011). JRC-NAMES: A freely available, highly multilingual named entity resource. In Proceedings of the 8th International Conference on Recent Advances in Natural Language Processing (RANLP) (pp. 104–110).
Vries, T. de, Ke, H., Chawla, S., & Christen, P. (2009). Robust record linkage blocking using suffix arrays. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (ACM) (pp. 305–314).
Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods (American Statistical Association) (pp. 354–359).
Xu, J., & Croft, W. B. (1998). Corpus-based stemming using co-occurrence of word variants. Transactions on Information Systems (TOIS), 16(1), 61–81.
Acknowledgments
This work is part of the research programme LINKS (LINKing System for historical family reconstruction, http://www.iisg.nl/hsn/projects/links.html), which is financed by the Netherlands Organisation for Scientific Research (NWO), grant 640.004.804.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Bloothooft, G., Schraagen, M. (2015). Learning Name Variants from Inexact High-Confidence Matches. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-19884-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19883-5
Online ISBN: 978-3-319-19884-2
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)