Learning Name Variants from Inexact High-Confidence Matches

Bloothooft, Gerrit; Schraagen, Marijn

doi:10.1007/978-3-319-19884-2_4

Learning Name Variants from Inexact High-Confidence Matches

Gerrit Bloothooft⁵ &
Marijn Schraagen⁶

Chapter
First Online: 01 January 2015

605 Accesses
3 Citations

Abstract

Name variants which differ more than a few characters can seriously hamper record linkage. A method is described by which variants of first names and surnames can be learned automatically from records that contain more information than needed for a true link decision. Post-processing and limited manual intervention (active learning) is unavoidable, however, to differentiate errors in the original and the digitised data from variants. The method is demonstrated on the basis of an analysis of 14.8 million records from the Dutch vital registration.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
A web-based query interface is available on https://familysearch.org/stdfinder/NameStandardLookup.jsp.

References

Anderson, J. M. (2007). The grammar of names. Oxford: Oxford University Press.
Google Scholar
Bloothooft, G. (1995). Rules for semi-phonetic conversion of first names and family names. Uil-OTS internal report (in Dutch).
Google Scholar
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1, (Article 5).
Google Scholar
Bratley, P., & Lusignan, S. (1976). Information processing in dictionary making: Some technical guidelines. Computers and the Humanities, 10(3), 133–143.
Google Scholar
Christen, P. (2012). Data matching—Concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-centric systems and applications. Berlin: Springer.
Google Scholar
Dolby, J. L. (1970). An algorithm for variable-length proper-name compression. Journal of Library Automation, 3(4), 257–275.
Google Scholar
Driscoll, P. (2013). Computational methods for name normalization using hypocoristic personal name variants. In Multi-source, multilingual information extraction and summarization (pp. 73–91), Springer.
Google Scholar
Malin, B. (2005). Unsupervised name disambiguation via social network similarity. In Proceedings of the workshop on link analysis, counterterrorism, and security (pp. 93–102).
Google Scholar
Olsson, F. (2009). A literature survey of active machine learning in the context of natural language processing. SICS Technical Report T, 2009, 06.
Google Scholar
Oosten, M. (2008). Past names, family relation based on data from Genlias, MSc thesis, LIACS, Leiden University (in Dutch).
Google Scholar
Philips, L. (2000). The double metaphone search algorithm. C/C++ Users Journal, 18(6), 38–43.
Google Scholar
Russel, R. (1918). Index. US Patent 1261167.
Google Scholar
Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 269–278). ACM.
Google Scholar
Schaar, J. van der (1964). Woordenboek van voornamen, Aula (since 1992 edited by D. Gerritzen).
Google Scholar
Steinberger, R., Pouliquen, B., Kabadjov, M., Belyaeva, J., & van der Goot, E. (2011). JRC-NAMES: A freely available, highly multilingual named entity resource. In Proceedings of the 8th International Conference on Recent Advances in Natural Language Processing (RANLP) (pp. 104–110).
Google Scholar
Vries, T. de, Ke, H., Chawla, S., & Christen, P. (2009). Robust record linkage blocking using suffix arrays. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (ACM) (pp. 305–314).
Google Scholar
Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods (American Statistical Association) (pp. 354–359).
Google Scholar
Xu, J., & Croft, W. B. (1998). Corpus-based stemming using co-occurrence of word variants. Transactions on Information Systems (TOIS), 16(1), 61–81.
Article Google Scholar

Download references

Acknowledgments

This work is part of the research programme LINKS (LINKing System for historical family reconstruction, http://www.iisg.nl/hsn/projects/links.html), which is financed by the Netherlands Organisation for Scientific Research (NWO), grant 640.004.804.

Author information

Authors and Affiliations

Utrecht Institute of Linguistics-OTS, Utrecht University, Trans 10, 3512 JK, Utrecht, The Netherlands
Gerrit Bloothooft
Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
Marijn Schraagen

Authors

Gerrit Bloothooft
View author publications
You can also search for this author in PubMed Google Scholar
Marijn Schraagen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gerrit Bloothooft .

Editor information

Editors and Affiliations

Utrecht Universty, Utrecht, The Netherlands
Gerrit Bloothooft
The Australian National University, Canberra, Aust Capital Terr, Australia
Peter Christen
International Inst. of Social History, Amsterdam, The Netherlands
Kees Mandemakers
Leiden University, Leiden, The Netherlands
Marijn Schraagen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bloothooft, G., Schraagen, M. (2015). Learning Name Variants from Inexact High-Confidence Matches. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds) Population Reconstruction. Springer, Cham. https://doi.org/10.1007/978-3-319-19884-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-19884-2_4
Published: 23 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19883-5
Online ISBN: 978-3-319-19884-2
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics