Abstract
Predicting if two names refer to the same entity is an important task for many domains, such as information retrieval, record linkage and data integration. In this paper, we propose to create name-embeddings by employing a Doc2Vec methodology, where each name is viewed as a document and each letter in the name is considered a word. Our hypothesis is that representing names as documents, with letters as words, will help capture the internal structure of names and relationships among letters. We present and discuss an experimental study where we explore the effect of various parameters, and we assess the stability of the models built for the embedding of names. Our results show that the new proposed method can predict with high accuracy when a pair of names matches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This is due to the fact that some of the vectors could have negative values as well.
References
Name2Vec implementation and results (2019). https://github.com/foxcroftjn/CanAI-Name2Vec
Antonie, L., Inwood, K., Lizotte, D.J., Ross, J.A.: Tracking people over time in 19th century Canada for longitudinal analysis. Mach. Learn. 95(1), 129–146 (2014)
Carvalho, V.R., Kiran, Y., Borthwick, A.: The Intelius nickname collection: quantitative analyses from billions of public records. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 607–610 (2012)
Christen, P.: A comparison of personal name matching: techniques and practical issues. In: Proceedings of IEEE International Conference on Data Mining - Workshops, pp. 290–294 (2006)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)
Jaro, M.A.: Probabilistic linkage of large public health data files. Stat. Med. 14(5–7), 491–498 (1995)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
MĂ¼ller, M.-C.: Semantic author name disambiguation with word embeddings. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds.) TPDL 2017. LNCS, vol. 10450, pp. 300–311. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67008-9_24
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50, May 2010
Sim, A., Borthwick, A.: Record2Vec: unsupervised representation learning for structured records. In: IEEE International Conference on Data Mining, ICDM 2018, Singapore, 17–20 November 2018, pp. 1236–1241 (2018)
Sukharev, J., Zhukov, L., Popescul, A.: Parallel corpus approach for name matching in record linkage. In: Proceedings of IEEE International Conference on Data Mining, ICDM, pp. 995–1000 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Foxcroft, J., d’Alessandro, A., Antonie, L. (2019). Name2Vec: Personal Names Embeddings. In: Meurs, MJ., Rudzicz, F. (eds) Advances in Artificial Intelligence. Canadian AI 2019. Lecture Notes in Computer Science(), vol 11489. Springer, Cham. https://doi.org/10.1007/978-3-030-18305-9_52
Download citation
DOI: https://doi.org/10.1007/978-3-030-18305-9_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18304-2
Online ISBN: 978-3-030-18305-9
eBook Packages: Computer ScienceComputer Science (R0)