Abstract
In this study, we investigate several statistical techniques for personal name popularity estimation and perform a record linkage experiment guided by name popularity estimates. The results show that name popularity can leverage personal name matching in databases and be of interest for many other domains.
The work was carried out while authors were at Kontur Labs, the research department of SKB Kontur, https://kontur.ru/eng/. The authors benefit from the Russian Ministry of Education and Science, project no. 1.3253.2017, and the Competitiveness Enhancement Program of Ural Federal University.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Note, that in this case C(x) corresponds to the number of persons bearing name x in S (not in \(S_{train}\) as in equations above).
- 3.
We also performed an experiment with first-last name doubles that showed similar behavior of the models. We do not cite the results here due to limited space.
References
Baayen, H.: Word Frequency Distributions. Text, Speech and Language Technology. Kluwer Academic Publishers, Dordrecht (2001)
Bergsma, S., et al.: Broadly improving user classification via communication-based name and location clustering on Twitter. In: NAACL-HLT, pp. 1010–1019 (2013)
Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: EMNLP-CoNLL, pp. 858–867 (2007)
Chang, J., Rosenn, I., Backstrom, L., Marlow, C.: ePluribus: ethnicity on social networks. In: ICWSM, pp. 18–25 (2010)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–393 (1999)
Christen, P.: A comparison of personal name matching: techniques and practical issues. Technical report. TR-CS-06-02, Australian National University, September 2006
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Evert, S.: A simple LNRE model for random character sequences. In: JADT, pp. 411–422 (2004)
Evert, S., Baroni, M.: Testing the extrapolation quality of word frequency models. In: Corpus Linguistics Conference Series, vol. 1 (2005)
Evert, S., Baroni, M.: zipfR: word frequency distributions in R. In: Proceedings of ACL, pp. 29–32 (2007)
Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40(3/4), 237–264 (1953)
Goodman, J.T.: A bit of progress in language modeling. Comput. Speech Lang. 15(4), 403–434 (2001)
Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Sig. Process. 35(3), 400–401 (1987)
Khmaladze, E.V.: The statistical analysis of a large number of rare events. Technical report MS-R8804, CWI (1988)
Liu, J., et al.: What’s in a name? An unsupervised approach to link users across communities. In: WSDM, pp. 495–504 (2013)
Mislove, A., Lehmann, S., Ahn, Y.Y., Onnela, J.P., Rosenquist, J.: Understanding the demographics of Twitter users. In: ICWSM (2011)
Perito, D., Castelluccia, C., Kaafar, M.A., Manils, P.: How unique and traceable are usernames? In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 1–17. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22263-4_1
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhagorina, K., Braslavski, P., Gusev, V. (2018). Personal Names Popularity Estimation and Its Application to Record Linkage. In: Benczúr, A., et al. New Trends in Databases and Information Systems. ADBIS 2018. Communications in Computer and Information Science, vol 909. Springer, Cham. https://doi.org/10.1007/978-3-030-00063-9_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-00063-9_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00062-2
Online ISBN: 978-3-030-00063-9
eBook Packages: Computer ScienceComputer Science (R0)