Personal Names Popularity Estimation and Its Application to Record Linkage

Zhagorina, Ksenia; Braslavski, Pavel; Gusev, Vladimir

doi:10.1007/978-3-030-00063-9_9

Ksenia Zhagorina¹⁵,
Pavel Braslavski^16,17 &
Vladimir Gusev¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 909))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1222 Accesses

Abstract

In this study, we investigate several statistical techniques for personal name popularity estimation and perform a record linkage experiment guided by name popularity estimates. The results show that name popularity can leverage personal name matching in databases and be of interest for many other domains.

The work was carried out while authors were at Kontur Labs, the research department of SKB Kontur, https://kontur.ru/eng/. The authors benefit from the Russian Ministry of Education and Science, project no. 1.3253.2017, and the Competitiveness Enhancement Program of Ural Federal University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://egrul.nalog.ru/.
2.
Note, that in this case C(x) corresponds to the number of persons bearing name x in S (not in \(S_{train}\) as in equations above).
3.
We also performed an experiment with first-last name doubles that showed similar behavior of the models. We do not cite the results here due to limited space.

References

Baayen, H.: Word Frequency Distributions. Text, Speech and Language Technology. Kluwer Academic Publishers, Dordrecht (2001)
Book Google Scholar
Bergsma, S., et al.: Broadly improving user classification via communication-based name and location clustering on Twitter. In: NAACL-HLT, pp. 1010–1019 (2013)
Google Scholar
Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: EMNLP-CoNLL, pp. 858–867 (2007)
Google Scholar
Chang, J., Rosenn, I., Backstrom, L., Marlow, C.: ePluribus: ethnicity on social networks. In: ICWSM, pp. 18–25 (2010)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–393 (1999)
Article Google Scholar
Christen, P.: A comparison of personal name matching: techniques and practical issues. Technical report. TR-CS-06-02, Australian National University, September 2006
Google Scholar
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Book Google Scholar
Evert, S.: A simple LNRE model for random character sequences. In: JADT, pp. 411–422 (2004)
Google Scholar
Evert, S., Baroni, M.: Testing the extrapolation quality of word frequency models. In: Corpus Linguistics Conference Series, vol. 1 (2005)
Google Scholar
Evert, S., Baroni, M.: zipfR: word frequency distributions in R. In: Proceedings of ACL, pp. 29–32 (2007)
Google Scholar
Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40(3/4), 237–264 (1953)
Article MathSciNet Google Scholar
Goodman, J.T.: A bit of progress in language modeling. Comput. Speech Lang. 15(4), 403–434 (2001)
Article Google Scholar
Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Sig. Process. 35(3), 400–401 (1987)
Article Google Scholar
Khmaladze, E.V.: The statistical analysis of a large number of rare events. Technical report MS-R8804, CWI (1988)
Google Scholar
Liu, J., et al.: What’s in a name? An unsupervised approach to link users across communities. In: WSDM, pp. 495–504 (2013)
Google Scholar
Mislove, A., Lehmann, S., Ahn, Y.Y., Onnela, J.P., Rosenquist, J.: Understanding the demographics of Twitter users. In: ICWSM (2011)
Google Scholar
Perito, D., Castelluccia, C., Kaafar, M.A., Manils, P.: How unique and traceable are usernames? In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 1–17. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22263-4_1
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Yandex, Yekaterinburg, Russia
Ksenia Zhagorina
Ural Federal University, Yekaterinburg, Russia
Pavel Braslavski & Vladimir Gusev
JetBrains Research, Saint Petersburg, Russia
Pavel Braslavski

Authors

Ksenia Zhagorina
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Braslavski
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Gusev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavel Braslavski .

Editor information

Editors and Affiliations

Eötvös Loránd University, Budapest, Hungary
András Benczúr
Abt. Informatik, Universität Kiel, Kiel, Germany
Bernhard Thalheim
Eötvös Loránd University, Budapest, Hungary
Tomáš Horváth
Politecnico di Torino, Turin, Italy
Silvia Chiusano
Polytechnic University of Turin, Turin, Italy
Tania Cerquitelli
Hungarian Academy of Sciences, Budapest, Hungary
Csaba Sidló
University of Nebraska–Lincoln, Lincoln, NE, USA
Peter Z. Revesz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhagorina, K., Braslavski, P., Gusev, V. (2018). Personal Names Popularity Estimation and Its Application to Record Linkage. In: Benczúr, A., et al. New Trends in Databases and Information Systems. ADBIS 2018. Communications in Computer and Information Science, vol 909. Springer, Cham. https://doi.org/10.1007/978-3-030-00063-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-00063-9_9
Published: 31 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00062-2
Online ISBN: 978-3-030-00063-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics