Skip to main content

Personal Names Popularity Estimation and Its Application to Record Linkage

  • Conference paper
  • First Online:
New Trends in Databases and Information Systems (ADBIS 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 909))

Included in the following conference series:

  • 1222 Accesses

Abstract

In this study, we investigate several statistical techniques for personal name popularity estimation and perform a record linkage experiment guided by name popularity estimates. The results show that name popularity can leverage personal name matching in databases and be of interest for many other domains.

The work was carried out while authors were at Kontur Labs, the research department of SKB Kontur, https://kontur.ru/eng/. The authors benefit from the Russian Ministry of Education and Science, project no. 1.3253.2017, and the Competitiveness Enhancement Program of Ural Federal University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://egrul.nalog.ru/.

  2. 2.

    Note, that in this case C(x) corresponds to the number of persons bearing name x in S (not in \(S_{train}\) as in equations above).

  3. 3.

    We also performed an experiment with first-last name doubles that showed similar behavior of the models. We do not cite the results here due to limited space.

References

  1. Baayen, H.: Word Frequency Distributions. Text, Speech and Language Technology. Kluwer Academic Publishers, Dordrecht (2001)

    Book  Google Scholar 

  2. Bergsma, S., et al.: Broadly improving user classification via communication-based name and location clustering on Twitter. In: NAACL-HLT, pp. 1010–1019 (2013)

    Google Scholar 

  3. Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: EMNLP-CoNLL, pp. 858–867 (2007)

    Google Scholar 

  4. Chang, J., Rosenn, I., Backstrom, L., Marlow, C.: ePluribus: ethnicity on social networks. In: ICWSM, pp. 18–25 (2010)

    Google Scholar 

  5. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13(4), 359–393 (1999)

    Article  Google Scholar 

  6. Christen, P.: A comparison of personal name matching: techniques and practical issues. Technical report. TR-CS-06-02, Australian National University, September 2006

    Google Scholar 

  7. Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  8. Evert, S.: A simple LNRE model for random character sequences. In: JADT, pp. 411–422 (2004)

    Google Scholar 

  9. Evert, S., Baroni, M.: Testing the extrapolation quality of word frequency models. In: Corpus Linguistics Conference Series, vol. 1 (2005)

    Google Scholar 

  10. Evert, S., Baroni, M.: zipfR: word frequency distributions in R. In: Proceedings of ACL, pp. 29–32 (2007)

    Google Scholar 

  11. Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40(3/4), 237–264 (1953)

    Article  MathSciNet  Google Scholar 

  12. Goodman, J.T.: A bit of progress in language modeling. Comput. Speech Lang. 15(4), 403–434 (2001)

    Article  Google Scholar 

  13. Katz, S.M.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Sig. Process. 35(3), 400–401 (1987)

    Article  Google Scholar 

  14. Khmaladze, E.V.: The statistical analysis of a large number of rare events. Technical report MS-R8804, CWI (1988)

    Google Scholar 

  15. Liu, J., et al.: What’s in a name? An unsupervised approach to link users across communities. In: WSDM, pp. 495–504 (2013)

    Google Scholar 

  16. Mislove, A., Lehmann, S., Ahn, Y.Y., Onnela, J.P., Rosenquist, J.: Understanding the demographics of Twitter users. In: ICWSM (2011)

    Google Scholar 

  17. Perito, D., Castelluccia, C., Kaafar, M.A., Manils, P.: How unique and traceable are usernames? In: Fischer-Hübner, S., Hopper, N. (eds.) PETS 2011. LNCS, vol. 6794, pp. 1–17. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22263-4_1

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pavel Braslavski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhagorina, K., Braslavski, P., Gusev, V. (2018). Personal Names Popularity Estimation and Its Application to Record Linkage. In: Benczúr, A., et al. New Trends in Databases and Information Systems. ADBIS 2018. Communications in Computer and Information Science, vol 909. Springer, Cham. https://doi.org/10.1007/978-3-030-00063-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00063-9_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00062-2

  • Online ISBN: 978-3-030-00063-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics