Skip to main content

Detecting Gender by Full Name: Experiments with the Russian Language

Part of the Communications in Computer and Information Science book series (CCIS,volume 436)

Abstract

This paper describes a method that detects gender of a person by his/her full name. While some approaches were proposed for English language, little has been done so far for Russian. We fill this gap and present a large-scale experiment on a dataset of 100,000 Russian full names from Facebook. Our method is based on three types of features (word endings, character \(n\)-grams and dictionary of names) combined within a linear supervised model. Experiments show that the proposed simple and computationally efficient approach yields excellent results achieving accuracy up to 96 %.

Keywords

  • Gender detection
  • Short text classification

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-12580-0_17
  • Chapter length: 14 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   54.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-12580-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   69.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.

Notes

  1. 1.

    http://www.yandex.com/

  2. 2.

    http://www.mail.ru/

  3. 3.

    http://research.digsolab.com/gender

  4. 4.

    http://www.natcorp.ox.ac.uk/

  5. 5.

    http://www.ruscorpora.ru/en/

  6. 6.

    https://developers.facebook.com/tools/explorer

  7. 7.

    http://scikit-learn.org/

  8. 8.

    http://imena-list.ru/

  9. 9.

    http://www.gramota.ru/slovari/info/ag/

  10. 10.

    http://ru.wiktionary.org/wiki/

    figure f
  11. 11.

    http://ru.wikisource.org/wiki/

    figure g
  12. 12.

    Available at http://panchenko.me/gender/wiki-gender-dict.csv.

References

  1. Underwood, A.: Gender targeting for promoted products now available, October 2012

    Google Scholar 

  2. Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2011)

    Google Scholar 

  3. Kharitonov, E., Serdyukov, P.: Gender-aware re-ranking. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1081–1082. ACM (2012)

    Google Scholar 

  4. Bi, B., Shokouhi, M., Kosinski, M., Graepel, T.: Inferring the demographics of search users: social data meets search queries. In: Proceedings of the 22nd International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 131–140 (2013)

    Google Scholar 

  5. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary Linguist. Comput. 17(4), 401–412 (2002)

    CrossRef  Google Scholar 

  6. Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers age and gender. In: Third International AAAI Conference on Weblogs and Social Media (2009)

    Google Scholar 

  7. Rangel, F., Rosso, P.: Use of language and author profiling: Identification of gender and age. In: Natural Language Processing and Cognitive Science, p. 177 (2013)

    Google Scholar 

  8. Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: how old do you think i am: a study of language and age in twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (2013)

    Google Scholar 

  9. Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of twitter users in non-english contexts. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Wash, pp. 18–21 (2013)

    Google Scholar 

  10. Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter. In: Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2010)

    Google Scholar 

  11. Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1301–1309. Association for Computational Linguistics (2011)

    Google Scholar 

  12. Daniel, M. A. Zelenkov, Y.: Russian national corpus as a playground for sociolinguistic research. episode iv. gender and length of the utterance. In: Proceedings of Dialog-2012, pp. 51–62 (2012)

    Google Scholar 

  13. Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 207–217. Association for Computational Linguistics (2010)

    Google Scholar 

  14. Vapnik, V.: The nature of statistical learning theory. Data Min. Knowl. Discovery 6, 1–47 (1995)

    Google Scholar 

  15. Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: ICWSM (2012)

    Google Scholar 

  16. Liu, W., Zamal, F.A., Ruths, D.: Using social media to infer gender composition of commuter populations. In: Proceedings of the When the City Meets the Citizen Worksop (2012)

    Google Scholar 

  17. Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning, vol. 1. Springer, New York (2006)

    MATH  Google Scholar 

  18. Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72. Association for Computational Linguistics (2006)

    Google Scholar 

  19. Agresti, A.: Categorical Data Analysis, vol. 359. Wiley, New York (2002)

    CrossRef  MATH  Google Scholar 

  20. Panchenko, A., Beaufort, R., Naets, H., Fairon, C.: Towards detection of child sexual abuse media: categorization of the associated filenames. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 776–779. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  21. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  22. Yu, H.F., Huang, F.L., Lin, C.J.: Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn. 85(1–2), 41–75 (2011)

    CrossRef  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

This research was supported by Digital Society Laboratory LLC. We thank Kirill Shileev, Segei Objedkov and three anonymous reviewers for their helpful comments that significantly improved quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Panchenko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Panchenko, A., Teterin, A. (2014). Detecting Gender by Full Name: Experiments with the Russian Language. In: Ignatov, D., Khachay, M., Panchenko, A., Konstantinova, N., Yavorsky, R. (eds) Analysis of Images, Social Networks and Texts. AIST 2014. Communications in Computer and Information Science, vol 436. Springer, Cham. https://doi.org/10.1007/978-3-319-12580-0_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12580-0_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12579-4

  • Online ISBN: 978-3-319-12580-0

  • eBook Packages: Computer ScienceComputer Science (R0)