Detecting Gender by Full Name: Experiments with the Russian Language

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 436)

Abstract

This paper describes a method that detects gender of a person by his/her full name. While some approaches were proposed for English language, little has been done so far for Russian. We fill this gap and present a large-scale experiment on a dataset of 100,000 Russian full names from Facebook. Our method is based on three types of features (word endings, character \(n\)-grams and dictionary of names) combined within a linear supervised model. Experiments show that the proposed simple and computationally efficient approach yields excellent results achieving accuracy up to 96 %.

Keywords

Gender detection Short text classification 

Notes

Acknowledgments

This research was supported by Digital Society Laboratory LLC. We thank Kirill Shileev, Segei Objedkov and three anonymous reviewers for their helpful comments that significantly improved quality of this paper.

References

  1. 1.
    Underwood, A.: Gender targeting for promoted products now available, October 2012Google Scholar
  2. 2.
    Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2011)Google Scholar
  3. 3.
    Kharitonov, E., Serdyukov, P.: Gender-aware re-ranking. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1081–1082. ACM (2012)Google Scholar
  4. 4.
    Bi, B., Shokouhi, M., Kosinski, M., Graepel, T.: Inferring the demographics of search users: social data meets search queries. In: Proceedings of the 22nd International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 131–140 (2013)Google Scholar
  5. 5.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary Linguist. Comput. 17(4), 401–412 (2002)CrossRefGoogle Scholar
  6. 6.
    Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers age and gender. In: Third International AAAI Conference on Weblogs and Social Media (2009)Google Scholar
  7. 7.
    Rangel, F., Rosso, P.: Use of language and author profiling: Identification of gender and age. In: Natural Language Processing and Cognitive Science, p. 177 (2013)Google Scholar
  8. 8.
    Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: how old do you think i am: a study of language and age in twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (2013)Google Scholar
  9. 9.
    Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of twitter users in non-english contexts. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Wash, pp. 18–21 (2013)Google Scholar
  10. 10.
    Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter. In: Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2010)Google Scholar
  11. 11.
    Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1301–1309. Association for Computational Linguistics (2011)Google Scholar
  12. 12.
    Daniel, M. A. Zelenkov, Y.: Russian national corpus as a playground for sociolinguistic research. episode iv. gender and length of the utterance. In: Proceedings of Dialog-2012, pp. 51–62 (2012)Google Scholar
  13. 13.
    Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 207–217. Association for Computational Linguistics (2010)Google Scholar
  14. 14.
    Vapnik, V.: The nature of statistical learning theory. Data Min. Knowl. Discovery 6, 1–47 (1995)Google Scholar
  15. 15.
    Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: ICWSM (2012)Google Scholar
  16. 16.
    Liu, W., Zamal, F.A., Ruths, D.: Using social media to infer gender composition of commuter populations. In: Proceedings of the When the City Meets the Citizen Worksop (2012)Google Scholar
  17. 17.
    Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning, vol. 1. Springer, New York (2006)MATHGoogle Scholar
  18. 18.
    Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72. Association for Computational Linguistics (2006)Google Scholar
  19. 19.
    Agresti, A.: Categorical Data Analysis, vol. 359. Wiley, New York (2002)CrossRefMATHGoogle Scholar
  20. 20.
    Panchenko, A., Beaufort, R., Naets, H., Fairon, C.: Towards detection of child sexual abuse media: categorization of the associated filenames. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 776–779. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  21. 21.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)MATHGoogle Scholar
  22. 22.
    Yu, H.F., Huang, F.L., Lin, C.J.: Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn. 85(1–2), 41–75 (2011)CrossRefMATHMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Digital Society Laboratory LLCMoscowRussia
  2. 2.Université catholique de LouvainLouvain-la-NeuveBelgium

Personalised recommendations