Detecting Gender by Full Name: Experiments with the Russian Language

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 436)

Abstract

This paper describes a method that detects gender of a person by his/her full name. While some approaches were proposed for English language, little has been done so far for Russian. We fill this gap and present a large-scale experiment on a dataset of 100,000 Russian full names from Facebook. Our method is based on three types of features (word endings, character \(n\)-grams and dictionary of names) combined within a linear supervised model. Experiments show that the proposed simple and computationally efficient approach yields excellent results achieving accuracy up to 96 %.

Keywords

Gender detection Short text classification 

References

  1. 1.
    Underwood, A.: Gender targeting for promoted products now available, October 2012Google Scholar
  2. 2.
    Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2011)Google Scholar
  3. 3.
    Kharitonov, E., Serdyukov, P.: Gender-aware re-ranking. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1081–1082. ACM (2012)Google Scholar
  4. 4.
    Bi, B., Shokouhi, M., Kosinski, M., Graepel, T.: Inferring the demographics of search users: social data meets search queries. In: Proceedings of the 22nd International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 131–140 (2013)Google Scholar
  5. 5.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary Linguist. Comput. 17(4), 401–412 (2002)CrossRefGoogle Scholar
  6. 6.
    Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers age and gender. In: Third International AAAI Conference on Weblogs and Social Media (2009)Google Scholar
  7. 7.
    Rangel, F., Rosso, P.: Use of language and author profiling: Identification of gender and age. In: Natural Language Processing and Cognitive Science, p. 177 (2013)Google Scholar
  8. 8.
    Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: how old do you think i am: a study of language and age in twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (2013)Google Scholar
  9. 9.
    Ciot, M., Sonderegger, M., Ruths, D.: Gender inference of twitter users in non-english contexts. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Wash, pp. 18–21 (2013)Google Scholar
  10. 10.
    Rao, D., Yarowsky, D., Shreevats, A., Gupta, M.: Classifying latent user attributes in twitter. In: Proceedings of the 2nd International Workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2010)Google Scholar
  11. 11.
    Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1301–1309. Association for Computational Linguistics (2011)Google Scholar
  12. 12.
    Daniel, M. A. Zelenkov, Y.: Russian national corpus as a playground for sociolinguistic research. episode iv. gender and length of the utterance. In: Proceedings of Dialog-2012, pp. 51–62 (2012)Google Scholar
  13. 13.
    Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 207–217. Association for Computational Linguistics (2010)Google Scholar
  14. 14.
    Vapnik, V.: The nature of statistical learning theory. Data Min. Knowl. Discovery 6, 1–47 (1995)Google Scholar
  15. 15.
    Al Zamal, F., Liu, W., Ruths, D.: Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: ICWSM (2012)Google Scholar
  16. 16.
    Liu, W., Zamal, F.A., Ruths, D.: Using social media to infer gender composition of commuter populations. In: Proceedings of the When the City Meets the Citizen Worksop (2012)Google Scholar
  17. 17.
    Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning, vol. 1. Springer, New York (2006)MATHGoogle Scholar
  18. 18.
    Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72. Association for Computational Linguistics (2006)Google Scholar
  19. 19.
    Agresti, A.: Categorical Data Analysis, vol. 359. Wiley, New York (2002)CrossRefMATHGoogle Scholar
  20. 20.
    Panchenko, A., Beaufort, R., Naets, H., Fairon, C.: Towards detection of child sexual abuse media: categorization of the associated filenames. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 776–779. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  21. 21.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)MATHGoogle Scholar
  22. 22.
    Yu, H.F., Huang, F.L., Lin, C.J.: Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn. 85(1–2), 41–75 (2011)CrossRefMATHMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.Digital Society Laboratory LLCMoscowRussia
  2. 2.Université catholique de LouvainLouvain-la-NeuveBelgium

Personalised recommendations