Semantic Feature Aggregation for Gender Identification in Russian Facebook

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 789)


The goal of the current work is to evaluate semantic feature aggregation techniques in a task of gender classification of public social media texts in Russian. We collect Facebook posts of Russian-speaking users and apply them as a dataset for two topic modelling techniques and a distributional clustering approach. The output of the algorithms is applied as a feature aggregation method in a task of gender classification based on a smaller Facebook sample. The classification performance of the best model is favorably compared against the lemmas baseline and the state-of-the-art results reported for a different genre or language. The resulting successful features are exemplified, and the difference between the three techniques in terms of classification performance and feature contents are discussed, with the best technique clearly outperforming the others.


Feature Aggregation Method Author Profiling Task Author-topic Model (ATM) Mal Language Feature Assembly Modeling 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The authors acknowledge Saint-Petersburg State University for a research grant 8.38.351.2015. The reported study is also supported by RFBR grant 16-06-00529.


  1. 1.
    Aletras, N., Stevenson, M.: Labelling topics using unsupervised graph-based methods. In: ACL, vol. 2, pp. 631–636 (2014)Google Scholar
  2. 2.
    Álvarez-Carmona, M.A., López-Monroy, A.P., Montes-y-Gómez, M., Villaseñor-Pineda, L., Meza, I.: Evaluating topic-based representations for author profiling in social media. In: Montes-y-Gómez, M., Escalante, H.J., Segura, A., Murillo, J.D. (eds.) IBERAMIA 2016. LNCS (LNAI), vol. 10022, pp. 151–162. Springer, Cham (2016). Scholar
  3. 3.
    Amir, S., Coppersmith, G., Carvalho, P., Silva, M.J., Wallace, B.C.: Quantifying mental health from social media with neural user embeddings. arXiv preprint arXiv:1705.00335 (2017)
  4. 4.
    Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57(1), 289–300 (1995)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Biemann, C.: Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing, pp. 73–80. Association for Computational Linguistics (2006)Google Scholar
  6. 6.
    Bird, S., Klein, E., Loper, E.: Natural Language Processing With Python: Analyzing Text With The Natural Language Toolkit. O’Reilly Media Inc, Sebastopol (2009)zbMATHGoogle Scholar
  7. 7.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  8. 8.
    Bogolyubova, O., Tikhonov, R., Ivanov, V., Panicheva, P., Ledovaya, Y.: Violence exposure, posttraumatic stress, and subjective well-being in a sample of russian adults: a facebook-based study. J. Interpersonal Violence 30, 1153–1167 (2017).
  9. 9.
    Ding, T., Pan, S., Bickel, W.K.: \(1 today or \)2 tomorrow? the answer is in your facebook likes. arXiv preprint arXiv:1703.07726 (2017)
  10. 10.
    Gliozzo, A., Biemann, C., Riedl, M., Coppola, B., Glass, M.R., Hatem, M.: Jobimtext visualizer: a graph-based approach to contextualizing distributional similarity. In: Graph-Based Methods for Natural Language Processing, p. 6 (2013)Google Scholar
  11. 11.
    Hulpus, I., Hayes, C., Karnstedt, M., Greene, D.: Unsupervised graph-based topic labelling using dbpedia. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 465–474. ACM (2013)Google Scholar
  12. 12.
    Jones, E., Oliphant, T., Peterson, P., et al.: SciPy: open source scientific tools for python (2001).
  13. 13.
    Korobov, M.: Morphological analyzer and generator for russian and ukrainian languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015). Scholar
  14. 14.
    Kosinski, M., Matz, S.C., Gosling, S.D., Popov, V., Stillwell, D.: Facebook as a research tool for the social sciences: opportunities, challenges, ethical considerations, and practical guidelines. Am. Psychol. 70(6), 543 (2015)CrossRefGoogle Scholar
  15. 15.
    Kou, W., Li, F., Baldwin, T.: Automatic labelling of topic models using word vectors and letter trigram vectors. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds.) AIRS 2015. LNCS, vol. 9460, pp. 253–264. Springer, Cham (2015). Scholar
  16. 16.
    Kulkarni, V., Kern, M.L., Stillwell, D., Kosinski, M., Matz, S., Ungar, L., Skiena, S., Schwartz, H.A.: Latent human traits in the language of social media: an open-vocabulary approach (2017)Google Scholar
  17. 17.
    Kutuzov, A., Andreev, I.: Texts in, meaning out: neural language models in semantic similarity task for Russian. arXiv preprint arXiv:1504.08183 (2015)
  18. 18.
    Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1536–1545. Association for Computational Linguistics (2011)Google Scholar
  19. 19.
    Lau, J.H., Newman, D., Karimi, S., Baldwin, T.: Best topic word selection for topic labelling. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 605–613. Association for Computational Linguistics (2010)Google Scholar
  20. 20.
    Litvinova, T., Seredin, P., Litvinova, O., Zagorovskaya, O., Sboev, A., Gudovskih, D., Moloshnikov, I., Rybka, R.: Gender prediction for authors of Russian texts using regression and classification techniques. In: CDUD 2016–The 3rd International Workshop on Concept Discovery in Unstructured Data, p. 44 (2016).
  21. 21.
    Lui, M., Baldwin, T.: Langid. py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations, pp. 25–30. Association for Computational Linguistics (2012)Google Scholar
  22. 22.
    Magatti, D., Calegari, S., Ciucci, D., Stella, F.: Automatic labeling of topics. In: Ninth International Conference on Intelligent Systems Design and Applications ISDA 2009, pp. 1227–1232. IEEE (2009)Google Scholar
  23. 23.
    Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 889–892. ACM (2013)Google Scholar
  24. 24.
    Mei, Q., Shen, X., Zhai, C.: Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 490–499. ACM (2007)Google Scholar
  25. 25.
    Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. Association for Computational Linguistics (2004)Google Scholar
  26. 26.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  27. 27.
    Mirzagitova, A., Mitrofanova, O.: Automatic assignment of labels in topic modelling for Russian corpora. In: Proceedings of 7th Tutorial and Research Workshop on Experimental Linguistics, ExLing, pp. 115–118 (2016)Google Scholar
  28. 28.
    Panchenko, A., Loukachevitch, N., Ustalov, D., Paperno, D., Meyer, C., Konstantinova, N.: Russe: the first workshop on Russian semantic similarity. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference. Dialogue, vol. 2, pp. 89–105 (2015)Google Scholar
  29. 29.
    Panicheva, P., Ledovaya, Y., Bogoliubova, O.: Revealing interpetable content correlates of the dark triad personality traits. In: Russian Summer School in Information Retrieval (2016)Google Scholar
  30. 30.
    Panicheva, P., Ledovaya, Y., Bogolyubova, O.: Lexical, morphological and semantic correlates of the dark triad personality traits in Russian facebook texts. In: Artificial Intelligence and Natural Language Conference (AINL) IEEE, pp. 1–8. IEEE (2016)Google Scholar
  31. 31.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  32. 32.
    Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates 71 (2001)Google Scholar
  33. 33.
    Prince, S.J.: Computer Vision: Models, Learning and Inference. Cambridge University Press, Cambridge (2012)CrossRefGoogle Scholar
  34. 34.
    Rangel, F., Rosso, P., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daeleman, W., et al.: Overview of the 2nd author profiling task at pan 2014. In: CEUR Workshop Proceedings, vol. 1180, pp. 898–927. CEUR Workshop Proceedings.
  35. 35.
    Rehurek, R., Sojka, P.: Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno (2011)Google Scholar
  36. 36.
    Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494. AUAI Press (2004)Google Scholar
  37. 37.
    Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8(9), e73791 (2013)CrossRefGoogle Scholar
  38. 38.
    Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)Google Scholar
  39. 39.
    Zhiqiang, T., Wenting, W.: Dlirec: aspect term extraction and term polarity classification system. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014) (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.St. Petersburg State UniversitySt. PetersburgRussia

Personalised recommendations