Advertisement

Profiling the Age of Russian Bloggers

  • Tatiana Litvinova
  • Alexandr Sboev
  • Polina Panicheva
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 930)

Abstract

The task of predicting demographics of social media users, bloggers and authors of other types of online texts is crucial for marketing, security, etc. However, most of the papers in authorship profiling deal with author gender prediction. In addition, most of the studies are performed in English-language corpora and very little work in the area in the Russian language. Filling this gap will elaborate on the multi-lingual insights into age-specific linguistic features and will provide a crucial step towards online security management in social networks. We present the first age-annotated dataset in Russian. The dataset contains blogs of 1260 authors from LiveJournal and is balanced against both age group and gender of the author. We perform age classification experiments (for age groups 20–30, 30–40, 40–50) with the presented data using basic linguistic features (lemmas, part-of-speech unigrams and bigrams etc.) and obtain a considerable baseline in age classification for Russian. We also consider age as a continuous variable and build regression models to predict age. Finally, we analyze significant features and provide interpretation where possible.

Keywords

Authorship profiling Age prediction Russian language Text classification 

Notes

Acknowledgment

Funding of the project “Identifying the Gender and Age of Online Chatters Using Formal Parameters of their Texts” from the Russian Science Foundation (no. 16-18-10050) is gratefully acknowledged.

References

  1. Alekseev, A., Nikolenko, S.I.: Predicting the age of social network users from user-generated texts with word embeddings. In: Proceedings of the AINL FRUCT 2016 Conference, pp. 1–11. IEEE, St. Petersburg (2017)Google Scholar
  2. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: age, gender and the varieties of self-expression. First Monday 12(9) (2007). http://firstmonday.org/ojs/index.php/fm/article/view/2003/1878
  3. Gomzin, A., Laguta, A., Stroev, V., Turdakov, D.: Detection of author’s educational level and age based on comments analysis. Paper presented at Dialogue 2018, Moscow, 30 May–2 June 2018. http://www.dialog-21.ru/media/4279/gomzin_turdakov.pdf (2018)
  4. Jones, E., Oliphant, T., Peterson, P.: SciPy: open source scientific tools for Python (2014). https://www.scipy.org/. Accessed 21 May 2018
  5. Korobov, M.: Morphological analyzer and generator for Russian and Ukrainian languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-26123-2_31CrossRefGoogle Scholar
  6. Kubát, M., Matlach, V., Čech, R.: Studies in Quantitative Linguistics 18: QUITA-Quantitative Index Text Analyzer. RAM-Verlag, Lüdenscheid (2014)Google Scholar
  7. Litvinova, T., Rangel, F., Rosso, P., Seredin, P., Litvinova, O.: Overview of the RusProfiling PAN at FIRE track on cross-genre gender identification in Russian. In: CEUR Workshop Proceedings, pp. 1–7 (2017)Google Scholar
  8. Litvinova, T., Seredin, P., Litvinova, O., Zagorovskaya, O.: Identification of gender of the author of a written text using topic-independent features. Pertanika J. Soc. Sci. Hum. 26(1), 103–112 (2018)Google Scholar
  9. Litvinova, T., Seredin, P., Litvinova, O., Zagorovskaya, O.: Profiling a set of personality traits of text author: what our words reveal about us. Res. Lang. 14(4), 409–422 (2016)CrossRefGoogle Scholar
  10. Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations, pp. 25–30 (2012)Google Scholar
  11. Nguyen, D., Dogruöz, A.S., Rosé, C.P., de Jong, F.: Computational sociolinguistics: a survey. Comput. Linguist. 42(3), 537–593 (2016)MathSciNetCrossRefGoogle Scholar
  12. Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: How old do you think I am? A study of language and age in Twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media, pp. 439–448. Boston, Massachusetts, USA (2013)Google Scholar
  13. Nguyen, D., Smith, N.A., Rosé, C.P.: Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 115–123. Association for Computational Linguistics (2011)Google Scholar
  14. Nguyen, D., et al.: Why gender and age prediction from tweets is hard: lessons from a crowdsourcing experiment. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 1950–1961 (2014)Google Scholar
  15. Nguyen, D.: Text as social and cultural data: a computational perspective on variation in text. Ph.D. dissertation, University of Twente (2017)Google Scholar
  16. Panicheva, P., Mirzagitova, A., Ledovaya, Y.: Semantic feature aggregation for gender identification in Russian Facebook. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 3–15. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-71746-3_1CrossRefGoogle Scholar
  17. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  18. Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic Inquiry and Word Count: LIWC 2001. Lawrence Erlbaum, Mahwah (2001)Google Scholar
  19. Pennebaker, J.W., Stone, L.D.: Words of wisdom: language use over the life span. J. Personal. Soc. Psychol. 85(2), 291–301 (2003)CrossRefGoogle Scholar
  20. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Balog, K., et al. (eds.) Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, pp. 750–784 (2016)Google Scholar
  21. Rosenthal, S., McKeown, K.: Age prediction in blogs: a study of style, content, and online behavior in pre- and post-social media generations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 763–772 (2011)Google Scholar
  22. Sboev, A., Litvinova, T., Gudovskikh, D., Rybka, R., Moloshnikov, I.: Machine learning models of text categorization by author gender using topic-independent features. Procedia Comput. Sci. 101, 135–142 (2016)CrossRefGoogle Scholar
  23. Sboev, A., Moloshnikov, I., Gudovskikh, D., Selivanov, A., Rybka, R., Litvinova, T.: Automatic gender identification of author of Russian text by machine learning and neural net algorithms in case of gender deception. Procedia Comput. Sci. 123, 417–423 (2018)CrossRefGoogle Scholar
  24. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: Proceedings of AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, pp. 199–205. Menlo Park, California (2006)Google Scholar
  25. Tutubalina, E., Nikolenko, S.: Automated prediction of demographic information from medical user reviews. In: Prasath, R., Gelbukh, A. (eds.) MIKE 2016. LNCS (LNAI), vol. 10089, pp. 174–184. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-58130-9_17CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Voronezh State Pedagogical UniversityVoronezhRussia
  2. 2.National Research Center “Kurchatov Institute”MoscowRussia

Personalised recommendations