Skip to main content

Profiling the Age of Russian Bloggers

  • Conference paper
  • First Online:
Artificial Intelligence and Natural Language (AINL 2018)

Abstract

The task of predicting demographics of social media users, bloggers and authors of other types of online texts is crucial for marketing, security, etc. However, most of the papers in authorship profiling deal with author gender prediction. In addition, most of the studies are performed in English-language corpora and very little work in the area in the Russian language. Filling this gap will elaborate on the multi-lingual insights into age-specific linguistic features and will provide a crucial step towards online security management in social networks. We present the first age-annotated dataset in Russian. The dataset contains blogs of 1260 authors from LiveJournal and is balanced against both age group and gender of the author. We perform age classification experiments (for age groups 20–30, 30–40, 40–50) with the presented data using basic linguistic features (lemmas, part-of-speech unigrams and bigrams etc.) and obtain a considerable baseline in age classification for Russian. We also consider age as a continuous variable and build regression models to predict age. Finally, we analyze significant features and provide interpretation where possible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://pan.webis.de/index.html (last accessed 2018/05/21).

References

  • Alekseev, A., Nikolenko, S.I.: Predicting the age of social network users from user-generated texts with word embeddings. In: Proceedings of the AINL FRUCT 2016 Conference, pp. 1–11. IEEE, St. Petersburg (2017)

    Google Scholar 

  • Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: age, gender and the varieties of self-expression. First Monday 12(9) (2007). http://firstmonday.org/ojs/index.php/fm/article/view/2003/1878

  • Gomzin, A., Laguta, A., Stroev, V., Turdakov, D.: Detection of author’s educational level and age based on comments analysis. Paper presented at Dialogue 2018, Moscow, 30 May–2 June 2018. http://www.dialog-21.ru/media/4279/gomzin_turdakov.pdf (2018)

  • Jones, E., Oliphant, T., Peterson, P.: SciPy: open source scientific tools for Python (2014). https://www.scipy.org/. Accessed 21 May 2018

  • Korobov, M.: Morphological analyzer and generator for Russian and Ukrainian languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26123-2_31

    Chapter  Google Scholar 

  • Kubát, M., Matlach, V., Čech, R.: Studies in Quantitative Linguistics 18: QUITA-Quantitative Index Text Analyzer. RAM-Verlag, Lüdenscheid (2014)

    Google Scholar 

  • Litvinova, T., Rangel, F., Rosso, P., Seredin, P., Litvinova, O.: Overview of the RusProfiling PAN at FIRE track on cross-genre gender identification in Russian. In: CEUR Workshop Proceedings, pp. 1–7 (2017)

    Google Scholar 

  • Litvinova, T., Seredin, P., Litvinova, O., Zagorovskaya, O.: Identification of gender of the author of a written text using topic-independent features. Pertanika J. Soc. Sci. Hum. 26(1), 103–112 (2018)

    Google Scholar 

  • Litvinova, T., Seredin, P., Litvinova, O., Zagorovskaya, O.: Profiling a set of personality traits of text author: what our words reveal about us. Res. Lang. 14(4), 409–422 (2016)

    Article  Google Scholar 

  • Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations, pp. 25–30 (2012)

    Google Scholar 

  • Nguyen, D., Dogruöz, A.S., Rosé, C.P., de Jong, F.: Computational sociolinguistics: a survey. Comput. Linguist. 42(3), 537–593 (2016)

    Article  MathSciNet  Google Scholar 

  • Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: How old do you think I am? A study of language and age in Twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media, pp. 439–448. Boston, Massachusetts, USA (2013)

    Google Scholar 

  • Nguyen, D., Smith, N.A., Rosé, C.P.: Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 115–123. Association for Computational Linguistics (2011)

    Google Scholar 

  • Nguyen, D., et al.: Why gender and age prediction from tweets is hard: lessons from a crowdsourcing experiment. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 1950–1961 (2014)

    Google Scholar 

  • Nguyen, D.: Text as social and cultural data: a computational perspective on variation in text. Ph.D. dissertation, University of Twente (2017)

    Google Scholar 

  • Panicheva, P., Mirzagitova, A., Ledovaya, Y.: Semantic feature aggregation for gender identification in Russian Facebook. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 3–15. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-71746-3_1

    Chapter  Google Scholar 

  • Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  • Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic Inquiry and Word Count: LIWC 2001. Lawrence Erlbaum, Mahwah (2001)

    Google Scholar 

  • Pennebaker, J.W., Stone, L.D.: Words of wisdom: language use over the life span. J. Personal. Soc. Psychol. 85(2), 291–301 (2003)

    Article  Google Scholar 

  • Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Balog, K., et al. (eds.) Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, pp. 750–784 (2016)

    Google Scholar 

  • Rosenthal, S., McKeown, K.: Age prediction in blogs: a study of style, content, and online behavior in pre- and post-social media generations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 763–772 (2011)

    Google Scholar 

  • Sboev, A., Litvinova, T., Gudovskikh, D., Rybka, R., Moloshnikov, I.: Machine learning models of text categorization by author gender using topic-independent features. Procedia Comput. Sci. 101, 135–142 (2016)

    Article  Google Scholar 

  • Sboev, A., Moloshnikov, I., Gudovskikh, D., Selivanov, A., Rybka, R., Litvinova, T.: Automatic gender identification of author of Russian text by machine learning and neural net algorithms in case of gender deception. Procedia Comput. Sci. 123, 417–423 (2018)

    Article  Google Scholar 

  • Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: Proceedings of AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, pp. 199–205. Menlo Park, California (2006)

    Google Scholar 

  • Tutubalina, E., Nikolenko, S.: Automated prediction of demographic information from medical user reviews. In: Prasath, R., Gelbukh, A. (eds.) MIKE 2016. LNCS (LNAI), vol. 10089, pp. 174–184. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58130-9_17

    Chapter  Google Scholar 

Download references

Acknowledgment

Funding of the project “Identifying the Gender and Age of Online Chatters Using Formal Parameters of their Texts” from the Russian Science Foundation (no. 16-18-10050) is gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tatiana Litvinova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Litvinova, T., Sboev, A., Panicheva, P. (2018). Profiling the Age of Russian Bloggers. In: Ustalov, D., Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2018. Communications in Computer and Information Science, vol 930. Springer, Cham. https://doi.org/10.1007/978-3-030-01204-5_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01204-5_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01203-8

  • Online ISBN: 978-3-030-01204-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics