Abstract
The task of predicting demographics of social media users, bloggers and authors of other types of online texts is crucial for marketing, security, etc. However, most of the papers in authorship profiling deal with author gender prediction. In addition, most of the studies are performed in English-language corpora and very little work in the area in the Russian language. Filling this gap will elaborate on the multi-lingual insights into age-specific linguistic features and will provide a crucial step towards online security management in social networks. We present the first age-annotated dataset in Russian. The dataset contains blogs of 1260 authors from LiveJournal and is balanced against both age group and gender of the author. We perform age classification experiments (for age groups 20–30, 30–40, 40–50) with the presented data using basic linguistic features (lemmas, part-of-speech unigrams and bigrams etc.) and obtain a considerable baseline in age classification for Russian. We also consider age as a continuous variable and build regression models to predict age. Finally, we analyze significant features and provide interpretation where possible.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
https://pan.webis.de/index.html (last accessed 2018/05/21).
References
Alekseev, A., Nikolenko, S.I.: Predicting the age of social network users from user-generated texts with word embeddings. In: Proceedings of the AINL FRUCT 2016 Conference, pp. 1–11. IEEE, St. Petersburg (2017)
Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: age, gender and the varieties of self-expression. First Monday 12(9) (2007). http://firstmonday.org/ojs/index.php/fm/article/view/2003/1878
Gomzin, A., Laguta, A., Stroev, V., Turdakov, D.: Detection of author’s educational level and age based on comments analysis. Paper presented at Dialogue 2018, Moscow, 30 May–2 June 2018. http://www.dialog-21.ru/media/4279/gomzin_turdakov.pdf (2018)
Jones, E., Oliphant, T., Peterson, P.: SciPy: open source scientific tools for Python (2014). https://www.scipy.org/. Accessed 21 May 2018
Korobov, M.: Morphological analyzer and generator for Russian and Ukrainian languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26123-2_31
Kubát, M., Matlach, V., Čech, R.: Studies in Quantitative Linguistics 18: QUITA-Quantitative Index Text Analyzer. RAM-Verlag, Lüdenscheid (2014)
Litvinova, T., Rangel, F., Rosso, P., Seredin, P., Litvinova, O.: Overview of the RusProfiling PAN at FIRE track on cross-genre gender identification in Russian. In: CEUR Workshop Proceedings, pp. 1–7 (2017)
Litvinova, T., Seredin, P., Litvinova, O., Zagorovskaya, O.: Identification of gender of the author of a written text using topic-independent features. Pertanika J. Soc. Sci. Hum. 26(1), 103–112 (2018)
Litvinova, T., Seredin, P., Litvinova, O., Zagorovskaya, O.: Profiling a set of personality traits of text author: what our words reveal about us. Res. Lang. 14(4), 409–422 (2016)
Lui, M., Baldwin, T.: langid.py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 System Demonstrations, pp. 25–30 (2012)
Nguyen, D., Dogruöz, A.S., Rosé, C.P., de Jong, F.: Computational sociolinguistics: a survey. Comput. Linguist. 42(3), 537–593 (2016)
Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: How old do you think I am? A study of language and age in Twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media, pp. 439–448. Boston, Massachusetts, USA (2013)
Nguyen, D., Smith, N.A., Rosé, C.P.: Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 115–123. Association for Computational Linguistics (2011)
Nguyen, D., et al.: Why gender and age prediction from tweets is hard: lessons from a crowdsourcing experiment. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 1950–1961 (2014)
Nguyen, D.: Text as social and cultural data: a computational perspective on variation in text. Ph.D. dissertation, University of Twente (2017)
Panicheva, P., Mirzagitova, A., Ledovaya, Y.: Semantic feature aggregation for gender identification in Russian Facebook. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 3–15. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-71746-3_1
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic Inquiry and Word Count: LIWC 2001. Lawrence Erlbaum, Mahwah (2001)
Pennebaker, J.W., Stone, L.D.: Words of wisdom: language use over the life span. J. Personal. Soc. Psychol. 85(2), 291–301 (2003)
Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Balog, K., et al. (eds.) Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, pp. 750–784 (2016)
Rosenthal, S., McKeown, K.: Age prediction in blogs: a study of style, content, and online behavior in pre- and post-social media generations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 763–772 (2011)
Sboev, A., Litvinova, T., Gudovskikh, D., Rybka, R., Moloshnikov, I.: Machine learning models of text categorization by author gender using topic-independent features. Procedia Comput. Sci. 101, 135–142 (2016)
Sboev, A., Moloshnikov, I., Gudovskikh, D., Selivanov, A., Rybka, R., Litvinova, T.: Automatic gender identification of author of Russian text by machine learning and neural net algorithms in case of gender deception. Procedia Comput. Sci. 123, 417–423 (2018)
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: Proceedings of AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, pp. 199–205. Menlo Park, California (2006)
Tutubalina, E., Nikolenko, S.: Automated prediction of demographic information from medical user reviews. In: Prasath, R., Gelbukh, A. (eds.) MIKE 2016. LNCS (LNAI), vol. 10089, pp. 174–184. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58130-9_17
Acknowledgment
Funding of the project “Identifying the Gender and Age of Online Chatters Using Formal Parameters of their Texts” from the Russian Science Foundation (no. 16-18-10050) is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Litvinova, T., Sboev, A., Panicheva, P. (2018). Profiling the Age of Russian Bloggers. In: Ustalov, D., Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2018. Communications in Computer and Information Science, vol 930. Springer, Cham. https://doi.org/10.1007/978-3-030-01204-5_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-01204-5_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01203-8
Online ISBN: 978-3-030-01204-5
eBook Packages: Computer ScienceComputer Science (R0)