A Comparative Study of Feature Types for Age-Based Text Classification

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12602)


The ability to automatically determine the age audience of a novel provides many opportunities for the development of information retrieval tools. Firstly, developers of book recommendation systems and electronic libraries may be interested in filtering texts by the age of the most likely readers. Further, parents may want to select literature for children. Finally, it will be useful for writers and publishers to determine which features influence whether the texts are suitable for children. In this article, we compare the empirical effectiveness of various types of linguistic features for the task of age-based classification of fiction texts. For this purpose, we collected a text corpus of book previews labeled with one of two categories – children’s or adult. We evaluated the following types of features: readability indices, sentiment, lexical, grammatical and general features, and publishing attributes. The results obtained show that the features describing the text at the document level can significantly increase the quality of machine learning models.


Text classification Fiction Corpus Age audience Content rating Text difficulty RuBERT Neural network Natural language processing Machine learning 


  1. 1.
    Aivazyan, S.A., Bukhshtaber, V.M., Enyukov, I.S., et al.: Applied Statistics: Classification and Dimension Reduction: A Handbook. Fin. i stat, Moscow (1989)Google Scholar
  2. 2.
    Azpiazu, I.M., Pera, M.S.: Multiattentive recurrent neural network architecture for multilingual readability assessment. Trans. Assoc. Comp. Ling. 7, 421–436.
  3. 3.
    Balyan, R., McCarthy. K.S., McNamara, D.S.: Applying natural language processing and hierarchical machine learning approaches to text difficulty classification. Int. J. Art. Intell. Educ., 1–34 (2020).
  4. 4.
    Bertills, Y.: Beyond Identification: Proper Names in Children’s Literature. Abo Akademi University Press, Turku (2003)Google Scholar
  5. 5.
    Corpus and Baselines for Age-Based Text Clas. Accessed 24 Sep 2020
  6. 6.
    Crossley, S.A., Skalicky, S., Dascalu, M., et al.: Predicting text comprehension, processing, and familiarity in adult readers: new approaches to readability formulas. Discourse Process. 54, 5–6 (2017). Scholar
  7. 7.
    Cuzzocrea, A., Bosco, G.L., Pilato, G., Schicchi, D.: Multi-class text complexity evaluation via deep neural networks. In: Yin, H., Camacho, D., Tino, P., Tallón-Ballesteros, A.J., Menezes, R., Allmendinger, R. (eds.) IDEAL 2019. LNCS, vol. 11872, pp. 313–322. Springer, Cham (2019). Scholar
  8. 8.
    Devlin, J., Chang, M.W., Lee, K. et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  9. 9.
    Dogruel, L., Joeckel, S.: Video game rating systems in the US and Europe: comparing their outcomes. Int. Commun. Gaz. 757, 672–692 (2013)CrossRefGoogle Scholar
  10. 10.
    Didegah, F., Thelwall, M.: Which factors help authors produce the highest impact research? Collaboration, journal and document properties. J. Inf. 7(4), 861–873 (2013). Scholar
  11. 11.
    Grealy, L., Driscoll, C., Cather, K.: A history of age-based film classification in Japan. Japan Forum (2020). Scholar
  12. 12.
    Glazkova, A.: An approach to text classification based on age groups of addressees. SPIIRAS Proc. 52(3), 51–69 (2017).
  13. 13.
    Gulli, A., Pal, S.: Deep learning with Keras. Packt Publishing Ltd. (2017)Google Scholar
  14. 14.
    Kim, S.W., et al.: A global comparative study on the game rating system. J. Digital Convergence 1712, 91–108 (2019)Google Scholar
  15. 15.
    Fed. Law N 436-FZ "On the Protection of Children from Information Harmful to Their Health and Development. Accessed 23 Jul 2020
  16. 16.
    Flekova, L., Stoffel, F., Gurevych, I. et al.: Content-based analysis and visualization of story complexity. In: Vis. sprachlicher Daten, pp. 185–223. Heidelberg: Heid. Univ. Publishing (2018)Google Scholar
  17. 17.
    Juilland, A.G., Brodin, D.R., Davidovitch, C.: Frequency dictionary of French words. Hague, Paris (1971)Google Scholar
  18. 18.
    Hamid, R.S., Shiratuddin, N.: Age classification of the existing digital game content rating system across the world: a comparative analysis. In: Proceedings of KMICe, pp. 218–222 (2018)Google Scholar
  19. 19.
    Korobov, M.: Morphological analyzer and generator for Russian and Ukrainian Languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015). Scholar
  20. 20.
    Kuratov, Y., Arkhipov, M.: Adaptation of deep bidirectional multilingual transformers for Russian language, arXiv preprint arXiv:1905.07213. 2019
  21. 21.
    Kutuzov, A., Kuzmenko, E.: WebVectors: a toolkit for building web interfaces for vector semantic models. CCIS 661, 155–161 (2017). Scholar
  22. 22.
    Laposhina, A.N., Veselovskaya, T.S., Lebedeva, M.U. et al.: Automated text readability assessment for Russian second language learners. In: Komp. Lingv. i Intel. Tehn., pp. 396–406 (2018)Google Scholar
  23. 23.
    Loper, E., Bird, S.: NLTK: the natural language toolkit, arXiv preprint cs/0205028 (2002)Google Scholar
  24. 24.
    Loukachevitch, N., Levchik, A.: Creating a General Russian Sentiment Lexicon. In: Proc. of LREC-2016, pp. 1171–1176 (2016)Google Scholar
  25. 25.
    Lyashevskaya, O.N., Sharov, S.A.: Frequency Dictionary of the Modern Russian Language (based on the materials of the National Corps of the Russian Language). Azbukovnik, Moscow (2009)Google Scholar
  26. 26.
    Mukherjee, P., Leroy, G., Kauchak, D.: Using Lexical Chains to Identify Text Difficulty: A Corpus Statistics and Classification Study. J. of Biomed. and Health Informatics 23(5), 2164–2173 (2019). Scholar
  27. 27.
    Oborneva, I.V.: Automated estimation of complexity of educational texts on the basis of statistical parameters. Pedagogy Cand. Diss, Moscow (2006)Google Scholar
  28. 28.
    Paszke, A. et al.: Pytorch: An imperative style, high-performance deep learning library. In: Adv. in neur. inf. proc. systems, pp. 8026–8037 (2019)Google Scholar
  29. 29.
    Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  30. 30.
    Piasecki, S., Malekpour, S.: Morality and religion as factors in age rating computer and video games: ESRA, the Iranian games age rating system. Online-Heidelberg J. of Religions on the Int., 11Google Scholar
  31. 31.
    Potdar, K., Pardawala, T.S., Pai, C.D.: A comparative study of categorical variable encoding techniques for neural network classifiers. Int. J. Comp. Appl. 1754, 7–9 (2017). Scholar
  32. 32.
    Russian National Corpus. Accessed 23 Jul 2020
  33. 33.
    Shafaei, M., Samghabadi, N.S., Kar, S., Solorio, T.: Age suitability rating: predicting the MPAA rating based on movie dialogues. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 1327–1335 (2020)Google Scholar
  34. 34.
    Sharoff, S.: Meaning as use: exploitation of aligned corpora for the contrastive study of lexical semantics. In: Proceedings of LREC02, pp. 447–452. Las Palmas, Spain (2002)Google Scholar
  35. 35.
    Schicchi, D., Pilato, G., Bosco, G.L.: Deep neural attention-based model for the evaluation of italian sentences complexity. In: 2020 IEEE 14th ICSC, pp. 253–256.
  36. 36.
    Schwarm, S.E., Ostendorf, M.: Reading level assessment using support vector machines and statistical language models. In: Proceedings of ACL 2005, pp. 523–530 (2005).
  37. 37.
    Solnyshkina, M., Ivanov, V., Solovyev, V.: Readability Formula for Russian Texts: A Modified Version. In: Batyrshin, I., Martínez-Villaseñor, M.L., Ponce Espinosa, H.E. (eds.) MICAI 2018. LNCS (LNAI), vol. 11289, pp. 132–145. Springer, Cham (2018). Scholar
  38. 38.
    Solovyev, V., Solnyshkina, M., Ivanov, V., et al.: Prediction of reading difficulty in Russian academic texts. J. Int. Fuzzy Syst. 36(5), 4553–4563 (2019). Scholar
  39. 39.
    Sung, Y.T., Chen, J.L., Cha, J.H., et al.: Constructing and validating readability models: the method of integrating multilevel linguistic features with machine learning. Behav. Res. Methods 47(2), 340–354 (2015). Scholar
  40. 40.
    Templin, M.C.: Certain Language Skills in Children; Their Development and Interrelationships. Univ. of Minnesota Press, Minneapolis (1957)CrossRefGoogle Scholar
  41. 41.
    Text readability rating. Accessed 23 Jul 2020
  42. 42.
    Tomina, Y.A.: Objective Assessment of the Language Difficulty of Texts (Description, Narration, Reasoning, Proof). Pedagogy Cand. Diss, Moscow (1985)Google Scholar
  43. 43.
    Wolf, T., Debut, L., Sanh, V., et al.: HuggingFace’s Transformers: State-of-the-art Natural Language Processing. ArXiv, arXiv-1910 (2019)Google Scholar
  44. 44.
    Zagoruiko, N.G.: Applied methods of data and knowledge analysis. Izd-vo IM SO RAN, Novosibirsk (1999)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2021

Authors and Affiliations

  1. 1.University of TyumenTyumenRussia
  2. 2.Organization of cognitive associative systems LLCTyumenRussia

Personalised recommendations