Automatic Estimation of Web Bloggers’ Age Using Regression Models

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9319)


In this article, we address the problem of automatic age estimation of web users based on their posts. Most studies on age identification treat the issue as a classification problem. Instead of following an age category classification approach, we investigate the appropriateness of several regression algorithms on the task of age estimation. We evaluate a number of well-known and widely used machine learning algorithms for numerical estimation, in order to examine their appropriateness on this task. We used a set of 42 text features. The experimental results showed that the Bagging algorithm with RepTree base learner offered the best performance, achieving estimation of web users’ age with mean absolute error equal to 5.44, while the root mean squared error is approximately 7.14.


Author’s age estimation Text processing Regression algorithms 


  1. 1.
    Labov, W.: Sociolinguistic Patterns (No. 4). University of Pennsylvania Press, Philadelphia (1972)Google Scholar
  2. 2.
    Trudgill, P.: The social differentiation of English in Norwich, vol. 13. CUP Archive, Cambridge (1974)Google Scholar
  3. 3.
    Eckert, P.: Age as a sociolinguistic variable. In: Coulmas, F. (ed.) The Handbook of Sociolinguistics. Blackwell, Oxford (1997)Google Scholar
  4. 4.
    Labov, W.: Principles of linguistic change, cognitive and cultural factors, vol. 3. John Wiley & Sons, New York (2011)Google Scholar
  5. 5.
    Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, vol. 6, pp. 199–205 (2006)Google Scholar
  6. 6.
    Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: age, gender and the varieties of self-expression. First Monday, 12(9) (2007)Google Scholar
  7. 7.
    Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. In: Third International AAAI Conference on Weblogs and Social Media (2009)Google Scholar
  8. 8.
    Tam, J., Martell, C.H.: Age detection in chat. In: IEEE International Conference on Semantic Computing, ICSC 2009, pp. 33–39. IEEE (2009)Google Scholar
  9. 9.
    Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd international workshop on Search and Mining User-Generated Contents, pp. 37–44. ACM (2011)Google Scholar
  10. 10.
    Rosenthal, S., McKeown, K.: Age prediction in blogs: a study of style, content, and online behavior in pre-and post-social media generations. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 763–772. ACL (2011)Google Scholar
  11. 11.
    Nguyen, D., Smith, N.A., Ros, C.P.: Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 115–123. ACL (2011)Google Scholar
  12. 12.
    Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. Notebook Papers of CLEF (2013)Google Scholar
  13. 13.
    Flekova, L., Gurevych, I.: Can we hide in the web? Large scale simultaneous age and gender author profiling in social media. In: CLEF 2012 Labs and Work-shop. Notebook Papers (2013)Google Scholar
  14. 14.
    Rangel, F., Rosso, P.: Use of language and author profiling: identification of gender and age. Natural Language Processing and Cognitive Science, 177 (2013)Google Scholar
  15. 15.
    Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Ungar, L.H.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PloS one 8(9), e73791 (2013)CrossRefGoogle Scholar
  16. 16.
    Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: “How old do you think i am?”; A study of language and age in twitter. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media. AAAI Press (2013)Google Scholar
  17. 17.
    Verhoeven, B., Daelemans, W.: CLiPSStylometry Investigation (CSI) corpus: a Dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (2014)Google Scholar
  18. 18.
    Chester, D.L.: Why two hidden layers are better than one. In: Proceedings of the International Joint Conference on Neural Networks, vol. 1, pp. 265–268 (1990)Google Scholar
  19. 19.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Elsevier, Morgan-Kaufman Series of Data Management Systems, San Francisco (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Multidimensional Data Analysis and Knowledge Management Laboratory Department of Computer Engineering and InformaticsUniversity of PatrasRionGreece

Personalised recommendations