Gender Classification of Web Authors Using Feature Selection and Language Models

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9319)


In the present article, we address the problem of automatic gender classification of web blog authors. More specifically, we employ eight widely used machine learning algorithms, in order to study the effectiveness of feature selection on improving the accuracy of gender classification. The feature ranking is performed over a set of statistical, part-of-speech tagging and language model features. In the experiments, we employed classification models based on decision trees, support vector machines and lazy-learning algorithms. The experimental evaluation performed on blog author gender classification data demonstrated the importance of language model features for this task and that feature selection significantly improves the accuracy of gender classification, regardless of the type of the machine learning algorithm used.


Text classification Gender identification Feature selection 


  1. 1.
    Ansari, Y.Z., Azad, S.A., Akhtar, H.: Gender classification of blog authors. Int. J. Sustain. Dev. Green Econ. 2(1) (2013). ISSN No: 2315–4721Google Scholar
  2. 2.
    Argamon, S., Koppel, M., Pennebaker, W., Schler, J.: Mining the Blogosphere: age, gender and the varieties of self-expression. First Monday 12, 9 (2007)CrossRefGoogle Scholar
  3. 3.
    Burger, J., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on Twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1301–1309. Association for Computational Linguistics, Stroudsburg (2011)Google Scholar
  4. 4.
    Cheng, N., Chandramouli, R., Subbalakshmi, K.P.: Author gender identification from text. Int. J. Digit. Forensics Incident Response 8(1), 78–88 (2011)Google Scholar
  5. 5.
    Company, J.S., Wanner, L.: How to use less features and reach better performance in author gender identification. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC). Reykjavik, Iceland (2014)Google Scholar
  6. 6.
    Holmgren, J., Shyu, E.: Gender Classification of Facebook Posts (2013)Google Scholar
  7. 7.
    Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proceedings of the 9th International Workshop on Machine Learning, pp. 249–256 (1992)Google Scholar
  8. 8.
    Kobayashi, D., Matsumura, N., Ishizuka, M.: Automatic estimation of Bloggers’ gender. In: Proceedings of International Conference on Weblogs and Social Media (2007)Google Scholar
  9. 9.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary Linguist. Comput. 17(4), 401–412 (2003)CrossRefGoogle Scholar
  10. 10.
    Lazer, D., Pentland, A.S., Adamic, L., Aral, S., Barabasi, A.L., Brewer, D., Van Alstyne, M.: Life in the network: the coming age of computational social science. Science 323(5915), 721 (2009). (New York, NY)CrossRefGoogle Scholar
  11. 11.
    Marquardt, J., Farnadi, G., Vasudevan, G., Moens, M., Davalos, S., Teredesai, A., De Cock, M.: Age and Gender Identification in Social Media. Author Profiling Task at PAN (2014)Google Scholar
  12. 12.
    Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of EMNLP (2010)Google Scholar
  13. 13.
    Peersman, C., Daelemans, W., Van Vaerenbergh, L: Predicting age and gender in online social networks. In: Proceedings of the 3rd Workshop on Search and Mining User-Generated Contents, Glasgow, UK (2011)Google Scholar
  14. 14.
    Rangel, F., Rosso, P.: Use of language and author profiling: identification of gender and age. In: Proceedings of the 10th International Workshop on Natural Language Processing and Cognitive Science (2013)Google Scholar
  15. 15.
    Sarawgi, R., Gajulapalli, K., Choi, Y.: Gender attribution: tracing stylometric evidence beyond topic and genre. In: Proceedings of the 15th Conference on Computational Natural Language Learning, pp. 78–86. Association for Computational Linguistics, Stroudsburg (2011)Google Scholar
  16. 16.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan-Kaufman Series of Data Management Systems, 2nd edn. Elsevier, San Francisco (2005)Google Scholar
  17. 17.
    Yan, X., Yan, L.: Gender Classification of Weblog Authors. Computational Approaches to Analyzing Weblogs, AAAI (2006)Google Scholar
  18. 18.
    Zhang, C., Zhang, P.: Predicting gender from blog posts. Technical report. University of Massachusetts Amherst, USA (2010)Google Scholar
  19. 19.
  20. 20.

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Multidimensional Data Analysis and Knowledge Management Laboratory, Department of Computer Engineering and InformaticsUniversity of PatrasRionGreece

Personalised recommendations