Skip to main content

Gender Classification of Web Authors Using Feature Selection and Language Models

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9319))

Included in the following conference series:

Abstract

In the present article, we address the problem of automatic gender classification of web blog authors. More specifically, we employ eight widely used machine learning algorithms, in order to study the effectiveness of feature selection on improving the accuracy of gender classification. The feature ranking is performed over a set of statistical, part-of-speech tagging and language model features. In the experiments, we employed classification models based on decision trees, support vector machines and lazy-learning algorithms. The experimental evaluation performed on blog author gender classification data demonstrated the importance of language model features for this task and that feature selection significantly improves the accuracy of gender classification, regardless of the type of the machine learning algorithm used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ansari, Y.Z., Azad, S.A., Akhtar, H.: Gender classification of blog authors. Int. J. Sustain. Dev. Green Econ. 2(1) (2013). ISSN No: 2315–4721

    Google Scholar 

  2. Argamon, S., Koppel, M., Pennebaker, W., Schler, J.: Mining the Blogosphere: age, gender and the varieties of self-expression. First Monday 12, 9 (2007)

    Article  Google Scholar 

  3. Burger, J., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on Twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1301–1309. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  4. Cheng, N., Chandramouli, R., Subbalakshmi, K.P.: Author gender identification from text. Int. J. Digit. Forensics Incident Response 8(1), 78–88 (2011)

    Google Scholar 

  5. Company, J.S., Wanner, L.: How to use less features and reach better performance in author gender identification. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC). Reykjavik, Iceland (2014)

    Google Scholar 

  6. Holmgren, J., Shyu, E.: Gender Classification of Facebook Posts (2013)

    Google Scholar 

  7. Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proceedings of the 9th International Workshop on Machine Learning, pp. 249–256 (1992)

    Google Scholar 

  8. Kobayashi, D., Matsumura, N., Ishizuka, M.: Automatic estimation of Bloggers’ gender. In: Proceedings of International Conference on Weblogs and Social Media (2007)

    Google Scholar 

  9. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary Linguist. Comput. 17(4), 401–412 (2003)

    Article  Google Scholar 

  10. Lazer, D., Pentland, A.S., Adamic, L., Aral, S., Barabasi, A.L., Brewer, D., Van Alstyne, M.: Life in the network: the coming age of computational social science. Science 323(5915), 721 (2009). (New York, NY)

    Article  Google Scholar 

  11. Marquardt, J., Farnadi, G., Vasudevan, G., Moens, M., Davalos, S., Teredesai, A., De Cock, M.: Age and Gender Identification in Social Media. Author Profiling Task at PAN (2014)

    Google Scholar 

  12. Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of EMNLP (2010)

    Google Scholar 

  13. Peersman, C., Daelemans, W., Van Vaerenbergh, L: Predicting age and gender in online social networks. In: Proceedings of the 3rd Workshop on Search and Mining User-Generated Contents, Glasgow, UK (2011)

    Google Scholar 

  14. Rangel, F., Rosso, P.: Use of language and author profiling: identification of gender and age. In: Proceedings of the 10th International Workshop on Natural Language Processing and Cognitive Science (2013)

    Google Scholar 

  15. Sarawgi, R., Gajulapalli, K., Choi, Y.: Gender attribution: tracing stylometric evidence beyond topic and genre. In: Proceedings of the 15th Conference on Computational Natural Language Learning, pp. 78–86. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  16. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan-Kaufman Series of Data Management Systems, 2nd edn. Elsevier, San Francisco (2005)

    Google Scholar 

  17. Yan, X., Yan, L.: Gender Classification of Weblog Authors. Computational Approaches to Analyzing Weblogs, AAAI (2006)

    Google Scholar 

  18. Zhang, C., Zhang, P.: Predicting gender from blog posts. Technical report. University of Massachusetts Amherst, USA (2010)

    Google Scholar 

  19. NLTK. http://www.nltk.org/

  20. WEKA. http://www.cs.waikato.ac.nz/ml/weka/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christina Aravantinou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Aravantinou, C., Simaki, V., Mporas, I., Megalooikonomou, V. (2015). Gender Classification of Web Authors Using Feature Selection and Language Models. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds) Speech and Computer. SPECOM 2015. Lecture Notes in Computer Science(), vol 9319. Springer, Cham. https://doi.org/10.1007/978-3-319-23132-7_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23132-7_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23131-0

  • Online ISBN: 978-3-319-23132-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics