Gender Profiling from PhD Theses Using k-Nearest Neighbour and Sequential Minimal Optimisation

  • Hoshiladevi Ramnial
  • Shireen Panchoo
  • Sameerchand Pudaruth
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 385)


Author profiling is a subfield of text categorisation in which the aim is to predict some characteristics of a writer. In this paper, our objective is to determine the gender of an author based on their writings. Our corpus consists of 10 PhD theses which was split into equal sized segments of 1000, 5000 and 10000 words. From this corpus, a total of 446 features were extracted. Some new features like combined-words, new words endings and new POS tags were used in this study. The features were not separated into categories. Two machine learning classifiers, namely the k-nearest neighbour and a support vector machines classifier were used to assess the practicability and utility of our study. We were able to achieve 100% accuracy using the sequential minimal optimisation (SMO) algorithm with 40 document parts. Surprisingly, the simple and lazy k-nearest neighbour (kNN) classifier which is often discarded in gender profiling studies achieved a 98% accuracy with the same group of documents. Furthermore, 5-NN and 7-NN even outperformed SMO when using 400 document parts of 1000 words each. These values are much higher than those obtained in previous studies. However, we have used a new dataset and the results are therefore not directly comparable. Thus, our experiments provide further evidence that it is possible to infer the gender of an author using a computational linguistic approach.


Gender profiling Text classification Machine learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Mikros, G.K.: Authorship Attribution and Gender Identification in Greek Blogs. Methods and Applications of Quantitative Linguistics 21 (2012)Google Scholar
  2. 2.
    Segarra, S., Eisen, M., Ribeiro, A.: Authorship Attribution through Function Word Adjacency Networks. Cornell University Library, Computation and Language (2014)Google Scholar
  3. 3.
    Corney, M.: Analysing E-mail Text authorship for Forensic Purposes. Master of Information Technology Thesis. Queensland University of Technology (2003)Google Scholar
  4. 4.
    Gressel, G., Hrudya, P., Surendran, K., Thara, S., Aravind, A., Poornachandran, P.: In Proceedings of Notebook for PAN at CLEF 2014 (2014)Google Scholar
  5. 5.
    Chaski, C.E.: The Computational-Linguistic Approach to Forensic Authorship Attribution. Law and Language: Theory and Practice. Düsseldorf: Düsseldorf University Press (2006)Google Scholar
  6. 6.
    Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the SIGIR 2006 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 659–660. New York, NY, USA (2006)Google Scholar
  7. 7.
    Abbasi, A., Chen, H.: Visualizing authorship for identification. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, F.-Y. (eds.) ISI 2006. LNCS, vol. 3975, pp. 60–71. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Abbasi, A., Chen, H.: Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems, 26(2), Article 7 (2008)Google Scholar
  9. 9.
    Koppel, M., Schler, J., Argamon, S.: Computational Methods in Authorship Attribution. Journal of the Americal Society for Information Science and Technology 60(1), 9–26 (2009). John Wiley & SonsCrossRefGoogle Scholar
  10. 10.
    Mechti, S., Jaoua, M., Belguith, L.H., Faiz, R.: Machine Learning for classifying authors of anonymous tweets, blogs, reviews and Social media. In: Proceedings of the PAN@CLEF, Sheffield, England, September 2014Google Scholar
  11. 11.
    Peersman, C., Daelemans, W., Vaerenbergh, L.V.: Predicting age and gender in online social networks. In: Proceedings of the 3rd international workshop on search and mining user-generated contents, pp. 37–44 (2011)Google Scholar
  12. 12.
    Kucukyilmaz, T., Cambazoglu, B.B., Aykanat, C., Can, F.: Chat mining for gender prediction. In: Yakhno, T., Neuhold, E.J. (eds.) ADVIS 2006. LNCS, vol. 4243, pp. 274–283. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  13. 13.
    Lin, J.: Automatic author profiling of online chat logs. Naval Postgraduate School, Monterey (2007)Google Scholar
  14. 14.
    Estival, D., Gaustad, T., Hutchinson, B., Pham, S.B., Radford, W.: TAT: an author profiling tool with application to Arabic emails. In: Proceedings of the Australasian Language Technology Workshop 2007, pp. 21–30 (2007)Google Scholar
  15. 15.
    Estival, D., Gaustad, T., Pham, S.B., Radford, W., Hutchinson, B.: Author profiling for English emails. In: Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics, PACLING 2007, pp. 262–272 (2007)Google Scholar
  16. 16.
    Estival, D., Gaustad, T., Hutchinson, B., Pham, S.B., Radford, W.: Author Profiling for English and Arabic Emails. Natural Language Engineering, Cambridge University Press (2008)Google Scholar
  17. 17.
    Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 207–217. MIT, Massachusetts, October 9–11, 2010Google Scholar
  18. 18.
    Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. AAAI Spring Symposium Computational Approaches to Analyzing Weblogs, pp. 199–205 (2006)Google Scholar
  19. 19.
    Lim, W., Goh, J., Thing, V.L.L.: Content-centric age and gender profiling. In: Proceedings of the Notebook for PAN at CLEF 2013 (2013)Google Scholar
  20. 20.
    Bergsma, S., Post, M., Yarowsky, D.: Stylometric analysis of scientific articles. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 327–337. Stroudsburg, USA (2012)Google Scholar
  21. 21.
    Corney, M., Vel, O., Anderson, A., Mohay, G.: Gender preferential text mining of e-mail discourse. In: Proceedings of the 18th Annual Computer Security Applications Conference (ACSAC 2002), pp. 282–292. Las Vegas, USA (2002)Google Scholar
  22. 22.
    Singh, S.: A Pilot Study on Gender Differences in Conversational Speech on Lexical Richness Measures. Literary and Linguistic Computing 16(3), 251–264 (2001)CrossRefGoogle Scholar
  23. 23.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing, 17(4) (2002)Google Scholar
  24. 24.
    Maharjan, S., Shrestha, P., Solorio, T., Hasan, R.: A straightforward author profiling approach in MapReduce. In: Bazzan, A.L., Pichara, K. (eds.) IBERAMIA 2014. LNCS, vol. 8864, pp. 95–107. Springer, Heidelberg (2014)Google Scholar
  25. 25.
    Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, genre, and writing style in formal written texts. Text - Interdisciplinary Journal for the Study of Discourse 23(3), 321–346 (2003)CrossRefGoogle Scholar
  26. 26.
    Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatic profiling the author of an anonymous text. Communications of the ACM 52(2), 119–123 (2009)CrossRefGoogle Scholar
  27. 27.
    de Vel, O., Corney, M., Anderson, A., Mohay, G.: Language and gender author cohort analysis of e-mail for computer forensics. In: Proceedings of the digital forensic research workshop (2002)Google Scholar
  28. 28.
    Koppel, M., Schler, J., Argamon, S., Winter, Y.: The Fundamental Problem of Authorship Attribution. English Studies 93(3), 284–291 (2012). Taylor & FancisCrossRefGoogle Scholar
  29. 29.
    Rangel, F., Rosso, P., Koppel M., Stamatatos, E., Inches, G.: Overview of the author profiling tasks at PAN 2013. In: Notebook for PAN at CLEF 2013 (2013). (accessed March 3, 2015)
  30. 30.
    Cheng, N., Chandramouli, R., Subbalakshmi, K.P.: Author gender identification from text. In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining Conference, April 2009, Digital Investigation, vol. 8, no. 1, July 2011, pp. 78–88. Elsevier Ltd (2009)Google Scholar
  31. 31.
    Daelemans, W.: Explanation in computational stylometry. In: Gelbukh, A. (ed.) CICLing 2013, Part II. LNCS, vol. 7817, pp. 451–462. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  32. 32.
    The British Library: THE BRITISH LIBRARY - The world’s knowledge (2015). (accessed April 11, 2015)
  33. 33.
    Weka: WEKA, The university of Waikato (2015). (accessed March 28, 2015)

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Hoshiladevi Ramnial
    • 1
  • Shireen Panchoo
    • 1
  • Sameerchand Pudaruth
    • 2
  1. 1.School of Innovative Technologies and EngineeringUniversity of TechnologyPort LouisMauritius
  2. 2.Department of Ocean Engineering and ICT, Faculty of Ocean StudiesUniversity of MauritiusPort LouisMauritius

Personalised recommendations