Author Profiling: Predicting Gender from Document

  • Sunakshi MamgainEmail author
  • Rakesh C. Balabantaray
  • Ajit K. Das
  • Srikant Kumar
Conference paper
Part of the Lecture Notes on Data Engineering and Communications Technologies book series (LNDECT, volume 37)


As the Internet is aging, a massive amount of data is being created on the web, out of which mostly is text. Therefore, authorship of the contents and prediction of characteristics of the author is becoming a new domain of data analytics making Author Profiling a research area with huge scope of possibilities and outcomes. The ability to describe the features or traits of an author has a key application in many security and forensic areas. The PAN labs provide a platform for scholars by organizing author profiling tasks, for example, language, gender prediction, etc. In this paper, we are attempting to predict gender of a particular author, for which we have considered English dataset of PAN 2017.


NLP Classification Cross-validation Logistic regression SVM and Multinomial Naïve Bayes 


  1. 1.
    H.N. Tran, T. Huynh, T. Do, Author name disambiguation by using deep neural network, in Asian Conference on Intelligent Information and Database Systems (Springer, Cham, 2014), pp. 123–132CrossRefGoogle Scholar
  2. 2.
    D. Bagnall, Author identification using multi-headed recurrent neural networks (2015). arXiv:1506.04891
  3. 3.
    Jonathan Schler, Moshe Koppel, Shlomo Argamon, James W. Pennebaker, Effects of age and gender on blogging, in AAAI spring symposium: Computational approaches to analyzing weblogs, vol. 6, (2006), pp. 199–205Google Scholar
  4. 4.
    K. Santosh, R. Bansal, M. Shekhar, V. Varma, Author profiling: predicting age and gender from blogs. Notebook for PAN at CLEF (2013), pp. 119–124Google Scholar
  5. 5.
    M. Koppel, S. Argamon, A.R. Shimoni, Automatically categorizing written texts by author gender. Lit. Linguist. Comput. 17.4, 401–412 (2002)CrossRefGoogle Scholar
  6. 6.
    C. Alexandre, J. Balsa, Client profiling for an anti-money laundering system (2015). arXiv:1510.00878
  7. 7.
    J. Hong, C. Mattmann, P. Ramirez, Ensemble maximum entropy classification and linear regression for author age prediction (2016). arXiv:1610.00852
  8. 8.
    R. Shetty, B. Schiele, M. Fritz, A4NT: author attribute anonymity by adversarial training of neural machine translation, in 27th USENIX Security Symposium. USENIX Association (2018), pp. 1633–1650Google Scholar
  9. 9.
    S. Mechti, M. Jaoua, L.H. Belguith, R. Faiz, Author profiling using style-based features, in Proceedings of CLEF (2013)Google Scholar
  10. 10.
    D. Nazareth, K. Asnani, O. Rodrigues, Author-profile system development based on software reuse of open source components, in Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2014 (Springer, Cham, 2015), pp. 629–636Google Scholar
  11. 11.
    A.M. Ciobanu, M. Zampieri, S. Malmasi, L.P. Dinu, Including dialects and language varieties in author profiling (2017). arXiv:1707.00621
  12. 12.
    F. Rangel, P. Rosso, M. Potthast, B. Stein, Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in twitter, in CLEF 2017 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings, ed L. Cappellato, N. Ferro, L. Goeuriot, T. Mandl, vol. 1866 (, 2017)Google Scholar
  13. 13.
    R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)Google Scholar
  14. 14.
    R. Akerkar, P.S. Sajja, Basic learning algorithms, in Intelligent Techniques for Data Science (Springer, Cham, 2016), pp. 53–93CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Sunakshi Mamgain
    • 1
    Email author
  • Rakesh C. Balabantaray
    • 2
  • Ajit K. Das
    • 2
  • Srikant Kumar
    • 1
  1. 1.IIITBhubaneswarIndia
  2. 2.Department of CS-ITIIITBhubaneswarIndia

Personalised recommendations