Chat Mining for Gender Prediction

  • Tayfun Kucukyilmaz
  • B. Barla Cambazoglu
  • Cevdet Aykanat
  • Fazli Can
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4243)


The aim of this paper is to investigate the feasibility of predicting the gender of a text document’s author using linguistic evidence. For this purpose, term- and style-based classification techniques are evaluated over a large collection of chat messages. Prediction accuracies up to 84.2% are achieved, illustrating the applicability of these techniques to gender prediction. Moreover, the reverse problem is exploited, and the effect of gender on the writing style is discussed.


Feature Selection Word Length Punctuation Mark Female User Stylistic Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Love, H.: Attributing Authorship: An Introduction. Cambridge University Press, Cambridge (2002)CrossRefGoogle Scholar
  2. 2.
    Corney, M.W.: Analysing E-mail Text Authorship for Forensic Purposes. M.S. Thesis. Queensland University of Technology (2003)Google Scholar
  3. 3.
    Holmes, D.I.: Analysis of Literary Style - A Review. Journal of the Royal Statistical Society 148(4), 328–341 (1985)Google Scholar
  4. 4.
    Elliot, W.E.Y., Valenza, R.J.: Was the Earl of Oxford the True Shakespeare? A Computer Aided Analysis. Notes and Queries 236, 501–506 (1991)Google Scholar
  5. 5.
    Merriam, T., Matthews, R.: Neural Computation in Stylometry II: An Application to the Works of Shakespeare and Marlowe. Literary and Linguistic Computing 9, 1–6 (1994)CrossRefGoogle Scholar
  6. 6.
    Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading (1964)MATHGoogle Scholar
  7. 7.
    Holmes, I., Forstyh, R.: The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing 10(2), 111–127 (1995)CrossRefGoogle Scholar
  8. 8.
    Tweedie, F.J., Singh, S., Holmes, D.I.: Neural Network Applications in Stylometry: The Federalist Papers. Computers and the Humanities 30(1), 1–10 (1996)CrossRefGoogle Scholar
  9. 9.
    Patton, J.M., Can, F.: A Stylometric Analysis of Yasar Kemal’s Ince Memed Tetralogy. Computers and the Humanities 38(4), 457–467 (2004)CrossRefGoogle Scholar
  10. 10.
    Graham, N., Hirst, G., Marthi, B.: Segmenting Documents by Stylistic Character. Natural Language Engineering 11(4), 397–415 (2005)CrossRefGoogle Scholar
  11. 11.
    de Vel, O., Corney, M., Anderson, A., Mohay, G.: Language and Gender Author Cohort Analysis of E-mail for Computer Forensics. In: Second Digital Forensics Research Workshop (2002)Google Scholar
  12. 12.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary & Linguistic Computing 17(4), 401–412 (2002)CrossRefGoogle Scholar
  13. 13.
    Kessler, B., Nunberg, G., Schutze, H.: Automatic Detection of Text Genre. In: Proceedings of the 35th Annual Meeting on Association for Computational Linguistics, pp. 32–38 (1997)Google Scholar
  14. 14.
    Spafford, E.H., Weeber, S.A.: Software Forensics: Can We Track Code to Its Authors? Computers and Security 12, 585–595 (1993)CrossRefGoogle Scholar
  15. 15.
    Rudman, J.: The State of Authorship Attribution Studies: Some Problems and Solutions. Computers and the Humanities 31(4), 351–365 (1998)CrossRefMathSciNetGoogle Scholar
  16. 16.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  17. 17.
    Holmes, D.I.: Authorship Attribution. Computers and the Humanities 28(2), 87–106 (1994)CrossRefGoogle Scholar
  18. 18.
    Liu, A.Y.C.: The Effect of Oversampling and Undersampling on Classifying Imbalanced Text Datasets. M.S. Thesis. University of Texas at Austin (2004)Google Scholar
  19. 19.
    Kubat, M., Matwin, S.: Addressing the Curse of Imbalanced Data Sets: One-sided Sampling. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186 (1997)Google Scholar
  20. 20.
    Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412–420 (1997)Google Scholar
  21. 21.
    Cambazoglu, B.B., Aykanat, C.: Harbinger Machine Learning Toolkit Manual. Technical Report BU-CE-0503, Bilkent University, Computer Engineering Department, Ankara (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Tayfun Kucukyilmaz
    • 1
  • B. Barla Cambazoglu
    • 1
  • Cevdet Aykanat
    • 1
  • Fazli Can
    • 1
  1. 1.Department of Computer EngineeringBilkent UniversityBilkent, AnkaraTurkey

Personalised recommendations