Automatic Turkish Text Categorization in Terms of Author, Genre and Gender

  • M. Fatih Amasyalı
  • Banu Diri
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3999)


In this study, a first comprehensive text classification using n-gram model has been realized for Turkish. We worked in 3 different areas such as determining the identification of a Turkish document’s author, classifying documents according to text’s genre and identifying a gender of an author, automatically. Naive Bayes, Support Vector Machine, C 4.5 and Random Forest were used as classification methods and the results were given comparatively. The success in determining the author of the text, genre of the text and gender of the author was obtained as 83%, 93% and 96%, respectively.


Support Vector Machine Feature Selection Random Forest Subset Feature Feature Selection Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Love, H.: Attributing Authorship: An Introduction. Cambridge Univ. Press, Cambridge (2002)CrossRefGoogle Scholar
  2. 2.
    Dale, R., Moisl, H., Somers, H.: Handbook of NLP. Marcel Dekker, New York (2000)Google Scholar
  3. 3.
    Burrows, J.F.: Not unless you ask nicely: the interpretative nexus between analysis and information. Literary Linguist Comput (7), 91–109 (1992)Google Scholar
  4. 4.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics, 471–495 (2000)Google Scholar
  5. 5.
    Fürnkranz, J.: A Study using n-gram Features for Text Categorization. Austrian Research Institute for Artifical Intelligence (1998)Google Scholar
  6. 6.
    Cavnar, W.B.: Using an n-gram-based Document Representation with a Vector Processing Retrieval Model. In: Proceedings of the Third Text Retrieval Conference(TREC-3) (1994)Google Scholar
  7. 7.
    Biber, D.: Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge Univ. Press, Cambridge (1995)CrossRefGoogle Scholar
  8. 8.
    Kessler, B., Nunberg, G., Schütze, H.: Automatic Detection of Text Genre. In: Proc. of the 35th Annual Meeting of the Association for Computational Linguistics (ACL/EACL 1997), pp. 32–38 (1997)Google Scholar
  9. 9.
    Mulac, A., Studley, L.B., Blau, S.: The Gender-linked Language Effect in Primary and Secondary Students impromptu Essays, Sex Roles, 9/10 (1990)Google Scholar
  10. 10.
    Herring, S.: Two Variants of an Electronic Message Schema. In: Herring, S. (ed.) Computer-Mediated Communication: Linguistic, Social and Cross-Cultural Perspectives, pp. 81–106 (1996)Google Scholar
  11. 11.
    Palander, C.M.: Male and Female Styles in 17th Century Correspondence. Language Variation and Change 11, 123–141 (1999)Google Scholar
  12. 12.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender Literary and Linguistic Computing 17(4), 401-412 (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • M. Fatih Amasyalı
    • 1
  • Banu Diri
    • 1
  1. 1.Computer Engineering DepartmentYıldız Technical UniversityBeşiktaş, İstanbulTurkey

Personalised recommendations