Automatic Turkish Text Categorization in Terms of Author, Genre and Gender
In this study, a first comprehensive text classification using n-gram model has been realized for Turkish. We worked in 3 different areas such as determining the identification of a Turkish document’s author, classifying documents according to text’s genre and identifying a gender of an author, automatically. Naive Bayes, Support Vector Machine, C 4.5 and Random Forest were used as classification methods and the results were given comparatively. The success in determining the author of the text, genre of the text and gender of the author was obtained as 83%, 93% and 96%, respectively.
KeywordsSupport Vector Machine Feature Selection Random Forest Subset Feature Feature Selection Algorithm
Unable to display preview. Download preview PDF.
- 2.Dale, R., Moisl, H., Somers, H.: Handbook of NLP. Marcel Dekker, New York (2000)Google Scholar
- 3.Burrows, J.F.: Not unless you ask nicely: the interpretative nexus between analysis and information. Literary Linguist Comput (7), 91–109 (1992)Google Scholar
- 4.Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics, 471–495 (2000)Google Scholar
- 5.Fürnkranz, J.: A Study using n-gram Features for Text Categorization. Austrian Research Institute for Artifical Intelligence (1998)Google Scholar
- 6.Cavnar, W.B.: Using an n-gram-based Document Representation with a Vector Processing Retrieval Model. In: Proceedings of the Third Text Retrieval Conference(TREC-3) (1994)Google Scholar
- 8.Kessler, B., Nunberg, G., Schütze, H.: Automatic Detection of Text Genre. In: Proc. of the 35th Annual Meeting of the Association for Computational Linguistics (ACL/EACL 1997), pp. 32–38 (1997)Google Scholar
- 9.Mulac, A., Studley, L.B., Blau, S.: The Gender-linked Language Effect in Primary and Secondary Students impromptu Essays, Sex Roles, 9/10 (1990)Google Scholar
- 10.Herring, S.: Two Variants of an Electronic Message Schema. In: Herring, S. (ed.) Computer-Mediated Communication: Linguistic, Social and Cross-Cultural Perspectives, pp. 81–106 (1996)Google Scholar
- 11.Palander, C.M.: Male and Female Styles in 17th Century Correspondence. Language Variation and Change 11, 123–141 (1999)Google Scholar
- 12.Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender Literary and Linguistic Computing 17(4), 401-412 (2002)Google Scholar