Feature Selection for Enhanced Author Identification of Turkish Text

Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 363)

Abstract

The rapid growth of the Internet and the increasing availability of electronic documents poses some problems, such as identification of an anonymous text and plagiarism. This study aims to determine the author of a given document among the set of text documents whose author is known. Despite the excess number of researches conducted in English language for author identification in the last century, Turkish and other languages are gaining interest only in the last decade. Therefore, this study deals with the Author Identification problem using two different Turkish datasets, collected from two different Turkish newspapers. The datasets comprises 850 columns written by 17 columnists as a total, 50 columns from each columnist. 4 different Machine Learning algorithms (Naive Bayes, Support Vector Machine, the K-Nearest Neighbor and Decision Tree) have been employed and 99.7 % accuracy is achieved with K-Nearest Neighbor algorithm. The classification fully recognized with Chi-square feature selection method by reducing the features from 20 to 17.

Keywords

Author identification Text classification Machine learning Feature selection 

References

  1. 1.
    Mosteller, F., Wallace, D.: Inference and disputed authorship: the federalist. Adison Wesley (1964)Google Scholar
  2. 2.
    Mikros, G.K., Perifanos, K.: Authorship identification in large email collections: experiments using features that belong to different linguistic levels, CLEF (2011)Google Scholar
  3. 3.
    Hill, S., Provost, F.: The myth of the double-blind review\(? \)author identification using only citations. ACM SIGKDD Explor. Newsl. 5(2), 179–184 (2003)CrossRefGoogle Scholar
  4. 4.
    Zhao, J., Zhan, G., Feng, J.: Disputed authorship in C program code after detection of plagiarism. Int. Conf. Comput. Sci. Softw. Eng. 1, 86–89 (2008)Google Scholar
  5. 5.
    de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)CrossRefGoogle Scholar
  6. 6.
    Gray, A., Sallis, P., MacDonnel, S.: Software forensics: extending authorship analysis techniques to computer programs. In: Biannual Conference of the International Association of Forensic Linguists (IAFL’97), pp. 1–8 (1997)Google Scholar
  7. 7.
    Cheng, N., Chen, X., Chandramouli, R., Subbalakshmi, K.P.: Gender identification from E-mails. In: IEEE Symposium on Computational Intelligence and Data Mining, CIDM ’09, pp. 154–158 (2009)Google Scholar
  8. 8.
    Bandara, U., Wijayarathna, G.: Source code author identification with unsupervised feature learning. Pattern Recogn. Lett. 34(3), 330–334Google Scholar
  9. 9.
    Coulthard, M.: Author identification. Idiolects Linguist. Uniquenes Appl. Linguist. 25(4), 431–447 (2004)CrossRefGoogle Scholar
  10. 10.
    Pavelec, D., Justino, E., Oliveira, L.S.: Author Identification using Stylometric Features. Inteligencia Artif. Rev. Iberoamericana de Inteligencia Artif. 11(36), 59–65 (2007)Google Scholar
  11. 11.
    Bozkurt, D., Baglioglu, O., Uyar, E.: Authorship attribution: performance of various features and classification methods computer and information sciences (2007)Google Scholar
  12. 12.
    Taş, T., Görür, A.: Author identification for Turkish texts. J. Arts Sci. 7, 151–161 (2007)Google Scholar
  13. 13.
    Türkoğlu, F., Diri, B., Amasyalı, M.F.: Author attribution of Turkish texts by feature mining. In: Proceedings of the 3rd International Conference on Intelligent Computing, ICIC 2007 Qingdao, China, LNCS 4681/2007 (2007)Google Scholar
  14. 14.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)CrossRefGoogle Scholar
  15. 15.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  16. 16.
    Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556Google Scholar
  17. 17.
    Stamatatos, E.: Author identification using imbalanced and limited training text. In: Proceedings of the 18th International Conference and Database and Expert Systems Applications, Regensburg, pp. 237–41. IEEE Computer Society, Germany (2007)Google Scholar
  18. 18.
    Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Literary Linguist. Comput. 26(1), 35–55 (2011)CrossRefGoogle Scholar
  19. 19.
    Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. Literary Linguist. Comput. 22, 251–270 (2007)Google Scholar
  20. 20.
  21. 21.
    Vapnik, V.: The nature of statistical learning theory. Springer, New York (1995)Google Scholar
  22. 22.
    Manning, C.D., Raghavan, P., Schütze, H.: Information retrieval. Cambridge University Press (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Management Information SystemsCyprus International UniversityLefkosaTurkey
  2. 2.Computer Engineering DepartmentCyprus International UniversityLefkosaTurkey

Personalised recommendations