N-Gram Feature Selection for Authorship Identification

  • John Houvardas
  • Efstathios Stamatatos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4183)


Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.


Feature Selection Information Gain Feature Selection Method Authorship Identification Authorship Attribution 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Mosteller, F., Wallace, D.: Inference in an Authorship Problem. Journal of the American Statistical Association 58(302), 275–230 (1963)Google Scholar
  2. 2.
    Labbé, C., Labbé, D.: Inter-textual distance and authorship attribution: Corneille and Molière. Journal of Quantitative Linguistics 8, 213–231 (2001)CrossRefGoogle Scholar
  3. 3.
    de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-mail Content for Author Identification Forensics. SIGMOD Record 30(4), 55–64 (2001)CrossRefGoogle Scholar
  4. 4.
    Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-Group Web Forum Messages. IEEE Intelligent Systems 20(5), 67–75 (2005)CrossRefGoogle Scholar
  5. 5.
    van Halteren, H.: Linguistic Profiling for Author Recognition and Verification. In: Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 199–206 (2004)Google Scholar
  6. 6.
    Chaski, C.: Empirical Evaluations of Language-based Author Identification Techniques. Forensic Linguistics 8(1), 1–65 (2001)CrossRefGoogle Scholar
  7. 7.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4), 471–495 (2000)CrossRefGoogle Scholar
  8. 8.
    Peng, F., Shuurmans, F., Keselj, V., Wang, S.: Language Independent Authorship Attribution Using Character Level Language Models. In: Proc. of the 10th European Association for Computational Linguistics (2003)Google Scholar
  9. 9.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  10. 10.
    Holmes, D.: The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3), 111–117 (1998)CrossRefGoogle Scholar
  11. 11.
    Kjell, B., Addison Woods, W., Frieder, O.: Discrimination of authorship using visualization. Information Processing and Management 30(1) (1994)Google Scholar
  12. 12.
    Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Proc. of the Conference Pacific Association for Computational Linguistics (2003)Google Scholar
  13. 13.
    Juola, P.: Ad-hoc Authorship Attribution Competition. In: Proc. of the Joint ALLC/ACH2004 Conf., pp. 175–176 (2004)Google Scholar
  14. 14.
    Ferreira da Silva, J., Dias, G., Guilloré, S., Pereira Lopes, J.G.: Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  15. 15.
    Silva, J., Lopes, G.: A local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In: Proc. of the 6th Meeting on the Mathematics of Language, pp. 369–381 (1999)Google Scholar
  16. 16.
    Church, K., Hanks, K.: Word Association Norms, Mutual Information and Lexicography. Computational Linguistics 16(1), 22–29 (1990)Google Scholar
  17. 17.
    Gale, W., Church, K.: Concordance for parallel texts. In: Proc. of the 7th Annual Conference for the new OED and Text Research, Oxford, pp. 40–62 (1991)Google Scholar
  18. 18.
    Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)Google Scholar
  19. 19.
    Khmelev, D., Teahan, W.: A Repetition Based Measure for Verification of Text Collections and for Text Categorization. In: Proc. of the 26th ACM SIGIR, pp. 104–110 (2003)Google Scholar
  20. 20.
    Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., Ye, L.: Author Identification on the Large Scale. In: Proc. of CSNA (2005)Google Scholar
  21. 21.
    Yang, Y., Pedersen J.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th Int. Conf. on Machine Learning (1997) Google Scholar
  22. 22.
    Marton, Y., Wu, N., Hellerstein, L.: On Compression-Based Text Classification. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 300–314. Springer, Heidelberg (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • John Houvardas
    • 1
  • Efstathios Stamatatos
    • 1
  1. 1.Dept. of Information and Communication Systems Eng.University of the AegeanKarlovassiGreece

Personalised recommendations