Automatically Determining an Anonymous Author’s Native Language

  • Moshe Koppel
  • Jonathan Schler
  • Kfir Zigdon
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3495)


Text authored by an unidentified assailant can offer valuable clues to the assailant’s identity. In this paper, we show that stylistic text features can be exploited to determine an anonymous author’s native language with high accuracy.


Support Vector Machine Native Language Error Type Function Word Definite Article 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Koppel, M., Argamon, S., Shimony, A.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4) (2002)Google Scholar
  2. 2.
    Lado, R.: Linguistics Across Cultures. University of Michigan Press, Ann Arbor (1961)Google Scholar
  3. 3.
    Corder, S.P.: Error Analysis and Interlanguage. Oxford University Press, Oxford (1981)Google Scholar
  4. 4.
    Tomokiyo, L.M., Jones, R.: You’re Not From ’Round Here, Are You? Naive Bayes Detection of Non-native Utterance Text. In: NAACL 2001 (2001)Google Scholar
  5. 5.
    Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison Wesley, Reading (1964)zbMATHGoogle Scholar
  6. 6.
    Yule, G.U.: On sentence length as a statistical characteristic of style in prose with application to two cases of disputed authorship. Biometrika 30, 363–390 (1938)Google Scholar
  7. 7.
    Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, vol. 11 (1996)Google Scholar
  8. 8.
    Argamon-Engelson, S., Koppel, M., Avneri, G.: Style-based text categorization: What newspaper am I reading? In: Proc. of AAAI Workshop on Learning for Text Categorization, pp. 1–4 (1998)Google Scholar
  9. 9.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35, 193–214 (2001)CrossRefGoogle Scholar
  10. 10.
    Koppel, M., Schler, J.: Exploiting Stylistic Idiosyncrasies for Authorship Attribution. In: Proceedings of IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico (2003)Google Scholar
  11. 11.
    Peng, F., Schuurmans, D., Wang, S.: Augmenting Naive Bayes Classifiers with Statistical Language Models. Inf. Retr. 7(3-4), 317–345 (2004)CrossRefGoogle Scholar
  12. 12.
    Foster, D.: Author Unknown: On the Trail of Anonymous. Henry Holt, New York (2000)Google Scholar
  13. 13.
    Dagneaux, E., Denness, S., Granger, S.: Computer-aided Error Analysis System. An International Journal of Educational Technology and Applied Linguistics 26(2), 163–174 (1998)Google Scholar
  14. 14.
    Tono, Y., Kaneko, T., Isahara, H., Saiga, T., Izumi, E.: The Standard Speaking Test (SST) Corpus: A 1 million-word spoken corpus of Japanese learners of English and its implications for L2 lexicography. In: Second Asialex International Congress, Korea, pp. 257–262 (2001)Google Scholar
  15. 15.
    Chodorow, M., Leacock, C.: An unsupervised method for detecting grammatical errors. In: Proceedings of 1st Meeting of N. American Chapter of Assoc. for Computational Linguistics, pp. 140–147 (2000)Google Scholar
  16. 16.
    Francis, W., Kucera, H.: Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin Company, Boston (1982)Google Scholar
  17. 17.
    Brill, E.: A simple rule-based part-of-speech tagger. In: Proceedings of 3rd Conference on Applied Natural Language Processing, pp. 152–155 (1992)Google Scholar
  18. 18.
    Granger, S., Dagneaux, E., Meunier, F.: The International Corpus of Learner English. Handbook and CD-ROM. Presses Universitaires de Louvain, Louvain-la-Neuve (2002)Google Scholar
  19. 19.
    Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, 265–292 (2001)CrossRefGoogle Scholar
  20. 20.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning, pp. 137–142 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Moshe Koppel
    • 1
  • Jonathan Schler
    • 1
  • Kfir Zigdon
    • 1
  1. 1.Department of Computer ScienceBar-Ilan UniversityRamat-GanIsrael

Personalised recommendations