Using Relative Entropy for Authorship Attribution

  • Ying Zhao
  • Justin Zobel
  • Phil Vines
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4182)


Authorship attribution is the task of deciding who wrote a particular document. Several attribution approaches have been proposed in recent research, but none of these approaches is particularly satisfactory; some of them are ad hoc and most have defects in terms of scalability, effectiveness, and efficiency. In this paper, we propose a principled approach motivated from information theory to identify authors based on elements of writing style. We make use of the Kullback-Leibler divergence, a measure of how different two distributions are, and explore several different approaches to tokenizing documents to extract style markers. We use several data collections to examine the performance of our approach. We have found that our proposed approach is as effective as the best existing attribution methods for two class attribution, and is superior for multi-class attribution. It has lower computational cost and is cheaper to train. Finally, our results suggest this approach is a promising alternative for other categorization problems.


Support Vector Machine Bayesian Network Language Model Principle Component Analysis Relative Entropy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baayen, H., Halteren, H.V., Neijt, A., Tweedie, F.: An experiment in authorship attribution. In: 6th JADT (2002)Google Scholar
  2. 2.
    Baayen, H., Halteren, H.V., Tweedie, F.: Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11(3), 121–132 (1996)CrossRefGoogle Scholar
  3. 3.
    Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. The American Physical Society 88(4) (2002)Google Scholar
  4. 4.
    Binongo, J.N.G.: Who wrote the 15th book of oz? an application of multivariate statistics to authorship attribution. Computational Linguistics 16(2), 9–17 (2003)MathSciNetGoogle Scholar
  5. 5.
    Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1-2), 109–123 (2003)MATHCrossRefGoogle Scholar
  6. 6.
    Fung, G.: The disputed federalist papers: Svm feature selection via concave minimization. In: Proceedings of the 2003 Conference on Diversity in Computing, pp. 42–46. ACM Press, New York (2003)CrossRefGoogle Scholar
  7. 7.
    Goodman, J.: Extended comment on language trees and zipping (1995)Google Scholar
  8. 8.
    Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing & Management 31(3), 271–289 (1995)CrossRefGoogle Scholar
  9. 9.
    Heckerman, D., Geiger, D., Chickering, D.: Learning bayesian networks: the combination of knowledge and statistical data. Machine Learning 20, 197–243 (1995)MATHGoogle Scholar
  10. 10.
    Holmes, D.I., Robertson, M., paez, R.: Stephen crane and the new-york tribune: A case study in traditional and non-traditional authorship attribution. Computers and the Humanities 35(3), 315–331 (2001)CrossRefGoogle Scholar
  11. 11.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  12. 12.
    Juola, P., Baayen, H.: A controlled-corpus experiment in authorship identification by cross-entropy. Literary and Linguistic Computing (2003)Google Scholar
  13. 13.
    Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Pasific Association for Computational Linguistics, pp. 256–264 (2003)Google Scholar
  14. 14.
    Khmelev, D.V., Tweedie, F.J.: Using markov chains for identification of writers. Literary and Linguistic Computing 16(4), 229–307 (2002)Google Scholar
  15. 15.
    Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Twenty-first International Conference on Machine Learning. ACM Press, New York (2004)Google Scholar
  16. 16.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  17. 17.
    Manning, C., Schze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)MATHGoogle Scholar
  18. 18.
    Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Language independent authorship attribution using character level language models. In: 10th Conference of the European Chapter of the Association for Computational Linguistics, EACL (2003)Google Scholar
  19. 19.
    Peng, F., Schuurmans, D., Wang, S.: Language and task independent text categorization with simple language models. In: NAACL 2003: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Association for Computational Linguistics, Morristown, NJ, USA, pp. 110–117 (2003)Google Scholar
  20. 20.
    Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. The MIT Press, Cambridge (2002)Google Scholar
  21. 21.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic authorship attribution. In: Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, pp. 158–164 (1999)Google Scholar
  22. 22.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35(2), 193–214 (2001)CrossRefGoogle Scholar
  23. 23.
    Vapnik, V., Wu, D.: Support vector machine for text categorization (1998)Google Scholar
  24. 24.
    Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22(2), 179–214 (2004)CrossRefGoogle Scholar
  25. 25.
    Zhao, Y., Zobel, J.: Effective authorship attribution using function word. In: 2nd Asian Information Retrieval Symposium, pp. 174–190. Springer, Heidelberg (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Ying Zhao
    • 1
  • Justin Zobel
    • 1
  • Phil Vines
    • 1
  1. 1.School of Computer Science and Information TechnologyRMIT UniversityMelbourneAustralia

Personalised recommendations