Using Relative Entropy for Authorship Attribution

Zhao, Ying; Zobel, Justin; Vines, Phil

doi:10.1007/11880592_8

Ying Zhao²⁰,
Justin Zobel²⁰ &
Phil Vines²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4182))

Included in the following conference series:

Asia Information Retrieval Symposium

1025 Accesses
21 Citations

Abstract

Authorship attribution is the task of deciding who wrote a particular document. Several attribution approaches have been proposed in recent research, but none of these approaches is particularly satisfactory; some of them are ad hoc and most have defects in terms of scalability, effectiveness, and efficiency. In this paper, we propose a principled approach motivated from information theory to identify authors based on elements of writing style. We make use of the Kullback-Leibler divergence, a measure of how different two distributions are, and explore several different approaches to tokenizing documents to extract style markers. We use several data collections to examine the performance of our approach. We have found that our proposed approach is as effective as the best existing attribution methods for two class attribution, and is superior for multi-class attribution. It has lower computational cost and is cheaper to train. Finally, our results suggest this approach is a promising alternative for other categorization problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baayen, H., Halteren, H.V., Neijt, A., Tweedie, F.: An experiment in authorship attribution. In: 6th JADT (2002)
Google Scholar
Baayen, H., Halteren, H.V., Tweedie, F.: Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11(3), 121–132 (1996)
Article Google Scholar
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. The American Physical Society 88(4) (2002)
Google Scholar
Binongo, J.N.G.: Who wrote the 15th book of oz? an application of multivariate statistics to authorship attribution. Computational Linguistics 16(2), 9–17 (2003)
MathSciNet Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1-2), 109–123 (2003)
Article MATH Google Scholar
Fung, G.: The disputed federalist papers: Svm feature selection via concave minimization. In: Proceedings of the 2003 Conference on Diversity in Computing, pp. 42–46. ACM Press, New York (2003)
Chapter Google Scholar
Goodman, J.: Extended comment on language trees and zipping (1995)
Google Scholar
Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing & Management 31(3), 271–289 (1995)
Article Google Scholar
Heckerman, D., Geiger, D., Chickering, D.: Learning bayesian networks: the combination of knowledge and statistical data. Machine Learning 20, 197–243 (1995)
MATH Google Scholar
Holmes, D.I., Robertson, M., paez, R.: Stephen crane and the new-york tribune: A case study in traditional and non-traditional authorship attribution. Computers and the Humanities 35(3), 315–331 (2001)
Article Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Juola, P., Baayen, H.: A controlled-corpus experiment in authorship identification by cross-entropy. Literary and Linguistic Computing (2003)
Google Scholar
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Pasific Association for Computational Linguistics, pp. 256–264 (2003)
Google Scholar
Khmelev, D.V., Tweedie, F.J.: Using markov chains for identification of writers. Literary and Linguistic Computing 16(4), 229–307 (2002)
Google Scholar
Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Twenty-first International Conference on Machine Learning. ACM Press, New York (2004)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Google Scholar
Manning, C., Schze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Language independent authorship attribution using character level language models. In: 10th Conference of the European Chapter of the Association for Computational Linguistics, EACL (2003)
Google Scholar
Peng, F., Schuurmans, D., Wang, S.: Language and task independent text categorization with simple language models. In: NAACL 2003: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Association for Computational Linguistics, Morristown, NJ, USA, pp. 110–117 (2003)
Google Scholar
Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. The MIT Press, Cambridge (2002)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic authorship attribution. In: Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, pp. 158–164 (1999)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35(2), 193–214 (2001)
Article Google Scholar
Vapnik, V., Wu, D.: Support vector machine for text categorization (1998)
Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22(2), 179–214 (2004)
Article Google Scholar
Zhao, Y., Zobel, J.: Effective authorship attribution using function word. In: 2nd Asian Information Retrieval Symposium, pp. 174–190. Springer, Heidelberg (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne, Australia
Ying Zhao, Justin Zobel & Phil Vines

Authors

Ying Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Justin Zobel
View author publications
You can also search for this author in PubMed Google Scholar
Phil Vines
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, National University of Singapore, 3 Science Drive 2, 117543, Singapore
Hwee Tou Ng
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, 119613, Singapore
Mun-Kew Leong
Department of Computer Science, School of Computing, National University of Singapore, 117543, Singapore
Min-Yen Kan
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, P.O. Box, 119613, Singapore
Donghong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, Y., Zobel, J., Vines, P. (2006). Using Relative Entropy for Authorship Attribution. In: Ng, H.T., Leong, MK., Kan, MY., Ji, D. (eds) Information Retrieval Technology. AIRS 2006. Lecture Notes in Computer Science, vol 4182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880592_8

Download citation

DOI: https://doi.org/10.1007/11880592_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45780-0
Online ISBN: 978-3-540-46237-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics