Effective and Scalable Authorship Attribution Using Function Words

Zhao, Ying; Zobel, Justin

doi:10.1007/11562382_14

Ying Zhao²⁰ &
Justin Zobel²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3689))

Included in the following conference series:

Asia Information Retrieval Symposium

1283 Accesses
72 Citations

Abstract

Techniques for identifying the author of an unattributed document can be applied to problems in information analysis and in academic scholarship. A range of methods have been proposed in the research literature, using a variety of features and machine learning approaches, but the methods have been tested on very different data and the results cannot be compared. It is not even clear whether the differences in performance are due to feature selection or other variables. In this paper we examine the use of a large publicly available collection of newswire articles as a benchmark for comparing authorship attribution methods. To demonstrate the value of having a benchmark, we experimentally compare several recent feature-based techniques for authorship attribution, and test how well these methods perform as the volume of data is increased. We show that the benchmark is able to clearly distinguish between different approaches, and that the scalability of the best methods based on using function words features is acceptable, with only moderate decline as the difficulty of the problem is increased.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aha, D., Kibler, D.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)
Google Scholar
Baayen, H., Halteren, H.V., Neijt, A., Tweedie, F.: An experiment in authorship attribution. In: 6th JADT (2002)
Google Scholar
Baayen, H., Halteren, H.V., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11(3), 121–132 (1996)
Article Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (May 1999)
Google Scholar
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. The American Physical Society 88(4) (2002)
Google Scholar
Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 55–67. Springer, Heidelberg (2004)
Chapter Google Scholar
Binongo, J.N.G.: Who wrote the 15th book of oz? an application of multivariate statistics to authorship attribution. Computational Linguistics 16(2), 9–17 (2003)
MathSciNet Google Scholar
Burrows, J.: Word patterns and story shapes: the statistical analysis of narrative style. Literary and linguistic Computing 2, 61–70 (1987)
Article Google Scholar
Burrows, J.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing 17, 267–287 (2002)
Article Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1-2), 109–123 (2003)
Article MATH Google Scholar
D’Souza, D., Thom, J., Zobel, J.: Collection selection for managed distributed document databases. Information Processing & Management 40, 527–546 (2004)
Article Google Scholar
Fung, G.: The disputed federalist papers: Svm feature selection via concave minimization. In: Proceedings of the 2003 conference on Diversity in computing, pp. 42–46. ACM Press, New York (2003)
Chapter Google Scholar
Goodman, J.: Extended comment on language trees and zipping
Google Scholar
Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing & Management 31(3), 271–289 (1995)
Article Google Scholar
Heckerman, D., Geiger, D., Chickering, D.: Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning 20, 197–243 (1995)
MATH Google Scholar
Holmes, D.I., Robertson, M., paez, R.: Stephen crane and the new-york tribune: A case study in traditional and non-traditional authorship attribution. Computers and the Humanities 35(3), 315–331 (2001)
Article Google Scholar
John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann Publisher, San Francisco (1995)
Google Scholar
Juola, P., Baayen, H.: A controlled-corpus experiment in authorship identification by cross-entropy. Literary and Linguistic Computing (2003)
Google Scholar
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Pasific Association for Computational Linguistics, pp. 256–264 (2003)
Google Scholar
Khmelev, D.V., Tweedie, F.J.: Using markov chains for identification of writers. Literary and Linguistic Computing 16(4), 229–307 (2002)
Google Scholar
Langley, P., Sage, S.: Tractable average-case analysis of naive Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 220–228. Morgan Kaufmann Publisher, San Francisco (1999)
Google Scholar
Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Language independent authorship attribution using character level language models. In: 10th Conference of the European Chapter of the Association for Computational Linguistics, EACL (2003)
Google Scholar
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic authorship attribution. In: Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, pp. 158–164 (1999)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35(2), 193–214 (2001)
Article Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne, Australia
Ying Zhao & Justin Zobel

Authors

Ying Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Justin Zobel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, 790-784, Pohang, Korea
Gary Geunbae Lee
Computer and Communication Media Research, NEC Corp., Miyazaki 4-1-1, Miyamae-ku, 216-8555, Kawasaki, Japan
Akio Yamada
Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong
Helen Meng
School of Engineering, Information and Communications University, 119, Munjiro, Yuseong-gu, 305-732, Daejeon, Korea
Sung Hyon Myaeng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, Y., Zobel, J. (2005). Effective and Scalable Authorship Attribution Using Function Words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_14

Download citation

DOI: https://doi.org/10.1007/11562382_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics