Language Resources and Evaluation

, Volume 45, Issue 1, pp 83–94 | Cite as

Authorship attribution in the wild

Article

Abstract

Most previous work on authorship attribution has focused on the case in which we need to attribute an anonymous document to one of a small set of candidate authors. In this paper, we consider authorship attribution as found in the wild: the set of known candidates is extremely large (possibly many thousands) and might not even include the actual author. Moreover, the known texts and the anonymous texts might be of limited length. We show that even in these difficult cases, we can use similarity-based methods along with multiple randomized feature sets to achieve high precision. Moreover, we show the precise relationship between attribution precision and four parameters: the size of the candidate set, the quantity of known-text by the candidates, the length of the anonymous text and a certain robustness score associated with a attribution.

Keywords

Authorship attribution Open candidate set Randomized feature set 

References

  1. Abbasi, A., & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection. ACM Transactions on Information Systems, 26(2), 7.Google Scholar
  2. Argamon, S. (2008). Interpreting burrows’s delta: Geometric and probabilistic foundations. Literary and Linguistic Computing, 23(2), 131–147.CrossRefGoogle Scholar
  3. Bryll, R., Gutierrez-Osuna, R., & Quek, F. (2003). Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition, 36(6), 1291–1302.CrossRefGoogle Scholar
  4. Burrows, J. F. (2002). Delta: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17, 267–287.CrossRefGoogle Scholar
  5. Clough, P. (2000). Plagiarism in natural and programming languages: An overview of current tools and technologies, Research Memoranda: CS-00-05, Department of Computer Science, University of Sheffield, UK.Google Scholar
  6. Hoover, D. L. (2003). Multivariate analysis and the study of style variation. Literary and Linguistic Computing, 18, 341–360.CrossRefGoogle Scholar
  7. Juola, P. (2008). Author attribution, foundations and trends in information. Retrieval, 1(3), 233–334.CrossRefGoogle Scholar
  8. Keselj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-Gram-Based Author Profiles for Authorship Attribution. In Proceeding of PACLING’03 (pp. 255–264). Halifax, Canada.Google Scholar
  9. Koppel, M., Schler, J., Argamon, S., & Messeri, E. (2006). Authorship attribution with thousands of candidate authors. In Proceedings of the 29th ACM SIGIR Conference on Research and Development on Information Retrieval. Seattle, Washington.Google Scholar
  10. Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. JMLR, 8, 1261–1276.Google Scholar
  11. Koppel, M., Schler, J., & Argamon, S. (2008). Computational methods in authorship attribution. JASIST, 60(1), 9–26.CrossRefGoogle Scholar
  12. Luyckx, K., & Daelemans, W. (2008). Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008) (pp. 513–520). Manchester, UK.Google Scholar
  13. Madigan, D., Genkin, A., Lewis, D. D., Argamon, S., Fradkin, D., & Ye, L. (2005). Author identification on the large scale. In Proceedings of the Meeting of the Classification Society of North America, 2005.Google Scholar
  14. Meyer zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. J. Lenz (Eds.), Advances in data analysis (pp. 359–366). Springer, Berlin.Google Scholar
  15. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management: An International Journal, 24(5), 513–523.CrossRefGoogle Scholar
  16. Stamatatos, E. (2009). A survey of modern authorship attribution methods. JASIST, 60(3), 538–556.CrossRefGoogle Scholar
  17. van Halteren, H., Baayen, H., Tweedie, F., Haverkort, M., & Neijt, A. (2005). New machine learning methods demonstrate the existence of a human stylome. Journal of Quantitative Linguistics, 12(1), 65–77.CrossRefGoogle Scholar
  18. Zhao, Y., & Zobel, J. (2005). Effective authorship attribution using function word. In Proceedings of the 2nd AIRS Asian information retrieval symposium (pp. 174–190). Berlin: Springer.Google Scholar
  19. Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.Bar-Ilan UniversityRamat-GanIsrael
  2. 2.Illinois Institute of TechnologyChicagoUSA

Personalised recommendations