Skip to main content

Tensor Space Models for Authorship Identification

  • Conference paper
Artificial Intelligence: Theories, Models and Applications (SETN 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5138))

Included in the following conference series:

Abstract

Authorship identification can be viewed as a text categorization task. However, in this task the most frequent features appear to be the most important discriminators, there is usually a shortage of training texts, and the training texts are rarely evenly distributed over the authors. To cope with these problems, we propose tensors of second order for representing the stylistic properties of texts. Our approach requires the calculation of much fewer parameters in comparison to the traditional vector space representation. We examine various methods for building appropriate tensors taking into account that similar features should be placed in the same neighborhood. Based on an existing generalization of SVM able to handle tensors we perform experiments on corpora controlled for genre and topic and show that the proposed approach can effectively handle cases where only limited training texts are available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-group Web Forum Messages. IEEE Intelligent Systems 20(5), 67–75 (2005)

    Article  Google Scholar 

  2. Argamon, S., Saric, M., Stein, S.: Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results. In: 9th ACM SIGKDD, pp. 475–480 (2003)

    Google Scholar 

  3. Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic Text Classification Using Functional Lexical Features. Journal of the American Society for Information Science and Technology 58(6), 802–822 (2007)

    Article  Google Scholar 

  4. Benedetto, D., Caglioti, E., Loreto, V.: Language Trees and Zipping. Physical Review Letters 88(4), 048702 (2002)

    Article  Google Scholar 

  5. Burrows, J.F.: Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style. Literary and Linguistic Computing 2, 61–70 (1987)

    Article  Google Scholar 

  6. Cai, D., He, X., Wen, J.R., Han, J., Ma, W.Y.: Support Tensor Machines for Text Categorization. Technical report, UIUCDCS-R-2006-2714, University of Illinois at Urbana-Champaign (2006)

    Google Scholar 

  7. Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship Attribution Using Word Sequences. In: 11th Iberoamerican Congress on Pattern Recognition, pp. 844–853. Springer, Heidelberg (2006d)

    Google Scholar 

  8. Chaski, C.E.: Who’s at the Keyboard? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence 4(1) (2005)

    Google Scholar 

  9. Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship Attribution with Support Vector Machines. Applied Intelligence 19(1/2), 109–123 (2003)

    Article  MATH  Google Scholar 

  10. Gamon, M.: Linguistic Correlates of Style: Authorship Classification with Deep Linguistic Analysis Features. In: 20th International Conference on Computational Linguistics, pp. 611–617 (2004)

    Google Scholar 

  11. Grieve, J.: Quantitative Authorship Attribution: An Evaluation of Techniques. Literary and Linguistic Computing 22(3), 251–270 (2007)

    Article  Google Scholar 

  12. Hirst, G., Feiguina, O.: Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts. Literary and Linguistic Computing 22, 405–417 (2007)

    Article  Google Scholar 

  13. Holmes, D.I.: The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3), 111–117 (1998)

    Article  Google Scholar 

  14. Houvardas, J., Stamatatos, E.: N-gram Feature Selection for Authorship Identification. In: 12th International Conference on Artificial Intelligence: Methodology, Systems, Applications, pp. 77–86. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  15. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: 10th European Conference on Machine Learning, pp. 137–142 (1998)

    Google Scholar 

  16. Juola, P.: Authorship Attribution for Electronic Documents. In: Olivier, M., Shenoi, S. (eds.) Advances in Digital Forensics II, pp. 119–130. Springer, Heidelberg (2006)

    Google Scholar 

  17. Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Pacific Association for Computational Linguistics, pp. 255–264 (2003)

    Google Scholar 

  18. Khmelev, D.V., Teahan, W.J.: A Repetition based Measure for Verification of Text Collections and for Text Categorization. In: 26th ACM SIGIR, pp. 104–110 (2003)

    Google Scholar 

  19. Koppel, M., Akiva, N., Dagan, I.: Feature Instability as a Criterion for Selecting Potential Style Markers. Journal of the American Society for Information Science and Technology 57(11), 1519–1525 (2006)

    Article  Google Scholar 

  20. Koppel, M., Schler, J.: Exploiting Stylistic Idiosyncrasies for Authorship Attribution. In: IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)

    Google Scholar 

  21. Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research 8, 1261–1276 (2007)

    Google Scholar 

  22. Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  23. Li, J., Zheng, R., Chen, H.: From Fingerprint to Writeprint. Communications of the ACM 49(4), 76–82 (2006)

    Article  Google Scholar 

  24. Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., Ye, L.: Author Identification on the Large Scale. In: CSNA 2005 (2005)

    Google Scholar 

  25. Marton, Y., Wu, N., Hellerstein, L.: On Compression-based Text Classification. In: European Conference on Information Retrieval, pp. 300–314. Springer, Heidelberg (2005)

    Google Scholar 

  26. Mosteller, F., Wallace, D.: Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading (1964)

    Google Scholar 

  27. Peng, F., Shuurmans, D., Wang, S.: Augmenting Naive Bayes Classifiers with Statistical Language Models. Information Retrieval Journal 7(1), 317–345 (2004)

    Article  Google Scholar 

  28. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1) (2002)

    Google Scholar 

  29. Stamatatos, E.: Authorship Attribution Based on Feature Set Subspacing Ensembles. International Journal on Artificial Intelligence Tools 15(5), 823–838 (2006)

    Article  Google Scholar 

  30. Stamatatos, E.: Author Identification Using Imbalanced and Limited Training Texts. In: 4th International Workshop on Text-based Information Retrieval, pp. 237–241 (2007)

    Google Scholar 

  31. Stamatatos, E.: Author Identification: Using Text Sampling to Handle the Class Imbalance Problem. Information Processing and Management 44(2), 790–799 (2008)

    Article  Google Scholar 

  32. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4), 471–495 (2000)

    Article  Google Scholar 

  33. de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-mail Content for Author Identification Forensics. SIGMOD Record 30(4), 55–64 (2001)

    Article  Google Scholar 

  34. Zhang, D., Lee, W.S.: Extracting Key-substring-group Features for Text Classification. In: 12th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 474–483 (2006)

    Google Scholar 

  35. Zheng, R., Li, J., Chen, H., Huang, Z.: A Framework for Authorship Identification of Online Messages: Writing Style Features and Classification Techniques. Journal of the American Society of Information Science and Technology 57(3), 378–393 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

John Darzentas George A. Vouros Spyros Vosinakis Argyris Arnellos

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Plakias, S., Stamatatos, E. (2008). Tensor Space Models for Authorship Identification. In: Darzentas, J., Vouros, G.A., Vosinakis, S., Arnellos, A. (eds) Artificial Intelligence: Theories, Models and Applications. SETN 2008. Lecture Notes in Computer Science(), vol 5138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87881-0_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-87881-0_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-87880-3

  • Online ISBN: 978-3-540-87881-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics