Advertisement

Tensor Space Models for Authorship Identification

  • Spyridon Plakias
  • Efstathios Stamatatos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5138)

Abstract

Authorship identification can be viewed as a text categorization task. However, in this task the most frequent features appear to be the most important discriminators, there is usually a shortage of training texts, and the training texts are rarely evenly distributed over the authors. To cope with these problems, we propose tensors of second order for representing the stylistic properties of texts. Our approach requires the calculation of much fewer parameters in comparison to the traditional vector space representation. We examine various methods for building appropriate tensors taking into account that similar features should be placed in the same neighborhood. Based on an existing generalization of SVM able to handle tensors we perform experiments on corpora controlled for genre and topic and show that the proposed approach can effectively handle cases where only limited training texts are available.

Keywords

Authorship identification Tensor space representation Text categorization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-group Web Forum Messages. IEEE Intelligent Systems 20(5), 67–75 (2005)CrossRefGoogle Scholar
  2. 2.
    Argamon, S., Saric, M., Stein, S.: Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results. In: 9th ACM SIGKDD, pp. 475–480 (2003)Google Scholar
  3. 3.
    Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic Text Classification Using Functional Lexical Features. Journal of the American Society for Information Science and Technology 58(6), 802–822 (2007)CrossRefGoogle Scholar
  4. 4.
    Benedetto, D., Caglioti, E., Loreto, V.: Language Trees and Zipping. Physical Review Letters 88(4), 048702 (2002)CrossRefGoogle Scholar
  5. 5.
    Burrows, J.F.: Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style. Literary and Linguistic Computing 2, 61–70 (1987)CrossRefGoogle Scholar
  6. 6.
    Cai, D., He, X., Wen, J.R., Han, J., Ma, W.Y.: Support Tensor Machines for Text Categorization. Technical report, UIUCDCS-R-2006-2714, University of Illinois at Urbana-Champaign (2006)Google Scholar
  7. 7.
    Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship Attribution Using Word Sequences. In: 11th Iberoamerican Congress on Pattern Recognition, pp. 844–853. Springer, Heidelberg (2006d)Google Scholar
  8. 8.
    Chaski, C.E.: Who’s at the Keyboard? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence 4(1) (2005)Google Scholar
  9. 9.
    Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship Attribution with Support Vector Machines. Applied Intelligence 19(1/2), 109–123 (2003)zbMATHCrossRefGoogle Scholar
  10. 10.
    Gamon, M.: Linguistic Correlates of Style: Authorship Classification with Deep Linguistic Analysis Features. In: 20th International Conference on Computational Linguistics, pp. 611–617 (2004)Google Scholar
  11. 11.
    Grieve, J.: Quantitative Authorship Attribution: An Evaluation of Techniques. Literary and Linguistic Computing 22(3), 251–270 (2007)CrossRefGoogle Scholar
  12. 12.
    Hirst, G., Feiguina, O.: Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts. Literary and Linguistic Computing 22, 405–417 (2007)CrossRefGoogle Scholar
  13. 13.
    Holmes, D.I.: The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3), 111–117 (1998)CrossRefGoogle Scholar
  14. 14.
    Houvardas, J., Stamatatos, E.: N-gram Feature Selection for Authorship Identification. In: 12th International Conference on Artificial Intelligence: Methodology, Systems, Applications, pp. 77–86. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  15. 15.
    Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: 10th European Conference on Machine Learning, pp. 137–142 (1998)Google Scholar
  16. 16.
    Juola, P.: Authorship Attribution for Electronic Documents. In: Olivier, M., Shenoi, S. (eds.) Advances in Digital Forensics II, pp. 119–130. Springer, Heidelberg (2006)Google Scholar
  17. 17.
    Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Pacific Association for Computational Linguistics, pp. 255–264 (2003)Google Scholar
  18. 18.
    Khmelev, D.V., Teahan, W.J.: A Repetition based Measure for Verification of Text Collections and for Text Categorization. In: 26th ACM SIGIR, pp. 104–110 (2003)Google Scholar
  19. 19.
    Koppel, M., Akiva, N., Dagan, I.: Feature Instability as a Criterion for Selecting Potential Style Markers. Journal of the American Society for Information Science and Technology 57(11), 1519–1525 (2006)CrossRefGoogle Scholar
  20. 20.
    Koppel, M., Schler, J.: Exploiting Stylistic Idiosyncrasies for Authorship Attribution. In: IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)Google Scholar
  21. 21.
    Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research 8, 1261–1276 (2007)Google Scholar
  22. 22.
    Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)Google Scholar
  23. 23.
    Li, J., Zheng, R., Chen, H.: From Fingerprint to Writeprint. Communications of the ACM 49(4), 76–82 (2006)CrossRefGoogle Scholar
  24. 24.
    Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., Ye, L.: Author Identification on the Large Scale. In: CSNA 2005 (2005)Google Scholar
  25. 25.
    Marton, Y., Wu, N., Hellerstein, L.: On Compression-based Text Classification. In: European Conference on Information Retrieval, pp. 300–314. Springer, Heidelberg (2005)Google Scholar
  26. 26.
    Mosteller, F., Wallace, D.: Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading (1964)Google Scholar
  27. 27.
    Peng, F., Shuurmans, D., Wang, S.: Augmenting Naive Bayes Classifiers with Statistical Language Models. Information Retrieval Journal 7(1), 317–345 (2004)CrossRefGoogle Scholar
  28. 28.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1) (2002)Google Scholar
  29. 29.
    Stamatatos, E.: Authorship Attribution Based on Feature Set Subspacing Ensembles. International Journal on Artificial Intelligence Tools 15(5), 823–838 (2006)CrossRefGoogle Scholar
  30. 30.
    Stamatatos, E.: Author Identification Using Imbalanced and Limited Training Texts. In: 4th International Workshop on Text-based Information Retrieval, pp. 237–241 (2007)Google Scholar
  31. 31.
    Stamatatos, E.: Author Identification: Using Text Sampling to Handle the Class Imbalance Problem. Information Processing and Management 44(2), 790–799 (2008)CrossRefGoogle Scholar
  32. 32.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4), 471–495 (2000)CrossRefGoogle Scholar
  33. 33.
    de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-mail Content for Author Identification Forensics. SIGMOD Record 30(4), 55–64 (2001)CrossRefGoogle Scholar
  34. 34.
    Zhang, D., Lee, W.S.: Extracting Key-substring-group Features for Text Classification. In: 12th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 474–483 (2006)Google Scholar
  35. 35.
    Zheng, R., Li, J., Chen, H., Huang, Z.: A Framework for Authorship Identification of Online Messages: Writing Style Features and Classification Techniques. Journal of the American Society of Information Science and Technology 57(3), 378–393 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Spyridon Plakias
    • 1
  • Efstathios Stamatatos
    • 1
  1. 1.Dept. of Information and Communication Systems Eng.University of the AegeanKarlovassiGreece

Personalised recommendations