Tensor Space Models for Authorship Identification

Plakias, Spyridon; Stamatatos, Efstathios

doi:10.1007/978-3-540-87881-0_22

Spyridon Plakias¹ &
Efstathios Stamatatos¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5138))

Included in the following conference series:

Hellenic Conference on Artificial Intelligence

1805 Accesses
8 Citations

Abstract

Authorship identification can be viewed as a text categorization task. However, in this task the most frequent features appear to be the most important discriminators, there is usually a shortage of training texts, and the training texts are rarely evenly distributed over the authors. To cope with these problems, we propose tensors of second order for representing the stylistic properties of texts. Our approach requires the calculation of much fewer parameters in comparison to the traditional vector space representation. We examine various methods for building appropriate tensors taking into account that similar features should be placed in the same neighborhood. Based on an existing generalization of SVM able to handle tensors we perform experiments on corpora controlled for genre and topic and show that the proposed approach can effectively handle cases where only limited training texts are available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-group Web Forum Messages. IEEE Intelligent Systems 20(5), 67–75 (2005)
Article Google Scholar
Argamon, S., Saric, M., Stein, S.: Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results. In: 9th ACM SIGKDD, pp. 475–480 (2003)
Google Scholar
Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic Text Classification Using Functional Lexical Features. Journal of the American Society for Information Science and Technology 58(6), 802–822 (2007)
Article Google Scholar
Benedetto, D., Caglioti, E., Loreto, V.: Language Trees and Zipping. Physical Review Letters 88(4), 048702 (2002)
Article Google Scholar
Burrows, J.F.: Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style. Literary and Linguistic Computing 2, 61–70 (1987)
Article Google Scholar
Cai, D., He, X., Wen, J.R., Han, J., Ma, W.Y.: Support Tensor Machines for Text Categorization. Technical report, UIUCDCS-R-2006-2714, University of Illinois at Urbana-Champaign (2006)
Google Scholar
Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship Attribution Using Word Sequences. In: 11th Iberoamerican Congress on Pattern Recognition, pp. 844–853. Springer, Heidelberg (2006d)
Google Scholar
Chaski, C.E.: Who’s at the Keyboard? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence 4(1) (2005)
Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship Attribution with Support Vector Machines. Applied Intelligence 19(1/2), 109–123 (2003)
Article MATH Google Scholar
Gamon, M.: Linguistic Correlates of Style: Authorship Classification with Deep Linguistic Analysis Features. In: 20th International Conference on Computational Linguistics, pp. 611–617 (2004)
Google Scholar
Grieve, J.: Quantitative Authorship Attribution: An Evaluation of Techniques. Literary and Linguistic Computing 22(3), 251–270 (2007)
Article Google Scholar
Hirst, G., Feiguina, O.: Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts. Literary and Linguistic Computing 22, 405–417 (2007)
Article Google Scholar
Holmes, D.I.: The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3), 111–117 (1998)
Article Google Scholar
Houvardas, J., Stamatatos, E.: N-gram Feature Selection for Authorship Identification. In: 12th International Conference on Artificial Intelligence: Methodology, Systems, Applications, pp. 77–86. Springer, Heidelberg (2006)
Chapter Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: 10th European Conference on Machine Learning, pp. 137–142 (1998)
Google Scholar
Juola, P.: Authorship Attribution for Electronic Documents. In: Olivier, M., Shenoi, S. (eds.) Advances in Digital Forensics II, pp. 119–130. Springer, Heidelberg (2006)
Google Scholar
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Pacific Association for Computational Linguistics, pp. 255–264 (2003)
Google Scholar
Khmelev, D.V., Teahan, W.J.: A Repetition based Measure for Verification of Text Collections and for Text Categorization. In: 26th ACM SIGIR, pp. 104–110 (2003)
Google Scholar
Koppel, M., Akiva, N., Dagan, I.: Feature Instability as a Criterion for Selecting Potential Style Markers. Journal of the American Society for Information Science and Technology 57(11), 1519–1525 (2006)
Article Google Scholar
Koppel, M., Schler, J.: Exploiting Stylistic Idiosyncrasies for Authorship Attribution. In: IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)
Google Scholar
Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research 8, 1261–1276 (2007)
Google Scholar
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)
Google Scholar
Li, J., Zheng, R., Chen, H.: From Fingerprint to Writeprint. Communications of the ACM 49(4), 76–82 (2006)
Article Google Scholar
Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., Ye, L.: Author Identification on the Large Scale. In: CSNA 2005 (2005)
Google Scholar
Marton, Y., Wu, N., Hellerstein, L.: On Compression-based Text Classification. In: European Conference on Information Retrieval, pp. 300–314. Springer, Heidelberg (2005)
Google Scholar
Mosteller, F., Wallace, D.: Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading (1964)
Google Scholar
Peng, F., Shuurmans, D., Wang, S.: Augmenting Naive Bayes Classifiers with Statistical Language Models. Information Retrieval Journal 7(1), 317–345 (2004)
Article Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1) (2002)
Google Scholar
Stamatatos, E.: Authorship Attribution Based on Feature Set Subspacing Ensembles. International Journal on Artificial Intelligence Tools 15(5), 823–838 (2006)
Article Google Scholar
Stamatatos, E.: Author Identification Using Imbalanced and Limited Training Texts. In: 4th International Workshop on Text-based Information Retrieval, pp. 237–241 (2007)
Google Scholar
Stamatatos, E.: Author Identification: Using Text Sampling to Handle the Class Imbalance Problem. Information Processing and Management 44(2), 790–799 (2008)
Article Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4), 471–495 (2000)
Article Google Scholar
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-mail Content for Author Identification Forensics. SIGMOD Record 30(4), 55–64 (2001)
Article Google Scholar
Zhang, D., Lee, W.S.: Extracting Key-substring-group Features for Text Classification. In: 12th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 474–483 (2006)
Google Scholar
Zheng, R., Li, J., Chen, H., Huang, Z.: A Framework for Authorship Identification of Online Messages: Writing Style Features and Classification Techniques. Journal of the American Society of Information Science and Technology 57(3), 378–393 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Information and Communication Systems Eng., University of the Aegean, 83200, Karlovassi, Greece
Spyridon Plakias & Efstathios Stamatatos

Authors

Spyridon Plakias
View author publications
You can also search for this author in PubMed Google Scholar
Efstathios Stamatatos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

John Darzentas George A. Vouros Spyros Vosinakis Argyris Arnellos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Plakias, S., Stamatatos, E. (2008). Tensor Space Models for Authorship Identification. In: Darzentas, J., Vouros, G.A., Vosinakis, S., Arnellos, A. (eds) Artificial Intelligence: Theories, Models and Applications. SETN 2008. Lecture Notes in Computer Science(), vol 5138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87881-0_22

Download citation

DOI: https://doi.org/10.1007/978-3-540-87881-0_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87880-3
Online ISBN: 978-3-540-87881-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics