Abstract
Authorship identification can be viewed as a text categorization task. However, in this task the most frequent features appear to be the most important discriminators, there is usually a shortage of training texts, and the training texts are rarely evenly distributed over the authors. To cope with these problems, we propose tensors of second order for representing the stylistic properties of texts. Our approach requires the calculation of much fewer parameters in comparison to the traditional vector space representation. We examine various methods for building appropriate tensors taking into account that similar features should be placed in the same neighborhood. Based on an existing generalization of SVM able to handle tensors we perform experiments on corpora controlled for genre and topic and show that the proposed approach can effectively handle cases where only limited training texts are available.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-group Web Forum Messages. IEEE Intelligent Systems 20(5), 67–75 (2005)
Argamon, S., Saric, M., Stein, S.: Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results. In: 9th ACM SIGKDD, pp. 475–480 (2003)
Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic Text Classification Using Functional Lexical Features. Journal of the American Society for Information Science and Technology 58(6), 802–822 (2007)
Benedetto, D., Caglioti, E., Loreto, V.: Language Trees and Zipping. Physical Review Letters 88(4), 048702 (2002)
Burrows, J.F.: Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style. Literary and Linguistic Computing 2, 61–70 (1987)
Cai, D., He, X., Wen, J.R., Han, J., Ma, W.Y.: Support Tensor Machines for Text Categorization. Technical report, UIUCDCS-R-2006-2714, University of Illinois at Urbana-Champaign (2006)
Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship Attribution Using Word Sequences. In: 11th Iberoamerican Congress on Pattern Recognition, pp. 844–853. Springer, Heidelberg (2006d)
Chaski, C.E.: Who’s at the Keyboard? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence 4(1) (2005)
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship Attribution with Support Vector Machines. Applied Intelligence 19(1/2), 109–123 (2003)
Gamon, M.: Linguistic Correlates of Style: Authorship Classification with Deep Linguistic Analysis Features. In: 20th International Conference on Computational Linguistics, pp. 611–617 (2004)
Grieve, J.: Quantitative Authorship Attribution: An Evaluation of Techniques. Literary and Linguistic Computing 22(3), 251–270 (2007)
Hirst, G., Feiguina, O.: Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts. Literary and Linguistic Computing 22, 405–417 (2007)
Holmes, D.I.: The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3), 111–117 (1998)
Houvardas, J., Stamatatos, E.: N-gram Feature Selection for Authorship Identification. In: 12th International Conference on Artificial Intelligence: Methodology, Systems, Applications, pp. 77–86. Springer, Heidelberg (2006)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: 10th European Conference on Machine Learning, pp. 137–142 (1998)
Juola, P.: Authorship Attribution for Electronic Documents. In: Olivier, M., Shenoi, S. (eds.) Advances in Digital Forensics II, pp. 119–130. Springer, Heidelberg (2006)
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Pacific Association for Computational Linguistics, pp. 255–264 (2003)
Khmelev, D.V., Teahan, W.J.: A Repetition based Measure for Verification of Text Collections and for Text Categorization. In: 26th ACM SIGIR, pp. 104–110 (2003)
Koppel, M., Akiva, N., Dagan, I.: Feature Instability as a Criterion for Selecting Potential Style Markers. Journal of the American Society for Information Science and Technology 57(11), 1519–1525 (2006)
Koppel, M., Schler, J.: Exploiting Stylistic Idiosyncrasies for Authorship Attribution. In: IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis, pp. 69–72 (2003)
Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring Differentiability: Unmasking Pseudonymous Authors. Journal of Machine Learning Research 8, 1261–1276 (2007)
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)
Li, J., Zheng, R., Chen, H.: From Fingerprint to Writeprint. Communications of the ACM 49(4), 76–82 (2006)
Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., Ye, L.: Author Identification on the Large Scale. In: CSNA 2005 (2005)
Marton, Y., Wu, N., Hellerstein, L.: On Compression-based Text Classification. In: European Conference on Information Retrieval, pp. 300–314. Springer, Heidelberg (2005)
Mosteller, F., Wallace, D.: Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Addison-Wesley, Reading (1964)
Peng, F., Shuurmans, D., Wang, S.: Augmenting Naive Bayes Classifiers with Statistical Language Models. Information Retrieval Journal 7(1), 317–345 (2004)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1) (2002)
Stamatatos, E.: Authorship Attribution Based on Feature Set Subspacing Ensembles. International Journal on Artificial Intelligence Tools 15(5), 823–838 (2006)
Stamatatos, E.: Author Identification Using Imbalanced and Limited Training Texts. In: 4th International Workshop on Text-based Information Retrieval, pp. 237–241 (2007)
Stamatatos, E.: Author Identification: Using Text Sampling to Handle the Class Imbalance Problem. Information Processing and Management 44(2), 790–799 (2008)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4), 471–495 (2000)
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-mail Content for Author Identification Forensics. SIGMOD Record 30(4), 55–64 (2001)
Zhang, D., Lee, W.S.: Extracting Key-substring-group Features for Text Classification. In: 12th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 474–483 (2006)
Zheng, R., Li, J., Chen, H., Huang, Z.: A Framework for Authorship Identification of Online Messages: Writing Style Features and Classification Techniques. Journal of the American Society of Information Science and Technology 57(3), 378–393 (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Plakias, S., Stamatatos, E. (2008). Tensor Space Models for Authorship Identification. In: Darzentas, J., Vouros, G.A., Vosinakis, S., Arnellos, A. (eds) Artificial Intelligence: Theories, Models and Applications. SETN 2008. Lecture Notes in Computer Science(), vol 5138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87881-0_22
Download citation
DOI: https://doi.org/10.1007/978-3-540-87881-0_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87880-3
Online ISBN: 978-3-540-87881-0
eBook Packages: Computer ScienceComputer Science (R0)