The Use of Orthogonal Similarity Relations in the Prediction of Authorship

Sapkota, Upendra; Solorio, Thamar; Montes-y-Gómez, Manuel; Rosso, Paolo

doi:10.1007/978-3-642-37256-8_38

Upendra Sapkota¹⁷,
Thamar Solorio¹⁷,
Manuel Montes-y-Gómez¹⁸ &
…
Paolo Rosso¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7817))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2853 Accesses
7 Citations

Abstract

Recent work on Authorship Attribution (AA) proposes the use of meta characteristics to train author models. The meta characteristics are orthogonal sets of similarity relations between the features from the different candidate authors. In that approach, the features are grouped and processed separately according to the type of information they encode, the so called linguistic modalities. For instance, the syntactic, stylistic and semantic features are each considered different modalities as they represent different aspects of the texts. The assumption is that the independent extraction of meta characteristics results in more informative feature vectors, that in turn result in higher accuracies. In this paper we set out to the task of studying the empirical value of this modality specific process. We experimented with different ways of generating the meta characteristics on different data sets with different numbers of authors and genres. Our results show that by extracting the meta characteristics from splitting features by their linguistic dimension we achieve consistent improvement of prediction accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baker, L.D., McCallum, A.: Distributional clustering of words for text classification. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR, pp. 96–103. ACM, Melbourne (1998)
Google Scholar
Biber, D.: The multi-dimensional approach to linguistic analyses of genre variation: An overview of methodology and findings. Computers and the Humanities 26, 331–345 (1993)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 1998 Conference on Computational Learning Theory (1998)
Google Scholar
Dhillon, I.S., Mallela, S., Kumar, R.: A divisive information-theoretic feature clsutering algorithm for text classification. Journal of Machine Learning Research 3, 1265–1287 (2003)
MATH Google Scholar
Escalante, H.J., Montes-y-Gómez, M., Solorio, T.: A weighted profile intersection measure for profile-based authorship attribution. In: Batyrshin, I., Sidorov, G. (eds.) MICAI 2011, Part I. LNCS, vol. 7094, pp. 232–243. Springer, Heidelberg (2011)
Chapter Google Scholar
Escalante, H.J., Solorio, T., Montes-y-Gomez, M.: Local histograms of character n-grams for authorship attribution. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 288–298. Association for Computational Linguistics, Portland (2011)
Google Scholar
Hayes, J.H.: Authorship attribution: A principal component and linear discriminant analysis of the consistent programmer hypothesis. I. J. Comput. Appl., 79–99 (2008)
Google Scholar
Houvardas, J., Stamatatos, E.: N-gram feature selection for authorship identification. In: Euzenat, J., Domingue, J. (eds.) AIMSA 2006. LNCS (LNAI), vol. 4183, pp. 77–86. Springer, Heidelberg (2006)
Chapter Google Scholar
Karypis, G.: CLUTO - a clustering toolkit. Tech. Rep. #02-017 (November 2003)
Google Scholar
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram based author profiles for authorship attribution. In: Proceedings of the Pacific Association for Computational Linguistics, pp. 255–264 (2003)
Google Scholar
Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Language Resources and Evaluation 45, 83–94 (2011)
Article Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Google Scholar
Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, pp. 513–520 (August 2008)
Google Scholar
Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. In: Literary and Linguistic Computing, pp. 1–21 (August 2010)
Google Scholar
Marneffe, M.D., MacCartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: LREC 2006 (2006)
Google Scholar
Plakias, S., Stamatatos, E.: Tensor space models for authorship identification. In: Darzentas, J., Vouros, G.A., Vosinakis, S., Arnellos, A. (eds.) SETN 2008. LNCS (LNAI), vol. 5138, pp. 239–249. Springer, Heidelberg (2008)
Chapter Google Scholar
Raghavan, S., Kovashka, A., Mooney, R.: Authorship attribution using probabilistic context-free grammars. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 38–42. Association for Computational Linguistics, Uppsala (2010)
Google Scholar
Slonim, N., Tishby, N.: The power of word clusters for text classification. In: 23rd European Colloquium on Information Retrieval Research, ECIR (2001)
Google Scholar
Solorio, T., Pillay, S., Raghavan, S., Montes-y-Gómez: Generating metafeatures for authorship attribution on web forum posts. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, IJCNLP 2011, pp. 156–164. AFNLP, Chiang Mai (2011)
Google Scholar
Stamatatos, E.: Author identification using imbalanced and limited training texts. In: 18th International Workshop on Database and Expert Systems Applications, DEXA 2007, pp. 237–241 (September 2007)
Google Scholar
Stamatatos, E.: Author identification: Using text sampling to handle the class imbalance problem. Information Processing and Managemement 44, 790–799 (2008)
Article Google Scholar
Stamatatos, E.: Plagiarism detection using stopword n-grams. Journal of the American Society for Information Science and Technology 62(12), 2512–2527 (2011)
Article Google Scholar
Stamatatos, E.: A survey on modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)
Article Google Scholar
Stolcke, A.: SRILM - an extensible language modeling toolkit, pp. 901–904 (2002)
Google Scholar
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol. 1, pp. 173–180 (2003)
Google Scholar
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Multi-topic e-mail authorship attribution forensics. In: Proceedings of the Workshop on Data Mining for Security Applications, 8th ACM Conference on Computer Security (2001)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Alabama at Birmingham, Birmingham, AL, 35294, USA
Upendra Sapkota & Thamar Solorio
Instituto Nacional de Astrofísica, Optica y Electrónica, Puebla, Mexico
Manuel Montes-y-Gómez
NLE Lab - ELiRF, Universitat Politècnica de València, Valencia, Spain
Paolo Rosso

Authors

Upendra Sapkota
View author publications
You can also search for this author in PubMed Google Scholar
Thamar Solorio
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Montes-y-Gómez
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Mexico D.F., Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sapkota, U., Solorio, T., Montes-y-Gómez, M., Rosso, P. (2013). The Use of Orthogonal Similarity Relations in the Prediction of Authorship. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7817. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37256-8_38

Download citation

DOI: https://doi.org/10.1007/978-3-642-37256-8_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37255-1
Online ISBN: 978-3-642-37256-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics