Abstract
Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Mosteller, F., Wallace, D.: Inference in an Authorship Problem. Journal of the American Statistical Association 58(302), 275–230 (1963)
Labbé, C., Labbé, D.: Inter-textual distance and authorship attribution: Corneille and Molière. Journal of Quantitative Linguistics 8, 213–231 (2001)
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining E-mail Content for Author Identification Forensics. SIGMOD Record 30(4), 55–64 (2001)
Abbasi, A., Chen, H.: Applying Authorship Analysis to Extremist-Group Web Forum Messages. IEEE Intelligent Systems 20(5), 67–75 (2005)
van Halteren, H.: Linguistic Profiling for Author Recognition and Verification. In: Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 199–206 (2004)
Chaski, C.: Empirical Evaluations of Language-based Author Identification Techniques. Forensic Linguistics 8(1), 1–65 (2001)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26(4), 471–495 (2000)
Peng, F., Shuurmans, F., Keselj, V., Wang, S.: Language Independent Authorship Attribution Using Character Level Language Models. In: Proc. of the 10th European Association for Computational Linguistics (2003)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Holmes, D.: The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13(3), 111–117 (1998)
Kjell, B., Addison Woods, W., Frieder, O.: Discrimination of authorship using visualization. Information Processing and Management 30(1) (1994)
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based Author Profiles for Authorship Attribution. In: Proc. of the Conference Pacific Association for Computational Linguistics (2003)
Juola, P.: Ad-hoc Authorship Attribution Competition. In: Proc. of the Joint ALLC/ACH2004 Conf., pp. 175–176 (2004)
Ferreira da Silva, J., Dias, G., Guilloré, S., Pereira Lopes, J.G.: Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999)
Silva, J., Lopes, G.: A local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In: Proc. of the 6th Meeting on the Mathematics of Language, pp. 369–381 (1999)
Church, K., Hanks, K.: Word Association Norms, Mutual Information and Lexicography. Computational Linguistics 16(1), 22–29 (1990)
Gale, W., Church, K.: Concordance for parallel texts. In: Proc. of the 7th Annual Conference for the new OED and Text Research, Oxford, pp. 40–62 (1991)
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5, 361–397 (2004)
Khmelev, D., Teahan, W.: A Repetition Based Measure for Verification of Text Collections and for Text Categorization. In: Proc. of the 26th ACM SIGIR, pp. 104–110 (2003)
Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., Ye, L.: Author Identification on the Large Scale. In: Proc. of CSNA (2005)
Yang, Y., Pedersen J.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th Int. Conf. on Machine Learning (1997)
Marton, Y., Wu, N., Hellerstein, L.: On Compression-Based Text Classification. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 300–314. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Houvardas, J., Stamatatos, E. (2006). N-Gram Feature Selection for Authorship Identification. In: Euzenat, J., Domingue, J. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2006. Lecture Notes in Computer Science(), vol 4183. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11861461_10
Download citation
DOI: https://doi.org/10.1007/11861461_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40930-4
Online ISBN: 978-3-540-40931-1
eBook Packages: Computer ScienceComputer Science (R0)