Abstract
Most existing research on authorship attribution uses various types of lexical, syntactic, and structural features for classification. Some of these features are not meaningful for small texts such as email messages. In this paper we demonstrate a very effective use of a syntactic feature of an author’s writing - text’s parse tree characteristics - for authorship analysis of email messages. We define author templates consisting of context free grammar (CFG) production frequencies occurring in an author’s training set of email messages. We then use similar frequencies extracted from a new email message to match against various authors’ templates to identify the best match. We evaluate our approach on Enron email dataset and show that CFG production frequencies work very well and are robust in attributing the authorship of email messages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hope, W., Holston, K.: The Shakespeare Controversy: An Analysis of the Authorship Theories. McFarland, Jefferson (2009)
Sousa Silva, R., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., Maia, B.: ‘twazn me!!!;(’ Automatic authorship analysis of micro-blogging messages. In: Muñoz, R., Montoyo, A., Métais, E. (eds.) NLDB 2011. LNCS, vol. 6716, pp. 161–168. Springer, Heidelberg (2011)
De Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. ACM Sigmod Rec. 30, 55–64 (2001)
Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 611. Association for Computational Linguistics, Stroudsburg (2004)
Baayen, R., Van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary Linguist. Comput. 11, 121–131 (1996)
Teng, G.F., Lai, M.S., Ma, J.B., Li, Y. :E-mail authorship mining based on SVM for computer forensic. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics, pp. 1204–1207. IEEE Press, New York (2004)
De Vel, O.: Mining e-mail authorship. In: Proceedings of Workshop on Text Mining, ACM 6th International Conference on Knowledge Discovery and Data Mining (2000)
Nizamani, S., Memon, N.: CEAI: CCM-based e-mail authorship identification model. Egypt. Inf. J. 14, 239–249 (2013)
Zheng, R., Qin, Y., Huang, Z., Chen, H.: Authorship analysis in cybercrime investigation. In: Chen, H., Miranda, R., Zeng, D.D., Demchak, C.C., Schroeder, J., Madhusudan, T. (eds.) ISI 2003. LNCS, vol. 2665, pp. 59–73. Springer, Heidelberg (2003)
Iqbal, F., Binsalleeh, H., Fung, B., Debbabi, M.: Mining writeprints from anonymous e-mails for forensic investigation. Digital Invest. 7, 56–64 (2010)
Iqbal, F., Binsalleeh, H., Fung, B., Debbabi, M.: A unified data mining solution for authorship analysis in anonymous textual communications. Inf. Sci. 231, 98–112 (2013)
Peng, F., Schuurmans, D., Wang, S., Keselj, V.: Language independent authorship attribution using character level language models. In: Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics, pp. 267–274. Association for Computational Linguistics, Stroudsburg (2003)
Mosteller, F., Wallace, D.L.: Applied Bayesian and Classical Inference. Springer Series in Statistics. Springer, New York (1984)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 423–430. Association for Computational Linguistics, Stroudsburg (2003)
Leibler, R.A., Kullback, S.: On information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951)
Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. Lon. 186, 453–461 (1946)
Inder, J.E.T.A.: New developments in generalized information measures. In: Hawkes, P.W. (ed.) Advances in Imaging and Electron Physics, vol. 91, pp. 37–135. Academic Press, New York (2006)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Patchala, J., Bhatnagar, R., Gopalakrishnan, S. (2015). Author Attribution of Email Messages Using Parse-Tree Features. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-21024-7_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21023-0
Online ISBN: 978-3-319-21024-7
eBook Packages: Computer ScienceComputer Science (R0)