Stylometric Analysis for Authorship Attribution on Twitter
- 19 Citations
- 3 Mentions
- 3.3k Downloads
Abstract
Authorship Attribution (AA), the science of inferring an author for a given piece of text based on its characteristics is a problem with a long history. In this paper, we study the problem of authorship attribution for forensic purposes and present machine learning techniques and stylometric features of the authors that enable authorship to be determined at rates significantly better than chance for texts of 140 characters or less. This analysis targets the micro-blogging site Twitter, where people share their interests and thoughts in form of short messages called ”tweets”. Millions of ”tweets” are posted daily via this service and the possibility of sharing sensitive and illegitimate text cannot be ruled out. The technique discussed in this paper is a two stage process, where in the first stage, stylometric information is extracted from the collected dataset and in the second stage different classification algorithms are trained to predict authors of unseen text. The effort is towards maximizing the accuracy of predictions with optimum amount of data and users under consideration.
Keywords
Online Social Media Twitter Authorship Attribution Machine Learning Classifier Stylometry AnalysisPreview
Unable to display preview. Download preview PDF.
References
- 1.Abbasi, A., Chen, H.: Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (March 2008)Google Scholar
- 2.de Vel, O.: Mining e-mail authorship. In: ACM International Conference on Knowledge Discovery and Data Mining (KDD) (2000)Google Scholar
- 3.Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4), 401–412 (2002)CrossRefGoogle Scholar
- 4.Twitter report twitter hits half a billion tweets a day (October 26, 2012), http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/
- 5.Holmes, D.I.: The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing 13(3), 111–117 (1998)CrossRefGoogle Scholar
- 6.Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems 20(5), 67–75 (2005)CrossRefGoogle Scholar
- 7.Mohtasseb, H., Lincoln, U., Ahmed, A.: Mining Online Diaries for Blogger Identification. In: Proceedings of the World Congress on Engineering (2009)Google Scholar
- 8.Mosteller, F., Wallace, D.L.: Inference in an authorship problem. Journal of the American Statistical Association 58(302), 275–309 (1963)zbMATHGoogle Scholar
- 9.Raghavan, S.: Authorship Attribution Using Probabilistic Context-Free Grammars. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL (2010)Google Scholar
- 10.Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A Practical Guide to Support Vector Classification. Department of Computer Science, National Taiwan University, Taipei, Taiwan (2010)Google Scholar
- 11.Malcolm Walter Corney, Analysing E-mail Text Authorship for Forensic Purposes. Queensland University of Technology, Australia (2003)Google Scholar
- 12.Pillay, S.R., Solorio, T.: Authorship Attribution of web forum posts. APWG eCrime Researchers Summit (2010)Google Scholar
- 13.Cristani, M., Bazzani, L., Vinciarelli, A., Murin, V.: Conversationally-inspired Stylometric Features for Authorship Attribution in Instant Messaging. ACM Multimedia (October 29, 2012)Google Scholar
- 14.Twitter Corpus (2012), https://github.com/bwbaugh/twitter-corpus
- 15.Twitter (2013), https://dev.twitter.com/docs/api/1/get/statuses/user_timeline
- 16.Natural language Toolkit (2013), http://nltk.org/
- 17.Support Vector Machine (2000), http://www.support-vector.net/
- 18.Libsvm (2013), http://www.csie.ntu.edu.tw/cjlin/libsvm/
- 19.Sousa Silva, R., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., Maia, B.: ‘twazn me!!! ;(’ Automatic Authorship Analysis of Micro-Blogging Messages. In: Muñoz, R., Montoyo, A., Métais, E. (eds.) NLDB 2011. LNCS, vol. 6716, pp. 161–168. Springer, Heidelberg (2011)CrossRefGoogle Scholar