Advertisement

Stylometric Analysis for Authorship Attribution on Twitter

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8302)

Abstract

Authorship Attribution (AA), the science of inferring an author for a given piece of text based on its characteristics is a problem with a long history. In this paper, we study the problem of authorship attribution for forensic purposes and present machine learning techniques and stylometric features of the authors that enable authorship to be determined at rates significantly better than chance for texts of 140 characters or less. This analysis targets the micro-blogging site Twitter, where people share their interests and thoughts in form of short messages called ”tweets”. Millions of ”tweets” are posted daily via this service and the possibility of sharing sensitive and illegitimate text cannot be ruled out. The technique discussed in this paper is a two stage process, where in the first stage, stylometric information is extracted from the collected dataset and in the second stage different classification algorithms are trained to predict authors of unseen text. The effort is towards maximizing the accuracy of predictions with optimum amount of data and users under consideration.

Keywords

Online Social Media Twitter Authorship Attribution Machine Learning Classifier Stylometry Analysis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abbasi, A., Chen, H.: Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (March 2008)Google Scholar
  2. 2.
    de Vel, O.: Mining e-mail authorship. In: ACM International Conference on Knowledge Discovery and Data Mining (KDD) (2000)Google Scholar
  3. 3.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4), 401–412 (2002)CrossRefGoogle Scholar
  4. 4.
    Twitter report twitter hits half a billion tweets a day (October 26, 2012), http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/
  5. 5.
    Holmes, D.I.: The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing 13(3), 111–117 (1998)CrossRefGoogle Scholar
  6. 6.
    Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems 20(5), 67–75 (2005)CrossRefGoogle Scholar
  7. 7.
    Mohtasseb, H., Lincoln, U., Ahmed, A.: Mining Online Diaries for Blogger Identification. In: Proceedings of the World Congress on Engineering (2009)Google Scholar
  8. 8.
    Mosteller, F., Wallace, D.L.: Inference in an authorship problem. Journal of the American Statistical Association 58(302), 275–309 (1963)zbMATHGoogle Scholar
  9. 9.
    Raghavan, S.: Authorship Attribution Using Probabilistic Context-Free Grammars. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL (2010)Google Scholar
  10. 10.
    Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A Practical Guide to Support Vector Classification. Department of Computer Science, National Taiwan University, Taipei, Taiwan (2010)Google Scholar
  11. 11.
    Malcolm Walter Corney, Analysing E-mail Text Authorship for Forensic Purposes. Queensland University of Technology, Australia (2003)Google Scholar
  12. 12.
    Pillay, S.R., Solorio, T.: Authorship Attribution of web forum posts. APWG eCrime Researchers Summit (2010)Google Scholar
  13. 13.
    Cristani, M., Bazzani, L., Vinciarelli, A., Murin, V.: Conversationally-inspired Stylometric Features for Authorship Attribution in Instant Messaging. ACM Multimedia (October 29, 2012)Google Scholar
  14. 14.
  15. 15.
  16. 16.
    Natural language Toolkit (2013), http://nltk.org/
  17. 17.
    Support Vector Machine (2000), http://www.support-vector.net/
  18. 18.
  19. 19.
    Sousa Silva, R., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., Maia, B.: ‘twazn me!!! ;(’ Automatic Authorship Analysis of Micro-Blogging Messages. In: Muñoz, R., Montoyo, A., Métais, E. (eds.) NLDB 2011. LNCS, vol. 6716, pp. 161–168. Springer, Heidelberg (2011)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  1. 1.Jaypee Institute of Information TechnologyIndia

Personalised recommendations