Skip to main content

Improving Authorship Attribution in Twitter Through Topic-Based Sampling

  • Conference paper
  • First Online:
AI 2017: Advances in Artificial Intelligence (AI 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10400))

Included in the following conference series:

Abstract

Aliases are used as a means of anonymity on the Internet in environments such as IRC (internet relay chat), forums and micro-blogging websites such as Twitter. While there are genuine reasons for the use of aliases, such as journalists operating in politically oppressive countries, they are increasingly being used by cybercriminals and extremist organisations. In recent years, we have seen increased research on authorship attribution of Twitter messages, including authorship analysis of aliases. Previous studies have shown that anti-aliasing of randomly generated sub-aliases yields high accuracies when linking the sub-aliases, but become much less accurate when topic-based sub-aliases are used. N-gram methods have previously been demonstrated to perform better than other methods in this situation. This paper investigates the effect of topic-based sampling on authorship attribution accuracy for the popular micro-blogging website Twitter. Features are extracted using character n-grams, which accurately capture differences in authorship style. These features are analysed using support vector machines using a one-versus-all classifier. The predictive performance of the algorithm is then evaluated using two different sampling methodologies - authors that were sampled through a context-sensitive topic-based search and authors that were sampled randomly. Topic-based sampling of authors is found to produce more accurate authorship predictions. This paper presents several theories as to why this might be the case.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mendenhall, T.C.: The characteristic curves of composition. Sci. 237–249 (1887)

    Google Scholar 

  2. Sanzgiri, A., Joyce, J., Upadhyaya, S.: The early (tweet-ing) bird spreads the worm: an assessment of Twitter for malware propagation. Procedia Comput. Sci. 10, 705–712 (2012)

    Article  Google Scholar 

  3. Sanzgiri, A., Hughes, A., Upadhyaya, S.: Analysis of malware propagation in Twitter. In: 2013 IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS). IEEE (2013)

    Google Scholar 

  4. Wang, X., Gerber, M.S., Brown, D.E.: Automatic crime prediction using events extracted from twitter posts. In: Yang, S.J., Greenberg, A.M., Endsley, M. (eds.) SBP 2012. LNCS, vol. 7227, pp. 231–238. Springer, Heidelberg (2012). doi:10.1007/978-3-642-29047-3_28

    Chapter  Google Scholar 

  5. Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. J. Comput. Sci. 2(1), 1–8 (2011)

    Article  Google Scholar 

  6. Si, J., et al.: Exploiting topic based Twitter sentiment for stock prediction. ACL 2013(2), 24–29 (2013)

    Google Scholar 

  7. Sang, E.T.K., Bos, J.: Predicting the 2011 dutch senate election results with Twitter. In: Proceedings of the Workshop on Semantic Analysis in Social Media. Association for Computational Linguistics (2012)

    Google Scholar 

  8. Achrekar, H., et al.: Predicting flu trends using twitter data. In: 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE (2011)

    Google Scholar 

  9. Ritterman, J., Osborne, M. Klein, E.: Using prediction markets and Twitter to predict a swine flu pandemic. In: 1st International Workshop on Mining Social Media (2009). http://homepages.inf.ed.ac.uk/miles/papers/swine09.pdf. Accessed 26 Aug 2015

  10. Gayo-Avello, D.: “I wanted to predict elections with Twitter and all i got was this Lousy paper”–a balanced survey on election prediction using Twitter Data (2012). arXiv preprint arXiv:1204.6441

  11. Layton, R., Watters, P., Dazeley, R.: Authorship attribution for Twitter in 140 characters or less. In: 2010 Second Cybercrime and Trustworthy Computing Workshop (CTC). IEEE (2010)

    Google Scholar 

  12. Layton, R., Watters, P.A., Dazeley, R.: Authorship analysis of aliases: does topic influence accuracy? Nat. Lang. Eng. 21(04), 497–518 (2015)

    Article  Google Scholar 

  13. Kanaris, I., et al.: Words versus character n-grams for anti-spam filtering. Int. J. Artif. Intell. Tools 16(06), 1047–1067 (2007)

    Article  Google Scholar 

  14. Bhargava, M., Mehndiratta, P., Asawa, K.: Stylometric analysis for authorship attribution on Twitter. In: Bhatnagar, V., Srinivasa, S. (eds.) BDA 2013. LNCS, vol. 8302, pp. 37–47. Springer, Cham (2013). doi:10.1007/978-3-319-03689-2_3

    Chapter  Google Scholar 

  15. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)

    Article  Google Scholar 

  16. Oxford: The Oxford English Corpus: Facts about the language (2015). http://www.oxforddictionaries.com/words/the-oec-facts-about-the-language. Accessed 2015

  17. Kanaris, I., Kanaris, K., Stamatatos, E.: Spam detection using character n-grams. In: Antoniou, G., Potamias, G., Spyropoulos, C., Plexousakis, D. (eds.) SETN 2006. LNCS, vol. 3955, pp. 95–104. Springer, Heidelberg (2006). doi:10.1007/11752912_12

    Chapter  Google Scholar 

  18. Stamatatos, E.: Author identification: using text sampling to handle the class imbalance problem. Inf. Process. Manag. 44(2), 790–799 (2008)

    Article  Google Scholar 

  19. Ng, A.: Support vector machines. CS229 Lecture notes 1(3), 1–3 (2000)

    MathSciNet  Google Scholar 

  20. Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A practical guide to support vector classification (2003)

    Google Scholar 

  21. van Baayen, H., et al.: An experiment in authorship attribution. In: 6th JADT (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luoxi Pan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Pan, L., Gondal, I., Layton, R. (2017). Improving Authorship Attribution in Twitter Through Topic-Based Sampling. In: Peng, W., Alahakoon, D., Li, X. (eds) AI 2017: Advances in Artificial Intelligence. AI 2017. Lecture Notes in Computer Science(), vol 10400. Springer, Cham. https://doi.org/10.1007/978-3-319-63004-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63004-5_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63003-8

  • Online ISBN: 978-3-319-63004-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics