Advertisement

An Ensemble Learning Approach for Addressing the Class Imbalance Problem in Twitter Spam Detection

  • Shigang Liu
  • Yu WangEmail author
  • Chao Chen
  • Yang Xiang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9722)

Abstract

Being an important source for real-time information dissemination in recent years, Twitter is inevitably a prime target of spammers. It has been showed that the damage caused by Twitter spam can reach far beyond the social media platform itself. To mitigate the threat, a lot of recent studies use machine learning techniques to classify Twitter spam and report very satisfactory results. However, most of the studies overlook a fundamental issue that is widely seen in real-world Twitter data, i.e., the class imbalance problem. In this paper, we show that the unequal distribution between spam and non-spam classes in the data has a great impact on spam detection rate. To address the problem, we propose an ensemble learning approach, which involves three steps. In the first step, we adjust the class distribution in the imbalanced data set using various strategies, including random oversampling, random undersampling and fuzzy-based oversampling. In the next step, a classification model is built upon each of the redistributed data sets. In the final step, a majority voting scheme is introduced to combine all the classification models. Experimental results obtained using real-world Twitter data indicate that the proposed approach can significantly improve the spam detection rate in data sets with imbalanced class distribution.

Keywords

Online social networks Twitter spam Machine learning Class imbalance 

References

  1. 1.
    Benevenuto, F., Magno, G., Rodrigues, T., Almeida, V.: Detecting spammer on twitter. In: Seventh Annual Collaboration, Electronic messaging, Anti-abuse and Spam Conference, July 2010Google Scholar
  2. 2.
    Pash, C.: The lure of naked hollywood star photos sent the internet into meltdown in New Zealand. Business Insider, September 2014Google Scholar
  3. 3.
    Oliver, J., Pajares, P., Ke, C., Chen, C., Xiang, Y.: An in-depth analysis of abuse on twitter. Technical report, Trend Micro, 225 E. John Carpenter Freeway, Suite 1500 Irving, Texas 75062 USA, September 2014Google Scholar
  4. 4.
    Jeyaraman, R.: Fighting spam with botmaker. Twitter Engineering Blog, August 2014Google Scholar
  5. 5.
    Grier, C., Thomas, K., Paxson, V., Zhang, M.: @spam: the under- ground on 140 characters or less. In: Proceedings of the 17th ACM Conference on Computer and Communications Security, CCS 2010, pp. 27–37. ACM, New York (2010)Google Scholar
  6. 6.
    Thomas, K., Grier, C., Song, D., Paxson, V.: Suspended accounts in retrospect: an analysis of twitter spam. In: Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, IMC 2011, pp. 243–258, ACM, New York (2011)Google Scholar
  7. 7.
    Gao, H., Chen, Y., Lee, K., Palsetia, D., Choudhary, A.: Towards online spam filtering in social networks. In: NDSS (2012)Google Scholar
  8. 8.
    Yang, C., Harkreader, R., Zhang, J., Shin, S., Gu, G.: Analyzing spammers’ social networks for fun and profit: a case study of cyber criminal ecosystem on twitter. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012, pp. 71–80, USA (2012)Google Scholar
  9. 9.
    Stringhini, G., Kruegel, C., Vigna, G.: Detecting spammers on social networks. In: Proceedings of the 26th Annual Computer Security Applications Conference, ACSAC 2010, pp. 1–9. ACM, New York (2010)Google Scholar
  10. 10.
    Yang, C., Harkreader, R., Gu, G.: Empirical evaluation and new design for fighting evolving twitter spammers. IEEE Trans. Inf. Forensics Secur. 8(8), 1280–1293 (2013)CrossRefGoogle Scholar
  11. 11.
    Zhang, X., Zhu, S., Liang, W.: Detecting spam and promoting campaigns in the twitter social network. In: Data Mining. IEEE ICDM 2012, pp. 1194–1199 (2012)Google Scholar
  12. 12.
    Pear Analytics: Twitter Study, August 2009Google Scholar
  13. 13.
    Yardi, S., Romero, D., Schoenebeck, G., Boyd, D.: Detecting spam in a twitter network. First Monday 15(1–4) (2010). http://dx.doi.org/10.5210/fm.v15i1.2793
  14. 14.
    Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 591–600. ACM, New York (2010)Google Scholar
  15. 15.
    Lee, K., Caverlee, J., Webb, S.: Uncovering social spammers: social honeypots + machine learning. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 435–442. ACM, New York (2010)Google Scholar
  16. 16.
    Wang, A.H.: Don’t follow me: spam detection in twitter. In: Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT), pp. 1–10 (2010)Google Scholar
  17. 17.
    Song, J., Lee, S., Kim, J.: Spam filtering in twitter using sender-receiver relationship. In: Sommer, R., Balzarotti, D., Maier, G. (eds.) RAID 2011. LNCS, vol. 6961, pp. 301–317. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  18. 18.
    Thomas, K., Grier, C., Ma, J., Paxson, V., Song, D.: Design and evaluation of a real-time url spam filtering service. In: Proceedings of the 2011 IEEE Symposium on Security and Privacy, SP 2011, pp. 447– 462. IEEE Computer Society, Washington, DC (2011)Google Scholar
  19. 19.
    Lee, S., Kim, J.: Warningbird: a near real-time detection system for suspicious urls in twitter stream. IEEE Trans. Dependable Secur. Comput. 10(3), 183–195 (2013)CrossRefGoogle Scholar
  20. 20.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Liu, S., Zhang, J., Wang, Y., Xiang, Y.: Fuzzy-Based feature and instance recover. In: Nguyen, T.N., et al. (eds.) ACIIDS 2016. LNCS, vol. 9621, pp. 605–615. Springer, Heidelberg (2016)Google Scholar
  22. 22.
    Weka 3: Data Mining Software in Java. http://www.cs.waikato.ac.nz/ml/weka/
  23. 23.
    Choo, K.-K.R.: The cyber threat landscape: challenges and future research directions. Comput. Secur. 30(8), 719–731 (2011)CrossRefGoogle Scholar
  24. 24.
    Lai, S., Liu, J.K., Choo, K.-K.R., Liang, K.: Secret picture: an efficient tool for mitigating deletion delay on OSN. In: Qing, S., et al. (eds.) ICICS 2015. LNCS, vol. 9543, pp. 467–477. Springer, Heidelberg (2016). doi: 10.1007/978-3-319-29814-6_40 CrossRefGoogle Scholar
  25. 25.
    Norouzi, F., Dehghantanha, A., Eterovic-Soric, B., Choo, K.-K.R.: Investigating social networking applications on smartphones: detecting Facebook, Twitter, LinkedIn, and Google+ artifacts on android and iOS platforms. Aust. J. Forensic Sci. 1–20 (2015). doi: 10.1080/00450618.2015.1066854
  26. 26.
    Quick, D., Martini, B., Choo, K.-K.R.: Cloud Storage Forensics. Syngress Publishing/Elsevier, Boston (2013)Google Scholar
  27. 27.
    Chen, C., Zhang, J., Chen, X., Xiang, Y., Zhou, W.: 6 million spam tweets: a large ground truth for timely twitter spam detection. In: IEEE International Conference on Communications (ICC 2015) (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.School of Information TechnologyDeakin UniversityGeelongAustralia

Personalised recommendations