Attribute selection for improving spam classification in online social networks: a rough set theory-based approach

  • Soumi Dutta
  • Sujata Ghatak
  • Ratnadeep Dey
  • Asit Kumar Das
  • Saptarshi Ghosh
Original Article


As online social network (OSN) sites become increasingly popular, they are targeted by spammers who post malicious content on the sites. Hence, it is important to filter out spam accounts and spam posts from OSNs. There exist several prior works on spam classification on OSNs, which utilize various features to distinguish between spam and legitimate entities. The objective of this study is to improve such spam classification, by developing an attribute selection methodology that helps to find a smaller subset of the attributes which leads to better classification. Specifically, we apply the concepts of rough set theory to develop the attribute selection algorithm. We perform experiments over five different spam classification datasets over diverse OSNs and compare the performance of the proposed methodology with that of several baseline methodologies for attribute selection. We find that, for most of the datasets, the proposed methodology selects an attribute subset that is smaller than what is selected by the baseline methodologies, yet achieves better classification performance compared to the other methods.


Online social networks Spam detection Classification Attribute selection Rough set theory 



We thank the anonymous reviewers for their valuable comments and suggestions, which helped to improve the paper. We also acknowledge useful discussions with Arpan Das and Anirban Majumder in the early phases of the work.

Supplementary material

13278_2017_484_MOESM1_ESM.csv (31 mb)
Supplementary material 1 (csv 31725 KB)


  1. Ahmed F, Abulaish M (2013) A generic statistical approach for spam detection in online social networks. Comput Commun 36(10–11):1120–1129CrossRefGoogle Scholar
  2. Bandyopadhyay S, Bhadra T, Mitra P, Maulik U (2014) Integration of dense subgraph finding with feature clustering for unsupervised feature selection. Pattern Recogn Lett 40:104–112CrossRefGoogle Scholar
  3. Benevenuto F, Rodrigues T, Almeida V, Almeida J, Gonalves M (2009) Detecting spammers and content promoters in online video social networks. In: Proceedings of the annual Intl SIGIR conference, Boston, MA, USAGoogle Scholar
  4. Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on Twitter. In: Proceedings of collaboration, electronic messaging, anti-abuse and spam conference (CEAS)Google Scholar
  5. Caballero Y, Alvarez D, Bello R (2007) Feature selection algorithms using rough set theory. In: Proceedings of IEEE international conference on intelligent systems design and applications, pp 407–411Google Scholar
  6. Caruana G, Li M (2012) A survey of emerging approaches to spam filtering. ACM Comput Surv 44(2):9:1–9:27CrossRefGoogle Scholar
  7. Chen Y, Miao D, Wang R (2010) A rough set approach to feature selection based on ant colony optimization. Pattern Recogn Lett 31(3):226–233CrossRefGoogle Scholar
  8. Chhabra S, Aggarwal A, Benevenuto F, Kumaraguru P (2011) the phishing landscape through short URLs. In: proceedings of collaboration, electronic messaging, anti-abuse and spam conference (CEAS)Google Scholar
  9. Costa H, de Campos Merschmann LH, Barth F, Benevenuto F (2014) Pollution, bad-mouthing, and local marketing: the underground of location-based social networks. Elsevier Information Sciences, AmsterdamGoogle Scholar
  10. Costa H, Benevenuto F, de Campos Merschmann LH (2013) Detecting tip spam in location-based social networks. In: Proceedings of the 28th annual ACM symposium on applied computing (SAC)Google Scholar
  11. Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(1–4):131–156CrossRefGoogle Scholar
  12. Deogun JS, Choubey SK, Raghavan VV, Sever H (1998) Feature selection and effective classifiers. J Am Soc Inf Sci 49(5):423–434CrossRefGoogle Scholar
  13. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous valued attributes for classification learning. In: Proceedings of international joint conference on artificial intelligence, vol 2, pp 1022–1027Google Scholar
  14. Gao H, Hu J, Wilson C, Li Z, Chen Y, Zhao BY (2010) Detecting and characterizing social spam campaigns. In: Proceedings of ACM international conference on internet measurement (IMC)Google Scholar
  15. Garcia S, Luengo J, Saez JA, Lopez V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750CrossRefGoogle Scholar
  16. Grier C, Thomas K, Paxson V, Zhang M (2010) @spam: the underground on 140 characters or less. In: Proceedings of ACM international conference on computer and communications security (CCS), pp 27–37Google Scholar
  17. Hall MA (1998) Correlation-based feature subset selection for machine learning. Ph.D. thesis, University of Waikato, Hamilton, New ZealandGoogle Scholar
  18. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18CrossRefGoogle Scholar
  19. Heymann P, Koutrika G, Garcia-Molina H (2007) Fighting spam on social web sites: a survey of approaches and future challenges. IEEE Internet Comput 11:36–45CrossRefGoogle Scholar
  20. Infomap - community detection.
  21. Karimpour J, Noroozi AA, Abadi A (2012) The impact of feature selection on web spam detection. Int J Intell Syst Appl 4(9):61–67Google Scholar
  22. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324zbMATHCrossRefGoogle Scholar
  23. Lee S, Kim J (2013) WarningBird: a near real-time detection system for suspicious URLs in Twitter stream. IEEE Trans Dependable Secure Comput 10(3):183–195CrossRefGoogle Scholar
  24. Lee K, Caverlee J, Webb S (2010) Uncovering social spammers: social honeypots + machine learning. In: Proceedings of ACM international conference on research and development in information retrieval (SIGIR), pp 435–442Google Scholar
  25. Lee K, Eoff BD, Caverlee J (2011) Seven months with the devils: a long-term study of content polluters on Twitter. In: Proceedings of AAAI international conference on weblogs and social media (ICWSM)Google Scholar
  26. Liu H, Setiono R (1996) A probabilistic approach to feature selection—a filter solution. In: 13th international conference on machine learning, pp 319–327Google Scholar
  27. Martinez-Romo J, Araujo L (2013) Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Syst Appl 40(8):2992–3000CrossRefGoogle Scholar
  28. Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312CrossRefGoogle Scholar
  29. Pawlak Z (1982) Rough sets: basic notion. Int J Comput Inf Sci 11(5):344–356CrossRefGoogle Scholar
  30. Pawlak Z (1998) Rough set theory and its applications to data analysis. Cybern Syst 29(7):661–688zbMATHCrossRefGoogle Scholar
  31. Skowron A, Rauszer C (1992) The discernibility matrices and functions in information systems. In: Sowinski R (ed) Intelligent decision support. Handbook of applications and advances of the rough set theory, theory and decision library, vol 11. Kluwer Academic Publishers, Dordrecht, pp 331–362Google Scholar
  32. Swiniarski RW, Skowron A (2003) Rough set methods in feature selection and recognition. Pattern Recogn Lett 24(6):833–849zbMATHCrossRefGoogle Scholar
  33. The Spamhaus Project.
  34. Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time URL spam filtering service. In: Proceedings of IEEE symposium on security and privacy (2011)Google Scholar
  35. Tseng CY, Sung PC, Chen MS (2011) Cosdes: a collaborative spam detection system with a novel e-mail abstraction scheme. IEEE Trans Knowl Data Eng 23(5):669–682CrossRefGoogle Scholar
  36. Twitter API Home.
  37. Wagner S, Wagner D (2007) Comparing clusterings—an overview. Technical report 2006–04, Universität Karlsruhe (TH).
  38. Wild C, Seber G (2000) The Wilcoxon rank-sum test. In: Seber G (ed) Chance encounters: a first course in data analysis and inference. Wiley, New YorkGoogle Scholar
  39. Xin G, Qiang G, Jing Z, Zheng-Chao Z (2010) An attribute reduction algorithm based on rough set, information entropy and ant colony optimization. In: Proceedings of IEEE international conference on signal processing, pp 1313–1317Google Scholar
  40. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the international conference on machine learning (ICML), pp 412–420Google Scholar
  41. Yardi S, Romero D, Schoenebeck G, Boyd DM (2010) Detecting spam in a Twitter network. First Monday 15(1):1–13Google Scholar
  42. Zhai LY, Khoo LP, Fok SC (2002) Feature extraction using rough set theory and genetic algorithms—an application for the simplification of product quality evaluation. Comput Ind Eng 43(4):661–676CrossRefGoogle Scholar
  43. Zhang Y, Wang S, Wu L (2012) Spam detection via feature selection and decision tree. Adv Sci Lett 5(2):726–730CrossRefGoogle Scholar
  44. Zhang M, Yao JT (2004) A rough sets based approach to feature selection. In: Proceedings of IEEE annual meeting of the fuzzy information, pp 1313–1317Google Scholar

Copyright information

© Springer-Verlag GmbH Austria, part of Springer Nature 2018

Authors and Affiliations

  • Soumi Dutta
    • 1
    • 2
  • Sujata Ghatak
    • 2
  • Ratnadeep Dey
    • 1
  • Asit Kumar Das
    • 1
  • Saptarshi Ghosh
    • 1
    • 3
  1. 1.Department of Computer Science and TechnologyIndian Institute of Engineering Science and TechnologyShibpurIndia
  2. 2.Department of Computer Science and EngineeringInstitute of Engineering & ManagementKolkataIndia
  3. 3.Department of Computer Science and EngineeringIndian Institute of TechnologyKharagpurIndia

Personalised recommendations