Advertisement

SPADE: a social-spam analytics and detection framework

  • De Wang
  • Danesh Irani
  • Calton Pu
Original Article

Abstract

Social media such as Facebook, MySpace, and Twitter have become increasingly important for attracting millions of users. Consequently, spammers are increasing using such networks for propagating spam. Although existing filtering techniques such as collaborative filters and behavioral analysis filters are able to significantly reduce spam, each social network needs to build its own independent spam filter and support a spam team to keep spam prevention techniques current. To alleviate those problems, we propose a framework for spam analytics and detection which can be used across all social network sites. Specifically, the proposed framework SPADE has numerous benefits including (1) new spam detected on one social network can quickly be identified across social networks; (2) accuracy of spam detection will be improved through cross-domain classification and associative classification; (3) other techniques (such as blacklists and message shingling) can be integrated and centralized; (4) new social networks can plug into the system easily, preventing spam at an early stage. In SPADE, we present a uniform schema model to allow cross-social network integration. In this paper, we define the user, message, and web page model. Moreover, we provide an experimental study of real datasets from social networks to demonstrate the flexibility and feasibility of our framework. We extensively evaluated two major classification approaches in SPADE: cross-domain classification and associative classification. In cross-domain classification, SPADE achieved over 0.92 F-measure and over 91 % detection accuracy on web page model using Naïve Bayes classifier. In associative classification, SPADE also achieved 0.89 F-measure on message model and 0.87 F-measure on user profile model, respectively. Both detection accuracies are beyond 85 %. Based on those results, our SPADE has been demonstrated to be a competitive spam detection solution to social media.

Keywords

Social spam Framework Detection Classification 

Notes

Acknowledgments

This research has been partially funded by National Science Foundation by CNS/SAVI (1250260), IUCRC/FRP (1127904), CISE/CNS (1138666), RAPID (1138666), CISE/CRI (0855180), NetSE (0905493) programs, and gifts, grants, or contracts from DARPA/I2O, Singapore Government, Fujitsu Labs, and Georgia Tech Foundation through the John P. Imlay, Jr. Chair endowment. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding agencies and companies mentioned above.

References

  1. Becchetti L, Castillo C, Donato D, Baeza-Yates R, Leonardi S (2008) Link analysis for web spam detection. ACM Trans Web 2(1):42. Art No. 2Google Scholar
  2. Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on twitter. In: Proceedings of the seventh annual collaboration, electronic messaging, antiabuse and spam conference (CEAS 2010)Google Scholar
  3. Bosma M, Meij E, Weerkamp W (2012) A framework for unsupervised spam detection in social networking sites. In: ECIR 2012 34th European conference on information retrieval, Barcelona, pp 364–375Google Scholar
  4. Byun B, Lee C, Webb S, Irani D, Pu C (2009) An anti-spam filter combination framework for text-and-image emails through incremental learning. In: Proceedings of the sixth conference on email and anti-spam (CEAS 2009)Google Scholar
  5. Carreras X, Marquez L (2001) Boosting trees for anti-spam email filtering. Arxiv preprintGoogle Scholar
  6. Caverlee J, Liu L, Webb S (2008) Socialtrust: tamper-resilient trust establishment in online communities. In: Proceedings of the 8th ACM/IEEE-CS joint conference on digital librariesGoogle Scholar
  7. Caverlee J, Webb S (2008) A large-scale study of MySpace: observations and implications for online social networks. Proceedings of the international conference on weblogs and social media 8Google Scholar
  8. Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054CrossRefGoogle Scholar
  9. Fazeen M, Dantu R, Guturu P (2011) Identification of leaders, lurkers, associates and spammers in a social network: context-dependent and context-independent approaches. Soc Netw Anal Min 1(3):241–254CrossRefGoogle Scholar
  10. Fetterly D, Manasse M, Najork M (2003) On the evolution of clusters of near-duplicate web pages. In: Proceedings of the first conference on Latin American web congress, LA-WEB ’03Google Scholar
  11. Fetterly D, Manasse M, Najork M (2004) Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of the 7th international workshop on the web and databases: colocated with ACM SIGMOD/PODS 2004, WebDB ’04Google Scholar
  12. Fetterly D, Manasse M, Najork M (2005) Detecting phrase-level duplication on the world wide web. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05Google Scholar
  13. Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407CrossRefMathSciNetMATHGoogle Scholar
  14. Google opensocial API (2011). http://code.google.com/apis/opensocial/
  15. Gosier G (2009) Social networks as an attack platform: Facebook case study. In: Proceedings of the eighth international conference on networksGoogle Scholar
  16. Gyongyi Z, Garcia-Monlina H, Pedersen J (2004) Combating web spam with trustrank. In: Proceeding of the thirtieth international conference on very large data bases, vol 30Google Scholar
  17. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I (2009) The WEKA data mining software. ACM SIGKDD Explor Newsl 11(1):10–18CrossRefGoogle Scholar
  18. Han B, Baldwin T (2011) Lexical normalisation of short text messages: makn sens a #twitter. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. HLT ’11 association for computational linguistics, Stroudsburg, pp 368–378Google Scholar
  19. Han JS, Park BJ (2012) Efficient detection of content polluters in social networks. In: ICITCS, pp 991–996Google Scholar
  20. Hao S, Syed NA, Feamster N, Gray AG, Krasser S (2009) Detecting spammers with snare: Spatio-temporal network-level automatic reputation engine. In: Proceedings of the 18th conference on USENIX security symposium., SSYM’09CA, Berkeley, pp 101–118Google Scholar
  21. He Q, Zhuang F, Li J, Shi Z (2010) Parallel implementation of classification algorithms based on MapReduce. Rough set and knowledge technology. Lecture notes in computer science vol 6401, pp 655–662Google Scholar
  22. Hirai J, Raghavan S, Garcia-Molina H, Paepcke A (2000) WebBase: a repository of web pages. Comput Netw 33(1–6):277–293CrossRefGoogle Scholar
  23. HOOTSUITE social media dashboard (2011). http://hootsuite.com/
  24. Irani D, Webb S, Giffin J, Pu C (2008) Evolutionary study of phishing. In: eCrime researchers summit, pp 1–10Google Scholar
  25. Irani D, Webb S, Pu C (2010) Study of static classification of social spam profiles in myspace. In: Proceedings of the international AAAI conference on weblogs and social mediaGoogle Scholar
  26. Irani D, Webb S, Pu C, Li K (2010) Study of trend-stuffing on twitter through text classification. In: Collaboration, electronic messaging, anti-abuse and spam conference (CEAS 2010)ACM, New York, pp 112–117Google Scholar
  27. Jensen D, Neville J, Gallagher B (2004) Why collective inference improves relational classification. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’04, pp 593–598Google Scholar
  28. Jin X, Lin CX, Luo J, Han J (2011) Socialspamguard: a data mining-based spam detection system for social media networks. In: Proceedings of the international conference on very large data basesGoogle Scholar
  29. Kreibich C, Kanich C, Levchenko K, Enright B, Voelker G, Paxson V, Savage S (2008) On the spam campaign trail. In: Proceedings of the 1st usenix workshop on large-scale exploits and emergent threats, USENIX association, pp 1–9Google Scholar
  30. Learmonth M (2010) Twitter getting serious about spam issue. http://adage.com/article/digital/digital-marketing-twitter-spam-issue/142800/
  31. Lee K, Caverlee J, Kamath KY, Cheng Z (2012) Detecting collective attention spam. In: Proceedings of the 2nd joint WICOW/AIRWeb workshop on web quality, WebQuality ’12NY, New York, pp 48–55Google Scholar
  32. Lex E, Seifert C, Granitzer M, Juffinger A (2010) Efficient cross-domain classification of weblogs. Int J Intell Comput Res 1(1):36–45Google Scholar
  33. Liu Y, Zhang M, Ma S, Ru L (2008) User behavior oriented web spam detection. In: Proceedings of the 17th international conference on world wide web, WWW ’08Google Scholar
  34. Ma Y, Wang L, Li L (2010) A parallel and convergent support vector machine based on MapReduce. In: Computer engineering and networking, Lecture notes in electrical engineering, vol 277. Springer International Publishing, pp 585–592Google Scholar
  35. Modi S (2013) Relational classification using multiple view approach with voting. Int J Comput Appl 70(16):31–36. Published by Foundation of Computer Science, New YorkGoogle Scholar
  36. Ntoulas A, Najork M, Manasse M, Fetterly D (2006) Detecting spam web pages through content analysis. In: Proceedings of the 15th international conference on world wide web, WWW ’06Google Scholar
  37. Pan SJ, Ni X, Sun JT, Yang Q, Chen Z (2010) Cross-domain sentiment classification via spectral feature alignment. In: Proceedings of the 19th international conference on World wide web, WWW ’10, pp 751–760Google Scholar
  38. Pu C, Webb S (2006) Observed trends in spam construction techniques: a case study of spam evolution. In: Proceedings of the third conference on email and anti-spam (CEAS 2006)Google Scholar
  39. Pu C, Webb S, Kolesnikov O, Lee W, Lipton R (2006) Towards the integration of diverse spam filtering techniques. In: Proceedings of the IEEE international conference on granular computing (GrC06), pp 17–20Google Scholar
  40. Radlinski F (2007) Addressing malicious noise in clickthrough data. In: Proceedings of the 3rd international workshop on adversarial information retrieval on the web (AIRWeb).Google Scholar
  41. Rosen D, Barnett GA, Kim JH (2011) Social networks and online environments: when science and practice co-evolve. Soc Netw Anal Min 1(1):27–42CrossRefGoogle Scholar
  42. Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 workshop, vol 62, AAAI Technical, Report WS-98-05, Madison, pp 98–05Google Scholar
  43. Sebastiani F (2005) Text categorization. In: Text mining and its applications to intelligence, CRM and knowledge management, WIT Press, pp 109–129Google Scholar
  44. Spirin N, Han J (2012) Survey on web spam detection: principles and algorithms. SIGKDD Explor Newsl 13(2):50–64CrossRefGoogle Scholar
  45. Stein T, Chen E, Mangla K (2011) Facebook immune system. In: Proceedings of the forth ACM EuroSys workshop on social network systems (SNS2011)Google Scholar
  46. Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time url spam filtering service. In: Proceedings of the IEEE symposium on security and privacyGoogle Scholar
  47. Tweetdeck by twitter (2011). http://tweetdeck.com/
  48. Voorhees E, Harman D, U.S. National Institute of Standards and Technology (2005) TREC: experiment and evaluation in information retrieval, MIT press, USAGoogle Scholar
  49. Wang D (2014) Analysis and detection of low quality information in social networks. In: Proceedings of Ph.D. symposium at 30th IEEE international conference on data engineering (ICDE 2014), ChicagoGoogle Scholar
  50. Wang D, Irani D, Pu C (2011) A social-spam detection framework. In: Proceedings of the annual collaboration, electronic messaging, antiabuse and spam conference (CEAS 2011), pp 46–54Google Scholar
  51. Wang D, Irani D, Pu C (2012) Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In: Proceedings of 8th IEEE international conference on collaborative computing: networking, applications and worksharing (CollaborateCom), pp 40–49Google Scholar
  52. Wang D, Navathe SB, Liu L, Irani D, Tamersoy A, Pu C (2013) Click traffic analysis of short url spam on twitter. In: Proceedings of 9th IEEE international conference on collaborative computing: networking, applications and worksharing (CollaborateCom), pp 250–259Google Scholar
  53. Wang P, Domeniconi C, Hu J (2008) Cross-domain text classification using wikipedia. IEEE Intell Inf Bull 9(1):36–45Google Scholar
  54. Webb S, Caverlee J, Pu C (2006) Introducing the webb spam corpus: using email spam to identify web spam automatically. In: Proceedings of the third conference on email and anti-spam (CEAS 2006)Google Scholar
  55. Webb S, Caverlee J, Pu C (2007) Characterizing web spam using content and http session analysis. In: Proceedings of the fourth conference on email and anti-spam (CEAS 2007), pp 84–89Google Scholar
  56. Webb S, Caverlee J, Pu C (2008) Predicting web spam with http session information. In: Proceedings of the seventeenth conference on information and knowledge management (CIKM 2008)Google Scholar
  57. Webb S, Caverlee J, Pu C (2008) Social honeypots: making friends with a spammer near you. In: Proceedings of the fifth conference on email and anti-spam (CEAS 2008)Google Scholar
  58. Wolfe AW (2011) Anthropologist view of social network analysis and data mining. Soc Netw Anal Min 1(1):3–19CrossRefGoogle Scholar
  59. Zhen Y, Li C (2008) Cross-domain knowledge transfer using semi-supervised classification. In: AI 2008: advances in artificial intelligence, vol 5360. Lecture notes in computer science, Springer, Berlin, pp 362–371Google Scholar
  60. Zou M, Wang T, Li H, Yang D (2010) A general multi-relational classification approach using feature generation and selection. In: Cao L, Zhong J, Feng Y (eds) Advanced data mining and applications, vol 6441. Lecture notes in computer science, Springer, Berlin, pp 21–33Google Scholar

Copyright information

© Springer-Verlag Wien 2014

Authors and Affiliations

  1. 1.College of ComputingGeorgia Institute of TechnologyAtlantaUSA

Personalised recommendations