Electronic Commerce Research

, Volume 17, Issue 1, pp 51–81 | Cite as

Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection

  • Long SongEmail author
  • Raymond Yiu Keung Lau
  • Ron Chi-Wai Kwok
  • Kristijan Mirkovski
  • Wenyu Dou


With the rise of social web, there has also been a great concern about the quality of user-generated content on social media sites (SMSs). Deceptive comments harm users’ trust in online social media and cause financial loss to firms. Previous studies use various features and classification algorithms to detect and filter social spam on several social media platforms. However, to the best of our knowledge, previous studies have not exploited both probabilistic topic modeling and incremental learning to detect social spam on SMSs. Thus, the main contribution of this paper is design of a novel detection methodology that combines topic- and user-based features to improve the effectiveness of social spam detection. The proposed methodology exploits a probabilistic generative model, namely the labeled latent Dirichlet allocation (L-LDA), for mining the latent semantics from user-generated comments, and an incremental learning approach for tackling the changing feature space. An experiment based on a large dataset extracted from YouTube demonstrates the effectiveness of our proposed methodology, which achieves an average accuracy of 91.17 % in social spam detection. Our statistical analysis reveals that topic-based features significantly improve social spam detection, which has significant implications for business practice.


Social spam Spam detection Topic modeling Incremental learning Machine learning Big data 



This work was supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Projects: CityU 11502115), and the Shenzhen Municipal Science and Technology R&D Funding - Basic Research Program (Project No. JCYJ20140419115614350).


  1. 1.
    van Marle, D. (2011) IP telephony shifts from unified communications to social media. In Proceedings of the 50th FITCE Congress, 2011 (pp. 1–4). Piscataway: IEEEGoogle Scholar
  2. 2.
    Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud computing and big data analytics: What is new from databases perspective? In Big Data Analytics (pp. 42–61). Berlin: Springer.Google Scholar
  3. 3.
    Chandramouli, R. (2011). Emerging social media threats: Technology and policy perspectives. In Proceedings of the 2nd Worldwide Cybersecurity Summit (WCS), London (pp. 1–4). Piscataway: IEEEGoogle Scholar
  4. 4.
    Zhou, L., Wu, J., & Zhang, D. (2014). Discourse cues to deception in the case of multiple receivers. Information & Management, 51(6), 726–737. doi: 10.1016/ Scholar
  5. 5.
    Wu, G., Greene, D., Smyth, B., & Cunningham, P. A. (2010) Distortion as a validation criterion in the identification of suspicious reviews. In Proceedings of the 1st Workshop on Social Media Analytics, New York (pp. 10–13, SOMA ‘10): Association of Computing Machinery (ACM). doi: 10.1145/1964858.1964860.
  6. 6.
    Yoo, K.-H., & Gretzel, U. (2009). Comparison of deceptive and truthful travel reviews. In W. Höpken, U. Gretzel, & R. Law (Eds.), Information and Communication Technologies in Tourism 2009 (pp. 37–47). Vienna: Springer.Google Scholar
  7. 7.
    Theft, fraud cost retailers $8 million a day: study. (2007), The Ottawa Citizen, pp. E.3-E3.Google Scholar
  8. 8.
    Wang, D., Irani, D., & Pu, C. (2014). SPADE: A social-spam analytics and detection framework. Social Network Analysis and Mining, 4(1), 1–18. doi: 10.1007/s13278-014-0189-1.CrossRefGoogle Scholar
  9. 9.
    Jagatic, T. N., Johnson, N. A., Jakobsson, M., & Menczer, F. (2007). Social phishing. Communications of ACM, 50(10), 94–100.CrossRefGoogle Scholar
  10. 10.
    Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J. I., & Tseng, B. L. (2008). Detecting splogs via temporal dynamics using self-similarity analysis. ACM Transactions on the Web, 2(1), 4. doi: 10.1145/1326561.1326565.CrossRefGoogle Scholar
  11. 11.
    Boyd, D., & Heer, J. (2006) Profiles as conversation: Networked identity performance on friendster. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences, Koloa, Hawaii (Vol. 3, pp. 59c-59c). Piscataway: IEEE Computer SocietyGoogle Scholar
  12. 12.
    Brown, G., Howe, T., Ihbe, M., Prakash, A., & Borders, K. (2008). Social networks and context-aware spam. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, New York (pp. 403–412, CSCW ‘08): Association of Computing Machinery (ACM). doi: 10.1145/1460563.1460628.
  13. 13.
    Zinman, A., & Donath, J. (2007). Is Britney Spears spam? In Paper presented at the 4th Conference on Email and Anti-Spam, Mountain View, California.Google Scholar
  14. 14.
    Harold, & Nguyen (2014). 2013 State of Social Media Spam Report (2013 Research Report ed., pp. 21). Burlingame, California: Nexgate.Google Scholar
  15. 15.
    Grier, C., Thomas, K., Paxson, V., & Zhang, M. (2010) @spam: the underground on 140 characters or less. In Proceedings of the 17th ACM Conference on Computer and Communications Security, New York (Vol. Chicago, Illinois, pp. 27–37): Association of Computing Machinery (ACM). doi:
  16. 16.
    Zhang, D., Yan, Z., Jiang, H., & Kim, T. (2014). A domain-feature enhanced classification model for the detection of Chinese phishing e-Business websites. Information & Management, 51(7), 845–853.CrossRefGoogle Scholar
  17. 17.
    Ensing, & David (2013). Money talks and listens: Characteristics of rating and review site users. Maritz Research’s White Papers, 4Google Scholar
  18. 18.
    IC3 (2008). 2008 Internet Crime Report (p. 25): Internet Crime Complaint Center.Google Scholar
  19. 19.
    Reviews, reputation, and revenue: The case of (2011). Harvard Business School, Boston College.
  20. 20.
    Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings on the Conference on Empirical Methods in Natural Language Processing, Singapore (pp. 248–256): Association for Computational LinguisticsGoogle Scholar
  21. 21.
    Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.Google Scholar
  22. 22.
    Markines, B., Cattuto, C., & Menczer, F. (2009). Social spam detection. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, New York (pp. 41–48, AIRWeb ‘09): Association of Computing Machinery (ACM). doi:
  23. 23.
    Lee, K., Caverlee, J., & Webb, S. (2010). Uncovering social spammers: social honeypots + machine learning. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York (pp. 435–442): Association of Computing Machinery (ACM). doi: 10.1145/1835449.1835522.
  24. 24.
    Jin, X., Lin, C., Luo, J., & Han, J. (2011). A data mining-based spam detection system for social media networks. Proceedings of the VLDB Endowment, 4(12), 1458–1461.Google Scholar
  25. 25.
    Lin, L., & Kun, J. (2012). Detecting spam in Chinese microblogs: A study on Sina Weibo. In Proceedings of the 8th International Conference on Computational Intelligence and Security, Guangzhou, Guangdong Province (pp. 578–581): China Printing Solutions. doi: 10.1109/cis.2012.135.
  26. 26.
    Dae-Ha, P., Eun-Ae, C., & Byung-Won, O. (2013). Social spam discovery using bayesian network classifiers based on feature extractions. In Proceedings of the 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, Melbourne, Australia, July 2013 (pp. 1808–1811) Piscataway: IEEEGoogle Scholar
  27. 27.
    Po-Ching, L., & Po-Min, H. (2013). A study of effective features for detecting long-surviving Twitter spam accounts. In Proceedings of the 15th International Conference on Advanced Communication Technology, PyeongChang, South Korea, Jan 2013 (pp. 841–846). Piscataway: IEEEGoogle Scholar
  28. 28.
    Sureka, A. (2011). Mining user comment activity for detecting forum spammers in Youtube. Paper presented at the 1st International Workshop on Usage Analysis and the Web of Data, Hyderabad, IndiaGoogle Scholar
  29. 29.
    Brody, S., & Elhadad, N. (2010). An unsupervised aspect-sentiment model for online reviews. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Los Angeles, California (pp. 804–812): Association for Computational LinguisticsGoogle Scholar
  30. 30.
    Liu, B., Liu, L., Tsykin, A., Goodall, G. J., Green, J. E., Zhu, M., et al. (2010). Identifying functional miRNA–mRNA regulatory modules with correspondence latent dirichlet allocation. Bioinformatics, 26(24), 3105–3111.CrossRefGoogle Scholar
  31. 31.
    Wang, C., Blei, D., & Li, F.-F. (2009). Simultaneous image classification and annotation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL (pp. 1903–1910). Piscataway: IEEEGoogle Scholar
  32. 32.
    Bíró, I., Szabó, J., & Benczúr, A. A. (2008). Latent dirichlet allocation in web spam filtering. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, Beijing (pp. 29–32). New York: Association of Computing Machinery (ACM)Google Scholar
  33. 33.
    Cui, K., Zhou, B., Jia, Y., & Liang, Z. (2010). LDA-based model for online topic evolution mining. Computer Science, 37(11), 156–193.Google Scholar
  34. 34.
    Sizov, S. (2010). Geofolk: Latent spatial semantics in web 2.0 social media. In Proceedings of the third ACM international conference on Web search and data mining (pp. 281–290). New York: ACMGoogle Scholar
  35. 35.
    Geng, X., & Smith-Miles, K. (2009). Incremental learning. In S. Li & A. Jain (Eds.), Encyclopedia of biometrics (pp. 731–735). Berlin: Springer.Google Scholar
  36. 36.
    Mitchell, T. M. (1997). Machine learning. Boston: McGraw-Hill.Google Scholar
  37. 37.
    Mitchell, T. M. (1982). Generalization as search. Artificial Intelligence, 18(2), 203–226.CrossRefGoogle Scholar
  38. 38.
    Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2), 139–172.Google Scholar
  39. 39.
    Utgoff, P. E. (1988). Id5: An incremental id3. In Proceedings of 5th International Workshop on Machine Learning, Ann Arbor, Michigan (pp. 107–120). Burlington, MA: Morgan KaufmannGoogle Scholar
  40. 40.
    Martinez, C., & Tony, G.-C. (1995). ILA: Combining inductive learning with prior knowledge and reasoning. 17Google Scholar
  41. 41.
    Tsai, C. H., Lin, C. Y., & Lin, C. J. (2014). Incremental and decremental training for linear classification. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York (pp. 343–352). New York: Association of Computing Machinery (ACM)Google Scholar
  42. 42.
    Mairal, J. (2015). Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2), 829–855.CrossRefGoogle Scholar
  43. 43.
    Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York: McGraw-Hill.Google Scholar
  44. 44.
    Aphinyanaphongs, Y., Fu, L. D., Li, Z., Peskin, E. R., Efstathiadis, E., Aliferis, C. F., et al. (2014). A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. Journal of the Association for Information Science and Technology, 65(10), 1964–1987.CrossRefGoogle Scholar
  45. 45.
    Sood, S. O., Churchill, E. F., & Antin, J. (2012). Automatic identification of personal insults on social news sites. Journal of the American Society for Information Science and Technology, 63(2), 270–285.CrossRefGoogle Scholar
  46. 46.
    Joachims, T. (1997). A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA, 1997 (pp. 143–151). San Francisco: Morgan Kaufmann Publishers Inc.Google Scholar
  47. 47.
    Soucy, P., & Mineau, G. W. (2005) Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the International Joint Conferences on Artificial Intelligence, Edinburgh, Scotland (Vol. 5, pp. 1130–1135): IJCAI OrganizationGoogle Scholar
  48. 48.
    Singhal, A., Choi, J., Hindle, D., Lewis, D. D., & Pereira, F. (1999). AT&T at TREC-7. In Proceedings of the 7th Text Retrieval Conference, Gaithersburg, MD (pp. 239–252): National Institute of Standards and Technology (NIST)Google Scholar
  49. 49.
    Alexandrov, M., Gelbukh, A. F., & Lozovoi, G. (2001) Chi square classifier for document categorization. In Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City (Vol. 2004, pp. 457–459). Belin: SpringerGoogle Scholar
  50. 50.
    Dunham, M. H., & Ming, D. (2003). Introductory and advanced topics. Upper Saddle River, NJ: Prentice Hall/Pearson Education.Google Scholar
  51. 51.
    Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(7–8), 1289–1305.Google Scholar
  52. 52.
    Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. S., & Al-Rajeh, A. (2008). Automatic arabic text classification. In Paper presented at the 9th International Conference on the Statistical Analysis of Textual Data, Lyon. Google Scholar
  53. 53.
    Mesleh, A Md. (2007). Chi square feature extraction based svms arabic text categorization system. Journal of Computer Science, 3(6), 430–435.CrossRefGoogle Scholar
  54. 54.
    Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.CrossRefGoogle Scholar
  55. 55.
    Halliday, M. A., & Matthiessen, C. M. (2004). An introduction to functional grammar. New York: Routledge.Google Scholar
  56. 56.
    Fairclough, N. (2003). Analysing discourse: Textual analysis for social research. London: Routledge.Google Scholar
  57. 57.
    Abbasi, A., & Chen, H. (2008). CyberGate: a design framework and system for text analysis of computer-mediated communication. MIS Quarterly, 32(4), 811–837.Google Scholar
  58. 58.
    Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167–174.Google Scholar
  59. 59.
    Duan, Z., Gopalan, K., & Yuan, X. (2011). An empirical study of behavioral characteristics of spammers: Findings and implications. Computer Communications, 34(14), 1764–1776. doi: 10.1016/j.comcom.2011.03.015.CrossRefGoogle Scholar
  60. 60.
    Gao, H., Chen, Y., Lee, K., Palsetia, D., & Choudhary, A. N. (2012). Towards online spam filtering in social networks. In NDSS Google Scholar
  61. 61.
    Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., & Zhao, B. Y. (2010). Detecting and characterizing social spam campaigns. In Paper presented at the Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, Melbourne.Google Scholar
  62. 62.
    Chen, C., Wu, K., Srinivasan, V., & Zhang, X. (2013). Battling the internet water army: detection of hidden paid posters. In Paper presented at the Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Niagara.Google Scholar
  63. 63.
    Mukherjee, A., Liu, B., & Glance, N. Spotting fake reviewer groups in consumer reviews. In Proceedings of the 21st international conference on World Wide Web, 2012 (pp. 191–200). New York: ACMGoogle Scholar
  64. 64.
    Song, J., Lee, S., & Kim, J. (2011). Spam filtering in twitter using sender-receiver relationship. In R. Sommer, D. Balzarotti, & G. Maier (Eds.), Recent advances in intrusion detection (Vol. 6961, pp. 301–317)., Lecture Notes in Computer Science Berlin, Heidelberg: Springer.CrossRefGoogle Scholar
  65. 65.
    Wang, A. H. (2010). Don’t follow me: Spam detection in Twitter. In Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT) 2010 (pp. 1–10)Google Scholar
  66. 66.
    Myers, E. W. (1986). An O(ND) difference algorithm and its variations. Algorithmica, 1(1–4), 251–266.CrossRefGoogle Scholar
  67. 67.
    Ukkonen, E. (1985). Algorithms for approximate string matching. Information and Control, 64(1), 100–118.CrossRefGoogle Scholar
  68. 68.
    Fawcett, T., & Provost, F. (1997). Adaptive fraud detection. Data Mining and Knowledge Discovery, 1(3), 291–316.CrossRefGoogle Scholar
  69. 69.
    Manaskasemsak, B., Jiarpakdee, J., & Rungsawang, A. (2014). Adaptive Learning Ant Colony Optimization for Web Spam Detection. In Computational Science and Its ApplicationsICCSA 2014 (Vol. 8584, pp. 642–653, Lecture Notes in Computer Science). Berlin: Springer.Google Scholar
  70. 70.
    Congfu, X., Baojun, S., Yunbiao, C., & Weike, P. (2014). An adaptive fusion algorithm for spam detection. IEEE Intelligent Systems, 29(4), 2–8.CrossRefGoogle Scholar
  71. 71.
    Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408.CrossRefGoogle Scholar
  72. 72.
    Li, Y., & Long, P. (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1–3), 361–387.CrossRefGoogle Scholar
  73. 73.
    Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21th International Conference on Machine Learning, Banff, Alberta, Canada, 2004 (p. 116). New York: Association of Computing Machinery (ACM). doi: 10.1145/1015330.1015332.
  74. 74.
    Shalev-Shwartz, S., Singer, Y., Srebro, N., & Cotter, A. (2011). Pegasos: primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), 3–30.CrossRefGoogle Scholar
  75. 75.
    Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(3), 551–585.Google Scholar
  76. 76.
    Hofmann, T. (1999). Probabilistic latent semantic indexing. In Paper presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, CA.Google Scholar
  77. 77.
    O’Callaghan, D., Harrigan, M., Carthy, J., & Cunningham, P. A. (2012) Identifying discriminating network motifs in YouTube spam. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media, Dublin (pp. 521–529): Association for the Advancement of Artificial IntelligenceGoogle Scholar
  78. 78.
    O’Callaghan, D., Harrigan, M., Carthy, J., & Cunningham, P. A. (2012) Network analysis of recurring YouTube spam campaigns. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media, Dublin (pp. 531–534)Google Scholar
  79. 79.
    Helft, M. (2008). Search ads come to YouTube.
  80. 80.
    YouTube (2013). Youtube: Statistics.Google Scholar
  81. 81.
    Sivaselvan, B., & Gopalan, N. P. (2009). Data mining: Techniques and trends. New Delhi: Prentice-Hall.Google Scholar
  82. 82.
    Ahmed, S., & Mithun, F. (2004). Word stemming to enhance spam filtering. In Paper presented at the 1st Conference on Email and Anti-Spam, Mountain View, CA.Google Scholar
  83. 83.
    Sculley, D. (2010) Combined regression and ranking. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC (pp. 979–988). New York: Association of Computing Machinery (ACM)Google Scholar
  84. 84.
    Neyman, J. (1934). On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4), 558–625.CrossRefGoogle Scholar
  85. 85.
    Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. New York: Wiley.Google Scholar
  86. 86.
    Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. (2011). Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg (Vol. 1, pp. 309–319, HLT’11): Association for Computational LinguisticsGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Department of Information Systems, College of BusinessCity University of Hong KongKowloon TongPeople’s Republic of China
  2. 2.School of Information Management, Victoria Business SchoolVictoria University of Wellington WellingtonNew Zealand
  3. 3.Department of Marketing, College of BusinessCity University of Hong KongKowloon TongPeople’s Republic of China

Personalised recommendations