Skip to main content
Log in

An efficient approach to comment spam identification

  • Published:
Journal of Electronics (China)

Abstract

This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Jubin Chheda. Combating link spam. Master of Technology, Seminar Report, Department of Computer Science and Engineering, Indian Institute of Technology, Bombay. 2006. http://www.cse.iitb.ac.in/~jubin/seminar_report.pdf, Dec 2006.

    Google Scholar 

  2. G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. The 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’2005), Chiba, Japan. May 10, 2005, 162–167.

  3. M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2002)2, 11–22.

    Article  Google Scholar 

  4. E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: detecting site functionality by structural patterns. The 14th ACM Conference on Hypertext and Hypermedia (HYPERTEXT’03), Nottingham, UK, Aug. 26–30, 2003, 38–47.

  5. B. Davison. Recognizing nepotistic links on the web. AAAI-2000 Workshop on Artificial Intelligence for Web Search, Stanford, CA, USA, Match 20–22, 2000, 23–28.

  6. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB’04: Proceedings of the 7th International Workshop on the Web and Databases, Paris, France, June 17–18, 2004, 1–6.

  7. B. P. Bailey, L. J. Gurak, and J. A. Konstan. An examination of trust production in computer-mediated exchange. The 7th International Conference on Human Factors and the Web, Madison, Wisconsin, USA, June 4–6, 2001, 167–174.

  8. Jay Allen. MT-blacklist: a movable type anti-spam plug-in. http://www.jayallen.org/projects/mt-blacklist, July 2008.

  9. Joint statement from Yahoo, Google, and others regarding the “nofollow” tag, URLs:http://www.google.com/googlblog/2005/01/preventing-commentspam.html, http://www.ysearchblog.com/archives/000069.html, Nov. 2006.

  10. Seungyeop Han, Yong-yeol Ahn, Sue Moon, et al. Collaborative blog spam filtering using adaptive percolation search. The 15th International Conference on World Wide Web. Edinburgh, Scotland, UK, May 23–26, 2006, 87–92.

  11. D. Ikeda, and Y. Yamada. Gathering text files generated from templates. In Proceedings of Workshop on Information Integration on the Web (IIWeb-04), Toronto, Canada, Aug. 30, 2004, 21–26.

  12. H. Drucker, C. J. C. Burges, L. Kaufman, et al. Support vector regression machines. Advances in Neural Information Processing Systems 9 (NIPS’1996). Denver, CO, USA. Dec 2–5, 1996, 155–161.

  13. P. Kolari, T. Finin, and A. Joshi. SVMs for the blogosphere: blog identification and splog detection. In AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, CA, USA, May 27–29, 83–90.

  14. P. Kolari, A. Java, T. Finin, et al. Detecting spam blogs: a machine learning approach. The 21st National Conference on Artificial Intelligence, Boston, Massachusetts, USA, July 16–20, 2006, 65–70.

  15. Pranam Kolari, Akshay Java, and Tim Finin. Characterizing the splogosphere. The 15th International Conference on World Wide Web. Edinburgh, Scotland, UK, May 23–26, 2006, 234–240.

  16. Kazuyuki Narisawa, Yasuhiro Yamada, Daisuke Ikeda, et al. Detecting blog spams using the vocabulary size of all substrings in their copies. The 15th International Conference on World Wide Web. Edinburgh, Scotland, UK, May 23–26, 2006, 312–321.

  17. D. Ikeda. Autoschediastic text mining algorithms. [Ph.D. Dissertation], Graduate School of Information Science and Electrical Engineering, Kyushu University, March 2004.

  18. G. Salton. Developments in automatic text retrieval. Science, 253(1991)3, 974–979.

    Article  MathSciNet  Google Scholar 

  19. Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proceedings of the 14th International Conference on Machine Learning (ICML’1997), Nashville, Tennessee, USA, July 8–12, 1997, 143–151.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuhang Yang.

Additional information

Supported by the National Natural Science Foundation of China (No.60736044, 60803094).

Communication author: Yang Yuhang, born in 1983, male, Ph.D. candidate.

About this article

Cite this article

Yang, Y., Zhao, T., Zheng, D. et al. An efficient approach to comment spam identification. J. Electron.(China) 26, 644–650 (2009). https://doi.org/10.1007/s11767-007-0115-z

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11767-007-0115-z

Key words

CLC index

Navigation