An efficient approach to comment spam identification
- 35 Downloads
- 1 Citations
Abstract
This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments.
Key words
Comment spam Automatic identification Content analysis BlogCLC index
TP391.1 TP391.3References
- [1]Jubin Chheda. Combating link spam. Master of Technology, Seminar Report, Department of Computer Science and Engineering, Indian Institute of Technology, Bombay. 2006. http://www.cse.iitb.ac.in/~jubin/seminar_report.pdf, Dec 2006.Google Scholar
- [2]G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. The 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’2005), Chiba, Japan. May 10, 2005, 162–167.Google Scholar
- [3]M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2002)2, 11–22.CrossRefGoogle Scholar
- [4]E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: detecting site functionality by structural patterns. The 14th ACM Conference on Hypertext and Hypermedia (HYPERTEXT’03), Nottingham, UK, Aug. 26–30, 2003, 38–47.Google Scholar
- [5]B. Davison. Recognizing nepotistic links on the web. AAAI-2000 Workshop on Artificial Intelligence for Web Search, Stanford, CA, USA, Match 20–22, 2000, 23–28.Google Scholar
- [6]D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB’04: Proceedings of the 7th International Workshop on the Web and Databases, Paris, France, June 17–18, 2004, 1–6.Google Scholar
- [7]B. P. Bailey, L. J. Gurak, and J. A. Konstan. An examination of trust production in computer-mediated exchange. The 7th International Conference on Human Factors and the Web, Madison, Wisconsin, USA, June 4–6, 2001, 167–174.Google Scholar
- [8]Jay Allen. MT-blacklist: a movable type anti-spam plug-in. http://www.jayallen.org/projects/mt-blacklist, July 2008.
- [9]Joint statement from Yahoo, Google, and others regarding the “nofollow” tag, URLs:http://www.google.com/googlblog/2005/01/preventing-commentspam.html, http://www.ysearchblog.com/archives/000069.html, Nov. 2006.
- [10]Seungyeop Han, Yong-yeol Ahn, Sue Moon, et al. Collaborative blog spam filtering using adaptive percolation search. The 15th International Conference on World Wide Web. Edinburgh, Scotland, UK, May 23–26, 2006, 87–92.Google Scholar
- [11]D. Ikeda, and Y. Yamada. Gathering text files generated from templates. In Proceedings of Workshop on Information Integration on the Web (IIWeb-04), Toronto, Canada, Aug. 30, 2004, 21–26.Google Scholar
- [12]H. Drucker, C. J. C. Burges, L. Kaufman, et al. Support vector regression machines. Advances in Neural Information Processing Systems 9 (NIPS’1996). Denver, CO, USA. Dec 2–5, 1996, 155–161.Google Scholar
- [13]P. Kolari, T. Finin, and A. Joshi. SVMs for the blogosphere: blog identification and splog detection. In AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, CA, USA, May 27–29, 83–90.Google Scholar
- [14]P. Kolari, A. Java, T. Finin, et al. Detecting spam blogs: a machine learning approach. The 21st National Conference on Artificial Intelligence, Boston, Massachusetts, USA, July 16–20, 2006, 65–70.Google Scholar
- [15]Pranam Kolari, Akshay Java, and Tim Finin. Characterizing the splogosphere. The 15th International Conference on World Wide Web. Edinburgh, Scotland, UK, May 23–26, 2006, 234–240.Google Scholar
- [16]Kazuyuki Narisawa, Yasuhiro Yamada, Daisuke Ikeda, et al. Detecting blog spams using the vocabulary size of all substrings in their copies. The 15th International Conference on World Wide Web. Edinburgh, Scotland, UK, May 23–26, 2006, 312–321.Google Scholar
- [17]D. Ikeda. Autoschediastic text mining algorithms. [Ph.D. Dissertation], Graduate School of Information Science and Electrical Engineering, Kyushu University, March 2004.Google Scholar
- [18]G. Salton. Developments in automatic text retrieval. Science, 253(1991)3, 974–979.CrossRefMathSciNetGoogle Scholar
- [19]Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proceedings of the 14th International Conference on Machine Learning (ICML’1997), Nashville, Tennessee, USA, July 8–12, 1997, 143–151.Google Scholar