An efficient approach to comment spam identification

Yang, Yuhang; Zhao, Tiejun; Zheng, Dequan; Yu, Hao

doi:10.1007/s11767-007-0115-z

An efficient approach to comment spam identification

Published: 19 December 2009

Volume 26, pages 644–650, (2009)
Cite this article

Journal of Electronics (China)

Yuhang Yang^1,2,
Tiejun Zhao¹,
Dequan Zheng¹ &
…
Hao Yu¹

35 Accesses
2 Citations
Explore all metrics

Abstract

This paper proposes a novel approach to comment spam identification based on content analysis. Three main features including the number of links, content repetitiveness, and text similarity are used for comment spam identification. In practice, content repetitiveness is determined by the length and frequency of the longest common substring. Furthermore, text similarity is calculated using vector space model. The precisions of preliminary experiments on comment spam identification conducted on Chinese and English are as high as 93% and 82% respectively. The results show the validity and language independency of this approach. Compared with conventional spam filtering approaches, our method requires no training, no rule sets and no link relationships. The proposed approach can also deal with new comments as well as existing comments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Jubin Chheda. Combating link spam. Master of Technology, Seminar Report, Department of Computer Science and Engineering, Indian Institute of Technology, Bombay. 2006. http://www.cse.iitb.ac.in/~jubin/seminar_report.pdf, Dec 2006.
Google Scholar
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. The 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’2005), Chiba, Japan. May 10, 2005, 162–167.
M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2002)2, 11–22.
Article Google Scholar
E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: detecting site functionality by structural patterns. The 14th ACM Conference on Hypertext and Hypermedia (HYPERTEXT’03), Nottingham, UK, Aug. 26–30, 2003, 38–47.
B. Davison. Recognizing nepotistic links on the web. AAAI-2000 Workshop on Artificial Intelligence for Web Search, Stanford, CA, USA, Match 20–22, 2000, 23–28.
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB’04: Proceedings of the 7th International Workshop on the Web and Databases, Paris, France, June 17–18, 2004, 1–6.
B. P. Bailey, L. J. Gurak, and J. A. Konstan. An examination of trust production in computer-mediated exchange. The 7th International Conference on Human Factors and the Web, Madison, Wisconsin, USA, June 4–6, 2001, 167–174.
Jay Allen. MT-blacklist: a movable type anti-spam plug-in. http://www.jayallen.org/projects/mt-blacklist, July 2008.
Joint statement from Yahoo, Google, and others regarding the “nofollow” tag, URLs:http://www.google.com/googlblog/2005/01/preventing-commentspam.html, http://www.ysearchblog.com/archives/000069.html, Nov. 2006.
Seungyeop Han, Yong-yeol Ahn, Sue Moon, et al. Collaborative blog spam filtering using adaptive percolation search. The 15th International Conference on World Wide Web. Edinburgh, Scotland, UK, May 23–26, 2006, 87–92.
D. Ikeda, and Y. Yamada. Gathering text files generated from templates. In Proceedings of Workshop on Information Integration on the Web (IIWeb-04), Toronto, Canada, Aug. 30, 2004, 21–26.
H. Drucker, C. J. C. Burges, L. Kaufman, et al. Support vector regression machines. Advances in Neural Information Processing Systems 9 (NIPS’1996). Denver, CO, USA. Dec 2–5, 1996, 155–161.
P. Kolari, T. Finin, and A. Joshi. SVMs for the blogosphere: blog identification and splog detection. In AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, CA, USA, May 27–29, 83–90.
P. Kolari, A. Java, T. Finin, et al. Detecting spam blogs: a machine learning approach. The 21st National Conference on Artificial Intelligence, Boston, Massachusetts, USA, July 16–20, 2006, 65–70.
Pranam Kolari, Akshay Java, and Tim Finin. Characterizing the splogosphere. The 15th International Conference on World Wide Web. Edinburgh, Scotland, UK, May 23–26, 2006, 234–240.
Kazuyuki Narisawa, Yasuhiro Yamada, Daisuke Ikeda, et al. Detecting blog spams using the vocabulary size of all substrings in their copies. The 15th International Conference on World Wide Web. Edinburgh, Scotland, UK, May 23–26, 2006, 312–321.
D. Ikeda. Autoschediastic text mining algorithms. [Ph.D. Dissertation], Graduate School of Information Science and Electrical Engineering, Kyushu University, March 2004.
G. Salton. Developments in automatic text retrieval. Science, 253(1991)3, 974–979.
Article MathSciNet Google Scholar
Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proceedings of the 14th International Conference on Machine Learning (ICML’1997), Nashville, Tennessee, USA, July 8–12, 1997, 143–151.

Download references

Author information

Authors and Affiliations

MOE-MS Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology, Harbin, 150001, China
Yuhang Yang, Tiejun Zhao, Dequan Zheng & Hao Yu
Harbin Institute of Technology, Room 611, New Technology Building, Box 321, Harbin, 150001, China
Yuhang Yang

Authors

Yuhang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Tiejun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Dequan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Hao Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuhang Yang.

Additional information

Supported by the National Natural Science Foundation of China (No.60736044, 60803094).

Communication author: Yang Yuhang, born in 1983, male, Ph.D. candidate.

About this article

Cite this article

Yang, Y., Zhao, T., Zheng, D. et al. An efficient approach to comment spam identification. J. Electron.(China) 26, 644–650 (2009). https://doi.org/10.1007/s11767-007-0115-z

Download citation

Received: 27 June 2007
Revised: 24 August 2008
Published: 19 December 2009
Issue Date: September 2009
DOI: https://doi.org/10.1007/s11767-007-0115-z

Key words

CLC index

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient approach to comment spam identification

Abstract

Access this article

Similar content being viewed by others

A Novel Chinese Text Mining Method for E-Commerce Review Spam Detection

Design and Implementation of Chinese Spam Review Detection System

Detecting Professional Spam Reviewers

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Key words

CLC index

Navigation

An efficient approach to comment spam identification

Abstract

Access this article

Similar content being viewed by others

A Novel Chinese Text Mining Method for E-Commerce Review Spam Detection

Design and Implementation of Chinese Spam Review Detection System

Detecting Professional Spam Reviewers

References

Author information

Authors and Affiliations

Corresponding author

Additional information

About this article

Cite this article

Share this article

Key words

CLC index

Search

Navigation