A Comparative Study of Blog Comments Spam Filtering with Machine Learning Techniques

  • Christian Romero
  • Mario Garcia-Valdez
  • Arnulfo Alanis
Part of the Studies in Computational Intelligence book series (SCI, volume 312)


In this paper we compare four machine learning techniques for spam filtering in blog comments. The machine learning techniques are: Naïve Bayes, K-nearest neighbors, neural networks and support vector machines. In this work we used a corpus of 1021 blog comments with 67% spam, the results of the filtering using 10 fold cross-validation are presented.


Support Vector Machine Feature Vector Lagrange Multiplier Decision Boundary Training Instance 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Tretyakov, K.: Machine Learning Techniques in Spam Filtering. Institute of Computer Science, University of Tartu (2004)Google Scholar
  2. 2.
    Aas, K., Eikvil, L.: Text categorization. A survey (1999), http://citeseer.ist.psu.edu/aas99text.html
  3. 3.
    Cristianini, N., Shewe-Taylor, J.: An introduction to support Vector Machines and other Kernel Based Learning Methods. Cambridge University Press, Cambridge (2003)Google Scholar
  4. 4.
    Kecman, V.: Learning and soft computing. The MIT Press, Cambridge (2001)MATHGoogle Scholar
  5. 5.
    Haykin, S.: Neural Networks: A Comprehensive Foundation. Practice Hall (1998)Google Scholar
  6. 6.
    Androutsopoulos, I., et al.: Learning to filter Spam E-mail: A comparison of Naïve Bayesian and a Memory-Based ApproachGoogle Scholar
  7. 7.
    Androutsopoulos, I., et al.: An experimental comparison of Naïve Bayesian and Keywords-Based Anti-Spam filtering with Personal E-mailGoogle Scholar
  8. 8.
    Cortes, C., Vapnik, V.: Support Vector Networks. Machine Learning (1995)Google Scholar
  9. 9.
    Vladimir, N., Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)MATHGoogle Scholar
  10. 10.
    Mishne, G., Carmel, D., Lempel, R.: Bocking Blog Spam with Language Model DisagreementGoogle Scholar
  11. 11.
    Mishne, G.: Using Blogs Properties to Improve RetrievalGoogle Scholar
  12. 12.
    Kolari, P., Finin, T., Joshi, A.: SVMs for the Blogsphere: Blog Identification and Splog Detection. In: AAAI Spring Symposium on Computational Approaches to Analysis Weblogs (2006)Google Scholar
  13. 13.
    Cormack, G., Gomez, J.M., Puertas, E.: Spam Filterin For Shot MessagesGoogle Scholar
  14. 14.
    Holdens, S.: Spam Filters (2004), http://freshment.net/articles/view/964
  15. 15.
    Cost, S., Salzberg, S.: A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning (1992)Google Scholar
  16. 16.
    Cover, T.M., Hart, P.E.: Nearest Neighbor Pattern Classification. Knowledge Based Systems (1995)Google Scholar
  17. 17.
    Goldstain, M.: K-Nearest Neighbor Classification (1972)Google Scholar
  18. 18.
    Bishop, C.M.: Neural Networks for Pattern Recognitions. Oxford University Press, U.K. (1995)Google Scholar
  19. 19.
    Ning Tan, P., Steinbach, M., Kumar, V.: Introduction to Data Maning. Adison Wesley (2006)Google Scholar
  20. 20.
    Arasu, A., Novak, J., Tomkins, A., Tomlin, J.: PageRank computation and the structure of the web: Experiments and algorithms. In: Proceedings of the Eleventh International World Wide Web Conference, Poster Track. Brisbane, Australia, pp. 107–117 (2002), http://citeseerx.ist.psu.edu/viewdoc/download?doi=

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Christian Romero
    • 1
  • Mario Garcia-Valdez
    • 1
  • Arnulfo Alanis
    • 1
  1. 1.Tijuana Institute of TechnologyTijuanaMéxico

Personalised recommendations