Collective Classification for Spam Filtering

  • Carlos Laorden
  • Borja Sanz
  • Igor Santos
  • Patxi Galán-García
  • Pablo G. Bringas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6694)


Spam has become a major issue in computer security because it is a channel for threats such as computer viruses, worms and phishing. Many solutions feature machine-learning algorithms trained using statistical representations of the terms that usually appear in the e-mails. Still, these methods require a training step with labelled data. Dealing with the situation where the availability of labelled training instances is limited slows down the progress of filtering systems and offers advantages to spammers. Currently, many approaches direct their efforts into Semi-Supervised Learning (SSL). SSL is a halfway method between supervised and unsupervised learning, which, in addition to unlabelled data, receives some supervision information such as the association of the targets with some of the examples. Collective Classification for Text Classification poses as an interesting method for optimising the classification of partially-labelled data. In this way, we propose here, for the first time, Collective Classification algorithms for spam filtering to overcome the amount of unclassified e-mails that are sent every day.


Spam filtering collective classification semi-supervised learning 


  1. 1.
    Robinson, G.: A statistical approach to the spam problem. Linux J. 3 (March 2003)Google Scholar
  2. 2.
    Chirita, P., Diederich, J., Nejdl, W.: MailRank: using ranking for spam detection. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 373–380. ACM, New York (2005)Google Scholar
  3. 3.
    Schryen, G.: A formal approach towards assessing the effectiveness of anti-spam procedures. In: Proceedings of the 39th Annual Hawaii International Conference on HICSS 2006, vol. 6, pp. 129–138. IEEE, Los Alamitos (2006)Google Scholar
  4. 4.
    Chiu, Y., Chen, C., Jeng, B., Lin, H.: An Alliance-Based Anti-spam Approach. In: Third International Conference on ICNC 2007, vol. 4, pp. 203–207. IEEE, Los Alamitos (2007)Google Scholar
  5. 5.
    Zhang, L., Zhu, J., Yao, T.: An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP) 3(4), 243–269 (2004)CrossRefGoogle Scholar
  6. 6.
    Mason, J.: Filtering spam with spamassassin. In: HEANet Annual Conference (2002)Google Scholar
  7. 7.
    Raymond, E.: Bogofilter: A fast open source bayesian spam filters (2005)Google Scholar
  8. 8.
    Burton, B.: Spamprobe-bayesian spam filtering tweaks. In: Proceedings of the Spam Conference (2003)Google Scholar
  9. 9.
    Dengel, A., Dubiel, F.: Clustering and classification of document structure-a machine learning approach. In: International Conference on Document Analysis and Recognition, vol. 2, p. 587 (1995)Google Scholar
  10. 10.
    Fujisawa, H., Nakano, Y., Kurino, K.: Segmentation methods for character recognition: from segmentation to document structure analysis. Proceedings of the IEEE 80(7), 1079–1092 (2002)CrossRefGoogle Scholar
  11. 11.
    Denoyer, L., Gallinari, P.: Bayesian network model for semi-structured document classification. Information Processing & Management 40(5), 807–827 (2004)CrossRefGoogle Scholar
  12. 12.
    Namata, G., Sen, P., Bilgic, M., Getoor, L.: Collective classification for text classification. Text Mining, 51–69 (2009)Google Scholar
  13. 13.
    Garner, S.: Weka: The Waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)Google Scholar
  14. 14.
    Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.: An evaluation of naive bayesian anti-spam filtering. In: Proceedings of the Workshop on Machine Learning in the New Information Age, pp. 9–17 (2000)Google Scholar
  15. 15.
    Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C., Stamatopoulos, P.: A memory-based approach to anti-spam filtering for mailing lists. Information Retrieval 6(1), 49–73 (2003)CrossRefGoogle Scholar
  16. 16.
    Wilbur, W., Sirotkin, K.: The automatic identification of stop words. Journal of Information Science 18(1), 45–55 (1992)CrossRefGoogle Scholar
  17. 17.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)CrossRefzbMATHGoogle Scholar
  18. 18.
    Kent, J.: Information gain and a general measure of correlation. Biometrika 70(1), 163–173 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Holmes, G., Donkin, A., Witten, I.H.: Weka: a machine learning workbench, 357–361 (August 1994)Google Scholar
  20. 20.
    Singh, Y., Kaur, A., Malhotra, R.: Comparative analysis of regression and machine learning methods for predicting fault proneness models. International Journal of Computer Applications in Technology 35(2), 183–193 (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Carlos Laorden
    • 1
  • Borja Sanz
    • 1
  • Igor Santos
    • 1
  • Patxi Galán-García
    • 1
  • Pablo G. Bringas
    • 1
  1. 1.DeustoTech Computing - S3LabUniversity of DeustoBilbaoSpain

Personalised recommendations