The Impact of Noise in Spam Filtering: A Case Study

  • I. Cid
  • L. R. Janeiro
  • J. R. Méndez
  • D. Glez-Peña
  • F. Fdez-Riverola
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5077)


Unsolicited commercial e-mail (UCE), more commonly known as spam is a growing problem on the Internet. Every day people receive lots of unwanted advertising e-mails that flood their mailboxes. Fortunately, there are several approaches for spam filtering able to detect and automatically delete this kind of messages. However, spammers have adopted some techniques to reduce the effectiveness of these filters by introducing noise in their messages. This work presents a new pre-processing technique for noise identification and reduction, showing preliminary results when it is applied with a Flexible Bayes classifier. The experimental analysis confirms the advantages of using the proposed technique in order to improve spam filters accuracy.


Feature Selection Mutual Information Feature Selection Method Document Frequency Feature Selection Technique 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering than Can Track Concept Drift. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 115–123. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  2. 2.
    The Spamhaus Project: Working to Protect Internet Networks Worldwide (2007),
  3. 3.
  4. 4.
  5. 5.
    Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. CEAS: First Conference on E-mail and Anti-Spam (2004)Google Scholar
  6. 6.
    Leslie, C., Kuang, R.: Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 1435–1455 (2004)Google Scholar
  7. 7.
    Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. Journal of Machine Learning Research 2, 419–444 (2002)CrossRefMATHGoogle Scholar
  8. 8.
    Androutsopoulos, I., Koustias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.: An Evaluation of Naïve Bayesian Anti-Spam Filtering. In: Proceedings of the 11th European Conference on Machine Learning, Workshop on Machine Learning in the New Information Age, pp. 9–17 (2000)Google Scholar
  9. 9.
    Cid, I., Méndez, J.R., Peña-Glez, D., Fdez-Riverola, F.: A comparative impact study of attribute selection techniques on Naïve Bayes spam filters. In: The 8th Industrial Conference on Data Mining, ICDM 2008 (submitted for publication 2007)Google Scholar
  10. 10.
  11. 11.
    Hash Buster definition (2007),
  12. 12.
    Méndez, J.R., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Sistemas Inteligentes para la Detección y Filtrado de Correo Spam: una Revisión. Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial 34, 63–81 (2007)Google Scholar
  13. 13.
    Lee, H., Ng, A.Y.: Spam deobfuscation using a Hidden Markov Model. In: Second Conference on E-mail and Anti-Spam (2005)Google Scholar
  14. 14.
    Shabbir, A., Farzana, M.: Word stemming to enhance spam filtering. In: CEAS: First Conference on E-mail and Anti-Spam (2004)Google Scholar
  15. 15.
    The Dspam project (2007),
  16. 16.
    SpamAssassin BNR (Bayes Noise Reduction) (2007),
  17. 17.
    Graham, P.: Better bayesian filtering (2003),
  18. 18.
    Klimt, B., Yang, Y.: Introducing the Enron corpus. In: CEAS: First Conference on E-mail and Anti-Spam (2004)Google Scholar
  19. 19.
    The Apache SpamAssassin Public Corpus (2007),
  20. 20.
    Crocker, D.: Standard for the Format of ARPA Internet Text Messages. STD 11, RFC 822 (2007),
  21. 21.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)Google Scholar
  22. 22.
    Graham-Cumming, J.: Understanding Spam Filter Accuracy. In: jgc spam and anti-spam newsletter (2004) (2007),
  23. 23.
    Rijsbergen, C.J.: Information Retrieval (ed.). Butterworth, London (1979)Google Scholar
  24. 24.
    Shaw, W.M., Burgin, R., Howell, P.: Performance standards and evaluations in IR test collections: Cluster-based retrieval models. Information Processing and Management 33(1), 1–14 (1997)CrossRefGoogle Scholar
  25. 25.
    Egan, J.P.: Signal Detection Theory and Roc Analysis (ed.). Academic Press, New York (1975)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • I. Cid
    • 1
  • L. R. Janeiro
    • 1
  • J. R. Méndez
    • 1
  • D. Glez-Peña
    • 1
  • F. Fdez-Riverola
    • 1
  1. 1.Dept. InformáticaUniversity of Vigo, Escuela Superior de Ingeniería Informática Edificio PolitécnicoOurenseSpain

Personalised recommendations