A Comparative Impact Study of Attribute Selection Techniques on Naïve Bayes Spam Filters

  • J. R. Méndez
  • I. Cid
  • D. Glez-Peña
  • M. Rocha
  • F. Fdez-Riverola
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5077)


The main problem of Internet e-mail service is the massive spam message delivery. Everyday, millions of unwanted and unhelpful messages are received by Internet users annoying their mailboxes. Fortunately, nowadays there are different kinds of filters able to automatically identify and delete most of these messages. In order to reduce the bulk of information to deal with, only distinctive attributes are selected spam and legitimate e-mails. This work presents a comparative study about the performance of five well-known feature selection techniques when they are applied in conjunction with four different types of Naïve Bayes classifier. The results obtained from the experiments carried out show the relevance of choosing an appropriate feature selection technique in order to obtain the most accurate results.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    European Commission: i2010 - A European Information Society for growth and employment (2007),
  2. 2.
    CardCommunications - Sitebrand Corporation: Email Trends Report (2007),
  3. 3.
    Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering than Can Track Concept Drift. In: Proceedings of the 5th International Conference on Case Based Reasoning, ICCBR-2003, Workshop of Long-Lived CBR Systems, pp. 115–123 (2003)Google Scholar
  4. 4.
    Jain, A., Zongker, D.: Feature Selection: Evaluation, Application, and Small Sample Performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(2), 153–158 (1997)CrossRefGoogle Scholar
  5. 5.
    Peters, T., Robinson, G., Hooft, R., Hammond, M., Meyer, T., True, S., Walker, A., Hindle, C., Pickett, N., Stone, T.: SpamBayes Project (2002),
  6. 6.
    Mozilla Project: Mozilla Spam Filter,
  7. 7.
    Burton Computer Corporation: SpamProbe: A Fast Spam Bayesian Filter (2002),
  8. 8.
    Androutsopoulos, I., Metsis, V., Paliouras, G.: Spam Filtering with Naive Bayes – Which Naive Bayes? In: Proceedings of the 3rd Conference on Email and AntiSpam (2006)Google Scholar
  9. 9.
    Androutsopoulos, I., Koustias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.: An Evaluation of Naïve Bayesian Anti-Spam Filtering. In: Proc. of the Workshop on Machine Learning in the New Information Age at 11th European Conference on Machine Learning, pp. 9–17 (2000)Google Scholar
  10. 10.
    Schneider, K.M.: A comparison of event models for Naive Bayes anti-spam e-mail filtering. In: Proceedings of the 10th Conference of the European Chapter of the ACL, Budapest, Hungry (2003)Google Scholar
  11. 11.
    John, G., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)Google Scholar
  12. 12.
    Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, pp. 412–420 (1997)Google Scholar
  13. 13.
    Androutsopoulos, I.: Ling Spam Corpus (2000),
  14. 14.
    Mason, J.: The Apache SpamAssassin Project (2005),
  15. 15.
    Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: Applying Lazy Learning Algorithms to Tackle Concept Drift in Spam Filtering. Expert Systems With Applications 33(1), 36–48 (2007)CrossRefGoogle Scholar
  16. 16.
    Shaw, W.M., Burgin, R., Howell, P.: Performance standards and evaluations in IR test collections: Cluster-based retrieval models. Information Processing and Management (1997)Google Scholar
  17. 17.
    Graham-Cumming, J.: Understanding Spam Filter Accuracy. JGC spam and anti-spam newsletter (2004)Google Scholar
  18. 18.
    Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Tokenising, Stemming and Stopword Removal on Anti-Spam Filtering Domain. Lecture notes in artificial intelligence, vol. 4147, pp. 559–558 (2006)Google Scholar
  19. 19.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)Google Scholar
  20. 20.
    Egan, J.P.: Signal Detection Theory and ROC Analysis. Academic Press, New York (1975)Google Scholar
  21. 21.
    Metz, C.E.: Basic principles of ROC analysis. Seminars in Nuclear Medicine 8, 283–298 (1978)CrossRefGoogle Scholar
  22. 22.
    Zweig, M.H., Campbell, G.: Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clinical Chemistry 39, 561–577 (1993)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • J. R. Méndez
    • 1
  • I. Cid
    • 1
  • D. Glez-Peña
    • 1
  • M. Rocha
    • 2
  • F. Fdez-Riverola
    • 1
  1. 1.Dept. InformáticaUniversity of Vigo, Escuela Superior de Ingeniería Informática Edificio PolitécnicoOurenseSpain
  2. 2.Dept. InformáticaUniversity of Minho, Centro de Ciências e Tecnologias da Computação.BragaPortugal

Personalised recommendations