Skip to main content

A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain

  • Conference paper
Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining (ICDM 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4065))

Included in the following conference series:

Abstract

In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ 2-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naïve Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Spam statistics, http://www.theregister.co.uk/security/spam/

  2. Oard, D.W.: The state of the art in text filtering. User Modeling and User-Adapted Interaction 7, 141–178 (1997)

    Article  Google Scholar 

  3. Wittel, G.L., Wu, S.F.: On Attacking Statistical Spam Filters. In: Proc. of the First Conference on E-mail and Anti-Spam CEAS (2004)

    Google Scholar 

  4. Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/2, NCSR Demokritos (2004)

    Google Scholar 

  5. Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Analyzing the Impact of Corpus Preprocessing on Anti-Spam Filtering Software. Research on Computing Science (to appear, 2005)

    Google Scholar 

  6. Corchado, J.M., Corchado, E.S., Aiken, J., Fyfe, C., Fdez-Riverola, F., Glez-Bedia, M.: Maximum Likelihood Hebbian Learning Based Retrieval Method for CBR Systems. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 107–121. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  7. Corchado, J.M., Aiken, J., Corchado, E., Lefevre, N., Smyth, T.: Quantifying the ocean’s CO2 budget with a coHeL-IBR system. In: Funk, P., González Calero, P.A. (eds.) ECCBR 2004. LNCS, vol. 3155, pp. 533–546. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  8. Fdez-Riverola, F., Lorenzo, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: SpamHunting: An Instance-Based Reasoning System for Spam Labelling and Filtering. Decision Support Systems (to appear, 2006)

    Google Scholar 

  9. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization – Papers from the AAAI Workshop, Technical Report WS-98-05, pp. 55–62 (1998)

    Google Scholar 

  10. Carreras, X., Màrquez, L.: Boosting trees for anti-spam e-mail filtering. In: Proc. of the 4th International Conference on Recent Advances in Natural Language Processing, pp. 58–64 (2001)

    Google Scholar 

  11. Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Statistics for Engineering and Information Science (1999)

    Google Scholar 

  12. Delany, S.J., Cunningham, P., Coyle, L.: An Assessment of Case-base Reasoning for Spam Filtering. In: Proc. of Fifteenth Irish Conference on Artificial Intelligence and Cognitive Science: AICS 2004, pp. 9–18 (2004)

    Google Scholar 

  13. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of the Fourteenth International Conference on Machine Learning: ICML 1997, pp. 412–420 (1997)

    Google Scholar 

  14. Mitchell, T.: Machine Learning. Mc Graw Hill, New York (1996)

    MATH  Google Scholar 

  15. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proc. of the 7th International Conference on Information and Knowledge Management, pp. 229–237 (1998)

    Google Scholar 

  16. Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proc. of the ACL, vol. 27, pp. 76–83 (1989)

    Google Scholar 

  17. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  18. Drucker, H.D., Wu, D., Vapnik, V.: Support Vector Machines for spam categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)

    Article  Google Scholar 

  19. Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Sholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208 (1999)

    Google Scholar 

  20. Schapire, R.E., Singer, Y.: BoosTexter: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)

    Article  MATH  Google Scholar 

  21. Tsymbal, A.: The problem of concept drift: definitions and related work, available at: http://www.cs.tcd.ie

  22. Graham, P.: Better Bayesian filtering. In: Proc. of the MIT Spam Conference (2003)

    Google Scholar 

  23. Kolcz, A., Alspector, J.: SVM-based filtering of e-mail spam with content specific misclassification costs. In: Proc. of the ICDM Workshop on Text Mining (2001)

    Google Scholar 

  24. Hovold, J.: Naïve Bayes Spam Filtering Using Word-Position-Based Attributes. In: Proc. of the Second Conference on Email and Anti-Spam CEAS 2005, http://www.ceas.cc/papers-2005/144.pdf

  25. Gama, J., Castillo, G.: Adaptive Bayes. In: Garijo, F.J., Riquelme, J.-C., Toro, M. (eds.) IBERAMIA 2002. LNCS, vol. 2527, pp. 765–774. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  26. Scholz, M., Klinkenberg, R.: An Ensemble Classifier for Drifting Concepts. In: Proc. of the Second International Workshop on Knowledge Discovery from Data Streams, pp. 53–64 (2005)

    Google Scholar 

  27. Syed, N.A., Liu, H., Sung, K.K.: Handling Concept Drifts in Incremental Learning with Support Vector Machines. In: Proc. of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 317–321 (1999)

    Google Scholar 

  28. Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to Filter Spam E-Mail: A Comparison of a Naïve Bayesian and a Memory-Based Approach. In: Zighed, A.D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 1–13. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  29. Daelemans, W., Jakub, Z., Sloot, K., Bosh, A.: TiMBL. Tilburg Memory Based Learning, version 5.1, Reference Guide. ILK, Computational Linguistics, Tilburg University, http://ilk.uvt.nl/software.html#timbl

  30. Lenz, M., Auriol, E., Manago, M.: Diagnosis and Decision Support. In: Lenz, M., Bartsch-Spörl, B., Burkhard, H.-D., Wess, S. (eds.) Case-Based Reasoning Technology. LNCS, vol. 1400, pp. 51–90. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  31. Frakes, B., Baeza-Yates, R.: Information Retrieval: Data Structures & Algorithms. Prentice-Hall, Englewood Cliffs (2000)

    Google Scholar 

  32. NIST: National Institute of Science and Technology. Reuters corpora (2004), http://trec.nist.gov/data/reuters/reuters.html

  33. Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  34. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C., Stamatopoulos, P.: A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6(1), 49–73 (2003)

    Article  Google Scholar 

  35. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proc. of the 14th International Joint Conference on Artificial Intelligence: IJCAI 1995, pp. 1137–1143 (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Méndez, J.R., Fdez-Riverola, F., Díaz, F., Iglesias, E.L., Corchado, J.M. (2006). A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain. In: Perner, P. (eds) Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining. ICDM 2006. Lecture Notes in Computer Science(), vol 4065. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11790853_9

Download citation

  • DOI: https://doi.org/10.1007/11790853_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-36036-0

  • Online ISBN: 978-3-540-36037-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics