Advertisement

A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain

  • J. R. Méndez
  • F. Fdez-Riverola
  • F. Díaz
  • E. L. Iglesias
  • J. M. Corchado
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4065)

Abstract

In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ 2-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naïve Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
    Oard, D.W.: The state of the art in text filtering. User Modeling and User-Adapted Interaction 7, 141–178 (1997)CrossRefGoogle Scholar
  3. 3.
    Wittel, G.L., Wu, S.F.: On Attacking Statistical Spam Filters. In: Proc. of the First Conference on E-mail and Anti-Spam CEAS (2004)Google Scholar
  4. 4.
    Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/2, NCSR Demokritos (2004) Google Scholar
  5. 5.
    Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Analyzing the Impact of Corpus Preprocessing on Anti-Spam Filtering Software. Research on Computing Science (to appear, 2005)Google Scholar
  6. 6.
    Corchado, J.M., Corchado, E.S., Aiken, J., Fyfe, C., Fdez-Riverola, F., Glez-Bedia, M.: Maximum Likelihood Hebbian Learning Based Retrieval Method for CBR Systems. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 107–121. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  7. 7.
    Corchado, J.M., Aiken, J., Corchado, E., Lefevre, N., Smyth, T.: Quantifying the ocean’s CO2 budget with a coHeL-IBR system. In: Funk, P., González Calero, P.A. (eds.) ECCBR 2004. LNCS, vol. 3155, pp. 533–546. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  8. 8.
    Fdez-Riverola, F., Lorenzo, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: SpamHunting: An Instance-Based Reasoning System for Spam Labelling and Filtering. Decision Support Systems (to appear, 2006)Google Scholar
  9. 9.
    Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization – Papers from the AAAI Workshop, Technical Report WS-98-05, pp. 55–62 (1998)Google Scholar
  10. 10.
    Carreras, X., Màrquez, L.: Boosting trees for anti-spam e-mail filtering. In: Proc. of the 4th International Conference on Recent Advances in Natural Language Processing, pp. 58–64 (2001)Google Scholar
  11. 11.
    Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Statistics for Engineering and Information Science (1999)Google Scholar
  12. 12.
    Delany, S.J., Cunningham, P., Coyle, L.: An Assessment of Case-base Reasoning for Spam Filtering. In: Proc. of Fifteenth Irish Conference on Artificial Intelligence and Cognitive Science: AICS 2004, pp. 9–18 (2004)Google Scholar
  13. 13.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of the Fourteenth International Conference on Machine Learning: ICML 1997, pp. 412–420 (1997)Google Scholar
  14. 14.
    Mitchell, T.: Machine Learning. Mc Graw Hill, New York (1996)zbMATHGoogle Scholar
  15. 15.
    Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proc. of the 7th International Conference on Information and Knowledge Management, pp. 229–237 (1998)Google Scholar
  16. 16.
    Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proc. of the ACL, vol. 27, pp. 76–83 (1989)Google Scholar
  17. 17.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  18. 18.
    Drucker, H.D., Wu, D., Vapnik, V.: Support Vector Machines for spam categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)CrossRefGoogle Scholar
  19. 19.
    Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Sholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208 (1999)Google Scholar
  20. 20.
    Schapire, R.E., Singer, Y.: BoosTexter: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)zbMATHCrossRefGoogle Scholar
  21. 21.
    Tsymbal, A.: The problem of concept drift: definitions and related work, available at: http://www.cs.tcd.ie
  22. 22.
    Graham, P.: Better Bayesian filtering. In: Proc. of the MIT Spam Conference (2003)Google Scholar
  23. 23.
    Kolcz, A., Alspector, J.: SVM-based filtering of e-mail spam with content specific misclassification costs. In: Proc. of the ICDM Workshop on Text Mining (2001)Google Scholar
  24. 24.
    Hovold, J.: Naïve Bayes Spam Filtering Using Word-Position-Based Attributes. In: Proc. of the Second Conference on Email and Anti-Spam CEAS 2005, http://www.ceas.cc/papers-2005/144.pdf
  25. 25.
    Gama, J., Castillo, G.: Adaptive Bayes. In: Garijo, F.J., Riquelme, J.-C., Toro, M. (eds.) IBERAMIA 2002. LNCS, vol. 2527, pp. 765–774. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  26. 26.
    Scholz, M., Klinkenberg, R.: An Ensemble Classifier for Drifting Concepts. In: Proc. of the Second International Workshop on Knowledge Discovery from Data Streams, pp. 53–64 (2005)Google Scholar
  27. 27.
    Syed, N.A., Liu, H., Sung, K.K.: Handling Concept Drifts in Incremental Learning with Support Vector Machines. In: Proc. of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 317–321 (1999)Google Scholar
  28. 28.
    Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to Filter Spam E-Mail: A Comparison of a Naïve Bayesian and a Memory-Based Approach. In: Zighed, A.D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 1–13. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  29. 29.
    Daelemans, W., Jakub, Z., Sloot, K., Bosh, A.: TiMBL. Tilburg Memory Based Learning, version 5.1, Reference Guide. ILK, Computational Linguistics, Tilburg University, http://ilk.uvt.nl/software.html#timbl
  30. 30.
    Lenz, M., Auriol, E., Manago, M.: Diagnosis and Decision Support. In: Lenz, M., Bartsch-Spörl, B., Burkhard, H.-D., Wess, S. (eds.) Case-Based Reasoning Technology. LNCS, vol. 1400, pp. 51–90. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  31. 31.
    Frakes, B., Baeza-Yates, R.: Information Retrieval: Data Structures & Algorithms. Prentice-Hall, Englewood Cliffs (2000)Google Scholar
  32. 32.
    NIST: National Institute of Science and Technology. Reuters corpora (2004), http://trec.nist.gov/data/reuters/reuters.html
  33. 33.
    Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar
  34. 34.
    Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C., Stamatopoulos, P.: A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6(1), 49–73 (2003)CrossRefGoogle Scholar
  35. 35.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proc. of the 14th International Joint Conference on Artificial Intelligence: IJCAI 1995, pp. 1137–1143 (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • J. R. Méndez
    • 1
  • F. Fdez-Riverola
    • 1
  • F. Díaz
    • 2
  • E. L. Iglesias
    • 1
  • J. M. Corchado
    • 3
  1. 1.Dept. InformáticaUniversity of VigoOurenseSpain
  2. 2.Dept. InformáticaUniversity of ValladolidSegoviaSpain
  3. 3.Dept. Informática y AutomáticaUniversity of SalamancaSalamancaSpain

Personalised recommendations