Advertisement

Tracking Concept Drift at Feature Selection Stage in SpamHunting: An Anti-spam Instance-Based Reasoning System

  • J. R. Méndez
  • F. Fdez-Riverola
  • E. L. Iglesias
  • F. Díaz
  • J. M. Corchado
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4106)

Abstract

In this paper we propose a novel feature selection method able to handle concept drift problems in spam filtering domain. The proposed technique is applied to a previous successful instance-based reasoning e-mail filtering system called SpamHunting. Our achieved information criterion is based on several ideas extracted from the well-known information measure introduced by Shannon. We show how results obtained by our previous system in combination with the improved feature selection method outperforms classical machine learning techniques and other well-known lazy learning approaches. In order to evaluate the performance of all the analysed models, we employ two different corpus and six well-known metrics in various scenarios.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Oard, D.W.: The state of the art in text filtering. User Modeling and User-Adapted Interaction 7, 141–178 (1997)CrossRefGoogle Scholar
  2. 2.
    Wittel, G.L., Wu, S.F.: On Attacking Statistical Spam Filters. In: Proc. of the First Conference on E-mail and Anti-Spam CEAS (2004)Google Scholar
  3. 3.
    Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/2, NCSR Demokritos (2004)Google Scholar
  4. 4.
    Delany, S.J., Cunningham, P., Coyle, L.: An Assessment of Case-base Reasoning for Spam Filtering. In: Proc. of Fifteenth Irish Conference on Artificial Intelligence and Cognitive Science: AICS 2004, pp. 9–18 (2004)Google Scholar
  5. 5.
    Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering that Can Track Concept Drift. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: SpamHunting: An Instance-Based Reasoning System for Spam Labelling and Filtering. In: Decision Support Systems (to appear, 2006)Google Scholar
  7. 7.
    Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization – Papers from the AAAI Workshop, Technical Report WS-98-05, pp. 55–62 (1998)Google Scholar
  8. 8.
    Carreras, X., Màrquez, L.: Boosting trees for anti-spam e-mail filtering. In: Proc. of the 4th International Conference on Recent Advances in Natural Language Processing, pp. 58–64 (2001)Google Scholar
  9. 9.
    Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Statistics for Engineering and Information Science (1999)Google Scholar
  10. 10.
    Lee, H., Ng, A.Y.: Spam Deobfuscation using a Hidden Markov Model. In: Proc. of the Second Conference on E-mail and Anti-Spam CEAS (2005)Google Scholar
  11. 11.
    Druker, H., Vapmik, V.: Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)CrossRefGoogle Scholar
  12. 12.
    Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Sholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)Google Scholar
  13. 13.
    Schapire, R.E., Singer, Y.: BoosTexter: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)zbMATHCrossRefGoogle Scholar
  14. 14.
    Rigoutsos, I., Huynh, T.: Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages (SPAM). In: Proc. of the First Conference on E-mail and Anti-Spam CEAS (2004)Google Scholar
  15. 15.
    Graham, P.: Better Bayesian filtering. In: Proc. of the MIT Spam Conference (2003)Google Scholar
  16. 16.
    Hovold, J.: Naïve Bayes Spam Filtering Using Word-Position-Based Attributes. In: Proc. of the Second Conference on Email and Anti-Spam CEAS (2005), http://www.ceas.cc/papers-2005/144.pdf
  17. 17.
    Kolcz, A., Alspector, J.: SVM-based filtering of e-mail spam with content specific misclassification costs. In: Proc. of the ICDM Workshop on Text Mining (2001)Google Scholar
  18. 18.
    Gama, J., Castillo, G.: Adaptive Bayes. In: Garijo, F.J., Riquelme, J.-C., Toro, M. (eds.) IBERAMIA 2002. LNCS, vol. 2527, pp. 765–774. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  19. 19.
    Scholz, M., Klinkenberg, R.: An Ensemble Classifier for Drifting Concepts. In: Proc. of the Second International Workshop on Knowledge Discovery from Data Streams, pp. 53–64 (2005)Google Scholar
  20. 20.
    Syed, N.A., Liu, H., Sung, K.K.: Handling Concept Drifts in Incremental Learning with Support Vector Machines. In: Proc. of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp. 317–321 (1999)Google Scholar
  21. 21.
    Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1), 69–101 (1996)Google Scholar
  22. 22.
    Lenz, M., Auriol, E., Manago, M.: Diagnosis and Decision Support. In: Lenz, M., Bartsch-Spörl, B., Burkhard, H.-D., Wess, S. (eds.) Case-Based Reasoning Technology. LNCS (LNAI), vol. 1400, pp. 51–90. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  23. 23.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of the Fourteenth International Conference on Machine Learning ICML 1997, pp. 412–420 (1997)Google Scholar
  24. 24.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, Reading (1999)Google Scholar
  25. 25.
    Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Analyzing the Impact of Corpus Preprocessing on Anti-Spam Filtering Software. Research on Computing Science 17, 129–138 (2005)Google Scholar
  26. 26.
    Shannon, C.E.: The mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423, 623–656 (1997)MathSciNetGoogle Scholar
  27. 27.
    Salton, G., McGill, M.: Introduction to mosdern information retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar
  28. 28.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAI 1995, pp. 1137–1143 (1995)Google Scholar
  29. 29.
    Oliver, J.J., Hand, D.J.: Averaging over decision stumps. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 231–241. Springer, Heidelberg (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • J. R. Méndez
    • 1
  • F. Fdez-Riverola
    • 1
  • E. L. Iglesias
    • 1
  • F. Díaz
    • 2
  • J. M. Corchado
    • 3
  1. 1.Dept. InformáticaUniversity of Vigo, Escuela Superior de Ingeniería Informática, Edificio PolitécnicoOurenseSpain
  2. 2.Dept. InformáticaUniversity of Valladolid, Escuela Universitaria de InformáticaSegoviaSpain
  3. 3.Dept. Informática y AutomáticaUniversity of SalamancaSalamancaSpain

Personalised recommendations