Relaxing Feature Selection in Spam Filtering by Using Case-Based Reasoning Systems

  • J. R. Méndez
  • F. Fdez-Riverola
  • D. Glez-Peña
  • F. Díaz
  • J. M. Corchado
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4874)

Abstract

This paper presents a comparison between two alternative strategies for addressing feature selection on a well known case-based reasoning spam filtering system called SpamHunting. We present the usage of the k more predictive features and a percentage-based strategy for the exploitation of our amount of information measure. Finally, we confirm the idea that the percentage feature selection method is more adequate for spam filtering domain.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Méndez, J.R., Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Corchado, J.M.: A Comparative Performance Study of Feature Selection Methods for the Anti-Spam Filtering Domain. In: Proc. of the 6th Industrial Conference on Data Mining, pp. 106–120 (2006)Google Scholar
  2. 2.
    Méndez, J.R., Fdez-Riverola, F., Díaz, F., Iglesias, E.L., Corchado, J.M.: Tracking Concept Drift at Feature Selection Stage in SpamHunting: an Anti-Spam Instance-Based Reasoning System. In: Proc. of the 8th European Conference on Case-Based Reasoning, pp. 504–518 (2006)Google Scholar
  3. 3.
    Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: SpamHunting: An Instance-Based Reasoning System for Spam Labeling and Filtering. Decision Support Systems (in press, 2007) http://dx.doi.org/10.1016/j.dss.2006.11.012
  4. 4.
    Méndez, J.R., Corzo, B., Glez-Peña, D., Fdez-Riverola, F., Díaz, F.: Analyzing the Performance of Spam Filtering Methods when Dimensionality of Input Vector Changes. In: Proc. of the 5th International Conference on Data Mining and Machine Learning (to appear, 2007)Google Scholar
  5. 5.
    Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam Filtering with Naive Bayes – Which Naive Bayes? In: Proc. of the 3rd Conference on Email and Anti-Spam, pp. 125–134 (2006), http://www.ceas.cc
  6. 6.
    Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering that Can Track Concept Drift. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, Springer, Heidelberg (2003)CrossRefGoogle Scholar
  7. 7.
    Delany, S.J., Cunningham, P., Coyle, L.: An Assessment of Case-base Reasoning for Spam Filtering. In: AICS 2004. Proc. of Fifteenth Irish Conference on Artificial Intelligence and Cognitive Science, pp. 9–18 (2004)Google Scholar
  8. 8.
    Lenz, M., Burkhard, H.D.: Case Retrieval Nets: Foundations, properties, implementation and results. Technical Report: Humboldt University, Berlin (1996)Google Scholar
  9. 9.
    Delany, S.J., Cunningham, P.: An Analysis of Case-Based Editing in a Spam Filtering System. In: Proceedings of the 7th European Conference on Case-Based Reasoning, pp. 128–141 (2004)Google Scholar
  10. 10.
    Fdez-Riverola, F., Iglesias, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: Applying Lazy Learning Algorithms to Tackle Concept Drift in Spam Filtering. ESWA: Expert Systems With Applications 33(1), 36–48 (2007)CrossRefGoogle Scholar
  11. 11.
    Méndez, J.R., González, C., Glez-Peña, G., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Assessing Classification Accuracy in the Revision Stage of a CBR Spam Filtering System. Lecture Notes on Artificial Intelligence (to appear, 2007)Google Scholar
  12. 12.
    Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Tokenising, Stemming and Stopword Removal on the Spam Filtering Domain. In: Proc. of the 11th Conference of the Spanish Association for Artificial Intelligence, pp. 449–458 (2005)Google Scholar
  13. 13.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  14. 14.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI 1995. Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)Google Scholar
  15. 15.
    Egan, J.P.: Signal Detection Theory and ROC Analysis. Academic Press, New York (1975)Google Scholar
  16. 16.
    Hasselband, V., Hedges, L.: Meta-analysis of diagnostics test. Psychological Bulletin 117, 167–178 (1995)CrossRefGoogle Scholar
  17. 17.
    Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/2, NCSR "Demokritos" (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • J. R. Méndez
    • 1
  • F. Fdez-Riverola
    • 1
  • D. Glez-Peña
    • 1
  • F. Díaz
    • 2
  • J. M. Corchado
    • 3
  1. 1.Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, OurenseSpain
  2. 2.Dept. Informática, University of Valladolid, Escuela Universitaria de Informática, Plaza Santa Eulalia, 9-11, 40005, SegoviaSpain
  3. 3.Dept. Informática y Automática, University of Salamanca, Plaza de la Merced s/n, 37008, SalamancaSpain

Personalised recommendations