Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain

  • J. R. Méndez
  • E. L. Iglesias
  • F. Fdez-Riverola
  • F. Díaz
  • J. M. Corchado
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4177)


Junk e-mail detection and filtering can be considered a cost-sensitive classification problem. Nevertheless, preprocessing methods and noise reduction strategies used to enhance the computational efficiency in text classification cannot be so efficient in e-mail filtering. This fact is demonstrated here where a comparative study of the use of stopword removal, stemming and different tokenising schemes is presented. The final goal is to preprocess the training e-mail corpora of several content-based techniques for spam filtering (machine approaches and case-based systems). Soundness conclusions are extracted from the experiments carried out where different scenarios are taken into consideration.


Support Vector Machine Model Feature Selection Method Concept Drift Punctuation Mark Stopword Removal 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    van Rijsbergen. C.J.: Information Retrieval, 2nd edn. Butterworths (1979)Google Scholar
  2. 2.
    Salton, G.: Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading (1989)Google Scholar
  3. 3.
    Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar
  4. 4.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  5. 5.
    Porter, M.F.: An algorithm for suffix stripping. In: Readings in Information Retrieval, pp. 313–316 (1997)Google Scholar
  6. 6.
    Oard, D.W.: The state of the art in text filtering. User Modeling and User-Adapted Interaction 7, 141–178 (1997)CrossRefGoogle Scholar
  7. 7.
    Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/2, NCSR “Demokritos” (2004)Google Scholar
  8. 8.
    Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization – Papers from the AAAI Workshop, Technical Report WS-98-05, pp. 55–62 (1998)Google Scholar
  9. 9.
    Carreras, X., Màrquez, L.: Boosting trees for anti-spam e-mail filtering. In: Proc. of the 4th International Conference on Recent Advances in Natural Language Processing, pp. 58–64 (2001)Google Scholar
  10. 10.
    Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Statistics for Engineering and Information Science (1999)Google Scholar
  11. 11.
    Delany, S.J., Cunningham, P., Coyle, L.: An Assessment of Case-base Reasoning for Spam Filtering. In: Proc. of Fifteenth Irish Conference on Artificial Intelligence and Cognitive Science: AICS 2004, pp. 9–18 (2004)Google Scholar
  12. 12.
    Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering that Can Track Concept Drift. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, Springer, Heidelberg (2003)CrossRefGoogle Scholar
  13. 13.
    Fdez-Riverola, F., Méndez, J.R., Iglesias, E.L., Díaz, F.: Representación Flexible de emails para la construcción de filtros antispam: un caso práctico. In: Proc. of the I Congreso Español de Informática CEDI 2005, pp. 109–116 (2005)Google Scholar
  14. 14.
    Schapire, R.E., Singer, Y.: BoosTexter: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)zbMATHCrossRefGoogle Scholar
  15. 15.
    Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Sholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)Google Scholar
  16. 16.
    Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1), 69–101 (1996)Google Scholar
  17. 17.
    Lenz, M., Auriol, E., Manago, M.: Diagnosis and Decision Support. In: Lenz, M., Bartsch-Spörl, B., Burkhard, H.-D., Wess, S. (eds.) Case-Based Reasoning Technology. LNCS (LNAI), vol. 1400, pp. 51–90. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  18. 18.
    Le, Z., Tian-shun, Y.: Filtering Junk Mail with A Maximum Entropy Model. In: Proc. of the ICCPO 2003, pp. 446–453 (2003)Google Scholar
  19. 19.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of the Fourteenth International Conference on Machine Learning: ICML 1997, pp. 412–420 (1997)Google Scholar
  20. 20.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence: IJCAI 1995, pp. 1137–1143 (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • J. R. Méndez
    • 1
  • E. L. Iglesias
    • 1
  • F. Fdez-Riverola
    • 1
  • F. Díaz
    • 2
  • J. M. Corchado
    • 3
  1. 1.Dept. InformáticaUniversity of Vigo, Escuela Superior de Ingeniería Informática, Edificio PolitécnicoOurenseSpain
  2. 2.Dept. InformáticaUniversity of Valladolid, Escuela Universitaria de InformáticaSegoviaSpain
  3. 3.Dept. Informática y AutomáticaUniversity of SalamancaSalamancaSpain

Personalised recommendations