On Feature Extraction for Spam E-Mail Detection

  • Serkan Günal
  • Semih Ergin
  • M. Bilginer Gülmezoğlu
  • Ö. Nezih Gerek
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4105)


Electronic mail is an important communication method for most computer users. Spam e-mails however consume bandwidth resource, fill-up server storage and are also a waste of time to tackle.The general way to label an e-mail as spam or non-spam is to set up a finite set of discriminative features and use a classifier for the detection. In most cases, the selection of such features is empirically verified. In this paper, two different methods are proposed to select the most discriminative features among a set of reasonably arbitrary features for spam e-mail detection. The selection methods are developed using the Common Vector Approach (CVA) which is actually a subspace-based pattern classifier.Experimental results indicate that the proposed feature selection methods give considerable reduction on the number of features without affecting recognition rates.


Feature Vector Feature Selection Feature Selection Method Discriminative Feature Irrelevant Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Qiu, X., Jihong, H., Ming, C.: Flow-Based Anti-Spam. In: Proceedings IEEE Workshop on IP Operations and Management, pp. 99–103 (2004)Google Scholar
  2. 2.
    Pelletier, L., Almhana, J., Choulakian, V.: Adaptive Filtering of SPAM. In: Proceedings of Second Annual Conference on Communication Networks and Services Research, pp. 218–224 (2004)Google Scholar
  3. 3.
    Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian Approach to Filtering Junk E-Mail. In: Proc. of AAAI 1998, Workshop on Learning for Text Categorization, Madison, WI (1998)Google Scholar
  4. 4.
    Michelakis, E., Androutsopoulos, I., Paliouras, G., Sakkis, G., Stamatopoulos, P.: Filtron: A Learning-Based Anti-Spam Filter. In: Proc. of the 1st Conf. on E-mail and Anti-Spam (CEAS 2004), Mountain View, CA (2004)Google Scholar
  5. 5.
    Drucker, H.D., Wu, D., Vapnik, V.: Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)CrossRefGoogle Scholar
  6. 6.
    Agrawal, B., Kumar, N., Molle, M.: Controlling Spam Emails at the Routers. In: IEEE International Conference on Communications, vol. 3, pp. 1588–1592 (2005)Google Scholar
  7. 7.
    Ching-Tung, W., Cheng, K.-T., Zhu, Q., Wu, Y.-L.: Using Visual Features for Anti-Spam Filtering. In: Proceedings of IEEE International Conference on Image Processing (ICIP 2005), vol. 3, pp. 509–512 (2005)Google Scholar
  8. 8.
    Lai, C.-C., Tsai, M.-C.: An Empirical Performance Comparison of Machine Learning Methods for Spam E-mail Categorization. In: Proceedings of Fourth International Conference on Hybrid Intelligent Systems, HIS 2004, pp. 44–48 (2004)Google Scholar
  9. 9.
    Wang, X.-L., Cloete, I.: Learning to Classify Email: A Survey. In: Proceedings of 2005 International Conference on Machine Learning and Cybernetics, vol. 9, pp. 5716–5719 (2005)Google Scholar
  10. 10.
    Gülmezoğlu, M.B., Dzhafarov, V., Keskin, M., Barkana, A.: A novel approach to isolated word recognition. IEEE Trans. on Speech and Audio Processing 7, 620–628 (1999)CrossRefGoogle Scholar
  11. 11.
    Gülmezoğlu, M.B., Dzhafarov, V., Barkana, A.: The common vector approach and its relation to the principal component analysis. IEEE Trans. on Speech and Audio Processing 9, 655–662 (2001)CrossRefGoogle Scholar
  12. 12.
    Çevikalp, H., Neamtu, M., Wilkes, M., Barkana, A.: Discriminative common vectors for face recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 27, 1–10 (2005)CrossRefGoogle Scholar
  13. 13.
    Gülmezoğlu, M.B., Dzhafarov, V., Barkana, A.: Comparison of the Common Vector Approach with the other subspace methods when there are sufficient data in the training set. In: Proc. of 8th National Conf. on Signal Processing and Applications, Belek, Turkey, pp. 13–18 (June 2000)Google Scholar
  14. 14.
    Oja, E.: Subspace methods of pattern recognition. John Wiley and Sons, New York (1983)Google Scholar
  15. 15.
    Swets, D.L., Weng, J.: Using discriminant eigenfeatures for image retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence 18, 831–836 (1996)CrossRefGoogle Scholar
  16. 16.
    Vaughan-Nichols, S.J.: Saving Private E-mail. IEEE Spectrum Magazine, 40–44 (August 2003)Google Scholar
  17. 17.
    Günal, S., Ergin, S., Gerek, Ö.N.: Spam E-mail Recognition by Subspace Analysis. In: INISTA – International Symposium on Innovations in Intelligent Systems and Applications, pp. 307–310 (2005)Google Scholar
  18. 18.
    Katakis, I., Tsoumakas, G., Vlahavas, I.: On the Utility of Incremental Feature Selection for the Classification of Textual Data Streams. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 338–348. Springer, Heidelberg (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Serkan Günal
    • 1
  • Semih Ergin
    • 1
  • M. Bilginer Gülmezoğlu
    • 1
  • Ö. Nezih Gerek
    • 2
  1. 1.The Department of Electrical and Electronics EngineeringEskişehir Osmangazi UniversityEskişehirTürkiye
  2. 2.The Department of Electrical and Electronics EngineeringAnadolu UniversityEskişehirTürkiye

Personalised recommendations