Investigating the Effect of Combining Text Clustering with Classification on Improving Spam Email Detection

  • Doaa Hassan
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 557)


Nowadays emails have been an easy and fast tool of communication among people. As a result, filtering unsolicited/spam emails has become a very important challenge to achieve. Recently there has been some research work in text mining that combines text clustering with classification to improve the classification performance. In this paper, we investigate the effect of combining text clustering using K-means algorithm with various supervised classification mechanisms on improving the performance of classification of emails into spam or non-spam. The conjunction of clustering and classification mechanisms is carried out by adding extra features from the clustering step to the feature space used for classification. Our results show that combining K-means clustering with supervised classification by this methodology does not always improve the classification performance. Moreover, for the cases that the classifiers performance is improved by clustering, we found that the performance of classifiers in terms of accuracy is slightly increased with a very small amount that does not meet the increase in the time taken for building a learning model that combines both mechanisms. The result of our experiment has been shown using the Enron-Spam datasets.


Text clustering Classification Spam email detection 


  1. 1.
    Weiss, S.M., et al.: Overview of text mining. In: Weiss, S.M., Indurkhya, N., Zhang, T. (eds.) Fundamentals of Predictive Text Mining, Chap. 1, pp. 1–12. Springer, London (2010)CrossRefGoogle Scholar
  2. 2.
    Hamerly, G., Elkan, C.: Alternatives to the k-means algorithm that find better clusterings. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM) (2002)Google Scholar
  3. 3.
    Sasaki, M., Shinnou, H.: Spam detection using text clustering. In: Proceedings of the 2005 International Conference on Cyberworlds (CW 2005), pp. 316–319 (2005)Google Scholar
  4. 4.
    Kyriakopoulou, A., Kalamboukis, T.: Using clustering to enhance text classification. In: Proceedings of SIGIR 2007, Amsterdam, The Netherlands (2007)Google Scholar
  5. 5.
    Kyriakopoulou, A., Kalamboukis, T.: Combining clustering with classification for spam detection in social bookmarking systems. In: ECML/PKDD Discovery Challenge (2008)Google Scholar
  6. 6.
    Tretyakov, K.: Machine learning techniques in spam filtering. In: Data Mining Problem-oriented Seminar, Institute of Computer Science, University of Tartu, pp. 60–79 (2004)Google Scholar
  7. 7.
    Basavaraju, M., Prabhakar, R.: A novel method of spam mail detection using text based clustering approach. Int. J. Comput. Appl. 5(4), 15–25 (2010)Google Scholar
  8. 8.
    Alsmadi, I., Alhami, I.: Clustering and classification of email contents. J. King Saud Univ. Comput. Inf. Sci. 27(1), 46–57 (2015)Google Scholar
  9. 9.
    Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004). doi: 10.1007/978-3-540-30115-8_22 CrossRefGoogle Scholar
  10. 10.
    Elssied, N.O.F., Ibrahim, O., Abu-Ulbeh, W.: An improved of spam e-mail classification mechanism using k-means clustering. J. Theor. Appl. Inf. Technol. 60(3), 568–580 (2014)Google Scholar
  11. 11.
  12. 12.
    Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  13. 13.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. J. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  14. 14.
    Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley, Boston (2005)Google Scholar
  15. 15.
    Frakes, W.B., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Prentice-Hall Inc., Upper Saddle River (1992)Google Scholar
  16. 16.
  17. 17.
    Bouckaert, R.R., Frank, E., Hall, M., Kirkby, R., Reutemann, P., Seewald, A., Scuse, D.: WEKA Manual for Version 3-6-8. University of Waikato, Hamilton, New Zealand (2012)Google Scholar
  18. 18.
    Hidalgo, J.M.G.: Text mining in WEKA: chaining filters and classifiers, January 2013Google Scholar
  19. 19.
    Sarukkai, R.R.: Foundations of Web Technology. The Springer International Series in Engineering and Computer Science, vol. 698. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  20. 20.
    Teknomo, K.: K-means clustering tutorial. Accessed July 2007

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Computers and SystemsNational Telecommunication InstituteCairoEgypt

Personalised recommendations