Skip to main content

Significance of Feature Selection and Pruning Algorithms in Machine Learning Classification of E-Mails

  • Chapter
  • First Online:
Artificial Intelligence for Cyber Security: Methods, Issues and Possible Horizons or Opportunities

Part of the book series: Studies in Computational Intelligence ((SCI,volume 972))

Abstract

Email classification using Machine Learning (ML) algorithm is an application of Artificial Intelligence (AI) where the system learns by itself to provide predictions between ham and spam mails. These algorithms predict a class based on the probability estimated from the features of the particular class label. The classifier performance gets degraded if there are large number of features with complex distributions and limited data instances. This paper illustrates the necessity of a suitable pre-processing stage to eliminate the irrelevant and redundant features before the classifier for enhancing its performance, facilitate the data visualization, reduce the storage requirements and time for modelling. The proposed methodology investigates and evaluates several feature selection and ranking algorithms in email classification process. The approaches used here are classifier-based feature selection methods, schema dependent and independent pruning methods, ranking based fast feature selection methods, cost sensitive classifier and cost-sensitive learner. A quantitative analysis was conducted using different feature selection and pruning algorithms. The important effects we observed here are feature space reduction and manageability of the learning algorithm’s computation criteria. The results of the experimentation show an enhanced accuracy of minimum 10% with better precision values. This paper concludes with significant outcomes like validation of various ranker evaluators, advantages of scheme independent methods in terms of computational factors, the effects of cost-sensitive classifier and cost-sensitive learner. Our experiments also illustrated the importance of classifier dependency in determining the sensitivity of feature selection algorithms and the peril of over-fitting the model in the email domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 279.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zia, T., Akhter, M. P., & Abbas, Q. (2015). Comparative study of feature selection approaches for Urdu text categorization. Malaysian Journal of Computer Science, 28(2), 93–109.

    Google Scholar 

  2. Abayomi-Alli, O., Misra, S., Matthews, V. O., Odusami, M., Abayomi-Alli, A., Ahuja, R., & Maskeliunas, R. (2019). An improved feature selection method for short text classification. Journal of Physics: Conference Series, 1235(1), 012021. IOP Publishing.

    Google Scholar 

  3. Tan, F. (2007) Improving feature selection techniques for machine learning.

    Google Scholar 

  4. Caruana, R. A., & Freitag, D. (1994). How useful is relevance? Technical report. In Fall’94 AAAI Symposium on Relevance, New Orleans.

    Google Scholar 

  5. Tang, J., Alelyani, S., & Liu, H. (2013). Feature selection for classification: A review. In Data classification: Algorithms and applications. CRC Press.

    Google Scholar 

  6. Dash, M., & Liu, H. (1997). Feature selection for classification, intelligent data analysis (pp. 131–156). Elsevier.

    Google Scholar 

  7. Doak, J. (1992). An evaluation of feature selection methods and their application to computer security, Technical report. University of California, Department of Computer Science.

    Google Scholar 

  8. Blum, L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence on Relevance, 97, 245–271.

    Article  MathSciNet  Google Scholar 

  9. Molina, L. C., Belanche, L., & Nebot, A. (2002). Feature selection algorithms: A survey and experimental evaluation. In Proceedings of ICDM (pp. 306–313).

    Google Scholar 

  10. Bolon Canedo, V., Sanchez-Marono, N., & Alonso-Betanzos, A. (2013). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 34(3), 483–519.

    Article  Google Scholar 

  11. Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transaction on Knowledge and Data Engineering, 17(4), 491–502.

    Article  MathSciNet  Google Scholar 

  12. Rogati, M., & Yang, Y. (2002). High-performing feature selection for text classification. In Proceedings of the Eleventh International Conference on Information and Knowledge Management. ACM.

    Google Scholar 

  13. Narendra, P., & Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Transactions on Computer, 26(9), 917–922.

    Article  Google Scholar 

  14. Koller, D., Sahami, M. (1996). Toward optimal feature selection. In Proceedings of the 13th International Conference on Machine Learning (pp. 284–292). Morgan Kaufmann.

    Google Scholar 

  15. John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant feature and the subset selection problem. In Proceedings of the Eleventh International Conference on Machine Learning (pp. 121–129).

    Google Scholar 

  16. Hua, J., Tembe, W., & Dougherty, E. (2009). Performance of feature-selection methods in the classification of high-dimension data. Journal of Pattern Recognition, 42(3), 409–424.

    Article  Google Scholar 

  17. Xu, L., Yan, P., & Chang, T. (1988). Best first strategy for feature selection. In Proceedings of the Ninth International Conference on Pattern Recognition (pp. 706–708).

    Google Scholar 

  18. Kohavi, R., & John, G. H. (1997). Wrapper for feature subset selection, artificial intelligence (pp. 273–324). Elsevier.

    Google Scholar 

  19. Hall, M. A. (1999). Correlation based feature selection for machine learning. Ph.D. thesis, University of Waikato.

    Google Scholar 

  20. Ichino, J., & Sklansky, M. (1984). Feature selection for linear classifier. In Proceedings of the Seventh International Conference on Pattern Recognition (pp. 124–127).

    Google Scholar 

  21. Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 129–134).

    Google Scholar 

  22. Liu, H., Liu, L., & Zhang, H. (2008). Feature selection using mutual information: An experimental study. In Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence (pp. 235–246).

    Google Scholar 

  23. Xing, E., Jordan, M., & Karp, R. (2001). Feature selection for high-dimensional genomic microarray data. In Proceedings of the Eighteenth International Conference on Machine Learning (pp. 601–608).

    Google Scholar 

  24. Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval (Vol. 33).

    Google Scholar 

  25. Wiener, E., Pedersen, J. O., & Weigend, A. S. (1995). A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Vol. 317).

    Google Scholar 

  26. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning ICML97 (pp. 412–420).

    Google Scholar 

  27. Abayomi-Alli, O., Misra, S., Abayomi-Alli, A., & Odusami, M. (2019). A review of soft techniques for SMS spam classification: Methods, approaches and applications. Engineering Applications of Artificial Intelligence, 86, 197–212.

    Article  Google Scholar 

  28. Chen, Y., Li, Y., Cheng, X. Q., & Guo, L. (2006). Survey and taxonomy of feature selection algorithms in intrusion detection system. In International Conference on Information Security and Cryptology (pp. 153–167). Springer.

    Google Scholar 

  29. Witten, I. H., & Frank, E. (2002). Data mining: Practical machine learning tools and techniques with Java implementations. Acm Sigmod Record, 31(1), 76–77.

    Article  Google Scholar 

  30. Oluranti, J., Omoregbe, N., & Misra, S. (2019, August). Effect of feature selection on performance of internet traffic classification on NIMS multi-class dataset. Journal of Physics: Conference Series, 1299(1), 012035. IOP Publishing.

    Google Scholar 

  31. UCI Machine Learning Repository, Spambase, Dataset. http://archive.ics.uci.edu/ml/datasets/Spambase.

  32. WEKA at http://www.cs.waikato.ac.nz/~ml/weka.

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Bindu, V., Thomas, C. (2021). Significance of Feature Selection and Pruning Algorithms in Machine Learning Classification of E-Mails. In: Misra, S., Kumar Tyagi, A. (eds) Artificial Intelligence for Cyber Security: Methods, Issues and Possible Horizons or Opportunities. Studies in Computational Intelligence, vol 972. Springer, Cham. https://doi.org/10.1007/978-3-030-72236-4_2

Download citation

Publish with us

Policies and ethics