Significance of Feature Selection and Pruning Algorithms in Machine Learning Classification of E-Mails

Bindu, V.; Thomas, Ciza

doi:10.1007/978-3-030-72236-4_2

V. Bindu⁴ &
Ciza Thomas⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 972))

1070 Accesses
1 Citations

Abstract

Email classification using Machine Learning (ML) algorithm is an application of Artificial Intelligence (AI) where the system learns by itself to provide predictions between ham and spam mails. These algorithms predict a class based on the probability estimated from the features of the particular class label. The classifier performance gets degraded if there are large number of features with complex distributions and limited data instances. This paper illustrates the necessity of a suitable pre-processing stage to eliminate the irrelevant and redundant features before the classifier for enhancing its performance, facilitate the data visualization, reduce the storage requirements and time for modelling. The proposed methodology investigates and evaluates several feature selection and ranking algorithms in email classification process. The approaches used here are classifier-based feature selection methods, schema dependent and independent pruning methods, ranking based fast feature selection methods, cost sensitive classifier and cost-sensitive learner. A quantitative analysis was conducted using different feature selection and pruning algorithms. The important effects we observed here are feature space reduction and manageability of the learning algorithm’s computation criteria. The results of the experimentation show an enhanced accuracy of minimum 10% with better precision values. This paper concludes with significant outcomes like validation of various ranker evaluators, advantages of scheme independent methods in terms of computational factors, the effects of cost-sensitive classifier and cost-sensitive learner. Our experiments also illustrated the importance of classifier dependency in determining the sensitivity of feature selection algorithms and the peril of over-fitting the model in the email domain.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Hardcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Zia, T., Akhter, M. P., & Abbas, Q. (2015). Comparative study of feature selection approaches for Urdu text categorization. Malaysian Journal of Computer Science, 28(2), 93–109.
Google Scholar
Abayomi-Alli, O., Misra, S., Matthews, V. O., Odusami, M., Abayomi-Alli, A., Ahuja, R., & Maskeliunas, R. (2019). An improved feature selection method for short text classification. Journal of Physics: Conference Series, 1235(1), 012021. IOP Publishing.
Google Scholar
Tan, F. (2007) Improving feature selection techniques for machine learning.
Google Scholar
Caruana, R. A., & Freitag, D. (1994). How useful is relevance? Technical report. In Fall’94 AAAI Symposium on Relevance, New Orleans.
Google Scholar
Tang, J., Alelyani, S., & Liu, H. (2013). Feature selection for classification: A review. In Data classification: Algorithms and applications. CRC Press.
Google Scholar
Dash, M., & Liu, H. (1997). Feature selection for classification, intelligent data analysis (pp. 131–156). Elsevier.
Google Scholar
Doak, J. (1992). An evaluation of feature selection methods and their application to computer security, Technical report. University of California, Department of Computer Science.
Google Scholar
Blum, L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence on Relevance, 97, 245–271.
Article MathSciNet Google Scholar
Molina, L. C., Belanche, L., & Nebot, A. (2002). Feature selection algorithms: A survey and experimental evaluation. In Proceedings of ICDM (pp. 306–313).
Google Scholar
Bolon Canedo, V., Sanchez-Marono, N., & Alonso-Betanzos, A. (2013). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 34(3), 483–519.
Article Google Scholar
Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transaction on Knowledge and Data Engineering, 17(4), 491–502.
Article MathSciNet Google Scholar
Rogati, M., & Yang, Y. (2002). High-performing feature selection for text classification. In Proceedings of the Eleventh International Conference on Information and Knowledge Management. ACM.
Google Scholar
Narendra, P., & Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Transactions on Computer, 26(9), 917–922.
Article Google Scholar
Koller, D., Sahami, M. (1996). Toward optimal feature selection. In Proceedings of the 13th International Conference on Machine Learning (pp. 284–292). Morgan Kaufmann.
Google Scholar
John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant feature and the subset selection problem. In Proceedings of the Eleventh International Conference on Machine Learning (pp. 121–129).
Google Scholar
Hua, J., Tembe, W., & Dougherty, E. (2009). Performance of feature-selection methods in the classification of high-dimension data. Journal of Pattern Recognition, 42(3), 409–424.
Article Google Scholar
Xu, L., Yan, P., & Chang, T. (1988). Best first strategy for feature selection. In Proceedings of the Ninth International Conference on Pattern Recognition (pp. 706–708).
Google Scholar
Kohavi, R., & John, G. H. (1997). Wrapper for feature subset selection, artificial intelligence (pp. 273–324). Elsevier.
Google Scholar
Hall, M. A. (1999). Correlation based feature selection for machine learning. Ph.D. thesis, University of Waikato.
Google Scholar
Ichino, J., & Sklansky, M. (1984). Feature selection for linear classifier. In Proceedings of the Seventh International Conference on Pattern Recognition (pp. 124–127).
Google Scholar
Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 129–134).
Google Scholar
Liu, H., Liu, L., & Zhang, H. (2008). Feature selection using mutual information: An experimental study. In Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence (pp. 235–246).
Google Scholar
Xing, E., Jordan, M., & Karp, R. (2001). Feature selection for high-dimensional genomic microarray data. In Proceedings of the Eighteenth International Conference on Machine Learning (pp. 601–608).
Google Scholar
Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval (Vol. 33).
Google Scholar
Wiener, E., Pedersen, J. O., & Weigend, A. S. (1995). A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Vol. 317).
Google Scholar
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning ICML97 (pp. 412–420).
Google Scholar
Abayomi-Alli, O., Misra, S., Abayomi-Alli, A., & Odusami, M. (2019). A review of soft techniques for SMS spam classification: Methods, approaches and applications. Engineering Applications of Artificial Intelligence, 86, 197–212.
Article Google Scholar
Chen, Y., Li, Y., Cheng, X. Q., & Guo, L. (2006). Survey and taxonomy of feature selection algorithms in intrusion detection system. In International Conference on Information Security and Cryptology (pp. 153–167). Springer.
Google Scholar
Witten, I. H., & Frank, E. (2002). Data mining: Practical machine learning tools and techniques with Java implementations. Acm Sigmod Record, 31(1), 76–77.
Article Google Scholar
Oluranti, J., Omoregbe, N., & Misra, S. (2019, August). Effect of feature selection on performance of internet traffic classification on NIMS multi-class dataset. Journal of Physics: Conference Series, 1299(1), 012035. IOP Publishing.
Google Scholar
UCI Machine Learning Repository, Spambase, Dataset. http://archive.ics.uci.edu/ml/datasets/Spambase.
WEKA at http://www.cs.waikato.ac.nz/~ml/weka.

Download references

Author information

Authors and Affiliations

Sree Chitra Thirunal College of Engineering, Trivandrum, Kerala, India
V. Bindu
Directorate of Technical Education, Government of Kerala, Kerala, India
Ciza Thomas

Authors

V. Bindu
View author publications
You can also search for this author in PubMed Google Scholar
Ciza Thomas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center of ICT/ICE, Covenant University, Ota, Nigeria
Sanjay Misra
Vellore Institute of Technology, Chennai, Tamil Nadu, India
Amit Kumar Tyagi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bindu, V., Thomas, C. (2021). Significance of Feature Selection and Pruning Algorithms in Machine Learning Classification of E-Mails. In: Misra, S., Kumar Tyagi, A. (eds) Artificial Intelligence for Cyber Security: Methods, Issues and Possible Horizons or Opportunities. Studies in Computational Intelligence, vol 972. Springer, Cham. https://doi.org/10.1007/978-3-030-72236-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-72236-4_2
Published: 01 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72235-7
Online ISBN: 978-3-030-72236-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Significance of Feature Selection and Pruning Algorithms in Machine Learning Classification of E-Mails