Abstract
Email classification using Machine Learning (ML) algorithm is an application of Artificial Intelligence (AI) where the system learns by itself to provide predictions between ham and spam mails. These algorithms predict a class based on the probability estimated from the features of the particular class label. The classifier performance gets degraded if there are large number of features with complex distributions and limited data instances. This paper illustrates the necessity of a suitable pre-processing stage to eliminate the irrelevant and redundant features before the classifier for enhancing its performance, facilitate the data visualization, reduce the storage requirements and time for modelling. The proposed methodology investigates and evaluates several feature selection and ranking algorithms in email classification process. The approaches used here are classifier-based feature selection methods, schema dependent and independent pruning methods, ranking based fast feature selection methods, cost sensitive classifier and cost-sensitive learner. A quantitative analysis was conducted using different feature selection and pruning algorithms. The important effects we observed here are feature space reduction and manageability of the learning algorithm’s computation criteria. The results of the experimentation show an enhanced accuracy of minimum 10% with better precision values. This paper concludes with significant outcomes like validation of various ranker evaluators, advantages of scheme independent methods in terms of computational factors, the effects of cost-sensitive classifier and cost-sensitive learner. Our experiments also illustrated the importance of classifier dependency in determining the sensitivity of feature selection algorithms and the peril of over-fitting the model in the email domain.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Zia, T., Akhter, M. P., & Abbas, Q. (2015). Comparative study of feature selection approaches for Urdu text categorization. Malaysian Journal of Computer Science, 28(2), 93–109.
Abayomi-Alli, O., Misra, S., Matthews, V. O., Odusami, M., Abayomi-Alli, A., Ahuja, R., & Maskeliunas, R. (2019). An improved feature selection method for short text classification. Journal of Physics: Conference Series, 1235(1), 012021. IOP Publishing.
Tan, F. (2007) Improving feature selection techniques for machine learning.
Caruana, R. A., & Freitag, D. (1994). How useful is relevance? Technical report. In Fall’94 AAAI Symposium on Relevance, New Orleans.
Tang, J., Alelyani, S., & Liu, H. (2013). Feature selection for classification: A review. In Data classification: Algorithms and applications. CRC Press.
Dash, M., & Liu, H. (1997). Feature selection for classification, intelligent data analysis (pp. 131–156). Elsevier.
Doak, J. (1992). An evaluation of feature selection methods and their application to computer security, Technical report. University of California, Department of Computer Science.
Blum, L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence on Relevance, 97, 245–271.
Molina, L. C., Belanche, L., & Nebot, A. (2002). Feature selection algorithms: A survey and experimental evaluation. In Proceedings of ICDM (pp. 306–313).
Bolon Canedo, V., Sanchez-Marono, N., & Alonso-Betanzos, A. (2013). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 34(3), 483–519.
Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transaction on Knowledge and Data Engineering, 17(4), 491–502.
Rogati, M., & Yang, Y. (2002). High-performing feature selection for text classification. In Proceedings of the Eleventh International Conference on Information and Knowledge Management. ACM.
Narendra, P., & Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Transactions on Computer, 26(9), 917–922.
Koller, D., Sahami, M. (1996). Toward optimal feature selection. In Proceedings of the 13th International Conference on Machine Learning (pp. 284–292). Morgan Kaufmann.
John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant feature and the subset selection problem. In Proceedings of the Eleventh International Conference on Machine Learning (pp. 121–129).
Hua, J., Tembe, W., & Dougherty, E. (2009). Performance of feature-selection methods in the classification of high-dimension data. Journal of Pattern Recognition, 42(3), 409–424.
Xu, L., Yan, P., & Chang, T. (1988). Best first strategy for feature selection. In Proceedings of the Ninth International Conference on Pattern Recognition (pp. 706–708).
Kohavi, R., & John, G. H. (1997). Wrapper for feature subset selection, artificial intelligence (pp. 273–324). Elsevier.
Hall, M. A. (1999). Correlation based feature selection for machine learning. Ph.D. thesis, University of Waikato.
Ichino, J., & Sklansky, M. (1984). Feature selection for linear classifier. In Proceedings of the Seventh International Conference on Pattern Recognition (pp. 124–127).
Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 129–134).
Liu, H., Liu, L., & Zhang, H. (2008). Feature selection using mutual information: An experimental study. In Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence (pp. 235–246).
Xing, E., Jordan, M., & Karp, R. (2001). Feature selection for high-dimensional genomic microarray data. In Proceedings of the Eighteenth International Conference on Machine Learning (pp. 601–608).
Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval (Vol. 33).
Wiener, E., Pedersen, J. O., & Weigend, A. S. (1995). A neural network approach to topic spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Vol. 317).
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning ICML97 (pp. 412–420).
Abayomi-Alli, O., Misra, S., Abayomi-Alli, A., & Odusami, M. (2019). A review of soft techniques for SMS spam classification: Methods, approaches and applications. Engineering Applications of Artificial Intelligence, 86, 197–212.
Chen, Y., Li, Y., Cheng, X. Q., & Guo, L. (2006). Survey and taxonomy of feature selection algorithms in intrusion detection system. In International Conference on Information Security and Cryptology (pp. 153–167). Springer.
Witten, I. H., & Frank, E. (2002). Data mining: Practical machine learning tools and techniques with Java implementations. Acm Sigmod Record, 31(1), 76–77.
Oluranti, J., Omoregbe, N., & Misra, S. (2019, August). Effect of feature selection on performance of internet traffic classification on NIMS multi-class dataset. Journal of Physics: Conference Series, 1299(1), 012035. IOP Publishing.
UCI Machine Learning Repository, Spambase, Dataset. http://archive.ics.uci.edu/ml/datasets/Spambase.
WEKA at http://www.cs.waikato.ac.nz/~ml/weka.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Bindu, V., Thomas, C. (2021). Significance of Feature Selection and Pruning Algorithms in Machine Learning Classification of E-Mails. In: Misra, S., Kumar Tyagi, A. (eds) Artificial Intelligence for Cyber Security: Methods, Issues and Possible Horizons or Opportunities. Studies in Computational Intelligence, vol 972. Springer, Cham. https://doi.org/10.1007/978-3-030-72236-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-72236-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72235-7
Online ISBN: 978-3-030-72236-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)