Abstract
Nowadays, email is an important medium of communication used by almost everyone whether for official or personal purposes, and this has encouraged some users to exploit this medium to send spam emails either for marketing purposes or for potentially harmful purposes. The massive increase in the number of spam messages led to the need to find ways to identify and filter these emails, which encouraged many researchers to produce work in this field. In this paper, we present a method for identifying and detecting spam email messages based on their contents. The approach uses the mutual information contents method to define the relationship between the text the email contains and its class to select the most frequently used text in spam emails. The random forest classifier was used to classify emails into legitimate and spam due to its performance and the advantage of overcoming the overfitting issue associated with regular decision tree classifiers. The proposed algorithm was applied to a dataset containing 3000 features and 5150 instances, and the results obtained were carefully studied and discussed. The algorithm showed an outstanding performance, which is evident in the accuracy obtained in some cases, which reached 97%, and the optimum accuracy which reached 96.4%.
Keywords
- Spam email
- Mutual information content
- Random forest classifier
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
J. Johnson, Number of e-mails per day worldwide 2017–2025. Statista (2021). https://www.statista.com/statistics/456500/daily-number-of-e-mails-worldwide/. Accessed 01 March 2021
H. Mohammadzadeh, F.S. Gharehchopogh, A novel hybrid whale optimization algorithm with flower pollination algorithm for feature selection: case study email spam detection. Comput. Intell. 37(1), 176–209 (2021). https://doi.org/10.1111/coin.12397
H. Faris et al., An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Inf. Fusion 48(August), 67–83 (2019). https://doi.org/10.1016/j.inffus.2018.08.002
M. Zhiwei, M.M. Singh, Z.F. Zaaba, Email spam detection: a method of metaclassifiers stacking. Int. Conf. Comput. Informatics 200, 750–757 (2017)
A. Bhowmick, S.M. Hazarika, Machine learning for e-mail spam filtering: review, techniques and trends, June 2016. http://arxiv.org/abs/1606.01042
E.G. Dada, J.S. Bassi, H. Chiroma, S.M. Abdulhamid, A.O. Adetunmbi, O.E. Ajibuwa, Machine learning for email spam filtering: review, approaches and open research problems. Heliyon 5(6) (2019). https://doi.org/10.1016/j.heliyon.2019.e01802
E.Y. Desta, Spam email detection on data mining: a review. J. Inf. Eng. Appl. 9(2), 1–4 (2019). https://doi.org/10.7176/jiea/9-2-01
S.K. Trivedi, S. Dey, Interplay between probabilistic classifiers and boosting algorithms for detecting complex unsolicited emails. J. Adv. Comput. Networks 1(2), 132–136 (2013). https://doi.org/10.7763/JACN.2013.V1.27
M. Bassiouni, M. Ali, E.A. El-Dahshan, Ham and spam e-mails classification using machine learning techniques. J. Appl. Secur. Res. 13(3), 315–331 (2018). https://doi.org/10.1080/19361610.2018.1463136
K. Agarwal, T. Kumar, Email spam detection using integrated approach of Naïve Bayes and particle swarm optimization, in Proceedings of the 2nd International Conference on Intelligent Computing and Control Systems, ICICCS 2018, June 2018 (2019), pp. 685–690. https://doi.org/10.1109/ICCONS.2018.8662957
D. Gaurav, S.M. Tiwari, A. Goyal, N. Gandhi, A. Abraham, Machine intelligence-based algorithms for spam filtering on document labeling. Soft Comput. 24(13), 9625–9638 (2020). https://doi.org/10.1007/s00500-019-04473-7
S. Douzi, F.A. AlShahwan, M. Lemoudden, B. El Ouahidi, Hybrid email spam detection model using artificial intelligence. Int. J. Mach. Learn. Comput. 10(2), 316–322 (2020). https://doi.org/10.18178/ijmlc.2020.10.2.937
U.K. Sah, N. Parmar, An approach for malicious spam detection in email with comparison of different classifiers. Int. Res. J. Eng. Technol. 4(8), 2238–2242 (2017). https://irjet.net/archives/V4/i8/IRJET-V4I8404.pdf
R.N. Khushaba, A. Al-Ani, A. Alsukker, A. Al-Jumaily, A combined ant colony and differential evolution feature selection algorithm. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 5217 (LNCS, 2008), pp. 1–12. https://doi.org/10.1007/978-3-540-87527-7_1
J. Huang, Y. Cai, X. Xu, A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recogn. Lett. 28(13), 1825–1844 (2007). https://doi.org/10.1016/j.patrec.2007.05.011
A.I. Sharaf, M. Abu, I. El-Henawy, A feature selection algorithm based on mutual information using local non-uniformity correction estimator. Int. J. Adv. Comput. Sci. Appl. 8(6) (2017). https://doi.org/10.14569/ijacsa.2017.080656
X. Wang, B. Guo, Y. Shen, C. Zhou, X. Duan, Input feature selection method based on feature set equivalence and mutual information gain maximization. IEEE Access 7, 151525–151538 (2019). https://doi.org/10.1109/ACCESS.2019.2948095
A. El Akadi, A. El Ouardighi, D. Aboutajdine, A powerful feature selection approach based on mutual information. Int. J. Comput. Sci. Netw. Secur. 8(4), 116–121 (2008). http://paper.ijcsns.org/07_book/200804/20080417.pdf
S. Verron, T. Tiplica, A. Kobi, Fault detection and identification with a new feature selection based on mutual information. J. Process Control 18(5), 479–490 (2008). https://doi.org/10.1016/j.jprocont.2007.08.003
B. Biswas, Email spam classification dataset CSV (2020). https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv. Accessed 1 Feb 2021
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Al-Azawi, M.A.N. (2022). Identifying the Most Frequently Used Words in Spam Mail Using Random Forest Classifier and Mutual Information Content. In: Verma, P., Charan, C., Fernando, X., Ganesan, S. (eds) Advances in Data Computing, Communication and Security. Lecture Notes on Data Engineering and Communications Technologies, vol 106. Springer, Singapore. https://doi.org/10.1007/978-981-16-8403-6_2
Download citation
DOI: https://doi.org/10.1007/978-981-16-8403-6_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-8402-9
Online ISBN: 978-981-16-8403-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)