Skip to main content

Identifying the Most Frequently Used Words in Spam Mail Using Random Forest Classifier and Mutual Information Content

Part of the Lecture Notes on Data Engineering and Communications Technologies book series (LNDECT,volume 106)

Abstract

Nowadays, email is an important medium of communication used by almost everyone whether for official or personal purposes, and this has encouraged some users to exploit this medium to send spam emails either for marketing purposes or for potentially harmful purposes. The massive increase in the number of spam messages led to the need to find ways to identify and filter these emails, which encouraged many researchers to produce work in this field. In this paper, we present a method for identifying and detecting spam email messages based on their contents. The approach uses the mutual information contents method to define the relationship between the text the email contains and its class to select the most frequently used text in spam emails. The random forest classifier was used to classify emails into legitimate and spam due to its performance and the advantage of overcoming the overfitting issue associated with regular decision tree classifiers. The proposed algorithm was applied to a dataset containing 3000 features and 5150 instances, and the results obtained were carefully studied and discussed. The algorithm showed an outstanding performance, which is evident in the accuracy obtained in some cases, which reached 97%, and the optimum accuracy which reached 96.4%.

Keywords

  • Spam email
  • Mutual information content
  • Random forest classifier

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-981-16-8403-6_2
  • Chapter length: 11 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   189.00
Price excludes VAT (USA)
  • ISBN: 978-981-16-8403-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   249.99
Price excludes VAT (USA)
Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. J. Johnson, Number of e-mails per day worldwide 2017–2025. Statista (2021). https://www.statista.com/statistics/456500/daily-number-of-e-mails-worldwide/. Accessed 01 March 2021

  2. H. Mohammadzadeh, F.S. Gharehchopogh, A novel hybrid whale optimization algorithm with flower pollination algorithm for feature selection: case study email spam detection. Comput. Intell. 37(1), 176–209 (2021). https://doi.org/10.1111/coin.12397

    MathSciNet  CrossRef  Google Scholar 

  3. H. Faris et al., An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Inf. Fusion 48(August), 67–83 (2019). https://doi.org/10.1016/j.inffus.2018.08.002

    CrossRef  Google Scholar 

  4. M. Zhiwei, M.M. Singh, Z.F. Zaaba, Email spam detection: a method of metaclassifiers stacking. Int. Conf. Comput. Informatics 200, 750–757 (2017)

    Google Scholar 

  5. A. Bhowmick, S.M. Hazarika, Machine learning for e-mail spam filtering: review, techniques and trends, June 2016. http://arxiv.org/abs/1606.01042

  6. E.G. Dada, J.S. Bassi, H. Chiroma, S.M. Abdulhamid, A.O. Adetunmbi, O.E. Ajibuwa, Machine learning for email spam filtering: review, approaches and open research problems. Heliyon 5(6) (2019). https://doi.org/10.1016/j.heliyon.2019.e01802

  7. E.Y. Desta, Spam email detection on data mining: a review. J. Inf. Eng. Appl. 9(2), 1–4 (2019). https://doi.org/10.7176/jiea/9-2-01

    CrossRef  Google Scholar 

  8. S.K. Trivedi, S. Dey, Interplay between probabilistic classifiers and boosting algorithms for detecting complex unsolicited emails. J. Adv. Comput. Networks 1(2), 132–136 (2013). https://doi.org/10.7763/JACN.2013.V1.27

    CrossRef  Google Scholar 

  9. M. Bassiouni, M. Ali, E.A. El-Dahshan, Ham and spam e-mails classification using machine learning techniques. J. Appl. Secur. Res. 13(3), 315–331 (2018). https://doi.org/10.1080/19361610.2018.1463136

    CrossRef  Google Scholar 

  10. K. Agarwal, T. Kumar, Email spam detection using integrated approach of Naïve Bayes and particle swarm optimization, in Proceedings of the 2nd International Conference on Intelligent Computing and Control Systems, ICICCS 2018, June 2018 (2019), pp. 685–690. https://doi.org/10.1109/ICCONS.2018.8662957

  11. D. Gaurav, S.M. Tiwari, A. Goyal, N. Gandhi, A. Abraham, Machine intelligence-based algorithms for spam filtering on document labeling. Soft Comput. 24(13), 9625–9638 (2020). https://doi.org/10.1007/s00500-019-04473-7

    CrossRef  Google Scholar 

  12. S. Douzi, F.A. AlShahwan, M. Lemoudden, B. El Ouahidi, Hybrid email spam detection model using artificial intelligence. Int. J. Mach. Learn. Comput. 10(2), 316–322 (2020). https://doi.org/10.18178/ijmlc.2020.10.2.937

    CrossRef  Google Scholar 

  13. U.K. Sah, N. Parmar, An approach for malicious spam detection in email with comparison of different classifiers. Int. Res. J. Eng. Technol. 4(8), 2238–2242 (2017). https://irjet.net/archives/V4/i8/IRJET-V4I8404.pdf

  14. R.N. Khushaba, A. Al-Ani, A. Alsukker, A. Al-Jumaily, A combined ant colony and differential evolution feature selection algorithm. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 5217 (LNCS, 2008), pp. 1–12. https://doi.org/10.1007/978-3-540-87527-7_1

  15. J. Huang, Y. Cai, X. Xu, A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recogn. Lett. 28(13), 1825–1844 (2007). https://doi.org/10.1016/j.patrec.2007.05.011

    CrossRef  Google Scholar 

  16. A.I. Sharaf, M. Abu, I. El-Henawy, A feature selection algorithm based on mutual information using local non-uniformity correction estimator. Int. J. Adv. Comput. Sci. Appl. 8(6) (2017). https://doi.org/10.14569/ijacsa.2017.080656

  17. X. Wang, B. Guo, Y. Shen, C. Zhou, X. Duan, Input feature selection method based on feature set equivalence and mutual information gain maximization. IEEE Access 7, 151525–151538 (2019). https://doi.org/10.1109/ACCESS.2019.2948095

    CrossRef  Google Scholar 

  18. A. El Akadi, A. El Ouardighi, D. Aboutajdine, A powerful feature selection approach based on mutual information. Int. J. Comput. Sci. Netw. Secur. 8(4), 116–121 (2008). http://paper.ijcsns.org/07_book/200804/20080417.pdf

  19. S. Verron, T. Tiplica, A. Kobi, Fault detection and identification with a new feature selection based on mutual information. J. Process Control 18(5), 479–490 (2008). https://doi.org/10.1016/j.jprocont.2007.08.003

    CrossRef  Google Scholar 

  20. B. Biswas, Email spam classification dataset CSV (2020). https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv. Accessed 1 Feb 2021

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad A. N. Al-Azawi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Al-Azawi, M.A.N. (2022). Identifying the Most Frequently Used Words in Spam Mail Using Random Forest Classifier and Mutual Information Content. In: Verma, P., Charan, C., Fernando, X., Ganesan, S. (eds) Advances in Data Computing, Communication and Security. Lecture Notes on Data Engineering and Communications Technologies, vol 106. Springer, Singapore. https://doi.org/10.1007/978-981-16-8403-6_2

Download citation