Skip to main content
Log in

Detecting ham and spam emails using feature union and supervised machine learning models

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Spam emails are cyber nuisances that cause serious security threats including personal and financial information. Although several spam detection approaches exist, detecting new strains of spam messages is challenging that requires a reliable and efficient intelligent spam email detection approach. This study utilizes features from the text of emails to determine whether it is spam or normal. Multiple features are combined to obtain a higher accuracy for spam email detection. Experiments involve machine learning and deep learning models and the influence of data resampling is also investigated. Performance analysis is done using F1 score, recall, precision, and accuracy, as well as comparison with state-of-the-art approaches. Random forest and logistic regression achieve the highest accuracy scores 0.991 and 0.990, respectively which is much better than existing models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data Availability

The datasets used in this study are publicly available at the following links https://www.kaggle.com/datasets/karthickveerakumar/spam-filterhttps://www.kaggle.com/washingtongold/spam-or-ham-emp-week-2-ml-hw-dataset

References

  1. A Chen YFU, Zheng X, Lu G (2022) An efficient network behavior anomaly detection using a hybrid dbn-lstm network. Comput Secur 114:102600

    Article  Google Scholar 

  2. APWG (2021) Fishing activity trend reports. https://apwg.org/trendsreports/, Accessed 19 2021

  3. Ahmed Arafa AH, Radad M, Badawy MM, El-Fishawy N (2022) Logistic regression hyperparameter optimization for cancer classification. Menoufia J Electron Eng Res

  4. Awad M, Foqaha M (2016) Email spam classification using hybrid approach of rbf neural network and particle swarm optimization. Int J Netw Secur Appl 8(4):17–28

    Google Scholar 

  5. Bassiouni M, Ali M, El-Dahshan E (2018) Ham and spam e-mails classification using machine learning techniques. J Appl Secur Res 13(3):315–331

    Article  Google Scholar 

  6. Bhatti UA , Huang M, Wang H, Zhang Y, Mehmood A, Di W (2018) Recommendation system for immunization coverage and monitoring. Hum Vaccines Immunotherapeutics 14(1):165–171

    Article  Google Scholar 

  7. Bhatti UA, Huang M, Wu D, Zhang Y, Mehmood A, Han H (2019) Recommendation system using feature extraction and pattern recognition in clinical care systems. Enterp Inf Syst 13(3):329–351

    Article  Google Scholar 

  8. Bhatti UA, Zeeshan Z, Nizamani MM, S Bazai ZYU, Yuan L (2022) Assessing the change of ambient air quality patterns in jiangsu province of China pre-to post-covid-19. Chemosphere 288:132569

    Article  Google Scholar 

  9. Bhowmick A, Hazarika S M (2018) E-mail spam filtering: a review of techniques and trends. Advances in electronics, communication and computing, pp 583–590

  10. Dada EG , Bassi JS, Chiroma H, Adetunmbi AO, Ajibuwa OE, et al. (2019) Machine learning for email spam filtering: review, approaches and open research problems. Heliyon 5(6):e01802

    Article  Google Scholar 

  11. Gaurav D, Tiwari SM, Goyal A, Gandhi N, Abraham A (2020) Machine intelligence-based algorithms for spam filtering on document labeling. Soft Comput 24(13):9625–9638

    Article  Google Scholar 

  12. GuangJun L, Nazir S, Khan HU, Haq AU (2020) Spam detection approach for secure mobile message communication using machine learning algorithms. Secur Commun Netw, vol 2020

  13. Hamid IRA, Abawajy J, Kim T (2013) Using feature selection and classification scheme for automating phishing email detection. Studies in informatics and control 22(1):61–70

    Article  Google Scholar 

  14. Hilal W, Gadsden SA, Yawney J, Gadsden SA, Yawney J (2022) Financial fraud: a review of anomaly detection techniques and recent advances

  15. Hulten G, Goodman J, Rounthwaite R (2004) Filtering spam e-mail on a global scale. In: Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pp 366–367

  16. Iqbal K, Khan MS (2022) Email classification analysis using machine learning techniques. Appl Comput Inform no. ahead-of-print

  17. Jánez-Martino F, Fidalgo E, González-Martínez S, Velasco-Mata J (2020) Classification of spam emails through hierarchical clustering and supervised learning. arXiv:http://arxiv.org/abs/2005.08773

  18. Javaid A, Siddique MA, Reshi AA, Rustam F, Lee E, Rupapara V, et al. (2022) Coal mining accident causes classification using voting-based hybrid classifier (vhc). J Ambient Intell Humanized Comput, pp 1–11

  19. Keivani FS, Jouzbarkand M, Khodadadi M, Sourkouhi ZK (2012) A general view on the e-banking. Int Proc Econ Dev Res 43:p62

    Google Scholar 

  20. Khamis SA, Foozy CFM, Ab Aziz MF, Rahim N (2020) Header based email spam detection framework using support vector machine (svm) technique. In: International conference on soft computing and data mining. Springer, pp 57–65

  21. Kontsewaya Y, Antonov E, Artamonov A (2021) Evaluating the effectiveness of machine learning methods for spam detection. Procedia Comput Sci 190:479–486

    Article  Google Scholar 

  22. Kumar KV (2021) Spam filer - identifying spam using emails. https://www.kaggle.com/karthickveerakumar/spam-filter/metadatahttps://www.kaggle.com/karthickveerakumar/spam-filter/metadata, Accessed 27 2017

  23. Kumar RK, Poonkuzhali G, Sudhakar P (2012) Comparative study on email spam classifier using data mining techniques. Proceedings of the international multiconference of engineers and computer scientists 1:14–16

    Google Scholar 

  24. Kumaresan T, Saravanakumar S, Balamurugan R (2019) Visual and textual features based email spam classification using s-cuckoo search and hybrid kernel support vector machine. Clust Comput 22(1):33–46

    Article  Google Scholar 

  25. Lee E, Rustam F, Ashraf I, Washington PB, Narra M, Shafique R (2022) Inquest of current situation in Afghanistan under taliban rule using sentiment analysis and volume analysis. IEEE Access 10:10333–10348

    Article  Google Scholar 

  26. Mujahid M, Lee E, Rustam F, Washington PB, Ullah S, Reshi AA, Ashraf I (2021) Sentiment analysis and topic modeling on tweets about online education during covid-19. Appl Sci 11(18):8438

    Article  Google Scholar 

  27. Reshi AA, Rustam F, Aljedaani W, Shafi S, Alhossan A, Alrabiah Z, Ahmad A, Alsuwailem H, Almangour TA, Alshammari MA et al (2022). In: Covid-19 vaccination-related sentiments analysis: a case study using worldwide twitter dataset Healthcare, vol 110(3). MDPI, pp 411

  28. Rish I, et al. (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3. (22), pp 41–46

  29. Rupapara V, Rustam F, Amaar A, Washington PB, Lee E, Ashraf I (2021) Deepfake tweets classification using stacked bi-lstm and words embedding. PeerJ Comput Sci 7:e745

    Article  Google Scholar 

  30. Rusland NF, Wahid N, Kasim S, Hafit H, Analysis of naïve bayes algorithm for email spam filtering across multiple datasets (2017). In: IOP conference series: materials science and engineering, vol 226, no 1. IOP Publishing, p 012091

  31. Rustam F , Imtiaz Z, Mehmood A, Rupapara V, Choi GS, Din S, Ashraf I (2022) Automated disease diagnosis and precaution recommender system using supervised machine learning. Multimed Tools Appl, pp 1–24

  32. Seth S, Biswas S (2017) Multimodal spam classification using deep learning techniques. In: 2017 13th international conference on signal-image technology & internet-based systems (SITIS). IEEE, pp 346–349

  33. Sinha S, Ghosh I, Satapathy SC (2021) A study for ann model for spam classification. In: Intelligent data engineering and analytics. Springer, pp 331–343

  34. Ye A (2021) Spam of ham - emp week 2 hw dataset. https://www.kaggle.com/washingtongold/spam-or-ham-emp-week-2-ml-hw-datasethttps://www.kaggle.com/washingtongold/spam-or-ham-emp-week-2-ml-hw-dataset, Accessed 27 2019

  35. Zamir A, Khan HU, Mehmood W, Iqbal T, Akram AU (2020) A feature-centric spam email detection model using diverse supervised machine learning algorithms. Electron Libr

  36. ZhiWei M, Singh MM, Zaaba ZF (2017) Email spam detection: a method of metaclassifiers stacking. In: The 6th international conference on computing and informatics, pp 750–757

Download references

Funding

“This research was supported by the Florida Center for Advanced Analytics and Data Science funded by Ernesto.Net (under the Algorithms for Good Grant).”

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Imran Ashraf.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rustam, F., Saher, N., Mehmood, A. et al. Detecting ham and spam emails using feature union and supervised machine learning models. Multimed Tools Appl 82, 26545–26561 (2023). https://doi.org/10.1007/s11042-023-14814-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14814-2

Keywords

Navigation