Detecting ham and spam emails using feature union and supervised machine learning models

Rustam, Furqan; Saher, Najia; Mehmood, Arif; Lee, Ernesto; Washington, Sandrilla; Ashraf, Imran

doi:10.1007/s11042-023-14814-2

Detecting ham and spam emails using feature union and supervised machine learning models

Published: 08 March 2023

Volume 82, pages 26545–26561, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Furqan Rustam¹,
Najia Saher²,
Arif Mehmood²,
Ernesto Lee³,
Sandrilla Washington⁴ &
…
Imran Ashraf ORCID: orcid.org/0000-0002-8271-6496⁵

381 Accesses
1 Citation
10 Altmetric
1 Mention
Explore all metrics

Abstract

Spam emails are cyber nuisances that cause serious security threats including personal and financial information. Although several spam detection approaches exist, detecting new strains of spam messages is challenging that requires a reliable and efficient intelligent spam email detection approach. This study utilizes features from the text of emails to determine whether it is spam or normal. Multiple features are combined to obtain a higher accuracy for spam email detection. Experiments involve machine learning and deep learning models and the influence of data resampling is also investigated. Performance analysis is done using F1 score, recall, precision, and accuracy, as well as comparison with state-of-the-art approaches. Random forest and logistic regression achieve the highest accuracy scores 0.991 and 0.990, respectively which is much better than existing models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative analysis of gradient boosting algorithms

Article 24 August 2020

A Review on Random Forest: An Ensemble Classifier

Introduction to Machine Learning

Data Availability

The datasets used in this study are publicly available at the following links https://www.kaggle.com/datasets/karthickveerakumar/spam-filter https://www.kaggle.com/washingtongold/spam-or-ham-emp-week-2-ml-hw-dataset

References

A Chen YFU, Zheng X, Lu G (2022) An efficient network behavior anomaly detection using a hybrid dbn-lstm network. Comput Secur 114:102600
Article Google Scholar
APWG (2021) Fishing activity trend reports. https://apwg.org/trendsreports/, Accessed 19 2021
Ahmed Arafa AH, Radad M, Badawy MM, El-Fishawy N (2022) Logistic regression hyperparameter optimization for cancer classification. Menoufia J Electron Eng Res
Awad M, Foqaha M (2016) Email spam classification using hybrid approach of rbf neural network and particle swarm optimization. Int J Netw Secur Appl 8(4):17–28
Google Scholar
Bassiouni M, Ali M, El-Dahshan E (2018) Ham and spam e-mails classification using machine learning techniques. J Appl Secur Res 13(3):315–331
Article Google Scholar
Bhatti UA , Huang M, Wang H, Zhang Y, Mehmood A, Di W (2018) Recommendation system for immunization coverage and monitoring. Hum Vaccines Immunotherapeutics 14(1):165–171
Article Google Scholar
Bhatti UA, Huang M, Wu D, Zhang Y, Mehmood A, Han H (2019) Recommendation system using feature extraction and pattern recognition in clinical care systems. Enterp Inf Syst 13(3):329–351
Article Google Scholar
Bhatti UA, Zeeshan Z, Nizamani MM, S Bazai ZYU, Yuan L (2022) Assessing the change of ambient air quality patterns in jiangsu province of China pre-to post-covid-19. Chemosphere 288:132569
Article Google Scholar
Bhowmick A, Hazarika S M (2018) E-mail spam filtering: a review of techniques and trends. Advances in electronics, communication and computing, pp 583–590
Dada EG , Bassi JS, Chiroma H, Adetunmbi AO, Ajibuwa OE, et al. (2019) Machine learning for email spam filtering: review, approaches and open research problems. Heliyon 5(6):e01802
Article Google Scholar
Gaurav D, Tiwari SM, Goyal A, Gandhi N, Abraham A (2020) Machine intelligence-based algorithms for spam filtering on document labeling. Soft Comput 24(13):9625–9638
Article Google Scholar
GuangJun L, Nazir S, Khan HU, Haq AU (2020) Spam detection approach for secure mobile message communication using machine learning algorithms. Secur Commun Netw, vol 2020
Hamid IRA, Abawajy J, Kim T (2013) Using feature selection and classification scheme for automating phishing email detection. Studies in informatics and control 22(1):61–70
Article Google Scholar
Hilal W, Gadsden SA, Yawney J, Gadsden SA, Yawney J (2022) Financial fraud: a review of anomaly detection techniques and recent advances
Hulten G, Goodman J, Rounthwaite R (2004) Filtering spam e-mail on a global scale. In: Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pp 366–367
Iqbal K, Khan MS (2022) Email classification analysis using machine learning techniques. Appl Comput Inform no. ahead-of-print
Jánez-Martino F, Fidalgo E, González-Martínez S, Velasco-Mata J (2020) Classification of spam emails through hierarchical clustering and supervised learning. arXiv:http://arxiv.org/abs/2005.08773
Javaid A, Siddique MA, Reshi AA, Rustam F, Lee E, Rupapara V, et al. (2022) Coal mining accident causes classification using voting-based hybrid classifier (vhc). J Ambient Intell Humanized Comput, pp 1–11
Keivani FS, Jouzbarkand M, Khodadadi M, Sourkouhi ZK (2012) A general view on the e-banking. Int Proc Econ Dev Res 43:p62
Google Scholar
Khamis SA, Foozy CFM, Ab Aziz MF, Rahim N (2020) Header based email spam detection framework using support vector machine (svm) technique. In: International conference on soft computing and data mining. Springer, pp 57–65
Kontsewaya Y, Antonov E, Artamonov A (2021) Evaluating the effectiveness of machine learning methods for spam detection. Procedia Comput Sci 190:479–486
Article Google Scholar
Kumar KV (2021) Spam filer - identifying spam using emails. https://www.kaggle.com/karthickveerakumar/spam-filter/metadata https://www.kaggle.com/karthickveerakumar/spam-filter/metadata, Accessed 27 2017
Kumar RK, Poonkuzhali G, Sudhakar P (2012) Comparative study on email spam classifier using data mining techniques. Proceedings of the international multiconference of engineers and computer scientists 1:14–16
Google Scholar
Kumaresan T, Saravanakumar S, Balamurugan R (2019) Visual and textual features based email spam classification using s-cuckoo search and hybrid kernel support vector machine. Clust Comput 22(1):33–46
Article Google Scholar
Lee E, Rustam F, Ashraf I, Washington PB, Narra M, Shafique R (2022) Inquest of current situation in Afghanistan under taliban rule using sentiment analysis and volume analysis. IEEE Access 10:10333–10348
Article Google Scholar
Mujahid M, Lee E, Rustam F, Washington PB, Ullah S, Reshi AA, Ashraf I (2021) Sentiment analysis and topic modeling on tweets about online education during covid-19. Appl Sci 11(18):8438
Article Google Scholar
Reshi AA, Rustam F, Aljedaani W, Shafi S, Alhossan A, Alrabiah Z, Ahmad A, Alsuwailem H, Almangour TA, Alshammari MA et al (2022). In: Covid-19 vaccination-related sentiments analysis: a case study using worldwide twitter dataset Healthcare, vol 110(3). MDPI, pp 411
Rish I, et al. (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3. (22), pp 41–46
Rupapara V, Rustam F, Amaar A, Washington PB, Lee E, Ashraf I (2021) Deepfake tweets classification using stacked bi-lstm and words embedding. PeerJ Comput Sci 7:e745
Article Google Scholar
Rusland NF, Wahid N, Kasim S, Hafit H, Analysis of naïve bayes algorithm for email spam filtering across multiple datasets (2017). In: IOP conference series: materials science and engineering, vol 226, no 1. IOP Publishing, p 012091
Rustam F , Imtiaz Z, Mehmood A, Rupapara V, Choi GS, Din S, Ashraf I (2022) Automated disease diagnosis and precaution recommender system using supervised machine learning. Multimed Tools Appl, pp 1–24
Seth S, Biswas S (2017) Multimodal spam classification using deep learning techniques. In: 2017 13th international conference on signal-image technology & internet-based systems (SITIS). IEEE, pp 346–349
Sinha S, Ghosh I, Satapathy SC (2021) A study for ann model for spam classification. In: Intelligent data engineering and analytics. Springer, pp 331–343
Ye A (2021) Spam of ham - emp week 2 hw dataset. https://www.kaggle.com/washingtongold/spam-or-ham-emp-week-2-ml-hw-dataset https://www.kaggle.com/washingtongold/spam-or-ham-emp-week-2-ml-hw-dataset, Accessed 27 2019
Zamir A, Khan HU, Mehmood W, Iqbal T, Akram AU (2020) A feature-centric spam email detection model using diverse supervised machine learning algorithms. Electron Libr
ZhiWei M, Singh MM, Zaaba ZF (2017) Email spam detection: a method of metaclassifiers stacking. In: The 6th international conference on computing and informatics, pp 750–757

Download references

Funding

“This research was supported by the Florida Center for Advanced Analytics and Data Science funded by Ernesto.Net (under the Algorithms for Good Grant).”

Author information

Authors and Affiliations

School of Computer Science, University College Dublin, D04 V1W8, Dublin, Ireland
Furqan Rustam
Department of CS and IT, The Islamia University of Bahawalpur, Bahawalpur, 63100, Pakistan
Najia Saher & Arif Mehmood
College of Engineering and Technology, Miami Dade College, Miami, FL, 33132, USA
Ernesto Lee
Department of Computer and Information Sciences, Spelman College, Atlanta, GA, USA
Sandrilla Washington
Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, 38544, South Korea
Imran Ashraf

Authors

Furqan Rustam
View author publications
You can also search for this author in PubMed Google Scholar
Najia Saher
View author publications
You can also search for this author in PubMed Google Scholar
Arif Mehmood
View author publications
You can also search for this author in PubMed Google Scholar
Ernesto Lee
View author publications
You can also search for this author in PubMed Google Scholar
Sandrilla Washington
View author publications
You can also search for this author in PubMed Google Scholar
Imran Ashraf
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Imran Ashraf.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rustam, F., Saher, N., Mehmood, A. et al. Detecting ham and spam emails using feature union and supervised machine learning models. Multimed Tools Appl 82, 26545–26561 (2023). https://doi.org/10.1007/s11042-023-14814-2

Download citation

Received: 16 April 2022
Revised: 15 June 2022
Accepted: 05 February 2023
Published: 08 March 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11042-023-14814-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting ham and spam emails using feature union and supervised machine learning models

Abstract

Access this article

Similar content being viewed by others

A comparative analysis of gradient boosting algorithms

A Review on Random Forest: An Ensemble Classifier

Introduction to Machine Learning

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detecting ham and spam emails using feature union and supervised machine learning models

Abstract

Access this article

Similar content being viewed by others

A comparative analysis of gradient boosting algorithms

A Review on Random Forest: An Ensemble Classifier

Introduction to Machine Learning

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation