Abstract
The increasing number of phishing attacks is one of the major concerns of security researchers today. The traditional tools for identifying phishing websites use signature-based approaches which are not able to detect newly created phishing webpages. Thus, researchers are coming up with machine learning-based methods which are capable to detect and classify the phishing webpages with high accuracy if a large and variety of features are considered. However, building a classification model using a large number of features takes time which hampers the timely detection of phishing websites. Therefore, it is pertinent to shortlist a set of features using a feature selection method so that high-performance classification models can be developed in less time. In this chapter, we study the role of feature selection methods in detecting phishing webpages efficiently and effectively. A comparative analysis of machine learning algorithms is carried out on the basis of their performance without and with feature selection. Experiments are conducted on a phishing dataset with 30 features containing 4898 phishing and 6157 benign webpages. Several machine learning algorithms are used for obtaining the best results. Afterward, a feature selection method is employed to improve the efficiency of the models. The best accuracy is obtained by random forest both before and after feature selection with a significant improvement in model building time. The experiments demonstrate that employing a feature selection method along with machine learning algorithms can improve the build time of classification models for phishing detection without compromising their accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anti-Phishing Working Group (APWG) https://docs.apwg.org//reports/apwg_trends_report_q4_2018.pdf
IC3 Annual Report 2018 https://pdf.ic3.gov/2018_IC3Report.pdf
Razorthorn phishing report https://www.razorthorn.co.uk/wp-content/uploads/2017/01/Phishing-S
Gandotra E, Bansal D, Sofat S (2014) Malware analysis and classification: a survey. J Inf Security 56–65
Gupta D, Rani R (2020) Improving malware detection using big data and ensemble learning. Comput Electr Eng 106729
Microsoft Security Intelligence Report (2019) vol 24 https://www.microsoft.com/security
Logic Bomb Set Off South Korea Cyberattack. https://www.wired.com/2013/03/logic-bomb-south-korea-attack/
Los Angeles Times https://www.latimes.com/business/la-fi-mh-anthem-is-warning-consumers-20150306-column.html
Threat Analysis Group, Findings on COVID-19 and online security threats https://blog.google/technology/safety-security/threat-analysis-group/findings-covid-19-and-online-security-threats/
Gandotra E, Bansal D, Sofat S (2016) Tools and techniques for malware analysis and classification. Int J Next-Gener Comput
Jsoup Java HTML Parser, with best of DOM, CSS, and jquery https://jsoup.org/
OpenDNS, PhishTank https://wwwphishtank.com/
Google Safe Browsing API—Google Code https://code.google.com/apis/safebrowsing/
Seifert C, Welch I, Komisarczuk P (2008) Identification of malicious web pages with static heuristics. In: 2008 Australasian Telecommunication Networks and Applications Conference, IEEE, pp 91–96
Jain AK, Gupta BB (2017) Phishing detection: analysis of visual similarity based approaches. Secur Commun Network
Gandotra E, Bansal D, Sofat S (2015) Computational techniques for predicting cyber threats. In: Intelligent computing, communication and devices, pp 247–253, Springer, New Delhi
Tan CL, Chiew KL, Wong K (2016) PhishWHO: phishing webpage detection via identity keywords extraction and target domain name finder. Decision Support Systems, pp 18–27
Chiew KL, Chang EH, Tiong WK (2015) Utilisation of website logo for phishing detection. Comput Security 16–26
Jain AK, Gupta BB (2018) Towards detection of phishing websites on client-side using machine learning based approach. Telecommun Syst 687–700
Srinivasa Rao R, Pais AR (2017) Detecting phishing websites using automation of human behavior. In: Proceedings of the 3rd ACM workshop on cyber-physical system security, ACM, pp 33–42
Sahingoz OK, Buber E, Demir O, Diri B (2019) Machine learning based phishing detection from URLs. Expert Syst Appl 345–357
Gu X, Wang H, Ni T (2013) An efficient approach to detecting phishing web. J Comput Inf Syst 5553–5560
Moghimi M, Varjani AY (2016) New rule-based phishing detection method. Expert systems with applications, pp 231–242
Xiang G, Hong J, Rose CP, Cranor L (2011) Cantina+ a feature-rich machine learning framework for detecting phishing web sites. ACM Transactions on Information and System Security (TISSEC), pp 1–28
Zhang Y, Hong JI, Cranor LF (2007) Cantina: a content-based approach to detecting phishing web sites. In: Proceedings of the 16th international conference on World Wide Web, ACM, (2007) pp 639–648
Joshi A, Pattanshetti P, Tanuja R (2019) Phishing Attack Detection using Feature Selection Techniques. In: Nutan College of Engineering & Research, International Conference on Communication and Information Processing (ICCIP)
Wu CY, Kuo CC, Yang CS (2019) A phishing detection system based on machine learning. In: 2019 International Conference on Intelligent Computing and its Emerging Applications (ICEA), pp 28–32
Zamir A, Khan HU, Iqbal T, Yousaf N, Aslam F, Anjum A, Hamdani M (2020) Phishing web site detection using diverse machine learning algorithms. The Electronic Library
Almseidin M, Zuraiq AA, Al-kasassbeh M, Alnidami N (2019) Phishing detection based on machine learning and feature selection methods. Int J Interactive Mobile Technol (iJIM) 171–183
Yerima SY, Alzaylaee MK (2020) High accuracy phishing detection based on convolutional neural networks. arXiv preprint arXiv:2004.03960
Basnet RB, Doleck T (2015) Towards developing a tool to detect phishing URLs: a machine learning approach. In 2015 IEEE International Conference on Computational Intelligence & Communication Technology, IEEE, pp 220–223
Hurrah NN, Parah SA, Loan NA, Sheikh JA, Elhoseny M, Muhammad K (2019) Dual watermarking framework for privacy protection and content authentication of multimedia. Future Gener Comput Syst 654–673
Parah SA, Sheikh JA, Bhat GM (2014) Fragility evaluation of intermediate significant bit embedding (ISBE) based digital image watermarking scheme for content authentication. In: 2014 International conference on advances in electronics computers and communications, IEEE pp 1–6
Gull S, Loan NA, Parah SA, Sheikh JA, Bhat GM (2018) An efficient watermarking technique for tamper detection and localization of medical images. J Ambient Intell Humanized Comput pp 1–10
Gull S, Parah SA (2019) Color image authentication using dual watermarks. In: 2019 fifth international conference on image information processing (ICIIP), pp 240–245
Giri KJ, Bashir R, Bhat JI (2019) A discrete wavelet based watermarking scheme for authentication of medical images. Int J E-Health Med Commun (IJEHMC), pp 30–38
UCI Machine Learning Repository, “Phishing Websites Dataset” https://archive.ics.uci.edu/ml/datasets/phishing+websites
Mohammad RM, Thabtah F, McCluskey L (2012) An assessment of features related to phishing websites using an automated technique. In 2012 International conference for internet technology and secured transactions, IEEE pp 492–497, IEEE
Alexa Most Popular sites. https://www.alexa.com/topsites
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD explorations newsletter, pp 10–18
Quinlan JR (2014) C4.5: Programs for Machine Learning. Elsevier
Schapire RE (1990) The strength of weak learnability. Machine Learning, pp 197–227
Witten IH, Frank E (2002) Data mining: practical machine learning tools and techniques with Java implementations. Acm Sigmod Record pp 76–77
Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines
Gandotra E, Bansal D, Sofat S (2016) Zero-day malware detection. In: 2016 sixth international symposium on embedded computing and system design (ISED), IEEE, pp 171–175
Gandotra E, Bansal D, Sofat S (2017) Malware threat assessment using fuzzy logic paradigm. Cybern Syst 29–48
Gupta D, Rani R (2019) A study of big data evolution and research challenges. J Inf Sci 322–340 (2019)
Gupta D, Rani R (2018) Big data framework for zero-day malware detection. Cybern Syst 103–121
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Gandotra, E., Gupta, D. (2021). An Efficient Approach for Phishing Detection using Machine Learning. In: Giri, K.J., Parah, S.A., Bashir, R., Muhammad, K. (eds) Multimedia Security. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-15-8711-5_12
Download citation
DOI: https://doi.org/10.1007/978-981-15-8711-5_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-8710-8
Online ISBN: 978-981-15-8711-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)