Abstract
Cybercriminals widely use Malicious URL, a.k.a. malicious website as a primary mechanism to host unsolicited content, such as spam, malicious advertisements, phishing, and drive-by exploits, to name a few. Previous studies used blacklisting, regular expression, and signature matching approaches to detect malicious URLs. However, these approaches are limited to detect variants of existing or newly generated malicious URLs. Over the last decade, classic machine learning techniques have been used to detect malicious URLs. In this work, we evaluate various state-of-the-art deep learning-based character level embedding methods for malicious URL detection. To leverage and transform the performance improvement, we propose DeepURLDetect (DURLD) in which raw URLs are encoded using character level embedding. To capture several types of information in URL, we used the hidden layers in deep learning architectures to extract features from character level embedding and then employ a non-linear activation function to estimate the probability of the URL as malicious or not. Experimental evaluation demonstrates that DURLD can detect variants of malicious URLs, and it is computationally inexpensive when compared to various relevant deep learning-based character level embedding methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abadi, Martín, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard. 2016. Tensorflow: A system for large-scale machine learning. In 12th \(\{\)USENIX\(\}\)symposium on operating systems design and implementation (\(\{\)OSDI\(\}\)16), 265–283.
Alazab, M., R. Layton, R. Broadhurst, and B. Bouhours. 2013. Malicious spam emails developments and authorship attribution. In 2013 fourth cybercrime and trustworthy computing workshop, 58–68.
Alazab, Mamoun, and Roderic Broadhurst. 2016. Spam and criminal activity. Trends and Issues in Crime and Criminal Justice (Australian Institute of Criminology) (526). https://www.aic.gov.au/publications/tandi/tandi526.
Alazab, Mamoun, Robert Layton, Roderic Broadhurst, and Brigitte Bouhours. 2013. Malicious spam emails developments and authorship attribution. In 2013 fourth cybercrime and trustworthy computing workshop, 58–68. IEEE, 2013.
Alazab, Mamoun, Sitalakshmi Venkatraman, Paul Watters, and Moutaz Alazab. 2010. Zero-day malware detection based on supervised learning algorithms of api call signatures.
Alazab, Mamoun, Sitalakshmi Venkatraman, Paul Watters, and Moutaz Alazab. 2013. Information security governance: the art of detecting hidden malware. In IT security governance innovations: theory and research, 293–315. IGI Global.
Anderson, Hyrum S., Jonathan Woodbridge, and Bobby Filar. 2016. Deepdga: Adversarially-tuned domain generation and detection. In Proceedings of the 2016 ACM workshop on artificial intelligence and security, 13–21.
Azab, A., M. Alazab, and M. Aiash. 2016. Machine learning based botnet identification traffic. In 2016 IEEE Trustcom/BigDataSE/ISPA, 1788–1794.
Azab, A., R. Layton, M. Alazab, and J. Oliver. 2014. Mining malware to detect variants. In 2014 fifth cybercrime and trustworthy computing conference, 44–53.
Bahnsen, A.C., E.C. Bohorquez, S. Villegas, J. Vargas, and F.A. González. 2017. Classifying phishing urls using recurrent neural networks. In 2017 APWG symposium on electronic crime research (eCrime), 1–8.
Blum, Aaron, Brad Wardman, Thamar Solorio, and Gary Warner. 2010. Lexical feature based phishing url detection using online learning. In Proceedings of the 3rd ACM Workshop on Artificial Intelligence and Security, 54–60.
Broadhurst, Roderic, Peter Grabosky, Mamoun Alazab, Brigitte Bouhours, and Steve Chon. 2014. An analysis of the nature of groups engaged in cyber crime. An Analysis of the Nature of Groups engaged in Cyber Crime, International Journal of Cyber Criminology 8 (1): 1–20.
Cao, Jian, Qiang Li, Yuede Ji, Yukun He, and Dong Guo. 2016. Detection of forwarding-based malicious urls in online social networks. International Journal of Parallel Programming 44 (1): 163–180.
Chiba, Daiki, Kazuhiro Tobe, Tatsuya Mori, and Shigeki Goto. 2012. Detecting malicious websites by learning ip address features. In 2012 IEEE/IPSJ 12th international symposium on applications and the internet, 29–39. IEEE.
Choi, Hyunsang, Bin B. Zhu, and Heejo Lee. 2011. Detecting malicious web links and identifying their attack types. WebApps 11 (11): 218.
Chollet, François. 2015. keras.
Dhingra, Bhuwan, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W Cohen. 2016. Tweet2vec: Character-based distributed representations for social media. arXiv:1605.03481.
Felegyhazi, Mark, Christian Kreibich, and Vern Paxson. 2010. On the potential of proactive domain blacklisting. LEET 10: 6.
Harikrishnan, N.B., R. Vinayakumar, K.P. Soman, and Prabaharan Poornachandran. 2019. Time split based pre-processing with a data-driven approach for malicious url detection. In Cybersecurity and secure information systems, 43–65. Springer.
Kolari, Pranam, Tim Finin, and Anupam Joshi. 2006. Svms for the blogosphere: Blog identification and splog detection. In AAAI spring symposium on computational approaches to analysing weblogs.
Lee, S., and J. Kim. 2013. Warningbird: A near real-time detection system for suspicious urls in twitter stream. IEEE Transactions on Dependable and Secure Computing 10 (3): 183–195.
Ma, Justin, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. 2009. Beyond blacklists: learning to detect malicious web sites from suspicious urls. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 1245–1254.
Ma, Justin, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. 2009. Identifying suspicious urls: an application of large-scale online learning. In Proceedings of the 26th annual international conference on machine learning, 681–688.
Kevin McGrath, D., and Minaxi Gupta. 2008. Behind phishing: An examination of phisher modi operandi. LEET 8: 4.
Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, and Vincent Dubourg. 2011. Scikit-learn: Machine learning in python. the Journal of Machine Learning Research, 12: 2825–2830.
R., V., M. Alazab, A. Jolfaei, S. K.P., and P. Poornachandran. 2019. Ransomware triage using deep learning: Twitter as a case study. In 2019 cybersecurity and cyberforensics conference (CCC), 67–73
S, S., V. R, M. Alazab, and S. KP. 2020. Network flow based iot botnet attack detection using deep learning. In IEEE INFOCOM 2020 - IEEE conference on computer communications workshops (INFOCOM WKSHPS), 189–194.
S, S., V. R, S. V, M. Alazab, and S. KP. 2020. Multi-scale learning based malware variant detection using spatial pyramid pooling network. In IEEE INFOCOM 2020 - IEEE conference on computer communications workshops (INFOCOM WKSHPS), 740–745.
Sahoo, Doyen, Chenghao Liu, and Steven CH Hoi. 2017. Malicious url detection using machine learning: A survey. arXiv:1701.07179.
Sanders, Hillary, and Joshua Saxe. 2017. Garbage in, garbage out: How purport-edly great ml models can be screwed up by bad data. Technical report.
Saxe, Joshua, and Konstantin Berlin. 2017. expose: A character-level convolutional neural network with embeddings for detecting malicious urls, file paths and registry keys. arXiv:1702.08568.
Schiappa, Madeline. 2009. Machine learning: How to build a better threat detection model. Accessed July 3, 2020.
Sommer, R., and V. Paxson. 2010. Outside the closed world: On using machine learning for network intrusion detection. In 2010 IEEE symposium on security and privacy, 305–316.
Srinivasan, S., V. Ravi, S. V., M. Krichen, D. Ben Noureddine, S. Anivilla, and S. K. P. 2020. Deep convolutional neural network based image spam classification. In 2020 6th conference on data science and machine learning applications (CDMA), 112–117.
Tran, Khoi-Nguyen, Mamoun Alazab, and Roderic Broadhurst. 2014. Towards a feature rich model for predicting spam emails containing malicious attachments and URLs.
Verma, Rakesh. 2018. Security analytics: Adapting data science for security challenges. In Proceedings of the fourth ACM international workshop on security and privacy analytics, 40–41.
Vinayakumar, R., M. Alazab, K.P. Soman, P. Poornachandran, A. Al-Nemrat, and S. Venkatraman. 2019. Deep learning approach for intelligent intrusion detection system. IEEE Access 7: 41525–41550.
Vinayakumar, R., M. Alazab, K.P. Soman, P. Poornachandran, and S. Venkatraman. 2019. Robust intelligent malware detection using deep learning. IEEE Access 7: 46717–46738.
Vinayakumar, R., M. Alazab, S. Srinivasan, Q. Pham, S.K. Padannayil, and K. Simran. 2020. A visualized botnet detection system based deep learning for the internet of things networks of smart cities. IEEE Transactions on Industry Applications 56 (4): 4436–4456.
Vinayakumar, R., Prabaharan Poornachandran, and K.P. Soman. 2018. Scalable framework for cyber threat situational awareness based on domain name systems data analysis. In Big data in engineering applications, 113–142. Springer.
Vinayakumar, R., K.P. Soman, and Prabaharan Poornachandran. 2018. Evaluating deep learning approaches to characterize and classify malicious url’s. Journal of Intelligent & Fuzzy Systems, 34(3):1333–1343.
Vinayakumar, R., K.P. Soman, Prabaharan Poornachandran, Mamoun Alazab, and Sabu Thampi 2019. Amritadga: a comprehensive data set for domain generation algorithms (dgas) based domain name detection systems and application of deep learning. In Big data recommender systems-Volume 2: application paradigms, 455–485. Institution of Engineering and Technology (IET).
Vosoughi, Soroush, Prashanth Vijayaraghavan, and Deb Roy. 2016. Tweet2vec: Learning tweet embeddings using character-level cnn-lstm encoder-decoder. In Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, 1041–1044.
Zhang, Xiang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, 649–657.
Acknowledgements
This work was supported by the Department of Corporate and Information Services, Northern Territory Government of Australia and in part by Paramount Computer Systems and Lakhshya Cyber Security Labs. We are grateful to NVIDIA India, for the GPU hardware support to the research grant. We are also grateful to Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, for encouraging this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Srinivasan, S., Vinayakumar, R., Arunachalam, A., Alazab, M., Soman, K. (2021). DURLD: Malicious URL Detection Using Deep Learning-Based Character Level Representations. In: Stamp, M., Alazab, M., Shalaginov, A. (eds) Malware Analysis Using Artificial Intelligence and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-62582-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-62582-5_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62581-8
Online ISBN: 978-3-030-62582-5
eBook Packages: Computer ScienceComputer Science (R0)