Skip to main content

DURLD: Malicious URL Detection Using Deep Learning-Based Character Level Representations

  • Chapter
  • First Online:
Malware Analysis Using Artificial Intelligence and Deep Learning

Abstract

Cybercriminals widely use Malicious URL, a.k.a. malicious website as a primary mechanism to host unsolicited content, such as spam, malicious advertisements, phishing, and drive-by exploits, to name a few. Previous studies used blacklisting, regular expression, and signature matching approaches to detect malicious URLs. However, these approaches are limited to detect variants of existing or newly generated malicious URLs. Over the last decade, classic machine learning techniques have been used to detect malicious URLs. In this work, we evaluate various state-of-the-art deep learning-based character level embedding methods for malicious URL detection. To leverage and transform the performance improvement, we propose DeepURLDetect (DURLD) in which raw URLs are encoded using character level embedding. To capture several types of information in URL, we used the hidden layers in deep learning architectures to extract features from character level embedding and then employ a non-linear activation function to estimate the probability of the URL as malicious or not. Experimental evaluation demonstrates that DURLD can detect variants of malicious URLs, and it is computationally inexpensive when compared to various relevant deep learning-based character level embedding methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abadi, Martín, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard. 2016. Tensorflow: A system for large-scale machine learning. In 12th \(\{\)USENIX\(\}\)symposium on operating systems design and implementation (\(\{\)OSDI\(\}\)16), 265–283.

    Google Scholar 

  2. Alazab, M., R. Layton, R. Broadhurst, and B. Bouhours. 2013. Malicious spam emails developments and authorship attribution. In 2013 fourth cybercrime and trustworthy computing workshop, 58–68.

    Google Scholar 

  3. Alazab, Mamoun, and Roderic Broadhurst. 2016. Spam and criminal activity. Trends and Issues in Crime and Criminal Justice (Australian Institute of Criminology) (526). https://www.aic.gov.au/publications/tandi/tandi526.

  4. Alazab, Mamoun, Robert Layton, Roderic Broadhurst, and Brigitte Bouhours. 2013. Malicious spam emails developments and authorship attribution. In 2013 fourth cybercrime and trustworthy computing workshop, 58–68. IEEE, 2013.

    Google Scholar 

  5. Alazab, Mamoun, Sitalakshmi Venkatraman, Paul Watters, and Moutaz Alazab. 2010. Zero-day malware detection based on supervised learning algorithms of api call signatures.

    Google Scholar 

  6. Alazab, Mamoun, Sitalakshmi Venkatraman, Paul Watters, and Moutaz Alazab. 2013. Information security governance: the art of detecting hidden malware. In IT security governance innovations: theory and research, 293–315. IGI Global.

    Google Scholar 

  7. Anderson, Hyrum S., Jonathan Woodbridge, and Bobby Filar. 2016. Deepdga: Adversarially-tuned domain generation and detection. In Proceedings of the 2016 ACM workshop on artificial intelligence and security, 13–21.

    Google Scholar 

  8. Azab, A., M. Alazab, and M. Aiash. 2016. Machine learning based botnet identification traffic. In 2016 IEEE Trustcom/BigDataSE/ISPA, 1788–1794.

    Google Scholar 

  9. Azab, A., R. Layton, M. Alazab, and J. Oliver. 2014. Mining malware to detect variants. In 2014 fifth cybercrime and trustworthy computing conference, 44–53.

    Google Scholar 

  10. Bahnsen, A.C., E.C. Bohorquez, S. Villegas, J. Vargas, and F.A. González. 2017. Classifying phishing urls using recurrent neural networks. In 2017 APWG symposium on electronic crime research (eCrime), 1–8.

    Google Scholar 

  11. Blum, Aaron, Brad Wardman, Thamar Solorio, and Gary Warner. 2010. Lexical feature based phishing url detection using online learning. In Proceedings of the 3rd ACM Workshop on Artificial Intelligence and Security, 54–60.

    Google Scholar 

  12. Broadhurst, Roderic, Peter Grabosky, Mamoun Alazab, Brigitte Bouhours, and Steve Chon. 2014. An analysis of the nature of groups engaged in cyber crime. An Analysis of the Nature of Groups engaged in Cyber Crime, International Journal of Cyber Criminology 8 (1): 1–20.

    Google Scholar 

  13. Cao, Jian, Qiang Li, Yuede Ji, Yukun He, and Dong Guo. 2016. Detection of forwarding-based malicious urls in online social networks. International Journal of Parallel Programming 44 (1): 163–180.

    Article  Google Scholar 

  14. Chiba, Daiki, Kazuhiro Tobe, Tatsuya Mori, and Shigeki Goto. 2012. Detecting malicious websites by learning ip address features. In 2012 IEEE/IPSJ 12th international symposium on applications and the internet, 29–39. IEEE.

    Google Scholar 

  15. Choi, Hyunsang, Bin B. Zhu, and Heejo Lee. 2011. Detecting malicious web links and identifying their attack types. WebApps 11 (11): 218.

    Google Scholar 

  16. Chollet, François. 2015. keras.

    Google Scholar 

  17. Dhingra, Bhuwan, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W Cohen. 2016. Tweet2vec: Character-based distributed representations for social media. arXiv:1605.03481.

  18. Felegyhazi, Mark, Christian Kreibich, and Vern Paxson. 2010. On the potential of proactive domain blacklisting. LEET 10: 6.

    Google Scholar 

  19. Harikrishnan, N.B., R. Vinayakumar, K.P. Soman, and Prabaharan Poornachandran. 2019. Time split based pre-processing with a data-driven approach for malicious url detection. In Cybersecurity and secure information systems, 43–65. Springer.

    Google Scholar 

  20. Kolari, Pranam, Tim Finin, and Anupam Joshi. 2006. Svms for the blogosphere: Blog identification and splog detection. In AAAI spring symposium on computational approaches to analysing weblogs.

    Google Scholar 

  21. Lee, S., and J. Kim. 2013. Warningbird: A near real-time detection system for suspicious urls in twitter stream. IEEE Transactions on Dependable and Secure Computing 10 (3): 183–195.

    Article  Google Scholar 

  22. Ma, Justin, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. 2009. Beyond blacklists: learning to detect malicious web sites from suspicious urls. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 1245–1254.

    Google Scholar 

  23. Ma, Justin, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. 2009. Identifying suspicious urls: an application of large-scale online learning. In Proceedings of the 26th annual international conference on machine learning, 681–688.

    Google Scholar 

  24. Kevin McGrath, D., and Minaxi Gupta. 2008. Behind phishing: An examination of phisher modi operandi. LEET 8: 4.

    Google Scholar 

  25. Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, and Vincent Dubourg. 2011. Scikit-learn: Machine learning in python. the Journal of Machine Learning Research, 12: 2825–2830.

    Google Scholar 

  26. R., V., M. Alazab, A. Jolfaei, S. K.P., and P. Poornachandran. 2019. Ransomware triage using deep learning: Twitter as a case study. In 2019 cybersecurity and cyberforensics conference (CCC), 67–73

    Google Scholar 

  27. S, S., V. R, M. Alazab, and S. KP. 2020. Network flow based iot botnet attack detection using deep learning. In IEEE INFOCOM 2020 - IEEE conference on computer communications workshops (INFOCOM WKSHPS), 189–194.

    Google Scholar 

  28. S, S., V. R, S. V, M. Alazab, and S. KP. 2020. Multi-scale learning based malware variant detection using spatial pyramid pooling network. In IEEE INFOCOM 2020 - IEEE conference on computer communications workshops (INFOCOM WKSHPS), 740–745.

    Google Scholar 

  29. Sahoo, Doyen, Chenghao Liu, and Steven CH Hoi. 2017. Malicious url detection using machine learning: A survey. arXiv:1701.07179.

  30. Sanders, Hillary, and Joshua Saxe. 2017. Garbage in, garbage out: How purport-edly great ml models can be screwed up by bad data. Technical report.

    Google Scholar 

  31. Saxe, Joshua, and Konstantin Berlin. 2017. expose: A character-level convolutional neural network with embeddings for detecting malicious urls, file paths and registry keys. arXiv:1702.08568.

  32. Schiappa, Madeline. 2009. Machine learning: How to build a better threat detection model. Accessed July 3, 2020.

    Google Scholar 

  33. Sommer, R., and V. Paxson. 2010. Outside the closed world: On using machine learning for network intrusion detection. In 2010 IEEE symposium on security and privacy, 305–316.

    Google Scholar 

  34. Srinivasan, S., V. Ravi, S. V., M. Krichen, D. Ben Noureddine, S. Anivilla, and S. K. P. 2020. Deep convolutional neural network based image spam classification. In 2020 6th conference on data science and machine learning applications (CDMA), 112–117.

    Google Scholar 

  35. Tran, Khoi-Nguyen, Mamoun Alazab, and Roderic Broadhurst. 2014. Towards a feature rich model for predicting spam emails containing malicious attachments and URLs.

    Google Scholar 

  36. Verma, Rakesh. 2018. Security analytics: Adapting data science for security challenges. In Proceedings of the fourth ACM international workshop on security and privacy analytics, 40–41.

    Google Scholar 

  37. Vinayakumar, R., M. Alazab, K.P. Soman, P. Poornachandran, A. Al-Nemrat, and S. Venkatraman. 2019. Deep learning approach for intelligent intrusion detection system. IEEE Access 7: 41525–41550.

    Article  Google Scholar 

  38. Vinayakumar, R., M. Alazab, K.P. Soman, P. Poornachandran, and S. Venkatraman. 2019. Robust intelligent malware detection using deep learning. IEEE Access 7: 46717–46738.

    Article  Google Scholar 

  39. Vinayakumar, R., M. Alazab, S. Srinivasan, Q. Pham, S.K. Padannayil, and K. Simran. 2020. A visualized botnet detection system based deep learning for the internet of things networks of smart cities. IEEE Transactions on Industry Applications 56 (4): 4436–4456.

    Article  Google Scholar 

  40. Vinayakumar, R., Prabaharan Poornachandran, and K.P. Soman. 2018. Scalable framework for cyber threat situational awareness based on domain name systems data analysis. In Big data in engineering applications, 113–142. Springer.

    Google Scholar 

  41. Vinayakumar, R., K.P. Soman, and Prabaharan Poornachandran. 2018. Evaluating deep learning approaches to characterize and classify malicious url’s. Journal of Intelligent & Fuzzy Systems, 34(3):1333–1343.

    Google Scholar 

  42. Vinayakumar, R., K.P. Soman, Prabaharan Poornachandran, Mamoun Alazab, and Sabu Thampi 2019. Amritadga: a comprehensive data set for domain generation algorithms (dgas) based domain name detection systems and application of deep learning. In Big data recommender systems-Volume 2: application paradigms, 455–485. Institution of Engineering and Technology (IET).

    Google Scholar 

  43. Vosoughi, Soroush, Prashanth Vijayaraghavan, and Deb Roy. 2016. Tweet2vec: Learning tweet embeddings using character-level cnn-lstm encoder-decoder. In Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval, 1041–1044.

    Google Scholar 

  44. Zhang, Xiang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, 649–657.

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Department of Corporate and Information Services, Northern Territory Government of Australia and in part by Paramount Computer Systems and Lakhshya Cyber Security Labs. We are grateful to NVIDIA India, for the GPU hardware support to the research grant. We are also grateful to Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, for encouraging this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sriram Srinivasan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Srinivasan, S., Vinayakumar, R., Arunachalam, A., Alazab, M., Soman, K. (2021). DURLD: Malicious URL Detection Using Deep Learning-Based Character Level Representations. In: Stamp, M., Alazab, M., Shalaginov, A. (eds) Malware Analysis Using Artificial Intelligence and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-62582-5_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-62582-5_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-62581-8

  • Online ISBN: 978-3-030-62582-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics