Skip to main content

Everything Is in the Name – A URL Based Approach for Phishing Detection

Part of the Lecture Notes in Computer Science book series (LNSC,volume 11527)


Phishing attack, in which a user is tricked into revealing sensitive information on a spoofed website, is one of the most common threat to cybersecurity. Most modern web browsers counter phishing attacks using a blacklist of confirmed phishing URLs. However, one major disadvantage of the blacklist method is that it is ineffective against newly generated phishes. Machine learning based techniques that rely on features extracted from URL (e.g., URL length and bag-of-words) or web page (e.g., TF-IDF and form fields) are considered to be more effective in identifying new phishing attacks. The main benefit of using URL based features over page based features is that the machine learning model can classify new URLs on-the-fly even before the page is loaded by the web browser, thus avoiding other potential dangers such as drive-by download attacks and cryptojacking attacks.

In this work, we focus on improving the performance of URL based detection techniques. We show that, although a classifier trained on traditional bag-of-words features (tokenized using special characters) works well in many cases, it fails to recognize a very prevalent class of phishing URLs that combines a popular brand with one or more words (e.g., and among others. To overcome these flaws, we explore various alternative feature extraction techniques based on word segmentation and \(n-\)grams. We also construct and use a phishy-list of popular words that are highly indicative of phishing attacks. We verify the efficacy of each of these feature sets by training a logistic regression classifier on a large dataset consisting of 100,000 URLs. Our experimental results reveal that features based on word segmentation, phishy-list and numerical features (e.g., URL length) perform better than all other features, as measured by misclassification and false negative rates.


  • Phishing detection
  • Machine learning
  • Social engineering attacks

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. APWG, February 2019.

  2. DMOZ, February 2019.

  3. Google Safe Browsing, February 2019.

  4. PhishTank, February 2019.

  5. Python Word Segmentation, February 2019.

  6. Alsharnouby, M., Alaca, F., Chiasson, S.: Why phishing still works: user strategies for combating phishing attacks. Int. J. Hum.-Comput. Stud. 82, 69–82 (2015)

    CrossRef  Google Scholar 

  7. Ardi, C., Heidemann, J.: Auntietuna: personalized content-based phishing detection. In: Proceedings of the NDSS Workshop on Usable Security. The Internet Society, San Diego, California, USA, February 2016.

  8. Canova, G., Volkamer, M., Bergmann, C., Reinheimer, B.: NoPhish app evaluation: lab and retention study. Internet Society, USEC (2015)

    Google Scholar 

  9. CJ, G., Pandit, S., Vaddepalli, S., Tupsamudre, H., Banahatti, V., Lodha, S.: Phishy - a serious game to train enterprise users on phishing awareness. In: Proceedings of the 2018 Annual Symposium on Computer-Human Interaction in Play Companion Extended Abstracts, CHI PLAY 2018, pp. 169–181. ACM, New York (2018).

  10. Dhamija, R., Tygar, J.D., Hearst, M.: Why phishing works. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2006, pp. 581–590. ACM, New York (2006).

  11. Felt, A.P., et al.: Improving SSL warnings: comprehension and adherence. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI 2015, pp. 2893–2902. ACM, New York (2015).

  12. Garera, S., Provos, N., Chew, M., Rubin, A.D.: A framework for detection and measurement of phishing attacks. In: Proceedings of the 2007 ACM Workshop on Recurring Malcode, WORM 2007, pp. 1–8. ACM, New York (2007).

  13. Hong, J.: The state of phishing attacks. Commun. ACM 55(1), 74–81 (2012).

    CrossRef  Google Scholar 

  14. Khonji, M., Iraqi, Y., Jones, A.: Phishing detection: a literature survey. IEEE Commun. Surv. Tutor. 15(4), 2091–2121 (2013).

    CrossRef  Google Scholar 

  15. Kintis, P., et al.: Hiding in plain sight: a longitudinal study of combosquatting abuse. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, pp. 569–586. ACM, New York (2017).

  16. Le, A., Markopoulou, A., Faloutsos, M.: PhishDef: URL names say it all. In: 2011 Proceedings IEEE INFOCOM, pp. 191–195, April 2011.

  17. Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1245–1254. ACM, New York (2009).

  18. Marchal, S., François, J., State, R., Engel, T.: Phishstorm: detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 11(4), 458–471 (2014).

    CrossRef  Google Scholar 

  19. Marchal, S., Saari, K., Singh, N., Asokan, N.: Know your phish: novel techniques for detecting phishing sites and their targets. In: 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS), pp. 323–333, June 2016.

  20. McGrath, D.K., Gupta, M.: Behind phishing: an examination of phisher modi operandi. In: Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats, LEET 2008, pp. 4:1–4:8. USENIX Association, Berkeley, CA, USA (2008).

  21. Norvig, P.: Natural Language Corpus Data: Beautiful Data, February 2019.

  22. Reeder, R.W., Felt, A.P., Consolvo, S., Malkin, N., Thompson, C., Egelman, S.: An experience sampling study of user reactions to browser warnings in the field. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI 2018, pp. 512:1–512:13. ACM, New York (2018).

  23. Sahoo, D., Liu, C., Hoi, S.C.: Malicious URL detection using machine learning: a survey. arXiv preprint arXiv:1701.07179 (2017)

  24. Sheng, S., et al.: Anti-phishing phil: the design and evaluation of a game that teaches people not to fall for phish. In: Proceedings of the 3rd Symposium on Usable Privacy and Security, SOUPS 2007, pp. 88–99. ACM, New York (2007).

  25. Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., Zhang, C.: An empirical analysis of phishing blacklists. In: Sixth Conference on Email and Anti-Spam (CEAS), California, USA (2009)

    Google Scholar 

  26. Verizon: 2018 data breach investigations report, February 2019.

  27. Verma, R., Das, A.: What’s in a URL: fast feature extraction and malicious URL detection. In: Proceedings of the 3rd ACM on International Workshop on Security and Privacy Analytics, IWSPA 2017, pp. 55–63. ACM, New York (2017).

  28. Wang, W., Shirley, K.: Breaking bad: detecting malicious domains using word segmentation. arXiv preprint arXiv:1506.04111 (2015)

  29. Whittaker, C., Ryner, B., Nazif, M.: Large-scale automatic classification of phishing pages. In: NDSS 2010 (2010).

  30. Yang, W., Zuo, W., Cui, B.: Detecting malicious urls via a keyword-based convolutional gated-recurrent-unit neural network. IEEE Access 7, 29891–29900 (2019).

    CrossRef  Google Scholar 

  31. Zhang, Y., Hong, J.I., Cranor, L.F.: Cantina: a content-based approach to detecting phishing web sites. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 639–648. ACM, New York (2007).

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Harshal Tupsamudre .

Editor information

Editors and Affiliations

Appendix A

Appendix A

The phishy-list consisting of 105 words extracted from the phishing dataset is given below:

{limited, securewebsession, confirmation, page, signin, team, sign, access, protection,active, manage, redirectme, http, secure, customer, account, client, information, recovery, verify, secured, busines, refund, help, safe, bank, event, promo, webservis, giveaway, card, webspace, user, notify, servico, store, device, payment, webnode, drive, shop, gold, violation, random, upgrade, webapp, dispute, setting, banking, activity, startup, review, email, approval, admin, browser, webapp, billing, advert, protect, case, temporary, alert, portal, login, servehttp, center, client, restore, secure, blob, smart, fortune, gift, server, security, page, confirm, notification, core, host, central, service, account, servise, support, apps, form, info, compute, verification, check, storage, setting, digital, update, token, required, resolution, ebayisapi, webscr, login, free, lucky, bonus}

Rights and permissions

Reprints and Permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tupsamudre, H., Singh, A.K., Lodha, S. (2019). Everything Is in the Name – A URL Based Approach for Phishing Detection. In: Dolev, S., Hendler, D., Lodha, S., Yung, M. (eds) Cyber Security Cryptography and Machine Learning. CSCML 2019. Lecture Notes in Computer Science(), vol 11527. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20950-6

  • Online ISBN: 978-3-030-20951-3

  • eBook Packages: Computer ScienceComputer Science (R0)