Skip to main content

A Self-training Method for Detection of Phishing Websites

  • Conference paper
  • First Online:
Data Mining and Big Data (DMBD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10943))

Included in the following conference series:

Abstract

Phishing detection based on machine learning always lacks training data with high confidence labels. In order to reduce the impact of lack of labels on training set on performance to phishing detection, this paper proposes an improved self-training method of semi-supervised learning. It uses the divide-and-conquer principle and decomposes the original problem into a number of smaller but similar sub-problems to the original one. We compare model classification quality among supervised learning, traditional semi-supervised learning and new proposal method by using four classifiers, as well as the running time between two kinds of semi-supervised methods. The running time of can be reduced by 50% by using the improve method which divides unlabeled dataset equally, on the basis of ensuring the classification effect is equal to the traditional self-training method. Furthermore, the running time of model is continue reducing significantly by increasing the number of dividing unlabeled data set. The experiments results show our proposal, the improved self-training method outperformed the traditional self-training method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anti-Phishing Working Group Homepage. http://www.antiphishing.org/. Accessed 21 Feb 2018

  2. Ma, J., Saul, L.K., Savage, S.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 28 June–July 2009, Paris, France, pp. 1245–1254 (2009)

    Google Scholar 

  3. Ma, J., Saul, L.K., Savage, S.: Identifying suspicious URLs: an application of large-scale online learning. In: International Conference on Machine Learning (ICML), Montreal, Quebec, June 2009

    Google Scholar 

  4. Xiang, G., Hong, J., Rose, C.P., et al.: CANTINA+: a feature-rich machine learning framework for detecting phishing web sites. ACM Trans. Inf. Syst. Secur. 14(2), 21 (2011)

    Article  Google Scholar 

  5. Moghimi, M., Varjani, A.Y.: New rule-based phishing detection method. Expert Syst. Appl. 53, 231–242 (2016)

    Article  Google Scholar 

  6. Tan, C.L., Kang, L.C., Wong, K.S.: PhishWHO: phishing webpage detection via identity keywords extraction and target domain name finder. Decis. Support Syst. 88, 18–27 (2016)

    Article  Google Scholar 

  7. Li, Y., Xiao, R., Feng, J.: A semi-supervised learning approach for detection of phishing webpages. Opt. – Int. J. Light Electron Opt. 124(23), 6027–6033 (2013)

    Article  Google Scholar 

  8. Gyawali, B., Solorio, T., Wardman, B.: Evaluating a semi-supervised approach to phishing URL identification in a realistic scenario. In: Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, pp: 176–183. ACM (2011)

    Google Scholar 

  9. Debarr, D.: Spam, phishing, and fraud detection using random projections, adversarial learning, and semi-supervised learning. Dissertations & theses – Gradworks (2013)

    Google Scholar 

  10. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training, vol. 33, pp. 86–93. ACM (2002)

    Google Scholar 

  11. Blum, A.: Combining labeled and unlabeled data with co-training. In: Eleventh Conference on Computational Learning Theory, pp. 92–100 (2000)

    Google Scholar 

  12. Chen, W.J., Shao, Y.H., Ye, Y.F.: Improving Lap-TSVM with successive over relaxation and differential evolution. Procedia Comput. Sci. 17, 33–40 (2013)

    Article  Google Scholar 

  13. Li, Y., Xiao, R., Feng, J.: A semi-supervised learning approach for detection of phishing webpages. Opt. - Int. J. Light Electron Opt. 124(23), 6027–6033 (2013)

    Article  Google Scholar 

  14. Chen, Y.S., Wang, G.P., Dong, S.H.: Learning with progressive transductive support vector machine. Pattern Recognit. Lett. 24(12), 1845–1855 (2003)

    Article  Google Scholar 

  15. Chapelle, O., Schölkopf, B., Zien, A.: Semi-supervised learning. J. R. Stat. Soc. 172(2), 1530 (2006)

    Google Scholar 

  16. Clark, S., Curran, J.R., Osborne, M.: Bootstrapping POS taggers using unlabeled data, p. 49 (2003)

    Google Scholar 

  17. Zhang, Y., Hong, J.I., Cranor, L.F.: Cantina: a content-based approach to detecting phishing web sites. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 639–648. ACM, New York (2007)

    Google Scholar 

  18. Garera S., Provos N., Chew M.: A Framework for Detection and Measurement of Phishing Attacks, ACM Workshop on Recurring Malcode, pp. 1–8. ACM (2007)

    Google Scholar 

  19. James, J., Sandhya, L., Thomas, C.: Detection of phishing URLs using machine learning techniques. In: International Conference on Control Communication and Computing, pp. 304. IEEE (2014)

    Google Scholar 

  20. Soska, K., Christin, N.: Automatically detecting vulnerable websites before they turn malicious. In: Usenix Conference on Security Symposium, p. 625. USENIX Association (2014)

    Google Scholar 

  21. Pradeepthi, K.V., Kannan, A.: Performance Study of classification techniques for phishing URL detection. In: Sixth International Conference on Advanced Computing, pp. 135–139. IEEE (2015)

    Google Scholar 

  22. PhishTank Homepage. https://www.phishtank.com/. Accessed 21 Feb 2018

  23. DMOZ Homepage. http://www.dmoz.org/. Accessed 21 Feb 2018

  24. Scikit-learn Homepage. http://scikit-learn.org/. Accessed 24 Feb 2018

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao-feng Rong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jia, Xp., Rong, Xf. (2018). A Self-training Method for Detection of Phishing Websites. In: Tan, Y., Shi, Y., Tang, Q. (eds) Data Mining and Big Data. DMBD 2018. Lecture Notes in Computer Science(), vol 10943. Springer, Cham. https://doi.org/10.1007/978-3-319-93803-5_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93803-5_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93802-8

  • Online ISBN: 978-3-319-93803-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics