Abstract
In a phishing attack, an unsuspecting victim is lured, typically via an email, to a web site designed to steal sensitive information such as bank/credit card account numbers, login information for accounts, etc. Each year Internet users lose billions of dollars to this scourge. In this paper, we present a general semantic feature selection method for text problems based on the statistical t-test and WordNet, and we show its effectiveness on phishing email detection by designing classifiers that combine semantics and statistics in analyzing the text in the email. Our feature selection method is general and useful for other applications involving text-based analysis as well. Our email body-text-only classifier achieves more than 95 % accuracy on detecting phishing emails with a false positive rate of 2.24 %. Due to its use of semantics, our feature selection method is robust against adaptive attacks and avoids the problem of frequent retraining needed by machine learning classifiers.
Research supported in part by NSF grants DUE 1241772 and CNS 1319212.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
The ‘Other’ category is explained in Table 2.
References
Irani, D., Webb, S., Giffin, J., Pu, C.: Evolutionary study of phishing. In: 3rd Anti-Phishing Working Group eCrime Researchers Summit (2008)
Yu, W., Nargundkar, S., Tiruthani, N.: Phishcatch - a phishing detection tool. In: 33rd IEEE International Computer Software and Applications Conference, pp. 451–456 (2009)
Verma, R., Shashidhar, N., Hossain, N.: Detecting phishing emails the natural language way. In: Foresti, S., Yung, M., Martinelli, F. (eds.) ESORICS 2012. LNCS, vol. 7459, pp. 824–841. Springer, Heidelberg (2012)
Ramanathan, V., Wechsler, H.: Phishgillnet - phishing detection using probabilistic latent semantic analysis. EURASIP J. Inf. Secur. 2012, 1 (2012)
Li, S., Xia, R., Zong, C., Huang, C.R.: A framework of feature selection methods for text categorization. In: ACL/AFNLP, pp. 692–700 (2009)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, pp. 412–420 (1997)
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2/3), 103–134 (2000)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Herzberg, A.: Combining authentication, reputation and classification to make phishing unprofitable. In: Gritzalis, D., Lopez, J. (eds.) SEC 2009. IFIP AICT, vol. 297, pp. 13–24. Springer, Heidelberg (2009)
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Fellbaum, C. (ed.): WordNet an Electronic Lexical Database. MIT Press, Cambridge (1998)
Richens, T.: Anomalies in the WordNet verb hierarchy. In: COLING, pp. 729–736 (2008)
Mihalcea, R., Csomai, A.: Senselearner: word sense disambiguation for all words in unrestricted text. In: ACL (2005)
Nazario, J.: The online phishing corpus (2004). http://monkey.org/~jose/wiki/doku.php
Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proceedings of the 16th International Conference on World Wide Web, ACM, pp. 649–656 (2007)
Ludl, C., McAllister, S., Kirda, E., Kruegel, C.: On the effectiveness of techniques to detect phishing sites. In: Hämmerli, B.M., Sommer, R. (eds.) DIMVA 2007. LNCS, vol. 4579, pp. 20–39. Springer, Heidelberg (2007)
Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., Zhang, C.: An empirical analysis of phishing blacklists. In: Proceedings of the 6th Conference on Email and Anti-Spam (2009)
Chandrasekaran, M., Narayanan, K., Upadhyaya, S.: Phishing email detection based on structural properties. In: NYS CyberSecurity Conference (2006)
Bergholz, A., Chang, J., Paaß, G., Reichartz, F., Strobel, S.: Improved phishing detection using model-based features. In: Proceedings of the Conference on Email and Anti-Spam (CEAS) (2008)
Basnet, R., Mukkamala, S., Sung, A.: Detection of phishing attacks: a machine learning approach. In: Prasad, B. (ed.) Soft Computing Applications in Industry. Studies in Fuzziness and Soft Computing, vol. 226, pp. 373–383. Springer, Heidelberg (2008)
Bergholz, A., Beer, J.D., Glahn, S., Moens, M.F., Paaß, G., Strobel, S.: New filtering approaches for phishing email. J. Comput. Secur. 18(1), 7–35 (2010)
Gansterer, W.N., Pölz, D.: E-mail classification for phishing defense. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 449–460. Springer, Heidelberg (2009)
Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: A comparison of machine learning techniques for phishing detection. In: Proceedings of the Anti-Phishing Working Group’s 2nd Annual eCrime Researchers Summit, ACM, pp. 60–69 (2007)
Cook, D.L., Gurbani, V.K., Daniluk, M.: Phishwish: a simple and stateless phishing filter. Secur. Commun. Netw. 2(1), 29–43 (2009)
Jakobsson, M., Myers, S.: Phishing and Countermeasures: Understanding the Increasing Problem of Electronic Identity Theft. Wiley-Interscience, Hoboken (2006)
James, L.: Phishing Exposed. Syngress Publishing, Rockland (2005)
Ollmann, G.: The phishing guide. Next Generation Security Software Ltd. (2004)
Turner, S., Housley, R.: Implementing Email and Security Tokens: Current Standards, Tools, and Practices. Wiley, Hoboken (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Verma, R., Hossain, N. (2014). Semantic Feature Selection for Text with Application to Phishing Email Detection. In: Lee, HS., Han, DG. (eds) Information Security and Cryptology -- ICISC 2013. ICISC 2013. Lecture Notes in Computer Science(), vol 8565. Springer, Cham. https://doi.org/10.1007/978-3-319-12160-4_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-12160-4_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12159-8
Online ISBN: 978-3-319-12160-4
eBook Packages: Computer ScienceComputer Science (R0)