Semantic Feature Selection for Text with Application to Phishing Email Detection

Verma, Rakesh; Hossain, Nabil

doi:10.1007/978-3-319-12160-4_27

Rakesh Verma¹⁵ &
Nabil Hossain¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 8565))

Included in the following conference series:

International Conference on Information Security and Cryptology

1561 Accesses
7 Citations

Abstract

In a phishing attack, an unsuspecting victim is lured, typically via an email, to a web site designed to steal sensitive information such as bank/credit card account numbers, login information for accounts, etc. Each year Internet users lose billions of dollars to this scourge. In this paper, we present a general semantic feature selection method for text problems based on the statistical t-test and WordNet, and we show its effectiveness on phishing email detection by designing classifiers that combine semantics and statistics in analyzing the text in the email. Our feature selection method is general and useful for other applications involving text-based analysis as well. Our email body-text-only classifier achieves more than 95 % accuracy on detecting phishing emails with a false positive rate of 2.24 %. Due to its use of semantics, our feature selection method is robust against adaptive attacks and avoids the problem of frequent retraining needed by machine learning classifiers.

Research supported in part by NSF grants DUE 1241772 and CNS 1319212.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.cs.cmu.edu/~enron/
2.
The ‘Other’ category is explained in Table 2.

References

Irani, D., Webb, S., Giffin, J., Pu, C.: Evolutionary study of phishing. In: 3rd Anti-Phishing Working Group eCrime Researchers Summit (2008)
Google Scholar
Yu, W., Nargundkar, S., Tiruthani, N.: Phishcatch - a phishing detection tool. In: 33rd IEEE International Computer Software and Applications Conference, pp. 451–456 (2009)
Google Scholar
Verma, R., Shashidhar, N., Hossain, N.: Detecting phishing emails the natural language way. In: Foresti, S., Yung, M., Martinelli, F. (eds.) ESORICS 2012. LNCS, vol. 7459, pp. 824–841. Springer, Heidelberg (2012)
Chapter Google Scholar
Ramanathan, V., Wechsler, H.: Phishgillnet - phishing detection using probabilistic latent semantic analysis. EURASIP J. Inf. Secur. 2012, 1 (2012)
Article Google Scholar
Li, S., Xia, R., Zong, C., Huang, C.R.: A framework of feature selection methods for text categorization. In: ACL/AFNLP, pp. 692–700 (2009)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, pp. 412–420 (1997)
Google Scholar
Nigam, K., McCallum, A., Thrun, S., Mitchell, T.M.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2/3), 103–134 (2000)
Article MATH Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Herzberg, A.: Combining authentication, reputation and classification to make phishing unprofitable. In: Gritzalis, D., Lopez, J. (eds.) SEC 2009. IFIP AICT, vol. 297, pp. 13–24. Springer, Heidelberg (2009)
Chapter Google Scholar
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Fellbaum, C. (ed.): WordNet an Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Richens, T.: Anomalies in the WordNet verb hierarchy. In: COLING, pp. 729–736 (2008)
Google Scholar
Mihalcea, R., Csomai, A.: Senselearner: word sense disambiguation for all words in unrestricted text. In: ACL (2005)
Google Scholar
Nazario, J.: The online phishing corpus (2004). http://monkey.org/~jose/wiki/doku.php
Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proceedings of the 16th International Conference on World Wide Web, ACM, pp. 649–656 (2007)
Google Scholar
Ludl, C., McAllister, S., Kirda, E., Kruegel, C.: On the effectiveness of techniques to detect phishing sites. In: Hämmerli, B.M., Sommer, R. (eds.) DIMVA 2007. LNCS, vol. 4579, pp. 20–39. Springer, Heidelberg (2007)
Chapter Google Scholar
Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., Zhang, C.: An empirical analysis of phishing blacklists. In: Proceedings of the 6th Conference on Email and Anti-Spam (2009)
Google Scholar
Chandrasekaran, M., Narayanan, K., Upadhyaya, S.: Phishing email detection based on structural properties. In: NYS CyberSecurity Conference (2006)
Google Scholar
Bergholz, A., Chang, J., Paaß, G., Reichartz, F., Strobel, S.: Improved phishing detection using model-based features. In: Proceedings of the Conference on Email and Anti-Spam (CEAS) (2008)
Google Scholar
Basnet, R., Mukkamala, S., Sung, A.: Detection of phishing attacks: a machine learning approach. In: Prasad, B. (ed.) Soft Computing Applications in Industry. Studies in Fuzziness and Soft Computing, vol. 226, pp. 373–383. Springer, Heidelberg (2008)
Chapter Google Scholar
Bergholz, A., Beer, J.D., Glahn, S., Moens, M.F., Paaß, G., Strobel, S.: New filtering approaches for phishing email. J. Comput. Secur. 18(1), 7–35 (2010)
Google Scholar
Gansterer, W.N., Pölz, D.: E-mail classification for phishing defense. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 449–460. Springer, Heidelberg (2009)
Chapter Google Scholar
Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: A comparison of machine learning techniques for phishing detection. In: Proceedings of the Anti-Phishing Working Group’s 2nd Annual eCrime Researchers Summit, ACM, pp. 60–69 (2007)
Google Scholar
Cook, D.L., Gurbani, V.K., Daniluk, M.: Phishwish: a simple and stateless phishing filter. Secur. Commun. Netw. 2(1), 29–43 (2009)
Article Google Scholar
Jakobsson, M., Myers, S.: Phishing and Countermeasures: Understanding the Increasing Problem of Electronic Identity Theft. Wiley-Interscience, Hoboken (2006)
Book Google Scholar
James, L.: Phishing Exposed. Syngress Publishing, Rockland (2005)
Google Scholar
Ollmann, G.: The phishing guide. Next Generation Security Software Ltd. (2004)
Google Scholar
Turner, S., Housley, R.: Implementing Email and Security Tokens: Current Standards, Tools, and Practices. Wiley, Hoboken (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, 4800 Calhoun Road, Houston, TX, USA
Rakesh Verma & Nabil Hossain

Authors

Rakesh Verma
View author publications
You can also search for this author in PubMed Google Scholar
Nabil Hossain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rakesh Verma .

Editor information

Editors and Affiliations

EWHA Womans University, Seoul, Korea, Republic of (South Korea)
Hyang-Sook Lee
Kookmin University, Seoul, Korea, Republic of (South Korea)
Dong-Guk Han

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Verma, R., Hossain, N. (2014). Semantic Feature Selection for Text with Application to Phishing Email Detection. In: Lee, HS., Han, DG. (eds) Information Security and Cryptology -- ICISC 2013. ICISC 2013. Lecture Notes in Computer Science(), vol 8565. Springer, Cham. https://doi.org/10.1007/978-3-319-12160-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-12160-4_27
Published: 19 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12159-8
Online ISBN: 978-3-319-12160-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics