Text Mining for Phishing E-mail Detection

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 308)


Phishing e-mails are threats to online banking transactions as it mislead the customer to disclose their valuable information which results in monetary losses. Common approach is to extract some specific features from phishing e-mails in a semiautomatic way by using small scripts which is a very tedious process. This paper proposes text mining for extracting distinguishing features from a collection of e-mails consists of both phishing and legitimate for better detection of phishing attack. Proposed method first convert the e-mails to a vector representation and then feature selection techniques are used for selecting best features for classification. The proposed method is evaluated by using a data set collected from the HamCorpus of SpamAssasssin project (legitimate e-mail) and the publicly available PhishingCorpus (phishing e-mail) and found that text mining-based phishing detection is simple, fast, and more accurate than the state-of-the-art methods.


Text mining Phishing Classification Feature selection 


  1. 1.
    L’Huillier, G., Weber, R., Figueroa, N.: Online phishing classification using adversarial data mining and signaling games. ACM SIGKDD Explor. Newslett. 11(2), 92–99 (2009)Google Scholar
  2. 2.
    Ramanathan, V., Wechsler, H.: PhishGILLNET—phishing detection methodology using probabilistic latent semantic analysis, AdaBoost and co-training. J. Inf. Secur. 2012, 1–22 (2012)Google Scholar
  3. 3.
    Ramanathan, V., Wechsler. H.: Phishing detection and impersonated entity discovery using conditional random field and latent dirichlet allocation. J. Comput. Secur. 34, 123–139Google Scholar
  4. 4.
    Robila, S.A., Ragucci, J.W.: Don’t be a phish: steps in user education. In: Proceedings of the 11th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education, pp. 237–241 (2006)Google Scholar
  5. 5.
    SpamAssassin PublicCorpus: Available from: (2006). Accessed 14 Jan 2014
  6. 6.
    Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proceedings of the 16th International Conference on World Wide Web, pp. 649–656 (2007)Google Scholar
  7. 7.
    Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: Distributed phishing detection by applying variable selection using Bayesian additive regression trees. In: IEEE International Conference on Communications, vol. 1, pp. 1–5, DresdenGoogle Scholar
  8. 8.
    Miyamoto, D., Hazeyama, H., Kadobayashi, Y.: An evaluation of machine learning based methods for detection of phishing sites. In: Proceedings of the 15th International Conference on Advances in Neuro-Information Processing, vol. 1, pp. 539–546. Springer, Heidelberg (2009)Google Scholar
  9. 9.
    Toolan, F., Carthy, J.: Phishing Detection Using Classifier Ensembles. eCrime Researchers Summit, Tacoma (2009)Google Scholar
  10. 10.
    Basnet, R.B., Sung, A.H.: Classifying phishing emails using confidence-weighted linear classifiers. In: International Conference on Information Security and Artificial Intelligence (ISAI), pp. 108–112 (2010)Google Scholar
  11. 11.
    PhishingCorpus: Available from: (2006). Accessed 14 Jan 2014
  12. 12.
    Roglia, E., Cancelliere, R., Meo, R.: Classification of chestnuts with experiments on feature selection and noise. Universit`a di Torino, Dipartimento di Informatica corso Svizzera, ItalyGoogle Scholar
  13. 13.
    L’Huillier, G., Hevia, A., Weber, R., Rıos, S.: Latent semantic analysis and keyword extraction for phishing classification. Department of Computer Science, University of ChileGoogle Scholar
  14. 14.
    Karegowda, A.G.: Comparative study of attribute selection using gain ratio and correlation based feature selection. Int. J. Inf. Technol. Knowl. Manage. 2, 271–277Google Scholar
  15. 15.
    Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.: A comparison of machine learning techniques for phishing detection. In: Proceedings of the eCrime Researchers Summit, vol. 1, pp. 60–69, PittsburghGoogle Scholar
  16. 16.
    DNSBL: Spam database lookup. Available from:
  17. 17.
    Huang, H., Qian, L., Wang, Y.: A SVM-based technique to detect phishing URLs. Inf. Technol. J. 11, 921–925 (2012)CrossRefGoogle Scholar

Copyright information

© Springer India 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceJamia Hamdard UniversityNew DelhiIndia

Personalised recommendations