Consensus Clustering and Supervised Classification for Profiling Phishing Emails in Internet Commerce Security

  • Richard Dazeley
  • John L. Yearwood
  • Byeong H. Kang
  • Andrei V. Kelarev
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6232)


This article investigates internet commerce security applications of a novel combined method, which uses unsupervised consensus clustering algorithms in combination with supervised classification methods. First, a variety of independent clustering algorithms are applied to a randomized sample of data. Second, several consensus functions and sophisticated algorithms are used to combine these independent clusterings into one final consensus clustering. Third, the consensus clustering of the randomized sample is used as a training set to train several fast supervised classification algorithms. Finally, these fast classification algorithms are used to classify the whole large data set. One of the advantages of this approach is in its ability to facilitate the inclusion of contributions from domain experts in order to adjust the training set created by consensus clustering. We apply this approach to profiling phishing emails selected from a very large data set supplied by the industry partners of the Centre for Informatics and Applied Optimization. Our experiments compare the performance of several classification algorithms incorporated in this scheme.


Cluster Algorithm Consensus Function Inverse Document Frequency Cluster Ensemble Consensus Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: ranking and clustering. In: Proc. 37th Annual ACM Symposium on Theory of Computing, pp. 684–693 (2005)Google Scholar
  2. 2.
    Anti-Phishing Working Group (2009), (retrieved April 2010)
  3. 3.
    Bagirov, A.M.: Modified global k-means algorithm for minimum sum-of-squares clustering problems. Pattern Recognition 41, 3192–3199 (2008)zbMATHCrossRefGoogle Scholar
  4. 4.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2001)zbMATHGoogle Scholar
  5. 5.
    Fern, X.Z., Brodley, C.E.: Cluster ensembles for high dimensional clustering: an empirical study. J. Machine Learning Research (2004)Google Scholar
  6. 6.
    Fern, X.Z., Brodley, C.E.: Solving cluster ensemble problems by bipartite graph partitioning. In: Proc. 21st Internat. Conference on Machine Learning, ICML 2004, Banff, Alberta, Canada, July 4-8, vol. 69, p. 36. ACM, New York (2004)Google Scholar
  7. 7.
    Fette, I., Sadeh, N., Tomasic, A.: Learning to detect phishing emails. In: Proc. 16th Internat. Conf. on the World Wide Web, WWW 2007, pp. 649–656. ACM Press, New York (2007)CrossRefGoogle Scholar
  8. 8.
    Filkov, V., Skiena, S.: Heterogeneous data integration with the consensus clustering formalism. In: Proc. of Data Integration in the Life Sciences, pp. 110–123 (2004)Google Scholar
  9. 9.
    Goder, A., Filkov, V.: Consensus clustering algorithms: comparison and refinement. In: Proc. Tenth SIAM Workshop on Algorithm Engineering and Experiments, ALENEX 2008, pp. 109–117 (2008)Google Scholar
  10. 10.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)zbMATHGoogle Scholar
  11. 11.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)CrossRefGoogle Scholar
  12. 12.
    Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TF-IDF for text categorization. In: Proc. 14th Internat. Conf. Machine Learning, pp. 143–151 (1997)Google Scholar
  13. 13.
    Kang, B.H., Kelarev, A.V., Sale, A.H.J., Williams, R.N.: A new model for classifying DNA code inspired by neural networks and FSA. In: Hoffmann, A., Kang, B.-h., Richards, D., Tsumoto, S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 187–198. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  14. 14.
    Karypis, G., Kumar, V.: METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices, Technical Report, University of Minnesota, Department of Computer Science and Engineering, Army HPC Research Centre, Minneapolis (1998)Google Scholar
  15. 15.
    Layton, R., Watters, P.: Determining provenance in phishing websites using automated conceptual analysis. In: Proc. 4th Annual APWG eCrime Researchers Summit, Tacoma, WA (2009)Google Scholar
  16. 16.
    Layton, R., Brown, S., Watters, P.: Using differencing to increase distinctiveness for phishing website clustering. In: Proc. Cybercrime and Trustworthy Computing Workshop, CTC 2009, Brisbane, Australia (2009)Google Scholar
  17. 17.
    Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recognition 36, 451–461 (2003)CrossRefGoogle Scholar
  18. 18.
    Ma, L., Yearwood, J., Watters, P.A.: Establishing phishing provenance using orthographic features, APWG E-crime Research Summit (2009)Google Scholar
  19. 19.
    OECD Task Force on Spam, OECD Anti-Spam Toolkit and its Annexes (2009), (retrieved April 2010)
  20. 20.
    Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Machine Learning Research 3, 583–617 (2002)CrossRefMathSciNetGoogle Scholar
  21. 21.
    Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Reading (2005)Google Scholar
  22. 22.
    Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press, London (2008)Google Scholar
  23. 23.
    Topchy, A., Jain, A.K., Punch, W.: Combining multiple weak clusterings. In: Proc. IEEE Internat. Conf. on Data Mining, pp. 331–338 (2003)Google Scholar
  24. 24.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier/Morgan Kaufman, Amsterdam (2005)zbMATHGoogle Scholar
  25. 25.
    Yearwood, J.L., Kang, B.H., Kelarev, A.V.: Experimental investigation of classification algorithms for ITS dataset. In: PKAW 2008, Pacific Rim Knowledge Acquisition Workshop, Hanoi, Vietnam, December 15-16, pp. 262–272 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Richard Dazeley
    • 1
  • John L. Yearwood
    • 1
  • Byeong H. Kang
    • 2
  • Andrei V. Kelarev
    • 1
  1. 1.Centre for Informatics and Applied Optimization Graduate School of ITMSUniversity of BallaratBallaratAustralia
  2. 2.School of Computing and Information SystemsUniversity of TasmaniaHobartAustralia

Personalised recommendations