Clustering for malware classification

  • Swathi Pai
  • Fabio Di Troia
  • Corrado Aaron Visaggio
  • Thomas H. Austin
  • Mark StampEmail author
Original Paper


In this research, we apply clustering techniques to the malware classification problem. We compute clusters using the well-known K-means and Expectation Maximization algorithms, with the underlying scores based on Hidden Markov Models. We compare the results obtained from these two clustering approaches and we carefully consider the interplay between the dimension (i.e., number of models used for clustering), and the number of clusters, with respect to the accuracy of the clustering.


Receiver Operating Characteristic Curve Hide Markov Model Expectation Maximization Hide Markov Model Model Silhouette Coefficient 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Alsabti, K., Ranka, S., Singh, V.: An efficient \(K\)-means clustering algorithm. Electrical Engineering and Computer Science. Paper 43. (1997). Accessed 21 Jan 2016
  2. 2.
    Al-Zoubi, M.B., Rawi, M.A.: An efficient approach for computing silhouette coefficients. J. Comput. Sci. 4(3), 252–255 (2008)CrossRefGoogle Scholar
  3. 3.
    Annachhatre, C., Austin, T.H., Stamp, M.: Hidden Markov model for malware classification. J. Comput. Virol. Hack. Tech. 11(2), 59–73 (2014)CrossRefGoogle Scholar
  4. 4.
    Austin, T.H., Filiol, E., Josse, S., Stamp, M.: Exploring hidden Markov models for virus analysis: a semantic approach. In: Proceedings of 46th Hawaii International Conference on System Sciences (HICSS 2013), pp. 5039–5048 (2013)Google Scholar
  5. 5.
    Aycock, J.: Computer Viruses and Malware. Springer, Heidelberg (2006)Google Scholar
  6. 6.
    Babu, A.R., Markandeyulu, M., Nagarjuna, B.V.R.R.: Pattern clustering with similarity measures. Int. J. Comput. Technol. Appl. 3(1), 365–369 (2012)Google Scholar
  7. 7.
    Bailey, M., Oberheide, J., Andersen, J., Morley Mao, Z., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection (RAID ’07), pp. 178–197 (2007)Google Scholar
  8. 8.
    Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7), 1145–1159 (1997)CrossRefGoogle Scholar
  9. 9.
    Denning, D.E.: An intrusion-detection model. IEEE Trans. Softw. Eng. 13(2), 222–232 (1987)CrossRefGoogle Scholar
  10. 10.
    Do, C.B., Batzoglou, S.: What is the expectation maximization algorithm? Nat. Biotechnol. 26(8), 897–899 (2008)CrossRefGoogle Scholar
  11. 11.
    EM clustering algorithm. Accessed 21 Jan 2016
  12. 12.
    Fawcett, T.: An introduction to ROC analysis. (2005). Accessed 21 Jan 2016
  13. 13.
    Idika, N., Mathur, A.P.: A survey of malware detection techniques. (2007)
  14. 14.
    Internet Security Threat Report, Symantec Inc. (2014). Accessed 21 Jan 2016
  15. 15.
    Kolter, J., Maloof, M.: Learning to detect malicious executables in the wild. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478 (2004)Google Scholar
  16. 16.
    Kong, D., Yan, G.: Discriminant malware distance learning on structural information for automated malware classification. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1357–1365 (2013)Google Scholar
  17. 17.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  18. 18.
    Malicia Project Dataset—Driving in the Cloud. Accessed 21 Jan 2016
  19. 19.
    Nappa, A., Zubair Rafique, M., Caballero, J.: Driving in the cloud: an analysis of drive-by download operations and abuse reporting. In: Proceedings of the 10th Conference on Detection of Intrusions and Malware and Vulnerability Assessment, Berlin, Germany, July (2013)Google Scholar
  20. 20.
    Narra, U., Di Troia, F., Corrado, V.A., Austin, T.H., Stamp, M.: Clustering versus SVM for malware detection. J. Comput. Virol. Hack. Tech. doi: 10.1007/s11416-015-0253-z
  21. 21.
    Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 7(2), 257–286 (1989)CrossRefGoogle Scholar
  22. 22.
    Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19(4), 639–668 (2011)CrossRefGoogle Scholar
  23. 23.
    Schultz, M., Eskin, E., Zadok, F., Stolfo, S.: Data mining methods for detection of new malicious executables. In: Proceedings of 2001 IEEE Symposium on Security and Privacy, pp. 38–49 (2001)Google Scholar
  24. 24.
    Smart HDD. Kaspersky lab technical report. Accessed 21 Jan 2016
  25. 25.
    Snakebyte, Next Generation Virus Construction Kit (NGVCK). Accessed 21 Jan 2016
  26. 26.
    Stamp, M.: A revealing introduction to hidden Markov models. (2012). Accessed 21 Jan 2016
  27. 27.
    Stamp, M.: Information Security: Principles and Practice, 2nd edn. Wiley, New York (2011)CrossRefGoogle Scholar
  28. 28.
    Stamp, M.: Machine learning with applications in information security (unpublished manuscript)Google Scholar
  29. 29.
  30. 30.
    Trojan.Zeroaccess, Symantec. (2011). Accessed 21 Jan 2016
  31. 31.
    WinWebSec, Enigma Software. Accessed 21 Jan 2016
  32. 32.
    Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2(3), 211–229 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag France 2016

Authors and Affiliations

  • Swathi Pai
    • 1
  • Fabio Di Troia
    • 2
  • Corrado Aaron Visaggio
    • 2
  • Thomas H. Austin
    • 1
  • Mark Stamp
    • 1
    Email author
  1. 1.Department of Computer ScienceSan Jose State UniversitySan JoseUSA
  2. 2.Department of EngineeringUniversità degli Studi del SannioBeneventoItaly

Personalised recommendations