Clustering versus SVM for malware detection

  • Usha Narra
  • Fabio Di Troia
  • Visaggio Aaron Corrado
  • Thomas H. Austin
  • Mark Stamp
Original Paper


Previous work has shown that cluster analysis can be used to effectively classify malware into meaningful families. In this research, we apply cluster analysis to the challenging problem of classifying previously unknown malware. We perform several experiments involving malware clustering. We compare our clustering results to those obtained when a support vector machine (SVM) is trained on the malware family. Using clustering, we are able to classify malware with an accuracy comparable to that of an SVM. An advantage of the clustering approach is that a new malware family can be classified before a model has been trained specifically for the family.


Support Vector Machine Receiver Operating Characteristic Curve Hide Markov Model Expectation Maximization Support Vector Machine Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Alsabti, K., Ranka, S., Singh, V.: An efficient K-means clustering algorithm. (1998). Accessed 22 Sept 2015
  2. 2.
    Annachhatre, C., Austin, T.H., Stamp, M.: HiddenMarkov models for malware classification. J. Comput. Virol. Hack. Tech. 11(2), 59–73 (2014)CrossRefGoogle Scholar
  3. 3.
    Austin, T., Filiol, E., Josse, S., Stamp, M.: Exploring hidden Markov models for virus analysis: a semantic approach. In: 46th Hawaii International Conference on System Sciences (HICSS 2013), pp. 5039–5048 (2013)Google Scholar
  4. 4.
    Baysa, D., Low, R.M., Stamp, M.: Structural entropy and metamorphic malware. J. Comput. Virol. Hack. Tech. 9(4), 179–192 (2013)CrossRefGoogle Scholar
  5. 5.
    Bilmes, J.A.: A Gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. (1998). Accessed 22 Sept 2015
  6. 6.
    Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. J. Pattern Recognit. 30(7), 1145–1159 (1997)CrossRefGoogle Scholar
  7. 7.
    Do, C.B., Batzoglou, S.: What is the expectation maximization algorithm? Nat. Biotechnol. 26(8), 897–899 (2008)CrossRefGoogle Scholar
  8. 8.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006)MathSciNetCrossRefGoogle Scholar
  9. 9.
  10. 10.
  11. 11.
    Jin, R.: Cluster validation. (2008). Accessed 22 Sept 2015
  12. 12.
    Malware Protection Center (n.d.), Win32/Winwebsec. Accessed 22 Sept 2015
  13. 13.
    Matlab Statistics Toolbox (n.d.). Accessed 22 Sept 2015
  14. 14.
    Moore, A.W.: \(K\)-means and hierarchical clustering. (2001). Accessed 22 Sept 2015
  15. 15.
    Nappa, A., Raque, M.Z., Caballero, J.: Driving in the cloud: an analysis of drive-by download operations and abuse reporting of viruses. In: Proceedings of the 10th Conference on Detection of Intrusions and Malware and Vulnerability Assessment (2013)Google Scholar
  16. 16.
    Next Generation Virus Construction Kit, VX Heavens. Accessed 22 Sept 2015
  17. 17.
    Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  18. 18.
    RapidMiner (n.d.). Accessed 22 Sept 2015
  19. 19.
  20. 20.
    Shanmugam, G., Low, R.M., Stamp, M.: Simple substitution distance and metamorphic detection. J. Comput. Virol. Hack. Tech. 9(3), 159–170 (2013)Google Scholar
  21. 21.
  22. 22.
    Stamp, M.: A revealing introduction to hidden Markov models. (2012). Accessed 22 Sept 2015
  23. 23.
    Stamp, M.: Machine Learning with Applications in Information Security (2015) (submitted for publication)Google Scholar
  24. 24.
    Support vector machines (n.d.). Accessed 22 Sept 2015
  25. 25.
    Symantec: Trojan.Zbot. (2010). Accessed 22 Sept 2015
  26. 26.
    Symantec security response: Trojan.Zeroaccess. (2011). Accessed 22 Sept 2015
  27. 27.
    Tan, P.N., Steinbach, M., Kumar, V.: Introduction to data mining, chapter 8. In: Cluster Analysis: Basic Concepts and Algorithms. Addison-Wesley (2006)Google Scholar
  28. 28.
    Vasudevan, A.: MalTRAK: tracking and eliminating unknown malware. In: Computer Security Applications Conference, pp. 311–321 (2008)Google Scholar
  29. 29.
    Villeneuve, N.: With a foreword by R. Deibert and R. Rohozinski: KOOBFACE: inside a crimeware network. (2010). Accessed 22 Sept 2015
  30. 30.
    Virus Bulletin (n.d.), Last-minute paper: an indepth look into Stuxnet. Accessed 22 Sept 2015
  31. 31.
    Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2(3), 211–229 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag France 2015

Authors and Affiliations

  • Usha Narra
    • 1
  • Fabio Di Troia
    • 2
  • Visaggio Aaron Corrado
    • 2
  • Thomas H. Austin
    • 1
  • Mark Stamp
    • 1
  1. 1.Department of Computer ScienceSan Jose State UniversitySan JoseUSA
  2. 2.Department of EngineeringUniversità degli Studi del SannioBeneventoItaly

Personalised recommendations