Advertisement

Hidden Markov models for malware classification

  • Chinmayee Annachhatre
  • Thomas H. Austin
  • Mark Stamp
Original Paper

Abstract

Previous research has shown that hidden Markov model (HMM) analysis is useful for detecting certain challenging classes of malware. In this research, we consider the related problem of malware classification based on HMMs. We train multiple HMMs on a variety of compilers and malware generators. More than 8,000 malware samples are then scored against these models and separated into clusters based on the resulting scores. We observe that the clustering results could be used to classify the malware samples into their appropriate families with good accuracy. Since none of the malware families in the test set were used to generate the HMMs, these results indicate that our approach can effective classify previously unknown malware, at least in some cases. Thus, such a clustering strategy could serve as a useful tool in malware analysis and classification.

Keywords

Hide Markov Model Hide State Forward Algorithm Silhouette Coefficient Initial Centroid 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Annachhatre, C.: Hidden Markov models for malware classification. Department of Computer Science, San Jose State University, Master’s report (2013)Google Scholar
  2. 2.
    Attaluri, S., McGhee, S., Stamp, M.: Profile hidden Markov models and metamorphic virus detection. J. Comput. Virol. 5(2), 151–169 (2009)CrossRefGoogle Scholar
  3. 3.
    Austin, T., Filiol, E., Josse, S., Stamp, M.: Exploring hidden Markov models for virus analysis: a semantic approach. In: 46th Hawaii International Conference on System Sciences (HICSS 46), pp. 5039–5048 (2013)Google Scholar
  4. 4.
    Baysa, D., Low, R.M., Stamp, M.: Structural entropy and metamorphic malware. J. Comput. Virol. Hacking Tech. 9(4), 179–192 (2013)CrossRefGoogle Scholar
  5. 5.
    Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30, 1145–1159 (1997)Google Scholar
  6. 6.
    Canzanese, R., Kam, M., Mancoridis, S.: Toward an automatic, online behavioral malware classification system. https://www.cs.drexel.edu/~spiros/papers/saso2013.pdf (2013)
  7. 7.
    Cesare, S., Xiang, Y.: Classification of Malware using structured control flow. In: 8th Australasian Symposium on Parallel and Distributed Computing, vol. 107, pp. 61–70 (2010)Google Scholar
  8. 8.
    Do, C.B., Batzoglou, S.: What is the expectation maximization algorithm? Nat. Biotechnol. 26(8), 897–899. http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf (2008)
  9. 9.
    Indika: Difference between hierarchical and partitional clustering. http://www.differencebetween.com/difference-between-hierarchical-and-vs-partitional-clustering (2011)
  10. 10.
    Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)zbMATHGoogle Scholar
  11. 11.
    Jin, R.: Cluster validation. http://www.cs.kent.edu/~jin/DM08/ClusterValidation.pdf (2008)
  12. 12.
    Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)CrossRefGoogle Scholar
  13. 13.
    Kolter, S., Maloof, M.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Krogh, A.: An Introduction to Hidden Markov Models for Biological Sequences. Computational Methods in Molecular Biology. Elsevier, Lyngby (1998)zbMATHGoogle Scholar
  15. 15.
    Krogh, A., et al.: Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol. 235(5), 1501–1531 (1994)CrossRefGoogle Scholar
  16. 16.
    Lakhotia, A., Walenstein, A., Miles, C., Singh, A.: VILO: a rapid learning nearest-neighbor classifier for malware triage. J. Comput. Virol. 9(3), 109–123 (2013)Google Scholar
  17. 17.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  18. 18.
    Maloof, M.A.: Machine Learning and Data Mining for Computer Security: Methods and Applications. Springer, Berlin (2006)CrossRefGoogle Scholar
  19. 19.
    Ming, X., et al.: A similarity metric method of obfuscated malware using function-call graph. J. Comput. Virol. Hacking Tech. 9(1), 35–47 (2013)CrossRefGoogle Scholar
  20. 20.
    MITRE: Malware attribute enumeration and characterization. http://maec.mitre.org (2013)
  21. 21.
    Moore, A.W.: \(K\)-Means and hierarchical clustering. http://www.autonlab.org/tutorials/kmeans11.pdf (2001)
  22. 22.
    Nappa, A., Zubair Rafique, M., Caballero, J.: Driving in the cloud: an analysis of drive-by download operations and abuse reporting of viruses. In: Proceedings of the 10th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (2013)Google Scholar
  23. 23.
    Park, Y., Reeves, D.S., Stamp, M.: Deriving common malware behavior through graph clustering. Comput. Secur. 39(B), 419–430 (2013)CrossRefGoogle Scholar
  24. 24.
    Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRefGoogle Scholar
  25. 25.
    Runwal, N., Low, R., Stamp, M.: Opcode graph similarity and metamorphic detection. J. Comput. Virol. 8, 37–52 (2012)CrossRefGoogle Scholar
  26. 26.
    Saleh, M., Mohamed, A., Nabi, A.: Eigenviruses for metamorphic virus recognition. IET Inf. Secur. 5(4), 191–198 (2011)CrossRefGoogle Scholar
  27. 27.
    Skulason, F., Solomon, A., Bontchev, V.: CARO naming scheme. http://www.caro.org/naming/scheme.html (1991)
  28. 28.
    Sorokin, I.: Comparing files using structural entropy. J. Comput. Virol. 7(4), 259–265 (2011)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Sridhara, S.M., Stamp, M.: Metamorphic worm that carries its own morphing engine. J. Comput. Virol. Hacking Tech. 9(2), 49–58 (2013)CrossRefGoogle Scholar
  30. 30.
    Stamp, M.: A revealing introduction to hidden Markov models. http://www.cs.sjsu.edu/faculty/stamp/RUA/HMM.pdf (2012)
  31. 31.
    Swimmer, M.: Response to the proposal for a “C virus” database. ACM SIGSAC Review, vol. 8, pp. 1–5. http://www.odysci.com/article/1010112993890087 (1990)
  32. 32.
  33. 33.
    Symantec Security Response: Trojan.Zeroaccess. http://www.symantec.com/security_response/writeup.jsp?docid=2011-071314-0410-99 (2011)
  34. 34.
    Virus Removal Services: Beware of FAKE antivirus—Winwebsec. http://virus.myfirstattempt.com/2012/11/beware-of-fake-anti-virus-winwebsec.html (2012)
  35. 35.
    VX Heavens. http://vx.netlux.org/ (2013)
  36. 36.
    Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2(3), 211–229 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag France 2014

Authors and Affiliations

  • Chinmayee Annachhatre
    • 1
  • Thomas H. Austin
    • 1
  • Mark Stamp
    • 1
  1. 1.Department of Computer ScienceSan Jose State UniversitySan JoseUSA

Personalised recommendations