VILO: a rapid learning nearest-neighbor classifier for malware triage

  • Arun Lakhotia
  • Andrew Walenstein
  • Craig Miles
  • Anshuman Singh
Original Paper


VILO is a lazy learner system designed for malware classification and triage. It implements a nearest neighbor (NN) algorithm with similarities computed over Term Frequency \(\times \) Inverse Document Frequency (TFIDF) weighted opcode mnemonic permutation features (N-perms). Being an NN-classifier, VILO makes minimal structural assumptions about class boundaries, and thus is well suited for the constantly changing malware population. This paper presents an extensive study of application of VILO in malware analysis. Our experiments demonstrate that (a) VILO is a rapid learner of malware families, i.e., VILO’s learning curve stabilizes at high accuracies quickly (training on less than 20 variants per family is sufficient); (b) similarity scores derived from TDIDF weighted features should primarily be treated as ordinal measurements; and (c) VILO with N-perm feature vectors outperforms traditional N-gram feature vectors when used to classify real-world malware into their respective families.



The authors are grateful for Prof. Mihai Giurcanu’s help in identifying proper statistical evaluation methods. Furthermore, we wish to thank Suresh Golconda, Chris Parich, Michael Venable, Matthew Hayes, and Christopher Thompson for their past work, without which this paper would not have been possible.


  1. 1.
    Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: N-gram-based detection of new malicious code. In: Proceedings of the 28th IEEE Annual International Computer Software and Applications Conference, 2004 (COMPSAC’04), vol. 2, pp. 41–42 (2004)Google Scholar
  2. 2.
    Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection, RAID’07, pp. 178–197. Springer, Berlin, Heidelberg (2007)Google Scholar
  3. 3.
    Carrera, E., Erdélyi, G.: Digital genome mapping-advanced binary malware analysis. In: Virus Bulletin Conference, pp. 187–197 (2004)Google Scholar
  4. 4.
    Chess, D., White, S.: An undetectable computer virus. In: Proceedings of Virus Bulletin Conference, vol. 5 (2000)Google Scholar
  5. 5.
    Chouchane, M., Lakhotia, A.: Using engine signature to detect metamorphic malware. In: Proceedings of the 4th ACM Workshop on Recurring Malcode, pp. 73–78. ACM (2006)Google Scholar
  6. 6.
    Chouchane, M., Walenstein, A., Lakhotia, A.: Statistical signatures for fast filtering of instruction-substituting metamorphic malware. In: Proceedings of the 2007 ACM Workshop on Recurring Malcode, pp. 31–37. ACM (2007)Google Scholar
  7. 7.
    Christodorescu, M., Jha, S., Seshia, S., Song, D., Bryant, R.: Semantics-aware malware detection. In: IEEE Symposium on IEEE Security and Privacy, pp. 32–46 (2005)Google Scholar
  8. 8.
    Cohen, F.: Operating system protection through program evolution. Comput. Secur. 12(6), 565–584 (1993)CrossRefGoogle Scholar
  9. 9.
    Duda, R., Hart, P., Stork, D.: Pattern Classification, vol. 2. Wiley, New York (2001)Google Scholar
  10. 10.
    Filiol, E., Josse, S.: A statistical model for undecidable viral detection. J. Comput. Virol 3(2), 65–74 (2007)CrossRefGoogle Scholar
  11. 11.
    Flake, H.: More fun with graphs. In: Proceedings of BlackHat Federal (2003)Google Scholar
  12. 12.
    Flake, H.: Structural comparison of executable objects. In: Proceedings of the International GI Workshop on Detection of Intrusions and Malware & Vulnerability Assessment, number P-46 in Lecture Notes in Informatics (DIMVA’04), pp. 161–174 (2004)Google Scholar
  13. 13.
    Green, D., Swets, J.: Signal Detection Theory and Psychophysics, vol. 1974. Wiley, New York (1966)Google Scholar
  14. 14.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann Publishers, San Francisco (2006)Google Scholar
  15. 15.
    Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inform. Sci. Technol. 54(3), 203–215 (2003)CrossRefGoogle Scholar
  16. 16.
    Hogg, R., McKean, J., Craig, A.: Introduction to Mathematical Statistics. Prentice Hall, Englewood Cliffs (2005)Google Scholar
  17. 17.
    Jang, J., Brumley, D., Venkataraman, S.: Bitshred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, pp. 309–320 (2011)Google Scholar
  18. 18.
    Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)CrossRefGoogle Scholar
  19. 19.
    Karim, M., Walenstein, A., Lakhotia, A., Parida, L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1), 13–23 (2005)CrossRefGoogle Scholar
  20. 20.
    Kephart, J., Arnold, W.: Automatic extraction of computer virus signatures. In: 4th Virus Bulletin International Conference, pp. 178–184 (1994)Google Scholar
  21. 21.
    Kim, M., Notkin, D.: Program element matching for multi-version program analyses. In: Proceedings of the 2006 International Workshop on Mining Software Repositories, pp. 58–64 (2006)Google Scholar
  22. 22.
    Kinable, J., Kostakis, O.: Malware classification based on call graph clustering. J. Comput. Virol. 7(4), 233–245 (2011)CrossRefGoogle Scholar
  23. 23.
    Kolter, J., Maloof, M.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478 (2004)Google Scholar
  24. 24.
    Kolter, J., Maloof, M.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006)Google Scholar
  25. 25.
    Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Recent Advances in Intrusion Detection, pp. 207–226. Springer, Berlin (2006)Google Scholar
  26. 26.
    Lakhotia, A., Singh, P.: Challenges in getting formal with viruses. Virus Bull. 9(1), 14–18 (2003)Google Scholar
  27. 27.
    Lin, D., Stamp, M.: Hunting for undetectable metamorphic viruses. J. Comput. Virol. 7(3), 201–214 (2011)CrossRefGoogle Scholar
  28. 28.
    Masud, M., Khan, L., Thuraisingham, B.: Data Mining Tools for Malware Detection. CRC Press, Boca Raton (2011)Google Scholar
  29. 29.
    Masud, M.M., Khan, L., Thuraisingham, B.: A hybrid model to detect malicious executables. In: Proceedings of the IEEE International Conference on Communications (ICC 2007), pp. 1443–1448 (2007)Google Scholar
  30. 30.
    Microsoft. Microsoft Malware Protection Center Backdoor:Win32/Hupigon.
  31. 31.
    Microsoft. Microsoft Malware Protection Center Virus:Win32/Parite.b.
  32. 32.
    Microsoft. Microsoft Malware Protection Center Backdoor:Win32/PcClient.
  33. 33.
    Microsoft. Microsoft security intelligence report July through December 2009. May 2010
  34. 34.
    Microsoft. Microsoft PE and COFF Specification. October 2011
  35. 35.
    Miles, C., Lakhotia, A.: Personal correspondance with malware analysts. Personal, communication (2012)Google Scholar
  36. 36.
    Moskovitch, R., Feher, C., Tzachar, N., Berger, E., Gitelman, M., Dolev, S., Elovici, Y.: Unknown malcode detection using opcode representation. In: European Conference on Intelligence and Security Informatics 2008 (EuroISI08), Lectures Notes in Computer Sciences, vol. 5376, pp. 204–215. Springer, Berlin (2008)Google Scholar
  37. 37.
    Muttik, I.: Malware mining. In: Proceedings of 21st Virus Bulletin Conference (2011)Google Scholar
  38. 38.
    Pietraszek, T.: On the use of roc analysis for the optimization of abstaining classifiers. Mach. Learn. 68(2), 137–169 (2007)CrossRefGoogle Scholar
  39. 39.
    Rodriguez, J., Perez, A., Lozano, J.: Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans. Pattern Anal. Mach. Intell. 32(3), 569–575 (2010) Google Scholar
  40. 40.
    Runwal, N., Low, R., Stamp, M.: Opcode graph similarity and metamorphic detection. J. Comput. Virol. 1–16 (2012)Google Scholar
  41. 41.
    Schultz, M.G., Eskin, E., Zadok, F., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Proceedings of S &P 2001: IEEE Symposium on Security and Privacy, pp. 38–49 (2001)Google Scholar
  42. 42.
    Schultz, M.G., Eskin, E., Zadok, F., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Proceedings of S &P 2001: the IEEE Symposium on Security and Privacy, pp. 38–49 (2001)Google Scholar
  43. 43.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  44. 44.
    Tesauro, G., Kephart, J., Sorkin, G.: Neural networks for computer virus recognition. IEEE Expert 11(4), 5–6 (1996)CrossRefGoogle Scholar
  45. 45.
    Tian, R., Batten, L., Versteeg, S.: Function length as a tool for malware classification. In: Proceedings of the 3rd International Conference on Malicious and Unwanted Software, 2008. MALWARE 2008, pp. 69–76 (2008)Google Scholar
  46. 46.
    Toderici, A., Stamp, M.: Chi-squared distance and metamorphic virus detection. J. Comput. Virol 1–14 (2012). doi:  10.1007/s11416-012-0171-2
  47. 47.
    Walenstein, A., Venable, M., Hayes, M., Thompson, C., Lakhotia, A.: Exploiting similarity between variants to defeat malware. In: Proceedings of BlackHat Briefings DC 2007 (2007)Google Scholar
  48. 48.
    Wang, J.H., Deng, P.S., Fan, Y.S., Jaw, L.J., Liu, Y.C.: Virus detection using data mining techniques. In: Proceedings of the 37th International Carnahan Conference on Security Techology, pp. 71–77 (2003)Google Scholar
  49. 49.
    Wong, W., Stamp, M.: Hunting for metamorphic engines. J. Comput. Virol. 2, 211–229 (2006)CrossRefGoogle Scholar
  50. 50.
    Zobel, J., Moffat, A.: Exploring the similarity space. ACM SIGIR Forum 32(1), 18–34 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag France 2013

Authors and Affiliations

  • Arun Lakhotia
    • 1
  • Andrew Walenstein
    • 2
  • Craig Miles
    • 1
  • Anshuman Singh
    • 1
  1. 1.Center for Advanced Computer StudiesUniversity of Louisiana at LafayetteLafayetteUSA
  2. 2.School of Computing and InformaticsUniversity of Louisiana at LafayetteLafayetteUSA

Personalised recommendations