Advertisement

Unknown Malcode Detection Using OPCODE Representation

  • Robert Moskovitch
  • Clint Feher
  • Nir Tzachar
  • Eugene Berger
  • Marina Gitelman
  • Shlomi Dolev
  • Yuval Elovici
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5376)

Abstract

The recent growth in network usage has motivated the creation of new malicious code for various purposes, including economic ones. Today’s signature-based anti-viruses are very accurate, but cannot detect new malicious code. Recently, classification algorithms were employed successfully for the detection of unknown malicious code. However, most of the studies use byte sequence n-grams representation of the binary code of the executables. We propose the use of (Operation Code) OpCodes, generated by disassembling the executables. We then use n-grams of the OpCodes as features for the classification process. We present a full methodology for the detection of unknown malicious code, based on text categorization concepts. We performed an extensive evaluation of a test collection of more than 30,000 files, in which we evaluated extensively the OpCode n-gram representation and investigated the imbalance problem, referring to real-life scenarios, in which the malicious file content is expected to be about 10% of the total files. Our results indicate that greater than 99% accuracy can be achieved through the use of a training set that has a malicious file percentage lower than 15%, which is higher than in our previous experience with byte sequence n-gram representation [1].

Keywords

Malicious Code Detection OpCode Classification 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Moskovitch, R., Stopel, D., Feher, C., Nissim, N., Elovici, Y.: Unknown Malcode Detection via Text Categorization and the Imbalance Problem. In: IEEE Intelligence and Security Informatics, Taiwan (2008)Google Scholar
  2. 2.
    Gryaznov, D.: Scanners of the Year 2000: Heuritics. In: The 5th International Virus Bulletin (1999)Google Scholar
  3. 3.
    Shin, S., Jung, J., Balakrishnan, H.: Malware Prevalence in the KaZaA File-Sharing Network. In: Internet Measurement Conference (IMC), Brazil (October 2006)Google Scholar
  4. 4.
    Schultz, M., Eskin, E., Zadok, E., Stolfo, S.: Data mining methods for detection of new malicious executables. In: Proceedings of the IEEE Symposium on Security and Privacy (2001)Google Scholar
  5. 5.
    Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: N-gram Based Detection of New Malicious Code. In: Proceedings of the 28th Annual International Computer Software and Applications Conference, COMPSAC 2004 (2004)Google Scholar
  6. 6.
    Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478. ACM Press, New York (2004)Google Scholar
  7. 7.
    Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)zbMATHGoogle Scholar
  8. 8.
    Kolter, J., Maloof, M.: Learning to Detect and Classify Malicious Executables in the Wild. Journal of Machine Learning Research 7, 2721–2744 (2006)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Henchiri, O., Japkowicz, N.: A Feature Selection and Evaluation Scheme for Computer Virus Detection. In: Proceedings of ICDM 2006, Hong Kong, pp. 891–895 (2006)Google Scholar
  10. 10.
    Dolev, S., Tzachar, N.: Malware signature builder and detection for executable code, patent applicationGoogle Scholar
  11. 11.
    Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations Newsletter 6(1), 1–6 (2004)CrossRefGoogle Scholar
  12. 12.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18, 613–620 (1975)CrossRefzbMATHGoogle Scholar
  13. 13.
    Golub, T., Slonim, D., Tamaya, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C., Lander, E.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)CrossRefGoogle Scholar
  14. 14.
    Bishop, C.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)zbMATHGoogle Scholar
  15. 15.
    Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers, Inc., San Francisco (1993)Google Scholar
  16. 16.
    Domingos, P., Pazzani, M.: On the optimality of simple Bayesian classifier under zero-one loss. Machine Learning 29, 103–130 (1997)CrossRefzbMATHGoogle Scholar
  17. 17.
    Freund, Y., Schapire, R.E.: A brief introduction to boosting. In: International Joint Conference on Artificial Intelligence (1999)Google Scholar
  18. 18.
    Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann Publishers, Inc., San Francisco (2005)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Robert Moskovitch
    • 1
  • Clint Feher
    • 1
  • Nir Tzachar
    • 1
  • Eugene Berger
    • 1
  • Marina Gitelman
    • 1
  • Shlomi Dolev
    • 1
  • Yuval Elovici
    • 1
  1. 1.Deutsche Telekom Laboratories at Ben Gurion UniversityBen Gurion UniversityBe’er ShevaIsrael

Personalised recommendations