Unknown malcode detection and the imbalance problem

  • Robert Moskovitch
  • Dima Stopel
  • Clint Feher
  • Nir Nissim
  • Nathalie Japkowicz
  • Yuval Elovici
Original Paper


The recent growth in network usage has motivated the creation of new malicious code for various purposes. Today’s signature-based antiviruses are very accurate for known malicious code, but can not detect new malicious code. Recently, classification algorithms were used successfully for the detection of unknown malicious code. But, these studies involved a test collection with a limited size and the same malicious: benign file ratio in both the training and test sets, a situation which does not reflect real-life conditions. We present a methodology for the detection of unknown malicious code, which examines concepts from text categorization, based on n-grams extraction from the binary code and feature selection. We performed an extensive evaluation, consisting of a test collection of more than 30,000 files, in which we investigated the class imbalance problem. In real-life scenarios, the malicious file content is expected to be low, about 10% of the total files. For practical purposes, it is unclear as to what the corresponding percentage in the training set should be. Our results indicate that greater than 95% accuracy can be achieved through the use of a training set that has a malicious file content of less than 33.3%.


Support Vector Machine Feature Selection True Positive Rate Term Frequency Document Frequency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Filiol E., Josse S.: A statistical model for undecidable viral detection. J. Comput. Virol. 3, 65–74 (2007)CrossRefGoogle Scholar
  2. 2.
    Filiol E.: Malware pattern scanning schemes secure against black-box analysis. J. Comput. Virol. 2, 35–50 (2006)CrossRefGoogle Scholar
  3. 3.
    Gryaznov, D.: Scanners of the year 2000: Heuritics. In: Proceedings of the 5th International Virus Bulletin (1999)Google Scholar
  4. 4.
    Schultz, M., Eskin, E., Zadok, E., Stolfo, S.: Data mining methods for detection of new malicious executables. In: Proceedings of the IEEE Symposium on Security and Privacy, 178–184 (2001)Google Scholar
  5. 5.
    Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: N-gram based detection of new malicious code. In: Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC’04) (2004)Google Scholar
  6. 6.
    Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478. ACM Press, New York (2004)Google Scholar
  7. 7.
    Mitchell T.: Machine Learning. McGraw-Hill, New York (1997)zbMATHGoogle Scholar
  8. 8.
    Henchiri, O., Japkowicz, N.: A feature selection and evaluation scheme for computer virus detection. In: Proceedings of ICDM-2006, pp. 891–895. Hong Kong (2006)Google Scholar
  9. 9.
    Reddy D., Pujari A.: N-gram analysis for computer virus detection. J. Comput. Virol. 2, 231–239 (2006)CrossRefGoogle Scholar
  10. 10.
    Kubat, M., Matwin, S.: Addressing the curse of imbalanced data sets: one-sided sampling. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186 (1997)Google Scholar
  11. 11.
    Fawcett T.E., Provost F.: Adaptive fraud detection. Data Min. Knowl. Discov. 1(3), 291–316 (1997)CrossRefGoogle Scholar
  12. 12.
    Ling, C.X., Li, C.: Data mining for direct marketing: problems and solutions. In: Proceedings of the Fourth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–79 (1998)Google Scholar
  13. 13.
    Chawla N.V., Japkowicz N., Kotcz A.: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6(1), 1–6 (2004)CrossRefGoogle Scholar
  14. 14.
    Japkowicz N., Stephen S.: The class imbalance problem: a systematic study. Intel. Data Anal. J. 6, 5 (2002)Google Scholar
  15. 15.
    Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intel. Res. (JAIR) 16, 321–357 (2002)zbMATHGoogle Scholar
  16. 16.
    Lawrence S., Burns I., Back A.D., Tsoi A.C., Giles C.L.: Neural network classification and unequal prior class probabilities. In: Orr, G., Muller, R.-R., Caruana, R.(eds) Tricks of the Trade. Lecture Notes in Computer Science State-of-the-Art Surveys, pp. 299–314. Springer, Heidelberg (1998)Google Scholar
  17. 17.
    Chen, C., Liaw, A., Breiman, L.: Using random forest to learn unbalanced data. Technical Report 666, Statistics Department, University of California at Berkeley (2004)Google Scholar
  18. 18.
    Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach—a case study in intensive care monitoring. In: Proceedings of the International Conference of Machine Learning, pp. 268–277 (1999)Google Scholar
  19. 19.
    Weiss G., Provost F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intel. Res. 19, 315–354 (2003)zbMATHGoogle Scholar
  20. 20.
    Salton G., Wong A., Yang C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)zbMATHCrossRefGoogle Scholar
  21. 21.
    Golub T., Slonim D., Tamaya P., Huard C., Gaasenbeek M., Mesirov J., Coller H., Loh M., Downing J., Caligiuri M., Bloomfield C., Lander E.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)CrossRefGoogle Scholar
  22. 22.
    Bishop C.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)Google Scholar
  23. 23.
    Quinlan J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., San Francisco (1993)Google Scholar
  24. 24.
    Witten I.H., Frank E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Publishers, Inc., San Francisco (2005)zbMATHGoogle Scholar
  25. 25.
    Domingos P., Pazzani M.: On the optimality of simple Bayesian classifier under zero-one loss. Mach. Learn. 29, 103–130 (1997)zbMATHCrossRefGoogle Scholar
  26. 26.
    Freund Y., Schapire R.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997)zbMATHCrossRefMathSciNetGoogle Scholar
  27. 27.
    Burges C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 955–974 (1998)CrossRefGoogle Scholar
  28. 28.
    Joachims, T.: Making large-scale support vector machine learning practical. Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge (1998)Google Scholar
  29. 29.
    Chang, C., Lin, C.: LIBSVM: a library for support vector machines (2001)Google Scholar
  30. 30.
    Provost, F., Fawcett, T.: Robust classification systems for imprecise environments. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98) (1998)Google Scholar
  31. 31.
    Kubat M., Holte R., Matwin S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)CrossRefGoogle Scholar
  32. 32.
    Karim Md., Walenstein A., Lakhotia A., Parida L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1, 13–23 (2005)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag France 2009

Authors and Affiliations

  • Robert Moskovitch
    • 1
  • Dima Stopel
    • 1
  • Clint Feher
    • 1
  • Nir Nissim
    • 1
  • Nathalie Japkowicz
    • 2
  • Yuval Elovici
    • 1
  1. 1.Deutsche Telekom Laboratories, Department of Information Systems EngineeringBen Gurion UniversityBe’er ShevaIsrael
  2. 2.School of Information Technology and EngineeringUniversity of OttawaOttawaCanada

Personalised recommendations