Journal in Computer Virology

, Volume 2, Issue 3, pp 231–239 | Cite as

N-gram analysis for computer virus detection

  • D Krishna Sandeep Reddy
  • Arun K PujariEmail author
Original Paper


Generic computer virus detection is the need of the hour as most commercial antivirus software fail to detect unknown and new viruses. Motivated by the success of datamining/machine learning techniques in intrusion detection systems, recent research in detecting malicious executables is directed towards devising efficient non-signature-based techniques that can profile the program characteristics from a set of training examples. Byte sequences and byte n-grams are considered to be basis of feature extraction. But as the number of n-grams is going to be very large, several methods of feature selections were proposed in literature. A recent report on use of information gain based feature selection has yielded the best-known result in classifying malicious executables from benign ones. We observe that information gain models the presence of n-gram in one class and its absence in the other. Through a simple example we show that this may lead to erroneous results. In this paper, we describe a new feature selection measure, class-wise document frequency of byte n-grams. We empirically demonstrate that the proposed method is a better method for feature selection. For detection, we combine several classifiers using Dempster Shafer Theory for better classification accuracy instead of using any single classifier. Our experimental results show that such a scheme detects virus program far more efficiently than the earlier known methods.


Feature Selection Information Gain Intrusion Detection System Belief Function Malicious Code 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: Detection of new malicious code using n-grams signatures. In: PST, pp. 193–196 (2004)Google Scholar
  2. 2.
    Arnold, W., Tesauro, G.: Automatically generated win32 heuristic virus detection. In: Proceedings of the 2000 International Virus Bulletin Conference (2000)Google Scholar
  3. 3.
    Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175. Las Vegas, US (1994)Google Scholar
  4. 4.
    Cohen F. (1987) Computer viruses: theory and experiments. Comput. Secur. 6(1):22–35CrossRefGoogle Scholar
  5. 5.
    Christodorescu, M., Jha, S.: Static analysis of executables to detect malicious patterns. In: Proceedings of the 12th USENIX Security Symposium (Security’03), pp. 169–186. USENIX Association, USENIX Association (2003)Google Scholar
  6. 6.
    Duin, R.P.W., Tax, D.M.J.: Experiments with classifier combining rules. In: MCS ’00: Proceedings of the First International Workshop on Multiple Classifier Systems, London, pp. 16–29. Springer, Berlin Heidelberg New York (2000)Google Scholar
  7. 7.
    Karim Md.E., Walenstein A., Lakhotia A., Parida L. (2005) Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1–2):13–23CrossRefGoogle Scholar
  8. 8.
    Gartner Inc: (2005)Google Scholar
  9. 9.
    Johannes, F.: A study using n-gram features for text categorization. Technical Report OEFAI-TR-9830, Austrian Institute for Artificial Intelligence (1998)Google Scholar
  10. 10.
    Kephart, J.O., Sorkin, G.B., Arnold, W.C., Chess, D.M., , G.J., White, S.R.: Biologically inspired defenses against computer viruses. In: Proceedings of the 14th IJCAI, pp. 985–996, Montreal (1995)Google Scholar
  11. 11.
    Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: KDD ’04: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470–478. ACM Press, New York (2004)Google Scholar
  12. 12.
    Lefevre E., Colot O., Vannoorenberghe P. (2002) Belief function combination and conflict management. Inf. Fusion 3(2):149–162CrossRefGoogle Scholar
  13. 13.
    McGraw G., Morrisett G. (2000) Attacking malicious code: a report to the infosec research council. IEEE Soft. 17(5):33–41CrossRefGoogle Scholar
  14. 14.
    Mitchell T.M. (1997) Machine Learning. McGraw-Hill, New YorkzbMATHGoogle Scholar
  15. 15.
    Murphy C.K. (2000) Combining belief functions when evidence conflicts. Decis. Support Syst. 29(1):1–9zbMATHCrossRefGoogle Scholar
  16. 16.
    Nachenberg, C.: Understanding and managing polymorphic viruses. Technical Report, The Symantec Exterprise Papers: Vol. XXXGoogle Scholar
  17. 17.
    Shafer G. (1976) A Mathematical Theory of Evidence. Princeton University Press, PrincetonzbMATHGoogle Scholar
  18. 18.
    Schultz, M.G., Eskin, E., Zadok, E., Bhattacharyya, M., Stolfo, S.J.: Mef: Malicious email filter – a unix mail filter that detects malicious windows executables. In: Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference, pp. 245–252. USENIX Association, Berkeley (2001)Google Scholar
  19. 19.
    Schultz, M.G., Eskin, E., Zadok, E., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: SP ’01: Proceedings of the 2001 IEEE Symposium on Security and Privacy, p. 38. IEEE Computer Society, Washington (2001)Google Scholar
  20. 20.
    Sentz, K.: Combination of evidence in Dempster–Shafer theory. Ph.D. Thesis, SNL, LANL, and Systems Science and Industrial Engineering Department, Binghamton UniversityGoogle Scholar
  21. 21.
    Smets P. (1993) Belief functions: The disjunctive rule of combination and the generalized bayesian theorem. Int. J. Approx. Reason. 9(1):1–35zbMATHMathSciNetCrossRefGoogle Scholar
  22. 22.
    Szor P. (2005) The Art of Computer Virus Research and Defense. Addison Wesley, ReadingGoogle Scholar
  23. 23.
    Ting K.M., Witten I.H. (1999) Issues in stacked generalization. J. Artif. Intell. Res. 10, 271–289zbMATHGoogle Scholar
  24. 24.
    Vx heavens: http://www.vx.netlux.orgGoogle Scholar
  25. 25.
    Witten I., Frank E. (2000) Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San FranciscoGoogle Scholar
  26. 26.
    Wolpert, D.H.: Stacked generalization. Technical Report LA-UR-90-3460, Los Alamos (1990)Google Scholar
  27. 27.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, pp. 412–420. Morgan Kaufmann, San Francisco (1997)Google Scholar
  28. 28.
    Yoo, I., Ultes-Nitsche, U.: Non-signature based virus detection: Towards establishing unknown virus detection technique using som. J. Comput. Virol. 2(3) (2006)Google Scholar
  29. 29.
    Zhang, B., Srihari, S.N.: Class-wise multi-classifier combination based on dempster-shafer theory. In: Proceedings of the 7th International Conference on Control, Automation, Robotics and Vision (2002)Google Scholar

Copyright information

© Springer-Verlag France 2006

Authors and Affiliations

  1. 1.Artificial Intelligence LabUniversity of HyderabadHyderabadIndia

Personalised recommendations