Information Systems Frontiers

, Volume 10, Issue 1, pp 33–45 | Cite as

A scalable multi-level feature extraction technique to detect malicious executables

  • Mohammad M. Masud
  • Latifur Khan
  • Bhavani ThuraisinghamEmail author


We present a scalable and multi-level feature extraction technique to detect malicious executables. We propose a novel combination of three different kinds of features at different levels of abstraction. These are binary n-grams, assembly instruction sequences, and Dynamic Link Library (DLL) function calls; extracted from binary executables, disassembled executables, and executable headers, respectively. We also propose an efficient and scalable feature extraction technique, and apply this technique on a large corpus of real benign and malicious executables. The above mentioned features are extracted from the corpus data and a classifier is trained, which achieves high accuracy and low false positive rate in detecting malicious executables. Our approach is knowledge-based because of several reasons. First, we apply the knowledge obtained from the binary n-gram features to extract assembly instruction sequences using our Assembly Feature Retrieval algorithm. Second, we apply the statistical knowledge obtained during feature extraction to select the best features, and to build a classification model. Our model is compared against other feature-based approaches for malicious code detection, and found to be more efficient in terms of detection accuracy and false alarm rate.


Disassembly Feature extraction Malicious executable n-gram analysis 



The work reported in this paper is supported by AFOSR under contract FA9550-06-1-0045 and by the Texas Enterprise Funds. We thank Dr. Robert Herklotz of AFOSR and Prof. Robert Helms, Dean of the School of Engineering at the University of Texas at Dallas for funding this research.


  1. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (2003). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), 5th annual ACM workshop on COLT (pp. 144–152). New York: ACM Press.Google Scholar
  2. Cygnus (1999). GNU Binutils Cygwin. Retrieved from
  3. Fawcett, T. (2003). ROC Graphs: Notes and practical considerations for researchers. Tech Report HPL-2003-4, HP Laboratories. Retrieved May 26, 2006, from
  4. Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proc. of the thirteenth international conference on machine learning (pp. 148–156). San Mateo, CA: Morgan Kaufmann.Google Scholar
  5. Garg, A., Rahalkar, R., Upadhyaya, S., & Kwiat, K. (2006). Profiling users in GUI based systems for masquerade detection. In Proc. of the 7th IEEE information assurance workshop (IAWorkshop 2006) (pp. 48–54).Google Scholar
  6. Golbeck, J., & Hendler, J. (2004). Reputation network analysis for email filtering. In CEAS.Google Scholar
  7. GoodRich, M. T., & Tamassia, R. (2006). Data structures and algorithms in Java (4th ed.). New York: Wiley.Google Scholar
  8. LIBSVM. (2006). A library for support vector machine. Retrieved June 1, 2006 from
  9. Kim, H. A., & Karp, B. (2004). Autograph: Toward automated, distributed worm signature detection. In Proc. of the 13th Usenix security symposium (Security 2004) (pp. 271–286).Google Scholar
  10. Kolter, J. Z., & Maloof, M. A. (2004). Learning to detect malicious executables in the wild. In Proc. of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 470–478).Google Scholar
  11. Lakhotia, A., Kumar, E. U., & Venable, M. (2005). A method for detecting obfuscated calls in malicious binaries. IEEE Transactions on Software Engineering, 31(11), 955–968.CrossRefGoogle Scholar
  12. Masud, M. M., Khan, L., & Thuraisingham, B. (2007a). Feature based techniques for auto-detection of novel email worms. In Proc. of the eleventh Pacific-Asia conference on knowledge discovery and data mining (PAKDD’07) (pp. 205–216). LNAI 4426/2007.Google Scholar
  13. Masud, M. M., Khan, L., & Thuraisingham, B. (2007b). A hybrid model to detect malicious executables. In Proc. of the IEEE international conference on communication (ICC’07) (pp. 1443–1448).Google Scholar
  14. Mitchell, T. (1997). Machine learning. New York: McGraw-Hill.Google Scholar
  15. Newman, M. E. J., Forrest, S., & Balthrop, J. (2002). Email networks and the spread of computer viruses. Physical Review, 66(3), 035101.CrossRefGoogle Scholar
  16. Newsome, J., Karp, B., & Song, D. (2005). Polygraph: Automatically generating signatures for polymorphic worms. In Proc. of the IEEE symposium on security and privacy (pp. 226–241).Google Scholar
  17. Royal, P., Halpin, M., Dagon, D., Edmonds, R., & Lee, W. (2006). PolyUnpack: Automating the hidden-code extraction of unpack-executing malware. In Proc. of 22nd annual computer security applications conference (ACSAC’06) (pp. 289–300).Google Scholar
  18. Schultz, M., Eskin, E., & Zadok, E. (2001a). MEF Malicious email filter, a UNIX mail filter that detects malicious windows executables. In Proc. of the USENIX annual technical conference—FREENIX track (pp. 245–252).Google Scholar
  19. Schultz, M., Eskin, E., Zadok, E., & Stolfo, S. (2001b). Data mining methods for detection of new malicious executables. In Proc. of the IEEE symposium on security and privacy (pp. 178–184).Google Scholar
  20. Singh, S., Estan, C., Varghese, G., & Savage, S. (2003). The earlyBird system for real-time detection of unknown worms. Technical report—cs2003-0761, UCSD.Google Scholar
  21. VX-Heavens. (2006). Retrieved May 6, 2006 from
  22. WEKA. (2006). Retrieved Aug 1, 2006 from
  23. Windows P.E. Disassembler. (1998). Retrieved June 05, 2006 from

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Mohammad M. Masud
    • 1
  • Latifur Khan
    • 2
  • Bhavani Thuraisingham
    • 2
    Email author
  1. 1.Department of Computer ScienceThe University of Texas at DallasRichardsonUSA
  2. 2.Department of Computer ScienceThe University of Texas at DallasRichardsonUSA

Personalised recommendations