Skip to main content
Log in

A scalable multi-level feature extraction technique to detect malicious executables

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

We present a scalable and multi-level feature extraction technique to detect malicious executables. We propose a novel combination of three different kinds of features at different levels of abstraction. These are binary n-grams, assembly instruction sequences, and Dynamic Link Library (DLL) function calls; extracted from binary executables, disassembled executables, and executable headers, respectively. We also propose an efficient and scalable feature extraction technique, and apply this technique on a large corpus of real benign and malicious executables. The above mentioned features are extracted from the corpus data and a classifier is trained, which achieves high accuracy and low false positive rate in detecting malicious executables. Our approach is knowledge-based because of several reasons. First, we apply the knowledge obtained from the binary n-gram features to extract assembly instruction sequences using our Assembly Feature Retrieval algorithm. Second, we apply the statistical knowledge obtained during feature extraction to select the best features, and to build a classification model. Our model is compared against other feature-based approaches for malicious code detection, and found to be more efficient in terms of detection accuracy and false alarm rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Boser, B. E., Guyon, I. M., & Vapnik, V. N. (2003). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), 5th annual ACM workshop on COLT (pp. 144–152). New York: ACM Press.

    Google Scholar 

  • Cygnus (1999). GNU Binutils Cygwin. Retrieved from http://sourceware.cygnus.com/cygwin.

  • Fawcett, T. (2003). ROC Graphs: Notes and practical considerations for researchers. Tech Report HPL-2003-4, HP Laboratories. Retrieved May 26, 2006, from http://www.hpl.hp.com/personal/TomFawcett/papers/ROC101.pdf.

  • Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proc. of the thirteenth international conference on machine learning (pp. 148–156). San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Garg, A., Rahalkar, R., Upadhyaya, S., & Kwiat, K. (2006). Profiling users in GUI based systems for masquerade detection. In Proc. of the 7th IEEE information assurance workshop (IAWorkshop 2006) (pp. 48–54).

  • Golbeck, J., & Hendler, J. (2004). Reputation network analysis for email filtering. In CEAS.

  • GoodRich, M. T., & Tamassia, R. (2006). Data structures and algorithms in Java (4th ed.). New York: Wiley.

    Google Scholar 

  • LIBSVM. (2006). A library for support vector machine. Retrieved June 1, 2006 from http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

  • Kim, H. A., & Karp, B. (2004). Autograph: Toward automated, distributed worm signature detection. In Proc. of the 13th Usenix security symposium (Security 2004) (pp. 271–286).

  • Kolter, J. Z., & Maloof, M. A. (2004). Learning to detect malicious executables in the wild. In Proc. of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 470–478).

  • Lakhotia, A., Kumar, E. U., & Venable, M. (2005). A method for detecting obfuscated calls in malicious binaries. IEEE Transactions on Software Engineering, 31(11), 955–968.

    Article  Google Scholar 

  • Masud, M. M., Khan, L., & Thuraisingham, B. (2007a). Feature based techniques for auto-detection of novel email worms. In Proc. of the eleventh Pacific-Asia conference on knowledge discovery and data mining (PAKDD’07) (pp. 205–216). LNAI 4426/2007.

  • Masud, M. M., Khan, L., & Thuraisingham, B. (2007b). A hybrid model to detect malicious executables. In Proc. of the IEEE international conference on communication (ICC’07) (pp. 1443–1448).

  • Mitchell, T. (1997). Machine learning. New York: McGraw-Hill.

    Google Scholar 

  • Newman, M. E. J., Forrest, S., & Balthrop, J. (2002). Email networks and the spread of computer viruses. Physical Review, 66(3), 035101.

    Article  Google Scholar 

  • Newsome, J., Karp, B., & Song, D. (2005). Polygraph: Automatically generating signatures for polymorphic worms. In Proc. of the IEEE symposium on security and privacy (pp. 226–241).

  • Royal, P., Halpin, M., Dagon, D., Edmonds, R., & Lee, W. (2006). PolyUnpack: Automating the hidden-code extraction of unpack-executing malware. In Proc. of 22nd annual computer security applications conference (ACSAC’06) (pp. 289–300).

  • Schultz, M., Eskin, E., & Zadok, E. (2001a). MEF Malicious email filter, a UNIX mail filter that detects malicious windows executables. In Proc. of the USENIX annual technical conference—FREENIX track (pp. 245–252).

  • Schultz, M., Eskin, E., Zadok, E., & Stolfo, S. (2001b). Data mining methods for detection of new malicious executables. In Proc. of the IEEE symposium on security and privacy (pp. 178–184).

  • Singh, S., Estan, C., Varghese, G., & Savage, S. (2003). The earlyBird system for real-time detection of unknown worms. Technical report—cs2003-0761, UCSD.

  • VX-Heavens. (2006). Retrieved May 6, 2006 from http://vx.netlux.org/.

  • WEKA. (2006). Retrieved Aug 1, 2006 from http://www.cs.waikato.ac.nz/ml/weka/.

  • Windows P.E. Disassembler. (1998). Retrieved June 05, 2006 from http://www.geocities.com/~sangcho/index.html.

Download references

Acknowledgment

The work reported in this paper is supported by AFOSR under contract FA9550-06-1-0045 and by the Texas Enterprise Funds. We thank Dr. Robert Herklotz of AFOSR and Prof. Robert Helms, Dean of the School of Engineering at the University of Texas at Dallas for funding this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bhavani Thuraisingham.

Appendices

Appendix A

Here we illustrate an example run of the AFR algorithm. The algorithm scans through each hexdump file, sliding a window of n bytes and checking the n-gram against the binary feature set (BFS). If a match is found, then we collect the corresponding (same offset address) assembly instruction sequence in the assembly program file. In this way, we collect all possible instruction sequences of all the features in BFS. Later, we select the best sequence using information gain. Example-III: Table 8 shows an example of the collection of assembly sequences and their IG values corresponding to the n-gram “00005068.” Note that this n-gram has 90 occurrences (in all hexdump files). We have shown only 5 of them for brevity. The bolded portion of the op-code in Table 8 represents the n-gram. According to the Most Distinguishing Instruction Sequence (MDIS) heuristic, we find that sequence #29 attains the highest information gain, which is selected as the DAF of the n-gram. In this way, we select one DAF per binary n-gram, and return all DAFs.

Table 8 Assembly code sequence for binary 4-g “00005068”

Appendix B

Here we summarize the time and space complexities of our algorithms in Table 9.

Table 9 Time and space complexities of different algorithms

B is the total size of training set in bytes, C is the average #of assembly sequences found per binary n-gram, K is the maximum #of nodes of the AVL tree (i.e., threshold), N is the total number of n-grams collected, n is size of each n-gram in bytes, and S is the total number of selected n-grams. The worst case assumption: B > N and SC > K

Rights and permissions

Reprints and permissions

About this article

Cite this article

Masud, M.M., Khan, L. & Thuraisingham, B. A scalable multi-level feature extraction technique to detect malicious executables. Inf Syst Front 10, 33–45 (2008). https://doi.org/10.1007/s10796-007-9054-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-007-9054-3

Keywords

Navigation