A scalable multi-level feature extraction technique to detect malicious executables
- 303 Downloads
We present a scalable and multi-level feature extraction technique to detect malicious executables. We propose a novel combination of three different kinds of features at different levels of abstraction. These are binary n-grams, assembly instruction sequences, and Dynamic Link Library (DLL) function calls; extracted from binary executables, disassembled executables, and executable headers, respectively. We also propose an efficient and scalable feature extraction technique, and apply this technique on a large corpus of real benign and malicious executables. The above mentioned features are extracted from the corpus data and a classifier is trained, which achieves high accuracy and low false positive rate in detecting malicious executables. Our approach is knowledge-based because of several reasons. First, we apply the knowledge obtained from the binary n-gram features to extract assembly instruction sequences using our Assembly Feature Retrieval algorithm. Second, we apply the statistical knowledge obtained during feature extraction to select the best features, and to build a classification model. Our model is compared against other feature-based approaches for malicious code detection, and found to be more efficient in terms of detection accuracy and false alarm rate.
KeywordsDisassembly Feature extraction Malicious executable n-gram analysis
The work reported in this paper is supported by AFOSR under contract FA9550-06-1-0045 and by the Texas Enterprise Funds. We thank Dr. Robert Herklotz of AFOSR and Prof. Robert Helms, Dean of the School of Engineering at the University of Texas at Dallas for funding this research.
- Boser, B. E., Guyon, I. M., & Vapnik, V. N. (2003). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), 5th annual ACM workshop on COLT (pp. 144–152). New York: ACM Press.Google Scholar
- Cygnus (1999). GNU Binutils Cygwin. Retrieved from http://sourceware.cygnus.com/cygwin.
- Fawcett, T. (2003). ROC Graphs: Notes and practical considerations for researchers. Tech Report HPL-2003-4, HP Laboratories. Retrieved May 26, 2006, from http://www.hpl.hp.com/personal/TomFawcett/papers/ROC101.pdf.
- Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proc. of the thirteenth international conference on machine learning (pp. 148–156). San Mateo, CA: Morgan Kaufmann.Google Scholar
- Garg, A., Rahalkar, R., Upadhyaya, S., & Kwiat, K. (2006). Profiling users in GUI based systems for masquerade detection. In Proc. of the 7th IEEE information assurance workshop (IAWorkshop 2006) (pp. 48–54).Google Scholar
- Golbeck, J., & Hendler, J. (2004). Reputation network analysis for email filtering. In CEAS.Google Scholar
- GoodRich, M. T., & Tamassia, R. (2006). Data structures and algorithms in Java (4th ed.). New York: Wiley.Google Scholar
- LIBSVM. (2006). A library for support vector machine. Retrieved June 1, 2006 from http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
- Kim, H. A., & Karp, B. (2004). Autograph: Toward automated, distributed worm signature detection. In Proc. of the 13th Usenix security symposium (Security 2004) (pp. 271–286).Google Scholar
- Kolter, J. Z., & Maloof, M. A. (2004). Learning to detect malicious executables in the wild. In Proc. of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 470–478).Google Scholar
- Masud, M. M., Khan, L., & Thuraisingham, B. (2007a). Feature based techniques for auto-detection of novel email worms. In Proc. of the eleventh Pacific-Asia conference on knowledge discovery and data mining (PAKDD’07) (pp. 205–216). LNAI 4426/2007.Google Scholar
- Masud, M. M., Khan, L., & Thuraisingham, B. (2007b). A hybrid model to detect malicious executables. In Proc. of the IEEE international conference on communication (ICC’07) (pp. 1443–1448).Google Scholar
- Mitchell, T. (1997). Machine learning. New York: McGraw-Hill.Google Scholar
- Newsome, J., Karp, B., & Song, D. (2005). Polygraph: Automatically generating signatures for polymorphic worms. In Proc. of the IEEE symposium on security and privacy (pp. 226–241).Google Scholar
- Royal, P., Halpin, M., Dagon, D., Edmonds, R., & Lee, W. (2006). PolyUnpack: Automating the hidden-code extraction of unpack-executing malware. In Proc. of 22nd annual computer security applications conference (ACSAC’06) (pp. 289–300).Google Scholar
- Schultz, M., Eskin, E., & Zadok, E. (2001a). MEF Malicious email filter, a UNIX mail filter that detects malicious windows executables. In Proc. of the USENIX annual technical conference—FREENIX track (pp. 245–252).Google Scholar
- Schultz, M., Eskin, E., Zadok, E., & Stolfo, S. (2001b). Data mining methods for detection of new malicious executables. In Proc. of the IEEE symposium on security and privacy (pp. 178–184).Google Scholar
- Singh, S., Estan, C., Varghese, G., & Savage, S. (2003). The earlyBird system for real-time detection of unknown worms. Technical report—cs2003-0761, UCSD.Google Scholar
- VX-Heavens. (2006). Retrieved May 6, 2006 from http://vx.netlux.org/.
- WEKA. (2006). Retrieved Aug 1, 2006 from http://www.cs.waikato.ac.nz/ml/weka/.
- Windows P.E. Disassembler. (1998). Retrieved June 05, 2006 from http://www.geocities.com/~sangcho/index.html.