A scalable multi-level feature extraction technique to detect malicious executables

Masud, Mohammad M.; Khan, Latifur; Thuraisingham, Bhavani

doi:10.1007/s10796-007-9054-3

A scalable multi-level feature extraction technique to detect malicious executables

Published: 23 October 2007

Volume 10, pages 33–45, (2008)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

Mohammad M. Masud¹,
Latifur Khan² &
Bhavani Thuraisingham²

592 Accesses
55 Citations
3 Altmetric
Explore all metrics

Abstract

We present a scalable and multi-level feature extraction technique to detect malicious executables. We propose a novel combination of three different kinds of features at different levels of abstraction. These are binary n-grams, assembly instruction sequences, and Dynamic Link Library (DLL) function calls; extracted from binary executables, disassembled executables, and executable headers, respectively. We also propose an efficient and scalable feature extraction technique, and apply this technique on a large corpus of real benign and malicious executables. The above mentioned features are extracted from the corpus data and a classifier is trained, which achieves high accuracy and low false positive rate in detecting malicious executables. Our approach is knowledge-based because of several reasons. First, we apply the knowledge obtained from the binary n-gram features to extract assembly instruction sequences using our Assembly Feature Retrieval algorithm. Second, we apply the statistical knowledge obtained during feature extraction to select the best features, and to build a classification model. Our model is compared against other feature-based approaches for malicious code detection, and found to be more efficient in terms of detection accuracy and false alarm rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Learning-Based Feature Extraction Method for Detecting Malicious Code

Optimizing Multi-class Classification of Binaries Based on Static Features

Robust and Effective Malware Detection Through Quantitative Data Flow Graph Metrics

References

Boser, B. E., Guyon, I. M., & Vapnik, V. N. (2003). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), 5th annual ACM workshop on COLT (pp. 144–152). New York: ACM Press.
Google Scholar
Cygnus (1999). GNU Binutils Cygwin. Retrieved from http://sourceware.cygnus.com/cygwin.
Fawcett, T. (2003). ROC Graphs: Notes and practical considerations for researchers. Tech Report HPL-2003-4, HP Laboratories. Retrieved May 26, 2006, from http://www.hpl.hp.com/personal/TomFawcett/papers/ROC101.pdf.
Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proc. of the thirteenth international conference on machine learning (pp. 148–156). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Garg, A., Rahalkar, R., Upadhyaya, S., & Kwiat, K. (2006). Profiling users in GUI based systems for masquerade detection. In Proc. of the 7th IEEE information assurance workshop (IAWorkshop 2006) (pp. 48–54).
Golbeck, J., & Hendler, J. (2004). Reputation network analysis for email filtering. In CEAS.
GoodRich, M. T., & Tamassia, R. (2006). Data structures and algorithms in Java (4th ed.). New York: Wiley.
Google Scholar
LIBSVM. (2006). A library for support vector machine. Retrieved June 1, 2006 from http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
Kim, H. A., & Karp, B. (2004). Autograph: Toward automated, distributed worm signature detection. In Proc. of the 13th Usenix security symposium (Security 2004) (pp. 271–286).
Kolter, J. Z., & Maloof, M. A. (2004). Learning to detect malicious executables in the wild. In Proc. of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 470–478).
Lakhotia, A., Kumar, E. U., & Venable, M. (2005). A method for detecting obfuscated calls in malicious binaries. IEEE Transactions on Software Engineering, 31(11), 955–968.
Article Google Scholar
Masud, M. M., Khan, L., & Thuraisingham, B. (2007a). Feature based techniques for auto-detection of novel email worms. In Proc. of the eleventh Pacific-Asia conference on knowledge discovery and data mining (PAKDD’07) (pp. 205–216). LNAI 4426/2007.
Masud, M. M., Khan, L., & Thuraisingham, B. (2007b). A hybrid model to detect malicious executables. In Proc. of the IEEE international conference on communication (ICC’07) (pp. 1443–1448).
Mitchell, T. (1997). Machine learning. New York: McGraw-Hill.
Google Scholar
Newman, M. E. J., Forrest, S., & Balthrop, J. (2002). Email networks and the spread of computer viruses. Physical Review, 66(3), 035101.
Article Google Scholar
Newsome, J., Karp, B., & Song, D. (2005). Polygraph: Automatically generating signatures for polymorphic worms. In Proc. of the IEEE symposium on security and privacy (pp. 226–241).
Royal, P., Halpin, M., Dagon, D., Edmonds, R., & Lee, W. (2006). PolyUnpack: Automating the hidden-code extraction of unpack-executing malware. In Proc. of 22nd annual computer security applications conference (ACSAC’06) (pp. 289–300).
Schultz, M., Eskin, E., & Zadok, E. (2001a). MEF Malicious email filter, a UNIX mail filter that detects malicious windows executables. In Proc. of the USENIX annual technical conference—FREENIX track (pp. 245–252).
Schultz, M., Eskin, E., Zadok, E., & Stolfo, S. (2001b). Data mining methods for detection of new malicious executables. In Proc. of the IEEE symposium on security and privacy (pp. 178–184).
Singh, S., Estan, C., Varghese, G., & Savage, S. (2003). The earlyBird system for real-time detection of unknown worms. Technical report—cs2003-0761, UCSD.
VX-Heavens. (2006). Retrieved May 6, 2006 from http://vx.netlux.org/.
WEKA. (2006). Retrieved Aug 1, 2006 from http://www.cs.waikato.ac.nz/ml/weka/.
Windows P.E. Disassembler. (1998). Retrieved June 05, 2006 from http://www.geocities.com/~sangcho/index.html.

Download references

Acknowledgment

The work reported in this paper is supported by AFOSR under contract FA9550-06-1-0045 and by the Texas Enterprise Funds. We thank Dr. Robert Herklotz of AFOSR and Prof. Robert Helms, Dean of the School of Engineering at the University of Texas at Dallas for funding this research.

Author information

Authors and Affiliations

Department of Computer Science, The University of Texas at Dallas, 2700 Waterview Pkwy, #5116, Richardson, TX, 75080, USA
Mohammad M. Masud
Department of Computer Science, The University of Texas at Dallas, Box 830688, EC 31, Richardson, TX, 75083-0688, USA
Latifur Khan & Bhavani Thuraisingham

Authors

Mohammad M. Masud
View author publications
You can also search for this author in PubMed Google Scholar
Latifur Khan
View author publications
You can also search for this author in PubMed Google Scholar
Bhavani Thuraisingham
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bhavani Thuraisingham.

Appendices

Appendix A

Here we illustrate an example run of the AFR algorithm. The algorithm scans through each hexdump file, sliding a window of n bytes and checking the n-gram against the binary feature set (BFS). If a match is found, then we collect the corresponding (same offset address) assembly instruction sequence in the assembly program file. In this way, we collect all possible instruction sequences of all the features in BFS. Later, we select the best sequence using information gain. Example-III: Table 8 shows an example of the collection of assembly sequences and their IG values corresponding to the n-gram “00005068.” Note that this n-gram has 90 occurrences (in all hexdump files). We have shown only 5 of them for brevity. The bolded portion of the op-code in Table 8 represents the n-gram. According to the Most Distinguishing Instruction Sequence (MDIS) heuristic, we find that sequence #29 attains the highest information gain, which is selected as the DAF of the n-gram. In this way, we select one DAF per binary n-gram, and return all DAFs.

Table 8 Assembly code sequence for binary 4-g “00005068”

Full size table

Appendix B

Here we summarize the time and space complexities of our algorithms in Table 9.

Table 9 Time and space complexities of different algorithms

Full size table

B is the total size of training set in bytes, C is the average #of assembly sequences found per binary n-gram, K is the maximum #of nodes of the AVL tree (i.e., threshold), N is the total number of n-grams collected, n is size of each n-gram in bytes, and S is the total number of selected n-grams. The worst case assumption: B > N and SC > K

Rights and permissions

Reprints and permissions

About this article

Cite this article

Masud, M.M., Khan, L. & Thuraisingham, B. A scalable multi-level feature extraction technique to detect malicious executables. Inf Syst Front 10, 33–45 (2008). https://doi.org/10.1007/s10796-007-9054-3

Download citation

Published: 23 October 2007
Issue Date: March 2008
DOI: https://doi.org/10.1007/s10796-007-9054-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A scalable multi-level feature extraction technique to detect malicious executables

Abstract

Access this article

Similar content being viewed by others

A Learning-Based Feature Extraction Method for Detecting Malicious Code

Optimizing Multi-class Classification of Binaries Based on Static Features

Robust and Effective Malware Detection Through Quantitative Data Flow Graph Metrics

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A

Appendix B

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A scalable multi-level feature extraction technique to detect malicious executables

Abstract

Access this article

Similar content being viewed by others

A Learning-Based Feature Extraction Method for Detecting Malicious Code

Optimizing Multi-class Classification of Binaries Based on Static Features

Robust and Effective Malware Detection Through Quantitative Data Flow Graph Metrics

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A

Appendix B

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation