Abstract
At present, the trend of familiarization of malicious code is becoming more and more obvious, and the research on the homology of malware (the classification of malicious code family) is of great significance for maintaining network security. In order to better express the overall characteristics of malicious code and improve the effect of detection and homology analysis, this paper proposes a method for detection and homology analysis of malware based on heterogeneous graphs of assembly instructions (AIHGAT). We take the assembly instructions of malicious families as the research object and analyze the importance and correlation of assembly instructions of different malicious families. The malware detection and homology analysis are carried out in three aspects: feature extraction, feature preprocessing, and model construction. In the feature extraction of malicious code, in order to alleviate the problem that it is difficult to extract static features of malicious samples that contain countermeasures such as packing and obfuscation, we obtain binary files from dynamic memory through sandbox and then, analyze its assembly instruction set. In feature preprocessing, we divide the assembly instructions into N-tuples and construct a heterogeneous graph based on assembly instructions according to the internal correlation of the gene sequence composed of the assembly N-grams features. Finally, in terms of model construction, we analyze the homology determination effect of the traditional graph neural network and construct the Graph Attention Network based on residual connection named ResGAT to analyze the homology of malicious code. The experimental results show that the ResGAT can gather the core characteristics of malicious families and enhance the ability to recognize malicious family variants. Our model has an accuracy rate of 98.83%, which is better than traditional machine learning detection methods, and can effectively determine the homology of malicious code families.
Similar content being viewed by others
Data availability statements
All data generated or analyzed during this study are included in this article. The dataset that supports the findings of this study are available at virusshare.com.
References
Santos, I., Brezo, F., Ugarte-Pedrero, X., et al.: Opcode sequences as representation of executables for data-mining-based unknown malware detection. Inf. Sci. 231, 64–82 (2013). https://doi.org/10.1016/j.ins.2011.08.020
Zhang F.Y., Zhao, T.Z. Malware detection and classification based on N-grams attribute similarity. In: 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), IEEE, 2017, pp. 793–796. https://doi.org/10.1109/CSE-EUC.2017.157
Galal, H.S., Mahdy, Y.B., Atiea, M.A.: Behavior-based features model for malware detection. J. Comput. Virol. Hacking Tech. 12(2), 59–67 (2016). https://doi.org/10.1007/s11416-015-0244-0
Shabtai, A., Moskovitch, R., Feher, C., et al.: Detecting unknown malicious code by applying classification techniques on OpCode patterns. Secur. Inform. 1, 1–22 (2012). https://doi.org/10.1186/2190-8532-1-1
Lee, J., Im, C., Jeong, H.: A study of malware detection and classification by comparing extracted strings. In: Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication, 2011, pp. 1–4. https://doi.org/10.1145/1968613.1968704
Alazab, M., Venkataraman, S., Watters, P.: Towards understanding malware behaviour by the extraction of API calls. In: Proceedings of 2010 Second Cybercrime and Trustworthy Computing Workshop, IEEE, 2016, pp.52–59.doi: https://doi.org/10.1109/CTC.2010.8
Amer, E., Zelinka, I.: A dynamic Windows malware detection and prediction method based on contextual understanding of API call sequence. Comput. Secur. 92, 101760 (2020). https://doi.org/10.1016/j.cose.2020.101760
Shang, S., Zheng, N., Xu, J. et al.: Detecting malware variants via function-call graph similarity. In: Proceedings of the 5th International Conference on the Malicious and Unwanted Software, IEEE, 2010, pp.113–120. https://doi.org/10.1109/MALWARE.2010.5665787
Kong, D., Yan, G.: Discriminant malware distance learning on structural information for automated malware classification. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013, pp. 1357–1365. https://doi.org/10.1145/2487575.2488219
Hassen, M., Chan, P.K.: Scalable function call graph-based malware classification. In: Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy. ACM, New York, NY, USA, 2017, pp. 239–248, https://doi.org/10.1145/3029806.3029824
Bruschi, D., Martignoni, L., Monga, M.: Detecting self-mutating malware using control-flow graph matching. In: Proceedings of the 3rd International Conference on Detection of Intrusions and Malware & Vulnerability Assessment. Berlin: Springer, 2006, pp.129–143. https://doi.org/10.1007/11790754_8
Ding, Y.X., Dai, W., Yan, S.L., Zhang, Y.M.: Control flow-based opcode behavior analysis for Malware detection. Comput. Secur. 44, 65–74 (2014). https://doi.org/10.1016/j.cose.2014.04.003
Abou-Assaleh, T., Cercone, N., Keselj, V. et al.: Detection of new malicious code using N-grams signatures. In: Proceedings of the 2nd Annual Conference on Privacy, Security and Trust. New Brunswick, Canada, 2004, pp.193–196
Sornil, O., Liangboonprakong, C.: Malware classification using N-grams sequential pattern features. Int.J. Inf. Process. Manag. 4(5), 59–67 (2013). https://doi.org/10.4156/ijipm.vol4.issue5.7
Moskovitch, R., Feher, C., Tzachar, N. et al.: Unknown malcode detection using OPCODE representation. In: Proceedings of the 2008 European Conference on Intelligence and Security Informatics. Berlin: Springer, 2008, pp.204–215. https://doi.org/10.1007/978-3-540-89900-6_21
Zhang, B., Xiao, W.T., Xiao, X., et al.: Ransomware classification using patch-based CNN and self-attention network on embedded N-grams of opcodes. Futur. Gener. Comput. Syst. 110, 708–720 (2020). https://doi.org/10.1016/j.future.2019.09.025
Baldangombo, U., Jambaljav, N., Horng, S. J.: Static malware detection system using data mining methods. Int. J.Artif. Intell. Appl. 4(4), 113–126. https://arxiv.org/abs/1308.2831 (2013)
Kolosnjaji, B., Zarras, A., Webster, G., Eckert, C.: Deep learning for classification of malware system call sequences. In: Australasian Joint Conference on Artificial Intelligence. Springer, Cham, 2016, pp. 137–149. https://doi.org/10.1007/978-3-319-50127-7_11
Zhang, J.X., Qin, Z., Yin, H., et al.: A feature-hybrid malware variants detection using CNN based opcode embedding and BPNN based API embedding. Comput. Secur. 84, 376–392 (2019). https://doi.org/10.1016/j.cose.2019.04.005
Wojnowicz, M., Chisholm, G., Wolff, M., Zhao, X.: Wavelet decomposition of software entropy reveals symptoms of malicious code. J. Innov. Digit. Ecosyst. 3(2), 130–140 (2016). https://doi.org/10.1016/j.jides.2016.10.009
Pagani, F., Dell'Amico, M., Balzarotti, D.: Beyond precision and recall: understanding uses (and misuses) of similarity hashes in binary analysis. In: Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy, 2018, pp. 354–365. https://doi.org/10.1145/3176258.3176306
Botacin, M., Moia, V.H.G., Ceschin, F., et al.: Understanding uses and misuses of similarity hashing functions for malware detection and family clustering in actual scenarios. Forensic Sci. Int.: Digit. Invest. 38, 301220 (2021). https://doi.org/10.1016/j.fsidi.2021.301220
Nataraj, L., Karthikeyan, S., Jacob, G. et al.: Malware images: visualization and automatic classification. In: Proceedings of the 8th International Symposium on Visualization for Cyber Security, 2011, pp.1–7. https://doi.org/10.1145/2016904.2016908
Fu, J.W., Xue, J.F., Wang, Y., et al.: Malware visualization for fine-grained classification. IEEE Access 6, 14510–14523 (2018). https://doi.org/10.1109/ACCESS.2018.2805301
Yakura, H., Shinozaki, S., Nishimura, R., et al.: Neural malware analysis with attention mechanism. Comput. Secur. 87, 101592 (2019). https://doi.org/10.1016/j.cose.2019.101592
Vasan, D., Alazab, M., Wassan, S., et al.: IMCFN: image-based malware classification using fine-tuned convolutional neural network architecture. ComputerNet-works 171, 107138 (2020). https://doi.org/10.1016/j.comnet.2020.107138
Xiao, G.Q., Li, J.N., Chen, Y.D., et al.: MalFCS: An effective malware classification framework with automated feature extraction based on deep convolutional neural networks. J. Parallel Distrib. Comput. 141, 49–58 (2020). https://doi.org/10.1016/j.jpdc.2020.03.012
Yuan, B.G., Wang, J.F., Liu, D., et al.: Byte-level malware classi-fication based on markov images and deep learning. Comput. Secur. 92, 101740 (2020). https://doi.org/10.1016/j.cose.2020.101740
Ghouti, L.: Malware classification using compact image features and multiclass support vector machines. IET Inf. Secur. 14(4), 419–429 (2020). https://doi.org/10.1049/iet-ifs.2019.0189
Jain, M., Andreopoulos, W., Stamp, M.: Convolutional neural networks and extreme learning machines for malware classification. J. Comput. Virol.Hacking Tech. 16(3), 229–244 (2020). https://doi.org/10.1007/s11416-020-00354-y
Kim, J., Kim, T.G., Im, E.G.: Structural information based malicious app similarity calculation and clustering. In: Proceedings of the 2015 Conference on research in adaptive and convergent systems, 2015, pp. 314–318. https://doi.org/10.1145/2811411.2811545
Schultz, M. G., Eskin, E., Zadok, E. et al.: Data mining methods for detection of new malicious executables. In: Proceedings 2001 IEEE Symposium on Security and Privacy, IEEE, 2001, pp. 38–49. https://doi.org/10.1109/SECPRI.2001.924286
Nataraj, L., Karthikeyan, S., Jacob, G. et al.: Malware images: visualization and automatic classification. In: Proceedings of the 8th international symposium on visualization for cyber security, 2011, pp. 1–7. https://doi.org/10.1145/2016904.2016908
Nataraj, L., Yegneswaran, V., Porras, P. et al.: A comparative assessment of malware classification using binary texture analysis and dynamic analysis. In: Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, 2011, pp. 21–30. https://doi.org/10.1145/2046684.2046689
Zhao, H.L., Xu, M., Zheng, N. et al.: Malicious executables classification based on behavioral factor analysis. In: 2010 International Conference on e-Education, e-Business, e-Management and e-Learning, IEEE, 2010, pp. 502–506.doi: https://doi.org/10.1109/IC4E.2010.78
Uppal, D., Sinha, R., Mehra, V. et al.: Exploring behavioral aspects of API calls for malware identification and categorization. In: 2014 International Conference on Computational Intelligence and Communication Networks, IEEE, 2014, pp. 824–828. https://doi.org/10.1109/CICN.2014.176
Lu, X.F., Jiang, F.S., Zhou, X., et al.: ASSCA: API sequence and statistics features combined architecture for malware detection. Comput. Netw. 157, 99–111 (2019). https://doi.org/10.1016/j.comnet.2019.04.007
Cakir, B., Dogdu, E.: Malware classification using deep learning methods. In: Proceedings of the ACMSE 2018 Conference, 2018, pp. 1–5. https://doi.org/10.1145/3190645.3190692
Popov, I.: Malware detection using machine learning based on word2vec embeddings of machine code instructions.In 2017: Siberian symposium on data science and engineering (SSDSE). IEEE 2017, 1–4 (2017). https://doi.org/10.1109/SSDSE.2017.8071952
Pascanu, R., Stokes, J. W., Sanossian, H. et al.: Malware classification with recurrent networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015, pp.1916–1920. https://doi.org/10.1109/ICASSP.2015.7178304.
Jeon, S., Moon, J.: Malware-detection method with a convolutional recurrent neural network using opcode sequences. Inf. Sci. 535, 1–15 (2020). https://doi.org/10.1016/j.ins.2020.05.026
David, O.E., Netanyahu, N.S.: DeepSign: Deep learning for automatic malware signature generation and classification. In: Proceedings of the 2015 International Joint Conference on Neural Networks, IEEE, 2015. https://doi.org/10.1109/IJCNN.2015.7280815
Hardy, W., Chen, L.W., Hou, S. F. et al.: DL4MD: A deep learning framework for intelligent malware detection.In: Proceedings of the International Conference on Data Science (ICDATA), 2016. URL:https://www.covert.io/research-papers/deep-learning-security/DL4MD—A Deep Learning Framework for Intelligent Malware Detection.pdf
Kim, J.Y., Bu, S.J., Cho, S.B.: Zero-day malware detection using transferred generative adversarial networks based on deep autoencoders. Inf. Sci. 460, 83–102 (2018). https://doi.org/10.1016/j.ins.2018.04.092
Wang, S., Philip, S.Y.: Heterogeneous graph matching networks: application to unknown malware detection. In: 2019 IEEE International Conference on Big Data (Big Data), 2019, pp. 5401–5408 https://doi.org/10.1109/BigData47090.2019.9006464.
Chen, K., Liu, P., Zhang, Y.J.: Achieving accuracy and scalability simultaneously in detecting application clones on Android markets. In: Proceedings of the 36th International Conference on Software Engineering (ICSE 2014), 2014, pp.175–186. https://doi.org/10.1145/2568225.2568286
Fan, M., Liu, J., Wang, W., et al.: DAPASA: detecting android piggybacked apps through sensitive subgraph analysis. IEEE Trans. Inf. Foren. Secur. 12, 1772–1785 (2017). https://doi.org/10.1109/TIFS.2017.2687880
Fan, M., et al.: Android malware familial classification and representative sample selection via frequent subgraph analysis. IEEE Trans. Inf. Foren. Secur. 13, 1890–1905 (2018). https://doi.org/10.1109/TIFS.2018.2806891
Yewale, A., Singh, M.: Malware detection based on opcode frequency. Int. Conf. Adv. Commun. Control Comput. Technol. (ICACCCT) 2016, 646–649 (2016). https://doi.org/10.1109/ICACCCT.2016.7831719
Gao, H., Cheng, S., Zhang, W.: GDroid: Android malware detection and classification with graph convolutional network. Comput. Secur. 106, 102264 (2021). https://doi.org/10.1016/j.cose.2021.102264
Khalilian, A., Nourazar, A., Vahidi-Asl, M., et al.: G3MD: Mining frequent opcode sub-graphs for metamorphic malware detection of existing families. Expert Syst. Appl. 112, 15–33 (2018). https://doi.org/10.1016/j.eswa.2018.06.012
Aghakhani, H., Gritti, F., Mecca, F. et al.: When malware is Packin' heat; limits of machine learning classifiers based on static analysis features. In: Network and Distributed System Security Symposium 2020, 2020. https://doi.org/10.14722/ndss.2020.24310
Sebastián, M., Rivera, R., Kotzias, P., Caballero, J.:. AVclass: A tool for massive malware labeling. International Symposium on Research in Attacks, Intrusions, and Defenses, 2016, pp.230–253. https://doi.org/10.1007/978-3-319-45719-2_11
Afianian, A., Niksefat, S., Sadeghiyan, B., Baptiste, D.: Malware dynamic analysis evasion techniques: a survey. ACM Computing Surveys ,2019, pp.1–28 .https://doi.org/10.1145/3365001
Kyriakos K. Ispoglou and Mathias Payer.MalWASH: washing malware to evade dynamic analysis.In: Proceedings of the 10th USENIX Conference on Offensive Technologies, 2016, pp.106–117.https://dl.acm.org/doi/https://doi.org/10.5555/3027019.3027029
Acknowledgements
This work was supported by Double First-Class Innovation Research Project for People’s Public Security University of China, No.2023SYL07.
Author information
Authors and Affiliations
Contributions
RW was contributed to acquisition and analysis of data, conception and design of methodology, writing original draft, review and editing. JG was contributed to conception and design of methodology, supervision, review. SH was contributed to conception and design of methodology, supervision, validation.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, R., Gao, J. & Huang, S. AIHGAT: A novel method of malware detection and homology analysis using assembly instruction heterogeneous graph. Int. J. Inf. Secur. 22, 1423–1443 (2023). https://doi.org/10.1007/s10207-023-00699-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10207-023-00699-7