Research on Malicious Code Analysis Method Based on Semi-supervised Learning

He, Tingting; Xue, Jingfeng; Fu, Jianwen; Wang, Yong; Shan, Chun

doi:10.1007/978-981-10-7080-8_17

Tingting He¹³,
Jingfeng Xue¹³,
Jianwen Fu¹³,
Yong Wang¹³ &
…
Chun Shan¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 704))

Included in the following conference series:

Chinese Conference on Trusted Computing and Information Security

666 Accesses
1 Citations

Abstract

The research on classification method of malicious code is helpful for researchers to understand attack characteristics quickly, and help to reduce the loss of users and even the states. Currently, most of the malware classification methods are based on supervised learning algorithms, but it is powerless for the small number of labeled samples. Therefore, in this paper, we propose a new malware classification method, which is based on semi-supervised learning algorithm. First, we extract the impactful static features and dynamic features to serialize and obtain features of high dimension. Then, we select them with Ensemble Feature Grader consistent with Information Gain, Random Forest and Logistic Regression with \(L_1\) and \(L_2\), and reduce dimension again with PCA. Finally, we use Learning with local and global consistency algorithm with K-means to classify malwares. The experimental results of comparison among SVM, LLGC and K-means + LLGC show that using of the feature extraction, feature reduction and classification method, K-means + LLGC algorithm is superior to LLGC in both classification accuracy and efficiency, the accuracy is increased by 2% to 3%, and the accuracy is more than SVM when the number of labeled samples is small.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cepeda, C., Dan, L.C.T., Ordonez, P.: Feature selection and improving classification performance for malware detection. In: IEEE International Conferences on Big Data and Cloud Computing, pp. 560–566. IEEE Computer Society (2016)
Google Scholar
Comar, P.M., Liu, L., Saha, S., et al.: Combining supervised and unsupervised learning for zero-day malware detection. In: INFOCOM, 2013 Proceedings IEEE, pp. 2022–2030. IEEE (2013)
Google Scholar
Xiao, X., Fu, P., Xiao, X., et al.: Two effective methods to detect mobile malware. In: International Conference on Computer Science and Network Technology, pp. 1041–1045. IEEE (2016)
Google Scholar
Choi, Y.H., Han, B.J., Bae, B.C., et al.: Toward extracting malware features for classification using static and dynamic analysis. In: International Conference on Computing and NETWORKING Technology, pp. 126–129. IEEE (2013)
Google Scholar
Shijo, P.V., Salim, A.: Integrated static and dynamic analysis for malware detection. Proc. Comput. Sci. 46, 804–811 (2015)
Article Google Scholar
Microsoft Malware Classification Challenge (BIG 2015). https://www.kaggle.com/c/malware-classification
Yi, W., Yong, T., Lu, Z., et al.: Research on feature selection in malicious code clustering. Inf. Netw. Secur. (9), 64–68 (2016)
Google Scholar
Shakya, S., Zhang, J.: Towards better semi-supervised classification of malicious software. In: ACM International Workshop, pp. 27–33. ACM (2015)
Google Scholar
Santos, I., Sanz, B., Laorden, C., Brezo, F., Bringas, P.G.: Opcode-sequence-based semi-supervised unknown malware detection. In: Herrero, Á., Corchado, E. (eds.) CISIS 2011. LNCS, vol. 6694, pp. 50–57. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21323-6_7
Chapter Google Scholar
Xueming, L., Hairui, L., Liang, X., et al.: TF-IDF algorithm based on information gain and information entropy. Comput. Eng. 38(8), 37–40 (2012)
Google Scholar
Rogers, J., Gunn, S.: Identifying feature relevance using a random forest. In: Subspace, Latent Structure and Feature Selection, Statistical and Optimization, Perspectives Workshop, SLSFS 2005, Bohinj, Slovenia, 23–25 February 2005, Revised Selected Papers, pp. 173–184. DBLP (2006)
Google Scholar
Padmavathi, J.: Logistic regression in feature selection in data mining. Int. J. Sci. Eng. Res. 3(8) (2012)
Google Scholar
Guo, W., Dai, L., Wang, R., et al.: Feature mapping based on PCA. Acta Automatica Sinica 34(8), 876–879 (2008)
Article MATH Google Scholar
Zhou, D., Bousquet, O., Lal, T.N., et al.: Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 16(4), 321–328 (2004)
Google Scholar
Sahu, K., Shrivastava, S.K.: Kernel K-means clustering for phishing website, malware categorization. Int. J. Comput. Appl. 111(9), 20–25 (2015)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Key R&D Program of China (Grant No. 2016YFB0801304).

Author information

Authors and Affiliations

School of Software, Beijing Institute of Technology, Beijing, China
Tingting He, Jingfeng Xue, Jianwen Fu, Yong Wang & Chun Shan

Authors

Tingting He
View author publications
You can also search for this author in PubMed Google Scholar
Jingfeng Xue
View author publications
You can also search for this author in PubMed Google Scholar
Jianwen Fu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chun Shan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong Wang .

Editor information

Editors and Affiliations

National University of Defense Technology, Changsha, China
Ming Xu
Hunan University, Changsha, Hunan, China
Zheng Qin
Wuhan University, Wuhan, China
Fei Yan
National University of Defense Technology, Changsha, Hunan, China
Shaojing Fu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, T., Xue, J., Fu, J., Wang, Y., Shan, C. (2017). Research on Malicious Code Analysis Method Based on Semi-supervised Learning. In: Xu, M., Qin, Z., Yan, F., Fu, S. (eds) Trusted Computing and Information Security. CTCIS 2017. Communications in Computer and Information Science, vol 704. Springer, Singapore. https://doi.org/10.1007/978-981-10-7080-8_17

Download citation

DOI: https://doi.org/10.1007/978-981-10-7080-8_17
Published: 23 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7079-2
Online ISBN: 978-981-10-7080-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)