Abstract
The research on classification method of malicious code is helpful for researchers to understand attack characteristics quickly, and help to reduce the loss of users and even the states. Currently, most of the malware classification methods are based on supervised learning algorithms, but it is powerless for the small number of labeled samples. Therefore, in this paper, we propose a new malware classification method, which is based on semi-supervised learning algorithm. First, we extract the impactful static features and dynamic features to serialize and obtain features of high dimension. Then, we select them with Ensemble Feature Grader consistent with Information Gain, Random Forest and Logistic Regression with \(L_1\) and \(L_2\), and reduce dimension again with PCA. Finally, we use Learning with local and global consistency algorithm with K-means to classify malwares. The experimental results of comparison among SVM, LLGC and K-means + LLGC show that using of the feature extraction, feature reduction and classification method, K-means + LLGC algorithm is superior to LLGC in both classification accuracy and efficiency, the accuracy is increased by 2% to 3%, and the accuracy is more than SVM when the number of labeled samples is small.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cepeda, C., Dan, L.C.T., Ordonez, P.: Feature selection and improving classification performance for malware detection. In: IEEE International Conferences on Big Data and Cloud Computing, pp. 560–566. IEEE Computer Society (2016)
Comar, P.M., Liu, L., Saha, S., et al.: Combining supervised and unsupervised learning for zero-day malware detection. In: INFOCOM, 2013 Proceedings IEEE, pp. 2022–2030. IEEE (2013)
Xiao, X., Fu, P., Xiao, X., et al.: Two effective methods to detect mobile malware. In: International Conference on Computer Science and Network Technology, pp. 1041–1045. IEEE (2016)
Choi, Y.H., Han, B.J., Bae, B.C., et al.: Toward extracting malware features for classification using static and dynamic analysis. In: International Conference on Computing and NETWORKING Technology, pp. 126–129. IEEE (2013)
Shijo, P.V., Salim, A.: Integrated static and dynamic analysis for malware detection. Proc. Comput. Sci. 46, 804–811 (2015)
Microsoft Malware Classification Challenge (BIG 2015). https://www.kaggle.com/c/malware-classification
Yi, W., Yong, T., Lu, Z., et al.: Research on feature selection in malicious code clustering. Inf. Netw. Secur. (9), 64–68 (2016)
Shakya, S., Zhang, J.: Towards better semi-supervised classification of malicious software. In: ACM International Workshop, pp. 27–33. ACM (2015)
Santos, I., Sanz, B., Laorden, C., Brezo, F., Bringas, P.G.: Opcode-sequence-based semi-supervised unknown malware detection. In: Herrero, Á., Corchado, E. (eds.) CISIS 2011. LNCS, vol. 6694, pp. 50–57. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21323-6_7
Xueming, L., Hairui, L., Liang, X., et al.: TF-IDF algorithm based on information gain and information entropy. Comput. Eng. 38(8), 37–40 (2012)
Rogers, J., Gunn, S.: Identifying feature relevance using a random forest. In: Subspace, Latent Structure and Feature Selection, Statistical and Optimization, Perspectives Workshop, SLSFS 2005, Bohinj, Slovenia, 23–25 February 2005, Revised Selected Papers, pp. 173–184. DBLP (2006)
Padmavathi, J.: Logistic regression in feature selection in data mining. Int. J. Sci. Eng. Res. 3(8) (2012)
Guo, W., Dai, L., Wang, R., et al.: Feature mapping based on PCA. Acta Automatica Sinica 34(8), 876–879 (2008)
Zhou, D., Bousquet, O., Lal, T.N., et al.: Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 16(4), 321–328 (2004)
Sahu, K., Shrivastava, S.K.: Kernel K-means clustering for phishing website, malware categorization. Int. J. Comput. Appl. 111(9), 20–25 (2015)
Acknowledgments
This work was supported by the National Key R&D Program of China (Grant No. 2016YFB0801304).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
He, T., Xue, J., Fu, J., Wang, Y., Shan, C. (2017). Research on Malicious Code Analysis Method Based on Semi-supervised Learning. In: Xu, M., Qin, Z., Yan, F., Fu, S. (eds) Trusted Computing and Information Security. CTCIS 2017. Communications in Computer and Information Science, vol 704. Springer, Singapore. https://doi.org/10.1007/978-981-10-7080-8_17
Download citation
DOI: https://doi.org/10.1007/978-981-10-7080-8_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7079-2
Online ISBN: 978-981-10-7080-8
eBook Packages: Computer ScienceComputer Science (R0)