Abstract
In the face threat of the Internet attack, malware classification is one of the promising solutions in the field of intrusion detection and digital forensics. In previous work, researchers performed dynamic analysis or static analysis after reverse engineering. But malware developers even use anti-virtual machine (VM) and obfuscation techniques to evade malware classifiers. By means of the deployment of honeypots, malware source code could be collected and analyzed. Source code analysis provides a better classification for understanding the purpose of attackers and forensics. In this paper, a novel classification approach is proposed, based on content similarity and directory structure similarity. Such a classification avoids to re-analyze known malware and allocates resources for new malware. Malware classification also let network administrators know the purpose of attackers. The experimental results demonstrate that the proposed system can classify the malware efficiently with a small misclassification ratio and the performance is better than virustotal.
Similar content being viewed by others
References
Jain S, Meena Y K. Byte level n-gram analysis for malware detection [M]. Berlin: Springer Heidelberg, 2011: 51–59.
Kolter J Z, Maloof M A. Learning to detect and classify malicious executables in the wild [J]. Journal of Machine Learning Research, 2006, 7: 2721–2744.
Tahan G, Rokach L, Shahar Y. Mal-ID: Automatic malware detection using common segment analysis and meta-features [J]. Journal of Machine Learning Research, 2012, 13: 949–979.
Zhang B, Yin J, Hao J, et al. Malicious codes detection based on ensemble learning [J]. Lecture Notes in Computer Science, 2007, 4610: 468–477.
Ye Y, Wang D, Li T, et al. An intelligent pe-malware detection system based on association mining [J]. Journal in Computer Virology, 2008, 4(4): 323–334.
Ye Y, Chen L, Wang D, et al. Sbmds: an interpretable string based malware detection system using SVM ensemble with bagging [J]. Journal in Computer Virology, 2009, 5(4): 283–293.
Ye Y, Li T, Wang D, et al. Hierarchical associative classifier (HAC) for malware detection from the large and imbalanced gray list [J]. Journal of Intelligent Information Systems, 2010, 35(1): 1–20.
Cesare S, Xiang Y. Classification of malware using structured control flow [C]//Proceedings of the 8th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2010). Darlinghurst, Australia: Australian Computer Society, 2010: 61–70.
Cesare S, Xiang Y, Zhou W. Malwise—An effective and efficient classification system for packed and polymorphic malware [J]. IEEE Transactions on Computers, 2013, 62(6): 1193–1206.
Gheorghescu M. An automated virus classification system [C]// Virus Bulletin Conference. Dublin, Ireland: Virus Bulletin, 2005: 294–300.
Rieck K, Trinius P, Willems C, et al. Automatic analysis of malware behavior using machine learning [J]. Journal of Computer Security, 2011, 19(4): 639–668.
Willems C, Holz T, Freiling F. Toward automated dynamic malware analysis using CWSandbox [J]. IEEE Security and Privacy, 2007, 2(5): 32–39.
Zhang J, Porras P, Yegneswaran V. Host-rx: Automated malware diagnosis based on probabilistic behavior models [R]. California, USA: SRI International, 2009.
Zhao H, Xu M, Zheng N, et al. Malicious executables classification based on behavioral factor analysis [C]//Proceedings of International Conference on e-Education, e-Business, e-Management and e-Learning. Washington, USA: IEEE Computer Society, 2010: 502–506.
Lutz P, Guido M, Michael P. JPlag: Finding plagiarisms among a set of programs with JPlag [J]. Journal of Universal Computer Science, 2002, 8(11): 1016–1038.
Cosma G, Joy M. An approach to source-code plagiarism detection and investigation using latent semantic analysis [J]. IEEE Transactions on Computers, 2012, 61(3): 379–394.
Rokach L, Romano R, Maimon O. Negation recognition in medical narrative reports [J]. Information Retrieval, 2008, 11(6): 499–538.
Bloom B H. Space/time trade-offs in hash coding with allowable errors [J]. Communications of the ACM, 1970, 13(7): 422–426.
Gitchell D, Tran N. Sim: A utility for detecting similarity in computer programs [C]//Proceedings of the 30th SIGCSE Technical Symposium. New York, USA: ACM, 1999: 266–270.
Author information
Authors and Affiliations
Corresponding author
Additional information
Foundation item: the Project of the Ministry of Science and Technology, Taiwan, China (Nos. NSC 100-2218-E-110-004-MY3 and NSC 100-2218-E-110-011)
Rights and permissions
About this article
Cite this article
Chia-mei, C., Gu-hsin, L. Research on classification of malware source code. J. Shanghai Jiaotong Univ. (Sci.) 19, 425–430 (2014). https://doi.org/10.1007/s12204-014-1519-1
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12204-014-1519-1