Abstract
With the development of software security technology, more and more malicious programs constantly uses new confusion and feature hiding techniques, the malware detection technology need to upgrade urgently. This paper presents a malware detection method based on sandbox, binary instrumentation and multidimensional feature extraction. We introduced the design and implementation of sandbox, feature extractor and the classifier. Finally, we merged multiple models and get a pretty well classifier for the malware detection.
Similar content being viewed by others
Keywords
- Multi-dimensional Feature Extraction
- Malware Detection
- Sandbox Mode
- Calf Feeding Systems
- Dynamic Instruction
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
With the development of Internet and information security, malicious programs constantly uses new confusion and feature hiding techniques, such as polymorphism, metamorphism, confusion, packing and other means to disturb static analysis and malware detection.
Malware static detection technology mainly uses program discrete features, such as PE structure information, string, byte-code or static disassembly instructions [1,2,3]. Because this method does not carry on the dynamic execution of the procedure, its accuracy will be affected by packing, encryption, and junk code and need a combination of various static features [4].
In this work, we present a new malware detection method with multidimensional feature extraction (both static and dynamic) and build a sandbox using Intel Pin [8]. First, we use IDA python script and IDA command line parameter to get the immediate and opcode of the static disassemble result. Secondly, we put the program into PinFWSandBox for the dynamic execution and get the system call information. Thirdly, with the protection of the sandbox, we use binary instrumentation to get the dynamic instruction flow snapshot. Fourthly, we extract static, system call and dynamic instruction flow features. Lastly, we use the Naive Bayes model to divide feature attribute and generate the classifiers for each feature and vote for the final classification result.
This paper is organized as follows: In Sect. 2, we firstly describe the design and implementation of the sandbox. In Sect. 3, we explain the general idea of our multi-dimensional feature extraction and classification method. In Sect. 4 shows the detail of the classification model and our experimental data. Lastly, Sect. 5 concludes our work and future work directions.
2 PinFWSandBox
2.1 General Idea
The PinFWSandBox is designed to protect the system as far as possible without influencing the normal execution of malicious programs. PinFWSandBox can also work well with the Flowwalker [10] recorder module and protect it from harmful behaviors.
We intercept the Windows system call both at the entrance and exit, access to malicious program system call number and parameter information, and classification for processing. Depending on how the sandbox is handled, we classify malicious behavior into four categories.
Registry malicious operation. Mainly include: Set the auto startup registry items, set the program suffix executive association, traversing the query registry key, malicious modification of other program key values, etc.
File system malicious operation. Mainly include: Malware self-copying, code self-decryption and release, encrypted disk data files, modify the system programs and dynamic link library files, etc.
Network malicious operation. Mainly include: start the back door service, listen to the local UDP/TCP port, execute DDOS attack,connect the remote control server, upload local data and download server instructions, etc.
System malicious operations. Mainly include: Create mutex, search the specified window, traverse the disk information and files, change file attributes, create user process, apply for remote memory, modify IE settings, privileges promotion, restart/shutdown computer, terminate other processes, format the disk and add system account, etc.
2.2 Registry Sandbox Module
The main idea of the registry sandbox module is recording the open registry information (including handle, path, key, value and registry type) and log out the change of the registry table and rollback after program finishing running.
The rollback log including following items:
-
OPER_TYPE. Recording the registry operation, for the registry key are “RegCreateKey” and “RegDelKey” and for registry value are “RegAddValue”, “RegSetValue” and “RegDelValue”.
-
RegPath. Recording the registry full path for the operation.
-
RegKey. Recording the registry key for the operation.
-
RegValue. If the registry operation has modified the value, record the original value otherwise record the new registry value.
-
ValueType. Recording the type of the value if necessary otherwise record as-1.
In order to obtain the above behavioral information, we mainly intercept the following registry system call function.
-
NtOpenKey/NtOpenKeyEx. Record the open registry information, and access to the registry handle.
-
NtCreateKey. Record the open/create registry key information and access to the registry handle.
-
NtSetValueKey. Record the add/modify registry value information and the original value.
-
NtDeleteKey/NtDeleteValueKey. Record the delete registry value/key information.
-
NtClose. Remove the close handle from the handle list.
As PinTool may not work well with Windows API, we write another program to reverse execution the modify operation. Such as “RegCreateKey” action, we can use RegDeleteKey function to delete the added registry key.
2.3 File Sandbox Module
The main idea of the file sandbox modules to record the file operation information and redirect the dangerous create/modify operation file path. If the file already exists, we will copy it to our redirect path.
In this module, we mainly focus on NtCreateFile function. This function opens or creates a disk file and returns its file handle. The second argument value (DesiredAccess) is the read-write attribute identification and we use it to judge write permission. If this operation may write some data to a file, we will change the third argument value (ObjectAttributes) to redirect the file path to a new place. This argument is a pointer to the structure OBJECT_ATTRIBUTES. We can malloc a new structure and change its “RootDirectory” handle and “ObjectName” value to change the file path.
This argument supports both absolute and relative path. If the “RootDirectory” handle is not NULL, it uses the relative path to open a file. It is noteworthy that PinTool set the program working directory to an uncertain path. In the Windows 7 platform, the working directory may set to the location of pin.exe and for the Windows 8.1 platform, the working directory may set to the location of cmd.exe which used to start pin.exe. So, we take the following two measures to make sure the sandbox work properly.
-
(1)
Using the SetCurrentDirectory function to reset the working directory to the program path.
-
(2)
Always setting the “RootDirectory” to NULL and transforming the relative to the absolute path.
Also, we intercept NtOpenFile, NtReadFile, NtWriteFile and NtClose functions to log file handle and read/write data.
2.4 Network Sandbox Module
The main purpose of the network sandbox module is collecting program’s network behaviors, especially for the communication ip and port using by client and server.
The most import system call function of network sandbox is the NtDeviceIoControlFile. We can both get connection information and transmit data with matched “AFD_CODE” and “Inputbuffer” structure.
This structure can be used to get server bind ip and port both TCP and UDP. For the other network behavior, we can use relevant “AFD_CODE” and structures.
2.5 System Malicious Operations Sandbox Module
System malicious operations sandbox module’s function is dispersed. We should intercept different system call function because of various kinds of malicious system operations.
We mainly pay attention to the following system call function.
-
NtCreateMutant. Recording mutant name of the program.
-
NtUserCallOneParam/NtUserCallNoParam. Avoiding the shutdown, reboot, logoff action by checking the second/first argument value (Routine).
-
NtShutdownSystem. Avoiding the shutdown, reboot, logoff action.
-
NtCreateUserProcess. Checking the sub-process parameters whether its command line or path name has dangerous string.
-
NtAdjustPrivilegesToken. Recording the program’s adjust privileges operation information.
If the sandbox encounters irreversible harmful operation, such as format disk or reboot computer, it will call PIN_ExitApplication function to terminate the process and run the PinTool fini function to save the log.
3 Method and Implementation
3.1 General Idea
We also divide our detection system into three modules. The flowchart of the method model is shown below (Fig. 1).
3.2 Collector
The collector module does the static analysis of the program and dynamic execution. It collects static disassemble info, dynamic instruction snapshot info and system call info. In this module, we use the PinFWSandBox for the dynamic execution of malware. Also, the system call log is generated by the sandbox.
-
(1)
Static Analysis Module
We use IDA and IDA python script to traverse the program’s user code section and save the disassemble result. As for automatically run the script, we use the IDA command argument “–A –c –S” to start IDA. Although some programs are packed or confused, the static analysis part will get the self-unpacking disassemble code sequence.
-
(2)
Instruction Recorder Module
With the protection of the sandbox, we use the binary instrumentation technology as the previous work [8]. Also, we record the dynamic instruction snapshot with BBL (Basic block) number and BBL’s instruction file only for the user function in the main image file.
-
(3)
Syscall Recorder Module
As for the dynamic behavior information, we mainly log the system call function information. We save the system call detail info in different four files (reg_log.txt, file_log.txt, net_log.txt, command_log.txt) and record name and the return value of each system call in another file.
3.3 Extractor
-
(1)
Static Feature Extraction
In this work, we also mainly focus on immediate feature. We traverse every instruction and pick out the immediate value and its opcode.
As for the static analysis and only contains the main image’s code section. This part of feature won’t be too big.
-
(2)
Dynamic Instruction Feature Extraction
In this work, dynamic instruction feature extraction mainly includes two parts.
The first part is the dynamic disassembly immediate. We also use the format “OPCODE_IMM”. However, due to the dynamic execution may have loop structures or run too many instructions, our feature file may be too big to do classify. Some feature files are even larger than 1 GB. Considering this condition, we use the following algorithm to compress the feature file size (Fig. 2).
Before the use the compression algorithm, the size of the feature file has been significantly reduced. Here is a dynamic instruction immediate feature example of a virus before compression (Fig. 3):
The second part is dynamic instruction’s opcode feature. We count the frequency of each opcode in the dynamic instruction flow. In this work, we deal with the opcode feature in a different way. We record each instruction’s opcode to a feature word and use the compressionalgorithm above. Here is a dynamic instruction opcode feature example of a virus before compression (Fig. 4):
-
(3)
System Call Feature Extraction
As for the system call feature, we extract the word format as “SyscallName_RetValue” directly. Here is a system call feature example of a virus (Fig. 5):
The data in our experiments show that if we just use the system call name as the word feature, the f1-score will be 10%–15% lower than using the return value.
4 Test and Analysis of Result
In order to verify the correctness of methods and models, we use malicious sample set and non-malicious sample set to do the two-class classification test.
The malicious samples are 1058 win32 viruses downloaded from the “vxheaven.org”. And the non-malicious samples are random win32 programs from the “malwr.com” which uploaded by others. We use the “virscan.org” to pick out 770 samples which reported as virus’ antivirus engine number less than 4 in 39 engines. We choose 80% samples as the train set and 20% samples as the test set.
4.1 Single ModelClassifier Result
-
(1)
Static Immediate Classifier Result
Firstly, we choose static immediate feature to do classify. The generator result and the classifier result are as are reproduced below (Tables 1 and 2).
As the test result shows, only using static immediate feature will not be enough for classifier. The final f1-score is 87.39% for the test set.
4.1.1 System Call Classifier Result
Secondly, we use system call feature to generate the classifier. The result is shown below (Tables 3 and 4).
Using the system call name and its return value can get the well enough result. The final f1-score of system call is 95.34% for the test set.
4.1.2 Dynamic Immediate Classifier Result
Thirdly, we try to use dynamic immediate feature to generate classifier and compare with static immediate result. We get the result shown below (Tables 5 and 6).
The final f1-socre of this model is 92.83%
4.1.3 Dynamic Opcode Classifier Result
Lastly, we use dynamic opcode features. The generator result and the classifier result are as are shown below (Tables 7 and 8).
Sadly, using discrete dynamic opcode and use the feature extraction method above only has 87.75% in final f1-score.
Sadly, using discrete dynamic opcode and use the feature extraction method above only has 87.75% in final f1-score.
4.2 Merged ModelClassifier Result
Using the four classifier’s f1-score, we calculate the \( \upalpha_{\text{i}} \) as follows and get the final result (Tables 9 and 10).
Finally, we increase the malware classification f1-score close to 96% by using the merged model. We believe that if we change the non-malicious set to a more stable sample set, we may get a better classify result.
5 Conclusion and Future Work
This paper presents a malware detection method based on sandbox, binary instrumentation and multidimensional feature extraction. Firstly, we design and implement a malware sandbox called PinFWSandBox, which intercepts and filter system call behaviors. Secondly, we extract static immediate feature, dynamic instruction discrete feature and system call function feature as the multidimensional feature. And then, we build single model classifiers by using each feature. Lastly, we merged each model using linear weighted fusion method and get the final f1-score close to 96%.
The deficiencies of this work are only using the discrete instruction feature and Naïve Bayes classifier. In our future work, we will consider using sequence feature like the instruction sequence similarity or function sequence feature. Furthermore, in order to achieve better results, we will try to use more features both static and dynamic.
References
Gandotra, E., Bansal, D., Sofat, S.: Malware analysis and classification: a survey. J. Inf. Secur. 2014 (2014)
Baldangombo, U., Jambaljav, N., Horng, S.J.: A static malware detection system using data mining methods. arXiv preprint arXiv:1308.2831 (2013)
Divandari, H., Pechaz, B., Jahan, M.V.: Malware detection using Markov Blanket based on opcode sequences. In: International Congress on Technology, Communication and Knowledge (ICTCK) 2015, pp. 564–569. IEEE (2015)
Lee, J., Im, C., Jeong, H.: A study of malware detection and classification by comparing extracted strings. In: Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication, p. 75. ACM (2011)
Xiao, H., Stibor, T.: A supervised topic transition model for detecting malicious system call sequences. In: Proceedings of the 2011 Workshop on Knowledge Discovery, Modeling and Simulation, pp. 23–30. ACM (2011)
Gui, X., Liu, J., Chi, M., et al.: Analysis of malware application based on massive network traffic. China Commun. 13(8), 209–221 (2016)
Alazab, M., Venkatraman, S., Watters, P., et al.: Zero-day malware detection based on supervised learning algorithms of API call signatures. In: Proceedings of the Ninth Australasian Data Mining Conference, vol. 121, pp. 171–182. Australian Computer Society, Inc. (2011)
Cui, B., Wang, F., Guo, T., et al.: Flowwalker: a fast and precise off-line taint analysis framework. In: Fourth International Conference on Emerging Intelligent Data and Web Technologies (EIDWT), 2013, pp. 583–588. IEEE (2013)
Jingling, Z., Shilei, C., Mengchen, C.A.O., et al.: Malware algorithm recognition based on offline instruction-flow analyse. J. Tsinghua Univ. (Sci. Technol.) 65(5), 484–492 (2016)
Cepeda, C., Tien, D.L.C., Ordóñez, P.: Feature selection and improving classification performance for malware detection. In: IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom)(BDCloud-SocialCom-SustainCom), 2016, pp. 560–566. IEEE (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Wang, C., Ding, J., Guo, T., Cui, B. (2018). A Malware Detection Method Based on Sandbox, Binary Instrumentation and Multidimensional Feature Extraction. In: Barolli, L., Xhafa, F., Conesa, J. (eds) Advances on Broad-Band Wireless Computing, Communication and Applications. BWCCA 2017. Lecture Notes on Data Engineering and Communications Technologies, vol 12. Springer, Cham. https://doi.org/10.1007/978-3-319-69811-3_39
Download citation
DOI: https://doi.org/10.1007/978-3-319-69811-3_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69810-6
Online ISBN: 978-3-319-69811-3
eBook Packages: EngineeringEngineering (R0)