Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

With the development of Internet and information security, malicious programs constantly uses new confusion and feature hiding techniques, such as polymorphism, metamorphism, confusion, packing and other means to disturb static analysis and malware detection.

Malware static detection technology mainly uses program discrete features, such as PE structure information, string, byte-code or static disassembly instructions [1,2,3]. Because this method does not carry on the dynamic execution of the procedure, its accuracy will be affected by packing, encryption, and junk code and need a combination of various static features [4].

In this work, we present a new malware detection method with multidimensional feature extraction (both static and dynamic) and build a sandbox using Intel Pin [8]. First, we use IDA python script and IDA command line parameter to get the immediate and opcode of the static disassemble result. Secondly, we put the program into PinFWSandBox for the dynamic execution and get the system call information. Thirdly, with the protection of the sandbox, we use binary instrumentation to get the dynamic instruction flow snapshot. Fourthly, we extract static, system call and dynamic instruction flow features. Lastly, we use the Naive Bayes model to divide feature attribute and generate the classifiers for each feature and vote for the final classification result.

This paper is organized as follows: In Sect. 2, we firstly describe the design and implementation of the sandbox. In Sect. 3, we explain the general idea of our multi-dimensional feature extraction and classification method. In Sect. 4 shows the detail of the classification model and our experimental data. Lastly, Sect. 5 concludes our work and future work directions.

2 PinFWSandBox

2.1 General Idea

The PinFWSandBox is designed to protect the system as far as possible without influencing the normal execution of malicious programs. PinFWSandBox can also work well with the Flowwalker [10] recorder module and protect it from harmful behaviors.

We intercept the Windows system call both at the entrance and exit, access to malicious program system call number and parameter information, and classification for processing. Depending on how the sandbox is handled, we classify malicious behavior into four categories.

Registry malicious operation. Mainly include: Set the auto startup registry items, set the program suffix executive association, traversing the query registry key, malicious modification of other program key values, etc.

File system malicious operation. Mainly include: Malware self-copying, code self-decryption and release, encrypted disk data files, modify the system programs and dynamic link library files, etc.

Network malicious operation. Mainly include: start the back door service, listen to the local UDP/TCP port, execute DDOS attack,connect the remote control server, upload local data and download server instructions, etc.

System malicious operations. Mainly include: Create mutex, search the specified window, traverse the disk information and files, change file attributes, create user process, apply for remote memory, modify IE settings, privileges promotion, restart/shutdown computer, terminate other processes, format the disk and add system account, etc.

2.2 Registry Sandbox Module

The main idea of the registry sandbox module is recording the open registry information (including handle, path, key, value and registry type) and log out the change of the registry table and rollback after program finishing running.

The rollback log including following items:

  • OPER_TYPE. Recording the registry operation, for the registry key are “RegCreateKey” and “RegDelKey” and for registry value are “RegAddValue”, “RegSetValue” and “RegDelValue”.

  • RegPath. Recording the registry full path for the operation.

  • RegKey. Recording the registry key for the operation.

  • RegValue. If the registry operation has modified the value, record the original value otherwise record the new registry value.

  • ValueType. Recording the type of the value if necessary otherwise record as-1.

In order to obtain the above behavioral information, we mainly intercept the following registry system call function.

  • NtOpenKey/NtOpenKeyEx. Record the open registry information, and access to the registry handle.

  • NtCreateKey. Record the open/create registry key information and access to the registry handle.

  • NtSetValueKey. Record the add/modify registry value information and the original value.

  • NtDeleteKey/NtDeleteValueKey. Record the delete registry value/key information.

  • NtClose. Remove the close handle from the handle list.

As PinTool may not work well with Windows API, we write another program to reverse execution the modify operation. Such as “RegCreateKey” action, we can use RegDeleteKey function to delete the added registry key.

2.3 File Sandbox Module

The main idea of the file sandbox modules to record the file operation information and redirect the dangerous create/modify operation file path. If the file already exists, we will copy it to our redirect path.

In this module, we mainly focus on NtCreateFile function. This function opens or creates a disk file and returns its file handle. The second argument value (DesiredAccess) is the read-write attribute identification and we use it to judge write permission. If this operation may write some data to a file, we will change the third argument value (ObjectAttributes) to redirect the file path to a new place. This argument is a pointer to the structure OBJECT_ATTRIBUTES. We can malloc a new structure and change its “RootDirectory” handle and “ObjectName” value to change the file path.

This argument supports both absolute and relative path. If the “RootDirectory” handle is not NULL, it uses the relative path to open a file. It is noteworthy that PinTool set the program working directory to an uncertain path. In the Windows 7 platform, the working directory may set to the location of pin.exe and for the Windows 8.1 platform, the working directory may set to the location of cmd.exe which used to start pin.exe. So, we take the following two measures to make sure the sandbox work properly.

  1. (1)

    Using the SetCurrentDirectory function to reset the working directory to the program path.

  2. (2)

    Always setting the “RootDirectory” to NULL and transforming the relative to the absolute path.

Also, we intercept NtOpenFile, NtReadFile, NtWriteFile and NtClose functions to log file handle and read/write data.

2.4 Network Sandbox Module

The main purpose of the network sandbox module is collecting program’s network behaviors, especially for the communication ip and port using by client and server.

The most import system call function of network sandbox is the NtDeviceIoControlFile. We can both get connection information and transmit data with matched “AFD_CODE” and “Inputbuffer” structure.

This structure can be used to get server bind ip and port both TCP and UDP. For the other network behavior, we can use relevant “AFD_CODE” and structures.

2.5 System Malicious Operations Sandbox Module

System malicious operations sandbox module’s function is dispersed. We should intercept different system call function because of various kinds of malicious system operations.

We mainly pay attention to the following system call function.

  • NtCreateMutant. Recording mutant name of the program.

  • NtUserCallOneParam/NtUserCallNoParam. Avoiding the shutdown, reboot, logoff action by checking the second/first argument value (Routine).

  • NtShutdownSystem. Avoiding the shutdown, reboot, logoff action.

  • NtCreateUserProcess. Checking the sub-process parameters whether its command line or path name has dangerous string.

  • NtAdjustPrivilegesToken. Recording the program’s adjust privileges operation information.

If the sandbox encounters irreversible harmful operation, such as format disk or reboot computer, it will call PIN_ExitApplication function to terminate the process and run the PinTool fini function to save the log.

3 Method and Implementation

3.1 General Idea

We also divide our detection system into three modules. The flowchart of the method model is shown below (Fig. 1).

Fig. 1.
figure 1

Flowchart of the method model

3.2 Collector

The collector module does the static analysis of the program and dynamic execution. It collects static disassemble info, dynamic instruction snapshot info and system call info. In this module, we use the PinFWSandBox for the dynamic execution of malware. Also, the system call log is generated by the sandbox.

  1. (1)

    Static Analysis Module

We use IDA and IDA python script to traverse the program’s user code section and save the disassemble result. As for automatically run the script, we use the IDA command argument “–A –c –S” to start IDA. Although some programs are packed or confused, the static analysis part will get the self-unpacking disassemble code sequence.

  1. (2)

    Instruction Recorder Module

With the protection of the sandbox, we use the binary instrumentation technology as the previous work [8]. Also, we record the dynamic instruction snapshot with BBL (Basic block) number and BBL’s instruction file only for the user function in the main image file.

  1. (3)

    Syscall Recorder Module

As for the dynamic behavior information, we mainly log the system call function information. We save the system call detail info in different four files (reg_log.txt, file_log.txt, net_log.txt, command_log.txt) and record name and the return value of each system call in another file.

3.3 Extractor

  1. (1)

    Static Feature Extraction

In this work, we also mainly focus on immediate feature. We traverse every instruction and pick out the immediate value and its opcode.

As for the static analysis and only contains the main image’s code section. This part of feature won’t be too big.

  1. (2)

    Dynamic Instruction Feature Extraction

In this work, dynamic instruction feature extraction mainly includes two parts.

The first part is the dynamic disassembly immediate. We also use the format “OPCODE_IMM”. However, due to the dynamic execution may have loop structures or run too many instructions, our feature file may be too big to do classify. Some feature files are even larger than 1 GB. Considering this condition, we use the following algorithm to compress the feature file size (Fig. 2).

Fig. 2.
figure 2

Instruction feature compression algorithm

Before the use the compression algorithm, the size of the feature file has been significantly reduced. Here is a dynamic instruction immediate feature example of a virus before compression (Fig. 3):

Fig. 3.
figure 3

Dynamic instruction immediate feature example of a virus

The second part is dynamic instruction’s opcode feature. We count the frequency of each opcode in the dynamic instruction flow. In this work, we deal with the opcode feature in a different way. We record each instruction’s opcode to a feature word and use the compressionalgorithm above. Here is a dynamic instruction opcode feature example of a virus before compression (Fig. 4):

Fig. 4.
figure 4

Dynamic instruction opcode feature example of a virus

  1. (3)

    System Call Feature Extraction

As for the system call feature, we extract the word format as “SyscallName_RetValue” directly. Here is a system call feature example of a virus (Fig. 5):

Fig. 5.
figure 5

System call feature example of a virus

The data in our experiments show that if we just use the system call name as the word feature, the f1-score will be 10%–15% lower than using the return value.

4 Test and Analysis of Result

In order to verify the correctness of methods and models, we use malicious sample set and non-malicious sample set to do the two-class classification test.

The malicious samples are 1058 win32 viruses downloaded from the “vxheaven.org”. And the non-malicious samples are random win32 programs from the “malwr.com” which uploaded by others. We use the “virscan.org” to pick out 770 samples which reported as virus’ antivirus engine number less than 4 in 39 engines. We choose 80% samples as the train set and 20% samples as the test set.

4.1 Single ModelClassifier Result

  1. (1)

    Static Immediate Classifier Result

Firstly, we choose static immediate feature to do classify. The generator result and the classifier result are as are reproduced below (Tables 1 and 2).

Table 1. Static immediate generator result
Table 2. Static immediate classifier result

As the test result shows, only using static immediate feature will not be enough for classifier. The final f1-score is 87.39% for the test set.

4.1.1 System Call Classifier Result

Secondly, we use system call feature to generate the classifier. The result is shown below (Tables 3 and 4).

Table 3. System call generator result
Table 4. System call classifier result

Using the system call name and its return value can get the well enough result. The final f1-score of system call is 95.34% for the test set.

4.1.2 Dynamic Immediate Classifier Result

Thirdly, we try to use dynamic immediate feature to generate classifier and compare with static immediate result. We get the result shown below (Tables 5 and 6).

Table 5. Dynamic immediate generator result
Table 6. Dynamic immediate generator result

The final f1-socre of this model is 92.83%

4.1.3 Dynamic Opcode Classifier Result

Lastly, we use dynamic opcode features. The generator result and the classifier result are as are shown below (Tables 7 and 8).

Table 7. Dynamic opcode generator result
Table 8. Dynamic opcode classifier result

Sadly, using discrete dynamic opcode and use the feature extraction method above only has 87.75% in final f1-score.

Sadly, using discrete dynamic opcode and use the feature extraction method above only has 87.75% in final f1-score.

4.2 Merged ModelClassifier Result

Using the four classifier’s f1-score, we calculate the \( \upalpha_{\text{i}} \) as follows and get the final result (Tables 9 and 10).

Table 9. Weight of different classifier
Table 10. Weight of different classifier

Finally, we increase the malware classification f1-score close to 96% by using the merged model. We believe that if we change the non-malicious set to a more stable sample set, we may get a better classify result.

5 Conclusion and Future Work

This paper presents a malware detection method based on sandbox, binary instrumentation and multidimensional feature extraction. Firstly, we design and implement a malware sandbox called PinFWSandBox, which intercepts and filter system call behaviors. Secondly, we extract static immediate feature, dynamic instruction discrete feature and system call function feature as the multidimensional feature. And then, we build single model classifiers by using each feature. Lastly, we merged each model using linear weighted fusion method and get the final f1-score close to 96%.

The deficiencies of this work are only using the discrete instruction feature and Naïve Bayes classifier. In our future work, we will consider using sequence feature like the instruction sequence similarity or function sequence feature. Furthermore, in order to achieve better results, we will try to use more features both static and dynamic.