Development of an Automatic Document Malware Analysis System

  • Hong-Koo Kang
  • Ji-Sang Kim
  • Byung-Ik Kim
  • Hyun-Cheol Jeong
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 215)

Abstract

Malware attacks that use document files like PDF and HWP have been rapidly increasing lately. Particularly, social engineering cases of infection by document based malware that has been transferred through Web/SNS posting or spam mail that pretends to represent political/cultural issues or a work colleague has greatly increased. The threat of document malware is expected to increase as most PC users routinely access document files and the rate of this type of malware being detected by commercial vaccine programs is not that high. Therefore, this paper proposes an automatic document malware analysis system that automatically performs the static/dynamic analysis of document files like PDF and HWP and provides the result. The static analysis of document based malware identifies the existence of the script and the shell code that is generating the malicious behavior and extracts it. It also detects obfuscated codes or the use of reportedly vulnerable functions. The dynamic analysis monitors the behavior of the kernel level and generates the log. The log is then compared with the malicious behavior rule to detect the suspicious malware. In the performance test that used the actual document malware sample, the system demonstrated an outstanding detection performance.

Keywords

Document Malware Automatic analysis system 

1 Introduction

Malware attacks like Advanced Persistent Threat (APT) and spam mail using a document file have been rapidly increasing lately. These attacks are mostly used in the social engineering method, which uses a Web/SNS posting containing political and cultural issues, to induce the users download the malware, or that pretends to be a work colleague and that sends spam mail with document malware attached to it to infect the users with malware [1, 2]. Since most PC users routinely use document files, they are more vulnerable to document based malware than the existing types of PE (Portable Executable) malware. Moreover, the rate of this type of malware being detected by commercial vaccine programs is not that high. Since the commercial vaccine programs use the signature based detection method, which has a low rate of detecting document malware, the threat of document malware is expected to continue to increase [3, 4].

Therefore, this paper proposes an automatic document malware analysis system that will automatically perform the static/dynamic analyses of document files like PDF and HWP and that will provide the result. The static analysis of document malware identifies the existence of the script and the shell code that is generating the malicious behavior and extracts it. It also detects obfuscated codes or the use of reportedly vulnerable functions. The dynamic analysis monitors the behavior of the kernel level and generates the log. The log is then compared with the malicious behavior rule to detect the suspicious malware. In the performance test that used the actual document malware sample, the system demonstrated an outstanding detection performance.

2 Related Studies

The leading automatic malware analysis systems include Anubis [5], CWSandbox [6], and Wepawet [7].

Figure 1 shows the result of the malicious code analysis by Anubis. Anubis provides a Web based automatic malware analysis service. It provides the attribute data of the input file as well as the behavior data such as the registry, file, and process. It also provides the analysis result of the file and process that was derived from the malicious code [5]. However, Anubis does not provide the analysis of document based malicious codes, although it does provide the analysis of PE types of malicious codes.
Fig. 1

The Anubis analysis result

Figure 2 shows the result of malicious code analysis by CWSandbox. Like Anubis, CWSandbox provides the automatic analysis of PE types of malicious codes. Unlike Anubis, CWSandbox provides the result of the PE file analysis through e-mail. Particularly, it provides the behavior data with the time that it occurred. [6]. However, like Anubis, CWSandbox does not provide the analysis of document based malicious codes.
Fig. 2

CWSandbox analysis result

Lastly, Wepawet performs the static/dynamic analyses of the PDF and provides the behavior result [7]. Figure 3 shows the result of the PDF file analyzed by Wepawet.
Fig. 3

PDF analysis result

Wepawet extracts the script/shell code that is contained in the PDF file and provides the behavior data of the extracted codes. For the generated file, it shows the result of applying the commercial vaccine program. However, Wepawet is limited in that it only provides the analysis of PDF format document based malicious codes.

3 System Design

The files targeted by the automatic document malware analysis system that are proposed in this paper are PDF, MS-Office, and HWP files. The system mainly consists of the analysis management module, static analysis module, and dynamic analysis module. Figure 4 shows the overall architecture of the automatic document malware analysis system.
Fig. 4

Overall system architecture

As shown in Fig. 4, the analysis management module receives the request for analysis and management of the data. The static analysis module and dynamic analysis module retrieve the analysis request file from the DB, perform the analysis, and store the results in the DB. Figure 5 shows the analysis management module.
Fig. 5

Analysis management module

As shown in Fig. 5, the analysis management module performs the task of saving the analysis request data in the management DB so that the system will perform the static/dynamic analyses upon an analysis request by a Web user/external system.

The static analysis system uses the fact that most of the malicious behaviors in a document file are executed by the scripts/shell codes and checks if there is any script/shell code in the document file and extracts it if it exists. For example, it checks if/JS or/JavaScript naming is used in a PDF file and extracts the relevant java script. It extracts the VB scripts included in the macro of an MS-Office file. In a PDF file, it decodes five types of obfuscated codes that are applied to/Filter and detects if any reportedly vulnerable functions are used.

The dynamic analysis module checks if there is an analysis request in the DB and initializes the environment to begin the dynamic analysis. It then performs the analysis of the document file and extracts the file/network/registry/process/memory behavior data as the analysis result. The extracted behavior data is compared with the malicious behavior rule, which is saved in the DB, to check potential maliciousness and the results are recorded in the DB. Figure 6 shows the dynamic analysis process for the document file.
Fig. 6

The dynamic analysis process

Most malware document files generated the malicious codes, which perform malicious behaviors, during the file execution. Since the executable file generated by a document file is highly likely to be a malicious code, the files created by a document file need to be managed in the same group and be statically/dynamically analyzed. Figure 7 shows the process of analyzing the file generated by a document file.
Fig. 7

Secondary generated file analysis process

As the static/dynamic analysis modules are configured as being the virtual environment, and they consist of many GeustOS systems. Each GeustOS performs the analysis of the input document file. Having many GeustOSs enables simultaneous analysis of multiple files.

4 System Implementation

The automatic document malware analysis system is deployed using the Web interface. A dotNet Framework and IIS Web server were deployed in the Windows 2003 server, and ASP was used to produce the Web pages. The automatic document malware analysis system provides not only the analysis of the document file uploaded by the administrators but also the I/O interface that can link with external systems. While the analysis is performed, the administrator can monitor the progress of the static/dynamic analyses in real-time. When the static/dynamic analyses of a document file are completed, the result is shown in the analysis result list. Figure 8 shows the document file analysis result.
Fig. 8

Document file analysis result list

The document file analysis result list, as shown in Fig. 8, displays a list of analyzed files. Users can query the files by various conditions like file type, hash value (MD5, SHA1), and period. Moreover, the detailed information of a specific file in the query can be checked. Figure 9 shows the detailed analysis information.
Fig. 9

Detailed document file information

Figure 9 shows the static/dynamic detailed analyses results of the document file. Users can check the scripts/shell codes that were extracted by the static analysis and whether code obfuscation and reportedly vulnerable functions were used. In the dynamic analysis, the result of the behavior analysis can be checked. The extracted behaviors are compared with the malicious behavior rule to determine the level of maliciousness. The malicious behavior rule is divided into the file, registry, process, network, and memory. The rule can be added or edited.

5 Performance Test

This paper used an actual document malware sample which was reported, for testing. A vaccine developer who is jointly studying this with KISC provided the samples that were used in testing. Table 1 shows the number of document malware samples used in the test.
Table 1

Number of samples

Type

No. of samples

PDF

50

MS-Office

28

HWP

10

Total

88

Table 2 shows the number of malicious codes that were detected by the proposed automatic document malware analysis system from out of the document malware samples in Table 1. Table 2 indicates that the automatic document malware analysis system proposed in this paper has the outstanding detection rate of 71.6 % for PDF, MS-Office, and HWP documents.
Table 2

Number of detected malware

Type

No. of detected malware

PDF

40

MS-Office

18

HWP

5

Total

63

6 Conclusions

This paper proposed an automatic document malware analysis system that can automatically analyze document files. The static analysis extracted the scripts/shell codes from the document file and detected any obfuscation or use of reportedly vulnerable functions. The dynamic analysis monitored behaviors and determined the maliciousness based on the malicious behavior rule to detect the document files that were suspected of being malicious. The testing of the system on the actual document malicious code samples showed outstanding performance.

Although obtaining new samples is very important to increase the detection rate of document based malware, there is no efficient sample collection channel in Korea. In the future, a function to provide a Web based document file analysis service, like Wepawet, to general users is needed to secure the new samples.

Notes

Acknowledgments

This research was supported by the KCC(Korea Communications Commission), Korea, under the R&D program supervised by the KCA(Korea Communications Agency)”(KCA-2012-(10912-06001)).

References

  1. 1.
    Park CS (2010) An email vaccine cloud system for detecting Malcode-Bearing documents. J KMS 13(5):754–762Google Scholar
  2. 2.
    Han KS, Shin YH, Im EG (2010) A study of spam spread malware analysis and countermeasure framework. J SE 7(4):363–383Google Scholar
  3. 3.
  4. 4.
    Ratantonio Y, Kruegel C, Vigna G, Shellzer (2011) a tool for the dynamic analysis of malicious shellcode. In: Proceedings of the international symposium on RAID, pp 61–80Google Scholar
  5. 5.
    Ulrich B, Imam H, Davide B, Engin K, Christopher K (2009) Insights into current malware behavior In: 2nd USENIX workshop on LEET, 2009Google Scholar
  6. 6.
    CWSandbox: Behavior-based Malware Analysis. http://mwanalysis.org/
  7. 7.
    Marco C, Christopher K, Giovanni V (2010) Detection and analysis of drive-by-download attacks and malicious JavaScript code. In: Proceedings of the WWW conference, 2010Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Hong-Koo Kang
    • 1
  • Ji-Sang Kim
    • 1
  • Byung-Ik Kim
    • 1
  • Hyun-Cheol Jeong
    • 1
  1. 1.Team of Security R&DKorea Internet and Security AgencySongpa-guSouth Korea

Personalised recommendations