Detecting obfuscated malware using reduced opcode set and optimised runtime trace
- 4.4k Downloads
The research presented, investigates the optimal set of operational codes (opcodes) that create a robust indicator of malicious software (malware) and also determines a program’s execution duration for accurate classification of benign and malicious software. The features extracted from the dataset are opcode density histograms, extracted during the program execution. The classifier used is a support vector machine and is configured to select those features to produce the optimal classification of malware over different program run lengths. The findings demonstrate that malware can be detected using dynamic analysis with relatively few opcodes.
KeywordsComponent Packers Polymorphism Metamorphism malware Obfuscation Dynamic analysis Machine learning SVM
The malware industry has evolved into a well-organized $Billion marketplace operated by well-funded, multi-player syndicates that have invested large sums of money into malicious technologies, capable of evading traditional detection systems. To combat these advancements in malware, new detection approaches that mitigate the obfuscation methods employed by malware need to be found. A detection strategy that analyzes malicious activity on the host environment at run-time can foil malware attempts to evade detection. The proposed approach is the detection of malware using a support vector machine (SVM) on the feature (opcode density histograms) extracted during program execution. The experiments use feature filtering and feature selection to investigate all the Intel opcodes recorded during program execution.
While the full spectrum of opcodes is recorded, feature filtering is applied to narrow the search scope of the feature selection algorithm, which is applied across different program run-lengths. This research confirms that malware can be detected during the early phases of execution, possibly prior to any malicious activity.
“System overview” section describes the experimental framework and “Test platform” section details the test platform used to capture the program traces. “Dataset creation” section explains the dataset creation and is followed in “Opcode pre-filter” section with a description of the filtering method used. “Support vector machine” section introduces an SVM and describes the feature selection process. The results and observations are reviewed in “Discussion” section. Finally, “Conclusion” section concludes with a summary of the findings.
This research is an investigation into malware detection using N-gram analysis and is an extension of the work presented in . However, a summary of the related research is given here to aid the discussion within this paper. Typical analysis approaches involve Control Flow Graphs (CFG), State Machines (modelling behaviour), analysing stack operations, taint analysis, API calls and N-gram analysis.
Code obfuscation is a popular weapon used by malware writers to evade detection . Code obfuscation modifies the program code to produces a new version with the same functionality but with different Portable Executable (PE) file contents that are not known by the antivirus scanner. Obfuscation techniques such as packing are used by malware authors as well as legitimate software developers to compress and encrypt the PE. However, a second technique polymorphism  is used by malware. Polymorphic malware uses encryption to change the body of the malware which is governed by a decryption key that is changed each time the malware is executed creating a new permutation of the malware on each new infection. Eskandari et al.  propose to use program graph mining techniques for detecting polymorphic malware. However, these works employing sub-graph matching to classify and detect malware. These API based methods are easily subverted by changing API call sequence or adding extra API calls that have no effect except to disrupt the call-graph.
Sung et al.  proposed an anomaly based detection using API call sequences to detect unknown and polymorphic malware using an Euclidian distance measurement between API sequences alignment of different call sequences. This API sequence alignment proposed by Sung approach is effectively a signature based approach since it ignores the frequency of the API calls.
Tian et al.  explored a method for classifying Trojan malware and demonstrated that function length plays a significant role in classifying malware and if combined with other features could result in an improvement in malware classification. Unfortunately, these techniques are easily subverted with the addition of innocuous API calls. Sami et al.  also propose a method of detecting malware based on mining API calls statically gathered from the Import Address Tables (IAT) of PE files.
Lakhotia et al.  investigated stack operations as a means to detect obfuscated function calls. His method modelled stack operation based on push, pop and rets opcodes. However, his approach failed to detect obfuscation when the stack is manipulated using other opcodes.
Bilar  demonstrated using static analysis that Windows PE files contain different opcode distributions for obfuscated and non-obfuscated code. Bilar’s findings showed that opcodes such as adc, add, inc, ja, and sub could be used to detect malware.
In other research, Bilar  used statically generated CFG to show that a difference in program flow control structure exists between benign and malicious programs. Bilar concluded that malware has a simpler program flow structure, less interaction, fewer branches and less functionality than benign software.
More recent, research carried out by Agrawal et al.  also demonstrated a difference in the program flow control of malicious and benign software. Agrawal used an abstracted CFG that considered only the external artefacts of the program and used an ‘edit distance’ to compare the CFGs of programs. His findings show a difference in the flow control structure between benign and malicious programs.
N-gram analysis is the examination of sequences of bytes that can be used to detect malware. Using a machine learning algorithm, Santos et al.  demonstrated that N-gram analysis could be used to detect malware.
Santos et al.  perform static analysis on PE files to examine the similarity between malware families and the differences between benign and malicious software. Analysis with N-gram (N = 1) showed considerable similarity between families of malware, but no significant difference between benign and malicious software could be established. In a later paper, Santos et al. evaluated several machine learning algorithms  and showed that malware detection is possible using opcodes. Anderson et al.  combine both static and dynamic features in a multiple kernel learning framework to find a weighted combination of the data sources that produced an effective classification.
Shabtai et al.  used static analysis to evaluate the influence of N-gram sizes (N = 1–6) to detect malware using several classifiers and concluded that N = 2 performed best. Moskovitch et al.  also used N-gram analysis to investigate malware detection using opcodes and his findings concurred with Shabtai. Song et al.  explored the effects of polymorphism and confirmed that signature detection is easily evaded using polymorphism and is potentially on the brink of failure.
Due to the weakness in static analysis and the increase of obfuscated malware, it is difficult to ensure that all the code is thoroughly inspected. With the increasing amount of obfuscated malware being deployed, this research focuses on dynamic analysis (program run-time traces). Other dynamic analysis approaches use API calls to classify malware, which can easily be obfuscated by malware writers. Therefore, these experiments seek to identify run-time features (below the API calls) that can be used to identify malware. For this reason, the research investigates opcode density histograms obtained during program run-time as a means to identify malware.
‘Test Platform’: The program samples are executed within the controlled environment to create program run-time traces.
‘Dataset Creation’: Each program trace is parsed and sliced into 14 different program run-lengths, creating 14 unique datasets defined by the number of opcodes executed.
‘Pre-Filtering’: A filter is applied to reduce the number of opcodes (features) that the SVM needs to process; thereby reducing the computational overhead during the SVM training phase.
‘SVM Model Selection’: is a process of selecting hyper-parameters (regularisation and kernel parameters) to achieve good out-of-sample generalisation.
A native environment would provide the best platform in terms of the least tell-tale signs of a test environment and thereby mitigate any attempts by the malware to detect the test environment and exit early. However, other considerations need to be taken into account, such as ease of running the malware trace analysis.
A virtual platform is selected (QEMU-KVM), as the hypervisor provides isolation of the guest platform (Windows 7 OS test environment) from the underlying host OS and incorporates a backup and recovery tool that simplifies the removal of infected files. In addition to the virtual platform, a debugger is used to record the run-time behaviour of the programs under investigation. A plethora of debugging tools exist, with popular choices for malware analysis being IDA Pro, Ollydbg and WinDbg32 .
The Ollydbg debugger is chosen to record the program traces as it utilizes the StrongOD plug-in, which conceals the debugger’s presence from the malware. When a debugger loads a program, the environment setting are changed, which enables the debugger to control the loaded program. Malware uses techniques to detect debuggers and avoid being analysed. StrongOD mitigates many of the anti-analysis techniques employed by malware and for an in-depth discussion on these techniques see work by [19, 20].
Operational codes (Opcodes) are referred to as assembly language or machine language instructions and are CPU operations. They are usually represented by assembly language mnemonics.
An instruction consists of an opcode and operands. Opcodes, by themselves, are significant  and, therefore, only the opcodes are harvested with the operand being reduntant.
The program traces are created by recording the run-time opcodes that are executed when a program is run;
The opcode densities for each program trace are calculated using the parser described below.
The dataset is created by expressing the features as a set of opcodes density, extracted from the runtime traces of Windows PE files. The dataset consists of 300 benign Windows PE files taken from the ‘Windows Program Files’ directory, and 350 malware files (Windows PE) downloaded from Vxheaven . The datasets are constructed from different program run lengths, creating 14 distinct datasets. This new datasets are created by cropping the trace files into lengths based on the number of opcodes (1k-opcodes, 2k-opcodes etc.) prior to constructing a density histogram for each cropped trace file. The dataset creation starts by cropping the original dataset into 1k opcodes, and a density histogram is created, and is repeated for 2k, 4k, 8k, 16k,… 4096k and 8192k opcodes in length.
The computational effort associated with N-gram analysis is often referred to as the ‘Curse of dimensionality’ and was first coined by Bellman in 1961 to describe the exponential increase in computational effort associated with adding extra dimensions to a domain space. Using an SVM to examine all the opcode permutations over the complete opcode range creates a computational problem due to the high number of feature permutations produced.
To reduce the computational effort, the area of search is restricted to those features that contain the most information. This is achieved by applying a filtering process that ranks features according to the information that they contain and that is likely to be useful to the SVM . Each feature is assigned an importance value using eigenvectors, thereby ranking the feature’s usefulness as a means of classification.
PCA compresses the data by mapping it into subspace (feature space) and creating a set of new variables. These new variables (feature space) that define the original data are called principal components (PCs), and retain all of the original information in the data. The new variables (PCs) are ordered by their contribution (usefulness/eigenvalue) to the total information.
However, high ranking features such as rep, mov, add, etc. remain consistently high over the different program run lengths and the lowest ranking features such as lea, loopd, etc. remain consistently low over the different program run lengths. Considering the mid-ranking features, it can be seen that significant variations occur with different program run lengths.
Splitting these features into their opcode categories: arithmetic (sub, dec); logic (xor); and flow control (je, jb, jmp, pop, nop and call) infers that the program structure (flow control) changes with different program run lengths. Therefore, in the following experiment, the filter is run for each program run length to ensure the optimum feature selection.
Support vector machine
SVMs are classifiers that rely heavily on the optimal selection of hyper-parameters. A poor choice of values for a hyper-parameter can lead to poor performance in terms of overly complex hypothesis that leads to poor out-of-sample generalisation. The task of searching for optimal hyper-parameters, with respect to the performance measures (validation), is the called ‘SVM Model Selection’.
Parameter grid search;
Herbrich et al.  demonstrated that, without normalisation, large values can lead to over-fitting and thereby reducing the out-of-sample generalisation. Normalisation can be performed in either the ‘input space’ or the ‘feature space’.
Input space normalisation, as defined in (11), is implemented in the experiments presented in this paper.
The selection of an appropriate Kernel is key to the success of any machine learning algorithm. A linear kernel generally performs better at generalising the training phase into good test results where the data can be linearly separated. However, as shown in Fig. 5, the data is not linearly separated. Therefore, an RBF kernel (a non-linear decision plane) is used as it yields a greater accuracy than a linear kernel, as illustrated in Figs. 5 and 6.
The correct adjustment of the RBF kernel parameters significantly affects the performance of the SVM’s ability to classify correctly, and poorly adjusted parameters can lead to either overfitting or underfitting. There are two parameters—C and λ. C is used to adjust the trade-off between bias and variance errors and λ determines the width of the decision boundary in feature space.
Two grid searches are performed to find the values of λ and C that produce an optimal SVM configuration. The first search is a coarse grain search, ranging from λ = 1 e−5 to 1 e5 and C = 0–10. This is followed by a fine grain search (increments of 0.1) over a reduced range (λ = ±10, C = 0–3). The optimal performance was established with λ = 1 and C = 0.8.
Before continuing with the experiments, the results need to be placed in context. The measure of malware detection is based on:
This is also known as a false alarm and can have a significant impact on malware detection systems. For example, if an antivirus program is configured to delete or quarantine infected files, a false positive can render a system or application unusable.
This occurs when an anti-virus security product fails to detect an instance of malware. This can be due to a zero-day attack or malware using obfuscation techniques to evade detection . The impact of this security threat depends on whether the detection method is the last line of defence in the overall malware detection system.
False positives present a major problem, in that networks and host machines, can be taken out of service by the protective actions, as a consequent of alarms, such as quarantining or deleting a critical file. However, this paper focuses on end-point detection where false negatives present a security threat. Therefore, this research focuses on the minimisation of FN rate along with the detection accuracy.
D = 1 produces a detection accuracy ranging from 72.3 to 90.8 % (average 85.1 %) and a FN rate ranging from 0 to 10.79 % (average 5.4 %);
D = 1.5 produces a detection accuracy ranging from 70.8 to 90.8 % (average 84.4 %) and a FN rate ranging from 0 to 9.25 % (average 4.96 %);
D = 2 produces a detection accuracy ranging from 70.8 to 90.8 % (average 84.4 %) and a FN rate ranging from 0 to 6.18 % (average 2.98 %);
D = 4 produces a detection accuracy ranging from 70.8 to 81.5 % (average 75.1 %) and a FN rate ranging from 0 to 3.1 % (average 0.44 %).
Considering the average results; D = 1 and D = 1.5 yield very similar results with good detection accuracy of 85.1 and 84.4 % respectively but D = 1 and D = 1.5 produce a high FN rate of 5 % approximately. D = 4, produces an excellent FN rate of 0.44 %; however the corresponding detection accuracy is low at 75.1 %. D = 2 yields a compromise between D = 1.5 and D = 4 with a detection accuracy of 84.4 % and a FN rate of 2.98 %.
The results show that a lower value of D achieves a higher detection rate at the expense of the FN rate. A greater value of D results in lower FN rate at the cost of the detection rate. D = 2 delivers a low FN rate without overly penalising the detection accuracy and is therefore chosen as the steering function (13) for the remainder of the experiments carried out in this paper.
Program run length versus %optimisation value
Note, the columns ‘1 to 20’ represents the number of opcodes in each test, with the rows ‘1, 2, 4, 8,…, 8192’ represent the program run lengths in k-opcodes. The optimisation value is shown against that number of opcodes and program run length. I.e. the first row shows the cost function value (measure of performance) for a single opcode feature, with the maximum optimisation value for each program, run length and the second row shows the cost function values for two opcode features, with the maximum optimisation value for each program run length and so on. In Table 1, the maximum values are identified with an underscore. It can be seen that a point is reached, when adding more features results in a reduction of the maximum value; the assumption made is that over-fitting is occurring. As already mentioned, the grid search is guided by the performance metric in Eq. (13) and is measured using tenfold cross-validation.
While an optimal detection rate is a vital characteristic of any detection system, FP and FN rates need to be considered. These experiments are aimed at end host detection, and it can be argued that FN rates outweigh the importance of FP rates. Therefore, the aim of our approach is to convict all suspicious files and let further malware analysis determine their true status.
In a final testing phase, bootstrapping is introduced to ensure a robust measure of out-of-sample generalisation performance. The concern is that sample clustering may result, as many of the malware samples belong to the same malware family and often have similar file names. The Parser used, reads files from the directory (in alphabetic order) and creates the density histograms, which may result in clustering of malware samples that belong to the same family. Therefore, randomly selecting test samples prior to the SVM processing will ensure that the validation data is random.
The premise of Bootstrapping is that, in the absence of the true distribution, a conclusion about the distribution can be drawn from the samples obtained. Parke et al.  suggest that 200 iterations are sufficient to obtain a mean and standard deviation value of statistical importance.
While there is no universally defined valued that specifies a ‘good detection’ system; the values obtained in these experiments need to be placed in context. Curtsinger et al. , defined 0.003 % FN as an ‘extremely low false negative system’ and Dahl  classified a system with < 5 % FN as a ‘reasonably low’ false negative rate. Ye et al.  examined several detection methods and found that FN rates varied significantly with different classifiers such as Naive Nayes with 10.4 % FN; SVM with 1.8 % FN; Decision Tree (J48) with 2.2 % FN; Intelligent Malware Detection System (IMDS) with 1.6 % FN.
While our approach fails to satisfy the criteria of ‘extremely low’ FN, it does meet the criteria for a ‘reasonably low’ FN rate for the program run lengths of 1k and above 8k.
It can be seen (Fig. 6), that adding more features does not always improve the results. The performance of both the detection accuracy and the FN rate peaks at 13 features (average), above which the performance degrades. This degradation is pervasive in all the program run lengths. It is believed that this is likely due to over-fitting caused by too much variance being introduced by the additional features. Again, the smallest variance occurs with 13 features (average).
Optimum features for malware detection at selected run lengths (K-opcodes)
The performance rates are listed in the right-hand column (taken from Table 1) and correspond to different program run lengths as indicated in the left-most columns i.e. 1k-opcodes, 2k-opcodes, 4k-opcodes, 8k-opcodes, etc. The central columns list the opcodes used to achieve these results.
Encryption-based malware often use the xor (opcode) to perform their encryption and decryption. Table 2 shows that xor frequently appears in the shorter program run lengths. This frequent appearance of xor is expected as the unpacking/decrypting occurs at the start of a program. An exception is that the 4k-opcodes length program does not use xor to classify benign and malicious software.
More is not always best; the optimum number of features varies with the program run length, but typically (average) 13 opcodes yield the best results. As an example, the maximum detection accuracy (83.4 %) for the 1k-opcode program run length is achieved with 14 features. However, adding more features decreases the detection accuracy, which is typical of all the program run lengths.
Table 2 shows that xor is used as an indicator of malware for shorter program run lengths i,e 1k-opcdes to 126k-opcodes (excluding 4k-opcode). This is expected behaviour as encrypted malware frequently uses xor to perform its decryption and is normally exercised in the early stages of the program execution.
An exception, is the absence of xor in the 4k-opcode length, which is not clearly understood beyond the fact that the machine learning algorithm did not chose it as an optimal feature for this program run length i.e. other features performed better for this particular program run length.
While FN is not ideal, many of the program run lengths (excluding 2 and 4K-opcodes), are be considered to be a ‘reasonably low’ FN rate (FN < 5 %). The relative short program run lengths of 2 and 4k-opcodes have high FN rates of 8.47 and 13.49 % respectively. The other program lengths present good detection rates of 81–89 %, the FN rates between 1.58 and 5.87 %.
The maximum detection accuracy of 86.3 % with the lowest FN rate (1.58 %) is obtained for a program run length of 32k-opcodes. However, a program run length of 1K-opcodes produces a good detection accuracy of 83.4 %, with a respectable FN rate of 4.2 %.
The bottom row (Occur) of Table 2 shows the number of times a particular opcode was selected by the classifier (SVM) as an indicator of malware. For example, opcode add was chosen 13 times out of 14 program run lengths, whereas, opcode lods was only chosen once for the 8k-opcode run length. What is clear, is that the opcodes chosen (by the SVM) change relative to different program run lengths. Our observations show that shorter program run lengths rely on ‘logic and arithmetic’ and ‘flow control’, whereas the longer program run lengths rely more on ‘flow control’ opcodes. This infers that the detection of longer program run length relies on the complexity of the call structure of a program. This is consistent with Bilar  finding that showed malware having a less complex call structure than non-malicious software.
The experimental work carried out in this research investigated the use of an SVM to detect malware. The features used by the SVM were derived from program traces obtained from program execution. The findings indicate that encrypted malware can be detected using opcodes obtained during program execution. The investigation continued to establish an optimal program run-length for malware detection. The dataset was constructed from run-time opcodes and compiled into density histograms and then filtered prior to SVM analysis. A feature selection cost function was identified and used to steer the SVM for optimal performance. The full spectrum of opcodes were examined for information, and the search for the optimal opcodes was quickly narrowed using an Eigenvector filter.
The findings show that malware detection is possible for very short program run lengths of 1k-opcodes that produce a detection rate of 83.41 % and a FN rate of 4.2 %. Using mid-range program run lengths also yields a sound detection rate. However, their corresponding FN rates deteriorate. The 1k-opcode characteristics provide a basis to detect malware during run-time, potentially before the program can complete its malicious activity, i.e. during their unpacking and deciphering phase.
The research presented, provides an alternative malware detection approach that is capable of detecting obfuscated malware and possible Zero-day attacks. With a small group of features and short program run length, a real world application could be implemented that detects malware with minimal computation, enabling a practical real world solution to detect obfuscated malware.
We have read the ICMJE guidelines and can confirm that the authors PO, SS, KM contributed intellectually to the material presented in this manuscript. All authors read and approved the final manuscript.
We the authors of this paper confirm that we do not have any competing financial, professional or personal interests that would influence the performance or presentation of the work described in this manuscript.
- 4.Sung A, Xu J, Chavez P, Mukkamala S, et al (2004) Static analyzer of vicious executables (save). In: Proceedings of the 20th annual computer security applications conference, 2004Google Scholar
- 5.Tian R, Batten L, Islam R, et al (2009) An automated classification system based on the strings of trojan and virus families. In: Proceedings of the 4rd international conference on malicious and unwanted software: MALWARE, 2009, pp 23–30Google Scholar
- 6.Sami A, Yadegari B, Rahimi H, et al (2010) Malware detection based on mining API calls. In: Proceedings of the 2010 ACM symposium on applied computing, 2010, pp 1020–1025Google Scholar
- 9.Bilar D (2007) Callgraph properties of executables and generative mechanisms. AI Communications, special issue on Network Analysis in Natural Sciences and Engineering 20(4): 231–243Google Scholar
- 10.Agrawal H (2011) Detection of global metamorphic malware variants using control and data flow analysis. WIPO Patent No. 2011119940, 30 September 2011Google Scholar
- 11.I Santos, YK Penya, J Devesa, PG Garcia (2009) N-grams-based file signatures for malware detection. S3Lab, Deusto Technological FoundationGoogle Scholar
- 12.Santos I, Brezo F, Nieves J, Penya YK, Sanz B, Laorden C, Bringas PG (2010) Opcode-sequence-based malware detection. In: Proceedings of the 2nd international symposium on engineering secure software and systems (ESSoS), Pisa (Italy), 3–4th February 2010, LNCS 5965, pp 35–43Google Scholar
- 14.Anderson B, Storlie C, Lane T (2012, October) Improving malware classification: bridging the static/dynamic gap. In: Proceedings of the 5th ACM workshop on Security and artificial intelligence, pp 3–14. ACMGoogle Scholar
- 16.Moskovitch R, Feher C, Tzachar N, Berger E, Gitelman M, Dolev S, Elovici Y (2008) Unknown malcode detection using opcode representation. In: Proceedings of the 1st European conference on intelligence and security informatics (EuroISI08), 2008, pp 204–215Google Scholar
- 17.Song Y, Locasto M, Stavro A (2007) On the infeasibility of modeling polymorphic shellcode. In: ACM CCS, 2007, pp 541–551Google Scholar
- 18.Eilam E (2011) Reversing: secrets of reverse engineering. Wiley, New YorkGoogle Scholar
- 19.Ferrie P (2011) The ultimate anti debugge reference. http://pferrie.host22.com/papers/antidebug.pdf. Written May 2011, last accessed 11 October 2012
- 20.Chen X, Andersen J, Mao ZM, Bailey M, Nazario J (2008) Towards an understanding of anti-virtualization and anti-debugging behavior in modern malware. In: ICDSN proceedings, 2008, pp 177–186Google Scholar
- 21.Heaven VX (2013) Malware collection. http://vxheaven.org/vl.php. Last accessed Oct 2013
- 28.Dahl G, Stokes JW, Deng L, Yu D (2013) Large-scale malware classification using random projections and neural networks. Poster (MLSP-P5.4), May ICASSP 2013, Vancouver Canada, IEEE Signal Processing Society, 2013Google Scholar
- 29.Ye Y, Wang D, Li T, Ye D (2007) IMDS: intelligent malware detection system. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 2007Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.