Introduction

Cybersecurity has become an irresistible concern for enterprises across the globe keeping in view the sensitivity of the information as most valuable asset. In this information age, organizations are facing an ever expanding and sophisticated malware-based cyber threat spectrum. The Malware is an abbreviated form of “malicious software” and it is a set of instructions intended to bring fatal damages to enterprises, infrastructure, industrial processes and digital systems. It is a lethal cyber weapon used for unauthorized access, cyber espionage, cyber terror, identity theft, data exfiltration or corruption, service interruption or failure, data hostage for ransom etc. In 2018, more than 430 million unique samples of malware were detected in with an annual increase of 36% [1]. The significant annual increase of 25% and 1000% is also observed in use of malwares and malicious PowerShell scripts, respectively. According to Kaspersky Labs (2019), more than 100 million different hosts were attacked by mid-2019. In 2018, more than 889,452 internet banking users were targeted by of banking Trojans with an increase of 15.9% in comparison with previous year. According to a recent analysis by Juniper Research, the financial impact of data breaches will increase by 11% per year and will reach a level from $3 trillion to $5 trillion in 2024. Therefore, it is the utmost requirement of every business to protect its information-based assets, since even a single attack can result in critical data loss. There are several classes of malware including [11] Ransomware [11], Trojan [14], Key Logger [3], Backdoor [21], Launcher [13], Remote Access toolkits (RAT) [33], Spam-Sending malware [34] etc. The approaches for Malware detection are either signature-based [2] or behavior-based [31]; while first approach is good for identification of known attacks without producing an overwhelming false alarm [3] but requires frequent manual updates of the database with rules and signatures. On the other hand, later approach can be used to generalize signatures related to host and network used to identify the presence of an unwanted piece of code or activity on victim computers or networks. The use of packers [46], encryption [5], polymorphism [31] and obfuscation [28] techniques can easily bypass signature-based detection as they only perform pattern or string matching [11]. Behavior-based [36] approaches that focus on pattern identification including file activity, registry activity and API call [8]; are either based upon static [7] or dynamic analysis [6]. The latter form of analysis requires execution of the malicious code [35] in a controlled setup, i.e., sandbox and is often slow, resource intensive and not suitable for the deployment in the production environment which are also discussed in by [22]. Moreover, due to geometric rise in zero-day malware, existing approaches have become less efficient for detection of zero-day attacks and there is a dire need of automated malware detection and classification system equipped with the machine learning techniques [9].The machine learning can be either supervised or unsupervised, i.e., supervised learning or discriminative deep architectures conducts the training over labelled data, i.e., classification, regression or predictive analytics whereas unsupervised learning or so called generative architectures draws inferences from datasets consisting of input data without labels [43].

Keeping in view the ongoing huge growth in number of malwares, time-based complexity for malware analysis, acute number of domain experts and demand of earliest detection, considerable research on machine-learning-based techniques is being conducted for automated malware analysis and classification [10, 19] but most of the static analysis-based approaches are supervised in nature. The availability of updated malware dataset along with the labels is also a major hurdle for malware analysts. The aforementioned limitations and gap motivated the development of automated unsupervised malware analysis system for investigation of portable executable to make a classification decision based on static analysis. Moreover, it is essential to have a suitable representation of feature vectors to make decision regarding malware classification. This paper proposed a progressive deep unsupervised framework (PROUD-MAL) for classifying Windows PE using static analysis of executables. The major contributions are descried as follows:

  1. (a)

    The purpose of research is to present a framework for unsupervised classification of Portable Executables (PEs) using static features. We term this framework as PROUD-MAL To this end, we propose a two-phase cascaded formulation of progressive unsupervised clustering followed by an attention-based deep neural network for static feature-based malware classification.

  2. (b)

    Moreover, it is worth mentioning that attention models have shown promising outputs in various domains such as image analysis and natural language processing but to best of our knowledge, there is no research on applying the attention-based mechanism for malware classification using unsupervised clustering over static features. To this end, we propose a Feature Attention-based Neural Network (FANN) architecture for malware classification. The Attention Block (AB) considers correlation of a feature to target or other features, besides considering feature weight. It puts relatively more weights to feature that contributed more to minimize the validation loss and maximize the classification accuracy.

  3. (c)

    We also collected a novel real-time malware dataset comprises 15,457 (25 GB) PE samples collected over a period of seven (07) months (200 days). The novel dataset is collected by deploying low and high interaction honeypots as well as enterprise endpoint security solution over a large organizational network. It is available publicly for research community.

  4. (d)

    The quantitative assessment reflects that the proposed model achieved superior performance and outperformed state-of-the-art supervised approaches as well unsupervised one. The high yield of classification accuracy demonstrated the significance and utility of the proposed framework.

The rest of the paper organization is as follows: section “Background and context” describes background and structure of windows-based PEs. Section “Related work” narrates the related work. Section “Methodology and architecture” describes the dataset acquisition, data pre-processing, feature extraction and proposed framework, i.e., PROUD-MAL followed by FANN architecture. Section “Experiments and results ”narrates the implementation details including the experimental setup, results obtained and discussion. Section “Conclusion and future direction” narrates the concluding remarks followed by the future direction.

Background and context

Malware can be an executable or a non-executable binary and its classification is based on either dynamic or static analysis. The former approach involves the execution of a PE in a controlled environment to study its behavior including auto-start extensibility points, function calls and parameter analysis, data flow tracking [11] but it is more time consuming and computationally expensive, therefore, the adoption of dynamic analysis in production environment is not appreciated. The static analysis includes source code inspection [12] without any execution in controlled setup that involves decompression/unpacking of PE, if it is encoded by a third-party packer [11] and disassembling for the purpose of obtaining codes residing in memory [14]. The disassembler and memory dumper software packages, e.g., OllyDump and LordPE can be utilized. The windows-based PE file can be an executable, Dynamic Link Library (DLL) [13] or object code and inherits many features from Unix-specific Common Object File format (COFF). The PE content is semantically structured [21] that is important to understand for good analysis. The format is supported by various architectures including Intel, variants of ARM as well as AMD instruction sets. The PE has numerous predefined blocks including a number of headers and sections. The section contains a header that provides information regarding the address and size. The predefined blocks are explained as follows:

  1. (a)

    DOS Header Defines file as an executable binary or file and also called as MZ header. It provides information about four-byte offset address of PE header.

  2. (b)

    DOS Stub Small embedded program to display an appropriate message whenever there is an attempt to run a PE file in DOS.

  3. (c)

    PE File Header (Signature) Defines an executable file as PE. It provides information about machine compatibility, number of sections/symbols, compiler time-stamp and size of optional header, i.e., next unit.

  4. (d)

    Optional Header or Image Optional Header Mandatory contrary to its title and provides details including entry point address, os version, image and data base, image and subsystem version, the version of linker and size of the code, initialized and uninitialized data, stack and heap

  5. (e)

    Data Directories Successor of Optional Header. It gives details about directories including export, import, exception, relocation table, global pointer, debug and load configuration.

  6. (f)

    Section Table & Header Preceded by data directories and provides PE file attributes, instruction to load PE in memory, virtual address, section name, characteristics and size of raw data, etc.

  7. (g)

    Sections Contains executable code, resources and operands for PE unlike headers that provide information about executable. There are nine predefined sections. The names and description of each section are listed in Table 1. All sections may not be present in a PE. The missing idata does not mean there is no import table as it may be in.data or. edata section.

Table 1 PE sections

Related work

In literature, several approaches for malware detection based on machine learning techniques have been proposed. Some of the research work based on machine learning algorithms specific to PE file malware classification is discussed here. The Malware is a set of instructions developed to bring harmful consequences to organizations, their process, networks as well as infrastructure. The Malware can be an executable or a non-executable entity and its detection is based on either static or dynamic analysis. In 2001, a machine learning framework was proposed for classification of PE files using static analysis and the utilization of data mining techniques for the extraction of strings and byte sequence feature from PE [15]. In 2009, Researchers [16] extracted 5-g byte sequences from file header and applied term frequency-inverse document frequency approach for classification. In 2013, a malware detection system was proposed for analysis of PE files using byte sequence alternatively known as n-gram sequence that is less efficient and computationally expensive [17]. In 2015, researchers [18] proposed heuristic-based detection technique for metamorphic malware while using used static features for PE analysis. In the proposed model, file was disassembled using IDA pro to extract the features. Multiple classification algorithms (j48, j48graft, LADTree, NBTree, Random Forest, REPTree) were used for analysis and classification of PE files. It was highlighted by the researchers that the classification accuracy is based on the model applied as well as disassembler chosen. In 2018, it has also been shown that machine learning model can learn from sequence of raw bits without explicit feature extraction based on conventional practices of malware classification [19]. The use of machine learning-based classifiers for malware intrusion detection is a well-known approach for network analysis [25]. In addition to string extraction, researchers [30] have also used statistical approach such as raw byte and byte entropy histogram. In [20], researchers presented an approach using static analysis of the features from the PE-Optional Header fields by employing Phi (ϕ) coefficient and Chi-square (KHI2) score. In [23], features were extracted from system calls and submitted to neural network for classification using 170 samples and obtained 0.96 for Area under Curve (AUC). In [24], experiments were performed to identify the critical point to quarantine the activity of malicious code related to its communication with remote command and control server. Researchers [26] presented a framework that ensures the protection of application programs against malware for mobile platform. In 2017, researchers used static analysis to extract key information, i.e., headers strings and sequence from the metadata of PE files. The model was trained over a dataset of 4783 samples using Random Forest and achieved 96% accuracy. The researchers [42], designed a malware classification method for several malware variants based on signature prediction. The proposed solution was based on the static analysis of features including strings, n-grams and hashes extracted from PE header. In [27], the researchers proposed a malware detection system based on supervised learning. They devised tool for feature extraction from header of PE files. Later, system was trained using supervised machine learning classification algorithms such as Support Vector Machine and Decision Trees. In [47], authors proposed Virtual Machine Introspection a machine learning-based approach for malware detection in virtualized environment. The researcher extracted opcodes using static analysis and trained the classifier with selected features. Later, Term Frequency-Inverse Document Frequency (TF-IDF) and Information Gain (IG) were also applied as classification algorithms. In 2019, researcher [29] proposed a malware detection approach in the IoT environment based on similarity hashing algorithm-based. In proposed technique, scores of binaries were calculated to identify the similarity between malicious PEs. Numerous hashing techniques [21] including PEHash, Imphash and Ssdeep were used. Later, researchers integrated hash results using fuzzy logic. Recently, attention models have shown promising output in tasks such as image analysis, machine translation, computer vision and natural language processing [32]. The attention mechanism supports the model to focus on the most relevant features as required. Therefore, we employed the attention-based mechanism over static features using unsupervised clustering for malware classification.

Methodology and architecture

In this section, design of our proposed unsupervised framework, i.e., PROUD-MAL for classifying windows-based PE using clustering based on static analysis will be explained. The PROUD-MAL is a custom-built unsupervised framework composed of multiple modules including novel dataset collection, dataset pre-processing and feature extraction and unsupervised clustering of the malicious & benign PE samples as illustrated in Figs. 1 and 2. Moreover, the designed Feature Attention-based Neural Network (FANN) is trained over pseudo labels. The proposed classifier is evaluated over the test dataset which was kept hidden during the testing as depicted in Fig. 3.

Fig. 1
figure 1

Malware dataset collection and pre-processing

Fig. 2
figure 2

Unsupervised framework for windows-based PE malware classification

Fig. 3
figure 3

PROUD-MAL validation

Malware dataset acquisition

The first stage of the proposed framework is the indigenous dataset collection. In this research work, a pilot attempt is made to perform the dataset collection including the malware and benign samples which will be extended as future research work. A major obstacle in leveraging machine learning techniques for malware analysis is the lack of sufficiently big, labelled datasets that shall contain the malicious as well as benign samples. Moreover, it is very important to keep updating the malware dataset due to ever changing smart evasion approaches adopted by malware authors. The collection of malicious samples was difficult but the collection of benign samples was also not easy. To this end, we used two (02) different approaches for collecting the malicious and benign samples as illustrated in Fig. 1. First, we deployed low and high interaction honeypots as production unit and intentionally configured them in a vulnerable way to collect malicious files and log unauthorized behavior. The low interaction honeypots, i.e., Honeyd [34] as well as high interaction honey pot, i.e., SMB Honey Pot [4] were deployed over the enterprise organizational network to emulate the services frequently targeted by the attacker and the production systems, respectively. Second, Kippo-Malware collector and Kaspersky endpoint security solution is also deployed over the enterprise organizational network to collect malware as well as benign samples. The benign PE including.exe or.dll is also collected from machines with licensed and updated version of Windows operating system including Windows XP, 7, 8 and 10. Special precautions have been taken into account for compliance of licensing and regulatory requirements while collecting benign samples. Moreover, additional precautionary measures such as establishment and configuration of sandbox environment for dataset collection and further processing were also taken into consideration. We collected 19,000 samples (31 GB) over the period of seven months (200 days) but after performing dataset verification, samples were reduced to 15,457 (25 GB) PE samples comprising 8775 (17 GB) malicious and 6681 (8 GB) samples. The reduction in number of samples resulted due to filtering of corrupt and duplicate samples. The verification of samples and removal of duplicates were done using the hash values. The dataset is divided into 60:20:20 ratios for training, validation and testing, respectively, of the proposed model.

Data pre-processing and feature extraction

To prepare the dataset, a series of pre-processing steps were performed, i.e., identification of file type, removal of corrupt and duplicate samples, unpacking of the binaries and verification of labels to transform the raw data into a meaningful format. It was ensured that the dataset shall not contain any duplicate binaries using MD5. It was also ensured that only unpacked binaries shall be submitted for feature extraction therefore section names were examined using a tool PEStudio [45] to see if any of them contains popular packers [46] such as UPX, ASPack, FSG.

Moreover, verification of labels is a significant activity, which was performed by deploying signatures-based anti-virus solutions in parallel and finally using cloud-based service of Virus Total. We used VirusTotal API [44] as well as VirusTotal web interface to submit the binaries for verification. The VirusTotal API does not require to web interface for file submission It is pertinent to mention here that labelling of samples in the dataset like text, images or speech is relatively an easy task, but the labelling as well as the verification of labels that whether a sample is benign or malicious was very time intensive task. Handling the malicious files needs extra precautionary measures such as establishment and configuration of sandbox environment. During the process, findings were observed such as existence of overlapping segments, usage of non-standard version details, names for sections and zero size of raw data that also results into high virtual size of section in case of packed PE files. It was also observed that some packers make an attempt to reduce entropy by embedding zero bytes in data to bypass screening. Moreover, in malicious files, the data section is missing or has relatively lower value (if present) and permissions assigned to the section are found to be inconsistent in comparison with standard practices. It was also observed that resource size is relatively small as malicious files are mostly non-GUI. The study of compilation time revealed that malwares are mostly compiled during off working days and also do not have genuine creation time. After the dataset preparation and pre-processing, feature extraction was performed. Features were extracted by parsing headers of Portable Executables (PEs). A custom parser was developed to read PE headers, tokenization of features and their respective values. Finally, tokens are organized in a csv compatible format. More than 35 features were extracted and below is brief description of selected features.

  • MD5 is a cryptographic signature. It is a 32-bit hexadecimal value and each file has its unique MD5 value.

  • Machine represents the target machine such as Intel 386, MIPS little endian Motorola, etc.

  • Size of optional header is a mandatory feature irrespective of the name and provides information related to PE. It is included only for executable files and not for object files.

  • Characteristics represent attributes of the file such as base relocation address, local symbol, user program or system file, little-endian or big-endian architecture or whether file is DLL or not etc.

  • Major/Minor Linker Version tells the linker to place a version number in the header of the.dll or.exe file.

  • Code Size represents size of code (text) section.

  • Size of Uninitialized Data is the size of data section.

  • Address of Entry Point is the address where the PE loader will begin execution; this address is relative to image base of the executable. It is the address of the initialization function for device drivers and is optional for DLL.

  • Base of Code is the pointer to beginning of the code section, which is relative to image base.

  • Image Base is the preferred address of the first byte of the executable when it is loaded in memory.

  • Section Alignment: The address assignment to PE requires section loading. The section alignment is set to 0 × 2000. This means that the code section starts at 0 × 2000 and the section after that starts at 0 × 4000.

  • File Alignment: Just like the section alignment, the data also needs to be loaded. It is set to 512 bytes or 0 × 200.

  • Major/Minor Operating System Version is the version supported by PE.

  • Major Image Version is the major version number of image.

  • Size of Image is the size of executable after being loaded into memory. It must be multiple of section alignment.

  • Size of headers represents the size of all headers, i.e., PE header, the optional header, DOS header.

  • Checksum is used for file validation at load time and to confirm whether a file is undamaged or has been corrupted.

  • Sub System This field points to user interface type required by operating system.

  • Size of Stack Reserve is number of bytes allocated for stack and determines the stack region utilized by threads.

  • Size of Stack Commit is the amount of memory that stack is relegated at startup.

  • Size of Heap represents the space to reserve for loading.

  • Loader Flags informs upon loading whether to break upon loading, debug on loading or to set to default.

  • Number of RVA is the number of relative virtual addresses in rest of the optional header. Each entry describes a location and size. The structures contain critical information about specific regions of the PE file.

  • Load Configuration size is usually used for exceptions. It is only utilized in Windows NT, 2000 and XP.

  • Section Minimum/Maximum/Mean Entropy value of specific file is represented using digital values and is used to check whether a file is packed or not. Higher entropy usually means that file is malicious.

PROUD-MAL

The PROUD-MAL framework is a progressive unsupervised framework for malware classification based on static analysis of executables. To this end, an architecture with two-phase cascaded formulation of unsupervised clustering with an attention-based deep neural network is proposed. As 80% of dataset was unlabeled, therefore k-means clustering was employed for prediction of pseudo labels. Subsequently, deep neural network was trained using pseudo labels by applying attention over input features. The trained model was then tested over test dataset against the standard performance metric.

Unsupervised clustering

Clustering is a generally ubiquitous and widely accepted instrument of classification for the categorization of data with diversity of application domains including medical imaging, natural language processing, biotechnology and cyber security, etc. It is used in the manner of data exploration where objective function is to learn from data that is not well defined or understood [41]. Several algorithms are available but in this research work, unsupervised clustering is performed by applying k-means algorithm with the motivation for finding a representative, stable clustering solution, which can be further utilized for classification as per the framework architecture. The cluster prediction using unsupervised formulation, i.e., k-mean clustering is depicted in Fig. 4b. Keeping in view the nature of the subject problem, specifically, if the number of classes is known in advance, it is intuitive to initialize the value of k equal to them. However, we still validated the value of k using the elbow method given an empirical validation regarding the appropriate selection of numbers of clusters in a dataset that is also depicted by Fig. 4b. Nevertheless, if such information is not known in advance, the applicability of other clustering algorithms, e.g., mean-shift [20] or unsupervised deep embedding [21], etc. can be considered more appropriate. Therefore, the extracted features F = {f1, f2,..., fN} are submitted to k-means algorithm that clusters the similar features (i.e., the corresponding binaries). Using k-means allows us to obtain a set C = {c1, c2,..., ck} of k (= M) cluster centroids by keeping the following optimization function at minimum:

$$ C \leftarrow \arg_{{_{c} }} \min \sum\limits_{i = 1}^{N} {\sum\limits_{j = 1}^{M} {||f_{i} - c_{j} ||^{2} } } {,} $$
(1)

where cj = 1, M represents the M cluster centroids. The k-mean clustering iteratively optimizes an Euclidean objective clustering with a self-training distribution to achieve predicted clusters. This progressive clustering is important to refine the obtained pseudo label to optimize the model classification accuracy and subsequently convergence. It will also help to reduce the incorrect assignment which may is more vulnerable to get stuck in bad local optimum. Moreover, the visualization of clustering performance using silhouette analysis and elbow method is also illustrated in Fig. 4a and b, respectively.

Fig. 4
figure 4

a k-mean clusters of PE binaries. b Elbow score for cluster prediction

Classification using feature attention-based neural network

To end of malware classification in PROUD-MAL framework, we designed a Feature Attention-based Neural Network (FANN) to learn the patterns within a dataset. The FANN is designed to learn feature representation without ground-truth cluster membership labels and is trained over pseudo labels. The pseudo labels are achieved using k-means clustering which iteratively optimizes a Euclidean objective function with a self-training distribution. The FANN comprises of an input layer, output layer, Attention Block (AB) and three hidden layers; illustrated in Fig. 5. All layers are densely connected. A feature vector is input to FANN and is fed forward through the densely connected layers. The first hidden dense layer contains 38 neurons as equal to number of static feature while using rectified linear unit (ReLU) as an activation function. The output of first hidden layer is propagated to the embedded AB. The proposed Attention Block (AB) encodes contextual information by probing feature weight and results in more refined representation by focusing on features of interest. The AB consists of two parallel attention networks /layers. Each network computes the attention for features subsequence of a PE instance and also incorporates prior knowledge to predict new weights. The attention mechanism is discussed in more detail in following subsection. The third and fourth layers comprise of 13 neurons each followed by output layer using sigmoid activation for binary classification. The model is further fine-tuned by adjusting the hyperparameters to achieve the optimum results.

Fig. 5
figure 5

Feature attention-based neural network (FANN)’s architecture

Attention mechanism

As we introduced above, PE header has numerous features where some features might have a higher impact on identifying malicious PEs. Therefore, we employ attention mechanism to prioritize significance of important features while penalizing the “noise” fields. The main principle behind proposed Attention Block (AB) is as follows: The selection of significant feature rather than examining entire feature set improves classification. To this end, a feature vector sequence of length n is extracted from PE header. After processing feature vector at first iteration, significant combination of length k is selected based upon attention threshold. Subsequently, this subsequence is utilized as prior knowledge to train the model to predict classification of PEs. The sequence can be represented as \( \{{F}_{1},{F}_{2}, \ldots, Fn\}\). The weighted vector containing Wi of each data point Si in feature combination sequence is represented as \( \left\{\left({F}_{1},{W}_{1}\right),\left({F}_{2},{W}_{2}\right),\dots , \left({F}_{n},{W}_{n}\right)\right\}\). Next, we extract subsequence with k highest weights: \( \left\{ {\left( {F_{1}^{\prime } ,W_{1}^{\prime } } \right),\left( {F_{2}^{\prime } ,W_{2}^{\prime } } \right), \ldots ,~\left( {F_{k}^{\prime } ,W_{k}^{\prime } } \right)} \right\} \). As discussed earlier, the AB connects two parallel attention network/layers of opposite directions to same output. Each network/layers computes the attn(i,h) for features of a PE instance given as input, where i represents features and h represents number of units. One network processes sequence from top to bottom (forwards) and other processes the sequence from bottom to top (backwards). Let \({x}_{t}\) denote current step of input sequence, \({h}_{t-1}\) denote previous hidden state. The next hidden state \({\mathrm{h}}_{\mathrm{t}}\) can be calculated as follows:

$${h}_{t}=f\left(A{x}_{t}+W{h}_{t-1}\right),$$
(2)

where \(f\) is a non-linear activation function. A and W represent weight matrices of current input vector \({x}_{t}\) and previous hidden state \({h}_{t-1}\). At each time step t, the forward pass calculates hidden state \({h}_{t}\) by considering previous hidden state \({h}_{t-1}\) and new input sequence \({x}_{t}\). At the same time, backward flow computes hidden state \({\mathrm{h}}_{\mathrm{t}}\) considering future hidden state \({h}_{t+1}\) and the current input \({x}_{t}\). Afterward, the best output among both forward \({h}_{t}\) and backward \({h}_{t}\) are selected to obtain refined vector representation. The first network or set of layers in AB used sigmoid function while the other used ReLU function. Finally, the best output is applied to feature importance map while taking the product of learned parameters with respective probabilities. As each layer computes the attn(i,h) for features of a PE instance given as input, where i represents features and h represents number of units. The feature weights for first layer can be learned as Eq. (3).

$$ a_{{(i, h)}} \; = \;\sigma (x_{{(i,h)}} ,W), $$
(3)

where xi is the input to layer and W denotes weights of layer and \(\sigma\) represent sigmoid activation function to feature map w (i, h) for the first attention layer in Eq. (4).

$$ b_{(i, \, h)} = \partial (x_{(i,h)} ,W). $$
(4)

Similarly,\(\partial\) represent ReLU activation function employed by second attention layer to feature map w (i, h) followed by selecting maximum of Eqs. (3) and (4).

$$ w_{{(i,{\text{ h)}}}} = {\text{max}}(a_{(i,h)} ,b_{(i,h)} {)} $$
(5)

The attention is computed by multiplication of w (i, h) with output of sigmoid function as:

$$ {\text{attn}}_{(i,h)} \,\,{ = }\,\,\left[ {\frac{{\exp (w_{(i,h)} )}}{{\exp (w_{(i,h)} ) + \, 1}}} \right] \, \otimes {\text{ w}}_{(i,h)} . $$
(6)

The feature attention-based layer learns to put relatively more to those features that have contributed more to minimize the validation loss while learning the accurate classification by applying sigmoid function to the feature importance map and subsequently multiplying learned parameters with the respective probabilities. The dataset based on validated predicted clusters is splitted into 60:20:20 ratios for classification training, validation, and testing, respectively. The model is trained over the predicted cluster dataset using classification algorithms including Random Forest, Support Vector Machine (SVM), Gradient Boost, Ada Boost, Naive Bayes and PROUD-MAL. The training is performed for 60 epochs (i.e., approx. 23,185 iterations) and the input was submitted to the network in a batch of 32 with Adam as an optimizer and learning rate was initialized with stepwise decay at 0.001 or 10–3. Binary cross entropy is utilized for loss calculation over the training data. After model training, it is tested to make predictions against validation dataset. Finally, the empirical validation of proposed PROUD-MAL approach is also performed against standard metrics over test dataset which is kept hidden during the training phase.

Experiments and results

Implementation details

This section narrates the configuration and performance metric used for the experiment to classify Windows-based PE. The run time environment configured for experiments includes a workstation with Intel Core i5-9500 Processor @ 3.0 GHz with 6 cores and 6 logical processors, 32 GB Ram, virtual memory of 20.0 GB with enabled virtualization, graphic card NVIDIA GeForce GTX 1650 with 4 GB Ram and Window 10 Pro 64 operating system. In terms of software, both Keras and Tensorflow were employed at backend for implementation of our proposed framework. The training is performed for 60 epochs (i.e., approx. 23,185 iterations) and input was submitted to network in a batch of 32 with Adam as an optimizer and learning rate was initialized with stepwise decay at 0.001 or 10–3. Dropout regularization of 0.5 is placed in after fully connected layers which help to prevent overfitting. Generally, dropout removes neurons and its connections randomly. Moreover, we adopted binary cross-entropy loss function, which is minimizing the negative logarithmic likelihood between the prediction and the ground-truth data. The momentum helps accelerating the ADAM in the relevant direction and mitigates oscillations by adding a fraction of the update vector of the past time step to the current update vector. The accuracy and loss parameters provided by Keras are visualized in better manner utilizing tensor board and console logs. The summary of hyper parameters is provided in Table 2.

Table 2 Hyper-parameters and associated values

Results and discussion

We performed comparison of our proposed method with state-of-the-art supervised approaches. Despite this challenging comparison, the utility of our proposed framework is well demonstrated by achieving high classification accuracy. To perform model assessment in a quantitate fashion, we used standard metrics of classification accuracy, F1 score, precision, Receiver Operating Characteristic (ROC) curve, area under the ROC curve (AUC), True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) Rate. The accuracy and other parameter results for PROUD-MAL are illustrated in Tables 3, 4. In our experiments, we considered results of Random Forest (RF) as a baseline classification results for comparative study, due to its high classification accuracy. However, other classification algorithms are also employed for detailed comparison including Support Vector Machine (SVM), Gradient Boost (GB), Ada Boost (AB) and Naive Bayes (NB). The experiments include testing of classifier over novel dataset which is kept hidden during training phase against the standard evaluation metrics. Tables 3, 4 show quantitative results of comparative analysis of PROUD-MAL and other classification algorithms including Random Forest, Support Vector Machine, Gradient Boost, Ada Boost and Naive Bayes algorithms over the collected dataset. It can be seen in Tables 3, 4 that the best performance is achieved by PROUD-MAL with a classification accuracy of 98.09%. The RF, SVM and AB also showed good performance by achieving classification accuracy of 94.27%, 93.01% and 91.91%. However, GB and NB achieved lowest classification accuracy, i.e., 56.71 and 56.68%, respectively.

Table 3 Quantitative assessment of PROUD-MAL with supervised approaches—results
Table 4 Quantitative assessment of PROUD-MAL with unsupervised approach—results

A detailed analysis of confusion matrix shows that the proposed PROUD-MAL framework with Feature Attention-based Neural Network (FANN) demonstrated best classification accuracy of 98.09% against standard evaluation metrics on our indigenously collected novel dataset. However, RF, SVM and AB also showed good performance by achieving classification accuracy of 94.27%, 93.01% and 91.91%. The GB and NB achieved the lowest classification accuracy, i.e., 56.71 and 56.68%, respectively. It is worth mentioning that PROUD-MAL achieved 4%, 5.46%, 72%, 6.72% and 73% higher accuracy than classical machine learning classification models including random forest, SVM, Gradient Boost, Ada Boost and Naive Bayes, respectively. Moreover, our experiments show that FANN demonstrated overall higher AUC of 99.55% as compared to other classifiers which shows better predictive power and can also provide better sensitivity tuning. To the best of our knowledge, this is due to unsupervised clustering cascaded by classifier with embedded attention layers. However, RF, SVM and NB also showed good performance by achieving AUC of 98.78%, 97.40% and 95.37%. The GB and AB achieved relatively lower AUC, i.e., 90.99%. The comparison with unsupervised approach [Hyrum S. Anderson et.al. 2018] also showed superior performance. Our approach demonstrated 5.2% high classification accuracy. The detailed comparative assessment with supervised approaches as well as unsupervised one has shown utility and significance of the proposed architecture. It is also pertinent to mention that for classification of an unknown PE using an anti-virus software, the training time is not important because we can use pre-trained neural network. As test time of FANN model is less than 21 ms per step, the model is appropriate for its subsequent utility in real anti-virus software.

Experiments show the True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN) rate for FANN is 0.98, 0.02, 0.98 and 0.02, respectively. The quantitative assessment was conducted over 60 epochs with a batch size of 32. The training-validation accuracy as well as training-validation loss is depicted in Fig. 6. The training and validation graphs in Fig. 6 depict that PROUD-MAL is trained quiet well enough around 60 epochs. We also employed early stopping criteria to discontinue further training at an appropriate stage. It is worth mentioning that as we increase the number of iterations, the loss or learning rate descends gradually (not showing due to non-significance in figure for later iterations.) Moreover, the graphs for training and validation are also illustrated in Fig. 6. As our dataset has 15,457 binaries comprising 8775 (17 GB) malicious and 6681 (8 GB) samples, therefore, we also calculated the area under the ROC curve (AUC) as illustrated in Tables 3, 4, which is a widely used performance metric for imbalanced datasets. A visual inspection of the Receiver Operating Characteristic (ROC) curve (Fig. 7) shows that our framework shows superior performance compared to other state-of-the-art supervised approaches. PROUD-MAL achieved ROC of 0.99 with small discrepancy of 0.01. The visualization of cluster prediction is generated by applying t-SNE on the dataset and is depicted in Fig. 8. The blue dots represent the malicious binaries and yellow mark represents the benign PEs. The visual exhibit reflects minor overlapping between the malicious and benign samples. There were 38 features in vector space. However, by applying attention mechanism, it is revealed that features with the numerical values, e.g., section entropy, size of sections, image base were given more weight by AB. On the other hand, the features that either represent unique numerical value or fixed length value with a specific format, e.g., MD5, checksum are given relatively less weight than the normal numerical values such as section entropy. But these attributes are given more consideration in comparison with the features having string values, e.g., machine, characteristics, compiler etc. The proposed scheme of using feature subsequence combination by applying attention mechanism resulted in more refined feature representation. Subsequently, quantitative results of comparative assessment have demonstrated the utility of attention mechanism for unsupervised classification of PEs using static features.

Fig. 6
figure 6

Validation and training accuracy as well as loss of PROUD-MAL

Fig. 7
figure 7

ROC curve of PROUD-MAL

Fig. 8
figure 8

t-SNE clusters visualization of PEs

Conclusion and future direction

We have proposed and presented a progressive deep unsupervised malware classification framework, i.e., PROUD-MAL with a deep neural network architecture that uses dense layers and an attention block for binary classification of Windows-based PEs based on features extracted from header in a static fashion. Our proposed feature attention mechanism-based neural network for malware classification learns to put relatively more weights to those features that contributed more to minimize the validation loss while learning the accurate classification. We also collected novel real-time malware dataset by deploying low and high interaction honeypots as well as endpoint security solution on an enterprise organizational computer network for validation of proposed framework. This indigenously collected dataset is novel and is made public for the research community. We also look forward to enhance existing volume of novel dataset. The quantitative assessment reflects that the proposed PROUD-MAL framework achieved an accuracy of more than 98.09% with better quantitative performance in standard evaluation metrics on indigenously collected novel dataset and outperformed other conventional machine learning algorithms. As a way forward, our framework can be enhanced to explore the behavioral analysis based on API calls [49] using reinforcement learning [50] for malware analysis. This includes the transformation of PEs into malware images and performs entropy based semantic segmentation of malware images. This will potentially help malware authors to use malware visualization to perform malware analysis more effectively for zero-day malware samples. The scope of future direction may also include Non-Portable Executable (NPE) files.