1 Introduction

Ransomware is a harmful software that applies symmetric and asymmetric cryptography to inscribe user information and poses a Denial-of-Service (DoS) attack on the intended user [1]. The unique functional process of ransomware attacks makes it more harmful than any malware attacks and causes irreversible losses. Crypto-viral Extortion’, which is the functional process of ransomware, includes three main steps [2] as depicted in Fig. 1. In the initial step, the attacker creates a key pair that incorporates a private key K1 and a public key K2, puts the public key K2 in the ransomware, then, at that point, launches the ransomware. After entering a computer, in the second step, the ransomware activates itself and produces an arbitrary symmetric session key K3 to encrypt the victim’s files or data. Next, the ransomware utilizes K2 to encrypt K3 and to create a small irregular ciphertext E1. Then, the ransomware zeroizes K3 and the plaintext from the person’s drive. A communication bundle P1 containing previously generated E1, a payment note M, and a medium to contact the attacker, is then created. After that, the ransomware informs the victim of the attack and demands payment via a transaction medium within a set amount of time in order to decrypt the files by displaying the payment note M. At the final step, as the payment is completed, the communication bundle P1 is adjusted to P2 containing just the deviated ciphertext E1 and steered back to the attacker. The attacker gets P2, decrypts E1 with K1, and gets K3 which is then sent back to the victim to decrypt the files. Finally, upon receiving K3, the victim decrypts the files. Usually, the victim pays the ransom using untraceable cryptocurrency [3]. However, paying the ransom doesn’t guarantee that the decryption key could secure the encrypted files, which could be the worst scenario of any type of ransomware attack [4].

Supported by a report by Symantec in 2015, there are two types of ransomware [5]-

  • Locker ransomware: denies access to the system or device

  • Crypto ransomware: denies access to the files or data

However, according to [6], based on the functionalities, ransomware is categorized into four groups-

  • Encrypting ransomware: encrypts and denies access to the victim’s files and data (i.e., AIDS Trojan, CryptoLocker, WannaCry, CryptoWall) [6]

  • Non-encrypting ransomware: doesn’t do encryption but rather threatens to try if the ransom is not paid (i.e., WinLock, NotPetya) [6]

  • Leak-ware: doesn’t do encryption instead claims to reveal stolen information from the victim’s system if the ransom is not paid [7]

  • Mobile ransomware: targets the Android platform [8]

Fig. 1
figure 1

Workflow of a ransomware

All these categories of ransomware are playing a vital role in the recent upsurge in the incidence of ransomware attacks. Due to the increasing number of ransomware variants and ransomware attacks, researchers have been earnestly involving themselves in looking for efficient ways to improve the scenarios. While some researchers are analyzing the distinctive behaviors of ransomware by executing it in a secure environment called Dynamic Analysis [1, 9,10,11,12,13], some researchers are analyzing the ransomware without any execution, referred to as Static Analysis [14,15,16]. However, a good number of researchers are combining these two approaches and adopting a Hybrid Analysis Approach [17,18,19]. Although the static analysis technique takes less analysis time and facilitates the researchers by not requiring the execution of malicious files, this technique struggles to trace new ransomware variants because of the ever-evolving code obfuscation technique. On the other hand, although a dynamic analysis approach might take a longer time to process and analyze the ransomware program, this approach can detect ransomware with higher accuracy as it executes the ransomware program in a secure virtual environment and does real-time behavioral analysis. The main idea is that despite the changes in the new ransomware variants, they will still show the same behavioral patterns. Therefore, for this study, we have opted for the dynamic analysis approach for its ability to detect and classify ransomware families based on behavioral patterns regardless of the code obfuscation techniques deployed by the ransomware programmers [20, 21]. Among the broad range of different behavioral characteristics obtained from the dynamic analysis approach, selecting critical features for a robust ransomware classification or detection system has been a constant challenge. Although feature selection techniques play a pivotal role in this regard, investigating whether those selected features are crucial for the corresponding system has always been ignored and so has the efficiency of the feature selection technique. Especially, when applying an iterative feature selection technique, there is always a chance for a crucial feature not being included in the best-performed subset of features and the number of selected features may not always be optimal for a given problem. Hence, the prime objective of this study is to investigate the efficiency of RFECV- an iterative feature selection technique by incorporating XAI in our work and to demonstrate how XAI can be employed to derive highly contributing features that would facilitate building robust ransomware detection systems. Again, regardless of the relentless effort of the investigated research works, improvement scenarios are still existing in the manual data collection process, not having a large or diverse dataset that focuses on both Crypto and Locker types of ransomware [1, 10, 12, 22,23,24,25,26], using the Cuckoo sandbox environment that often falls short of providing in-depth and accurate analysis reports [1, 9,10,11,12,13], and not presenting the highly contributing features to the output of each ML model. Therefore, this study also automates the data collection process by developing a Web-Crawler, focuses on both the Crypto and Locker types of ransomware, utilizes an advanced sandbox environment, namely, Falcon Sandbox for the dynamic analysis of ransomware, and presents the highly contributing features to the output of each ML model.

The main contributions of this study are:

  • Developing a Web-Crawler, ‘GetRansomware’ to automate collecting the Windows Portable Executable (PE) files of 15 different ransomware families from the VirusShare repository. The Web-Crawler is essential to automate searching and downloading the samples and cutting down the manual workload.

  • Constructing two different ransomware datasets by analyzing two types of binaries, namely, Windows Portable Executables (PE) and Packet Capture (PCAP) files of both Crypto and Locker types of ransomware.

  • Examining and comparing the performance of six Supervised ML models in identifying ransomware families, both with and without the use of the RFECV feature selection approach. Since our approach includes utilizing RFECV for selecting the optimum number of features and RandomSearchCV for selecting the optimum hyperparameter values for each classifier, this study attempts to optimize each model’s performance in both scenarios before the comparison is made.

  • Presenting the efficiency of the RFECV feature selection technique in ransomware classification. For this task, first, we utilize ‘SHapley Additive exPlanations’ to obtain the highly contributing features from the without feature selection scenario. Next, we obtain the RFECV-selected features from the with feature selection scenario. Finally, we report how the important set of features varies for each ML model in two scenarios and how they affect the final outcome. Thus, this study also demonstrates the application of SHAP to identify the critical features that significantly contribute to the classification of ransomware.

The rest of this paper is structured as follows: Sect. 2 presents the related works. Section 3 details our methodology. The experimental results and discussions are illustrated in Sect. 4. Section 5 concludes the paper with the direction for future works.

2 Related works

In this section, we present several prior approaches to ransomware detection or classification. Although malware of a particular kind is called ransomware and many of the previous approaches include ransomware families in the malware dataset, our investigation mainly focuses on the binary and multiclass classification of ransomware through the dynamic analysis approach. First, we present recent research on API sequence and frequency-based ransomware detection and classification techniques. Next, we introduce a few investigations on network traffic features-based methods. Then, we mention several works that combine other significant features along with API call features and network traffic features towards ransomware detection and classification. All of these approaches are similar to our method since we consider both the API call features and network traffic features for comparing the performance of ML models with and without the RFECV feature selection technique.

A good number of researchers analyzed API call behaviors and proposed ransomware detection or classification methods based on the API call sequences or frequencies. Maniath et al. [10] analyzed the API call behavior of 157 ransomware and presented Long Short-Term Memory (LSTM)-based ransomware detection that focuses on API call sequence and compensates for the ransomware that causes execution delays. However, this work lacks complete information about the ransomware families/variants and the number of benign software used for the experiment. Vinayakumar Kumar et al. [11] proposed a Multilayer Perceptron (MLP)-based ransomware detection method focusing on API call frequency but they deployed a simple MLP network that failed to distinguish CryptoWall and Cryptolocker. Chen et al. [23] used API Call Flow Graph (CFG) generated from the extracted API sequence using the API monitor tool for detecting ransomware. Regardless, the work is based on a smaller dataset that includes only four ransomware families. Also, graph-similarity analysis requires higher computational power that some systems may not provide. Takeuchi et al. [12] used API call sequences to identify zero-day ransomware attacks and the work involved kernel tricks for tuning Support Vector Machine (SVM). However, the accuracy of this work decreases while using standardized vector representation because of the less diverse dataset. Bae et al. [27] extracted the API call sequences using the Intel Pin Tool. Their sequential process includes generating an n-gram sequence, input vector, and Class Frequency Non-Class Frequency (CF-NCF) for every sample before fitting their model. Nevertheless, their work lacks complete information about the ransomware families/variants used for the experiment, and the work’s accuracy can be improved with the help of deception-based techniques. Hwang et al. [13] analyzed API calls and used two Markov chains, one for ransomware and another for benign software to capture the API call sequence patterns. By using Random Forest (RF), they compensate Markov Chains and control False Positive Rate (FPR) and False Negative Rate (FNR) to achieve better performance. However, their model produces high FPR that can be improved with the help of signature-based techniques.

In contrast to the API call behaviors, some researchers analyzed network traffic behaviors of different ransomware families. Cabaj et al. [24] proposed two real-time Software Defined Networking (SDN) based mitigation methods that were developed using OpenFlow to ensure a prompt reaction to the threat while not decreasing the overall network performance. However, the proposed method is only based on the features of CryptoWall ransomware. Tseng et al. [25] proposed a method that can identify specific network traffic types and detect in-network behavior sequences. Their approach detects ransomware before encryption starts. Regardless, the work lacks complete information about the ransomware families/variants as well as benign software used for the experiment. Alhawi et al. [26] used TShark for capturing and analyzing malicious network traffic activities followed by utilizing the WEKA ML tool to detect ransomware based on only 9 extracted features. Nonetheless, because of using fewer features of only 210 ransomware, the proposed method may fall short of recognizing the new ransomware variants. Almashhadani et al. [22] built a dedicated testbed for executing and capturing the network traffic of the sample ransomware and proposed a multi-classifier that works on two different levels: packet-based and flow-based classifiers. Their method employed a language-independent algorithm that can detect domain names from general sonic axioms. However, the proposed method is only based on the Locky ransomware. In Almashhadani et al. [28], thoroughly analyze ransomworm network traffic, focusing on WannaCry and NotPetya. They extract 21 informative features from session-based and time-based flow levels to distinguish compromised host propagation traffic. Two machine learning classifiers are built based on these features. Moreover, they developed MFMCNS, a multi-feature and multi-classifier network system, which shows 99.8% detection accuracy. Nevertheless, the research relies heavily on WannaCry traffic analysis due to the greater availability of WannaCry PCAP files compared to those of NotPetya. Singh et al. [29] present SINN-RD, an innovative Neural Network-based Ransomware Detection System employing Spline Interpolation. They outline data normalization and feature generation from log files. Security analysis confirms SINN-RD’s robustness against potential threats. The practical application assesses its impact on key performance metrics, including accuracy, precision, recall, and F1-score. Furthermore, comparative analysis demonstrates that SINN-RD outperforms existing schemes, achieving an impressive 99.83% accuracy.

Table 1 Synopsis of the literature review

Instead of considering only API call behavior or only the network traffic behavior, some researchers combined these two categories of behavior along with other malicious indicators (i.e., registry key operations, file extensions, files/directory operation, etc.) for their models. D. Sgandurra et al. [9] analyzed API calls, registry key operations, embedded strings, file extensions, files/directory operations, and dropped file extensions prior to developing their model. The features were selected using the mutual information criterion and their proposed method ‘EldeRan’ was able to deal with sophisticated encryption methods of ransomware at an early stage. However, the limitation of ‘EldeRan’ is that it produces a higher False Positive Rate. Continella et al. [30] analyzed filesystem operations and presented two models: process-centric trained on each process and system-centric trained on the whole system. They developed ‘ShieldFS’-a software on OS that can detect malicious file activities and roll back from the attack. However, their system-centric model produces high false positives, and the system may face performance degradation due to the add-on driver on the OS. Lu et al. [31] analyzed API calls, network features, registry operations, file operations, directory operations, and memory usage for developing a ransomware detection method based on the Artificial Immune System (AIS). They applied real-valued detector generation based on the V-detector negative selection while optimizing the AIS parameter (i.e., hypersphere detector distribution) to improve the ransomware detection rate. Regardless, their system also produces higher false alarms. Hasan et al. [1] considered API calls, network features, registry key operations. process operations, function length frequency, and printable string information for their model. They proposed a framework- ‘RansHunt’ that takes a hybrid approach to identify potential static and dynamic features for the SVM classifier that outperforms traditional AV tools. However, the proposed method only focuses on the Crypto category. So, it may not be effective for the Locker category. In [32], Zahoora et al. analyzed API requests, file directory setups, file extensions, file processes, registry keys setups, strings, and dropped file records and introduced CSPE-R, a Cost-Sensitive Pareto Ensemble strategy for detecting new Ransomware attacks. Initially, an unsupervised deep Contractive Auto Encoder is used to transform the feature space. CSPE-R explores different semantic spaces and uses a novel Pareto Ensemble-based estimator selection strategy to balance false positives and false negatives. The experimental results demonstrate 93% accuracy against zero-day ransomware attacks, although the dataset includes only 11 crypto-ransomware families. Masum et al. [33] introduce a unique feature selection-based framework, incorporating various machine learning algorithms, particularly neural network-based classifiers, for efficient ransomware classification and detection. The framework uses variance threshold and VIF threshold as feature selection tools to eliminate low-variant and highly correlated features. The models’ performance was evaluated through a comprehensive comparative analysis of DT, RF, NB, LR, and NN classifiers. The experimental findings indicate that the Random Forest classifier outperforms other classifiers, demonstrating the highest 99% accuracy.

Table 1 presents the synopsis of the previous research works conducted on the analysis, detection, and classification of ransomware.

3 Methodology

The methodology of this study consists of three subsequent steps as illustrated in Fig. 2: Data Collection, Feature Engineering, and Classification.

Fig. 2
figure 2

Process overview of our methodology

3.1 Data collection

We have developed a Web-Crawler- ‘GetRansomware’ to automate collecting the Windows Portable Executable (PE) files of 15 different ransomware families from the VirusShare repository [34]. We have also shared the Web-Crawler on our GitHub repository for public access [35]. About 95% of the PE files were collected from VirusShare using GetRansomware. The rest of the PE files were collected from theZoo [36] and Hybrid-Analysis.com [37]. In addition, we have collected the Packet Capture (PCAP) files of those ransomware families from the malware-traffic-analysis [38]. Table 2 presents the number of collected samples.

Table 2 Number of collected samples

3.2 Feature engineering

The scarcity of the ransomware dataset is one of the major challenges that hinder the research work in this area [39]. Therefore, for this study, we construct two different datasets from two types of binaries through separate feature engineering processes. In the first process, we create the first dataset by analyzing the PE files while in the second process, we create the second dataset by analyzing the PCAP files.

3.2.1 Process 1: creation of the first dataset- ‘Data1’

The feature engineering step for the first process is composed of two phases. The phases are:

  • Phase 1: Feature Extraction

  • Phase 2: Feature Selection

Phase 1: feature extraction From the wide range of distinct behavioral features, we have considered utilizing API call frequencies for our study. API calls are made by the application or program running at a user level to request services as depicted in Fig. 3. It is the method through which data or information is exchanged between the sending device and the receiving device. The OS performs the requested services by issuing these calls, and the outcomes are returned to the caller user applications. Thus, API calls made by the ransomware program allow the attackers to explore and obtain control of the system and perform malicious activities. Since analyzing API call behavior leads researchers to better understand the program’s behavior [40, 41], therefore, we have considered extracting the API call frequency by executing the PE files of the ransomware.

Fig. 3
figure 3

Communication through the API call

We have analyzed the PE files with the help of Hybrid-Analysis.com [37], powered by the CrowdStrike Falcon Sandbox [42]. To automate submitting malicious binaries, pull the analysis report after the analysis, and perform advanced or required search queries on the database, Falcon Sandbox provides a free, convenient, and efficient API key that one can obtain from an authorized user account. For analysis, we have used our API key and Falcon Sandbox Python API Connector- VxAPI wrapper [43] to automatically submit the binaries from the system. After submission, Falcon Sandbox runs the binaries in a Virtual Machine (VM) and captures the run-time behaviors as illustrated in Fig. 4. Later, it shows the analysis results on the web interface.

Fig. 4
figure 4

Block diagram of the PE file execution process

Contrary to the prior works where the analysis tasks were done using the Cuckoo Sandbox [1, 9,10,11,12,13], we have analyzed the PE files using the Falcon Sandbox that uses a VM (Windows 7 64-bit) to execute the PE files. Falcon Sandbox incorporates many other services, such as VirusTotal, Thug honeyclient, OPSWAT Metadefender, TOR, NSRL (Whitelist), Phantom, and a large number of antivirus engines to provide an integrated and in-depth analysis reports compared to other Sandboxes. While executing the binaries, we have set run-time to the maximum available duration in the Falcon Sandbox to deal with the delayed execution techniques deployed by the attackers. The total time for the analysis was (1460 PE files * 7 min) = 170 h = 7 days approximately. Next, we obtained the analysis report by using the API key from which we have only sorted and computed the frequency of each API call. At the end of the PE files analysis process, we obtained our first dataset- ‘Data1’ consisting of the different frequencies of 68 distinct API calls associated with the 15 ransomware families as presented in Table 3.

Table 3 List of features in the ‘Data1’ dataset

Phase 2: feature selection At the beginning of the feature selection phase, we have evenly divided (stratified train-test split) our dataset into train data (80%) and test data (20%) to avoid data leakage. Next, we have applied Recursive Feature Elimination with Cross-Validation (RFECV) [44] to our train data. RFECV is a wrapper-style feature selection method that wraps a given ML model as depicted in Fig. 5 and selects the optimal number of features for each model by recursively eliminating 0-n features in each loop. Next, it selects the best-performing subset of features based on the accuracy or the score of cross-validation. RFECV also removes the dependencies and collinearity existing in the model. By using RFECV, we have selected 6 distinct subsets of features for 6 ML classifiers. These features have been selected by setting ‘min features to select’ as 34 (half of the features), cv=5, and ‘scoring’= ‘accuracy’ so that RFECV would select at least half of the features based on the optimum accuracy over the 5-fold cross-validation.

Fig. 5
figure 5

RFECV feature selection technique

3.2.2 Process 2: creation of the second dataset- ‘Data2’

The feature engineering step for the second process is composed of four phases. The phases are:

  • Phase 1: Feature Extraction

  • Phase 2: Exploratory Data Analysis (EDA)

  • Phase 3: Data Preprocessing

  • Phase 4: Feature Selection

Phase 1: feature extraction We have considered utilizing network traffic features for the second dataset for our study. The Transmission Control Protocol (TCP) refers to the set of standardized communication protocols that specify how computers communicate over the network. According to our literature review, the communication between the infected host machine (source) and the attacker (destination) is conducted through the transport layer [45]. Besides, HTTP GET or POST methods are also used to send back the information to the attacker [22]. Hence, we have opted for capturing the TCP traffic and the HTTP traffic information by analyzing the PCAP files of the ransomware.

Again, ransomware often spreads through spam emails containing malignant attachments as macro-enabled word documents. By executing a script, these attachments download the executable file of that ransomware from a URL and install it on the system. After the installation, the ransomware continuously tries to search and connect to its C & C servers to exchange the encryption key and launch the attack session. Firstly, it utilizes an encrypted list of IP addresses for creating a TCP session with the C & C servers. Upon failure due to the unreachable or blacklisted IP addresses or disrupted session, the ransomware then opts to find out its C & C server by executing the Domain Generation Algorithm (DGA) and recurrently produces a good number of pseudo-random domain names. Then, the ransomware continues sending the Domain Name System (DNS) request to those domain names until the actual C &C server is found as illustrated in Fig. 6. Here, DNS converts human-readable domain names to machine-readable IP addresses. Upon successful establishment of a TCP session, the attacker guides the victim in delivering the payload. The characteristic of dispatching an extensive number of DNS requests looking for a real C &C server looks like an arbitrary set of characters. Meaningful statistical information can be derived from these requested domain names as well as the pattern of randomness found in them [46]. If the ransomware detection method can trace the randomness that occurs before finding out the actual C &C server, it can be stopped before the ransomware begins encrypting files. This is an efficient approach in case of a zero-day attack as deriving the information from the known ransomware is not required in this case. Therefore, we have opted for extracting DNS traffic information by analyzing the PCAP files of the ransomware.

Fig. 6
figure 6

Finding out the actual C & C server by sending DNS requests [Author’s own processing]

We have analyzed the PCAP files using Wireshark- a network protocol analyzer [47, 48]. This manual process involved three identical systems with Wireshark installed and 2 volunteers for analyzing the PCAP files. We have extracted 18 network traffic features that according to [49], convey important statistical information that enhances the ability of the classification algorithms to classify ransomware. Then, these features have been merged resulting in ‘Data2’. Table 4 presents the list of network traffic features.

Table 4 List of features in the ‘Data2’ dataset

Phase 2: exploratory data analysis (EDA) At the beginning of Phase 2, we have evenly divided (stratified train-test split) the dataset into train data (80%) and test data (20%) to avoid data leakage. Next, we have done exploratory data analysis to better understand the raw data so that the data could be preprocessed as per requirement. The findings from this phase are:

  • Categorical data: We have found 11 features containing categorical data. They are the IP and port of the client, IP and port of the server, Bytes sent from the client to the server, Bytes sent from the server to the client, HTTP method GET or POST of the HTTP requests, Response code to the HTTP requests, URL requested in the HTTP request, IP and port of the client.1, IP and port of the DNS server, DNS request, and DNS response. These categorical data need to be encoded into numerical values since the classifiers require the data to be understandable so that they can be trained on and make predictions.

  • Random missing values: Since different ransomware families create different numbers of conversations over the network, the number of instances captured from the PCAP files was different for each ransomware sample. Hence, we have observed missing values in network traffic information. Handling missing values is an essential part of the feature engineering process as the ML models may generate biased or inaccurate results if the missing values are not handled properly. There are two ways of dealing with missing values, such as deleting the missing values and imputing the missing values. Since deleting the missing values ends up deleting the entire row or column that contains the missing values, there is a probability of losing useful information in the dataset. So, we have opted for imputing the missing values.

Phase 3: data preprocessing In the data preprocessing phase, firstly, we have encoded the categorical data into numerical data for which we have applied One-Hot Encoding [50] by using the ‘.get_dummies’ attribute of Pandas data frame package that generates the dummy variables of those 11 features. For preventing the ‘Dummy Variable Trap’, we have set ‘True’ as ‘drop_first’ parameter. To normalize the data and to prevent the imputer from producing biased numerical replacements for the missing data, we have scaled the numerical values between 0 and 1. After normalizing the data, we have used Scikit-Learn’s Impute package to apply KNNImputer to fill up the missing values.

Phase 4: feature selection We have selected the network traffic features using RFECV by setting ‘min_features_to_select’ as 9 (half of the features), cv=5, and ‘scoring’= ‘accuracy’ so that RFECV would select at least half of the features based on the optimum accuracy over the 5-fold cross-validation applied on our train data.

3.3 Classification

We have employed Supervised Machine Learning algorithms to classify 15 ransomware families into corresponding categories. Supervised learning algorithms are trained on the labeled dataset to make a decision in response to the unseen test dataset. These algorithms are generally of two types, such as classification-based and regression-based. The classification-based algorithms are used to accomplish both binary and multi-class classification where the instances from the test dataset are classified into one among an array of known classes, such as Naïve Bayes, Random Forest, K-Nearest Neighbor, etc. On the other hand, regression-based algorithms consider the relationship between independent features or input variables and dependent target class or continuous output variables to make a prediction, such as Linear Regression, Neural Network Regression, Lasso Regression, etc. As this study focuses on classifying 15 ransomware families, the following algorithms have been employed that are widely used for both binary and multi-class classification as per requirement:

  • Logistic Regression (LR): is a type of statistical analysis that predicts the probability of a dependent variable from a set of independent variables using their linear combination.

  • Stochastic Gradient Descent (SGD): is an optimization algorithm to find the model parameters by updating them for each training data so that the best fit is reached between predicted and actual outputs.

  • K-Nearest Neighbor (KNN): estimates the likelihood of a new data point being a member of a specific group by measuring the distance between neighboring data points and the new data point.

  • Naïve Bayes (NB): is based on Bayes’ theorem and predicts the probability of an instance belonging to a particular class.

  • Random Forest (RF): constructs multiple decision trees during the training phase and finally determines the class selected by the maximum number of trees.

  • Support Vector Machine (SVM): takes one or more data points from different classes as inputs and generates hyperplanes as outputs that best distinguish the classes.

Since this study focuses on multi-class classification and some classifiers are only designed for binary classification problems (i.e., Logistic Regression, Support Vector Machine, etc.), these cannot be directly applied to multi-class classification problems. Therefore, Heuristic Methods [51] can be applied to divide a multi-class classification problem into several binary classification problems. There are two types of heuristic methods as illustrated in Fig. 7. The methods are:

  • One-vs-Rest (OvR) which splits the dataset into one class against all other classes each time [52].

  • One-vs-One (OvO) which splits the dataset into one class against every other class each time [53].

We have applied the OvR method for our experiment to reduce the time and computational complexities. All these classifiers are built along with ‘RandomSearchCV’ [54]- a hyperparameter optimization technique, to find the best combination of hyperparameters for maximizing the performance of the models’ output in a reasonable time. Instead of exhaustively searching for the optimal values of the hyperparameters through a manually determined set of values (i.e., Grid Search), RandomSearchCV randomly searches the grid space and selects the best combination of hyperparameter values based on the accuracy or the score of cross-validation. Since we have used RFECV for feature selection and RandomSearchCV for hyperparameter optimization, the Nested Cross-Validation technique has been implemented in the pipeline to build each model.

Fig. 7
figure 7

Heuristic methods: a One-vs-Rest and b One-vs-One

4 Experimental results and discussions

4.1 Experimental results

We have evaluated the models in terms of Precision, Recall, F1-score, and Accuracy. These performance metrics are measured as follows:

$$\begin{aligned} \begin{aligned}&Precision = \frac{TP}{TP+FP} \\&Recall = \frac{TP}{TP + FN} \\&F1-score = \frac{2 \times Precision \times Recall}{Precision + Recall} \\&Accuracy = \frac{TP+TN}{TP + TN +FP + FN} \times 100\\ \end{aligned} \end{aligned}$$

where, TP = True Positives, FP = False Positives (Type 1 Error), TN = True Negative, FN = False Negative (Type 2 Error).

Table 5 presents the performance comparison of Machine Learning models with and without feature selection for the ‘Data1’ dataset. It shows that with and without feature selection LR outperforms other classifiers securing 98.20% and 99.30% overall accuracy respectively. Although there is a slight performance degradation in all the classifiers in the with-feature selection scenario, remarkable improvement in the processing time has been observed. As shown in Table 6, with-feature selection, the average processing time of all the classifiers has been improved by 26.97%. We present the classification accuracy for each class of the best-performed supervised machine learning model from these classifiers in two different scenarios. Figure 8 illustrates the normalized confusion matrix of the LR classifier. As shown in Fig. 8a, when the features are not selected, among 15 classes, the classifier could distinguish 13 classes with 100% accuracy. However, the classifier produces 1% false negatives classifying CryptoLocker ransomware and 11% false positives classifying Shade ransomware. On the other hand, Fig. 8b shows the confusion matrix of the LR classifier with feature selection. Although the classifier could distinguish 10 classes with 100% accuracy, the classifier produces 1% false negatives classifying Cerber, 22% false positives classifying CryptoLocker, 10% false positives classifying Mole, 10% false positives classifying Sage, and 11% false positives classifying Shade ransomware.

Table 5 Performance comparison between LR, SGD, KNN, NB, RF, and SVM with respect to the with-feature selection and without-feature selection using the ‘Data1’ dataset (P(avg)= Average performance, w FS= With-Feature Selection, and wo FS= Without-Feature Selection
Table 6 Classifier’s processing time comparison without-feature selection and with-feature selection using the ‘Data1’ dataset

Table 7 presents the performance comparison of Machine Learning models with and without feature selection for the ‘Data2’ dataset. It shows that with and without feature selection NB outperforms other classifiers securing 97.89% and 98.95% overall accuracy respectively. Even though all of the classifiers in the with-feature selection scenario show a minor performance deterioration, a notable improvement in processing time has been seen. As shown in Table 8, with-feature selection, the average processing time of all the classifiers has been improved by 34.72%. We present the classification accuracy for each class of the best-performed supervised machine learning model from these classifiers in two different scenarios. Figure 9 illustrates the normalized confusion matrix of the NB classifier.

Fig. 8
figure 8

Confusion matrix of (a) Logistic Regression without feature selection, and (b) Logistic Regression with feature selection for the ‘Data1’ dataset [Author’s own processing]

As shown in Fig. 9a, when the features are not selected, among 15 classes, the classifier could distinguish 10 classes with 100% accuracy. However, the classifier produces 2% false negatives classifying CryptoLocker and 1% false negatives classifying Maze ransomware. On the other hand, Fig. 9b shows the confusion matrix of the NB classifier with feature selection. The classifier could distinguish 9 classes with 100% accuracy with no false negatives. However, with feature selection, the classifier produces higher false positives as compared to that without-feature selection.

Table 7 Performance comparison between LR, SGD, KNN, NB, RF, and SVM with respect to the with-feature selection and without-feature selection using the ‘Data2’ dataset P (avg) = Average performance, w FS = With-Feature Selection, and wo FS = Without-Feature Selection
Table 8 Classifier’s processing time comparison without-feature selection and with-feature selection using the ‘Data2’ dataset

4.2 Discussions

In this section, we present the comparison between the RFECV-selected features in the with-feature selection scenario and the highly contributing features in the without-feature selection scenario to examine the efficiency of the RFECV feature selection technique toward ransomware classification. For this task, we apply ‘Shapley Additive exPlanations’, a tool for visualizing data that helps explain the results of machine learning models. SHAP is based on the coalition game theory that measures each

Fig. 9
figure 9

Confusion matrix of (a) Naïve Bayes without feature selection, and (b) Naïve Bayes with feature selection for the ‘Data2’ dataset [Author’s own processing]

feature’s individual contribution to the final output while conserving the sum of contributions being the same as the final result [55]. Unlike other explanation techniques that are limited to explaining specific models, SHAP values can be used to explain a wide variety of models, such as DeepExplainer to explain Deep Neural Networks (i.e., Multi-Layer Perceptron, Convolutional Neural Networks, etc.), TreeExplainer to explain tree-based models (i.e., Random Forest, XGBoost, etc.), and KernelExplainer to explain any model, etc. [56, 57]. For our study, we have used TreeExplainer to obtain highly contributing features from the Random Forest classifier, while for the other classifiers, we have used KernelExplainer.

Fig. 10
figure 10

Force plot for single instance of ’Data1’ dataset

Fig. 11
figure 11

Summary plot showing the top 40 highly contributing features of the ‘Data1’ dataset for each ML classifier in the without feature selection scenario

In the context of the classification model, the SHAP value is represented as a two-dimensional array. Each column corresponds to a feature used in the model, while each row represents an individual prediction made by the model. The SHAP value in this array indicates the contribution of a specific feature to the output of the corresponding prediction. Positive SHAP values indicate that a feature has a positive influence on pushing the model output toward the base value or expected value. Conversely, negative SHAP values indicate that a feature has a negative influence on pushing the base value toward the model output. The base value, or the average model output, is calculated based on the training data. To visualize this explanation for a single prediction, the Force plot can be utilized as illustrated in Fig. 10. In Fig. 10, the features with higher SHAP values (highlighted in red) positively contribute to pushing the base value toward the model output, while the features with lower SHAP values (highlighted in blue) negatively contribute to pushing the base value toward the model output.

Passing the array of SHAP values to a ‘summary plot’ function creates a feature importance plot as shown in Fig. 11. Here, we illustrate 40 highly contributing features (as RFECV selects the highest 40 features for the KNN classifier) for each classifier in the without-feature selection scenario for the ‘Data1’ dataset. Here, the x-axis denotes the mean of the absolute SHAP value for each feature which indicates the total contribution of the feature to the model and the y-axis denotes the features used for the classification. The features are organized in descending order from top to bottom by how strongly they influence the model’s decision. As illustrated in Fig. 11, the set of highly contributing features and their order varies for each classifier. However, for our study, we only examine the variation of the RFECV-selected features with the highly contributing features of the corresponding classifiers. Table 9 presents the set of optimum features selected by RFECV for each ML classifier from the ‘Data1’ dataset and Table 10 presents the list of RFECV-selected features for each ML classifier that is not present in the top 40 highly contributing features. By comparing these two tables, we get the features that are causing performance deterioration in the with-feature selection scenario and producing higher false alarms as compared to that without-feature selection.

Table 9 Set of optimum features selected by RFECV from the ‘Data1’ dataset
Table 10 List of RFECV-selected features from the ‘Data1’ dataset for each ML classifier that is not present in the top 40 highly contributing features
Fig. 12
figure 12

Summary plot showing the features of the ‘Data2’ dataset in descending order based on their contribution to each ML classifier’s decision

Similarly, for the ‘Data2’ dataset, we present the comparison between the RFECV-selected features in the with-feature selection scenario and the highly contributing features in the without-feature selection scenario. Figure 12 illustrates the features of the ‘Data2’ dataset in descending order from top to bottom by how strongly they influence the model’s decision. For each classifier, the order of the features varies except for the ‘Bytes sent from the client to the server’ feature. However, similar to the previous step, we only examine the variation of the RFECV-selected features. Table 11 presents the set of optimum features selected by RFECV from the ‘Data2’ dataset for each ML classifier, and Table 12 presents the list of features that were not selected by the RFECV. By comparing these two tables, we get the features that are causing performance deterioration even with the best-performed ML classifier in the with-feature selection scenario and produce higher false alarms as compared to that without-feature selection.

Table 11 Set of the optimum number of features selected by RFECV from the ‘Data2’ dataset
Table 12 List of features from the ‘Data2’ dataset that were not selected by the RFECV

Although SHAP importance shows the effect of a given feature on the model output while disregarding the exactness of the prediction, our study, by comparing the highly contributing features in the without feature selection scenario and the RFECV selected features in the with feature selection scenario finds out that the RFECV feature selection technique often fails to select the crucial features that have a high impact on the model output resulting in both Type 1 and Type 2 error. Again, for two different ransomware datasets, the selected features have been ranked 1, while the not-selected features have been ranked greater than 1. Hence, the order of the selected features based on their importance remains unknown in the RFECV feature selection technique. In addition, this study also reveals that RFECV falls short of improving the performance of our ML models. Our ML models secure better classification accuracies without RFECV (For ‘Data1’ dataset, with and without feature selection LR secures 98.20% and 99.30% overall accuracy respectively. For ‘Data2’ dataset, with and without feature selection NB secures 97.89% and 98.95% overall accuracy respectively.) and thus this study also substantiates the performance of our ML models over existing literature. While using API call features, although VinayaKumar et al. [11] achieved 100% exactness for binary classification, the model secured 98% accuracy doing multiclass classification utilizing only 7 classes. Again, regardless of having a good detection rate by utilizing network traffic features, Almashhadani et al. [22] did not extend their work for multiclass classification, and in [29], they rely heavily on WannaCry traffic analysis due to the greater availability of WannaCry PCAP files compared to those of NotPetya. In contrast to these prior approaches, while conducting our study, we improved the data collection process by developing a Web-Crawler to automate collecting 15 different ransomware families and created two different ransomware datasets based on API call features (‘Data1’) and network traffic features (‘Data2’) from 2 types of binaries (‘PE’ and ‘PCAP’ files respectively) of both the Crypto and Locker types of ransomware. Also, our LR and NB models offer comparative performance over existing literature with explainability that demonstrates the application of SHAP to identify the critical features that significantly contribute to the classification of ransomware.

4.2.1 Advantages of SHAP over iterative feature selection technique

SHAP is a powerful tool in explainable AI that can be useful for ransomware detection in the following ways:

  • Feature Importance: SHAP helps determine the importance of each feature in a machine learning model’s decision-making process. By analyzing the SHAP values assigned to each feature, one can identify which features contribute the most to the prediction of ransomware attacks. This information can aid in understanding the key indicators or patterns associated with ransomware.

  • Model Interpretability: Ransomware detection models are often complex, involving various algorithms and techniques. SHAP provides a way to interpret and explain the predictions made by these models. It can help cybersecurity experts, analysts, and investigators understand the factors that contribute to a system being classified as potentially affected by ransomware. By analyzing the explanations provided by SHAP, they can gain insights into the decision-making process of the model.

  • Anomaly Detection: Ransomware attacks often exhibit anomalous behavior compared to normal system usage. SHAP can help identify these anomalies by providing explanations for individual predictions. If a particular prediction has a high SHAP value for certain features, it indicates that those features strongly contributed to the model’s decision. Unusual values or combinations of features can then be flagged as potential indicators of ransomware activity.

  • Early Warning System: By training a model with historical ransomware attack data, SHAP can provide valuable insights into early warning signs. It can identify the specific indicators that are most indicative of ransomware attacks, allowing organizations to proactively monitor and detect potential threats. This can help security teams respond quickly and prevent or mitigate the impact of ransomware attacks.

  • Vulnerability Assessment: SHAP can be used to assess the vulnerability of a system to ransomware attacks. By analyzing the contributions of different features to the model’s predictions, security professionals can identify the weak points in their systems. They can then focus on improving the security measures for those vulnerable areas, reducing the risk of successful ransomware attacks.

Overall, SHAP can enhance ransomware detection by providing interpretability, feature importance analysis, anomaly detection, and a proactive approach to identifying and mitigating potential threats. Thus, SHAP offers advantages over iterative feature selection techniques. By leveraging SHAP, organizations can better understand the factors driving ransomware predictions and strengthen their cybersecurity defenses accordingly.

5 Conclusion

Early detection of ransomware is a key research area in cybersecurity. While feature selection techniques aim to improve detection accuracy while reducing overfitting and time complexity, the selected features must be crucial to support the technique’s effectiveness. This research thoroughly analyzes the performance of an iterative feature selection technique- Recursive Feature Elimination with Cross-Validation (RFECV) with widely utilized Supervised Machine Learning models on two different ransomware datasets. By employing the SHapley Additive exPlanations (SHAP) framework, critical features are determined when RFECV is not integrated with the ML models and then compared to RFECV-selected features. The study reveals that without RFECV the classification accuracies are better than with RFECV (For ‘Data1’ dataset, with and without feature selection LR secures 98.20% and 99.30% overall accuracy respectively. For ‘Data2’ dataset, with and without feature selection NB secures 97.89% and 98.95% overall accuracy respectively). Again, RFECV occasionally fails to select impactful features from both datasets, leading to both Type 1 and Type 2 errors. Moreover, the RFECV approach fails to disclose the importance-based order of selected features, reducing its efficacy in ransomware classification. Consequently, the study highlights the significance of integrating explainability techniques to identify highly contributing features, as relying solely on iterative feature selection techniques is not sufficient for strengthening ransomware detection systems. However, the research exclusively concentrates on the RFECV feature selection technique and does not assess the performance of Deep Learning models. Therefore, future investigations should explore other iterative feature selection methods and incorporate Deep Learning models to expand this research further.