1 Introduction

The introduction of fuzzy systems by Lotfi A. Zadeh gave rise to a new research field in soft computing [1]. Although fuzzy logic was proposed for modeling uncertainties, ambiguity, and inaccuracy, one of the known shortcomings of type-I fuzzy systems is that while membership functions are used to express uncertainties, the outputs of these functions are crisp values. Lotfi A. Zadeh introduced type-II fuzzy systems in 1975 to address the problem [2]. Nonetheless, fewer studies were published on type-II fuzzy systems until late in the previous century [3]. Liang and Mendel first investigated type-II non-singleton fuzzy systems [4]. After Mendel’s work on type-II fuzzy systems, these systems drew the interest of more researchers. The results obtained from the literature and the application of interval type-II fuzzy systems in practical situations with high noise levels, dynamic environments, and high complexity have demonstrated the superior performance of type-II over type-I fuzzy systems [5]. General type-II fuzzy systems can produce better results than interval type-II fuzzy systems due to their ability to handle higher uncertainty levels [6].

Type-III fuzzy systems, introduced in [7], are the evolved versions of the type-II fuzzy systems. They have been recently employed in various applications, such as robot control [8,9,10,11], time series prediction [12, 13], fault detection in gas turbine [14], and controller gain adjustment [15].The obtained results indicate that these systems can outperform type-I and type-II fuzzy systems by dealing with higher uncertainty levels.

Malware is among the most serious security threats on the Internet and leads to economic consequences for governments and businesses in addition to breaches of privacy [16]. Microsoft Windows is the most popular operating system, having a 74.79% share of the market [17]. This has resulted in the annual creation of a large number of malware programs for this operating system although it does not mean that other operating systems are safe from malware. According to reports by AVTEST, the number of malware programs created for Microsoft Windows was 793299151 in June 2023, 730142412 in June 2022, and 644297811 in June 2021, indicating an annual increase of 8.64% and 11.75%, respectively [18].

The act of identifying and subsequently categorizing malware is referred to as Malware Analysis. Static, Dynamic and both analyses [19] stand out as the predominant techniques employed in malware analysis. Static analysis generally involves parsing the executable code and extracting features and signatures that can aid in detection. Dynamic analysis involves executing the executable file in a controlled environment to observe its behavior. During dynamic malware analysis, the executable’s interactions with the system are monitored, including system calls, executed instructions, and accessed registers. Dynamic analysis has been proven to be a highly efficient method of expediting the identification of malicious code, owing to the multifarious dynamic behaviors that malware exhibits during its execution [20]. Although dynamic techniques for analyzing malware are usually more accurate than entirely static methods, their primary limitation is that they can only detect malicious actions when those actions occur during the analysis process. This limitation has led to fewer studies incorporating dynamic features in their research [20].

Deep learning is now widely used for malware classification tasks. To use deep neural networks for malware detection, first a raw binary malware file is converted into an image, which can be grayscale or colored. As this process is repeated, a dataset is created that can be used to train the network. The network can then be utilized as a malware classifier. It’s essential to emphasize the research gap regarding the classification of malware, particularly in the context of dynamic features, which have received comparatively less attention in previous studies. The article introduces a novel classifier that integrates a type-III fuzzy decision maker with an ensemble of deep neural networks and support vector machines to address this gap and offer a fresh perspective on malware classification. Additionally, it’s worth highlighting that the paper pioneers the utilization of fuzzy type-III as a decision maker, aiming to better model uncertainty and enhance classification accuracy. This marks the first instance where fuzzy type-III has been employed in such a capacity, underscoring the innovative nature of the proposed approach.

The contributions of this article can be summarized as follows:

  • Introducing a unique hybrid classifier that combines deep neural networks with a type-III fuzzy system.

  • Implementing the type-III fuzzy system for decision-making, as opposed to traditional methods like voting.

  • Showcasing an enhanced CGO optimization algorithm for training the type-III fuzzy system.

  • To assess how well ICGO performs in addressing optimization challenges, it is subjected to testing across a spectrum of 126 BFs, encompassing different categories such as UM, MM, as well as the CEC 2019 and CEC 2017 sets. This evaluation spans dimensions ranging from 10 to 100, and it also includes five EDPs to examine how the problem dimensions influence the efficacy of the ICGO algorithm.

  • Employing both balanced and imbalanced datasets and offering an in-depth analysis to provide a clearer insight into the performance of the proposed model.

The article is structured into five sections. Section 2 focuses on related work, while Sect. 3 covers material and methods. Section 4 presents simulation results and compares the performance of the different models. In Sect. 5, we will explain the proposed methodology. The Sect. 6 and final section provides conclusions based on the article’s findings.

2 Related works

The field of operating system security extensively utilizes machine learning techniques for various applications, particularly for effectively classifying different malware families. In the following paragraphs, we will examine a selection of articles published in recent years that have employed machine learning techniques for the purposes of detecting and/or classifying malware.

Machine learning offers two main approaches for detecting and classifying malware: image-based and operational code (Opcode) families-based methods. The former involves analyzing visual representations of malware, while the latter leverages the sequence of operations, known as opcodes, within the program to identify malicious code. Parildi et al. [21] presented an alternative method for malware detection using assembly opcode sequences, utilizing natural language processing and deep learning techniques for deeper behavioral features, and achieving MCC scores of up to 0.95. In another similar study, Santos et al. [22] proposed a method that involved examining the frequency of opcode sequences and building a semi-supervised machine-learning classifier using a set of labeled and unlabeled malware and legitimate software instances. Empirical validation was performed to demonstrate that the labeling efforts required for this method were lower than those of supervised learning while still maintaining a high level of accuracy. Results indicate that by labeling only 50% of the software, more than 83% accuracy rates can be achieved.

The method of visualizing images is of great interest to many security researchers due to its ability to eliminate the need for feature engineering. The authors of [23] first created a feature vector by concatenating the extracted features from AlexNet and ResNet-152 and then used three fully connected layers and a Softmax function to classify malware. They evaluated the proposed method using Malimg, MBIG2015, and Malevis datasets. The classification accuracy for the proposed classifier was reported to be 97.78%, 94.88%, and 96.5% for the mentioned datasets, respectively. A lightweight deep neural network called IMCLNet has been introduced by [16] for classifying malware. To evaluate the proposed network, two datasets, Malimg and MBIG2015 (Microsoft BIG 2015), were used, and their classification accuracy was reported as 99.785% and 98.942%, respectively, by the proposed classifier.

A DNN-based malware classifier for Windows programs was proposed to address vulnerabilities in adversarial perturbation attacks [24]. A defensive mechanism uses a generative adversary network (GAN). The GAN-based adversarial samples achieve high-quality samples with medium cost, and the enhanced DNN achieves satisfactory accuracy with a 90.20% evasion ratio. GAN secures the DNN-based malware classifier with minimal performance degradation and minimizes evasion ratios when faced with powerful adversarial attacks. During [25], after feature extraction with VGG16, the features pass through two BiLSTM layers. Finally, the outputs generated by the BiLSTM layers and the features extracted by VGG16 are combined for malware classification. In order to mitigate the problem of imbalanced data, data augmentation techniques such as image shifting, vertical flipping, horizontal flipping, 45-degree clockwise rotation, and 45-degree counterclockwise rotation were employed in this article. A graph convolutional network malware classifier was developed to adapt to malware characteristics, achieving 98.32% accuracy and superior performance compared to existing methods [26].

A feature selection technique based on frequent Android permissions is explored to reduce computational effort [27]. The authors of [28] have introduced IMCFN, a classifier designed to identify various types of malware and enhance detection by implementing a deep learning architecture based on CNN. The method converts raw malware binaries into color images, using the fine-tuned CNN architecture to identify malware families. The IMCFN outperforms other CNN models, with an accuracy of 98.82% in the Malimg malware dataset and over 97.35% in the IoT-android mobile dataset. Hosseini et al. [29] demonstrated the effectiveness of Deep Neural Networks in malware classification, primarily using a combined convolutional neural network and RNNs. The proposed algorithm achieves maximum accuracy of 98.8% using fivefold cross-validation, surpassing CNN, Ensemble-learning, and SVM algorithms. Improvements are needed to enhance robustness and detect malware families for higher accuracy.

MAPAS is a malware detection system that uses Grad-CAM to analyze malicious applications' behaviors and API call graphs. Grad-CAM stands as a method that preserves the structure of complex models while providing insight into their decisions without sacrificing precision. It's lauded as a localization technique that delineates classes, providing visual insights for CNN-based networks sans the need for altering their architecture or undergoing re-training. MAPAS classifies applications 145.8% faster and uses ten times less memory than MaMaDroid, with higher accuracy (91.27%) for detecting unknown malware and any type of malware with high accuracy. This innovative approach offers a cost-effective solution for protecting users from emerging malware [30]. Reference [31] aims to propose a hybrid deep learning model called DeepVisDroid for detecting Android malware samples using image-based features. Four grayscale datasets were constructed, and a 1D-convolutional layers-based neural network model was trained using extracted local and global features. The model achieved classification accuracy of over 98% with efficient run-time overhead. Current deep CNN-based models require higher resources and heavy training operations, making them insufficient for IoT applications. Reference [32] proposes a lightweight CNN model for malware image classification, achieving 96.64% accuracy and suitable for resource-constrained applications.

In the subsequent section, we discuss various researches that have employed different machine learning techniques for the detection and categorization of malware. An example of this is the research carried out by Aurangzeb and his team [33]. In this article, a combination of five classifiers, namely Gradient Boosting, KNN, Random Forest, XGBoost, and Multilayer Perceptron, has been utilized to detect malware in software programs operating on the Android platform. The proposed classifier employs a classification method based on the voting mechanism among the mentioned classifiers. Authors of [34] present a hybrid approach for Android malware classification using fuzzy C-means clustering and LightGBM. Fuzzy clustering generates clusters of app permissions, while LightGBM classifies apps as malware or good ware after training, offering high learning efficiency and precise classification.

Reference [35] aimed to improve the detection performance of Android malware using machine learning-based malware detectors and investigate the impact of adversarial attacks on classifiers. The framework integrates static and dynamic features, with machine learning algorithms achieving 97.59% accuracy and random forest at 95.64%. When combined, deep neural network models achieve 99.28% accuracy and 99.59% accuracy. The paper also evaluates classifiers’ robustness against evasion and poisoning attacks, revealing a significant drop in performance when simulating evasion attacks using static features. Dynamic features are also vulnerable to attack but exhibit more resilience than static features. A machine-learning approach was tested for detecting Android malware using a SVM classifier and Harris Hawks Optimization algorithm (HHO) [36].

Reference [37] proposed an Android malware detection technique using supervised learning to detect malware behavior. The supervised model achieves 97% accuracy in detecting malware, malicious API calls, and unusual app behavior. A simulated annealing algorithm and fuzzy logic were used in feature selection and neighbor generation stages to test ten feature sets, achieving 99.02% accuracy in feature selection with the KNN classifier [38].

Table 1 provides a summary of the selected related methods. The table indicates the use of k-fold cross-validation (CV), feature extraction (FE), classification algorithms (C), optimization techniques (Opt) for parameter tuning, introduction of new datasets (DS), reported accuracy (ACC), feature selection (FS), and feature processing (FP) in each work. This allows for a concise overview of the methodologies and contributions of previous studies relevant to the current research. Despite the critical role that these dynamic attributes play in identifying and analyzing malware, however, there has been a dearth of research focused on integrating them into the malware analysis process [39]. Moreover, a limited number of articles concerning the classification of malware have undertaken assessments of their classification techniques employing another dataset, including the renowned Fashion MNIST dataset and the MNIST dataset.

Table 1 The summary of selected related work is reported

3 Material and methods

3.1 Dataset

This article uses the public dataset presented in [39] comprising seven distinct datasets. In order to generate this data, 65,536 malicious samples were extracted from the VirusShare repository and then filtered to yield 15,872 viable executable malware files. The Cuckoo Sandbox software by Linux Software was utilized to conduct dynamic analysis on these files safely. This software executed the malware in an isolated environment and logged the behaviors and functions, outputting 15,872 report.json files. The Cuckoo Sandbox system also integrates with the VirusTotal scan service to identify files containing viruses and other malware. The AVClass2 was then applied to automatically label the malware samples into categories based on their attributes and actions [40]. The outcome of this process was a dataset including 3749 real malware samples categorized into 11 distinct classes. The distribution of the Malware Family is reported in Table 2, and the visual representation of it can be observed in Fig. 1.

Table 2 The distribution of Malware Family
Fig. 1
figure 1

The distribution of the malware family

From the analysis report generated in the previous stage, we identified three dynamic features included in the report. Subsequently, we will elaborate on each of these features, namely, The Application Programming Interface (API) calls and their frequency, Loaded Dynamic Link Libraries (DLL) files, and Registry operations (REG), to provide a detailed explanation of how each feature can reveal insights into the behavior of the malware during its execution. API calls refer to the requests made by Malware to initiate an interaction with the operating system or other software components. By analyzing the type and frequency of these API calls, we can deduce the malware’s behavior and intentions such as network access, file creation, and modification [41]. DLL files contain functions and data used by other programs. Using DLL files helps promote code modularization, code reuse, and efficient memory usage [42]. Malware often loads specific DLL files to utilize their functions and perform malicious activities. Identifying the DLL files loaded by the malware can expose the malware’s dependencies and reveal its tactics. REG is a database that stores configurations and settings for the operating system, hardware, and software [43]. Malware frequently manipulates the registry to achieve persistence, disable security features, or modify system settings [44]. Examining the operations performed on the registry can provide insights into the malware’s objectives and persistence mechanisms.

3.1.1 Malware visualization

In this work same as [39], the total number of API calls, DLL files loaded and malware registry keys present were 249, 730, and 5013 respectively. To improve the extraction of features and attain a more thorough understanding of malware properties, the following methodology was employed:

The aforementioned features were converted into three distinct vectors with varying dimensions (\({\Gamma }_{1}\), \({\Gamma }_{2}\), \({\Gamma }_{3}\)), specifically 249, 730, and 5013. These vectors were subsequently transformed into 2D arrays, resulting in new vectors with dimensions 16 × 16, 28 × 28, and 71 × 71, analogous to an RGB channel image. Nonetheless, due to the differing dimensions of the three vectors, it was necessary to employ a technique known as bilinear interpolation to adjust their sizes for compatibility [39]. In particular, using \({{\Gamma }_{2}}^{\prime}\) served as a reference point, with \({{\Gamma }_{1}}^{\prime}\) being expanded to 28 × 28 and being reduced to \({{\Gamma }_{3}}^{\prime}\) to 28 × 28. Finally, Malware visualization maps API calls, DLLs, and registry accesses to the red, green and blue (RGB) channels of an image, respectively, then combined into a single RGB image, with the intensity of each color representing the volume and specifics of the corresponding behavior by mapping these malware behaviors to color channels, visualization provides an intuitive representation of the runtime actions of a malware sample [39]. Figure 2 illustrates the complete process.

Fig. 2
figure 2

Steps of generating malware dataset for classification

3.1.2 K-fold cross validation

To mitigate the impact of partitioning the dataset on the classifier’s performance, we employed k-fold cross-validation in our study. The entire dataset is divided into \(k\) distinct subsets of similar sizes. Then, \(k-1\) subsets are used as the training set to train a model, and the remaining subsets are used as the test set to evaluate the model. This process is repeated \(k\) times, and each subset is used exactly once as the test set. The most common values for \(k\) are 5 and 10. In this article, a fivefold cross-validation method, similar to the Fig. 3, is employed. This method significantly reduces bias (a model with high bias pays very little attention to the training data, resulting in an overly simplistic model and high errors in both training and test data) and variance (a model with high variance overfits the training data and fails to generalize to unseen data), as it ensures that each sample from the entire dataset has a chance to appear in both the training and test sets.

Fig. 3
figure 3

Fivefold cross-validation

3.2 Efficient capsule network

In CNN, the use of a type of pooling operation called max pooling can result in some loss of information regarding object location and affine transformations. The authors of [45] solved two problems of CNN by replacing a vector capsule with a scalar CNN feature detector, a method known as routing. To resolve the second problem, they encoded the location information into the low-level capsule. In the first approach, the authors used a Capsule Network (CapsNet) for their problem because of the advantages discussed in this section, such as biomedical image segmentation [46], breaking CAPTCHA [47], text classification [48], classification of lung cancer [49], drowsiness detection [50], detecting fake news [51].

In the second approach, they improved and modified the structure of this neural network, as in RPI-CapsGAN [52], Dual-attentional spatial-aware CapsNet [53], Multi-Column CapsNet [54], Multi-Lane CapsNet [55], CapsNet with non-iterative cluster routing [56], Graph CapsNet [57], Adaptive CapsNet [58], Transformer CapsNet [59].

Although CapsNets have many advantages, they have a computational efficiency problem, which is why we used Efficient-CapsNet in our research. Efficient-CapsNet proposed in [60] by reducing parameters solves the traditional CapsNet’s computational efficiency problem. The efficient-CapsNet structure is represented in Fig. 4. The Efficient-CapsNet consists of three main components: a convolutional layer, a primary capsule layer, and a self-attention mechanism for routing between capsules. To prepare for the creation of the capsule, the input (image) passes respectively through the convolution multi-layer, the ReLU activation functions, and the bath normalization multi-layer. In the Efficient-CapsNet, the main capsule layer passes high-dimensional features from a depth-separable convolution. A depth-separable convolution combined with a linear activation function creates a vector representation of the features. In this part, the output size is 128 1 × 1.

Fig. 4
figure 4

Overall efficient-CapsNet structure

The primary capsule layer is crucial in decreasing parameters and enhancing the efficiency of this network. The final output of the preceding layer is reshaped according to user-defined parameters for the primary capsules, such as their number and dimensionality. The primary capsules then go through the squash function, and the resulting values are used to determine the digits capsules and classification through the fully connected capsules layer based on a self-attention routing.

Our attention in this part turns to the specific details of the Efficient-CapsNet architecture utilized in our research. During the forward pass of the network, 28 × 28 RGB images are fed into the input layer and then go through four convolution layers. The first convolution layer consists of 32 convolution kernels, each with a size of 5 × 5, a stride of 1, and valid padding, resulting in 32 feature maps of 24 × 24 dimensions. The second convolution layer has 64 kernels, each 3 × 3 in size, with a stride of 1 and valid padding, while the third convolution layer also has 64 kernels, with the same specifications. The second and third layers generate 62 feature maps of 22 × 22 and 20 × 20 dimensions, respectively. Lastly, the fourth convolution layer contains 128 convolution kernels of 3 × 3 size, a stride of 2, and valid padding, producing 128 feature maps with 9 × 9 dimensions. In order to train deep neural networks, Batch Normalization (BN) standardizes each mini-input batch to a layer. As a result, the learning process becomes more stable, and the number of training epochs necessary to create deep networks is greatly diminished. This transformation keeps the average output around zero, and the output’s standard deviation is around one [61]. A layer with \(dim\) dimensional input \(\left({\mathcalligra{x}}^{\left(1\right)}\cdots {\mathcalligra{x}}^{(dim)}\right)\), normalize each dimension using Eq. (1):

$${\mathcalligra{x}}_{\mathrm{Normalize }}^{(k)}=\frac{{\mathcalligra{x}}^{k}-E\left[{\mathcalligra{x}}^{k}\right]}{\sqrt{Var\left[{\mathcalligra{x}}^{k}\right]}}.$$
(1)

Mazzia et al. [60] proposed a transformation reconstruction algorithm in order to prevent the features learned by each layer from being destroyed by normalization Eq. (1). Through Eqs. (1) and (2), it is possible to restore the learned distribution of features for each layer [61]:

$${\mathcalligra{y}}^{(k)}={{\mathcal{T}}}^{(k)}{\mathcalligra{x}}_{\mathrm{Normalize }}^{(k)}+{{\mathcalligra{g}}}^{(k)}.$$
(2)

In Eq. (2) \({\mathcalligra{x}}_{\mathrm{Normalize }}^{(k)}\) is the input after k-layer normalization, and \({\mathcal{T}}\) and \({\mathcalligra{g}}\) are learnable parameters. Due to the advantages of BN, in Efficient-CapsNet, a batch normalization layer is embedded between the mentioned four convolution layers. Depth-separable convolution is a 9 × 9 convolution kernel with 128 channels. The output shape of this convolutional layer is 128 1 × 1. We set the number and dimension, 16 and 8, respectively, for primary capsules. The multiplication of these values gives 128 (16 × 8). The construction of the primary capsules is now complete. The self-attention routing (Fig. 5) part is similar to the fully connected network. The upper capsule takes in a combination of all “prediction vectors” from the lower capsule, and the output capsules correspond to different classes of Malware. Where \({U}_{{n}^{l},{d}^{l}}^{l}\) shown in Fig. 5, \(l\) represents \(l\)th capsule layer, \({n}^{l}\) represent number of capsule layers and \({d}^{l}\) represents the dimensions of each capsule layer. We set \({U}_{{16,8}}^{l}\) in this article. \({U}_{{11,16}}^{l+1}\), number capsule in \(l+1\) capsule layer (digit capsules) is set equal to the number of classes (11 in this work). \({W}_{16,{11,8},16}^{l}\) stands for a weight matrix with a dimension size of (1\({6,11,8},16\)). \({\widehat{U}}_{{16,8}}^{l}\) refers to all predictions of the front capsule, while \({C}^{l}\) represents the coupling coefficient matrix generated by the self-attention algorithm, as shown in Eqs. (3) and (4).

Fig. 5
figure 5

Self-attention routing in an efficient-CapsNet

Where, \({n}^{l}\) denotes the number of capsules present in layer \(l\), while \({n}^{l+1}\) indicates the number of capsules present in the next layer, which is \(l+1\). \({d}^{l}\) refers to the dimension of \(l\)-layer capsule, and \(\sqrt{{d}^{l}}\) is utilized to maintain equilibrium between the coupling coefficient and logarithmic priority, ensuring a stable training process. The self-attention matrix is represented by \({A}^{l}\), and each capsule is associated with a self-attention matrix:

$${A}_{\left(:,:,{n}^{l+1}\right)}^{l}=\frac{{\widehat{U}}_{\left(:,{n}^{l+1},:\right)}^{l}\times {\widehat{U}}_{\left(:,{n}^{l+1},:\right)}^{Tl}}{\sqrt{{d}^{l}}}$$
(3)
$${C}_{\left(:,{n}^{l+1}\right)}^{l}=\frac{{\text{exp}}\left(\sum_{{n}^{l}}{A}_{\left(:,{n}^{l},{n}^{l+1}\right)}^{l}\right) }{ \sum_{{n}^{l+1}}{\text{exp}}\left(\sum_{{n}^{l}}{A}_{\left(:,{n}^{l},{n}^{l+1}\right)}^{l}\right)}.$$
(4)

The \({B}^{l}\) matrix holds all weight discrimination information and is predetermined. Calculating all capsules \({S}_{{n}^{l+1},{n}^{l} }^{l+1}\) in the \(l+1\) layer can be accomplished using Eq. (5); this equation compresses the length of all capsule vectors in layer \(l+1\) between zero and one using the squeeze function to obtain \({U}_{{11,16}}^{l+1}\).

There is a difference between the squeeze function of the network and that of CapsNet. A variant of the squeeze function is proposed to prevent the network from converging due to the vector length of zero, as in Eq. (6):

$${s}_{n}^{l+1}={\widehat{U}}_{\left(:,{n}^{l+1},:\right)}^{Tl}\times \left({C}_{\left(:,{n}^{l+1}\right)}^{l}+{B}_{\left(:,{n}^{l+1}\right)}^{l}\right)$$
(5)
$$squash\left({s}_{n}^{l}\right)=\left(1-\frac{1}{{e}^{\parallel {s}_{n}^{l}\parallel }}\right)\frac{{s}_{n}^{l}}{\parallel {s}_{n}^{l}\parallel }.$$
(6)

In the end, the output of digit capsules layer provides the classification by using a final operation that reduces the shape 11 × 16 into 11 × 1. This represents a one-hot classification vector. The total loss function includes two losses, the first margin loss and the second reconstruction regularizer as Eq. (7) [62].

$${\mathcal{L}}_{{n}^{l}}={T}_{{n}^{L}}{{\text{max}}\left(0,{m}^{+}-\parallel {u}_{n}^{L}\parallel \right)}^{2}+\lambda \left(1-{T}_{{n}^{L}}\right){{\text{max}}\left(0,\parallel {u}_{n}^{L}\parallel -{m}^{-}\right)}^{2}$$
(7)

where, \(n\) represents the class. If class \(n\) exists, then \({T}_{n}\) is equal to one. On the other hand, if \(n\) does not exist, then \({T}_{n}\) is equal to zero. \(\lambda\) is a reference to a coefficient that is used for down weighting. During the initial learning phase, it is possible to prevent the shrinkage of all vectors if \(n\) is not present. Training options for the Efficient-CapsNet in our work are reported in Table 3. The performance curve of the Efficient-CapsNet for one of the API Dataset classification runs is illustrated in Fig. 6.

Table 3 Training options efficient-capsule
Fig. 6
figure 6

Performance curve of efficient-CapNet

3.3 Residual network

The Residual Network (ResNet) was proposed by Microsoft researchers in 2015 to address the challenge of training very deep networks. One of the primary drivers for the development of ResNet was to mitigate the issue of vanishing gradient. ResNets use shortcut connections to bypass one or more layers (Fig. 7), which helps in solving this problem [63].

Fig. 7
figure 7

Bypass layer in residual network

3.3.1 ResNet-18

The ResNet-18 architecture is a variant of the ResNet family that has been widely used in image recognition tasks. This network is a 72-layer architecture with 18 deep learning layers. Composed of 18 layers, ResNet-18 includes 17 convolutional layers that operate on the input image and extract features at various levels of abstraction. In addition, ResNet-18 has a fully connected layer and a Softmax function for classification. The convolutional layers in ResNet-18 use a 3 × 3 kernel size, which has been shown to be effective in capturing important spatial features in images. In ResNet-18, the residual shortcut connections skip two convolutional layers. This design has been carefully optimized to balance the trade-off between model complexity and performance [63]. A visual representation of the ResNet-18 architecture can be seen in Fig. 8.

Fig. 8
figure 8

ResNet-18 model architecture

Table 4 shows color what mean in Figs. 8 and 10.

Table 4 Legend of means color in Figs. 8 and 10

To use ResNet-18 (architecture in Table 5), we first must resize images dataset 28 × 28 RGB to 224 × 224 RGB. This model is trained with summation cross-entropy loss function and \({L}_{2}\) regularization. The cross-entropy loss function aims to minimize the distance between predicted and ground-truth probabilities. \({L}_{2}\) regularization reduce the overfitting possibility. It is defined as follows:

Table 5 ResNet-18 architecture summary [65]
$$Loss=-\frac{1}{N}\left(\sum_{i=1}^{N}{t}_{i}log\left({p}_{i}\right)+\frac{\lambda }{2}\sum {\Vert w\Vert }_{2}^{2}\right).$$
(8)

In Eq. (8) \(\lambda\) is the hyper-parameter scale of the regularization term to have a gentle influence on the \(Loss\), \({\Vert w\Vert }_{2}^{2}\) is the L2-norm expression for the entire set of weights in the model, \({t}_{i}\) is the truth label and \({p}_{i}\) is the Softmax probability for the \(i\)th class.

To update parameters, it uses the Adam (Adaptive Moment Estimation) optimizer. This optimization technique employs an updating process for parameters akin to RMSProp. However, it distinguishes itself by incorporating a momentum term into Adam [64]. The update rule of the Adam optimizer is described as:

$${\theta }_{{\ell}+1}={\theta }_{{\ell}}-\frac{\eta {m}_{l}}{\sqrt{{v}_{l}}+\epsilon }$$
(9)
$${m}_{l}=\frac{{m}_{{\ell}}}{1-{\beta }_{1}^{{\ell}}}$$
(10)
$${v}_{l}=\frac{{v}_{{\ell}}}{1-{\beta }_{2}^{{\ell}}}$$
(11)
$${m}_{{\ell}}={\beta }_{1}{m}_{{\ell}-1}+(1-{\beta }_{1})\times \nabla E({\theta }_{{\ell}})$$
(12)
$${v}_{{\ell}}={\beta }_{2}{v}_{{\ell}-1}+\left(1-{\beta }_{2}\right)\times {(\nabla E({\theta }_{{\ell}}))}^{2}.$$
(13)

In this set of equations \(\theta\) denotes the network’s parameters, \(E\) is the loss function to optimize, \({m}_{{\ell}}\) and \({v}_{{\ell}}\) are exponential average of gradients along \({\theta }_{{\ell}}\) and exponential average of squares of gradients along \({\theta }_{{\ell}}\) and \(\eta\) is initial learning rate.

The hyperparameters \({\beta }_{1}^{{\ell}}\) and \({\beta }_{2}^{{\ell}}\) are gradient decay factor and squared gradient factor. Training options Resnet-18 in our proposed model is reported in Table 6. Additionally, Fig. 9 illustrates the training progress of ResNet-18 for one of the API Dataset classification runs.

Table 6 Training options Resnet-18 in the proposed model
Fig. 9
figure 9

Training progress of ResNet-18 (API dataset)

3.3.2 ResNet-50

ResNet-50 is another variant of the ResNet family. This network is a 176-layer architecture with 50 deep learning layers. It consists of 48 convolutional layers, 1 max pooling layer, and 1 average pooling layer. The ResNet-50 consists of five convolutional layers as shown in Table 7. The Input image with 224 ×224 RGB passed through a convolutional layer (Conv1) with 64 filters, kernel size 7 × 7 and a stride of size 2 followed by a max pooling layer with stride size same previous layer. The second convolutional layer (Conv2_x) consists of three convolutional layers: (1) convolutional layer with 64 filters, kernel size 1 \(\times 1\), (2) convolutional layer with 64 filters, kernel size 3 \(\times 3\), and (3) convolutional layer with 256 filters, kernel size 1 \(\times 1\) and this layer repeated 3 times. This process is the same for the other three convolutional layers. Next, the features are passed through five convolutional layers and an average pooling layer. ResNet-50 is employed as the feature extractor, with the fully connected layer and Softmax function removed. Instead, the output from the average pooling layer is utilized [66]. A visual representation of the ResNet-50 architecture can be seen in Fig. 10.

Table 7 ResNet-50 architecture summary [65]
Fig. 10
figure 10

ResNet-50 model architecture

3.4 Feature extraction

One common application of convolutional neural networks is feature extraction, where each convolutional layer extracts feature maps from an image. Features of an image, such as edge information, gradient information, are obtained after each convolutional layer [67]. In contrast to deep learning, machine learning require feature extraction, making it crucial to select a technique that not only extracts suitable features but also maintains a desirable computational cost. CNN or HOG approaches have been employed in most recent studies, such as [68,69,70,71]. In this article, ResNet-50 and HOG are employed as feature extractor.

3.4.1 Histogram of oriented gradients (HOG)

The HOG method is a highly prevalent technique for feature extraction. Its favorable computational efficiency and robustness properties have made it useful in diverse domains, including medical applications [72], facial recognition [73], and fault detection in wind turbines [74]. The implementation steps of the HOG feature extraction algorithm are as follows:

  1. (1)

    To utilize this method, the images must have fixed dimensions and be in grayscale.

  2. (2)

    Firstly, the images are divided into smaller sections called cells and then the horizontal gradient (\({G}_{x}\)) and vertical gradient (\({G}_{y}\)) are calculated for each cell. Equations (1415)

    $${G}_{x}=I\left(x+1,y\right)-I\left(x-1,y\right)$$
    (14)
    $${G}_{y}=I\left(x,y+1\right)-I\left(x,y-1\right).$$
    (15)

Therefore, it is possible to calculate both the magnitude and orientation of the gradient by:

$$m\left(x,y\right)=\sqrt{{\left({G}_{x}\right)}^{2}+{\left({G}_{y}\right)}^{2}}$$
(16)
$$\theta \left(x,y\right)={tan}^{-1}\left(\frac{{G}_{y}}{Gx}\right).$$
(17)

In this method, a histogram is generated for each cell by binning the gradient orientations of pixels within that cell. To normalize the histograms, adjacent cells are combined into larger spatial regions known as blocks, and the information obtained is used to standardize all cells within the block. We define cell size at 24 \(\times 24\), block size 4 \(\times 4\) and number of bins is 5. Figure 11 shows random image of dataset with HOG descriptors. Finally, a feature vector with 720 columns is extracted for the datasets using the HOG feature extraction algorithm.

Fig. 11
figure 11

Random image of dataset with HOG descriptors

3.5 Neighborhood component analysis (NCA)

Neighborhood component analysis is a well-known method in feature selection or dimensionality reduction. This method can be used in both classification and regression. The findings presented in the publications show that the NCA method has chosen more valuable features in reducing dimensions than methods such as PCA, GA, and Relief, leading to improved classification accuracy compared to the mentioned methods [75, 76]. The advantages of feature selection methods include reducing computational costs and, in some cases, improving classifier performance. In this article, 1720 features were extracted, with 1000 features obtained by ReNet-50 (Sect. 3.3.2) and the rest extracted by the HOG algorithm (Sect. 3.4.1). NCA reduces these features to 80 features with the highest feature weights. These 80 features are used as input for the SVM classifier, discussed in the next section. Figure 12 displays the 80 features selected by NCA from 1720 extracted features of the API dataset.

Fig. 12
figure 12

8o features selected by NCA (API dataset)

3.6 Support vector machine (SVM)

Support vector machine is one of the most well-known classification methods of machine learning. Since SVM is inherently a binary classifier, when applied to datasets with more than two classes, multiple binary classifiers are used in combination to form a multiclass classifier. In this work, we have employed the one-vs-one approach to utilize the SVM multiclass classifier. Due to the requirement of correctly classifying all samples without any errors in the hard margin SVM, the soft margin SVM is often employed where a certain degree of classification error is permitted. The soft margin SVM allows for the violation of this constraint, albeit to a limited extent. However, the soft margin must minimize the number of samples that violate this constraint while maximizing the margin [77]. This optimization problem is typically formulated according to the Eq. (18):

$${\text{min}}\left(\frac{1}{2}{\Vert {\mathcalligra{w}}\Vert }^{2}+{\mathcal{C}}\sum_{{\mathcalligra{i}}=1}^{{\mathcal{M}}}{\xi }_{{\mathcalligra{i}}}\right)$$
$$subject\ to{:}\ \left\{\begin{array}{c}{\mathcalligra{y}}_{{\mathcalligra{i}}}\left({\mathcalligra{w}}.{\phi (\mathcalligra{x}}_{{\mathcalligra{i}}})+{\mathcalligra{b}}\right)\ge 1-{\xi }_{{\mathcalligra{i}}}\\ {\xi }_{{\mathcalligra{i}}}\ge 0 \end{array}, \right.{\mathcalligra{i}}=1,\ldots ,{\mathcal{M}}$$
(18)

where \({\mathcal{C}}\) is a penalty factor, \({\xi }_{{\mathcalligra{i}}}\) indicates the distance between the margin and \({\mathcalligra{x}}_{{\mathcalligra{i}}}\) is the feature vector. The set Eq. (18) dual to Eq. (19):

$$maximize\ L\left({\mathcalligra{a}}\right)=\sum_{{\mathcalligra{i}}=1}^{{\mathcal{M}}}{{\mathcalligra{a}}}_{{\mathcalligra{i}}}-\frac{1}{2}\sum_{{\mathcalligra{i}},{\mathcalligra{j}}=0}^{{\mathcal{M}}}{{\mathcalligra{a}}}_{{\mathcalligra{i}}}{{\mathcalligra{a}}}_{{\mathcalligra{j}}}{\mathcalligra{y}}_{{\mathcalligra{i}}}{\mathcalligra{y}}_{{\mathcalligra{j}}}{\mathcalligra{k}}({\mathcalligra{x}}_{{\mathcalligra{i}}},{\mathcalligra{x}}_{{\mathcalligra{j}}})$$
$$subject\ to{:}\ \sum_{{\mathcalligra{i}}=1}^{{\mathcal{M}}}{{\mathcalligra{a}}}_{{\mathcalligra{i}}}{\mathcalligra{y}}_{{\mathcalligra{i}}}=0, 0\le {{\mathcalligra{a}}}_{{\mathcalligra{i}}}\le {\mathcal{C}}, {\mathcalligra{i}}=1,\ldots ,{\mathcal{M}}$$
(19)

where \({\mathcalligra{k}}({\mathcalligra{x}}_{{\mathcalligra{i}}},{\mathcalligra{x}}_{{\mathcalligra{j}}})\) denotes a kernel function, and commonly used kernel functions in SVM include the linear kernel, RBF kernel, and polynomial kernel. Our work utilizes the RBF kernel function (Eq. (20)):

$${\mathcalligra{k}}\left({\mathcalligra{x}}_{{\mathcalligra{i}}},{\mathcalligra{x}}_{{\mathcalligra{j}}}\right)={\text{exp}}\left(-\frac{{\Vert {\mathcalligra{x}}_{{\mathcalligra{i}}}-{\mathcalligra{x}}_{{\mathcalligra{j}}}\Vert }^{2}}{2{\upsigma }^{2}}\right).$$
(20)

The decision function of the nonlinear SVM can be described by:

$$f\left(\mathcalligra{x}\right)=sign\left({\sum_{{\mathcalligra{i}},{\mathcalligra{j}}=1}^{{\mathcal{M}}}{{\mathcalligra{a}}}_{{\mathcalligra{i}}}\mathcalligra{y}}_{{\mathcalligra{i}}}{\mathcalligra{k}}\left({\mathcalligra{x}}_{{\mathcalligra{i}}},{\mathcalligra{x}}_{{\mathcalligra{j}}}\right)+{\mathcalligra{b}}\right).$$
(21)

Note that the samples closest to the separating hyperplanes are those whose coefficients \({{\mathcalligra{a}}}_{{\mathcalligra{i}}}\) are nonzero.

3.7 Fuzzy type-III

The bases of fuzzy type-III is presented in [7]. The general structure of fuzzy type-III is illustrated in Fig. 13. Due to the ability of fuzzy type-III logic to model higher levels of uncertainty compared to fuzzy type-II, researchers have become interested in its use for practical applications. For example, fuzzy type-III logic has been utilized for flowmeter fault detection [14], aggregation neural networks for predication [12], forecasting [78] and control how improvement over fuzzy type-II because its membership functions have upper and lower uncertainties, unlike type-II. The type-III membership functions (MFs) offer multi degree of freedom due to their uncertain secondary memberships and footprint of uncertainty (FOU). Type-III MFs can be formulated as Eq. (22):

Fig. 13
figure 13

The structure scheme of the fuzzy type-III

$$\sum_{{\mathcal{X}}\in X}\sum_{{\mathcal{P}}\in {\mathcal{J}}_{{\mathcal{X}}}}\frac{\left[\frac{1}{{\mathcal{S}}_{\widetilde{{\mathcal{M}}}}\left({\mathcal{X}},{\mathcal{P}}\right)}\right]}{\left({\mathcal{X}},{\mathcal{P}}\right)}, {\mathcal{J}}_{{\mathcal{X}}}\subseteq \left[{0,1}\right].$$
(22)

In Eq. (22) \(\sum \sum\) represent s-norm and \({\mathcal{S}}_{\widetilde{{\mathcal{M}}}}\left({\mathcal{X}},{\mathcal{P}}\right)\) is a type-II MF.

As shown in Fig. 13 inputs defined by \({{\mathcal{X}}}_{i},i=1,\ldots ,n\) and for each input \({\mathcal{Q}}_{i}\) MFs is considered. The \({\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}\) denoted \({\mathcalligra{j}}\)-th MF in \(i\)-th input. Subsequent to this, firing degree are computed for \({\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}\). Four primary memberships at each horizontal slice \({\alpha }_{k}\) (secondary membership) resulting two bounds for right side and two bounds for left side. \({\overline{{\mathcal{P}}}}_{ {\mathcalligra{i}},r,{\alpha }_{k}}\) are upper bounds of right side, \({\underset{\_}{{\mathcal{P}}}}_{ {\mathcalligra{i}},r,{\alpha }_{k}}\) is lower bounds of right side, \({\overline{{\mathcal{P}}}}_{ {\mathcalligra{i}},l,{\alpha }_{k}}\) is upper bounds of left side and \({\underset{\_}{{\mathcal{P}}}}_{ {\mathcalligra{i}},l,{\alpha }_{k}}\) is lower bounds of left side (Eqs. (2326)). Figure 14 is illustrated MF of the type-III fuzzy system.

Fig. 14
figure 14

Type-III MF

$${\overline{{\mathcal{P}}}}_{ {\mathcalligra{i}},r,{\alpha }_{k} }=\left\{\begin{array}{ll}1-{\left(\frac{\left|{{\mathcal{X}}}_{{\mathcalligra{i}}}-{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\right|}{{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}}\right)}^{{\Delta }_{r}\left(1-{\alpha }_{k}\right)+1},&\quad if\ {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}-{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}<{{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\\ 1-{\left(\frac{\left|{{\mathcal{X}}}_{{\mathcalligra{i}}}-{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\right|}{{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}}\right)}^{{\Delta }_{r}\left(1-{\alpha }_{k}\right)+1},&\quad if\ {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}<{{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}+{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\\ 0,&\quad if\ {{\mathcal{X}}}_{{\mathcalligra{i}}}>{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}+{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}} or {{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}-{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\end{array}\right.$$
(23)
$${\underset{\_}{{\mathcal{P}}}}_{ {\mathcalligra{i}},r,{\alpha }_{k} }=\left\{\begin{array}{ll}1-{\left(\frac{\left|{{\mathcal{X}}}_{{\mathcalligra{i}}}-{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\right|}{{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}}\right)}^{\frac{1}{{\Delta }_{r}\left(1-{\alpha }_{k}\right)+1}},&\quad if\ {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}-{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}<{{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\\ 1-{\left(\frac{\left|{{\mathcal{X}}}_{{\mathcalligra{i}}}-{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\right|}{{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}}\right)}^{\frac{1}{{\Delta }_{r}\left(1-{\alpha }_{k}\right)+1}},&\quad if\ {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}<{{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}+{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\\ 0,&\quad if\ {{\mathcal{X}}}_{{\mathcalligra{i}}}>{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}+{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}} or {{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}-{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\end{array}\right.$$
(24)
$${\overline{{\mathcal{P}}}}_{ {\mathcalligra{i}},l,{\alpha }_{k} }=\left\{\begin{array}{ll}1-{\left(\frac{\left|{{\mathcal{X}}}_{{\mathcalligra{i}}}-{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\right|}{{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}}\right)}^{{\Delta }_{l}\left(1-{\alpha }_{k}\right)+1},&\quad if\ {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}-{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}<{{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\\ 1-{\left(\frac{\left|{{\mathcal{X}}}_{{\mathcalligra{i}}}-{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\right|}{{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}}\right)}^{{\Delta }_{l}\left(1-{\alpha }_{k}\right)+1},&\quad if\ {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}<{{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}+{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\\ 0,&\quad if\ {{\mathcal{X}}}_{{\mathcalligra{i}}}>{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}+{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}} or {{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}-{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\end{array}\right.$$
(25)
$${\underset{\_}{{\mathcal{P}}}}_{ {\mathcalligra{i}},l,{\alpha }_{k} }=\left\{\begin{array}{ll}1-{\left(\frac{\left|{{\mathcal{X}}}_{{\mathcalligra{i}}}-{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\right|}{{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}}\right)}^{\frac{1}{{\Delta }_{l}\left(1-{\alpha }_{k}\right)+1}},&\quad if\ {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}-{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}<{{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\\ 1-{\left(\frac{\left|{{\mathcal{X}}}_{{\mathcalligra{i}}}-{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\right|}{{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}}\right)}^{\frac{1}{{\Delta }_{l}\left(1-{\alpha }_{k}\right)+1}},&\quad if\ {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}<{{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}+{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\\ 0,&\quad if\ {{\mathcal{X}}}_{{\mathcalligra{i}}}>{c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}+{\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}} or {{\mathcal{X}}}_{{\mathcalligra{i}}}\le {c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}-{\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\end{array}\right.$$
(26)

In Eqs. (2326) \({c}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\), \({\overline{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\) and \({\underset{\_}{\vartheta }}_{{\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}}\) is center of \({\mathcalligra{j}}\)-th MF in \({\mathcalligra{i}}\)-th input and are the widths of left and right side \({\mathcalligra{j}}\)-th MF in \({\mathcalligra{i}}\)-th input, respectively. \({\alpha }_{k}\ k=1,\ldots ,K\) represent the value of the horizontal slice. \({\Delta }_{r}\) and \({\Delta }_{l}\) represent the width of the MFs. Considering the memberships of the UB and LB of \({\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}\),the rule firings formulated as Eqs. (2730):

$${\overline{z}}_{r,{\alpha }_{k}}^{h}={\overline{s}}_{{\widetilde{{\mathcal{M}}}}_{1}^{{h}_{1}}|r,{\alpha }_{k}}\left({{\mathcal{X}}}_{1}\right){\overline{s}}_{{\widetilde{{\mathcal{M}}}}_{2}^{{h}_{2}}|r,{\alpha }_{k}}\left({{\mathcal{X}}}_{2}\right)\dots {\overline{s}}_{{\widetilde{{\mathcal{M}}}}_{n}^{{h}_{n}}|r,{\alpha }_{k}}\left({{\mathcal{X}}}_{n}\right)$$
(27)
$${\underline{z}}_{r,{\alpha }_{k}}^{h}={\underline{s}}_{{\widetilde{{\mathcal{M}}}}_{1}^{{h}_{1}}|r,{\alpha }_{k}}\left({{\mathcal{X}}}_{1}\right){\underline{s}}_{{\widetilde{{\mathcal{M}}}}_{2}^{{h}_{2}}|r,{\alpha }_{k}}\left({{\mathcal{X}}}_{2}\right)\dots {\underline{s}}_{{\widetilde{{\mathcal{M}}}}_{n}^{{h}_{n}}|r,{\alpha }_{k}}\left({{\mathcal{X}}}_{n}\right)$$
(28)
$${\overline{z}}_{l,{\alpha }_{k}}^{h}={\overline{s}}_{{\widetilde{{\mathcal{M}}}}_{1}^{{h}_{1}}|l,{\alpha }_{k}}\left({{\mathcal{X}}}_{1}\right){\overline{s}}_{{\widetilde{{\mathcal{M}}}}_{2}^{{h}_{2}}|l,{\alpha }_{k}}\left({{\mathcal{X}}}_{2}\right)\dots {\overline{s}}_{{\widetilde{{\mathcal{M}}}}_{n}^{{h}_{n}}|l,{\alpha }_{k}}\left({{\mathcal{X}}}_{n}\right)$$
(29)
$${\underline{z}}_{l,{\alpha }_{k}}^{h}={\underline{s}}_{{\widetilde{{\mathcal{M}}}}_{1}^{{h}_{1}}|l,{\alpha }_{k}}\left({{\mathcal{X}}}_{1}\right){\underline{s}}_{{\widetilde{{\mathcal{M}}}}_{2}^{{h}_{2}}|l,{\alpha }_{k}}\left({{\mathcal{X}}}_{2}\right)\dots {\underline{s}}_{{\widetilde{{\mathcal{M}}}}_{n}^{{h}_{n}}|l,{\alpha }_{k}}\left({{\mathcal{X}}}_{n}\right).$$
(30)

In Eqs. (2730) \(k=1,\ldots ,K\) \({\widetilde{{\mathcal{M}}}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}\) denote the \({h}_{{\mathcalligra{i}}}\)-th MF for input \({x}_{{\mathcalligra{i}}}\). The \(h\)-th rule is given as:

$$If {x}_{1}is {\widetilde{{\mathcal{M}}}}_{1}^{{h}_{1}},\ldots ,and\dots {x}_{n}is {\widetilde{{\mathcal{M}}}}_{n}^{{h}_{n}}, Then {y}_{h}\in \left[\left[{\underset{\_}{\Psi }}_{{\text{l}},{\text{h}}},{\underset{\_}{\Psi }}_{{\text{r}},{\text{h}}}\right]\left[{\overline{\Psi } }_{{\text{l}},{\text{h}}},{\overline{\Psi } }_{{\text{r}},{\text{h}}}\right]\right].$$
(31)

In Eq. (31) \(\left[{\underset{\_}{\Psi }}_{{\text{l}},{\text{h}}},{\underset{\_}{\Psi }}_{{\text{r}},{\text{h}}}\right]\) denote the lower bounds fuzzy number and \(\left[{\overline{\Psi } }_{{\text{l}},{\text{h}}},{\overline{\Psi } }_{{\text{r}},{\text{h}}}\right]\) are upper bounds fuzzy number. At last, output’s fuzzy type-III is defined as Eq. (32):

$$y=\frac{\sum_{k=1}^{K}\left({\overline{\phi }}_{k}+{\underset{\_}{\phi }}_{k}\right){\alpha }_{k}}{\sum_{k=1}^{K}{\alpha }_{k}}$$
(32)

\({\overline{\phi }}_{k}\) and \({\underset{\_}{\phi }}_{k}\) defined as Eq. (32) and Eq. (33):

$${\overline{\phi }}_{k}=\frac{\sum_{h=1}^{N}\left({\overline{z}}_{r,{\alpha }_{k}}^{h}{\overline{\Psi } }_{{\text{r}},{\text{k}}}^{h}+{\underline{z}}_{l,{\alpha }_{k}}^{h}{\underset{\_}{\Psi }}_{{\text{l}},{\text{k}}}^{h}\right)}{\sum_{h=1}^{N}{\overline{z}}_{r,{\alpha }_{k}}^{h}+{\underline{z}}_{l,{\alpha }_{k}}^{h}}$$
(33)
$${\underset{\_}{\phi }}_{k}=\frac{\sum_{h=1}^{N}\left({\underline{z}}_{r,{\alpha }_{k}}^{h}{\underset{\_}{\Psi }}_{{\text{r}},{\text{k}}}^{h}+{\overline{z}}_{l,{\alpha }_{k}}^{h}{\overline{\Psi }}_{{\text{l}},{\text{k}}}^{h}\right)}{\sum_{h=1}^{N}{\underline{z}}_{r,{\alpha }_{k}}^{h}+{\overline{z}}_{l,{\alpha }_{k}}^{h}}.$$
(34)

The inputs to FIS are the outputs from three classifiers: Efficient-Capsule, ResNet-18, and SVM. The output of the FIS is the final decision regarding malware classification. In other words, FIS-III makes a choice based on its inputs to which class the data (image) belongs. This means FIS-III decides which output of classifiers is through to output, or when all classifiers are misclassified, it can be modified to achieve the best accuracy. For tuning the parameters of the FIS, meta-heuristic optimization algorithms are used instead of derivative-based methods. The meta-heuristic optimization algorithm utilized for tuning the rule parameters of the FIS-III (\({\overline{\Psi }}_{{\text{r}},{\text{k}}}^{h}\), \({\underset{\_}{\Psi }}_{{\text{r}},{\text{k}}}^{h}\),\({\overline{\Psi }}_{{\text{l}},{\text{k}}}^{h}\),\({\underset{\_}{\Psi }}_{{\text{l}},{\text{k}}}^{h}\)) is ICGO (Sect. 3.8.1).

Since, the outputs of the classifiers can be integer numbers ranging from 0 to 10, the total number of rules considered, based on the 3 classifiers used, is the combination \((\begin{array}{c}{11}\\ {3}\end{array})\) or 165 rules. Now, intelligently and by utilizing the associative memory property of fuzzy systems, these 165 rules can be reduced to just 43 rules. The intelligent selection of rules refers to first identifying the unique combinations of train data, and then choosing the output decisions to maximize accuracy. We present an example implementation to provide a simplified illustration of this intelligent rule selection approach. Suppose our fuzzy system gets an input of \(\left[{1,9},1\right]\). In train data, we have combination where the target is 1 for this input, and combination where the target is 9. We will look how many times each target occurs for this specific input pattern. If there are more cases where the target is 9 when the input is \(\left[{1,9},1\right]\), then we will set the fuzzy system output to be 9 for this rule. If the train data contains an equal number of target 1 and target 9 for input \(\left[{1,9},1\right]\), then the rule consequent will be randomly chosen as either 1 or 9 with equal probability. Given the described structure for the FIS-III with 43 rules, this result in 43 × 4 = 172 rule parameters. The rule parameters must be tuned by ICGO. The membership all three inputs are considered similar to each other and is reported in Table 8. The MFs of the FIS-III in Fig. 15 it has been shown.

Table 8 Specifications of type-III fuzzy system
Fig. 15
figure 15

MFs for three inputs FIS-III mentioned a SVM, b ResNet-18, c efficient-CapsNet

Remark 1

It should be noted that the proposed model results in increased classification accuracy only if the neural networks used as base classifiers are not overfitted.

Remark 2

Another limitation of our proposed method is that as the number of classes increases, the number of membership functions also increases.

3.8 Optimization

Optimization is fundamental in the realm of machine learning. It revolves around the meticulous adjustment of model parameters to ensure peak performance. This process is particularly crucial because even slight parameter changes can significantly impact a model’s overall accuracy and efficiency. Traditional optimization techniques might not always be the best fit in the context of fuzzy systems and neural networks, which are inherently complex and nonlinear. Meta-heuristic interpolation algorithms have emerged as a preferred choice in such scenarios. Unlike derivative-based methods, which rely on gradient information and can sometimes get stuck in local optima, meta-heuristic algorithms explore the solution space more broadly, making them more adept at finding global optima. This adaptability and flexibility make them especially effective for the intricate landscapes of fuzzy systems and neural networks. In intrusion detection and malware classification research, metaheuristic algorithms have also been utilized, such as [79,80,81,82,83,84,85].

3.8.1 Improved chaos game optimization (ICGO)

ICGO is an enhanced version of the Chaos Game Optimization (CGO) algorithm, with improvements made to the mutation phase of the traditional CGO method. Talatahari et al. [86] introduced this algorithm in 2020. The foundational mathematics of CGO is rooted in the self-resemblance properties of fractals in chaos theory and the fundamental principles of creating the Sierpinski triangle [87]. The Sierpinski triangle is crafted by segmenting each equilateral triangle into three smaller triangles, each having half the edge length of the original. When the count of initial fractal vertices rises to N, a Sierpinski triangle of N − 1 dimensions can be established, as illustrated in Fig. 16. The reason for choosing the CGO algorithm in this study is that its performance has been evaluated by 228 BFs. Additionally, 11 EDPs were assessed, and three comparative analyses were conducted to assess the outcomes of the CGO algorithm. It is rare to find an algorithm that has been evaluated with this number of tests, and the evaluation results indicate that the CGO algorithm is capable of solving various optimization problems. Therefore, there is hope that it can also demonstrate good performance for the problem addressed in this article. Both CGO and ICGO exhibit a computational complexity of \({\mathcal{O}}\left({\mathcal{N}}{\mathcalligra{m}}\left(1+4{\mathcal{T}}\right)\right)\).

Fig. 16
figure 16

Sierpinski triangle and self-similarity in ICGO

Firstly, an initialization procedure is configured by determining the initial positions of the solution candidates from the following equations:

$${\mathcal{X}}=\left[\begin{array}{c}{{\mathcal{X}}}_{1}\\ {{\mathcal{X}}}_{2}\\ \vdots \\ {{\mathcal{X}}}_{{\mathcalligra{i}}}\\ \vdots \\ {{\mathcal{X}}}_{{\mathcalligra{n}}}\end{array}\right]=\left[\begin{array}{cccccc}{\mathcalligra{x}}_{1}^{1}& {\mathcalligra{x}}_{1}^{2}& \cdots & {\mathcalligra{x}}_{1}^{{\mathcalligra{j}}}& \cdots & {\mathcalligra{x}}_{1}^{{\mathcalligra{d}}}\\ {\mathcalligra{x}}_{2}^{1}& {\mathcalligra{x}}_{2}^{2}& \cdots & {\mathcalligra{x}}_{2}^{{\mathcalligra{j}}}& \cdots & {\mathcalligra{x}}_{2}^{{\mathcalligra{d}}}\\ \vdots & \vdots & \cdots & \vdots & \cdots & \vdots \\ {\mathcalligra{x}}_{{\mathcalligra{i}}}^{1}& {\mathcalligra{x}}_{{\mathcalligra{i}}}^{2}& \cdots & {\mathcalligra{x}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}& \cdots & {\mathcalligra{x}}_{{\mathcalligra{i}}}^{{\mathcalligra{d}}}\\ {\mathcalligra{x}}_{{\mathcalligra{n}}}^{1}& {\mathcalligra{x}}_{{\mathcalligra{n}}}^{2}& \cdots & {\mathcalligra{x}}_{{\mathcalligra{n}}}^{{\mathcalligra{j}}}& \cdots & {\mathcalligra{x}}_{{\mathcalligra{n}}}^{{\mathcalligra{d}}}\end{array}\right],{\mathcalligra{i}}={1,2},\ldots ,{\mathcalligra{n}} ,{\mathcalligra{j}}={1,2},\ldots ,{\mathcalligra{d}}$$
(35)
$${\mathcalligra{x}}_{{\mathcalligra{i}}}^{{\mathcalligra{j}}}(0)={\mathcalligra{x}}_{{\mathcalligra{i}},min}^{{\mathcalligra{j}}}+{\mathcal{r}}.({\mathcalligra{x}}_{{\mathcalligra{i}},max}^{{\mathcalligra{j}}}-{\mathcalligra{x}}_{{\mathcalligra{i}},min}^{{\mathcalligra{j}}})$$
(36)

where \({\mathcalligra{d}}\) denotes the dimension of problem, \({\mathcalligra{n}}\) number of the solution candidates, \({\mathcal{r}}\) random number in the range [0,1], \({\mathcalligra{x}}_{{\mathcalligra{i}},min}^{{\mathcalligra{j}}}\) and \({\mathcalligra{x}}_{{\mathcalligra{i}},max}^{{\mathcalligra{j}}}\) the lower and upper bounds, \({\mathcalligra{j}}\) specifies the decision variable, and \({\mathcalligra{i}}\) specifies the solution number. In this Algorithm 3 seeds and a dice are utilized for creating new seeds. Seeds can be calculated as follows:

$$See{d}_{{\mathcalligra{i}}}^{1}={{\mathcal{X}}}_{{\mathcalligra{i}}}+{\alpha }_{{\mathcalligra{i}}}\times \left({\beta }_{{\mathcalligra{i}}}\times {\mathcal{G}}{\mathcal{B}}-{\gamma }_{{\mathcalligra{i}}}\times {\mathcal{M}}{{\mathcal{G}}}_{{\mathcalligra{i}}}\right),{\mathcalligra{i}}={1,2},\ldots ,{\mathcalligra{n}}$$
(37)
$$See{d}_{{\mathcalligra{i}}}^{2}={\mathcal{G}}{\mathcal{B}}+{\alpha }_{{\mathcalligra{i}}}\times \left({\beta }_{{\mathcalligra{i}}}\times {{\mathcal{X}}}_{{\mathcalligra{i}}}-{\gamma }_{{\mathcalligra{i}}}\times {\mathcal{M}}{{\mathcal{G}}}_{{\mathcalligra{i}}}\right), {\mathcalligra{i}}={1,2},\ldots ,{\mathcalligra{n}}$$
(38)
$$See{d}_{{\mathcalligra{i}}}^{3}={\mathcal{M}}{{\mathcal{G}}}_{{\mathcalligra{i}}}+{\alpha }_{{\mathcalligra{i}}}\times \left({\beta }_{{\mathcalligra{i}}}\times {{\mathcal{X}}}_{{\mathcalligra{i}}}-{\gamma }_{{\mathcalligra{i}}}\times {\mathcal{G}}{\mathcal{B}}\right), {\mathcalligra{i}}={1,2},\ldots ,{\mathcalligra{n}}.$$
(39)

In Eqs. (3739) \({\mathcal{G}}{\mathcal{B}}\) is the global best, \({\alpha }_{{\mathcalligra{i}}}\) is the movement limitation factor, \({\beta }_{{\mathcalligra{i}}}\) and \({\gamma }_{{\mathcalligra{i}}}\) represents vectors randomly created by numbers in range of [0, 1] and \({\mathcal{M}}{{\mathcal{G}}}_{{\mathcalligra{i}}}\) is the \({\mathcalligra{i}}\)th candidate’s mean group (\({{\mathcal{X}}}_{{\mathcalligra{i}}}).\)

In conventional CGO algorithm, the fourth seed is considered a dice or mutation operator. The update of position for this seed is accomplished by making random modifications to the decision variables that are randomly selected. This seed formulated as Eq. (40):

$$See{d}_{{\mathcalligra{i}}}^{4}={{\mathcal{X}}}_{{\mathcalligra{i}}}\left({\mathcalligra{x}}_{{\mathcalligra{i}}}^{{\mathcalligra{k}}}={\mathcalligra{x}}_{{\mathcalligra{i}}}^{{\mathcalligra{k}}}+{\mathcal{R}}\right),{\mathcalligra{k}}={1,2},\ldots ,{\mathcalligra{d}}, {\mathcalligra{i}}={1,2},\ldots ,{\mathcalligra{n}}.$$
(40)

In Eq. (40) \({\mathcal{R}}\) is a vector with a random number in range of [0, 1].

In this article for improve CGO algorithm replace simple mutation in convectional CGO algorithm to wavelet mutation. The \(See{d}_{{\mathcalligra{i}}}^{4}\) in ICGO defined as Eq. (41):

$$See{d}_{{\mathcalligra{i}}}^{4}=\left\{\begin{array}{ll}GB+\sigma \left({{\mathcal{X}}}_{i,max}-{\mathcal{G}}{\mathcal{B}}\right),&\quad if\ o<0.5\\ GB+\sigma \left({\mathcal{G}}{\mathcal{B}}-{{\mathcal{X}}}_{i,min}\right),&\quad if\ o\ge 0.5.\end{array}\right.$$
(41)

In Eq. (41), \({{\mathcal{X}}}_{i,max}\) and \({{\mathcal{X}}}_{i,min}\) are the lower and upper bounds, \({\mathcalligra{o}}\) is a random number and \(\sigma\) is defined as Eq. (42):

$$\sigma =\frac{1}{\sqrt{\alpha }}\psi \left(\frac{\varphi }{\alpha }\right).$$
(42)

In Eq. (42), \(\psi \left(\frac{\varphi }{\alpha }\right)={e}^{-\frac{{\left(\frac{\varphi }{\alpha }\right)}^{2}}{2}}.{\text{cos}}\left(\frac{5\varphi }{\alpha }\right)\) is the Morlet wavelet function (Fig. 17) and \(\alpha ={\mathcalligra{s}}.{\left(\frac{1}{{\mathcalligra{s}}}\right)}^{\left(1-\frac{{\mathcal{T}}}{{{\mathcal{T}}}_{max}}\right)}{\mathcalligra{s}}\) is a random integer number, The current iteration is denoted by \({\mathcal{T}}\), and \({{\mathcal{T}}}_{max}\) represents the maximum number of iterations. The pseudocode is provided in Table 9.

Fig. 17
figure 17

Morlet wavelet

Table 9 Pseudocode improved chaos game optimization

The cost function used to adjust the rule parameters is described in Sect. 3.7 and is given by Eq. (47):

$${\text{Cost}}=1-\sum_{{\mathcalligra{i}}=1}^{{\ell}}\frac{{{\mathcal{T}}{\mathcal{P}}}_{{\mathcalligra{i}}}+{\mathcal{T}}{{\mathcal{N}}}_{{\mathcalligra{i}}}}{{{\mathcal{T}}{\mathcal{P}}}_{{\mathcalligra{i}}}+{\mathcal{T}}{N}_{{\mathcalligra{i}}}+{\mathcal{F}}{{\mathcal{P}}}_{{\mathcalligra{i}}}+{{\mathcal{F}}{\mathcal{N}}}_{{\mathcalligra{i}}}}$$
(47)

where, \({{\mathcal{T}}{\mathcal{P}}}_{{\mathcalligra{i}}}\) is the number of samples correctly put in the \({\mathcalligra{i}}\)th class, \({\mathcal{F}}{{\mathcal{P}}}_{{\mathcalligra{i}}}\) is the number of samples incorrectly put in the ith class and \({{\mathcal{F}}{\mathcal{N}}}_{{\mathcalligra{i}}}\) is the number of samples that belonged in the \({\mathcalligra{i}}\)th class but were put in other classes. Moreover,\({\ell}\) is the number of the total labels. In all separate runs in order to adjust rule parameters FIS-III the population size is 30 and the maximum iteration is 200 of repetitions considered. Figure 18 depicts the average convergence and 0.5 standard division of the ICGO optimization algorithm across multiple runs (5 separate runs per dataset) for adjusting the rule parameters of the fuzzy system.

Fig. 18
figure 18

The mean and 0.5 standard deviation were calculated for seven datasets to optimize rule parameters across five folds

3.8.2 Evolution of CEC 2019 BFs, CEC 2017 BFs, and EDPs

We evaluate the effectiveness of ICGO in addressing challenges posed by CEC 2019 BFs, CEC 2017 BFs, and five EDPs scenarios in 30 independent runs and 1000 iterations. Our study juxtaposes the performance of ICGO against twelve established metaheuristic algorithms CGO, LSHADE-EnSin, AOA, SSA, SCA, TLBO, GOA, PSO, WOA to gauge its efficacy in achieving optimal outcomes. The modification of control parameters is described according to the specifications provided in Table 10. The ICGO is applied to resolve these issues, with a total of 30,000 evaluations conducted. The population size for both CGO and ICGO is kept constant at 15 members. The TLBO's population size is maintained at a constant of 30 members. The population size of other algorithms is maintained at a constant of 60 members.

Table 10 Control parameters of algorithms
3.8.2.1 CEC 2017 BFs

The ability of the ICGO to tackle optimization challenges has been questioned when tested on the latest functions from the CEC 2017 test suite. This suite comprises thirty BFs categorized into four groups: three UMBFs (F1 to F3), seven MMBFs (F4 to F10), ten hybrid functions (F11 to F20), and ten composition functions (F21 to F30). We exclude the use of the F2 test function from the CEC 2017 set due to its unpredictable behavior, a decision shared by other researchers in their respective papers. Comprehensive details and information regarding these BFs can be found in [88]. The experiments are conducted across various dimensions of BFs, specifically 10, 30, 50, and 100. Results from optimizing BFs within the CEC 2017 set using the ICGO and competitor algorithms are presented in Table B.1B.4. Furthermore, boxplots and convergence curves of algorithms illustrating the performance of the ICGO and competitor algorithms across the CEC 2017 BFs for different dimensionalities are depicted in Fig. B.1B.8. The ICGO algorithm obtained the best solution among all algorithms for all 116 CEC 2017 BFs.

3.8.2.2 CEC 2019 BFs

The CEC 2019 BFs comprise ten intricate functions outlined in [89]. F1 and F10 functions, part of the CEC 2019 test suite, are tailored for single-objective real parameter optimization, targeting the discovery of globally optimal solutions. These functions serve as valuable tools for evaluating the efficacy of algorithms in conducting comprehensive searches for optimal solutions. ICGO algorithm achieved the best results in functions F1–F3, F5, and F7–F10 compared to the competing algorithms. In function F6, ICGO outperformed in all criteria except the Best criterion (see Table B.5). Boxplots and convergence curves of algorithms illustrating the performance of the ICGO and competitor algorithms across the CEC 2017 BFs are depicted in Fig. B.9 and Fig. B.10.

3.8.2.3 EDPs

The statistical results obtained through different methodologies are presented in Tables B.6. Furthermore, Fig. B.11 and Fig. B.12 display boxplots and convergence diagrams illustrating the algorithms’ performance. PV Design (Fig. D.1), TCS Design (Fig. D.2), WB Design (Fig. D.3), TBT Design (Fig. D.4), CSI Design (Fig. D.5) [90] The ICGO achieved the lowest values compared to other algorithms.

3.8.3 Optimization algorithms’ statistical evaluation

To meticulously assess the effectiveness of the ICGO, we perform an extensive statistical analysis, comparing it against the examined algorithms. The Wilcoxon nonparametric signed-rank test examines whether there’s a substantial contrast between pairs of data (Table C.1). It evaluates the magnitudes of differences (disregarding their direction) by assigning ranks and computing a statistic based on these ranks. This figure aids in discerning whether distinctions are probably attributable to random variation or if they carry significance. A low p-value indicates a substantial disparity between the paired data, while a high p-value suggests uncertainty regarding the existence of a noteworthy difference.

The Friedman test is indeed a non-parametric statistical test used to determine if there are statistically significant differences among multiple related groups (Table C.2). This research divided the BFs into five distinct groups to ensure the test’s reliability. The first, second, third, and fourth groups included CEC 2017 BFs in different dimensions, respectively (Tables B.1B.4), while the fifth group is formed by CEC 2019 BFs illustrated in Table B.5 [91].

A post-hoc Nemenyi test was utilized to delve deeper into the distinctions among the algorithms. If the null hypothesis is rejected, a post-hoc test can be conducted. The Nemenyi test is employed when conducting pairwise comparisons among all algorithms. The performance disparity between two classifiers is deemed significant if their respective average ranks exhibit a difference equal to or exceeding the \(CD\) (Eq. (48)) [91].

$$CD={q}_{\alpha }\sqrt{\frac{k(k+1)}{6N}}$$
(48)

\(N\) represents the number of BFs in each group, \(k\) represents the number of algorithms under comparison and in each group, we selected the top 10 algorithms for comparison. At a significance level of \(\alpha = 0.05\), the critical value for 10 algorithms, the associated \(CD\) for each group has been specified in Fig. C.1 and \({q}_{\alpha }=3.164\). To identify distinctions among the ten algorithms, the \(CD\) derived from the Nemenyi test was employed. The \(CD\) diagrams depicted in Fig. C.1 offer straightforward and intuitive visualizations of the outcomes from a Nemenyi post-hoc test. This test is specifically designed to assess the statistical significance of differences in average ranks among a collection of ten algorithms, each evaluated on a set of five groups.

Following the revelation of notable variations in performance among various algorithms, it becomes imperative to identify which algorithms exhibit significantly different performances compared to ICGO. ICGO is regarded as control algorithm in this context. Figure C.1 displays the average ranking of each method across five groups, with significance levels of 0.05 in 30 distinct runs. ICGO demonstrates significant superiority over algorithms whose average ranking exceeds the threshold line indicated in the figure.

A post-hoc analysis determines that if the disparity in mean Friedman values between the two algorithms falls below the \(CD\) threshold, there is no notable distinction between them; conversely, if it surpasses the \(CD\) value, a significant difference between the algorithms exists. In Table C.3, a comparison has been conducted between 10 algorithms and ICGO across all five BF groups. Algorithms that are not significantly different from the ICGO algorithm are highlighted with a red mark. Conversely, algorithms that are deemed significantly different from the ICGO algorithm are highlighted with a green mark in this table. In accordance with Table C.3, none of the examined algorithms in this article can serve as a substitute for algorithm ICGO. This observation underscores the necessity of the existence of algorithm ICGO, which can potentially address limitations not covered by other algorithms.

4 Proposed method

The implementation steps of our proposed model are as follows:

  1. (1)

    The Efficient-CapsNet was trained on a training set of RGB dimensions 28 × 28, according to the current k-fold cross-validation partitioning.

  2. (2)

    The ResNet-18 was trained on the training set of resized images with RGB dimensions of 224 × 224, according to the current k-fold cross-validation partitioning.

  3. (3)

    Features are extracted from the resized images of the training set using the ResNet-50 network and the HOG algorithm, according to the current k-fold cross-validation partitioning.

  4. (4)

    The NCA algorithm reduces the 1720 feature vectors obtained from step 3 to 80 features.

  5. (5)

    The 80 feature vectors obtained from the previous step are used to train the SVM classifier.

  6. (6)

    To train the type-III fuzzy system as an appropriate decision maker, the results obtained from the three classifiers are considered as a new dataset for the type-III fuzzy system. Its special cases are separated, and the rest are removed, as explained in Sect. 3.7. Finally, we have a dataset of 43 samples with labels that provide the highest classification accuracy (for the original dataset under investigation).

  7. (7)

    The fuzzy system is trained using the dataset created in step 6, and its rule parameters are fine-tuned by the ICGO algorithm (Sect. 3.8.1).

  8. (8)

    The proposed model’s performance is now evaluated using the test set.

Figure 19 presents a concise and graphical overview of the steps involved in implementing the proposed model.

Fig. 19
figure 19

The proposed model’s steps

Malware’s impact on infiltrating diverse operating systems underscores the urgency of creating a precise classifier for effective categorization. This article introduces a novel hybrid classifier that combines type-III fuzzy decision-making with a deep neural network ensemble. An enhanced version of the CGO optimization algorithm is utilized to efficiently adjust fuzzy system parameters, boosting accuracy. The proposed classifier consistently achieves accuracy rates higher than 96% in datasets related to malware classification. Furthermore, comparisons with recently introduced classifiers on MNIST and Fashion-MNIST datasets showcased the supremacy of the proposed classifier in multiple performance metrics, highlighting its superiority over well-known networks. While a minor rise in parameters when compared to ResNet-18 could be viewed as a minor weakness of the proposed method, it's important to note that the proposed approach significantly outperforms ResNet-18 and even networks with higher parameter counts like AlexNet in terms of accuracy. In this context, this aspect might be seen as a strength of the proposed method.

5 Results and discussion

In this section, the results have been thoroughly examined to familiarize other researchers with the strengths of our proposed classifier and facilitate easy comparison with other classifiers. The proposed method was evaluated using publicly available 11-class datasets from Ref. [39]. As noted in Sect. 3.1, the mentioned dataset comprises seven datasets. We compared the models proposed in [39] and other classifiers employed in this paper with our proposed method to evaluate its performance. Further, we utilized fivefold cross-validation (Sect. 3.1.2) in all classifiers used in this study to reduce the impact of data grouping (test and train) on performance. Lastly, a total of 750 samples, representing the test data, and 2999 samples, for the train data, are used in the simulations throughout the dataset, which are classified through fivefold cross-validation. In the course of our analysis, the macro average metrics was employed. However, only the basic formula of the metric is mentioned herein for brevity and to minimize the complexity of the presented equations. Metrics such as accuracy (Eq. 48), precision (Eq. 51), sensitivity (Eq. 50), specificity (Eq. 49), F1-score (Eq. 52). Also, to avoid prolixity, only the confusion matrix of the most optimal result obtained from the proposed model was depicted in the Sect. 5, whereas other results were detailed in Tables E (1–7), Tables F (1–7) and Figs. E (1–14). Since deep learning and metaheuristic optimization methods do not generally provide mathematical proof to guarantee performance, various models were carefully analyzed in the simulation section during separate runs. The advantages or disadvantages of the methods can only be discussed through their statistical analysis and multiple runs [92,93,94]:

$${\text{Accuracy}}=\frac{{\mathcal{T}}{\mathcal{P}}+{\mathcal{T}}{\mathcal{N}}}{{\mathcal{T}}{\mathcal{P}}+{\mathcal{T}}{\mathcal{N}}+{\mathcal{F}}{\mathcal{P}}+{\mathcal{F}}{\mathcal{N}}}$$
(49)
$${\text{Sensitivity}}=\frac{{\mathcal{T}}{\mathcal{N}}}{{\mathcal{T}}{\mathcal{N}}+{\mathcal{F}}{\mathcal{P}}}$$
(50)
$${\text{Specificity}}=\frac{{\mathcal{T}}{\mathcal{P}}}{{\mathcal{T}}{\mathcal{P}}+{\mathcal{F}}{\mathcal{N}}}$$
(51)
$${\text{Precision}}=\frac{{\mathcal{T}}{\mathcal{P}}}{{\mathcal{T}}{\mathcal{P}}+{\mathcal{F}}{\mathcal{P}}}$$
(52)
$${\text{F}}1{\text{-score}}=2\times \frac{Precision\times Sensitivity}{Precision+Sensitivity}.$$
(53)

5.1 API dataset

The malware made frequent API calls to the operating system for network access and file creation, represented. The R channel in the malware visualization is used to represent the API dataset, which causes the main characteristic of this dataset to be the presence of the R channel and the absence of the other two channels (G and B).

Figure E (1) and Fig. 20 demonstrate the confusion matrix of the best results in fivefold of Efficient-CapsNet, SVM, ResNet-18 methods our proposed method for test and train. Average accuracy: 0.9829, 0.9504, 0.9411, 0.9902, average sensitivity: 0.9810, 0.9351, 0.9231, 0.9920, average specificity: 0.9891, 0.9853, 0.9844, 0.9965, average precision: 0.9822, 0.9524, 0.9908, 0.9402, and average F1-score: 0.9819, 0.9380, 0.9269 and 0.9911 for the train data classification of the API dataset belong to the classifiers SVM, Efficient-CapsNet, ResNet-18, and the proposed classifier, respectively. Table (F.1) reports the results comparison of the classifiers for the train data of the API dataset. Throughout this study, we have used the abbreviation N/A to indicate Not Available information to maintain clarity and precision in our reporting.

Fig. 20
figure 20

Confusion matrix of the best result in fivefold of proposed model, a train, b test

Average accuracy: 0.9520, 0.9190, 0.9036, 0.9713 and N/A, average sensitivity: 0.9544, 0.9013, 0.9035, 0.9668, average specificity: 0.9870, 0.9821,0.8863, 0.9950 and N/A, average precision: 0.9618, 0.9237, 0.9107, 0.9770 and 0.93038, and average F1-score: 0.9530, 0.9056, 0.8891, 0.9688 and 0.90948 for the test data classification of the API dataset belong to the classifiers SVM, Efficient-CapsNet, ResNet-18, the proposed classifier, and DACN proposed classifier in Article [39], respectively. As can be observed, the most optimal average of various metrics belongs to our proposed classifier. The variance of the testing classification accuracy for our proposed method is only 0.0018, whereas it stands at 0.0062 for classifier DACN which is more than three times our proposed method, doubling the value of this performance improvement. Table 11 and Table (E.1) depict the performance comparison of the classifiers for the test data of the API dataset.

Table 11 The performance comparison proposed model using the fivefold CV procedure (test)

To better depict the robustness of the proposed classifier, the box plot of the models was also depicted in this article (Fig. E(2)). In this chart, the test data accuracy of the proposed classifier, and the models analyzed in this article and the DACN were illustrated in various folds. It is noteworthy that not only is the accuracy dispersion obtained through variant folds classifier substantially less than that of the other classifiers, but also the accuracy of the proposed classifier in various folds is more than other classifiers.

5.2 DLL dataset

The malware loaded DLLs that contained encryption functions, represented by G channel in the malware visualization. The main characteristic of these images is the presence of the G channel and the absence of the R and B channels.

Figure E(3) and Fig. 21 demonstrate the confusion matrix of the best results in fivefold of Efficient-CapsNet, SVM, ResNet-18 methods and our proposed method for test and train.

Fig. 21
figure 21

Confusion matrix of the best result in fivefold of proposed model, a train, b test

Average accuracy: 0.9753, 0.9727, 0.9556, 0.9854, average sensitivity: 0.9766, 0.9681, 0.9495, 0.9867, average separation: 0.9893, 0.9874, 0.9858, 0.9971, average precision: 0.9782, 0.9710, 0.9512, 0.9881, and average F1-score: 0.9759, 0.9701, 0.9522, 0.9860 for the train data classification of the DLL dataset belong to the classifiers SVM, Efficient-CapsNet, ResNet-18, and the proposed classifier, respectively. Table (F.2) reports the results of the classifiers for the train data of the DLL dataset.

Average accuracy: 0.9479, 0.9601, 0.9426, 0.9634 and N/A, average sensitivity: 0.9349, 0.9598, 0.9427, 0.9524 and N/A, average specificity: 0.9867, 0.9863, 0.9953 and N/A, average precision: 0.9544, 0.9600, 0.9431, 0.9714 and 0.96212, and average F1-score: 0.9403, 0.9597, 0.9424, 0.9572 and 0.96504 for the test data classification of the DLL dataset belong to the classifiers SVM, Efficient-CapsNet, ResNet-18, the proposed classifier, and DACN, respectively. (Table 11 and Table (E.2)) Our proposed classifier also achieved the most optimal performance in this dataset as per different metrics. With a slight difference of 0.96% in the average classification accuracy compared to classifier DACN, the proposed classifier ranks 1st. We must take into consideration that our proposed classifier, with a 0.0014 standard deviation of average classification accuracy, and classifier DACN, with a 0.0090 standard deviation of average classification accuracy, were ranked 1st and 2nd, respectively. This shows that our proposed classification has achieved a more proper classification in different folds.

The proposed classifier for the DLL dataset achieved the highest accuracy compared to the models evaluated in this article and DACN. The box plot of the classification accuracy for different models evaluated and the DACN model for the DLL dataset is plotted in Fig. E(4). As can be seen, the proposed classifier in this case also achieved the highest classification accuracy and the lowest amount of dispersion around the best classification among all the evaluated models, and thus is placed in the first position.

5.3 REG dataset

The malware manipulated the registry to achieve persistence, represented by the B channel in the malware visualization.

Figure E(5) and Fig. 22 demonstrate the confusion matrix of the best results in fivefold of Efficient-CapsNet, SVM, ResNet-18 methods and our proposed method for test and train. Average accuracy: 0.9808, 0.9352, 0.9267, 0.9914, average sensitivity: 0.9763, 0.9222, 0.9039, 0.9908, average specificity: 0.9878, 0.9864, 0.9845, 0.9899, average precision: 0.9800, 0.9561, 0.9353, 0.9956, and average F1-score: 0.9784, 0.9221, 0.9897and 0.9911 for the train data classification of the REG dataset belong to the classifiers SVM, Efficient-CapsNet, ResNet-18, and the proposed classifier, respectively. Table (F.3) report the results of the classifiers for the train data of the REG dataset.

Fig. 22
figure 22

Confusion matrix of the best result in fivefold of proposed model, a train, b test

Average accuracy: 0.9335, 0.9260, 0.9239, 0.9451 and N/A, average sensitivity: 0.9057, 0.9162, 0.9101, 0.9389 and N/A, average specificity: 0.9839, 0.9854, 0.9848, 0.9939 and N/A, average precision: 0.9406, 0.9457, 0.9379, 0.9626 and 0.83703, and average F1-score: 0.9102, 0.9158, 0.9047, 0.9397 and 0.72883 for the test data classification of the DLL dataset belong to the classifiers SVM, Efficient-CapsNet, ResNet-18, the proposed classifier, and DACN, respectively (Table 11 and Table (E.3)).

Compared to DACN, our proposed classifier has improved by roughly 15% regarding the average classification accuracy and 28% regarding the average F1-score. Compared to DACN, our proposed classification has improved the standard deviation of the average classification accuracy by over 92%. (The standard deviation of our proposed classifier: 0.0020, and the standard deviation of classifier DACN: 0.0258) Compared to the other models analyzed in this article, classifier DACN showed the poorest performance for the test data of the REG dataset. Aside from exhibiting the weakest classification accuracy, this classifier had the highest classification accuracy dispersion in various folds. With insignificant dispersion and the highest classification accuracy in various folds, our proposed classifier had the best result among the other models (Fig. E(6)).

5.4 API + DLL dataset

The malware visualization showed integrated R channel and G channel representing API calls and loaded DLLs. The R channel indicated network and file activities. The G channel revealed encryption functions from external libraries.

Figure E(7) and Fig. 23 demonstrate the confusion matrix of the best results in fivefold of Efficient-CapsNet, SVM, ResNet-18 methods and our proposed method for test and train.

Fig. 23
figure 23

Confusion matrix of the best result in fivefold of proposed model, a train, b test

Average accuracy: 0.9854, 0.9818, 0.9608, 0.9937, average sensitivity: 0.9819, 0.9826, 0.9478, 0.9911, average specificity: 0.9892, 0.9894, 0.9850, 0.9974, average precision: 0.9842, 0.9816, 0.9514, 0.9924, and average F1-score: 0.9836, 0.9821, 0.9524 and 0.9923 for the train data classification of the API + DLL dataset belong to the classifiers SVM, Efficient-CapsNet, ResNet-18, and the proposed classifier, respectively. Table (F.4) reports the results of the classifiers for the train data of the API + DLL dataset.

Average accuracy: 0.9675, 0.9691, 0.9604, 0.9782 and N/A, average sensitivity: 0.9615, 0.9664, 0.9398, 0.9667 and N/A, average specificity: 0.9872, 0.9874, 0.9838, 0.9955 and N/A, average precision: 0.9658, 0.9635, 0.9410, 0.9752 and 0.96612, and average F1-score: 0.9643, 0.9675, 0.9472, 0.9720 and 0.96572 for the test data classification of the APP + DLL dataset belong to the classifiers SVM, Efficient-CapsNet, ResNet-18, the proposed classifier, and DACN, respectively (Table 11 and Table (E.4)).

The proposed classifier has performed better than other evaluated models in terms of all the classification metrics. In the test data of the API + DLL dataset, our proposed classifier had the highest classification accuracy and the lowest classifier accuracy dispersion in various folds. ResNet-18 showed the weakest performance among the analyzed models (Fig. E(8)). Our proposed classification had significantly better performance, considering the mentioned points, compared to the proposed classifier DACN.

5.5 API + REG dataset

The malware visualization displayed a combination of R channel and B channel for API calls and registry manipulations. The R channel pointed to network and file behaviors. The B channel showed registry changes for persistence.

Figure E(9) and Fig. 24 demonstrate the confusion matrix of the best results in fivefold of Efficient-CapsNet, SVM, ResNet-18 methods and our proposed method for test and train.

Fig. 24
figure 24

Confusion matrix of the best result in fivefold of proposed model, a train, b test

Average accuracy: 0.9814, 0.9691, 0.9603, 0.9788, average sensitivity: 0.9773, 0.9549, 0.9495, 0.9739, average specificity: 0.9880, 0.9884, 0.9880, 0.9960 average precision: 0.9806, 0.9775, 0.9679, 0.9870, and average F1-score: 9791, 0.9625, 0.9555, 0.9738 for the train data classification of the API + REG dataset belongs to the classifiers SVM, Efficient-CapsNet, ResNet-18, and the proposed classifier, respectively. Table (F.5) reports the results of the classifiers for the train data of the API + REG dataset.

Average accuracy: 0.9013, 0.8954, 0.8889, 0.9148 and N/A, average sensitivity: 0.9073, 0.9082, 0.9002, 0.9208 and N/A, average specificity: 0.9851, 0.9861, 0.9862, 0.9939 and N/A, average precision: 0.9528, 0.9574, 0.9514, 0.9676, and 0.95466, and average F1-score: 0.9026, 0.8933, 0.8829, 0.9161 and 0.93815 for the test data classification of the API + REG dataset belongs to the classifiers SVM, Efficient-CapsNet, ResNet-18, the proposed classifier, and DACN, respectively (Tables 11 and Table (E.5)).

In terms of average F1-score, classifier DACN ranks 1st and our proposed classifier ranks 2nd. However, in terms of average classification accuracy, our proposed method ranks 1st. The standard deviation of the average classification accuracy of our proposed classifier stands at 0.0019 while this number is 0.0092 for classifier DACN, indicating the proposed class is more robust. DACN classifier has the highest dispersion of classification accuracy in different folds for API + REG dataset classification. The lowest dispersion of classification accuracy and the highest classification accuracy in different folds belong to our proposed classifier (Fig. E(10)). The results show the efficiency of our proposed classifier for different datasets.

5.6 DLL + REG dataset

The malware visualization had integrated G channel and B channel representing loaded DLLs and registry activities. The G channel highlighted encryption functions. The B channel indicated registry modifications for persistence.

Figure E(11) and Fig. 25 demonstrate the confusion matrix of the best results in fivefold of Efficient-CapsNet, SVM, ResNet-18 methods and our proposed method for test and train.

Fig. 25
figure 25

Confusion matrix of the best result in fivefold of proposed model, a train, b test

Average accuracy: 0.9750, 0.9793, 0.9653, 0.9893, average sensitivity: 0.9708, 0.9785, 0.9610, 0.9854, average specificity: 0.9877, 0.9778, 0.9660, 0.9967, average precision: 0.9734, 0.9865, 0.9831, 0.9896, and average F1-score: 0.9726, 0.9771, 0.9597 and 0.9872 for the train data classification of the DLL + REG dataset belongs to the classifiers SVM, Efficient-CapsNet, ResNet-18, and the proposed classifier, respectively. Table (F.6) reports the result of the classifiers for the train data of the DLL + REG dataset.

Average accuracy: 0.9668, 0.9687, 0.9350, 0.9820 and N/A, average sensitivity: 0.9570, 0.9565, 0.9386, 0.9773 and N/A, average specificity: 0.9869, 0.9877, 0.9858, 0.9964 and N/A, average precision:0.9654, 0.9733, 0.9509, 0.9868 and 0.97306 and average F1-score: 0.9611, 0.9622, 0.9339, 0.9791 and 0.97036 for the test data classification of the DLL + REG dataset belongs to the classifiers SVM, Efficient-CapsNet, ResNet-18, the proposed classifier, and DACN, respectively (Tables 11 and Table (E.6)).

Considering various metrics, our proposed classification performed better compared to other classifiers examined. The order of the ranking of classifiers based on the average classification accuracy: our proposed method, Efficient-CapsNet, DACN, SVM, and RestNet-18. The order of the ranking of the classifiers according to the standard deviation of the average classification accuracy: our proposed method, ResNet-18, Efficient-CapsNet, SVM, and DACN. Similar to the API + DLL dataset, in the DLL + REG dataset, RestNet-18 accounts for the high classification accuracy dispersion, as well as the lowest classification accuracy in various folds. The highest classification accuracy in various folds and the lowest classification accuracy dispersion, such as other analyzed datasets so far, belong to our proposed classification (Fig. F(12)).

5.7 API + DLL + REG dataset

The malware visualization showed a combination of R channel, G channel, and B channel representing API calls, loaded DLLs, and registry manipulations respectively. The R channel indicated network and file activities. The G channel pointed to encryption functions. And the B channel revealed registry changes for persistence. Together the RGB visualization provided an integrated profile of how the malware leveraged the operating system, external libraries, and registry to operate.

Figure E(13) and Fig. 26 demonstrate the confusion matrix of the best results in fivefold of Efficient-CapsNet, SVM, ResNet-18 methods and our proposed method for test and train.

Fig. 26
figure 26

Confusion matrix of the best result in fivefold of proposed model, a train, b test

Average accuracy: 0.9800, 0.9822, 0.9604, 0.9937, average sensitivity: 0.9800, 0.9799, 0.9517, 0.9892, average specificity: 0.9884, 0.9892, 0.9859, 0.9975, average precision: 0.9800, 0.9824, 0.9548, 0.9928, and average F1-score: 0.9800, 0.9809, 0.9546 and 0.9913 for the train data classification of the APP + DLL + REG dataset belongs to the classifiers SVM, Efficient-CapsNet, ResNet-18, and the proposed classifier, respectively. Table (F.7) reports the results of the classifiers for the train data of the APP + DLL + REG dataset.

Average accuracy: 0.9650, 0.9784, 0.9488, 0.9944, N/A, N/A, N/A, N/A and N/A, average sensitivity: 0.9459, 0.9704, 0.9334, 0.9886 N/A, N/A, N/A, N/A and N/A, average specificity: 0.9857, 0.9884, 0.9845, 0.9973, N/A, N/A, N/A, N/A and N/A, average precision: 0.9559, 0.9753, 0.9914, 0.9414, 0.97519, 0.94319, 0.62440, 0.96079 and 0.97199 and average F1-score: 0.9545, 0.9740, 0.9383, 0.9913, 0.97396, 0.93445, 0.42215, 0.95351 and 0.97032 for the test data classification of the APP + DLL + REG dataset belong to the classifiers SVM, Efficient-CapsNet, ResNet-18, the proposed classifier, and DACN, respectively (Table 11 and Table (E.7)).

Similar to the API + DLL and DLL + REG datasets, in the test data of the APP + DLL + REG dataset, ResNet-18 accounts for the lowest classification accuracy and highest classification accuracy dispersion (Fig. E(14)).

5.8 Main result

Considering the seven analyzed datasets, the efficiency of our proposed classifier for various datasets can be confirmed. Further, given the assessment of this class with 35 separate runs, it is possible to discuss the improvement of classification accuracy and the robustness of the proposed classifier in finding the proper classification with more confidence. Hence, in addition to the k-fold cross validation, the impact of data grouping significantly decreases due to the number of runs. The significance of this matter lies in the fact that neural networks and optimization algorithms often do not evaluate their performance through a mathematical proof and mostly compare them in terms of statistics, especially average and variance in numerous runs. This paper analyzed all the aforementioned, and the proposed classifier obtained the most optimal result.

Figure 27 demonstrates the best classification accuracy for different models analyzed in this article and DACN. It can easily be observed that all the datasets of our proposed classifier had the most optimal performance in terms of classification accuracy. Given the obtained results and their comparison with various models in this article, it can be claimed that the proposed classifier has properly and significantly improved compared to the classifiers from which it feeds. Further, it has performed better than highly used classifiers such as AlexNet, ResNet-18, and GoogleNet.

Fig. 27
figure 27

The best classification precision for different models (previous work = DACN)

Nonetheless, increased classification accuracy has not led to changes in the order of computational complexity (despite the primary objective of this paper being solely the improvement of classification accuracy). Figure 28 compares the learning parameters of the three networks of AlexNet, ResNet-18, and our proposed classification. Compared to ResNet-18, the number of the parameters added by our proposed classifier was only 142,956, which can be ignored considering the performance improvement and the number of ResNet-18 parameters (11.6 million). The trainable parameters in AlexNet reach 60.9 million, which substantially exceeds our proposed network. Although designing a low-parameter network was not one of our objectives, it can be stated that our proposed network has not caused exponential growth of parameters compared to the other networks. In the conclusion and future work (Sect. 6), we will discuss ideas regarding the design of this network so that it would be significantly lightweight for other researchers to use if found useful.

Fig. 28
figure 28

Comparing the number of train parameters

In addition to the seven mentioned datasets, the performance improvement of the proposed classifier was also evaluated on two well-known datasets, MNIST [95] and Fashion-MNIST [96] (Table (A)). The reason for choosing these two datasets was to thoroughly investigate the proposed classifier's performance from various perspectives by evaluating it on two large balanced datasets and one unbalanced dataset. The MNIST and Fashion-MNIST datasets are 10-class datasets; consequently, the output of classifiers are integer numbers between 0 to 9. The specifications of the FIS-III are similar to those reported in Table 8, with the difference that the parameters associated with the eleventh membership function have been removed by eliminating it. The total number of rules for these two datasets is 120; however, as described in Sect. 3.7, this number decreases to 28 and 32, respectively.

Table 12 presents a comparison between the proposed model in this article and recent models for detecting and classifying malware. The results indicate that the proposed classification method achieves high performance in identifying and categorizing malware when all three extracted dynamic features are utilized (API + DLL + REG) and it is ranked among the top-performing classifiers for malware classification. Even when the proposed method uses only one dynamic feature, such as REG, it still performs well and achieves a classification accuracy above 0.96 in all conditions.

Table 12 Comparing existing models for malware detection and classification

5.8.1 Classifiers’ statistical evaluation

In this section, the Friedman test and the Nemenyi post hoc test are employed to evaluate if the overall variances among accuracy and F1-score measurements are statistically meaningful. The Friedman test assesses algorithms individually for each dataset, ranking them based on their performance (Table 13). Subsequently, the Nemenyi test compares the average ranks of the algorithms. Then, it computes based on the \({\chi }_{F}^{2}\) distribution with \(k-1\) degrees of freedom (where \(k\) represents the number of algorithms). To identify a statistically notable variance in performance, a post hoc examination is necessary to pinpoint the algorithms responsible for these variances. The Nemenyi test is extensively employed in this scenario. This examination indicates notable discrepancies in their performance when the mean ranks of the two algorithms deviate from certain critical margins. Initially, we compute the mean rank for each algorithm examined in our trials, setting \(k=5\) and \(n=7\), as there are a total of 5 methods and 7 datasets involved. If the Friedman test statistic yields accuracy and F1-score values of \({\chi }_{F}^{2}=21.6\) and \({\chi }_{F}^{2}=16.8\) respectively, with \(4(k-1)\) degrees of freedom, \(\alpha =0.05\) and the critical value for the Friedman test, given for \(k=5\) and \(n=7\), is 9.49 at a significance level of, we can infer that the accuracy and F1-score values of the evaluated methods exhibit significant differences (\(21.6>9.49\), \(16.8>9.49\) respectively). Since the null hypothesis has been disproved, we can move forward with a post hoc examination. The Nemenyi test becomes applicable when all classifiers are assessed against one another [104]. The critical value in our experiments with \(k=5\) and \(\alpha =0.05\) is \({CD}_{\alpha }=2.3056\), \({q}_{\alpha }=2.728\). As a result, the accuracy and F1-score of the proposed method is significantly different from SVM, Efficient-CapsNet, ResNet-18 and DACN. Figures 29 and 30 depicts the statistical comparison of the approaches examined in our experiments using the Nemenyi test.

Table 13 The ordinal rank of the superior outcomes achieved by five classifiers when evaluated against each other
Fig. 29
figure 29

The crucial charts for F1-score metric the outcomes derived from the Nemenyi post hoc examination, alongside the mean rank scores obtained from the Friedman test

Fig. 30
figure 30

The crucial charts for Accuracy metric the outcomes derived from the Nemenyi post hoc examination, alongside the mean rank scores obtained from the Friedman test

6 Conclusion and future work

Scholarly evidence indicates that the utilization of artificial intelligence (AI) techniques has resulted in a significant enhancement of productivity across various domains. As smart devices continue to proliferate and become more deeply integrated into daily routines, the significance of safeguarding these devices against potential security threats becomes increasingly paramount. Malware constitutes a critical factor in the infiltration of diverse operating systems. Consequently, the development of a classifier that is capable of accurately categorizing malware becomes. For this reason, this article proposes a novel hybrid classifier based on type-III fuzzy decision-making and an ensemble of deep neural networks. To increase the speed of adjusting fuzzy system parameters and achieve higher accuracy, an improved version of CGO optimization algorithm was introduced and it was evaluated on 126 BFs and 5 EDPs, where it achieved first place in 124 BFs and all of EDPs. To evaluate the efficiency of the proposed classifier, we tested it on nine datasets. In 7 datasets related to malware classification, the classification accuracy was always over 0.96 and showed an improvement of 0.24–13.3% compared to other classifiers. Additionally, the statistical analysis results indicate that both ICGO and the proposed classifier exhibit superior performance compared to their competitors.

In addition, the performance of our proposed classifier was compared with six classifiers introduced in recent years for the well-known MNIST and Fashion-MNIST datasets. The proposed classifier ranked first in accuracy, specificity, sensitivity precision, and F1-score for these two datasets. The 45 separate runs of this classifier on nine datasets show that it is a suitable classifier for various applications and has significantly higher accuracy than well-known networks such as Efficient capsule, AlexNet, ResNet-18, and GoogleNet. This subject becomes important when we consider that the number of parameters in our proposed classifier is less than classifiers such as AlexNet, ResNet-18 and SIRE CNN [105].

Suggestions for future work include utilizing multi-objective optimization to optimize the parameters of a fuzzy system by considering a cost function that incorporates classification accuracy and F1-score. Another avenue for exploration involves utilizing a non-singleton Type-III fuzzy system as a decision-maker and employing posterior probabilities of classifiers instead of predicted classes by the classifiers.