Gene selection based on recursive spider wasp optimizer guided by marine predators algorithm

Osama, Sarah; Ali, Abdelmgeid A.; Shaban, Hassan

doi:10.1007/s00521-024-09965-8

Gene selection based on recursive spider wasp optimizer guided by marine predators algorithm

Original Article
Open access
Published: 12 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Gene selection based on recursive spider wasp optimizer guided by marine predators algorithm

Download PDF

403 Accesses
Explore all metrics

Abstract

Detecting tumors using gene analysis in microarray data is a critical area of research in artificial intelligence and bioinformatics. However, due to the large number of genes compared to observations, feature selection is a central process in microarray analysis. While various gene selection methods have been developed to select the most relevant genes, these methods’ efficiency and reliability can be improved. This paper proposes a new two-phase gene selection method that combines the ReliefF filter method with a novel version of the spider wasp optimizer (SWO) called RSWO-MPA. In the first phase, the ReliefF filter method is utilized to reduce the number of genes to a reasonable number. In the second phase, RSWO-MPA applies a recursive spider wasp optimizer guided by the marine predators algorithm (MPA) to select the most informative genes from the previously selected ones. The MPA is used in the initialization step of recursive SWO to narrow down the search space to the most relevant and accurate genes. The proposed RSWO-MPA has been implemented and validated through extensive experimentation using eight microarray gene expression datasets. The enhanced RSWO-MPA is compared with seven widely used and recently developed meta-heuristic algorithms, including Kepler optimization algorithm (KOA), marine predators algorithm (MPA), social ski-driver optimization (SSD), whale optimization algorithm (WOA), Harris hawks optimization (HHO), artificial bee colony (ABC) algorithm, and original SWO. The experimental results demonstrate that the developed method yields the highest accuracy, selects fewer features, and exhibits more stability than other compared algorithms and cutting-edge methods for all the datasets used. Specifically, it achieved an accuracy of 100.00%, 94.51%, 98.13%, 95.63%, 100.00%, 100.00%, 92.97%, and 100.00% for Yeoh, West, Chiaretti, Burcyznski, leukemia, ovarian cancer, central nervous system, and SRBCT datasets, respectively.

Pied kingfisher optimizer: a new bio-inspired algorithm for solving numerical optimization and industrial engineering problems

Article 16 May 2024

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Article 19 January 2024

Classification of cancer microarray data using a two-step feature selection framework with moth-flame optimization and extreme learning machine

Article 01 August 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Some types of cancer are considered genetic diseases [1] that occur when one or more cells begin to mutate in an out-of-control manner. A sequence of such mutations may spread cancer to other cells [2]. Microarray is one of the methods used to monitor gene expression levels in several tissues of the body [1]. In the biomedical era, cancer identification based on microarray gene expression data is a complex task because these datasets are high-dimensional, few observation data where the number of features is vast (most of which are irrelevant). However, the number of observations is limited, usually less than 100 [3, 4]. The predictive models that are built based on such datasets are unstable and prone to overfitting. Feature or gene selection methods are widely used to tackle this challenge by identifying the essential genes that lead to high predictive results [2]. Feature selection methods can be divided into filter-based, embedded-based, wrapper-based, ensemble-based, and hybrid-based methods [3].

On the one hand, in the filter-based method, each subset of features is assessed based on the essential data characteristics [5]. Several studies addressed gene selection problem using this type of method; for example, Rafael Arias-Michel and co-workers in [5] modified the approximated Markov blanket to consider the relationships among input data features. The correlation-based feature selection method was used to evaluate the relationships, while the fast correlation method was applied as a search approach. The experiments were conducted with support vector machines (SVM) with linear kernel function and Na$\ddot{\iota }$ve Bayesian network (NB). In the same context, the experts in [6] developed a two-stage filter selection method using Spearman’s correlation and distributed feature selection. This method was tested using decision tree (DT), k-nearest neighbor (kNN), SVM, and NB. In 2019, the authors in [7] used Relief and least absolute shrinkage and selection algorithm (LASSO) as feature selection approaches. The selected subsets of features were evaluated using three classifiers: multilayer perceptron networks (MLP), random forest (RF), and SVM. Both LASSO and SVM outperformed other developed approaches. The benefits of these sorts are that they have the rapid and simple computation, and they can simply cope with high-dimensional datasets. In contrast, the major drawback is that they disregard the interaction with the classification methods, which may result in poor predictive results [4].

On the other hand, wrapper-based methods, such as meta-heuristic algorithms, assess the usefulness of features by optimizing the error rate or accuracy during the training phase of a classifier [8]. For instance, Aiguo Wang and co-workers [9] suggested an efficient wrapper method by introducing the Markov blanket approach to reduce the required wrapper evaluation time. Similarly, the authors in [10] used a binary learning-based optimization algorithm to select the relevant features, and discriminant analysis, SVM, DT, and kNN were used as classifiers. In [11], the authors developed a firefly-based feature selection method to identify the most valuable genes. SVM with leave-one-out cross-validation was used as a classifier. Another study on this topic eliminated the redundant features from microarray gene expression datasets using a binary bat algorithm as a wrapper method, with an extreme learning machine as a classifier. The authors in [2] introduced the cuckoo search algorithm as a method for gene selection aided by a memory-based technique to identify the most valuable genes. In [12], the author used a simulated annealing optimizer as a wrapper algorithm, and RF, DT, and SVM were used as machine learning methods. These methods achieve superior predictive results but require high computational costs compared with filter-based methods.

Contrarily, embedded-based methods are comparable to wrapper-based methods in that they also rely on classifiers. In embedded methods, feature selection is conducted during the classifier’s training [13]. Embedded-based methods require less computational time than wrapper-based methods because they examine each feature subset during the algorithm’s learning process and avoid repeated execution [14]. For example, the authors in [15] developed clustering-guided sparse structural learning as an embedded feature selection method, and it also used as a classifier. Similarly, the authors in [16] developed an embedded method by combining a particle swarm optimization (PSO) algorithm with a C4.5 classifier. In another work [17], a support vector machine Bayesian T-test recursive feature elimination algorithm was developed for gene selection. The selected genes were evaluated using an SVM classifier. In [18], the authors used a weighted feature selection algorithm embedded in the bacterial colony optimization algorithm to select the most relevant features. This method decreased the computational time and enhanced the search ability. This study used an SVM and a kNN with k = 5 as classifiers. However, the primary disadvantage of ensemble-based methods is that the selection process relies on the used classifier. As a result, the selected subset of genes may lead to poor results when it is used as input to other classifiers.

On the contrary, ensemble-based methods address a variety of problems. The main idea behind ensemble methods is combining the results of two or more feature selection methods to produce a more stable feature subset [3]. The main benefits of these methods are that they are stable, reduce overfitting, and are more scalable for high-dimensional surfaces. To cite an example, the authors in [19] developed an ensemble-based method that combined both embedded and filter methods. Consistency-based filter, ReliefF, correlation-based feature selection, and information gain were the used filters, whereas feature selection–perceptron and support vector machine recursive feature elimination were used as embedded methods. The classification method used in this work was an ensemble classifier that integrated the instance-based learning method, C4.5, and NB. Similarly, the authors in [20] proposed an ensemble method by integrating two wrapper methods: the cuckoo optimization algorithm and genetic algorithm (GA). SVM and MLP were used as classifiers to evaluate the proposed method. In the same context, in [21], maximizing global diversity, error-correcting output codes, and maximizing local diversity were integrated to form a hierarchical ensemble-based feature selection method. The resulting features were evaluated using SVM and kNN. The authors in [22] developed a gene selection method that combined G-Forest and GA. RF was used to assess the selected features. The main drawbacks of ensemble-based methods are the considerable computation and space consumed.

Conversely, hybrid-based feature selection methods have been proposed to exploit the benefits of two or more feature selection methods, commonly filter- and wrapper-based, to address the severe flaws of previously mentioned methods [3]. For instance, the authors in [23] developed a hybrid method that combined GA with mutual information maximization, and the selected features were evaluated using SVM. Similarly, ReliefF and PSO were used as a two-phase feature selection method. In [24], the authors used ReliefF with a recursive binary gravitational search algorithm for gene selection. The expert in [25] proposed a hybrid method that integrated the analysis of variance (ANOVA), Ejaya, and forest optimization algorithm. The picked subset of genes was evaluated using an SVM classifier. Contextually, in [26], Relief and stacked autoencoder were developed as a hybrid gene selection method, while SVM and convolutional neural networks were used for classification. Moreover, in [27], the authors developed a hybrid method that integrated information gain and barnacles mating optimizer, and the resulting subset of genes was evaluated using SVM. The experts in [28] developed a hybrid method that combined minimum redundancy, maximum relevance (mRMR) and moth flame optimization (MFO) algorithm, combined with quantum computation.

Another study on this topic, [29], proposed a hybrid method that integrated an ensemble of Chi-square, ReliefF, and information gain with PSO to select the most relevant genes. In [30], the authors used mRMR and Manta ray foraging optimization for gene selection, and SVM was used as a classifier. A two-phase gene selection method was developed in [31], using Pasi Luukka’s filter-based feature ranking algorithm in the first stage to remove irrelevant genes. In the second stage, an enhanced version of the whale optimization algorithm (WOA), the altruistic whale optimization algorithm (AltWOA), was applied. This version applied the idea of altruism to the whale population. The experts in [13] developed a two-phase feature selection method. Firstly, an ensemble of several filter-based methods, including the Chi-square test, information gain ratio, and ReliefF, was used. Secondly, a recursive flower pollination search algorithm was used to obtain the final subset of genes. In [1], the experts developed a new hybrid method that consisted of two stages. Firstly, one-class SVM was used for anomaly detection. Secondly, a guided GA was developed to find the final subset of genes. The final subset of genes was evaluated using an SVM classifier.

Lastly, swarm optimization algorithms have attracted more attention because they achieve the highest performance in addressing various issues, such as wind energy optimization [32], sustainable energy [33], appointment scheduling problems [34, 35], and breast cancer diagnosis [36]. In gene selection, these algorithms were used as a wrapper-based or a stage in hybrid-based gene selection method. The commonly used swarm algorithms in the literature are artificial bee colony (ABC) [37, 38], harmony search algorithm (HSA) [39, 40], flower pollination search algorithm [13, 41, 42], GA [43,44,45], WOA [31, 46], MFO [28, 47], and PSO [29, 48,49,50].

As a result of the interaction among genes in microarray gene expression data, expanded search space, and stochastic nature of the swarm algorithms, most of them are exposed to the local optima issue and may experience degraded performance. Thus, there is a chance to enhance the search strategies to effectively explore the high-dimensional space, avoid local optima, and exploit the global solution more reliable, which is needed to tackle the gene selection problem properly. Gene selection issue can be successfully tackled by combining two or more swarm algorithms and altering or improving existing ones. The main drawbacks of swarm algorithms stated in the literature are the algorithm’s slow convergence rate, their tendency to get stuck in local optima, the impact of the algorithm’s parameters on its performance, and how poorly the exploration and exploitation phases are balanced.

Consequently, this paper studies the most recent swarm algorithms and their characteristics to use one of them in developing a new hybrid-based gene selection method that addresses the gene selection issue. Mainly, spider wasp optimizer (SWO) [51] is a recently developed swarm optimization algorithm that simulates what spider wasps accomplish in nature when they hunt, build nests, and mate. The SWO algorithm has many novel updating mechanisms. Therefore, it can address a variety of optimization problems with various exploration and exploitation strategies. The SWO algorithm has several benefits. The stability of the SWO performance is evaluated using 23 test functions, CEC2014, CEC2017, and CSCE2020 benchmark functions, and two engineering design issues. The SWO was compared with several optimization algorithms, including recently published and commonly used algorithms. It outperforms the artificial gorilla troops optimizer, sine–cosine algorithm, the slime mold algorithm, equilibrium optimizer, gray wolf optimizer, fox optimizer, African vultures optimization algorithm, whale optimization algorithm, and marine predators algorithm (MPA).

Even though the SWO algorithm has yielded favorable results, it is not entirely impervious to the shortcomings that swarm algorithms may experience. Despite sufficient and robust optimization techniques, meta-heuristics can run into issues. However, any meta-heuristics algorithm has weaknesses that degrade functionality, lead to slow convergence, or get stuck in local optima.

Motivated by the necessity of feature selection for more rapid computation and accurate classification results, this paper introduces an updated SWO version, known as RSWO-MPA, that utilizes the MPA during initialization. Afterward, this updated version recursively uses SWO to decrease the chosen feature. During each invocation to SWO, the SWO algorithm operates on a reduced dataset from the previous invocation. In addition, a two-phase gene selection method is proposed. In the first phase, ReliefF is used to remove redundant genes. In the second phase, RSWO-MPA is proposed as a wrapper algorithm to address the limitations of the developed gene selection algorithms.

Key contributions of this study are as follows:

Develop a new version of SWO named a recursive spider wasp optimizer guided by marine predators algorithm (RSWO-MPA).
Propose a two-stage feature selection method for gene expression analysis.
Use ReliefF filter-based method with RSWO-MPA as a wrapper-based feature selection method.
Assess RSWO-MPA using eight benchmark microarray gene expression datasets to prove its efficacy.
Compare the proposed RSWO-MPA with seven commonly used and recently developed algorithms. These algorithms include Kepler optimization algorithm (KOA), Harris hawks optimization (HHO), social ski-driver optimization (SSD), whale optimization algorithm (WOA), ABC, original MPA, and original SWO.
Demonstrate that the proposed RSWO-MPA outperforms other state-of-the-art gene selection methods.

The rest of the paper is structured as follows: Sect. 2 presents an overview of the SWO algorithm used in the proposed method. Section 3 introduces the proposed method. The experimental findings obtained from the proposed method to tackle gene selection are discussed in Sect. 4. Section 5 presents the conclusion of this paper and discusses future work.

2 Materials and methods

This section discusses the materials and methods required to implementing the proposed method.

2.1 Spider wasp optimizer (SWO)

SWO is a newly proposed swarm optimization algorithm that mimics what spider wasps (females) do in nature when they hunt, build nests, and mate; it has been proposed in 2023 by Mohamed Abdel Basset and co-workers [51]. In the following lines, we briefly explain the behavior of wasps, which is imitated in the SWO algorithm. Firstly, searching behavior aims at the food/prey at the beginning of the optimization steps to obtain the suitable spider for larval evolution. Thirdly, nesting attitude mimics the method of pulling the prey to the nests with a size suitable for the egg and prey. Fourthly, mating behavior emulates the characteristics of the progeny constructed by hatching the egg utilizing the uniform crossover operator between the male and the female wasps with a distinct possibility which is referred to as crossover rate (CR). The following subsections provide a mathematical model of those behaviors.

2.1.1 Creation of the initial population

Each spider wasp considers as a solution in the existing generation. Equation 1 shows the encoding of this solution in the N-dimension vector.

$$\begin{aligned} \overrightarrow{SpW} = [p_1,p_2,...,p_N] \end{aligned}$$

(1)

A random population that consists of M vectors (i.e., solution) can be created between the predefined upper bound $\textit{ub}$ and lower bound $\textit{lb}$ using Eq. 2.

$$\begin{aligned} SpW_{popu} = \begin{bmatrix} spw_{1,1} &{}spw_{1,2} &{} spw_{1,3} &{} \cdots &{} spw_{1,N}\\ spw_{2,1} &{}spw_{2,2} &{} spw_{2,3} &{} \cdots &{} spw_{2,N} \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ spw_{M,1} &{}spw_{M,2} &{} spw_{M,3} &{} \cdots &{}spw_{M,N} \end{bmatrix}, \end{aligned}$$

(2)

where $SpW_{popu}$ refers to the initial population of spider wasps. Equation 3 is applied to create any random solution in the search area.

$$\begin{aligned} \overrightarrow{SpW_p^g} = \overrightarrow{lb}+ \overrightarrow{rand} \times (\overrightarrow{ub}- \overrightarrow{lb}), \end{aligned}$$

(3)

where g indicates the index of current generation, and p refers to the index of population (p = 1, 2,..., M). $\overrightarrow{rand}$ denotes a vector of N-dimension that has a random initial values between 0 and 1.

2.1.2 The attitudes of seeking prey and nesting

The wasps are female and belong to the spider family randomly search for food/spiders within the search regions to obtain the most appreciated spider, and this is pointed out as the searching or exploration phase. Thereafter, spider wasps surround the spider and hunt it by flying or running. This phase is known as the surrounding and chasing. In the final phase, the spider wasp will pull the palsied prey into the pre-prepared nest to put the egg over its stomach.

2.1.3 Exploration stage: searching stage

This phase mimics the attitude of the spider wasps in obtaining the best prey to supply their larvae. In this phase, the spider wasp randomly searches the search area with a fixed tread, as earlier explained, to obtain the prey that will be relevant for their offspring. Equation 4 models this attitude.

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}} =\overrightarrow{SpW_p^{g}} +const_1* (\overrightarrow{SpW_x^{g}} -SpW_y^{g}), \end{aligned}$$

(4)

where $\overrightarrow{SpW_p^{g+1}}$ is the updated position of each female wasp with a static motion ($const_1$) through the current direction. g refers to the index of current generation, and p is the index of population (p=1,2,...,M). $SpW_x^{g}$ and $SpW_y^{g}$ are two random solutions that are used to identify the direction of exploration, followed by the female wasps, and x and y represent their indices. The next formula is used to compute $const_1$.

$$\begin{aligned} const_1 = |rand_{norm}|*rand_1, \end{aligned}$$

(5)

where $rand_{norm}$ is a random number that follows the normal distribution, while $rand_1$ indicates a random number between 0 and 1.

Female wasp occasionally loses the path of the prey that has fallen from the orb. Therefore, they inspect the whole area surrounding the precise location where this prey fell. To model this attitude, a new formula with distinct exploration strategy was developed to allow the SWO algorithm to search the area surrounding the fallen prey with a smaller step size than that of Eq. 4.

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}} =\overrightarrow{SpW_{curr}^g} +const_2* (\overrightarrow{lb} + \overrightarrow{rand_2} *(\overrightarrow{ub} -\overrightarrow{lb})), \end{aligned}$$

(6)

where $\overrightarrow{SpW_{curr}^g}$ is a randomly selected solution with curr index. $const_2$ is a static motion that is utilized to specify the direction of search, and $rand_1$ indicates a random number between 0 and 1. lb and ub indicate the lower boundary and upper boundary, respectively. $const_2$ is computed using Eq. 7.

$$\begin{aligned} const_2 =\frac{1}{1+e^{lr}} * cos(2 \pi lr), \end{aligned}$$

(7)

where lr is a random number between 1 and -2.

Equations 4 and 6 are used together to explore the search regions and discover the most encouraging locations. Eventually, the selection between these equations to produce the updated position for the female wasp is executed randomly as described in Eq. 8.

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}} = {\left\{ \begin{array}{ll} Equation \, 4 &{} rand_3 < rand_4 \\ Equation \, 6 &{} otherwise\end{array}\right. }, \end{aligned}$$

(8)

where $rand_3$ and $rand_4$ are random numbers in [0,1].

2.1.4 Exploration and exploitation stage: tracking and getting free stage

After locating the spider, the spider wasp seeks to kill it in the middle of the web. Regardless, the spider wasps fall on the ground to get away from them. Afterward, there are two scenarios. The first scenario is that spider wasps track the fallen prey to catch them and put them in the pre-prepared nests. This scenario is modeled in Eq. 9. The second scenario is that the wasp cannot catch the dropped prey. Equation 11 models this attitude, and the trade-off between these scenarios is attained randomly, as formulated in Eq. 13.

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}} = \overrightarrow{SpW_p^{g}} +C *| \overrightarrow{rand_5} * \overrightarrow{SpW_x^{g}} -\overrightarrow{SpW_p^{g}}|, \end{aligned}$$

(9)

where g is the index of current generation, p refers to the index of population, $SpW_x^{g}$ is a random solution, and x represents its index. $\overrightarrow{rand_5}$ denotes a vector of a random values in the interval [0,1]. C is a distance controlling parameter that is used to identify how fast the wasp move and begins with a speed of 2 and linearly decreases to 0. Equation 10 is used to compute C.

$$\begin{aligned} C = (2-2*(p/p_{max}))*rand_6, \end{aligned}$$

(10)

where $p_{max}$ indicates the maximum generation, and $rand_6$ denotes a random value in the interval [0,1]. C is a distance controlling parameter that is used to identify how fast the wasp move and begins with a speed of 2 and linearly decreases to 0.

In the second scenario, the distance between the spider wasp and the spider gradually rises. This phase is initially exploitation, and it is turned into exploration with the increase in the distance.

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}} = \overrightarrow{SpW_p^{g}}*\overrightarrow{vec}, \end{aligned}$$

(11)

where $\overrightarrow{vec}$ indicates a vector, its values between v and -v. Equation 12 computes the value of v which follows the normal distribution. This value is gradually raises the distance between spider wasp and the prey.

$$\begin{aligned} v= 1-(p/p_{max}), \end{aligned}$$

(12)

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}} = {\left\{ \begin{array}{ll} Equation \, 9 &{} rand_3 < rand_4 \\ Equation \, 11 &{} otherwise\end{array}\right. }, \end{aligned}$$

(13)

At the beginning of the optimization procedure, all the wasps will use the exploration strategy to globally explore the search area of the optimization issue to obtain the most appropriate area that might include the near-optimal solution. The algorithm will use the following (tracking) and escaping mechanisms to explore and exploit the region close to the current wasps during the iteration pass, hoping to avoid getting stuck in the local optima. Eventually, the balance between both stages is adapted based on Eq. 14.

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}} = {\left\{ \begin{array}{ll} Equation \, 8 &{} rand_p < v \\ Equation \, 13 &{} otherwise\end{array}\right. }, \end{aligned}$$

(14)

where $rand_p$ indicates a random number in the interval of 0 and 1.

2.1.5 Exploitation stage: nesting attitude

Female wasps drag the immobilized spider to an already prepared nest. Spider wasps exhibit a variety of nesting attitudes, such as forming cells in the earth, constructing mud nests on leaves or rocks, and utilizing existing cavities like spider or beetle burrows. The SWO algorithm emulates these actions using two distinct equations given that spider wasps demonstrate diverse nesting attitudes. The initial equation (Eq. 15) entails pulling the spider toward the area with the most ideal spider candidate and considering it the optimal location for nest-building, where the incapacitated spider will be deposited and an egg laid on its abdomen.

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}} =\overrightarrow{SpW^*} + cos(2 \pi lr)*(\overrightarrow{SpW^*}-\overrightarrow{SpW_p^{g}}), \end{aligned}$$

(15)

where the term $\overrightarrow{SpW^*}$ denotes the optimal solution found-so-far.

The second equation aims to construct a spider’s nest at the location of one female spider selected randomly from the group while considering a separate step size to prevent two nests from being built in the same spot. This equation was created in the following manner:

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}}= & {} \overrightarrow{SpW_x^{g}} +rand_3 *|\wp |*(\overrightarrow{SpW_x^{g}}-\overrightarrow{SpW_p^{g}}) \nonumber \\{} & {} \quad + (1-rand_3)*\overrightarrow{V}*(\overrightarrow{SpW_y^{g}}-\overrightarrow{SpW_z^{g}}), \end{aligned}$$

(16)

where $rand_3$ is a randomly generated number within the range of 0 and 1. The value of $\wp$ is determined by a certain method called “levy flight.” x, y, and z are indices that represent three solutions randomly chosen from a population. $\overrightarrow{V}$ is a binary vector that helps decide when to apply a step size to avoid creating two nests at the same spot. Lastly, $\overrightarrow{V}$ is calculated using Eq. 17.

$$\begin{aligned} \overrightarrow{V} ={\left\{ \begin{array}{ll}1 &{} \overrightarrow{rand_4} > \overrightarrow{rand_5}\\ 0 &{} otherwise \end{array}\right. }, \end{aligned}$$

(17)

where $\overrightarrow{rand_4}$ and $\overrightarrow{rand_5}$ indicate two vectors representing random values in interval [0,1].

According to the formula given in Eq. 18, Eqs. 15 and 16 are exchanged in a random manner.

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}} = {\left\{ \begin{array}{ll}Equation \, 15 &{} rand_3 <rand_4\\ Equation \, 16&{} otherwise\end{array}\right. }, \end{aligned}$$

(18)

Ultimately, the balance between the attitudes of seeking prey and nesting is attained through the use of Eq. 19. This involves all spider wasps searching for their respective spiders at the beginning of the optimization procedure, followed by pulling their appropriate wasps into the pre-arranged nests.

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}} = {\left\{ \begin{array}{ll}Equation \, 14 &{} p <M*v\\ Equation \,18&{} otherwise\end{array}\right. }, \end{aligned}$$

(19)

2.1.6 Mating behavior

The SWO algorithm takes into account the mating behavior of wasps. One of the key features of spider wasps is their ability to recognize gender, which is identified by the host size where an egg is deposited. Small spider wasps indicate male offspring, while larger wasps represent female offspring. In this approach, every spider wasp indicates a potential solution in the current iteration, and a spider wasp egg signifies a recently created possible solution for that iteration. Equation 20 is used to generate these spider wasp eggs (i.e., new solutions).

$$\begin{aligned} \overrightarrow{SpW_p^{g+1}} = Crossover(\overrightarrow{SpW_p^{g}},\overrightarrow{SpW_m^{g}},CrRa), \end{aligned}$$

(20)

where Crossover is an operator that is used to perform uniform crossover between the solutions $\overrightarrow{SpW_p^{g}}$ and $\overrightarrow{SpW_m^{g}}$ with a probability called the crossover rate (CrRa). The vectors $\overrightarrow{SpW_p^{g}}$ and $\overrightarrow{SpW_m^{g}}$ correspond to the female spider wasp and male, respectively. The SWO algorithm generates male spider wasps that are distinct from the female wasps using Eq. 21.

$$\begin{aligned} \overrightarrow{SpW_m^{g+1}} = \overrightarrow{SpW_p^{g}} + \overrightarrow{e^l}*|B|*\overrightarrow{vec_1}+(1-e^l)*|B_1|*\overrightarrow{vec_2}, \end{aligned}$$

(21)

where B and $B_1$ are two randomly generated values, which follow a normal distribution. The exponential constant, e, is also included in the equation. Additionally, the formula includes the generation of vectors $\overrightarrow{vec_1}$ and $\overrightarrow{vec_2}$, which are calculated using the following equations:

$$\begin{aligned} \overrightarrow{vec_1}= & {} {\left\{ \begin{array}{ll} \overrightarrow{x_a}-\overrightarrow{x_i} &{} f(\overrightarrow{x_a}) < f(\overrightarrow{x_a}) \\ \overrightarrow{x_i}-\overrightarrow{x_b} &{} otherwise\end{array}\right. }, \end{aligned}$$

(22)

$$\begin{aligned} \overrightarrow{vec_1}= & {} {\left\{ \begin{array}{ll} \overrightarrow{x_b}-\overrightarrow{x_c} &{} f(\overrightarrow{x_b}) < f(\overrightarrow{x_c}) \\ \overrightarrow{x_c}-\overrightarrow{x_b} &{} otherwise\end{array}\right. }, \end{aligned}$$

(23)

These formulas involve randomly selecting three solutions from the population using indices a, b, and c, where a, b, and c are distinct from each other as well as from the current solution i. Crossover is then employed to combine genetic material from two-parent female spiders, resulting in an offspring (or egg) that inherits traits from both parents. The balance between tracking and mating attitudes is determined by a trade-off rate (TR).

2.1.7 Reduce the population size and reserving the memory

Once a female spider lays its egg on the host’s abdomen, it seals the nest and departs the area to avoid detection. This indicates that the female’s role in the optimization steps is largely fulfilled, and other wasps can take over their function evaluations for the remainder of the procedure, leading to potentially improved outcomes. During the iteration run, certain wasps in the swarm will be eliminated to generate more function assessment for the remaining wasps. This also serves to decrease population diversity and speed up convergence toward achieving a near-optimal solution. The length of the recent population in each evaluation is adjusted using the following equation:

$$\begin{aligned} M =M_{min} + (M-M_{min}) \times v, \end{aligned}$$

(24)

Equation 24 involves setting a minimum population size ($M_{min}$) to prevent getting trapped in local optima at various phases of the optimization process. In addition, the SWO incorporates memory saving technique to retain each wasp’s best spider position for use in subsequent generations. Essentially, each solution generated by a wasp is compared to its equivalent in the previous generation, and if it is a better fit, it replaces the current one. Algorithm 1 provides a pseudo-code of SWO algorithm.

3 Methodology

This section explains the proposed methodology in detail. Section 3.1 outlines the framework of the RSWO-MPA optimization algorithm. In Sect. 3.2, the proposed hybrid gene selection method based on the RSWO-MPA is explained. Finally, Sects. 3.3 and 3.4 introduce the important settings for RSWO-MPA.

3.1 The proposed RSWO-MPA

The SWO algorithm is a new optimization algorithm that can deal with challenging problems, such as gene selection, by selecting the most valuable genes with high results. However, it cannot guarantee that it will select fewer genes. To address this issue, RSWO leverages recursive algorithms from computer science to determine a smaller set of genes without sacrificing accuracy. Firstly, the MPA is used initially to reduce the search space and obtain the initial solution (i.e., genes). The MPA has an expanded foraging strategy, specifically L$\acute{e}$vy and Brownian movements in ocean predators, and the optimal confrontation rate policy in the biological interaction between carnivore and prey. This algorithm follows the laws that inherently control optimal foraging techniques and encounters a rate policy between carnivore and prey in marine ecosystems [52]. Secondly, RSWO uses this solution as input to search for an even smaller subset of genes in subsequent stages. This iterative process does not harm classification accuracy and reduces the search space at each step, resulting in the selection of a smaller number of highly accurate genes. The recursive procedure stops when further reduction of the gene subset leads to reduced prediction results.

The key stages of the proposed RSWO-MPA are summarized as follows:

1.
Initialization stage: In the first run of RSWO, MPA is used to obtain the initial optimal genes ($\overrightarrow{OG_{initial}}$) to reduce the search space of SWO.
2.
Recursive stage: In this stage, the SWO is run recursively. The first invocation of SWO algorithm uses $\overrightarrow{OG_{initial}}$ to obtain a reduced microarray gene expression dataset ($D_{reduced}$). Moreover, the obtained fitness (BF) is stored in bestFitness. Afterward, in each invocation of SWO, the SWO algorithm is run using the $D_{reduced}$ dataset. To select optimal genes ($\overrightarrow{SpW^*}$), which is used to reduce the dimension of $D_{reduced}$. Moreover, the obtained result is compared with bestFitness, and the highest value is used to update the value of bestFitness. Then, the SWO is rerun using the new reduced dataset.
3.
Termination stage: The recursive process is continued until obtaining an $\overrightarrow{SpW^*}$, which reduces the prediction results (i.e., BF), or the number of selected features becomes one.

Figure 1 shows the flowchart of the proposed RSWO-MPA.

3.2 The proposed gene selection method

The proposed gene selection model comprises three phases: data preprocessing, gene selection, and classification. The following subsections explain the phases of the developed gene selection method. Algorithm 2 summarizes these phases.

3.2.1 Data preprocessing phase

Data preprocessing plays a significant role in the performance of machine learning algorithms. Regrettably, gene expression datasets are not always clean. Thus, examining and enhancing data quality is a crucial step because low-quality input data have a significant influence on machine learning algorithms. The following lines explain the applied data preprocessing methods.

1.
Splitting: A stratified train–test splitting is used to divide the datasets into two sets—one for training and the other for testing. This technique ensures that each set has an equal percentage of instances for every class. The dataset is split into 75% for training and 25% for testing. A technique known as k-folds cross-validation with k = 3 is applied to the training set to get the meta-parameters for SVM.
2.
Imputation: Missing genes are imputed using the kNN algorithm. The missing values for each instance are obtained using the mean values of their k-nearest neighbors from the training set. Two observations are near if the values of the existing genes are close. If no class label is available for an instance, it is typically removed instead of imputing it.
3.
Normalization: Normalizing the data aims to achieve a fair distribution of weights among genes and balance the model’s sensitivity with respect to their magnitude. This process has a significant impact on several classifiers, such as kNN and SVM. The gene values in this research are standardized to range from 0 to 1 using a min–max normalizer.

3.2.2 Gene selection phase

A hybrid gene selection method consisting of two stages is designed that combines filter-based with wrapper-based methods. In the first stage, ReliefF as an efficient, accurate filter-based method is used to select the most relevant genes. ReliefF helps reduce the dimensionality of the dataset. This can efficiently identify a subset of promising genes for further evaluation by the developed RSWO-MPA and lead to faster convergence in the second (wrapper) phase. In this stage, the top 100 genes are selected as suggested in the previous studies [6, 27, 29]. Moreover, the proposed RSWO-MPA is used in the second stage as a wrapper method to explore the reduced search space (containing 100 genes) and to select the most beneficial genes. The RSWO-MPA selects a subset of genes which are evaluated using SVM classifier that is used as a fitness function. The meta-parameters of SVM are adjusted through a grid search algorithm.

3.2.3 Classification phase

Following the execution of the two-stage gene selection method, an optimized SVM is developed to evaluate the effectiveness of the developed approach. It is reported that SVM achieved superior performance compared with other classifiers, such as logistic regression, DT, RF, NB, MLP, and kNN, in the domain of cancer classification based on the microarray gene expression datasets [3, 6, 7].

3.3 Solution representation

Overall, to use an optimization algorithm, such as SWO and RSWO-MPA, as a feature selection method, we typically represent the search space as either set of possible feature indices or binary solutions. In this paper, each candidate solution (i.e., spider wasp) would then be represented by a decimal vector with d items, $\overrightarrow{SpW_p} =(g_1,g_2,...,g_d)$, where d is the problem dimension (i.e., the number of genes that are used as an input to the optimizer), and g is a gene. In, $\overrightarrow{SpW_p}$, each $g_i$ has a decimal value that represents the index of a gene in dataset. Where $\overrightarrow{SpW_p}$ denotes the candidate solution in continues space at p iteration, $fun_1 ()$ is utilized to round each gene value, and $fun_2 ()$ is applied to eliminate the duplicated genes indices. $\overrightarrow{SpW_p^{new}}$ represents a candidate solution in decimal space. Figure 2 depicts a diagrammatic example of mapping continuous solution to a decimal one.

3.4 Fitness function and its evaluation

The fitness function is considered a key factor in designing optimization methods. It is applied to assess how well each solution performs during optimization. Discovering the best gene subset in wrapper-based gene selection methods is challenging because it requires identifying a subset with minimal genes and maximum accuracy. A superior solution is one that has both high classification accuracy and fewer genes. To achieve this, an effective fitness function must reconcile these two competing goals and achieve a balance between them. This paper employs a fitness function, depicted in Fig. 3, that considers both accuracy and the number of selected genes (length($\overrightarrow{SpW_p^{new}}$)). The fitness function uses the average of the k-fold cross-validation algorithm, where k=3. The accuracy and gene count obtained from this function are compared to the best global solution ($\overrightarrow{SpW^*}$) and its corresponding accuracy (BF).

4 Experimental results and discussion

4.1 Experimental setup

The Python language was utilized to develop the proposed method. In order to assess how well the RSWO-MPA method works, eight available high-dimensional benchmark microarray datasets with various disease types were utilized. The characteristics of the gene expression datasets employed in this study are presented in Table 1.

Table 1 List of publicly available microarray datasets used in this paper and corresponding URLs

Full size table

4.2 Parameter settings

The experiments were conducted on an Intel(R) processor with a Core(TM) i7-10750 H operating at 2.60 GHz and 16.0 GB of memory. This study utilized a population size parameter of 20, a maximum iteration parameter of 150, lower bounds of 0, and upper bounds of 99. The SWO-MPA was compared to other swarm optimization algorithms and the cutting-edge gene selection approaches based on the average classification accuracy and the average number of selected genes from 20 separate runs. Table 2 shows the common parameter settings for the proposed and other swarm algorithms used. This paper used the default algorithm’s parameters for all used swarm algorithms.

Table 2 Parameter configuration for RSWO-MPA and other swarm algorithms which are used for comparison

Full size table

4.3 Assessment criteria

The proposed algorithm is assessed on each dataset using several metrics, such as accuracy, number of selected genes [26, 28, 29, 61], and a t-test to maintain conformity with prior studies.

4.4 RSWO-MPA results and discussion

We conducted a four-stage experimental analysis. In the first stage, we compared the effectiveness of the proposed RSWO-MPA with various filter-based algorithms in Sect. 4.4.1. In the second stage, we compared the performance of RSWO-MPA with other swarm algorithms in Sect. 4.4.2. We applied several statistical metrics in the third stage to validate the proposed algorithm in Sect. 4.4.3. Finally, we compared the effectiveness of our algorithm with the state-of-the-art gene selection algorithms in Sect. 4.4.4.

4.4.1 Comparison of proposed RSWO-MPA with existing filter-based methods

In Table 3, various filter-based algorithms were compared with the proposed RSWO-MPA method based on how the SVM performed with them. These algorithms are ReliefF, Fisher score, information gain, and minimum redundancy, maximum relevance (mRMR). The best results are highlighted in bold. It is observed that the proposed RSWO-MPA method exceeds other filter-based methods in six out of eight datasets, while in DS 6 and DS 8, all methods recorded %100 accuracy, but the number of selected genes using filter-based methods is more than our proposed method.

Figure 4 shows the average accuracy obtained from overall datasets for many filter-based feature selection methods. As shown in Fig. 4, ReliefF filter-based feature selection method performed better than others filter-based methods in an average accuracy of the eight datasets used, followed by mRMR. Therefore, the ReliefF is selected to be used in the first phase of the proposed gene selection method.

Table 3 The performance of row data, proposed RSWO-MPA method, and other commonly used filter-based methods, in terms of accuracy

Full size table

4.4.2 Comparison of proposed RSWO-MPA with other swarm algorithms

The suggested RSWO-MPA was compared with seven swarm optimization algorithms, comprising recent and well-known algorithms. These algorithms are KOA, HHO, SSD, WOA, ABC, original MPA, and original SWO. This stage of the experiments aimed to compare the RSWO-MPA with other swarm algorithms based on two factors: accuracy and the count of selected features.

Table 4 displays the count of chosen features along with its corresponding fitness value. The most exceptional outcomes are highlighted in boldface. Figure 5 shows the average results over twenty independent runs for accuracy and the count of picked features across all datasets. Accordingly, the RSWO-MPA had the best accuracy among all algorithms while using the fewest features. As illustrated, the performance of the RSWO-MPA in solving the gene selection issue is satisfactory since its results are better than other algorithms for the given datasets.

Table 4 Average performance of several swarm algorithms over 20 independent runs

Full size table

4.4.3 Statistical measurements

Table 5 compares the p-values obtained via a parametric, two-sample t-test between RSWO-MPA and alternative algorithms with respect to their accuracy and the number of selected genes. This analysis aims to determine if there are any notable variations between the RSWO-MPA and others in terms of these two factors.

A p-value less than 0.05 (5%) is considered statistically significant, meaning that there is less than a 5% chance that the observed difference between the two algorithms occurred by chance. On the other hand, a p-value greater than 0.05 indicates that there is not enough evidence. On the contrary, if the p-value is 1, it usually indicates no statistical significance between the two groups being compared [62]. Conversely, the nan p-values in this table indicate that the standard deviation of one of the groups being compared is 0. This can happen if all values for a group are the same. If the standard deviation is 0, the t-test statistic will also be 0, which results in a division by zero error when calculating the p-value.

We initially assess the p-value of accuracy for the proposed algorithm and the other algorithms. If there is no statistically significant difference between them, we subsequently evaluate the number of selected genes as a means of comparison. The bold highlighting indicates a significant difference at a P less than 0.05, suggesting that RSWO-MPA performs better than the other compared algorithms. However, other results in the table show similar or worse performance than the compared algorithms. Based on this table, the proposed RSWO-MPA outperforms other algorithms.

Table 5 A comparison of p-values obtained through the t-test between RSWO-MPA and other algorithms in terms of their accuracy and the number of genes they select

Full size table

Table 6 displays the statistical performance measurements for the developed optimization algorithms, precisely their means and standard deviations over 20 independent runs. The best outcomes are highlighted in boldface. Generally, RSWO-MPA outperforms the other algorithms regarding average mean accuracy with the lowest standard deviation across all datasets.

Table 6 Statistical measures, including mean and standard deviation (Std.), were calculated from the results, using a stop criterion of 150 iterations. All swarm optimization algorithms are used an optimized SVM to obtain accuracy (fitness values) in 20 independent runs

Full size table

4.4.4 Comparison with the state-of-the-art feature selection methods

Table 7 shows a comparative analysis of the classification accuracy for all the datasets used in the study with some of the cutting-edge methods. The selected publication belonged to various types of feature selection algorithms. The most exceptional results are emphasised in bold. The suggested method recorded an improved performance, achieving scores of 100%, 94.51%, 98.13%, 95.63%, 100%, 100%, 92.67%, and 100% on the DS 1, DS 2, DS 3, DS 4, DS 6, DS 7, and DS 8 datasets, respectively. The datasets with the highest accuracy were DS 1, DS 3, DS 5, DS 6, and DS 8, achieving an accuracy of 100%, 98.13%, 100%, 100%, and 100%, respectively, compared to the other datasets. Overall, the proposed gene selection method based on RSWO-MPA achieved much better performance compared with other methods. There was a significant reduction in the number of selected genes, particularly for DS 1, DS 2, DS 3, and DS 4.

Table 7 Comparison between the proposed RSWO-MPA method with some state-of-the-art algorithms, where the accuracy is given in percentage (%)

Full size table

4.4.5 The drawbacks of the suggested RSWO-MPA method

Although the proposed algorithm demonstrates greater accuracy and selects fewer genes than other state-of-the-art algorithms, some limitations must be addressed in the future research. These limitations can be summarized as follows:

Computational Complexity: Hybrid gene selection and hybrid swarm optimization methods can be computationally expensive, particularly for large-scale datasets, such as microarray gene expression.
Longer Execution Time, Smaller Gene Subset: RSWO-MPA may require a longer time to execute than the original SWO, but this is generally acceptable because it selects a smaller subset of genes.
Parameter Tuning Challenges: Hybrid swarm optimization algorithms require careful tuning of various parameters to achieve optimal performance. This can be challenging, as the best combination of parameter settings may vary depending on the dataset and the specific problem being addressed.
Limited Generalizability: Because hybrid gene selection and hybrid swarm optimization methods often rely on specific assumptions and modeling strategies, their generalizability to other problems/domains may be limited.

Overall, while there are these limitations of hybrid gene selection and hybrid swarm optimization methods, they offer promising avenues for improving gene selection in bioinformatics applications. However, careful consideration must be given to their specific limitations.

4.4.6 Health-care implications

Implementing gene selection methods in health-care systems offers profound implications for health-care management and societal well-being. These methods enable health-care leaders and policymakers to make informed, strategic decisions aimed at improving disease prediction and personalizing medical treatments. By identifying genetic predispositions to diseases, doctors can treat patients early. Moreover, gene detection facilitates the development of new medical protocols and therapies, allowing for a more effective allocation of health-care resources.

5 Conclusion and future work

To conclude, this paper proposed a novel gene selection method to address the challenges of high-dimensional and overfitting biological datasets, such as microarray gene expression. The developed method consists of two phases. The ReliefF as filter method is employed in the first phase to reduce the number of genes. Then, the proposed RSWO-MPA is used in the second phase to identify the most informative genes. Moreover, the proposed methodology was thoroughly evaluated on eight microarray gene expression datasets and compared to seven existing meta-heuristic algorithms. According to the experimental results, the developed method outperformed all other compared algorithms and methods, including state-of-the-art methods, in terms of accuracy, number of selected features, and stability across all datasets used. These findings demonstrate the potential utility of the developed method for addressing gene selection challenges in biological research.

As a future direction, our goal is to assess the efficacy of the suggested methods on various dataset modalities. Additionally, we aim to construct a fusion model that can address multi-omics datasets such as RNA and DNA, not exclusively gene expression data. Future studies might concentrate on investigating the effectiveness of incorporating deep learning techniques. Deep learning has shown remarkable successes in various fields, including image and speech recognition, natural language processing, and bioinformatics. Therefore, integrating deep learning algorithms such as convolutional neural networks or recurrent neural networks with traditional feature selection techniques may lead to improved performance in gene selection analysis. Additionally, exploring other optimization algorithms or meta-heuristic approaches could be another direction for future research to enhance the efficiency and effectiveness of gene selection methods.

Data availability statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Akhavan M, Hasheminejad SMH (2023) A two-phase gene selection method using anomaly detection and genetic algorithm for microarray data. Knowl Based Syst 262:110249
Article Google Scholar
Alzaqebah M, Briki K, Alrefai N, Brini S, Jawarneh S, Alsmadi Mutasem K, Mohammad R, Mustafa A, Al-Marashdeh I, Alghamdi FA, Aldhafferi N et al (2021) Memory based cuckoo search algorithm for feature selection of gene expression dataset. Inform Med Unlock 24:100572
Article Google Scholar
Sarah O, Hassan S, Ali Abdelmgeid A (2022) Gene reduction and machine learning algorithms for cancer classification based on microarray gene expression data: a comprehensive review. Expert Syst Appl 213:118946
Google Scholar
Pashaei E, Pashaei E (2022) An efficient binary chimp optimization algorithm for feature selection in biomedical data classification. Neural Comput Appl 34(8):6427–6451
Article Google Scholar
Arias-Michel R, García-Torres M, Schaerer C E, Divina F (2015) Feature selection via approximated markov blankets using the cfs method. In: 2015 international workshop on data mining with industrial applications (DMIA), pp 38–43
Kumar SA, Diwakar T (2020) Detecting biomarkers from microarray data using distributed correlation based gene selection. Genes Genom 42:449–465
Article Google Scholar
Kıvanç G, İsmail C, Zyilmaz LO (2019) Dna microarray gene expression data classification using svm, mlp, and rf with feature selection methods relief and lasso. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi 23(1):126–132
Article Google Scholar
Liu C, Wang W, Zhao Q, Shen X, Konan M (2017) A new feature selection method based on a validity index of feature subset. Patt Recogn Lett 92:1–8
Article Google Scholar
Wang A, An N, Yang J, Chen G, Li L, Alterovitz G (2017) Wrapper-based gene selection with markov blanket. Comput Biol Med 81:11–23
Article Google Scholar
Mohan A, Nandhini M (2018) Optimal feature selection using binary teaching learning based optimization algorithm. J King Saud Univ Comput Inform Sci 34(2):329
Google Scholar
Almugren N, Alshamlan H (2019) Ff-svm: new firefly-based gene selection algorithm for microarray cancer classification. In 2019 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB), pages 1–6. IEEE
Al-Baity HH, Al-Mutlaq N (2021) A new optimized wrapper gene selection method for breast cancer prediction. CMC-Comput Mater Contin 67(3):3089–3106
Google Scholar
Li M, Ke L, Wang L, Deng S, Xiang Yu (2023) A novel hybrid gene selection for tumor identification by combining multifilter integration and a recursive flower pollination search algorithm. Knowl Based Syst 262:110250
Article Google Scholar
Nivedhitha M, Durai VPM, Raj SK, Chuan-Yu C (2020) Machine learning based computational gene selection models: a survey, performance evaluation, open issues, and future research directions. Front Genet 11:603808
Article Google Scholar
Li Z, Liu J, Yang Y, Zhou X, Hanqing L (2013) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150
Google Scholar
Chen K-H, Wang K-J, Wang K-M, Angelia M-A (2014) Applying particle swarm optimization-based decision tree classifier for cancer classification on gene expression data. Applied Soft Computing 24:773–780
Article Google Scholar
Mishra S, Mishra D (2015) Svm-bt-rfe: An improved gene selection framework using bayesian t-test embedded in support vector machine (recursive feature elimination) algorithm. Karbala Int J Modern Sci 1(2):86–96
Article Google Scholar
Wang H, Jing X, Niu B (2017) A discrete bacterial algorithm for feature selection in classification of microarray gene expression cancer data. Knowl Based Syst 126:8–19
Article Google Scholar
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2012) An ensemble of filters and classifiers for microarray data classification. Patt Recogn 45(1):531–539
Article Google Scholar
Vahid E, Mirjafari MS, Screen Hazel RC, Hasan SM (2015) Cancer classification using a novel gene selection approach by means of shuffling based on data clustering with optimization. Appl Soft Comput 35:43–51
Article Google Scholar
Liu K-H, Zeng Z-H, Ng VTY (2016) A hierarchical ensemble of ecoc for cancer classification based on multi-class microarray data. Inform Sci 349:102–118
Article Google Scholar
Mai A, Khasawneh MT (2020) G-forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays. Artif Intell Med 108:101941
Article Google Scholar
Huijuan L, Junying C, Ke Y, Qun J, Xue Yu, Zhigang G (2017) A hybrid feature selection algorithm for gene expression data classification. Neurocomputing 256:56–62
Article Google Scholar
Hong HX, Ao LD, Li W (2019) A hybrid cancer classification model based recursive binary gravitational search algorithm in microarray data. Proced Comput Sci 154:274–282
Article Google Scholar
Kumar BS, Swati V, Bodhisattva D (2020) A new optimal gene selection approach for cancer classification using enhanced jaya-based forest optimization algorithm. Neural Comput Appl 32(12):8599–8616
Article Google Scholar
Kilicarslan S, Adem K, Celik M (2020) Diagnosis and classification of cancer using hybrid model based on relieff and convolutional neural network. Med Hypoth 137:109577
Article Google Scholar
Houssein Essam H, Salama AD, Hassan Hager N, Al-Sayed Mustafa M, Emad N (2021) A hybrid barnacles mating optimizer algorithm with support vector machines for gene selection of microarray cancer classification. IEEE Access 9:64895–64905
Article Google Scholar
Dabba A, Tari A, Meftali S (2021) Hybridization of moth flame optimization algorithm and quantum computing for gene selection in microarray data. J Ambient Intell Human Comput 12(2):2731–2750
Article Google Scholar
Nashat A, Othman I (2022) Optimized feature selection method using particle swarm intelligence with ensemble learning for cancer classification based on microarray datasets. Neural Comput Appl 34(16):13513–13528
Article Google Scholar
Houssein Essam H, Hassan Hager N, Al-Sayed Mustafa M, Emad N (2022) Gene selection for microarray cancer classification based on manta rays foraging optimization and support vector machines. Arab J Sci Eng 47(2):2555–2572
Article Google Scholar
Kundu R, Chattopadhyay S, Cuevas E, Sarkar R (2022) Altwoa: altruistic whale optimization algorithm for feature selection on microarray datasets. Comput Biol Med 144:105349
Article Google Scholar
Ala A, Mahmoudi A, Mirjalili S, Simic V, Pamucar D (2023) Evaluating the performance of various algorithms for wind energy optimization: a hybrid decision-making model. Expert Syst Appl 221:119731
Article Google Scholar
Ali A, Vladimir S, Dragan P, Chiranjibe J (2023) A novel neutrosophic-based multi-objective grey wolf optimizer for ensuring the security and resilience of sustainable energy: A case study of belgium. Sustain Cities Soc 96:104709
Article Google Scholar
Ali A, Vladimir S, Dragan P, Babaee TE (2022) Appointment scheduling problem under fairness policy in healthcare services: fuzzy ant lion optimizer. Expert Syst Appl 207:117949
Article Google Scholar
Ali A, Morteza Y, Mohsen A, Aida P, Nejad AMY (2023) An efficient healthcare chain design for resolving the patient scheduling problem: queuing theory and milp-asa optimization approach. Ann Operat Res 328(1):3–31
Article MathSciNet Google Scholar
Houssein Essam H, Emam Marwa M, Ali AA (2022) An optimized deep learning architecture for breast cancer diagnosis based on improved marine predators algorithm. Neural Comput Appl 34(20):18015–18033
Article Google Scholar
Hala Mohammed Alshamlan (2018) Co-abc: correlation artificial bee colony algorithm for biomarker gene discovery using gene expression profile. Saudi J Biolog Sci 25(5):895–903
Article Google Scholar
Ragunthar T, Selvakumar S (2019) A wrapper based feature selection in bone marrow plasma cell gene expression data. Cluster Comput 22(6):13785–13796
Article Google Scholar
Ghosh M, Begum S, Sarkar R, Chakraborty D, Maulik U (2019) Recursive memetic algorithm for gene selection in microarray data. Expert Syst Appl 116:172–185
Article Google Scholar
Dash R (2021) An adaptive harmony search approach for gene selection and classification of high dimensional medical data. J King Saud Univ Comput Inform Sci 33(2):195–207
Google Scholar
Alomari O A, Khader A T, Al-Betar M A, Alyasseri Zaid A A (2018) A hybrid filter-wrapper gene selection method for cancer classification. In: 2018 2nd international conference on biosignal analysis, processing and systems (ICBAPS), pages 113–118. IEEE,
Venkatasalam K, Rajendran P, Thangavel M (2019) Improving the accuracy of feature selection in big data mining using accelerated flower pollination (afp) algorithm. J Med Syst 43:1–11
Article Google Scholar
Chuang L-Y, Yang C-H, Kuo-Chuan W, Yang C-H (2011) A hybrid feature selection method for dna microarray data. Comput Biol Med 41(4):228–237
Article Google Scholar
Alshamlan Hala M, Badr Ghada H, Alohali YA (2015) Genetic bee colony (gbc) algorithm: a new gene selection method for microarray cancer classification. Computat Biol Chem 56:49–60
Article Google Scholar
Poongodi K, Sabari A (2022) Identification of bio-markers for cancer classification using ensemble approach and genetic algorithm. Intell Autom Soft Comput 33(2):939–953
Article Google Scholar
Tawhid Mohamed A, Ibrahim AM (2020) Feature selection based on rough set approach, wrapper approach, and binary whale optimization algorithm. Int J Mach Learn Cybern 11(3):573–602
Article Google Scholar
Dabba A, Tari A, Meftali S, Mokhtari R (2021) Gene selection and classification of microarray data method based on mutual information and moth flame algorithm. Expert Syst Appl 166:114012
Article Google Scholar
Mohapatra Puspanjali, Chakravarty S (2015) Modified pso based feature selection for microarray data classification. In: 2015 IEEE power, communication and information technology conference (PCITC), pp 703–709
Umamaheswari K, Dhivya M (2016) D-mbpso: An unsupervised feature selection algorithm based on pso. In: innovations in bio-inspired computing and applications, pp 359–369. Springer,
Liu Mengdi, Xu Liancheng, Yi Jing, Huang Jie (2018) A feature gene selection method based on relieff and pso. In 2018 10th international conference on measuring technology and mechatronics automation (ICMTMA), pp 298–301
Mohamed A-B, Reda M, Mohammed J, Mohamed A (2023) Spider wasp optimizer: a novel meta-heuristic optimization algorithm. Artif Intell Rev 56(10):11675–11738
Article Google Scholar
Afshin F, Mohammad H, Seyedali M, Gandomi AH (2020) Marine predators algorithm: a nature-inspired metaheuristic. Expert Syst Appl 152:113377
Article Google Scholar
Eng-Juh Y, Ross Mary E, Shurtleff Sheila A, Kent WW, Divyen P, Rami M, Behm Fred G, Raimondi Susana C, Relling Mary V, Anami P et al (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1(2):133–143
Article Google Scholar
Mike W, Carrie B, Holly D, Erich H, Seiichi I, Rainer S, Harry Z Jr, John A, Olson MJ, Nevins R, Joseph R (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 98(20):11462–11467
Article Google Scholar
Chiaretti S, Li X, Gentleman R, Vitale A, Vignetti M, Mandelli F, Ritz J, Foa R (2004) Gene expression profile of adult t-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103(7):2771–2778
Article Google Scholar
Burczynski Michael E, Peterson Ron L, Twine Natalie C, Zuberek Krystyna A, Brodeur Brendan J, Lori C, Vasu M, Reddy Padma S, Andrew S, Fred I et al (2006) Molecular classification of crohn’s disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. J Mol Diagnost 8(1):51–61
Article Google Scholar
Golub Todd R, Slonim Donna K, Pablo T, Christine H, Michelle G, Mesirov Jill P, Hilary C, Loh Mignon L, Downing James R, Caligiuri Mark A et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Article Google Scholar
Emanuel F, Petricoin III, Ardekani Ali M, Hitt Ben A, Levine Peter J, Fusaro Vincent A, Steinberg Seth M, Mills Gordon B, Charles S, Fishman David A, Kohn Elise C et al (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359(9306):572–577
Article Google Scholar
Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415(6870):436–442
Article Google Scholar
Zhu Z, Ong Y-S, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Patt Recogn 40(11):3236–3248
Article Google Scholar
Dalia CN, De SC, Francesco F, Stefano R, Scotto FA (2019) An experimental comparison of feature-selection and classification methods for microarray datasets. Information 10(3):109
Article Google Scholar
Hama AAA, Aladdin Aso M, Hasan Dler O, Mohammed-Taha Soran R, Rashid TA (2023) Enhancing algorithm selection through comprehensive performance evaluation: Statistical analysis of stochastic algorithms. Computation 11(11):231
Article Google Scholar
Huynh P-H, Nguyen V H, Do T-N (2018) Random ensemble oblique decision stumps for classifying gene expression data. In: proceedings of the 9th international symposium on information and communication technology, pp 137–144,
Arpita N, Vijendra S (2018) A feature selection algorithm based on qualitative mutual information for cancer microarray data. Proc Comput Sci 132:244–252
Article Google Scholar

Download references

Funding

Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).

Author information

Authors and Affiliations

Faculty of Computers and Information, Minia University, Minia, Egypt
Sarah Osama, Abdelmgeid A. Ali & Hassan Shaban

Authors

Sarah Osama
View author publications
You can also search for this author in PubMed Google Scholar
Abdelmgeid A. Ali
View author publications
You can also search for this author in PubMed Google Scholar
Hassan Shaban
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Sarah Osama helped in conceptualization, methodology, software, formal analysis, investigation, resources, writing—original draft, and writing—review & editing. Hassan Shaban worked in supervision and writing—review & editing. Abdelmgeid A. Ali worked in supervision and writing—review & editing. All authors read and approved the published version of the manuscript.

Corresponding author

Correspondence to Sarah Osama.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest. This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Osama, S., Ali, A.A. & Shaban, H. Gene selection based on recursive spider wasp optimizer guided by marine predators algorithm. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09965-8

Download citation

Received: 21 April 2023
Accepted: 03 May 2024
Published: 12 June 2024
DOI: https://doi.org/10.1007/s00521-024-09965-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Gene selection based on recursive spider wasp optimizer guided by marine predators algorithm

Abstract

Similar content being viewed by others

Pied kingfisher optimizer: a new bio-inspired algorithm for solving numerical optimization and industrial engineering problems

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Classification of cancer microarray data using a two-step feature selection framework with moth-flame optimization and extreme learning machine

1 Introduction

2 Materials and methods

2.1 Spider wasp optimizer (SWO)

2.1.1 Creation of the initial population

2.1.2 The attitudes of seeking prey and nesting

2.1.3 Exploration stage: searching stage

2.1.4 Exploration and exploitation stage: tracking and getting free stage

2.1.5 Exploitation stage: nesting attitude

2.1.6 Mating behavior

2.1.7 Reduce the population size and reserving the memory

3 Methodology

3.1 The proposed RSWO-MPA

3.2 The proposed gene selection method

3.2.1 Data preprocessing phase

3.2.2 Gene selection phase

3.2.3 Classification phase

3.3 Solution representation

3.4 Fitness function and its evaluation

4 Experimental results and discussion

4.1 Experimental setup

4.2 Parameter settings

4.3 Assessment criteria

4.4 RSWO-MPA results and discussion

4.4.1 Comparison of proposed RSWO-MPA with existing filter-based methods

4.4.2 Comparison of proposed RSWO-MPA with other swarm algorithms

4.4.3 Statistical measurements

4.4.4 Comparison with the state-of-the-art feature selection methods

4.4.5 The drawbacks of the suggested RSWO-MPA method

4.4.6 Health-care implications

5 Conclusion and future work

Data availability statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation