1 Introduction

The enhancement of information technology that became rapid has allowed an excessive increase in data of various kinds. These data sets are also represented by more and more attributes, which makes their analysis and interpretation difficult and requires expensive processing time. For this purpose, feature selection (FS) methods are used for choosing from the original features the finest subset of informative features [1]. The learning speed and accuracy performance are improved if no informative features are deleted [2]. FS has been widely employed in a variety of applications, including text clustering [3], bioinformatics [4], image processing [5], and others [6,7,8].

According to the use or not of a classifier algorithm, feature selection techniques wrapper and filter are the two primary types of techniques. The wrapper techniques can choose the optimal predictive characteristics subset using a learning algorithm. The Sequential Forward Selection (SFS) and Sequential Backward Selection (SBS) algorithms are the most common in this category [9, 10]. Without using a classifier, the filter techniques discover the best subset features by maximizing specific criteria based on the statistical aspects of datasets, such as Relief [11] and mutual information (MI) [12].

Wrapper approaches, which are guided by the learning mode, normally produce better results than filter methods, but they consume a lot of time because each selected feature subset is evaluated using a learning process. This is especially true for high-dimensional datasets. Wrappers also risk overfitting, especially with high-dimensional datasets where the number of features outnumbers the number of samples [13]. The search space expands exponentially with the increase in the number of features, making an exhaustive search to find the best feature subset unfeasible [14]. Meta-heuristics (MH) have been frequently employed to solve the FS problem because of their ability to find optimal or near-ideal solutions without studying the complete range of options. This MH technique includes arithmetic optimization algorithm (AOA) [15], Equilibrium Optimization Algorithm (EO) [16], Reptile Search Algorithm (RSA) [17], Ant lion, Crow Search Algorithm (CSA) [18], Genetic algorithm (GA) [19], and Aquila Optimizer (AO) [20]. Each of these methods has its advantages and disadvantages; for example, the exploration of AOA, AO, and RSA is better than their exploitation, and this can lead to stuck in the attractive points. In contrast, the CSA and GA have better exploitation than exploration, which can influence the convergence toward the optimal solution.

Despite the effectiveness of both tactics when used together, a groundbreaking strategy for improving classification accuracy by selecting a reduced number of features by combining filters and meta-heuristic algorithms as independent approaches is still being developed. Also applicable is the No-Free-Lunch (NFL) theorem [21]. Over the years, several scholars have introduced new MH with improved performance based on the concept that one optimization method cannot solve all problems. The major goal of this research is to develop a new hybrid FS approach based on a new meta-heuristic Capuchin search algorithm (CapSA).

The Capuchin search algorithm (CapSA) is a meta-heuristic (MH) approach that has been enhanced in [22]. Furthermore, it was inspired by capuchin monkeys foraging in the wild. In the CapSA, the capuchin population consists of two groups: alpha which represents the leader and the followers. While foraging for food, the capuchins move according to five strategies (1) jumping on trees, (2) jumping over banks, (3) swinging on trees, (4) climbing on trees, and (5) moving normally and randomly on the ground [23]. The CapSA has been utilized as an optimization algorithm to solve a variety of issues, including global optimization and engineering challenges, based on these behaviors [22]. In [23], Modeling Industrial Winding Process has been improved by utilizing a modified CapSA that uses multi-gene genetic programming.

Despite CapSA's excellent performance, its full potential has yet to be realized, and there is room to improve it and apply it to feature selection problems based on our best understanding. The standard CapSA algorithm is an iterative process in which a population of Capuchins is characterized by their position and the velocity with which they move in the solution space. As in PSO, Capuchins change their position according to both the position of their best past performance and the position of the global best Capuchin found so far. In the standard CapSA, the velocity is adjusted according to three parts. The first part is the current velocity of the Capuchins. The second part is the cognition component, which represents the impact of personal experiences on a capuchin's trajectory and helps Capuchin to move toward the optimal position. The third part is the social component, which represents the effects of group experiences on the movement of a capuchin and guides it toward the best position found so far. The three parts are controlled by the inertia weight and two acceleration coefficients. Although the CapSA is very robust and efficient when compared with other population-based methods, its ability to fine-tune solutions and escape from local optimal is weakened; its performance is largely determined by these three parameters that must be tweaked in order to avoid a lack of diversity and premature convergence. In CapSA [22], the inertia weight takes the value of 1.0, then in [23], a linearly decreasing weight (LDW) strategy is adopted to adjust the inertia weight dynamically. However, the linear decreasing inertia weight strategy is known to have the shortcoming of premature convergence, and therefore, CapSA can easily be trapped into a local extremum [24]. To address this shortcoming, we propose a dynamic decreasing inertia weight using the logistic map combined with the velocity to effectively balance the global and local search of the basic CapSA. The goal was to adapt the inertia weights by using the chaotic optimization to improve the performance convergence of CapSA and avoid getting into the local optimum during the optimization process.

In addition, in the original CapSA, the cognitive and social components influence the velocity of a capuchin by using two acceleration coefficients constants similar to the basic PSO. Therefore, these two components are very important for convergence performance [25, 26]. Many strategies have been developed to select the optimal values to promote the basic PSO [27,28,29]. Based on these research studies, the constant acceleration coefficients are replaced in this paper with sine cosine acceleration coefficients to solve premature convergence and stagnation. To add more diversity to the movement of capuchins and help them to explore the search space more efficiency, the velocity update formula is modified. Unlike original CapSA, Capuchins update their velocities based on the historical best information; a learning strategy is introduced to help capuchins to learn from other good individuals in their local neighborhood and in each dimension [30, 31]. Finally, a Levy random walk is applied instead of a random movement to avoid unnecessary exploration that decreases the convergence speed of the algorithm.

The main objectives and contributions of the paper can be summarized as follows:

  1. 1.

    Propose an alternative wrapper-based FS approach according to modify the performance of new meta-heuristic technique named the Capuchin search algorithm (CapSA).

  2. 2.

    Use the sine cosine acceleration coefficients, Chaotic Inertia Weight Strategy, Stochastic Learning Strategy, and Levy random walk to improve the performance of CapSA and accelerate the convergence.

  3. 3.

    Apply the developed ECapSA approach to solve the problem of high dimensionality of real-world datasets. In addition, compare the performance of ECapSA with other well-known FS methods based on using different UCI datasets.

This is how the paper is structured: The second part showcases related works. The conventional CapSA is presented in part 3, and the suggested Enhanced CapSA (ECapSA) is discussed in Sect. 4. Section 5 introduces the experimental results and discusses them. The conclusion and future projects are presented in Sect. 6.

2 Related works

Recently, for assessing the FS problem as wrapper techniques, more metaheuristic techniques have been introduced; also, such techniques have high efficiencies when competed with the techniques [32]. In [14], the authors combine the Pareto optimization with the harmony search algorithm (HS) to handle the FS in high-dimensional data classification issues. A Wrapper-based approach is proposed in [33] based on the binary bat algorithm (BA). The authors in [34] combined the particle swarm optimization (PSO) with a local search for performing the feature selection. The authors of [35] introduced utilizing PSO for feature selection with an adaptive modernization mechanism through each particle swarm rank to be suitable with the inertia weight parameter's value.

Emary et al. [36] propose a binary variant of the Gray wolf optimization (GWO) for the feature selection domain. Abd ELAZIZ et al. [37] introduce an opposition-based SCA technique for solving the feature selection problem. Pashaei and Aydin [38] introduce the binary black hole technique for feature selection (BBHA). The BBHA is a binarization-assisted extension of the conventional BHA.

Recently, Kumar and Bharti [39] combine the PSO algorithm with SCA to take advantages of each optimizer to solve the feature selection problem. Abualigah and Dulaimi [40] introduce SCAGA for FS, which is a hybrid of the SCA and GA algorithms that takes benefit of each method. Xue et al. [41] hybrid the PSO with the forward and backward FS strategy. Gu et al. [42] propose a novel wrapper-based approach for selecting the optimal features subset using the binary CSO. Further, Kaya [43] developed a binary cuckoo search to select relevant features and enhance classification accuracy. In [44], a developed chaotic dragonfly method to eliminate irrelevant features is presented. Ouadfel et al. [45] introduced a novel version of CSA to determine the optimal features. They propose an adaptive awareness probability and a global search method to improve the convergence performance of the basic CSA. T. Jia et al. [46] presented a hybrid FS approach according to the combination of the simulated annealing (SA) and spotted hyena optimization (SHO). Ghosh et al. introduced in [47] an enhanced version of the binary sailfish optimizer and β-hill climbing for determining relevant features. Hammouri et al. proposed an improved version of the Dragonfly Algorithm (DA) for solving the feature selection problem [48]. In [44], the authors combine the DA with chaotic maps to accelerate the convergence rate of the basic DA. Zhang et al. combine the Harris’ Hawk Optimization (HHO) algorithm with SalpSA to select the best feature subset [49]. The proposed approach performed better and presented a good compromise between the exploration and exploitation. Too et al. proposed a binary quadratic version of HHO for feature selection, and experiments prove the superiority of their method in comparison with other metaheuristics [50]. Rodrigues et al. introduce the swap mutation in the basic KH and apply the new variant for feature selection on high-dimensional data in text clustering [51]. Sadeghian et al. combine the Information Gain with the binary Butterfly Optimization Algorithm BOA for the feature selection problem in [52]. Experiments performed on UCI datasets demonstrate the ability of the proposed approach to select the smallest feature subset. An improved version of salp swarm algorithm (SSA) was proposed byTubishat et al. for the feature selection problem [53]. The authors use the Opposition-Based Learning to initialize the population instead of the random initialization. Moreover, they introduce a new local search to improve exploitation performance. Faris et al. [54] propose a wrapped approach based on SSA for the feature selection problem and use eight transfer functions for discretizing the continuous search space and a crossover operator was to enhance the exploratory performance of the basic SSA. Sindu et al. [55] combine the sine cosine algorithm (SCA) with an elitism approach. Rodrigues et al. [56] developed a new binary-constrained flower pollination algorithm and apply their approach for feature selection problem. The new approach defined a Boolean lattice as a search space such that each solution defines whether a feature is selected or not. Yan et al. [57] developed a new variant of Coral Reefs Optimization algorithm for selection the optimal feature subset. Tournament selection was used to increase the diversity of the population and the KNN classifier is used to evaluate the quality of the feature subset encoded by each individual in the population.

3 Capuchin search algorithm: background material

The Capuchin Search Algorithm (CapSA) is a MH technique [22]. It takes its inspiration from Capuchin monkeys' natural behavior during foraging activity in the wild. Like any metaheuristic, the CapSA uses a population X of N capuchins such that each Capuchin represents a candidate solution in a d-dimensional search space. X can be expressed as a two-dimensional size matrix N × d.

During the process of searching within the d-dimensional space, the position of the \(i\)th Capuchin is represented by \({{x}^{i}=[x}_{1}^{i},{x}_{2}^{i},\dots .,{x}_{\begin{array}{c}d\\ \end{array}}^{i}]\) and its velocity \({{v}^{i}=[v}_{1}^{i},{ v}_{2}^{i},\dots .,{v}_{\begin{array}{c}d\\ \end{array}}^{i}]\). Moreover, capuchin \(i\) will hold on to its prior best position \({{\mathrm{pbest}}^{i}=[\mathrm{ pbest}}_{1}^{i},{\mathrm{pbest}}_{2}^{i},\dots .,{\mathrm{pbest}}_{\begin{array}{c}d\\ \end{array}}^{i}]\).

In CapSA, positions and velocities vectors are randomly initialized, then, \({v}^{i}\) at jth dimension was updated as

$${v}_{j}^{i}\left(t+1\right)=\rho {v}_{j}^{i}\left(t\right)+{a}_{1}\left({x}_{{best}_{j}}^{i}\left(t\right)-{x}_{j}^{i}\left(t\right)\right){r}_{1}+{a}_{2}\left({Best}_{j}-{x}_{j}^{i}\left(t\right)\right){r}_{2}$$
(1)

where \({x}_{j}^{i}\) is the current position of the ith leader solution at dimension \(j\). \({v}_{j}^{i}\left(t+1\right)\) and \({v}_{j}^{i}\left(t\right)\) refer to the current velocity and old velocity of \({x}_{j}^{i}\). \({x}_{{best}_{j}}^{i}\) is the best agent and \({Best}_{j}\) is the best solution found so far at \(j\) th dimension. \({a}_{1}\) and \({a}_{2}\) refer to two acceleration constants that balance the influence of \({x}_{{best}_{j}}^{i}\) and \({Best}_{j}\) on \({r}_{1}\) and \({r}_{2}\) denote uniformly distributed random values ranging from 0 to 1. \(\rho\) stands for the inertia weight that takes the value 1.0 in [22] and in [23], Eq. 2 shows its updated value.

$$\rho ={w}_{\mathrm{max}}-\left({w}_{\mathrm{max}}-{w}_{\mathrm{min}}\right){\left({t}/{\mathrm{maxite}}\right)}^{2}$$
(2)

where \({w}_{\mathrm{max}}\) and \({w}_{\mathrm{min}}\) represent the inertia weight's maximum and minimum values, respectively.

Capuchins are divided into two groups in CapSA: (1) Leaders (alpha leaders) are in charge of discovering food sources, and (2) followers are in charge of updating their positions by following the group's leaders. The capuchins are directed by the leaders, while the followers are pursued by the leaders, either directly or indirectly [22]. During the evolutionary process, the leaders of the community use five different movement techniques to find food:

  1. 1.

    In the first technique (jumping on trees), leaders update their positions using Eq. 3 as follows:

    $${x}_{j}^{i}(t+1)={\mathrm{Best}}_{j}+\frac{{P}_{bf}{\left({v}_{j}^{i} (t+1)\right)}^{2}\mathrm{sin}\left(2\theta \right)}{g}, i<n/2;0.1\le \mathcal{E}\le 0.20$$
    (3)

where \(E\) is a random number with a uniformly distributed distribution generated within [0, 1] and \({P}_{bf}\) is the probability of the capuchins' tails providing balance. \(g=9.81\) stands for the force of gravity, and \(\theta\) is the capuchins' jumping angle formulated in Eq. 4

$$\theta =\frac{3}{2}{r}_{4}$$
(4)

Equation (4), \({r}_{4}\in [0, 1]\) represents a uniform number.

  1. 2.

    In the second technique named Jumping on the ground, the position of the leader's solution is updated as

    $${x}_{j}^{i}(t+1)={\mathrm{Best}}_{j}+\frac{{{P}_{ef}P}_{bf}{\left({v}_{j}^{i}(t+1)\right)}^{2}\mathrm{sin}\left(2\theta \right)}{g}, \quad i<n/2;0.2\le \mathcal{E}\le 0.30$$
    (5)

where \({P}_{ef}\) stands for the elasticity probability of motion of solution on the ground.

  1. 3.

    Normal walking is the third technique and the position of alpha capuchins. The following was defined when seeking food on the ground using standard walking [22]:

    $${x}_{j}^{i}\left(t+1\right)={x}_{j}^{i}\left(t\right)+{v}_{j}^{i}\left(t+1\right); i<n/2 0.3< \mathcal{E}\le 0.5$$
    (6)
  2. 4.

    Swinging on trees is the fourth technique and the position of alpha capuchins. While swinging on trees in search of food, the following was defined as [22];

    $${x}_{j}^{i}\left(t+1\right)={\mathrm{Best}}_{j}+{ P}_{bf}\times \mathrm{sin}\left(2\theta \right), i<n/2;0.5\le \mathcal{E}\le 0.75$$
    (7)
  3. 5.

    Climbing trees is the fifth technique, and the position of alpha capuchins is updated using the following equation [22]:

    $${x}_{j}^{i}(t+1)={\mathrm{Best}}_{j}+{ P}_{bf}\left({v}_{j}^{i}(t+1)-{v}_{j}^{i}(t)\right),\qquad i<n/2;0.75<\mathcal{E}\le 1.0$$
    (8)

In order to find a better solution, alpha leaders are randomly relocated, as shown in Eq. (9) [22].

$${x}_{j}^{i}=\tau \left({lb}_{j}+\left({ub}_{j}-{lb}_{j}\right)\mathrm{rand}\right) i<n/2; \mathcal{E}\le {P}_{r}$$
(9)

where \({P}_{r}=0.1\) is the probability of a random walk search. \({ub}_{j}\) and \({lb}_{j}\) are the upper and lower boundaries of the search domain at dimension \(j\) and \(\tau\) is a parameter formulated as

$$\tau =2{e}^{-21{\left(\frac{t}{\mathrm{maxite}}\right)}^{2}}$$
(10)

where t and \(\mathrm{maxite}\) stand for the current iterations and total iterations, respectively.

According to Eq. 11, the standing of the supporters of the capuchins' commanders in CapSA is updated as follows:

$${x}_{j}^{i}(t+1)= \frac{1}{2}\left( {{x}^{\prime}}_{j}^{i}(t+1)+{x}_{j}^{i}(t)\right) n/2\le i\le n$$
(11)

where \({{x}^{\prime}}_{j}^{i}\) and \({x}_{j}^{i-1}\) represent the followers' current and prior positions at dimension \(j\), respectively, and \({{x}^{\prime}}_{j}^{i}\) is the current position of the leaders at dimension \(j\).

figure a

With these properties of CapSA, it still needs to be improved, and this motivated us to propose a modified version of CapSA as discussed in the following section.

4 Enhanced CapSA for feature selection

The FS process can be viewed as an NP-hard searching problem [38]. Selecting the optimal feature subset from high-dimensional data is a difficult task that involves expensive computational time. Metaheuristics are stochastic methods that have been applied with great success in solving complex optimization problems for which exact methods cannot be applied. The CapSA algorithm is a novel metaheuristic that shows the high performance when tackling various optimization problems [22] and [23]. However, like any metaheuristic, CapSA requires specific parameters to be tuned, and therefore, local optima stagnation and immature convergence may occasionally.

The merits of CapSA motivate us to develop for the first time a novel FS approach based on an enhanced version of CapSA optimizer (ECapSA) and use it as a wrapper searching strategy. EcapSA aims to enhance the performance of the basic CapSA by incorporating four improvements: a dynamic decreasing inertia weight using the logistic map is combined with the velocity to effectively balance the global and local search of the basic CapSA, sine cosine-based adjustment of the acceleration coefficients to enhance the performance rate, a stochastic learning strategy is applied to add more diversity to the movement of the Capuchin and the search strategy-based Lévy flight is integrated with the position update so that the global search ability is enhanced. The overall framework of the proposed ECapSA for the FS problem is presented in detail in the following subsection.

4.1 Population initialization

ECapSA begins by assigning an initial position to a set of agents for which the entire population is responsible \({X}^{t}\) (t = 0.…,\(maxite\)) of \(N\) capuchins with d-dimensional is formulated as

$${X}^{t}=\left[\begin{array}{c}{x}_{1}^{1},{x}_{2}^{1},\dots .,{x}_{\begin{array}{c}d\\ \end{array}}^{1}\\ {x}_{1}^{2},{x}_{2}^{2},\dots .,{x}_{\begin{array}{c}d\\ \end{array}}^{2}\\ .\\ .\\ {x}_{1}^{N},{x}_{2}^{N},\dots .,{x}_{\begin{array}{c}d\\ \end{array}}^{N}\end{array}\right]$$

The initial population is generated by random methods, which ensure it covers as much solution space as possible. Therefore, \({X}^{0}\) is generated by the uniform distribution as follows:

$${x}_{j}^{i}={lb}_{j}+\left({ub}_{j}-{lb}_{j}\right)\mathrm{rand },\mathrm{ i}=1..\mathrm{N}, =1..\mathrm{d}$$
(12)

where \({lb}_{j}\) and \({ub}_{j}\) stand for the lower and the upper boundaries, respectively, of the solution \({x}_{j}^{i} \epsilon X\) at \(j\) th dimension, respectively. In this paper, we set \({lb}_{j}=0\) and \({ub}_{j}=1\). Then, the fitness value of each \({x}_{j}^{i}\) is computed as discussed in the following section.

4.2 Fitness function

The features subset encoded in each individual atom is assessed using the KNN classifier. We utilize the fitness function represented by the following equation to discover a subset with the least number of characteristics and the highest classification accuracy:

$$fit=\alpha .\mathrm{ErrClass}+\left(1-\alpha \right).\frac{{D}_{s}}{d}$$
(13)

where ErrClass stands for the classification error. \({D}_{s}\) refers to the length of the selected features. \(\alpha\) ∈ [1, 0] related to the relevance of feature reduction and the weight of the classification error rate. After that competition between sine cosine acceleration coefficients, Chaotic Inertia Weight Strategy, Levy random walk, Stochastic Learning Strategy is used to enhance the current solutions. The details of these methods are given in the following subsections.

4.3 Sine cosine acceleration coefficients

When looking at the velocity update Eq. (1), it is made up of three terms. The first one is the \(v(t)\) that represents population velocity, while the second term is the cognitive component that represents the personal best position visited by the capuchin and the third one is the social component that regulates the velocity of \(\mathrm{the capuchin}\) toward the global best solution (\(Gbest\)). The second and third terms influence the algorithm, causing it to execute global and local searches, respectively. As can be seen, Eq. (1) is similar to the PSO velocity update formula, and it has been demonstrated that the second term of Eq. (1) can decrease the convergence rate quickly, while the third term of Eq. (1) can cause a premature convergence [58].

In the original CapSA, the acceleration coefficients are set to a fixed value as in the original PSO. The optimizer's solution quality is influenced by the relative values of cognitive and social components. When the social component \({a}_{2}\) is relatively high in comparison with the cognitive component \({a}_{1}\), particles arrive at a local optimum sooner, and when the cognitive component is relatively high, particles meander over the search space [29]. Many studies have been conducted to determine the ideal mix of these elements [29, 59, 60].

These coefficients are updated in such a way that the cognitive component is lowered, and the social component is boosted as iteration progresses to improve the solution quality. Based on the work [61], the following equations are used to update the two acceleration coefficients:

$${a}_{1}=-2\times \mathrm{sin}\left(\frac{\pi }{2}\times \frac{t}{\mathrm{maxite}}\right)+2.5$$
(14)
$${a}_{2}=-2\times \mathrm{cos}\left(\frac{\pi }{2}\times \frac{t}{\mathrm{maxiter}}\right)+2.5$$
(15)

According to [29], \({a}_{1}\) takes its value from the range [2.5, 0.5], while \({a}_{2}\) increased during the searching process from 0.5 to 2.5.

4.4 Chaotic inertia weight strategy

In CapSA, the update process of velocity is mainly based on the inertia weight, such as in some previously introduced algorithms, like PSO [58] and BAT [62]. The inertia weight technique is critical in maintaining a global and local search balance. The inertia weight technique determines the former particle velocity to its new one at the current iteration. In [63], Shi and Eberhart introduced inertia weight (IW) in PSO as a constant and illustrated that the exploration is enhanced by using a large value for IW. In contrast, the exploitation is improved when the value of IW is small. A large IW facilitates a global search, while a small IW facilitates a local search [63]. Many dynamic IW techniques have already been presented to augment PSO's capabilities. We uncover time-varying IW techniques among them, in which the value of the IW is modified according to the number of iterations [64,65,66].

Pursued by a sense of propriety, Chaos has ergodic and stochastic properties and specific elements. In a dynamic system, a global optimum or a good approximation can be attained with high probability by following chaotic orbits. Based on the work of [67], we introduce a chaotic optimization mechanism into CapSA and propose the use of the logistic map to tune the IW \(\rho\). The purpose of using a chaotic descending IW instead of the linear descending strategy proposed in the original CapSA is to improve the population diversity in the searching process, as well as enhance the ability to converge at the global optimal. The logistic map is applied to update the IW as described in Eq. (16):

$$r\left(t+1\right)=4\times r\left(t\right)\times \left(1-r\left(t\right)\right)\,\,\, r\left(0\right)=\mathrm{rand}$$
(16)

where \({r}_{0}\notin \left\{0, 0.25, 0.5,\mathrm{ 0.75,1}\right\}\)

$$\rho \left(t\right)=r\left(t\right)\times {\rho }_{\mathrm{min}}+\frac{t\times \left({\rho }_{\mathrm{max}}-{\rho }_{\mathrm{min}}\right)}{\mathrm{maxite}}$$
(17)

where \({\rho }_{\mathrm{max}}\) and \({\rho }_{min}\) are the value of final and initial IW, t stands for the current iteration. \(\mathrm{maxite}\) stands for the maximum number of generations, and \(\rho \left(t\right)\) represents the IW value in iteration \(t\). \(r\left(t\right)\) is a random number between 0 and 1 and generated by the logistic chaotic.

4.5 Stochastic learning strategy

In CapSA, during each iteration, the Capuchin updates its velocity using Eq. (1), which consists of three weighted terms: the first term ((t)) denotes the old velocity of each Capuchin at the previous iteration, the second term \(({\mathrm{pbest}}_{i}\left(t\right)-{x}_{i}(t))\) is the "cognitive part," which reflects the Capuchin's memory of its own historical experience; the third part is the "social part,"\((\mathrm{Gbest}\left(t\right)-{x}_{i}(t))\) which represents the information sharing and cooperation among particles. Regarding formula 1, capuchins update their positions by moving toward the Capuchin's personal best solution (pbest) and the global best solution (gbest). However, this strategy leads to premature convergence and poor performance of CapSA. In order to add more diversity to the movement of capuchins, we propose in this paper a new equation for the velocity update inspired by [30] and [31] in the following way:

$${v}_{i}^{d}\left(t+1\right)=\rho .{v}_{i}^{d}\left(t\right)+{a}_{1}.\mathrm{rand}.\left({\mathrm{pbest}}_{{f}_{i}^{d}}^{d}(t)-{x}_{i}^{d}(t)\right)+{a}_{2}.\mathrm{rand}.\left(\mathrm{gBest}-{x}_{i}^{d}(t)\right)$$
(18)

where \({f}_{i}^{d}\) defines which particle's local best the ith particle should follow. For each dimension of Capuchin \({x}_{i}\), two capuchins are chosen randomly from their local neighborhood (the ring topology of the neighborhood is used). Then, the fitness values of these two capuchins' personal best positions and the personal best position of the Capuchin whose velocity is updated are compared. The personal best of the Capuchin with better fitness is used in Eq. 19. This learning strategy helps capuchins learn from other good individuals in their local neighborhood, providing ECapSA with fast convergence and better global exploration ability.

4.6 Levy random walk

In the exploration phase of the CapSA, and according to Eq. (9), the previous solution is moved to a randomly chosen solution to generate a new individual, and the direction of the search is random. However, this strategy can lead to too much exploration, which tends to decrease the convergence speed of the algorithm [68,69,70]. To deal with this issue, some optimizers replace the simple random walk with a Lévy flight random walk to enhance the performance of the CapSA. In fact, Lèvy's random walking helps to generate solutions that are apart from existing solutions and furthermore enables a better exploration of the search space [68,69,70].

Motivated by the interesting properties of that Levy Flight walk, and in order to improve the global search of CapSA, we reformulate Eq. (9) as follows:

$${x}_{i}\left(t+1\right)={x}_{i}\left(t\right)+\mathrm{Levy}\left(M\right).\left({x}_{r}\left(t\right)-{x}_{i}\left(t\right)\right)$$
(19)

where \({x}_{i}\) is the current solution to update, \({x}_{r}\) refers to a randomly picked solution through random permutation, and \(Levy(M)\) is the Levy flight step size.

$$\mathrm{Levy}\left(M\right)=0.01\times \frac{{r}_{1}\times \sigma }{{\lceil{r}_{2}\rceil}^{\frac{1}{\beta }}}$$
(20)

where \({r}_{1}\) and \({r}_{2}\) are two random numbers from the range [0,1}, \(\beta\) is a constant, and

$$\sigma ={\left\{\frac{\Gamma \left(1+\beta \right)\times sin\left(\frac{\pi \beta }{2}\right)}{\Gamma \left(\frac{1+\beta }{2}\right)\times \beta \times {2}^{\left(\beta -1\right)/2}}\right\}}^{\frac{1}{\beta }},\Gamma \left(z\right)=\left(z-1\right)!$$
(21)

where Γ(x + 1) = x !. The ECapSA is described in Algorithm 2.

figure b

5 Experimental results

5.1 Datasets description

To assess the quality of ECapSA, 16 well-known UCI benchmark datasets were employed in this work. Many scholars utilize these datasets to compare performance in the field of FS. Table 1 summarizes the properties of these datasets. Training and test sets are generated using fivefold cross-validation for each dataset. The KNN classifier is used to evaluate each feature subset obtained by the crow (in this article, K = 5).

Table 1 The tested datasets

5.2 Evaluation measures and value of parameter

The following evaluation criteria applied in many works for FS problems are utilized to evaluate and compare the comparative methods employed in this paper.

Average accuracy: It is given by Eq. (12) and is the mean of the values of the accuracy for the procedure over \(M\) runs:

$${\text{AvgAcc}} \, = \frac{1}{M}\sum \limits_{i=1}^{M}{\mathrm{Acc}}^{\mathrm{i}} \,$$
(22)

where \({\mathrm{Acc}}^{i}\) is the accuracy of the optimal agent at ith run.

Average fitness: It is determined by Eq. (13) as the average of the method's fitness value across M runs.

$${\text{AvgFit}} \, = \frac{1}{M}\sum \limits_{i=1}^{M}{\mathrm{fit}}_{\mathrm{Best}}^{i} \,$$
(23)

Average Feature selection ratio: It is calculated as the average of the ratio of relevant features to D over M runs as given in Eq. (24).

$$\mathrm{AvgFR} = \frac{1}{M}\sum \limits_{i=1}^{M}\frac{FS{N}^{i}}{D} \, ({24})$$
(24)

where \({\mathrm{FSN}}^{i}\) is the length of selected features obtained using the best solution at ith run.

To assess the efficiency of the developed ECapSA, 16 well-regarded datasets of UCI [71] are listed in Table 1. The ECapSA is compared with the original CapSA and with nine metaheuristics, including DE [72], BPSO [58], SSA [73], SCA [74], and Binary DA ([75]. The tuning parameters of all algorithms are listed in Table 2. The control parameters of the original CapSA are obtained from [23].

Table 2 Parameter setting of each method

Because the comparing algorithms are stochastically based, they use the same parameters during the optimization phase to ensure a fair comparison. \(maxite\) is set to 100, while the population size is set to 20. Furthermore, each of them is subjected to ten separate executions. The parameter \(\alpha =0.5\) because it has been utilized in numerous FS literature. We also use fivefold cross-validation, and a KNN classifier with K = 5 is used to assess the quality of each feature subset encoded in each unique solution.

5.3 Results analysis and discussion

The classification results of all optimizers are reported in Tables 3, 4 and 5 according to the Accuracy (ACC), Fitness (Fit), and features selected number (FSN) measures. In addition, we rank the ECapSA algorithm and comparative approaches using the nonparametric Friedman test to see if the difference between the suggested strategy and others is significantly based on the three-assessment metrics. The mean rank of each FS approach among the statistical findings is presented in the last two rows of each table. Because we want to maximize the ACC, Fit, and FSN scores, the approach that gives us the highest rank is the best.

Table 3 The developed and other optimizers were compared in terms of the average accuracy and their STD
Table 4 ECapSA and six optimizers' average fitness values are compared
Table 5 ECapSA and other optimizers compare their average feature selection number

According to the results in Tables 3, 4 and 5, we make the following notices:

  • Table 3 shows that ECapSA outperforms alternative optimizers for almost all datasets for the evaluation measure ACC. Except for the Exactly2 dataset, where the Original CapSA delivers a greater average accuracy value, ECapSA outperforms the original CapSA on all datasets. This finding demonstrates that the three changes made to the classic CapSA had an effect on classification performance. For 12 of the sixteen datasets analyzed, ECapSA outperforms the other algorithms. With CapSA and DE, it delivers the same average accuracy and STD values for M-of-N and Exactly datasets. SCA is only accurate for one dataset: the zoo dataset. According to Table 3, ECapSA is placed first with an overall rank value of 6.5938, followed by the original CapSA in second place with an overall rank value of 5.7813. DE, SCA, and SSA are ranked third, fourth, and fifth in the rankings, respectively. The BPSO and BDA are placed sixth and seventh, respectively.

  • Table 4 demonstrates ECapSA's superiority in terms of average fitness. For 75 percent of datasets, ECapSA surpasses the other six optimizers and has the greatest average fitness value. Except for the Exactly2 dataset, where CapSA offers the least value, ECapsa performs better than the original CapSA for all datasets. For two datasets, Exactly and M-Of-N, DE, CapSA, and ECapSA produce equivalent results. SCA, on the other hand, provides a greater average fitness value for the zoo dataset. Table 4 shows the mean rank of each FS technique in terms of average fitness value. As can be seen, the suggested ECapSA is ranked higher than CapSA. SSA and BPSO are ranked fifth and sixth, respectively, while DE and SCA are ranked third and fourth, respectively. BDA, once again, is the worst of all optimizers, coming in last place.

  • In the majority of datasets, ECapSA beats all optimizers when considering the ratio of selected attributes presented in Table 5. It produces a better average ratio value for 75% of datasets, whereas CapSA, BDA, and BPSO each produce the highest average ratio value for one dataset: IonosphereEW, SpectEW, and Tic-tac-toe. For Exactly and M-of-N datasets, ECapSA and CapSA yield similar average ratio values. Based on the overall rank (1.4688), ECapSA has a lower mean rank (1.4688) and is placed first as a result. With an overall rank of 2.7813, CapSA is ranked second, followed by BDA and BPSO in third and fourth position, respectively. The SSA and SCA are tied for the sixth position with the same average rank, followed by the DE in last place.

Figure 1 shows the average performance of the fitness value attained by ECapSA and CApSA for each dataset. Regarding Fig. 1, except for Exactly 2 dataset, we can see that the ECapSA had greater convergence than the other approaches on nearly all of the datasets we studied. The use of the three adjustments to improve the CapSA algorithm is the major reason for this increase in the rate of convergence toward the optimal solution. When comparing the behavior of ECapSA to that of the original CapSA, it is clear that ECapSA outperforms traditional CapSA.

Fig. 1
figure 1figure 1figure 1

The convergence curve of CapSA and ECapSA

Figure 2 shows the average convergence curves of seven optimizers on different datasets. It is clear from the figure that ECapSA outperforms other algorithms in terms of average fitness value in the majority of datasets. It should be mentioned that ECapSA presents faster convergence because the improvements introduced allowed for a better balance between exploration and exploitation capabilities.

Fig. 2
figure 2figure 2

The convergence curve of ECapSA and other optimizers

Table 6 compares the results of ECapSA with several state-of-the-art methodologies in terms of classification accuracy to further evaluate the performance of the suggested methodology. WOA is based on crossover and mutation (WOA-CM) [76], Chaotic Interior Search Algorithm (CISA) [77], BBO [78], Satin Bird Optimizer (SBO) [78], Binary Bat Algorithm (BBA) [79], and Binary Gras (BGOA). Missing values in Table 6 are replaced with the "–." We can see from the findings in Table 6 that ECapSA produces greater classification accuracy for eight datasets and gives optimal performance for the M-Of-N and Exactly datasets. The ECWSA-4 is allocated in the second because it provides higher accuracy values for four datasets: BreastEW, Exactly2, HeartEW, and Zoo. CISA is ranked third because it outperforms CISA when it comes to zoo datasets. The results of the other optimizers were respectable, although they were the worst for all datasets.

Table 6 Comparison with other FS methods from literature

6 Conclusion

In this paper, an improvement of capuchin search algorithm (CapSA) has been presented as a feature selection approach. The enhancement of CapSA, named ECapSA, depends on using four improvements: a dynamic decreasing inertia weight using the logistic map, sine cosine-based adjustment of the acceleration coefficients, Stochastic Learning Strategy, and the search strategy based Lévy flight to improve the convergence rate. To justify the efficiency of ECapSA, it has been compared with different MH techniques, including CapSA, BDA, SSA, SCA, DE, and BPSO. In addition, a set of sixteen datasets has been used to evaluate the performance of the competitive algorithms. According to the obtained results, the developed ECapSA has an efficiency better than the competitive MH and other state-of-the-art techniques. In addition, it can be observed that the convergence of the developed method is better than other methods, as can be noticed from the convergence curve, whereas the results of using the Friedman test, the developed ECapSA has been allocated the first mean rank among the comparative algorithms according to the performance metrics. With these advantages of ECapSA, it still needs some improvements, especially the time required to determine the relevant features. This can be tackled by reducing the population size or using a local search method to improve the global solution by removing irrelevant features and adding relevant ones. In addition, we can use the sine cosine acceleration coefficients and Chaotic Inertia Weight Strategy according to some criteria for example in the first half of iterations or according to random value (rand > 0.5), since this will lead to reduce the process of updating them at each iterations.

Besides the superiority of ECapSA, it can be presented in other real-life areas, including task scheduling in cloud computing and the Internet of things problems. Most of these problems are similar to feature selection problems since they are discrete problems that can be extended. Moreover, ECapSA can be used as a prediction technique (i.e., classification or regression) to improve the performance of different machine learning methods. For example, it can be combined with a random vector functional link and artificial neural network (ANN). In addition, the proposed method can be applied as a multi-objective optimization technique to minimize or maximize the objective functions; however, it must be combined with the concept of Pareto front and archive to save the optimal solution.