Introduction

The Internet is currently being adopted by people, governments, and institutions all over the world in almost all aspects of life. Thereby, a large amount of data is generated every day in a wide variety of forms, with these datasets typically serving as an extension of knowledge [1]. In this regard, society may create data from a broad range of sectors, including health, agriculture, and industry, among many others [2, 3]. These datasets may be categorized and then used for information, forecasts, and insights. Due to the favors of rapidly advancing data collection and storage technologies, organizations and governments often collect and use these vast volumes of data on a regular basis. Raw datasets are frequently fruitless and possibly unusable unless a proper automated method is used to extract useful information from them. In any case, getting usable information has proved to be a very difficult task [4]. Traditional data analysis techniques often fail when attempting to analyze huge amounts of datasets. Even when the dataset is very small, the atypical form of the data might make it challenging to employ traditional methods to effectively process and analyze such datasets. In many cases, the problems that need to be addressed cannot be solved with the existing data analysis methods, and so new methods have to be taken [5]. Based on these discussions, the key principles of data mining (DM) can be used to identify the observable patterns in unprocessed datasets [6]. DM approaches especially attempt to learn about various features within the data by removing and simulating the content of the data. In this, DM is a generic process that involves a number of transformation operations, starting from pre-processing the data and ending with post-processing the output produced by a pre-built DM technique [7]. Due to some problems inherent in the raw datasets, which may often include redundant, irrelevant, or unimportant data [8], these datasets cannot then be utilized directly for post-processing steps. In this, pre-processing steps must be used to the gathered data to clean it up and get it ready for further phases of machine learning (ML) methods [7]. Data pre-processing is perhaps the most prolonged and difficult phase of the knowledge detection process due to the variety of ways that may be used to collect data. From these angles, this work specifically focuses on feature selection (FS) to confront with the above obstacles.

Feature selection is an important area of study in ML tasks, where ML is one of the key areas of cognitive computation. In many ML tasks, high dimensionality, on the one side, augments the information of data, on the other side, leads to curse of dimensionality problem [6]. In point of fact, for a lot of real-world applications, a small subset of informative and discriminate features can act better than employing all of the features. Data dimensionality reduction [9, 10] may be thought of as a cognitive method for analyzing the inherent properties of data. Feature selection [1, 7], favorably, can tackle the problems of high dimensions by lowering the dimensionality in ML tasks. It poses one of the most important pre-steps, which aims to remove duplicate or irrelevant data from the dataset that will be analyzed [11]. On this account, the benefits of FS stretch from data dimensionality and reducing over-fitting to eliminating noisy data, ameliorating classification accuracy, speeding up the model’s learning cycle, and reducing complexities within the dataset, among many other merits of FS methods [12]. Due to the above blessings, FS techniques have become active research areas and have been properly used in various fields such as facial recognition [13], image classification [14], and micro-array analysis [15].

Recently, many COVID-19 cases have been collected, and the dataset has been established [16]. Regardless of the location, cases volume, pandemic wave, or time of the collected COVID-19 data, the gathered dataset consists of fifteen features, and FS methods attempt to locate the subset of the most informative features. This subset will be used in machine learning or deep learning methods for classification purposes. Moreover, some methods have used COVID-19 data to discover chronic lung diseases. This is to estimate the severity and mortality rate among COVID-19 patients [17,18,19]. As of now, there more than 900 million people in China who have been infected with COVID-19, and that number is rising by the millions every day. Thus, China immediately ceased providing daily COVID figures and abandoned its zero COVID policies. An increase in COVID cases is also expected in rural China this year. It is expected that the COVID wave in China would peak in 2 to 3 months. COVID-19 and chronic obstructive pulmonary disease (COPD) have several potential adverse interrelations that may influence infection course and clinical results. There are some mechanisms that may be considered for increased COVID-19 infection susceptibility in COPD such as ineffective immunity and decreased antiviral defense [20]. COPD is linked with worse clinical results from COVID-19 [20], where COVID-19 has a large impact on the habitual care of COPD patients. There is evidence that COPD patients have worse outcomes from COVID-19 [18, 19]. A result of COVID-19 has been isolation and increased anxiety in COPD patients, with conceivably deleterious long-term repercussions [20].

Returning to the essence of this study, in general, during the implementation of the FS process, a subset of the candidate features is selected from the original feature set, and its relevance is then measured by an evaluation criterion. The process of selecting and evaluating a subset of features is repeated until a preset stopping condition is met. Then, the best obtained subset is validated in the test dataset [21]. Two opposing goals are used in feature selection methods to optimize the search in the search space, namely, to reduce the redundancy of the selected features and to increase the relevance of the class label [22]. Various search techniques could be used to deal with these goals. These techniques can be classified as single-objective and multi-objective methods [23]. In the single-objective FS methods, the solution is enhanced by evaluating a particular objective function. Therefore, the used objective function will affect the quality of the solution obtained from the optimization algorithm. Moreover, there is no specific objective that can be suitable for all optimization problems. Thus, defining a fitness function for optimizing a single objective can lessen the performance of the optimization method. Consideration of different conflicting goals in a fitness function can overcome the above obstacles and may return a non-dominated set of solutions, which can be several subsets of feature that meet different objectives. The chief handicap of multi-objective optimization methods is the increased complexity of the search space [24]. Furthermore, the intricacy of the majority of real-world problems are further challenges of FS methods. This, in turn, requires a large amount of solution space due to the dependency and non-linear needs between the dataset characteristics [6]. Examining each subset produced during the generation of various subsets is necessary in order to determine which subsets best suit assessment techniques like maximizing classification rate or reducing error [25]. This method is computationally costly and cumbersome, especially for datasets with high-dimensions. In such a case, creating all possible subsets of high dimensional datasets becomes impractical and computationally costly. Hence, dealing with such complex problems is difficult using conventional FS methods. These and other difficulties have led researchers to investigate many different strategies to get excellent performance levels in classification purposes [26]. Thus, the application of meta-heuristics has been targeted in the search for improved solutions to FS problems with an optimal rate of performance [5]. These techniques have generally proven to be effective in tackling optimization real-world problems in reasonable amounts of time with minimal computing effort in a wide range of engineering and science fields.

From the standpoint of cognitive computation, the challenges posed by high dimensionality problems in ML tasks may be overcome by adopting highly reliable FS approaches as well as robust classification models that learn from data. In this study, a newly developed meta-heuristic, named capuchin search algorithm (CSA) [27], was adopted to solve broadly available feature selection problems in the field of medical diagnosis. Although CSA has the ability to get the optimal solution in solving diverse problems in the optimization field [28, 29], it is customarily confined to the local optima especially when it encounters complex problems with many local optimums. This may be ascribed to its narrow search ability and modest convergence property. Hence, this study made some improvements to the basic CSA to efficiently solve such convoluted FS problems. This is developed in keeping with the ideology of continuous improvements to come up with highly powerful solutions to real-world problems. This is the first and main motive for this work. The basic CSA has two essential parameters in the velocity updating model, referred to as cognitive and social parameters, which help the capuchins to reach the optimal solution. However, these parameters are constants during the iterative process of CSA, which may impair exploration and exploitation features when addressing hard optimization problems. As well-adapted cognitive and social parameters as well as an efficient inertia weight mechanism are expected to influence the performance of CSA, it is worth noting and investigating to enhance CSA from the point of view of using adaptive strategies for these parameters. Based on the underlying CSA, a promising velocity updating model with proper control parameters can be implemented to guide the local and global search stages of the capuchins in the surrounding environment. In this respect, three improved versions of CSA were developed to deal with the early convergence and low search ability of CSA. Each version has added a reasonable improvement to the parent CSA by adopting different growth functions to update the values of the cognitive and social models during the path iterations of CSA. These versions are called exponential CSA (ECSA), power CSA (PCSA), and S-shaped CSA (PCSA). Subsequently, a new inertia weight was proposed for all of these versions to further control the velocity model. This improvement is intended to empower these versions to have more exploration and exploitation abilities. These functions not only provide effective guidance for the capuchins in the search areas, but they are also useful in alleviating the stagnancy of CSA. As the optimization frameworks of the proposed and basic variants of CSA are continuous, they can only handle continuous search spaces, but they have trouble in tackling problems with binary search spaces. In light of this, binary versions of these variants were created by adjusting the key operators and parameters of these variants to align with the nature of the search space of FS problems. This is the second motive of the current work. In this work, we address the shortcomings of CSA based on cognitive models to provide efficient and reliable optimization algorithms. To strengthen the efficacy of the proposed methods on FS problems, we expand the work utilizing growth computation models. For many of the datasets considered in this study, we found that the proposed FS methods when in combination with adaptive inertia weight during the iteration process of these methods can consistently provide higher performance levels than other FS methods. To sum up, the theoretical contributions of the proposed work can be recapped as follows:

  • Three enhanced binary versions of CSA were proposed and applied together with the basic binary CSA to solve a variety of 24 datasets collected from the UCI machine learning repository.

  • Three new cognitive and three new social models were embedded into the proposed methods to improve the diversity of solutions and increase the balance between exploration and exploitation features to promote the best-found solutions.

  • The performance of the evolved FS algorithms were compared with other highly effective FS methods in terms of several relevant criteria.

The reset of this work is arranged as follows: a literature review of several feature selection methods is presented in the “Related Works” section. The “Basic Capuchin Search Algorithm” section provides a brief description of the parent CSA. The following “Proposed Algorithms of CSA” section presents in detail the proposed algorithms. Next, the “Proposed Algorithms for Feature Selection” section describes the binary versions of the proposed feature selection methods. In the “Experimental Results and Discussions” section, the experimental results are presented and discussed. Finally, conclusions and several future directions are provided in the “Conclusion and Future Works” section.

Related Works

More recently, meta-heuristic algorithms have been widely used by many researchers to address different kinds of FS problems of varied levels of complexity [4, 5]. Astonishingly, meta-heuristics have reported notable advantages and delivered impressive accuracy when used as wrapper-based methods for solving FS problems. Well-known classes of meta-heuristics including swarm intelligence, evolutionary algorithms, and physics-based algorithms, as well as several hybridization algorithms that combine two algorithms of the same class or different classes, have been used to solve FS problems [6]. The following is a review of some selected FS methods classified according to the class of algorithms used, which have reported promising performance in the literature.

Swarm Intelligence-Based FS

There are many prominent examples of applications of swarm intelligence (SI) algorithms as search methods for wrapper-based approaches to solve FS problems in different domains [2, 4, 5, 30]. An efficient study for solving FS problems was presented by Arora et al. [2]. In that study, Arora et al. evolved two diversified binary variants of butterfly optimization algorithm (BBOA). The S- and V-shape transfer functions were applied to generate the two binary versions of BOA. These versions were assessed on 21 datasets collected from the UCI repository. It was found in [2] that BBOA with S-shape is better than BBOA with V-shape as well as many other similar FS algorithms in all evaluation methods and all studied datasets. Xian-Fang et al. [30] evolved a three-stage FS method as follows: (1) In the first stage, irrelevant features were eliminated using C-relevance; (2) in the second stage, the kth feature cluster was used to collect analogous features in the same cluster; and (3) an improved version of particle swarm optimization (PSO) was used to determine the optimal feature set. This algorithm, referred to as HFS-C-P, was assessed on 18 datasets collected from public repositories. Xian-Fang et al. stated that the HFS-C-P algorithm achieved promising results, in respect of fitness score, number of selected features, and computational time, in all considered datasets better than those achieved by other algorithms. In a more recent and effective work on FS problems, an improved binary variant of the rat swarm optimizer (RSO) combined with the local search paradigm of PSO was proposed [4]. In this method, three crossover mechanisms, controlled by a switch probability, were embedded with RSO to improve the diversity of its solutions. This method was examined on 24 datasets collected from various repositories and assessed using several evaluation methods. While this method revealed promising levels of performance, its convergence rate is a little modest. Last but not least, another good work for solving FS problems is presented in [5] using a binary version of Horse herd Optimization Algorithm, referred to as BHOA. In this algorithm, three transfer functions, namely, S-shape, V-shape, and U-shape, were used to obtain the binary domain of the HOA, where these variants were integrated with three types of crossover mechanisms to produce fifteen different variants of the BHOA. The performance of these versions was examined on 24 real-world datasets and evaluated using a set of six metric measures. The best-formed version of the proposed versions is a BHOA with an S-shape and a one-point crossover. A comparative evaluation was conducted against 21 FS methods. The BHOA method was able to find very competitive results against these comparative methods, but the implementation of the 15 versions of BHOA demands a large computational burden.

Evolutionary Algorithms-Based FS

Evolutionary algorithms (EAs) represent another broad class of meta-heuristics inspired by the natural processes of evolution. These algorithms have been broadly adapted to mature appropriate approaches for solving FS problems with a promising degree of accuracy using low computational burden and a small number of selected features [31, 32]. Appropriately, many variants of EAs, such as genetic algorithm (GA) [33], genetic programming (GP) [31], and binary differential evolution (DE) [32], have been used in the literature to tackle several types of FS problems. Recently, Awadallah et al. [34] developed a FS method based on a binary version of the JAYA algorithm, denoted as BJAM, with the help of an adaptive mutation operator. This operator was used to diversify the population during the iterative process, where this operator was managed using a pre-defined mutation rate. The JAYA algorithm was converted to binary utilizing an S-shape transfer function. The BJAM algorithm was assessed on 22 datasets selected from the UCI repository, where a good performance degree was realized compared to other FS methods.

Physics-Based Algorithms-Based FS

Simulated annealing (SA) [35] is one of the popular and broadly physics-based (PB) algorithms used to tackle FS problems. In [36], SA was used as a FS method for detecting various denial of service attacks. Several hybridization methods of PB algorithms with SI algorithms or EAs have been presented in the literature to solve FS problems. For example, a hybridization method of the whale optimization algorithm (WOA) with SA was used to solve FS problems for 18 datasets taken from the UCI repository [37]. In this method, SA was utilized to improve neighborhood search ability of WOA, where WOA was used to ameliorate the exploitation ability of the native algorithms. The performance of this hybrid model is promising and superior to the parent and other algorithms. Other examples of hybrid-based FS methods include a hybridization of a binary coral reefs algorithm with SA [38] and a binary spotted hyena algorithm with SA [39]. There are many other hybrid FS methods evolved in the literature such as a hybrid approach of genetic algorithms and artificial bee colony (ABC) [40] and a hybrid approach of Harris Hawks optimization algorithm with SA [41]. These hybrid models-based FS problems have achieved acceptable performance, but were subject to significant computations and complexity. There are many other researchers who have combined wrapper-based with filter-based methods to solve FS problems as discussed below.

Hybrid Filter-Wrapper Model-Based FS

Hybridization of filter and wrapper-based methods has been extensively used in the literature to strengthen the performance of FS tasks [42]. These hybrid models mainly comprise of two stages: the first stage is carried out based on a filter method to select the most important features, while the second stage is implemented based on a wrapper method in order to select a subset of features based on these selected features. For example, hybridization of mutual information (MI) as a filter method and binary cuckoo search as a wrapper method was presented in [43] to solve FS problems. Lai et al. [44] evolved a hybrid FS model using information gain (IG) as a filter method and an improved simplified swarm optimization (ISSO) as a wrapper method. In this method, IG was applied to select the most important features representing genes, while ISSO was applied to search for the optimal subset of genes. Another hybrid filter-wrapper method gene selection from microarray data for gene expression is reported in [45], where improved swarm-optimization was implemented as a wrapper-based method. The above hybrid filter-wrapper models for FS problems revealed reasonable performance. However, the filter methods may prevent some important features despite their sensible performance. Also, the wrapper methods that used some meta-heuristics may suffer from poor stability [46], where random search affected the stability of the selected features. A promising FS technique using a hybridization of binary biogeography optimization (BBO) with support vector machine recursive feature elimination (SVM-RFE), known as BBO-SVM-RFE, was developed in [6]. The SVM-RFE is embedded into the BBO to improve the quality of the obtained solutions in the mutation operator in order to reinforce the exploitation capability as well as to strike an adequate balance between exploitation and exploration of the original BBO. The BBO-SVM-RFE was assessed on 18 benchmark datasets. Comparative results showed that BBO-SVM-RFE revealed a wise degree of performance in terms of accuracy and number of selected features against other existing FS methods. However, the structure of BBO-SVM-RFE is complex and has slow convergence behavior, where the optimal solutions require high computational efforts. While the aforesaid feature selection methods realized promising levels of performance in a fair-minded time in addressing the FS problems deemed in the aimed applications and datasets, they cannot ensure that in all experimental runs, the optimal solutions will be identified. This insinuates that locating the optimal feature subset is not ensured. Moreover, they demonstrated that they can address FS problems that were taken into account in their studies by identifying a minimal number of attributes from a selected subset. However, each of these FS methods behaved properly in the problems considered and might fell short in other real-world FS problems, especially those with high-dimensional datasets. Besides, some FS methods such as the one reported in [4] suffered from large computational time and local optimums. Also, the method presented in [5] provided sensible results in some datasets but not for all considered datasets in that study. The deficiencies of the above FS systems might be attributed to the possibility that meta-heuristic algorithms may fall into local optimum solutions, especially when tackling complex FS problems with varied degrees of complexity and high dimensionality. Hence, there is still a need for further amelioration on FS methods, particularly for datasets with a large number of features and a high level of complexity. This motivated this study to strengthen the performance of the basic CSA to deal with FS problems of high complexity. This is carried out by developing three improved variants of this basic algorithm and using binary versions of the core and proposed algorithms of CSA to explore their capacities and efficiencies in handling familiar FS problems with different numbers of features, samples, and dimensions.

Basic Capuchin Search Algorithm

CSA is a new swarm intelligence algorithm developed to imitate the foraging activity and locomotion practices of capuchins while roaming in forests. The population of CSA is divided into two bunches: leaders (i.e., alpha capuchins) and followers (i.e., the remaining capuchins). Leaders lead the followers, where they go after each other and the leaders directly or indirectly. Leaders are accountable for locating food sources for themselves and the other capuchins in the group. The other capuchins (i.e., followers) update their position by pursuing the leaders. As reported in [27], leaders employ the following strategies of movements while foraging, which can be presented as shown below:

  • The leaders’ positions when jumping on trees are determined as follows:

    $$\begin{aligned} x_j^i= & {} F_j + \frac{P_{bf}(v_{j}^{i})^2 sin(2\theta )}{g} \\{} & {} i< n/2; \;\;\; 0.1 < rand \le 0.25 \nonumber \end{aligned}$$
    (1)

    where \(x_j^i\) denotes the current position of the leaders at dimension j, \(F_j\) is the position of the food of the capuchins found so far at dimension j, \(P_{bf}\) indicates the equilibrium probability provided by the capuchins’ tails which is equal to 0.75, \(v_{j}^{i}\) is the current velocity of the ith capuchin at dimension j which is defined in Eq. 3, \(\theta\) is the capuchins’ leaping angle defined in Eq. 2, g is the force of gravity which is equal to 9.81, n is the number of capuchins, and rand is a random value generated in the interval [0, 1].

    $$\begin{aligned} \theta = \frac{3}{2} r \end{aligned}$$
    (2)

    where r is a random value produced in the range [0, 1].

    $$\begin{aligned} v_j^i = \rho v_j^i + a_{1} \left( x^{i}_{best_j}-x_j^i\right) r_1 + a_{2} \left( F_j-x_j^i \right) r_2 \end{aligned}$$
    (3)

    where \(x^{i}_{best_j}\) stands for the best position of capuchin i at dimension j, \(a_{1}\) and \(a_{2}\) are positive values that are equal to 1.5 and 1.5, respectively, \(r_1\) and \(r_2\) are random values in the interval [0, 1], and \(\rho\) stands for the inertia weight of the velocity defined as given by Eq. 4.

    $$\begin{aligned} \rho = w_{max} - \left( w_{max} - w_{min} \right) \frac{k}{K} \end{aligned}$$
    (4)

    where k is the iteration index representing the current number of iterations, K is a predefined maximum number of iterations, and \(w_{min}\) and \(w_{max}\) are the minimum and maximum weight values that were set to 0.2 and 0.9, respectively.

  • The leaders’ positions while foraging on the banks of rivers using the jumping strategy can be determined as follows:

    $$\begin{aligned} \begin{aligned} x_j^i =&F_j + \frac{P_{ef} P_{bf}(v_{j}^{i})^2 sin(2 \theta )}{g} \\&i< n/2; \;\;\; 0.25 < rand \le 0.50 \end{aligned} \end{aligned}$$
    (5)

    where \(P_{ef}\) stands for the elasticity probability of capuchins’ motion on the ground which is equal to 9.

  • The leaders’ position while foraging on the ground using normal walking can be decided as follows:

    $$\begin{aligned} \begin{aligned} x_j^i =&x_j^i + v_{j}^{i} \\&i< n/2; \;\;\; 0.5 < rand \le 0.70 \end{aligned} \end{aligned}$$
    (6)
  • The leaders’ position while swaying on trees can be decided as follows:

    $$\begin{aligned} \begin{aligned} x_j^i =&F_j + \tau P_{bf}\times sin(2\theta )\\&i< n/2; \;\;\; 0.7 < rand \le 0.8 \end{aligned} \end{aligned}$$
    (7)

    where \(\tau\) is a dominant parameter defined in Eq. 8.

    $$\begin{aligned} \tau = 2 e^{-21 (\frac{k}{K})^2} \end{aligned}$$
    (8)
  • The leaders’ position while climbing trees can be decided as follows:

    $$\begin{aligned} \begin{aligned} x_j^i=&F_j + \tau P_{bf} (v_j^i-v_{j-1}^i)\\&i< n/2; \;\;\; 0.80 < rand \le 1.0 \end{aligned} \end{aligned}$$
    (9)

    where \(v_{j-1}^i\) is the former velocity of capuchin i at dimension j.

  • The random movement of leaders while foraging can be decided as follows:

    $$\begin{aligned} \begin{aligned} x_j^i =&\tau \times \left[ lb_{j} + rand \times (ub_{j}-lb_{j})\right] \\&i < n/2; \;\;\; rand \le Pr \end{aligned} \end{aligned}$$
    (10)

    where Pr denotes the random search probability of the leaders that has a value of 0.1, and \(ub_{j}\) and \(lb_{j}\) stand for the upper and lower limits of the search space at dimension j.

The followers’ positions can be updated as per Eq. 11:

$$\begin{aligned} x_j^i = \frac{1}{2}\left( {\acute{x}}_j^i + x_j^{i-1} \right) \;\;\;\;\; n/2 \le \;i\le n \end{aligned}$$
(11)

where \(x_j^{i-1}\) and \(x_j^{i}\) stand for the former and current position of the followers at dimension j, respectively, and \(\acute{x}_j^i\) represents the current leaders’ position at dimension j.

Each new solution for each capuchin’s position is assessed using a pre-defined fitness criterion. The optimization process of CSA can be implemented through iterative steps, whereby capuchins’ positions are evaluated and updated. These steps are reiterated at each iteration, through which the convergence behavior can be got when the maximum number of iterations is realized. Algorithm 1 presents a short illustration of the iterative steps of CSA.

figure a

As CSA has affirmed its reliability and efficacy in addressing a lot of broadly well-known real-world problems [27, 28], we concluded that CSA could be an appropriate alternative algorithm to avail as a feature selection method.

Proposed Algorithms of CSA

Issues of CSA

Although CSA can search for optimal solutions while solving optimization problems, its search ability is limited by its original mathematical model and the defects of the velocity update model, where these flaws often lead to local optima in addressing complex optimization problems [27, 28]. The reason is that capuchins in CSA update their velocities toward food sources by relying on constant social and cognitive models for locomotion and repetitive foraging. However, these fixed values for such key parameters cannot guarantee that CSA can escape stagnation or that it is not confined to local optima. In addition, CSA challenges another issue of feeble exploration and exploitation competencies. This is obviously faced as a result of updating the original position of the capuchins in CSA, which does not take into account the control of key parameters of the velocity model during the iterative process. Therefore, there must be some strategies that help update the velocity model of the capuchins as well as fine-tuning of the key parameters in CSA. Furthermore, the exploration aptitude of CSA is insufficient in the inception search stage, so its modest exploitation causes the difficulty of finding the global solution in the late search stage. In this case, local optima is usually received. Thus, a reasonable compromise must be made between exploration and exploitation in order to enhance the search capacity of CSA. For this purpose, a new mechanism is needed to reinforce the swarming behavior among capuchins, in which the best ones play a spirited role in leading others. In this, the best capuchins can help other capuchins to avoid premature convergence once they are bounded into local optima, and without any capability to prevent or deal with this circumstance. To deal with the aforementioned issues in CSA, an update was made to the velocity model of the basic CSA that uses adaptive social and cognitive models, with the goal of improving the performance score of CSA. The following subsections provide detailed descriptions of the proposed variants of CSA, referred to as exponential CSA (ECSA), power CSA (PCSA), and S-shaped CSA (SCSA).

Exponential Model of CSA (ECSA)

During the iterative process of CSA, the characteristics of the population distribution vary not only with iteration number but with the iterative state as well. For example, at a premature stage, the capuchins may be dispersed in different areas of the search space, and thus, the distribution of the population is scattered. In the proposed ECSA, two steps including adaptation of the inertia weight and controlling the acceleration coefficients were carried out for the velocity model as shown below:

Adaptation of the Inertia Weight

The inertia weight \(\rho\) in CSA is used to balance global and local search potentials. For optimal performance, the value of \(\rho\) is expected to be large in the case of exploration and small in the case of exploitation. However, it is not necessarily true to reduce \(\rho\) in CSA simply with time. Thus, an iterative factor f was proposed to be used in the inertia weight \(\rho\) to take part some properties with \(\rho\). In this, the factor f is also relatively large during the exploration case and becomes relatively small in the convergence condition. Thus, it would be useful to enable \(\rho\) to put up with the iterative state using a sigmoid mapping \(\rho (f)\): \(\Re ^{+}\rightarrow \Re ^{-}\). Here, the inertia weight \(\rho\) shown in Eq. 12 was proposed to be utilized in the velocity model of ECSA.

$$\begin{aligned} \rho (f) = \frac{1}{1.5e^{-2.6*f}} \in \left[ 0.4, 0.9\right] \;\;\;\;\;\;\;\;\;\;\;\; \forall f\in \left[ 0, 1\right] \end{aligned}$$
(12)

In this work, \(\rho\) is initialized to 0.9. Since \(\rho\) is not necessarily monotonic over time, but monotonic with the iterative factor f, \(\rho\) will, thus, adapt to the search environment illustrated by f. In the case of jumping out or exploration case, large f and \(\rho\) will benefit the global search, as noted earlier. On the contrary, when f is small, an exploitation case or convergence case is identified, and thus, \(\rho\) goes down to benefit the local search. In view of this, the movements of the capuchins in the proposed ECSA model are implemented by a new integrated velocity-updating model with adaptation of the inertia weight throughout the iterative process. To be more specific, this new inertia weight was proposed to help update the capuchin’s capuchins to move adaptively toward food.

Control of the Acceleration Coefficients

In addition to the inertia weight, the acceleration coefficients \(a_1\) and \(a_2\) are also important parameters in CSA that control the overall velocity of the capuchins. Accordingly, an adaptive control can be devised for these coefficients on the basis of the following idea. Parameter \(a_{1}\) represents the “self-cognition” that attracts the capuchins to their own historical best positions, helping to explore local niches and to preserve the diversity of capuchins. This parameter reveals how much confidence a capuchin has in itself. Parameter \(a_{2}\) represents the “social effect” that drives the capuchins to converge towards the current globally best area, which aids in rapid convergence. This parameter divulges how much trust a capuchin has in its neighbors. Braik et al. [27] stated in their original work of CSA that the implemented experiments showed that both the “cognitive-only” model and the “social-only” model are essential for the success of CSA, for which a constant value of 1.50 was used for each of the acceleration coefficients. However, it is anticipated that using ad hoc values of \(a_{1}\) and \(a_{2}\) instead of a constant value of 1.50 for different problems could result in better performance. To do so, the exponential models introduced in [47] were used to define the cognitive and social models in order to originate an enhanced variant of CSA, referred to as exponential CSA (ECSA). The exponential growth function shown in Eq. 13 was proposed to represent the cognitive model of the capuchins in ECSA.

$$\begin{aligned} a_1 (t; \beta _0, \beta _1) = \beta _0 (1 - e^{-\beta _1 t}) \end{aligned}$$
(13)

where \(t=\frac{k}{k^2}\), the parameter \(\beta _0\) denotes the initial estimate of the cognitive parameter, and the parameter \(\beta _1\) represents the final estimate of the social parameter approximately achievable by the capuchins at the end of the ECSA’s iterative process.

It is important to note that the exponential function \(a_2\) is derived from the exponential function \(a_1\) as settled in Eq. 14.

$$\begin{aligned} a_2(t; \beta _0, \beta _1) = \frac{\partial a_1(t; \beta _0, \beta _1)}{\partial t} \end{aligned}$$
(14)

According to Eqs. 13 and 14, the exponential function of the social model of the capuchins in ECSA is established in Eq. 15.

$$\begin{aligned} a_2 (t; \beta _0, \beta _1)= \beta _0 \beta _1 e^{-\beta _1 t} \end{aligned}$$
(15)

To estimate the parameters \(\beta _0\) and \(\beta _1\) for the functions of cognitive and social parameters, there are several conventional and intelligent methods mentioned in the literature [27]. One of the common traditional estimation methods is the least square estimation method [48]. This method has many issues with estimation accuracy and needs a large number of measurements to be able to give a good estimation of parameters. Other methods include the use of meta-heuristics in estimating the parameters [27]. These methods may demand senior computational efforts.

The parameters \(\beta _0\) and \(\beta _1\) of the exponential models used for \(a_1\) and \(a_2\) were selected through the use of experimental design by examining the proposed ECSA on feature selection problems. For all of the feature selection problems solved in this work, \(\beta _0\) and \(\beta _1\) are equal to 2.0 and 1.0, respectively. These values presented a high level of efficiency in solving feature selection problems as presented in the results section. However, optimal values are often obtained only empirically, perhaps, not the “best” values. Thereby, these parameters can be adapted to other problems as it is demanded.

The values of \(a_1\) and \(a_2\) are updated exponentially at each iteration loop of ECSA. In Fig. 1, the curve of the cognitive parameter of ECSA shows that the tendency of this parameter diminishes in an exponential manner. This has an impact on the conduct of the capuchins in ECSA which can be shifted towards more exploration and exploitation. Due to that, the capuchins can finish their foraging by locating the food at the end of their wanderings. This may also avoid local optimal solutions.

Fig. 1
figure 1

Proposed exponential functions for \(a_1\) and \(a_2\) in the proposed ECSA: Left image \(a_1\), Right image \(a_2\)

As can be observed in Fig. 1, the ECSA is proposed with time-varying acceleration coefficients, where a larger \(a_{1}\) and a smaller \(a_{2}\) were initially set and gradually reversed during the search process. As per this manner, it is expected that ECSA could divulge better overall performance than the basic CSA. This may be due to the time-varying of \(a_{1}\) and \(a_{2}\) that can balance the global and local search capabilities, which means that the adaptation of \(a_{1}\) and \(a_{2}\) can be encouraging in improving the performance of the basic CSA. As the iterative process of ECSA continues, the capuchins would clump together and converge into a locally or globally optimal region. Thus, the population distribution information would be various from that in the premature stage. In other words, the curve in Fig. 1 that represents the cognitive parameter expands exponentially until the capuchins realize and find the position of food sources. In this way, exploration and exploitation stages of the basic CSA are ameliorated, and the capuchins can eventually find food and not lose other comrade capuchins in the group while foraging.

Bounds of the Acceleration Coefficients

As discussed earlier, the aforementioned adjustments for inertia weight and acceleration coefficients should not be too troublesome. Thus, the maximum increase or decrease between two iterations is bounded by

$$\begin{aligned} \left| a_i(t+1) - a_i(t)\right| \le \delta , \;\;\;\;\;\;\;\; i = 1, 2 \end{aligned}$$
(16)

where \(\delta\) is called the “acceleration rate.”

Experiments revealed that a uniformly created random value for \(\delta\) in the range of [0.05, 0.1] performed better in most of the feature selection problems under study. Note that 0.5 was used for \(\delta\), where it is recommended to make “slight” changes.

Power Model-Based CSA (PCSA)

The inertia weight \(\rho\) used in PCSA is the same as that used in ECSA which is given in Eq. 12. The difference between PCSA and ECSA models is the functions used to control the acceleration coefficients of the velocity of these models. Anyway, the cognitive and social coefficients of PCSA are adapted using the power model described in [49]. This is why this algorithm of CSA is referred to as power CSA (PCSA). This model is based upon the heterogeneous Poisson process model. The mathematical function formulated in Eq. 17 was utilized to carry out \(a_1\) in the velocity model in PCSA.

$$\begin{aligned} a_1 (t; \beta _0, \beta _1)= \beta _0 t^{\beta _1} \end{aligned}$$
(17)

where \(t=\frac{k}{K}\). It is important to note that the power model of \(a_2\) is derived from the power model of \(a_1\) as defined in Eq. 14. According to Eqs. 17 and 14, the power function of the social model of the capuchins in PCSA can be defined as shown in Eq. 18.

$$\begin{aligned} a_2 (t; \beta _0, \beta _1)= \beta _0\beta _1 t^{\beta _1 - 1} \end{aligned}$$
(18)

Equations 17 and 18 were employed to update \(a_1\) and \(a_2\) during the iterative process of PCSA. It is important to know that Eq. 18 was used to find \(a_2\) in this model. A pool of values in the range from 0 to 5 were applied to each of \(\beta _0\) and \(\beta _1\). For all of the feature selection problems tackled in this work, \(\beta _0\) and \(\beta _1\) are equal to 2.0 and 0.1, respectively. These parameters were selected by experimental testing of a large subset of test datasets of varied complexities, where this value reported the best accuracy of the proposed PCSA. However, these parameters can be adapted for other problems as it is needed. The values of \(a_1\) and \(a_2\) were updated in PCSA, in non-linear form, as displayed in Fig. 2.

Fig. 2
figure 2

Proposed power functions for \(a_1\) and \(a_2\) in the proposed PCSA: Left image \(a_1\), Right image \(a_2\)

It is evident from Fig. 2 that the proclivities of \(a_1\) and \(a_2\) of PCSA are decreasing and increasing non-linearly, respectively. This has an impact on the search conduct of PCSA which can be moved towards more exploration or exploitation in a faster base when compared to CSA that utilizes constant values for the parameters \(a_1\) and \(a_2\).

It is clear from Eqs. 17 and 18 that when t is small, \(a_1\) has an extreme value and swiftly lessens to a minimum value. On the other hand, the value of \(a_2\) is small when t is small, and it progressively augments towards its maximum value. In this context, the capuchins in PCSA can find a food source at the end of their foraging activity. In detail, \(a_1\) starts from a large value and declines little by little to a small value to denote that the capuchins find a source of food. Conversely, \(a_2\) starts with a small value and gradually expands to a maximum value to indicate that the capuchins ultimately became aware of the location of the food source. This scenario of using power functions for \(a_1\) and \(a_2\) can ameliorate exploration and exploitation as presented in the evaluation results. The maximum increase or decrease between two successive iterations of the acceleration coefficients in PCSA is bounded using Eq. 16.

Delayed S-Shaped Model-Based CSA (SCSA)

The inertia weight \(\rho\) used in SCSA is the same as that used in ECSA and PCSA which is defined in Eq. 12. The difference between the former proposed algorithms of CSA and SCSA is the models used to control the acceleration coefficients of the velocity model of SCSA. In any case, the acceleration coefficients of SCSA are adapted using the S-shaped model introduced in [50]. This is why this algorithm of CSA is called S-shaped CSA (SCSA). The S-shaped mathematical growth model used to define the parameter \(a_1\) is presented as given in Eq. 19.

$$\begin{aligned} a_1(t; \beta _0, \beta _1) = \beta _0 \left( 1- \left( 1+ \beta _1 t\right) e^{-\beta _1 t}\right) \end{aligned}$$
(19)

where \(t=\frac{K}{k}\).

Equation 14 was used to derive \(a_2\) from \(a_1\) in terms of iterations as defined in 19, where \(a_2\) was got as presented in Eq. 20.

$$\begin{aligned} a_2 (t; \beta _0, \beta _1) = \beta _0 \beta _1^2 t e^{-\beta _1 t} \end{aligned}$$
(20)

For all of the feature selection problems addressed in this work, \(\beta _0 and \beta _1\) are equal to 2.0 and 1.0, respectively. These parameters were determined by empirical testing of a large number of feature selection problems, where these parameters were changed several times until a trustworthy solution was acquired by the proposed SCSA. The growth functions representing the parameters \(a_1\) and \(a_2\) are updated in a non-linear shape as displayed in Fig. 3.

Fig. 3
figure 3

Proposed S-shaped functions for \(a_1\) and \(a_2\) in the proposed SCSA: Left image \(a_1\), Right image \(a_2\)

In Fig. 3, the curve of \(a_1\) in SCSA shows that the trend of this parameter is gradually decreasing in a non-linear manner. Briefly, the parameters \(\beta _0\) and \(\beta _1\) in Eqs. 13, 15, 17, 18, 19, and 20 can be fine-tuned for other problems as demanded. The three new algorithms of the basic CSA were proposed to lay out a suitable setting for \(a_1\) and \(a_2\) in order to enhance the exploration and exploitation features of CSA. These proposed algorithms are anticipated to fulfill efficient convergence and ameliorate the performance of the parent CSA in solving feature selection problems under study. Besides, the proposed algorithms of CSA could deliver outstanding potential to reliably evade stagnation in local optima regions and help them to find the global optima.

Briefly, in the original mechanism of updating capuchins’ velocity as presented in Eq. 3, the values of \(a_1\) and \(a_2\) are constants during the iterative process of CSA. This points out that exploration and exploitation processes in CSA are based on a mathematical model that relies on stationary parameters. This has a big impact on global and local search abilities that can only give sensible exploration and exploitation without strict structure. In the proposed ECSA, PCSA, and SCSA, the parameters \(a_1\) and \(a_2\) were applied as interactive operators to foster exploration and exploitation capabilities of these proposed versions of CSA. With a set of different values for \(a_1\) and \(a_2\), these proposed algorithms can switch between global and local searches to promote the convergence performance of CSA to realize optimality.

Supportive Positioning Update Process

For further exploration and exploitation, the positions of leaders and followers in ECSA, PCSA, and SCSA are then updated as per the mechanism proposed in Eq. 21.

$$\begin{aligned} x^{i}_{k+1} = {\left\{ \begin{array}{ll} x^{i}_{k}+ v^{i}_{k} \;\;\;\;\; rand \ge 0.5 \\ F_j + \lambda \left[ lb_{j} + r \left( ub_{j}-lb_{j}\right) \right] \;\; rand<0.5 \end{array}\right. } \end{aligned}$$
(21)

where \(x^{i}_{k+1}\) and \(x^{i}_{k}\) denote the next and present positions of the ith capuchin at the next and current iterations, respectively, \(v^{i}_{k}\) stands for the present velocity of the ith capuchin at iteration k, r and rand are uniformly random values generated in the interval from 0 to 1, and \(\lambda\) is defined as a function of iterations as drafted in Eq. 22.

$$\begin{aligned} \tau = \mu _0 e^{-(\mu _1 k/K)^{\mu _2}} \end{aligned}$$
(22)

the parameters \(\mu _0\), \(\mu _1\), and \(\mu _1\) are constant values used to automatically update the parameter \(\lambda\) at each iteration. These parameters are useful for strengthening exploration and exploitation conducts. For all of the test problems subsequently addressed in this work, \(\mu _0\), \(\mu _1\), and \(\mu _2\) are set to 2, 4, and 2, respectively. These constants were captured by pilot testing of a bunch of test problems. However, they can be refined to suit other problems that those problems require.

The parameter \(\lambda\) is defined as a function of iterations k to control the random movement of capuchins iteratively and thus decreases with the number of iterations. Specifically, this parameter was proposed for dynamic system optimization to secure convergence by diminishing the search speed as well as to enhance exploration and exploitation of the proposed algorithms. This parameter can enable capuchins to scout more search space and exploit each region while looking for food or other capuchins’ food. This is to arrive at an efficient convergence process, which can further improve the performance of ECSA, PCSA, and SCSA in solving feature selection problems. In this light, it is anticipated that the combination of the parameter \(\lambda\) with ECSA, PCSA, and SCSA will intensify the exploration capacity of these versions and bring them closer to the optimal solution.

The first case of Eq. 21 (i.e., when \(rand \ge 0.5\)) was suggested to allow capuchins to advance toward food sources. The second case of Eq. 21 (i.e., when \(rand< 0.5\)) was suggested to empower capuchins to scout several random positions in the search domain to improve local and global search capabilities and to strike a sufficient balance between exploration and exploitation. This gives capuchins in ECSA, PCSA, and SCSA a great deal of ability to explore every potential position in the search space. In sum, Eq. 21 was combined with the mathematical models of ECSA, PCSA, and SCSA during their search iterations to address a number of FS problems from various domains, where the optimum feature subset characterizing the dataset is chosen. This is performed to get better classification efficiency by these versions than can be got with the basic CSA. In addition, the length of the feature subset is anticipated to be reduced with these versions over that realizable by the basic binary CSA.

To summarize, the proposed exponential, power, and S-shaped models for the cognitive and social models of the proposed ECSA, PCSA, and SCSA were utilized in these algorithms to improve the mobility of capuchins. In this work, the proposed binary ECSA, PCSA, and SCSA were applied to identify the most significant features from medium, small, and high-dimensional datasets in binary search spaces, on top of reducing the redundant and irrelevant features.

Complexity Analysis of ECSA, PCSA, and SCSA

Time complexity of an optimization algorithm can be drafted using a function that relates the algorithm’s running time to the size of the optimization problem. Given this, Big-O notation can be used. The time complexity analysis of these optimization methods basically relies on the following steps: problem definition, initialization procedure, population update, fitness evaluation, and selection method. The computational time of the fitness assessment is highly dependent on the particular optimization problems. In doing so, the general computational complexity of each of ECSA, PCSA, and SCSA is the same and can be computed as follows:

$$\begin{aligned} \begin{aligned} \mathcal {O} \left( ECSA\right) =&\mathcal {O} \left( problem \; def.\right) + \mathcal {O} \left( init.\right) \\&+ \mathcal {O}\left( K \left( population \; update\right) \right) \\&+ \mathcal {O}\left( K \left( fitness \; eval.\right) \right) \\&+ \mathcal {O}\left( K \left( selection \; procedure\right) \right) \end{aligned} \end{aligned}$$
(23)

As Eq. 23 implies, the time complexity of ECSA, PCSA, and SCSA mainly depends on the total number of iterations (K), the population size (n), the dimension of the problem under study (d), and the cost of the fitness function (c). Also, the S-shaped transfer function used to get binary versions of these proposed algorithms is intended to update the solutions. In specific form, the general computational complexity of ECSA, PCSA, and SCSA can be formulated in the worst case as follows:

$$\begin{aligned} \begin{aligned} \mathcal {O} \left( ECSA\right)&= \mathcal {O}\left( 1\right) + \mathcal {O}\left( nd\right) + \mathcal {O}\left( VKnd\right) \\&+ \mathcal {O}\left( VKnc\right) + \mathcal {O}\left( VKn^2d\right) \end{aligned} \end{aligned}$$
(24)

where V stands for the number of assessment experiments.

The number of iterations (K) is often greater than the population size (n), the problem dimension (d), and the cost of the fitness function (c). In this regard, the main parameters K and n are essential in evaluating the complexity issue of optimization algorithms. Also, as \(nd \ll VKnd\) and \(nd \ll VKcn\), so the components 1 and nd can be ruled out from the time complexity given in Eq. 24. Consequently, the general time complexity of ECSA, PCSA, and SCSA can be viewed as follows:

$$\begin{aligned} \mathcal {O} (ECSA) \cong (VKnd + VKnc + VKn^2d) \end{aligned}$$
(25)

Proposed Algorithms for Feature Selection

The proposed ECSA, PCSA, and SCSA algorithms aspire to extend the exploration and exploitation characteristics of CSA beyond enhancing CSA to deal with complex search spaces for challenging feature selection problems. The proposed binary FS algorithms aim to find the optimal subset of attributes for classification tasks by locating the most relevant attributes and abolishing the unimportant attributes. In this work, a wrapper-based FS approach was intended using four different binary algorithms to address a pool of benchmark feature selection datasets that vary in complexity, number of attributes, and characteristics. However, the CSA was originally introduced to handle continuous optimization problems, whereby each search agent in CSA updates its position based on its current position, the position of the best individual found so far, and the position of all other remaining search agents [27]. Thus, the proposed algorithms of CSA can only deal with continuous search spaces. Anyway, as these algorithms are tailored to solve FS problems, where the search spaces of such problems can be delineated by binary values due to their familiar nature. Hence, there is a need to use binary versions of the algorithm, CSA, ECSA, PCSA, and SCSA, to evolve a search strategy for FS problems of binary search spaces. In light of this, some operators of these methods demand to be altered to create binary versions of them, which can provide output in binary forms as either “*0” or “*1.” This allows the search agents of these algorithms (i.e., capuchins) to have binary solutions while solving FS problems. As per this, the capuchins can change their position in the search space during the iterative procedural loops, where the movements of the capuchins are associated with values of “*1” and “*0.” Considering this, an S-shape transfer function (TF) was used to transform the continuous CSA, ECSA, PCSA, and SCSA into binary ones, referred to as BCSA, BECSA, BPCSA, and BSCSA, as described in the “S-Shape Transfer Function” section, with the objective function of these algorithms in the “Fitness Function” section.

S-Shape Transfer Function

As per the assertions conducted in [51], one of the most practical ways of transforming an optimization method that deals with continuous problems to a method that can address binary problems is to create a binary version of the purposed optimization method. To do this without modifying the basis of the algorithm, a transfer function is the most widely norm used in this area. This strategy aims to establish the probability that an element in the feature subset, \(x^i\), will be constrained in a binary form that can be either “*0” or “*1.” An element value of “*0” presents that the corresponding feature is not chosen, while a value of “*1” points out that the feature was chosen. In this study, an S-shape TF was utilized to conduct the transformation task. The S-shape TF that was applied to transform the position of the capuchins (i.e., search agents) in the proposed algorithms of CSA from continuous to binary is presented in Eq. 26:

$$\begin{aligned} S\left( x_{m, j}^{k}\right) = \frac{1}{1+exp^{- x_{m, j}^{k}}} \end{aligned}$$
(26)

where S refers to the transfer vector, \(S\left( x_{m, j}^{k}\right)\) which represents the probability value of the S-shape value, \(x_{m, j}^{k}\) stands for the jth element in the solution x (i.e., search agents) at the mth dimension, and k denotes the current iteration.

The values of the elements in the solution x are converted to either “*1” or “*0” using Eq. 27.

$$\begin{aligned} x_{m, j}^{k} ={\left\{ \begin{array}{ll} 1 &{} \qquad rand < S\left( x_{m, j}^{k}\right) ,\\ 0 &{} \qquad rand \ge S\left( x_{m, j}^{k}\right) \end{array}\right. } \end{aligned}$$
(27)

where rand stands for a function that generates a uniform random number in the range [0, 1].

Fitness Function

A wrapper-based FS method needs a learning algorithm to be involved in evaluation of the selected subset of features. Here, the k-Nearest Neighbor (k-NN) classifier [52] was used to get a sense of the classification accuracy of the obtained solutions. In addition, each optimization problem must be solved using a fitness function, which is essential to any wrapper-based FS approach. While drafting a FS method, two key considerations must be made: How to express the solution and evaluate it. A binary vector with a length equal to the number of features in the dataset is used in this study to create a feature subset. Two discrepancy objectives are imposed on the feature selection problem as a multi-objective optimization problem as follows: (1) lessen the amount of features that are chosen, and 2) problem the k-NN classifier’s classification accuracy. It is broadly known that as the number of features selected in a solution diminishes, the quality of the solution improves, by which the solution with the lowest number of features associated with the largest classification value is the best solution which is required to be achieved. These two opposite goals should be deemed into one objective function in the proposed FS algorithms. By this, these goals were implemented into one objective function for each solution in the iterative process of the proposed BCSA, BECSA, BPCSA, and BSCSA that was computed using the k-NN classifier as presented below:

$$\begin{aligned} fitness=\alpha \zeta _{k}+\beta \frac{|R|}{|N|} \end{aligned}$$
(28)

where \(\zeta _{k}\) is the rate of classification error got by k-NN, |R| and |N| are the number of selected and original features, respectively, and \(\alpha\) and \(\beta\) are the weights of the classification quality and selection ratio of the chosen features, respectively, where they are two counteractive parameters in the interval [0, 1], in which \(\beta\) is the complement of \(\alpha\).

The fitness function presented in Eq. 28 that considers the trade-off between the selected features in each solution vector (i.e., minimization) and the classification accuracy rate of the learning classifier (i.e., maximization) is used to assess the selected subsets of the proposed BECSA, BPCSA, and BSCSA as FS methods.

The proposed BECSA, BPCSA, and BSCSA as FS methods are evaluated using the fitness function shown in Eq. 28, which takes into account the trade-off between the classification accuracy rate of the learning classifier (i.e., maximization) and the selected features in each solution vector (i.e., minimization). The schematic diagram of the proposed work is presented in Fig. 4.

Fig. 4
figure 4

Schematic diagram of the proposed algorithms of CSA for feature selections

Experimental Results and Discussions

The effectiveness and robustness of the proposed algorithms of CSA and basic CSA in solving FS problems are studied and analyzed in this section. First, a brief description of the datasets used for the evaluation tasks is found in the “Dataset Description” section. Then, the parameter settings of the proposed algorithms and the characteristics of the system used to run the algorithms are provided in the “Parameter Settings” section, while the evaluation metrics are formulated in the “Performance Measures” section. The results of evaluation and analysis of the proposed algorithms are summarized in the “Evaluation of the Proposed Methods” section. Finally, the outcomes of the fundamental CSA and the best developed algorithm compared to other algorithms are presented in the “Performance Comparison with Other Methods” section.

Dataset Description

The performance of the proposed algorithms of CSA is evaluated using 24 benchmark datasets of different number of features and varying complexity. These datasets were extracted from patients’ medical diagnoses. It should be noted that these datasets were collected from the UCI repository, KEEL repository, Kaggle, and another well-known medical website.Footnote 1 Table 1 presents a brief description of the first 23 datasets used in the current study.

Table 1 A brief description of the first 23 datasets

From Table 1, it can be clearly seen that only the main characteristics of each dataset are presented as follows: number of features, number of samples, number of classes, and the source of each dataset. In addition, the number of samples ranged from 62 and 1151, the number of features ranged from 9 to 5966, and the number of classes ranged from 2 to 6.

Table 2 shows a description of the real-world corona-virus disease (i.e., SARS-CoV-2 or broadly called COVID-19) dataset, where this dataset was gathered from [16], which comprises of 15 features.

Table 2 A characterization of the COVID-19 dataset utilized

In order to prevent overfitting in solving the feature selection problems under study based on the proposed feature selection models, the data is randomly split into training and testing datasets. This is also to ensure the stability of the outcomes. In this, the total number of instances in the datasets shown in Tables 1 and 2 were split into two partitions, where 80% of the instances were used for training, and the remaining 20% of the instances in the dataset were used for testing. This is also recommended in many related works in the literature [5, 37]. For parameter tuning, the number of iterations of the training model was set to 100 based on early stopping criteria, as aforementioned, in order to help prevent overfitting. These two techniques make it possible to ameliorate the number of training instances by altering the existing ones. Parameter optimization of the proposed methods can be found in the “Sensitivity Analysis” section. There are many other methods to reduce model overfitting such as 10-fold cross validation [53], early stopping criteria [54], and image augmentation by generating new training samples from the existing training dataset [55].

Parameter Settings

The parameter settings of the proposed algorithms are recorded in Table 3. It should be pointed out that each algorithm is repeated 30 independent times to give an idea of its stability. Thereafter, the results of the proposed algorithms are collected and summarized in terms of average and standard derivation values. Furthermore, all the experiments were conducted and carried out using a personal computer with Intel(R) Core(TM) i7-7700HQ with 2.80GHz CPU and 16.0 GB RAM. The proposed algorithms and other comparative algorithms were implemented using Matlab 2022a.

Table 3 Parameter settings of the developed FS methods

Performance Measures

This study verifies the effectiveness and accuracy of the proposed algorithms against other comparative ones. Several performance measures are utilized for these purposes, which can be defined as follows [4, 5]:

  1. 1.

    Classification accuracy: This evaluation method introduces the correct classification ratio of the studied cases in the test dataset as shown in Eq. 29.

    $$\begin{aligned} \text {Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
    (29)

    where the true positive (TP) is the correctly defined classes; the true negative (TN) is the properly rejected classes; the false positive (FP) is the incorrectly identified classes; and the false negative (FN) is the incorrectly rejected classes.

  2. 2.

    Sensitivity: This metric is utilized to calculate the proportion of all true positive cases in the test dataset as given by Eq. 30.

    $$\begin{aligned} \text {Sensitivity} = \frac{TP}{TP + FN} \end{aligned}$$
    (30)
  3. 3.

    Specificity: This metric measure was introduced to compute the proportion of all true negative cases in the test dataset as given by Eq. 31.

    $$\begin{aligned} \text {Specificity} = \frac{TN}{FP + TN} \end{aligned}$$
    (31)
  4. 4.

    Fitness value: This metric measure is used to determine the quality of the desired solution as given by the mathematical formula defined in Eq. 28.

  5. 5.

    Number of selected features: This criterion is used to present the number of features in the desired solution.

For comparison purposes, the average results and standard deviations of these metric measures were obtained, where each algorithm was executed 30 independent times as aforementioned.

Evaluation of the Proposed Methods

The evaluation of the developed algorithms of CSA together with the original CSA for feature selection is studied in this section. In this, the performance levels of the proposed algorithms (binary exponential CSA (BECSA), binary power CSA (BPCSA), and binary S-shaped CSA (BSCSA)) are compared with those of the binary version of the parent CSA (BCSA). This is necessary to specify the version of CSA reveals the best level of performance. Tables 4, 5, and 6 display the results of the different variants of CSA based upon the following evaluation measures: (i) classification accuracy rates, (2) fitness values, (3) sensitivity, (4) specificity, and (5) number of selected features. The average (AV) and standard deviation (SD) values of the results of each algorithm, which were performed 30 independent runs, for each dataset and each evaluation measure are recorded in Tables 4, 5, and 6. On top of that, the average ranking of each proposed algorithm for each evaluation measure was obtained using Friedman’s test and is provided in the last line of these tables. The best results are highlighted in bold to give them more weight over the other results.

Table 4 A comparison of the four binary variants of the CSA algorithm based on average classification accuracy and fitness values

Initially, the average results and standard deviations of the different variants of CSA in terms of classification accuracy and fitness values are summarized in Table 4. The higher average accuracy results mean better algorithm performance. It can be clearly seen that the performance of the four variants of CSA was similar by obtaining the same accuracy results in 8 datasets (i.e., Breast, Prognostic, Coimbra, Saheart, PimaDiabetes, Leukemia, Colon, and ProstateGE). In addition, the performance of BECSA, BPCSA, and BSCSA was better than the basic BCSA in the ILPD and COVID-19 datasets. Besides, BECSA performed better or comparable to other versions by obtaining the best accuracy results in 18 datasets. BSCSA obtained the best results in 14 datasets, while BCSA achieved the best accuracy results in 12 datasets. Surprisingly, the performance of BPCSA was better than or similar to other proposed versions in 11 datasets. On the other hand, the results of the standard deviation (SD) reflect the robustness of the algorithm, with the lowest SD results being the best. Reading the results in Table 4 once more, it can be shown that the proposed BECSA was more robust than the other variants by getting the minimum SD values in 17 out of the 24 datasets considered in this study. Finally, but not least, it can be observed that the proposed BECSA ranked first by having the minimum average ranking using Friedman’s test, while BSCSA and BPCSA ranked second and third, respectively. BCSA was ranked last by having the largest average rankings. This proves the efficiency of the proposed changes to the main structure of the original CSA while favoring the proposed BECSA.

Figure 5 shows the average accuracy results of the four variants of CSA in all datasets using a radar chart. The shape with the larger area indicates that the algorithm obtains higher accuracy results and thus better performance. Obviously, the proposed BECSA, BPCSA, and BSCSA have almost the same scheme; however, there are slight differences between the BECSA, BPCSA, BSCSA, and BCSA algorithms.

Fig. 5
figure 5

Radar graph for BCSA, BECSA, BPCSA, and BSCSA based on the average classification accuracy results for all datasets

The average fitness values and standard derivations of the four variants of CSA are presented in Table 4. A lower fitness value indicates better performance. From Table 4, it can be seen that the four variants got the same best results in the Breast, Saheart, and PimaDiabetes datasets. In addition, BECSA, BPCSA, and BSCSA obtained the same best fitness figures that are better than BCSA in the Coimbra and ILPD datasets. Besides, BECSA performed better or similar to the other variants by having the best results in 14 datasets. BCSA comes in second with the best results in 10 datasets. BSCSA and BPCSA came in third and fourth with the best results in 8 and 6 datasets, respectively. In the same vein, it can be observed that BECSA performed more robustly than other competitors by having the lowest SD values in 18 out of 24 datasets. Reading the results given in Table 4 one more time, one can see that the proposed BECSA was ranked first by getting the minimum average rankings using Friedman’s test. As per this, BSCSA, BCSA, and PBSCA ranked second, third, and fourth, respectively.

Figure 6 shows the convergence behavior of the proposed BCSA, BECSA, BPCSA, and BSCSA according to the fitness results. In these plots, the x-axis reflects the number of iterations, while the y-axis represents the fitness values. The convergence behavior of the best solution obtained by each algorithm over 30 independent runs is illustrated in these plots. The preferred algorithm is the one that culminates the minimum fitness results with few iterations and at the same time does not get stuck in local optima. From Fig. 6, it can be noticed that there are slight differences between the behavior of the four proposed algorithms when navigating the search space of each problem. This is due to how well the algorithm balances exploitation and exploration capabilities. Upon examining the plots in Fig. 6 again, it can be viewed that the convergence conducts of the four proposed algorithms are nearly identical in the first 50 iterations in the Coimbra, Breast, ILPD, PimaDiabetes, and Saheart datasets. The convergence behavior of the four proposed companions becomes identical after half of the iterations in BreastEW, Retinopathy, Dermatology, Lymphography, Cleveland, HeartEW, Thyroid, and Heart. The convergence behavior of the four algorithms is not identical in most iterations and becomes indistinguishable in the last few iterations as shown in the plots of Spectfheart and COVID-19 datasets. However, the convergence behavior of the proposed BECSA is superior to others in the Parkinsons, Prognostic, and Hepatitis datasets. Finally, the convergence conduct of BCSA is better than the other variants in the Leukemia, Colon, ProstateGE, ParkinsonC, and SPECT datasets. It should be distinguished that the performance of the proposed BECSA is better than BCSA in terms of average fitness values in these datasets as displayed in Table 4.

Fig. 6
figure 6

Convergence characteristic curves of BCSA, BECSA, BPCSA, and BSCSA for all studied datasets

Table 5 A comparison of the four binary variants of the CSA algorithm in terms of sensitivity and specificity values

The average sensitivity and specificity results as well as the standard derivations of the four proposed binary variants of CSA are given in Table 5. Higher sensitivity and specificity indicate better performance. According to the sensitivity results, it can be clearly seen that BCSA got the best sensitivity results in 9 datasets, while BECSA acquired the best sensitivity results in 8 datasets. BPCSA achieved the best sensitivity results in 4 datasets, while BSCSA ranked last by having the best scores in 3 datasets. Therefore, both BCSA and ECSA were ranked first by having the same minimum average ranking using Friedman’s test, while the remaining two variants came in the last two rankings. When reading the sensitivity results again, it can be noticed that the performance of BCSA, BECSA, and BPCSA is almost the same as per the SD results, while the performance of these three algorithms was better than that of the basic BSCSA.

The specificity outcomes of the developed binary versions of CSA are recorded in Table 5. Apparently, BECSA performed better than or similar to the other variants by having the best specificity scores in 13 out of 24 datasets. BCSA and BSCSA came in second place with each having the best specificity results in 4 datasets. BPCSA finally came out with the best results in 3 datasets. In the same vein, BECSA ranked first by obtaining the minimum average ranking using Friedman’s test, while BSCSA, BPCSA, and BCSA ranked second, third, and fourth, respectively.

Table 6 A comparison of the four binary variants of the CSA algorithm in terms of the number of selected features

Finally, the average number of selected features and standard derivations of the four variants of CSA are presented in Table 6. It is worthy to note that the minimum average of the selected features points out a better performance score. From Table 6, it can be seen that BCSA obtained the lowest average of the number of selected features in 15 out of 24 datasets. At the same time, each algorithm of the remaining variants got the lowest average of the selected features in 7 datasets. Additionally, BCSA is placed first by achieving the minimum average ranking using Friedman’s test, while BECSA is ranked second. BPCSA was placed third using Friedman’s test, and finally, BSCSA ranked last.

Discussion of the Results

As evidenced by a second reading of the findings in Table 4, BECSA exclusively outperformed its competitors in respect of the accuracy it obtained in 7 out of the 24 problems and had the highest accuracy rate in an overall of 18 problems. Similar to this, BPCSA, BSCSA, and BCSA were exclusively best in 0, 2, and 3 problems, respectively, and had the greatest accuracies in 11, 14, and 12 problems, respectively. It is evident that BECSA, BPCSA, BSCSA, and BCSA achieved the optimum accuracy in one dataset, namely, Leukemia, with a value of \(100\%\). In addition, BECSA achieved accuracy rates near to \(100\%\) in other datasets, including Diagnostic, Breast, BreastEW, Dermatology, and Thyroid. In these datasets, BECSA performed quite well, consistently locating a solution close to the near-optimal solution over 30 independent runs. In the Dermatology dataset, BPCSA achieved an accuracy value of \(99.30\%\), whereas BCSA, BECSA, and BSCSA earned accuracy values of \(99.20\%\), \(99.67\%\), and \(99.58\%\), respectively. For the Diagnostic dataset, BPCSA and BSCSA obtained an accuracy of \(96.49\%\), while BECSA has obtained an accuracy of \(96.55\%\). BSCSA was exclusive to obtain an accuracy rate of \(95.29\%\) in the Lymphography dataset, and BCSA arrived at an accuracy rate of \(88.05\%\) exclusively in the SPECT dataset. BCSA has the greatest accuracy in the Spectfheart, coming up at \(91.70\%\), while BECSA, BPCSA, and BSCSA have accuracy values of \(91.20\%\), \(91.45\%\), and \(91.32\%\), respectively. In the Thyroid dataset, BCSA has the greatest accuracy, coming in at \(99.11\%\), while BECSA, BPCSA, and BSCSA have accuracy rates of \(99.08\%\), \(99.03\%\), and \(99.06\%\), respectively. For COVID-19, the proposed BECSA, BSCSA, and BPCSA achieved a classification rate of \(95.93\%\), where BCSA has realized a classification rate \(95.89\%\). Furthermore, BCSA has an SD of 0.0015, while the others have an SD value of 0.0000. Retinopathy, ILPD, Cleveland, Saheart, and PimaDiabetess were among the datasets for which it was challenging to obtain accuracy levels near to \(80\%\). This can be attributed to the difficult nature of these problems. Having said that, as we shall show in a moment, the proposed BECSA algorithm attained accuracy levels that were on par with or better than those reported in the literature. In terms of best outcomes, BECSA is able to obtain good solutions for all datasets under consideration, with the exception of the Cleveland dataset, where the performance was subpar which is \(60.96\%\). This insinuates that BECSA is stuck in the local optimum solution, which, although not the best solution to this problem, is nonetheless not far from that best solution. In the Breast, Prognostic, Coimbra, ILPD, Saheart, Leukemia, PimaDiabetes, ProstateGE, COVID-19, and Colon datasets, the proposed BECSA, BSCSA, and BPCSA have the lowest standard deviations with a value of 0.0000. These findings support the effectiveness of BECSA, BSCSA, and BPCSA in achieving appropriate exploration and exploitation in addition to a proper balance between these two features.

The developed binary methods of CSA for FS problems have the goal of lowering the fitness scores produced by the criterion defined in Eq. 28. This is shown as the average fitness values in Table 4, which also includes the averages and standard deviations of the standard and developed algorithms of CSA over 30 independent runs for each of the 24 FS problems. Overall, Table 4 shows that BECSA is capable of finding the global optimal solution constantly and exclusively in 9 datasets and achieving the best fitness outcomes in an overall of 13 datasets. Evidently, it came in the first place by getting these outcomes. BCSA came in second place by getting the lowest average fitness values in seven problems and the lowest fitness scores over a number of ten datasets. The third place was kept for BSCSA, with best exclusive fitness values in 2 datasets and best fitness values in an overall of 8 problems. BPCSA was placed last, with no unique best fitness results in any dataset and best fitness outcomes in an overall of 6 datasets. For the Retinopathy, ILPD, Saheart, and PimaDiabetes datasets, although best solutions were not constantly found, the outcomes got are not far from the global optimum solution which can be substantiated by the very humble standard deviations obtained. BECSA had the best obtained fitness score of 0.0212 for the Parkinsons data set. This value is not too far from the values obtained by BPCSA and BSCSA where the values got by these two algorithms were 0.0189. For the Hepatitis dataset, BECSA revealed the best fitness of 0.0534, while BCSA, BSCSA, and BPCSA showed comparable small fitness values of 0.0631, 0.0556, and 0.0577, respectively.

It is important to note that higher sensitivity values correspond to higher levels of performance. Table 5 reveals that BECSA ranked first by having the top exclusive sensitivity findings across 9 of the 24 datasets. The BCSA algorithm came second by providing the highest exclusive sensitivity findings in 7 datasets (Prognostic, Coimbra, BreastEW, Lymphography, Cleveland, Hepatitis, and Saheart) and had the highest sensitivity outcomes in an overall of 8 problems, the eighth of which being the Dermatology dataset. BPCSA appeared in third place with best exclusive sensitivity outcomes in 4 problems, that is, Spectfheart, PimaDiabetes, Leukemia, and ProstateGE, whereas BSCSA appeared in the last place with best exclusive sensitivity outcomes only in 3 datasets, that is, SPECT, Heart, and Colon. Additionally, a sensitivity value of \(99.09\%\) was reported by both of BCSA and BECSA, compared to \(99.08\%\) and \(98.88\%\) reported by BPCSA and BSCSA for the Dermatology dataset, respectively. BCSA obtained an exclusive sensitivity result of \(73.32\%\) for the Coimbra dataset, while BECSA, BSCSA, and BPCSA recorded sensitivity values of \(72.25\%\), \(66.91\%\), and \(66.95\%\), respectively. Returning to Table 5, one can emphasize that there is a relatively large discrepancy between the findings produced by BCSA and those recorded by its developed partners.

Table 5 shows the specificity findings of the fundamental binary and proposed binary versions of CSA in relation to their average and standard deviation outcomes. It is obvious that BECSA emerged in the first place where it got the best exclusive specificity outcomes in 13 datasets. This binary variant of CSA is successively followed by BCSA, BSCSA, and BPCSA, with these versions had the best exclusive specificity results in 4, 4, and 3 datasets, respectively. There is only a slight difference between the results of these versions arrived at for the Breast, Lymphography, Cleveland, Hepatitis, Thyroid, Leukemia, and COVID-19 datasets. For the Coimbra dataset, there is an extremely little difference between the specificity result got by BECSA and that got by BPCSA, with the former having a specificity value of \(82.02\%\) and the latter having a specificity score of \(81.99\%\). For the Diagnostic dataset, the proposed BECSA and BPCSA algorithms reported specificity values of \(97.03\%\) and \(96.63\%\), respectively, while BCSA and BSCSA reported specificity values of \(96.25\%\) and \(96.49\%\), respectively. There are either zero or almost zero values for the SD results in Table 5, which indicates the robustness of the proposed FS methods.

The average number of features got in the classification of 24 datasets for each of the binary algorithms proposed for CSA is presented in Table 6. Comparison of the four proposed FS methods in Table 6 reveals a significant difference in the results. BCSA was the most beneficial FS method in reducing the features need for classification, where, compared to the other algorithms, the average number of attributes that were selected was the fewest. BCSA has the exclusive fewest number of features in 13 out of 24 datasets, with an overall of lowest average number of selected features is in 15 datasets. BECSA has the exclusive minimal number of features in the BreastEW and Spectfheart datasets, shared the minimal number of chosen features for the Breast and PimaDiabetes datasets with BCSA, BPCSA, and BSCSA, and shared the minimum number of selected features for Coimbra, ILPD, Saheart with BPCSA and BSCSA. The proposed BPCSA exclusively captured the minimum number of features in the Retinopathy and Lymphography datasets, and BSCSA exclusively captured the minimum number of features in the SPECT and Cleveland datasets.

Finally, it is preferable to determine the relative performance of one or all of the developed FS methods when their performance levels are compared to those of other FS methods. This is because the absolute performance of the proposed FS methods compared to each other is not fair to grade these methods. In this, it is clear from a review of the results in Tables 4, 5, and 6 that BECSA beat all competing algorithms in all evaluation criteria. Because of this, the outcomes of both the basic BCSA and the proposed BECSA are compared with those of widely used algorithms stated in the pertinent literature, as presented later in a subsection below.

Limitations of the Proposed Methods

Although the proposed FS algorithms of CSA have shown outstanding performance levels in addressing low, medium, and high-dimensional FS problems in binary search spaces, they do not guarantee overall optimality. Furthermore, for several datasets including Retinopathy, ILPD, Cleveland, Saheart, and PimaDiabetes, the classification accuracy, sensitivity, and specificity rates were essentially modest. The corresponding fitness values for these datasets on the basis of the proposed FS methods are not small to the extent required. According to this, the proposed FS algorithms have some limitations, such as falling into local optimal values or departing from the global optimal solution, which impacts how well they perform when tackling complex FS problems with a large number of features and dimensions. Besides, the number of features selected by the proposed FS in some datasets methods for the classification tasks is reasonable and is not better than that of the basic binary version of CSA. Despite having good performance in the majority of the test FS problems, the proposed FS methods may experience poor convergence rates and may become stuck in the local optimum in certain datasets, such as in the case of the Cleveland dataset. The sensitivity and specificity results for some datasets such as Retinopathy, Lymphography, SPECT, and Cleveland are not as large as needed. In order to cope with these limitations, more works on how to make the proposed FS methods perform better may be considered. By enhancing these proposed algorithms’ exploration and exploitation capabilities within the binary search spaces of the relevant datasets, these issues might be avoided.

Sensitivity Analysis

A thorough sensitivity analysis based on the Design of Experiment (DoE) technique was carried out in order to pinpoint the best parameter settings of the proposed FS methods by which the experimental results can be greatly affected. The proposed FS methods that employ k-NN as a classifier used DoE to examine the sensitivity of the key control parameters (\(\beta _0\), and \(\beta _1\)). The ranges of the key control parameters were first established, and the values of these parameters were defined to assess if the best values fell within the range or whether more experiments were required. The experiments using FS methods then employed a parameter with one input of the generated DoE values in the designated range, leaving the other parameters at their starting values. These experiments were performed in a systematic manner in order to study the influence of the input parameters on the accuracy level of the proposed binary algorithms and to arrive at a sensible solution. In this situation, a comprehensive analysis was conducted using the previously indicated parameters, and the average classification accuracy was obtained for each experiment for each proposed FS method for all datasets. The values of each parameter used in this experimental analysis are as follows: (1) the control parameters for BECSA are identified to be \(\beta _0\) = 0.5, 1.0, 1.5, 2.0 and \(\beta _1\) = 0.5, 1.0, 1.5, 2.0; (2) the control parameters for BPCSA are identified to be \(\beta _0\) = 0.5, 1.0, 1.5, 2.0 and \(\beta _1\) = 0.1, 0.3, 0.5, 1.0; (3) the control parameters of BSCSA are identified to be \(\beta _0\) = 0.5, 1.0, 1.5, 2.0 and \(\beta _1\) = 0.5, 1.0, 1.5, 2.0. The number of capuchins and the maximum iterations in this study were set to 100 and 30, respectively, correlated with 30 separate runs.

Table 7 Classification accuracy of BECSA using different values of \(\beta _0\) and \(\beta _1\) with the k-NN classifier
Table 8 Classification accuracy of BPCSA using different values of \(\beta _0\) and \(\beta _1\) with the k-NN classifier
Table 9 Classification accuracy of BSCSA using different values of \(\beta _0\) and \(\beta _1\) with the k-NN classifier
  1. 1.

    The control parameters \(\beta _0\) and \(\beta _1\) of BECSA: the proposed BECSA was tested for various values of the parameters \(\beta _0\) and \(\beta _1\). The average classification accuracy of BECSA with different values for \(\beta _0\) and \(\beta _1\) when employed to solve the feature selection problems under investigation, with the other parameters left intact, is shown in Table 7. The computational results in Table 7 show that BECSA presented the best classification rates when the parameters \(\beta _0\) and \(\beta _1\) of BECSA are equal to 2.0 and 1.0, respectively. This demonstrates the significance of employing a sensible range of these crucial parameters to augment the robustness of BECSA.

  2. 2.

    The control parameters \(\beta _0\) and \(\beta _1\) of BPCSA: the proposed BPCSA was tested for a variety of range values of \(\beta _0\) and \(\beta _1\). Table 8 displays the average classification accuracy rate of BPCSA when applied to the feature selection problems under study, while the other parameters were left unchanged. The optimal classification accuracy for BPCSA is obvious from Table 8 when the values of the parameters \(\beta _0\) and \(\beta _1\) are equal to 2.0 and 0.1, respectively. This highlights the need to study the BPCSA’s sensitivity to various values of this parameter.

  3. 3.

    The control parameters \(\beta _0\) and \(\beta _1\) of BECSA: the proposed BSCSA was tested for a variety values of the parameters \(\beta _0\) and \(\beta _1\). The average classification accuracy of BECSA when applied to the 24 feature selection problems under study, with varying the values of \(\beta _0\) and \(\beta _1\) and with all other parameter values left unaltered, is shown in Table 9. Table 9 clearly shows that the classification accuracy results on the basis of the values of \(\beta _0\) and \(\beta _1\) change considerably, demonstrating how sensitive the BSCSA is to these parameters. Additionally, it should be noted that the values of \(\beta _0\) and \(\beta _1\) were adjusted to 2.0 and 1.0, respectively, for BSCSA to produce the optimal classification accuracy.

The most sensitive values out of all those provided in Tables 7, 8, and 9 were chosen after sensitivity analysis. As mentioned, while addressing feature selection problems, the classification accuracy of the proposed BECSA, BPCSA, and BSCSA can be impacted by different values of the parameters \(\beta _0\) and \(\beta _1\). For datasets with different degrees of dimensionality, the standard deviation values of the classification accuracy of BECSA, BPCSA, and BSCSA are low, indicating that these proposed methods are almost stable to changes in the relevant control parameters for each corresponding method. In sum, the findings in Tables 7, 8, and 9 demonstrate that BECSA, BPCSA, and BSCSA are usually quite sensitive to the parameter settings of \(\beta 0\) and \(\beta 1\).

Performance Comparison with Other Methods

To study the performance of the proposed BECSA in depth, its findings on FS problems were compared with those of standard BCSA and other FS methods, namely, Binary Biogeography-Based Optimization (BBBO) [56], Binary Moth-Flame Optimization (BMFO) algorithm [57], Binary Teaching-Learning-Based Optimization (BTLBO) [58], Binary Success-History based Adaptive Differential Evolution with Linear population size reduction (BLSHADE) [59], Binary Particle Swarm Optimization (BPSO) [60], Binary Ali Baba and the Forty Thieves (BAFT) algorithm [1], and Binary Honey Badger Algorithm (BHBA) [61]. Not only were the details of the experimental results provided in this comparison, but also an in-depth and careful comparison with those comparative optimization methods was made. The same experimental conditions, including maximum number of iterations and population size, were applied to all of the examined FS methods in order to make an equitable comparison of the proposed BECSA with the comparative optimization methods. Table 10 contains the parameter settings for each of the competing algorithms.

Table 10 Parameter settings of the comparative binary algorithms

The experimental runs were performed 30 times in total, independently, to get statistically meaningful findings. Based on the overall capabilities and findings reached throughout these runs, the statistical results were then obtained. The number of attributes in each dataset is the same as the dimension of each associated dataset. Classification accuracy, sensitivity, specificity, fitness values, and number of selected features were used to evaluate the performance of the proposed BECSA, and this performance was compared to the performance of other methods in terms of all the above criteria. The best outcomes are highlighted in bold in all comparison tables to give them more weight than the other findings. In Table 11, the average classification accuracy results for the basic BCSA, the proposed BECSA, and other competing methods are tabulated along with their respective standard derivation results.

Table 11 Comparison results between the proposed BECSA and other methods based on classification accuracy

Again, a higher value of the accuracy results means better robustness, while a lower value of SD reflects the stability of the algorithm. It can be seen from Table 11 that the proposed BECSA ranked first exclusively by having the best accuracy results in 5 datasets and achieved the highest accuracy in a total of 13 datasets. BCSA ranked second by exclusively acquiring the best accuracy results in 3 datasets and realized the highest accuracy in a total of 10 datasets. BHBA came in third by exclusively achieving the best results in 5 datasets and achieved the highest accuracy in a total of 11 datasets. BBBO ranked fourth by having the best results exclusively in 2 datasets, namely, the Cleveland and Hepatitis. BLSHADE came in fifth place by achieving the best results in 3 datasets, while BAFT came in sixth by fulfilling the best result only in ProstateGE dataset. BPSO, BMFO, and BTLBO did not achieve the best results in any of the datasets. However, both BCSA and BECSA obtained 100% accuracy in the Leukemia dataset. When reading the SD results recorded in Table 11, it can be noticed that the proposed BECSA was more robust than other rivals by having the minimum SD values in 13 out of the 24 datasets.

The average fitness value and standard deviation of all rivals in each dataset are presented in Table 12.

Table 12 Comparison results between the proposed BECSA and other methods based on fitness values

It should be perceived that the minimum fitness values reveal better performance for the optimization algorithms. From Table 12, it can be observed that BECSA, BCSA, and BHBA ranked first, second, and third, respectively, where each exclusively got the minimum fitness values in an overall of 7 datasets. BTLBO ranked fourth by exclusively achieving the minimum fitness values in 3 datasets, namely, Parkinsons, Cleveland, and Hepatitis. Finally, BLSHADE, BBBO, BAFT, BMFO, and BPSO did not report any lowest fitness results in any of the datasets examined in this work, where they ranked fifth, sixth, seventh, eighth, and ninth, respectively. When reading the standard derivation results tabulated in Table 12 one more time, it can be shown that the performance of BECSA is more powerful than the other competitors by acquisition the lowest SD results in 11 out of 24 datasets.

The sensitivity results of the proposed ECSA compared to BCSA and the other comparative methods are given in Table 13.

Table 13 Comparison results between the proposed BECSA and other methods based on sensitivity results

Then, we move on to compare the proposed BECSA with other competing algorithms in respect of the sensitivity results that the FS algorithms intend to increase. It should be noted that higher sensitivity results mean preferable performance level. The results of these competing algorithms are summarized in terms of the average sensitivity results associated with their standard derivation values in Table 13. Regarding the results presented in this table, without a doubt, BLSHADE has the largest sensitivity figures compared to all other rivals. Specifically, BLSHADE exclusively achieved the highest sensitivity values in 12 out of 24 datasets. Evidently, BCSA ranked second by obtaining the best sensitivity results in 3 out of 24 datasets. BHBA ranked third even though it did not achieve any optimal sensitivity result than all other competing algorithms, but its results are rather high, while the proposed BECSA ranked fourth although it has the best sensitivity results in 6 datasets. BBBO and BMFO came in fifth and sixth places with the best sensitivity results only in 2 and 1 datasets, respectively. Finally, BAFT, BPSO, and BTLBO did not report any distinct sensitivity result in any of the datasets, where they were in seventh, eighth, and ninth places among all other competing algorithms. Regarding the standard deviation values of the proposed BECSA, they are small, indicating that the stability of BECSA is entrenched.

Similarly, the average and standard deviations of the specificity results for BECSA, BCSA, and all other comparative methods are summarized in Table 14.

Table 14 Comparison results between the proposed BECSA and other methods based on specificity results

Reading the specificity results listed in 14, one can notice that BCSA and BECSA ranked first and second, where each having exclusively the highest specificity values in a total of 4 datasets out of 24. BHBA placed third by getting the best specificity results in 3 datasets. BLSHADE, BTLBO, BAFT, and BBBO exclusively got the highest specificity values in 6, 3, 2, and 2 datasets, respectively. BMFO and BPSO did not achieved any highest specificity results in any of the datasets, but their obtained results are reasonable and better than other competing algorithms such as BTLBO and BAFT. In terms of standard deviation results, the proposed BECSA has very small SD values in most of the test datasets in comparison to other rivals. These findings confirm that the superiority of BECSA is stable.

The number of selected features is also considered to study the performance of the proposed BECSA against BCSA and other methods available in the literature as shown in Table 15.

Table 15 Comparison results between the proposed BECSA and other methods based on the average number of selected features

The amount of features selected during the classification process is as important as the classification accuracy in assessing any feature selection algorithm. When comparing the proposed algorithm to the eight rival algorithms, Table 15 reveals a wide variety in the results. According to the results in the average number of features, the nine competing algorithms can be split as follows: BCSA outperformed the other rivals by exclusively acquiring the fewest number of selected features in a total of in 11 out of 24 datasets. BECSA ranked second with the exclusive minimum number of features in 4 datasets, and shared the minimum number of features for the Sahear dataset with BHBA, which exclusively received the minimum number of features in the Coimbra, SPECT, and Hepatitis datasets. BMFO had the lowest number of features in the Breast, ILPD, Lymphography, and Heart datasets, while BPSO reduced the number of features only in PimaDiabetes dataset. The BLSHADE, BAFT, BBBO, and BTLBO algorithms did not reach any minimum number of features in any dataset considered. It may be deduced from a second reading of the findings in Table 15 that the BECSA performed more robustly than its rivals because it achieved the lowest SD results in 10 out of 24 datasets. This implies that by carrying out the algorithm 30 independent times, BECSA was able to reach around the same amount of selected features.

The results shown in Tables 11, 12, 13, 14, and 15 reveal the robustness of the proposed BECSA in comparison with other state-of-the-art feature selection algorithms available in the literature. By taking a closer look at these results and observing the margin differences between BECSA and other competing algorithms, one can see that the algorithms such as BTLBO, BBBO, and BPSO lag far behind BECSA. Moreover, in regards to the standard divisions of the proposed BECSA, they are tiny and inferior to those of other competing algorithms. This confirms that the superiority of this proposed algorithm is solid. The key factors for the reasonable degree of performance of BECSA is the sought-after balance between exploration and exploitation features of this algorithm on account of the use of the proposed cognitive and social models for the velocities of capuchins as well as the proposed mathematical model of this algorithm. This mathematical model of BECSA assisted the capuchins to explore and exploit each promising area in the search space, thus reviving a sensible balance between exploitation and exploration. In this respect, if the capuchins become stuck in local optimums, they have a chance to leave their local neighborhood.

Statistical Test

For further evaluation of the proposed methods, a non-parametric Friedman’s statistical test was used to highlight the algorithm that has superior results compared to other comparative algorithms. Table 16 shows the average ranking results of the statistical evaluation of BECSA, BCSA, and other comparative methods using Friedman’s test based on classification accuracy, fitness value, sensitivity, specificity, and number of selected features.

Table 16 Average rankings of all competitor algorithms using Friedman’s test

As can be inferred from Table 16, the lowest ranking value reflects better performance. The p-value was calculated using Friedman’s test as shown in Table 16, where all p-values were below the significance level \(\alpha\) = 0.5. This leads to the rejection of the null hypothesis and the acceptance of the alternative hypothesis. The null hypothesis states that all the compared algorithms have the same performance behavior when used to solve an optimization problem, while the alternative hypothesis means that there is a difference between the performance behaviors of the algorithms when they are used to solve an optimization problem. According to the statistical results presented in Table 16, BECSA is statistically significant and is the most effectual method among all other methods. In this, it can be seen that the proposed BECSA ranked first in classification accuracy, first in fitness value, and first in specificity, while BECSA ranked fourth in sensitivity and second in number of selected features. Furthermore, BCSA ranked first in number of selected features, second in classification accuracy, second in fitness, second in sensitivity, and fourth in sensitivity. These outcomes denote the robustness of the proposed BECSA and BCSA among other rivals. Finally, it is clear that BLSHADE obtained the first rank in respect of the sensitivity, where it got a rank of 3.5416 which is the lowest of all other ranks that the other algorithms have.

Holm’s test was then utilized as post-hoc approach to show the significant difference between the control algorithm and the other competitors. In view of this, the algorithm with the first rank in each assessment measure using Friedman’s test is the control algorithm. The statistical results got by Holm’s procedure are presented in Table 17. In Table 17, \(R_0\) is the Friedman’s rank assigned to the control algorithm, \(R^i\) is the Friedman’s rank assigned to algorithm i, ES is the effect size of the control method on method i, and z represents the statistical difference between two methods.

Table 17 Holm’s test results between the control algorithm and all other comparative methods

A comparison of BECSA with other FS methods was conducted by applying Holm’s test as presented in Table 17, where this test discards hypotheses with p-values \(\le 0.02500\), \(\le 0.01666\), \(\le 0.00833\), \(\le 0.00833\), and \(\le 0.01250\) in classification accuracy, fitness value, sensitivity, specificity, and number of selected features, respectively. Reading the results given in Table 17, one can conclude that there is a significant difference between BECSA and BBBO, BLSHADE, BMFO, BAFT, BPSO, and BTLBO in terms of classification accuracy, while there is no significant difference between BECSA and the remaining two algorithms (i.e., BCSA and BHBA). As per the statistical fitness results computed according to Friedman’s and Holm’s test, there is no significant difference between BECSA and two other competing algorithms (i.e., BCSA and BHBA). However, there is a notable difference between BECSA and the other remaining algorithms (i.e., BTLBO, BLSHADE, BBBO, BAFT, BMFO, and BPSO). According to the specificity results, there is a significant difference between BECSA and three other algorithms (i.e., BMFO, BTLBO, and BPSO), while there is no significant difference between BECSA and the other competing algorithms including BLSHADE, BBBO, BCSA, BHBA, and BAFT. On the other hand, regarding the sensitivity results, there is no significant difference between BECSA and the control algorithm (i.e., BLSHADE). Similarly, in terms of the number of selected features, there is not much difference between BECSA and the control algorithm (i.e., BCSA). As can be realized from the results in Tables 16 and 17, BECSA and BCSA are effective FS method in getting promising results for the datasets under study, and they are much better than the other competing methods.

One important conclusion drawn from the statistical analysis results discussed above is that, on average, BECSA surpassed other state-of-the-art FS techniques mentioned in the literature, including BLSHADE, BHBA, BPSO, and BMFO. This highlights the strong performance of BECSA and shows that this algorithm can effectively explore the search space whether there are a single or many optimums present, or if the feature selection problems are low, medium, or high dimensional. Additionally, the average ranking of the algorithms in terms of sensitivity results divulges that the performance score of BECSA is not far behind that of BLSHADE and BHBA, whereas the performance of BECSA, BLSHADE, and BHBA is far from all other rivals such as BPSO and BMFO. Specifically, we may infer that the excellent superiority of BECSA in addressing feature selection problems is due to its thoughtful mathematical model. In conclusion, the results of this statistical study show that BECSA is a good and trustworthy method with reasonable exploration and exploitation aspects. These conclusions offer positive reasons to utilize the proposed method to address more challenging real-world applications in the field of healthcare.

Conclusion and Future Works

In this paper, three enhanced binary cognitive computation methods based on capuchin search algorithm (CSA) are proposed for feature selection (FS) problems in medical diagnostic applications. These methods are referred to as binary exponential CSA (BECSA), binary power CSA (BPCSA), and binary S-shaped CSA (BSCSA). Each version utilizes a different growth function to update the values of the cognitive and social parameters during the iterative process. The goals of these FS algorithms include creating simple and comprehensive models, enhancing data-mining performance, and helping prepare clear and non-redundant data. In the meantime, these proposed methods could be successfully used to reduce the dimensionality of data for machine learning tasks. The performance of these methods was assessed on 24 datasets using several assessment criteria. Initially, the results produced by the three proposed versions of CSA are compared together in addition to those produced by the native version of the binary CSA. For comparative evaluations, the proposed BECSA and basic binary CSA are compared with other well-established algorithms. Evaluation based on Friedman’s and Holm’s tests showed that BECSA is able to rank first in terms of classification accuracy, fitness value, and specificity. As the proposed binary versions of CSA revealed attractive performance in handling FS problems, further extensions of these versions could be made for future research. For example, these methods might potentially be used by researches working on multi-objective optimization problems. Gene selection, as a high-dimensional dataset, could also be used to further validate the suitability of these methods. Other transfer functions such as U-shape, V-shape, and X-shape could be used to check the effect of these transfer functions on the performance of the proposed methods. Due to the differences in accuracy between the classes since some of them come from different datasets. This can help convolutional neural networks (CNN) enrich the features according to the properties of each dataset, even with the ensemble methods’ presence.