1 Introduction

Data mining serves as a crucial approach for extracting meaningful information and knowledge. This would help in making the right decisions based on high-quality information models. As soon as the database models lead to an increased number of features (relevant, irrelevant, redundant, and noisy information) [1], essential concerns arise regarding the degradation of machine learning performance [2]. A huge number of attributes can result in increased computational complexity, high execution time and low precision. Here, we have faced a phenomenon called the curse of dimensionality. To deal with all these challenges, Feature Selection (FS) is considered the most effective solution. In recent times, feature selection has attracted considerable attention, because of its capability in selecting the most relevant features from higher-dimension [3]. Feature selection includes two categories of techniques, filtering, and wrapping [4,5,6]. The filtering technique [5] is simple and less expensive numerical computations based on statistical analysis and mutual information. The filtering technique is independent of learning algorithms and only requires relationships between inputs and outputs. The wrapper technique related to learning algorithms when it comes to determining significant features but requires a high computational cost compared to filtering techniques [6]. However, the more accurate results of this technique can result in high interest to academics and practitioners [7].

Feature selection is NP-hard problem derived from the exponential increase in the number of attributes. The complex nature of relations between features should not be neglected, especially with the growth of features. Although this parameter must be applied in the realm of feature selection applications [8]. It is noted that these applications often rely heavily on metaheuristic optimization techniques [9,10,11,12,13,14,15] to achieve higher classification accuracy based on genetic and chimp optimization algorithms. Such methods have been applied to increase performance. Unfortunately, they present a serious weakness, especially in terms of getting an optimum consensus between classification accuracy, optimal number of features, and fast computational time. To fill these gaps, we look for another solution that provides efficient optimization techniques with a high accuracy rate and acceptable execution time for various domain applications. We believe that machine learning can fill these gaps.

This paper proposes an efficient feature selection technique for reducing dimensionality and improving classification accuracy with acceptable execution time by using two variants of Binary Chimp optimization algorithms (BChimp1 and BChimp2) and six well-known classifiers (KNN, DT, SVM, RF, QDA, and MLP). All tests are applied on 23 different datasets [47] to evaluate the classification accuracy and execution time. Overall, we make the following contributions:

  • Two variants of binary chimp optimization algorithms named BChimp1 and BChimp2 are proposed to select optimal features in multidimensional datasets and machine learning models are established both for training and testing steps.

  • To enhance the quality of the classified features with acceptable execution time, a machine learning model is developed. The selected features have been validated using the most known classifiers such as KNN, DT, SVM, RF, QDA, and MLP.

  • An evaluation of the proposed methods (BChimp1-ML and BChimp1-ML) on 23 datasets [47] of different features in various domains (19 of which are high-dimensional) to ensure optimal feature selection and acceptable execution time. However, the two proposed methods have been compared with GWO, WOA, GA, and PSO.

The rest of the paper is organized as follows. Section 2 discusses the related work of this study. Section 3 presents an overview of feature selection and Chimp Optimization Algorithm (ChOA). Section 4 specifies the proposed hybrid approach. Section 5 provides an experimental demonstration of the proposed framework. Section 6 provides a summary of the paper.

2 Related Works

There are two manners for selecting features: filtering and wrapping. The filtering approach directly extracts relevant features based on their correlation with the dependent variable. The wrapper approach determines optimal features based on classifier models. Our work is based on the wrapper feature selection approach.

2.1 Filtering Based Approaches

This section discusses some recent and relevant filtering-based techniques presented in the literature [7, 16,17,18]. Peng et al. [7] integrate mutual criteria such as maximum dependencies, maximum relevancies, and minimum redundancies (MRMR). Wang et al. [16] proposed a feature selection approach based on statistical tests. Gao et al. [17] designed a feature selection algorithm combining significant features. Labani et al. [18] proposed a new multiple-feature filtering method based on text classification algorithms.

2.2 Wrapper Based Approaches

Wrapper approaches are divided into 1) evolutionary algorithms that rely on evolution mechanisms such as mutation and crossing, 2) swarm intelligence that relies on the principle and effects of some animals or insects, and 3) physics-based algorithms are physical phenomena highly sensitive to initial conditions and natural behaviors [19] and 4) deep learning-based algorithms.

2.2.1 Evolutionary Algorithms

Genetic Algorithm (GA) is an efficient and useful metaheuristic optimization approach based on the evolution of genetic theory. They are coded using a problem-specific fitness function [20]. GA allows for exploring high-dimensional search space efficiently [21]. But still, there is a drawback in the GA that it suffers from high computational complexity and random selection of mutation and selection processes, which greatly affect the feature selection process [22]. In [23], Siedlecki et al. apply GA for selecting optimal features for the first time. Subsequently, Huang et al. [21] used the GA in feature selection to improve the overall effectiveness of SVM. Nemati et al. [24] proposed an optimization technique based on genetic and ant colony algorithms to enhance accuracy. The optimization is done by adopting the protein-based function. Stefano et al. [25] used genetic feature optimization algorithm in large text data. Rejer et al. [26] introduced a powerful GA algorithm based on mutation to optimize features for brain-machine interfaces. Jingwei et al. [27] employed a combination of the selection and the crossover operators to optimize the features in the multidimensional databases.

2.2.2 Swarm Intelligence Algorithms

Kennedy et al. [28] proposed simple and effective algorithms called Particle Swarm Optimization (PSO) for the problem of feature optimization. However, it is noted that PSO suffers from staying in the local minima. To address this issue, various variants of PSO are proposed. A binarization of PSO (BPSO) is applied by Chuang et al. [29] to genes in order to attain the ultimate performance. The proposed algorithm focused on achieving early convergence of PSO by changing the local search space if it did not find the optimal solution after three iterations. Unler et al. [30] proposed a novel DPSO based on probability-weighted estimation to improve selection. Moreover, to enhance effectiveness and solution quality, Ayeche et al. [31] proposed micro-GA for facial expression recognition based on swarm optimization algorithms. Tan et al. [32] designed a novel efficient attribute selection algorithm using PSO to improve skin cancer detection. Too et al. [33] proposed BPSO to select significant attributes. First, the population is divided according to the weight-based segmentation strategies and some important individuals are incorporated into the new population in order to enhance the diversity. Moreover, to enhance effectiveness and adaptability, Xue et al. [34] proposed an adaptive average parameter PSO approach for complex applications for attribute selection with multiple classifiers. Al-Tashi et al. [12] designed attributes selection scheme based on multi-objectives function and grey wolf optimization. Too et al. [35] also proposed a new quadratic binary Harris Hawk Optimization (HHO) for feature selection. Moghaddam et al. [36] presented a novel multi-objective forest optimization for attribute selection.

Too et al. [13], introduced Hyper Learning Binary Dragonfly Algorithm (HLBDA) capable of determining significant features in COVID-19 datasets. Hegazy et al. [37] developed the Chaotic Salp Swarm Algorithm (CSSA) that enhances the performance of the data classification, which significantly reduces the high number of attributes. Niu et al. [38] proposed a multi-objective fitness selection approach based on bacterial foraging optimization. In addition, Piri et al. [39] used a multi-objective retrieval approach based on Ant Lion optimization to enhance efficiency.

Recently Jayashree et al. [49] proposed an efficient selection method utilizing Binary Multi-Objective ChOA (BMOChOA) and KNN classifier. BMOChOA and KNN allow the selection of relevant features from medical data. They evaluated their approach with four multi-objective performance evaluators and 14 medical databases of different sizes. The significant gain in this approach is a reduced number of features and enhanced classification accuracy. However, the authors did not achieve optimization of features and not evaluate execution time.

In [50], ChoA is subjected to binarization and transfer functions. They are applied with a crossover operator to yield the excellent exploratory behavior of the ChOA. This approach ensures efficient and direct selection of significant features. However, the work presents some weaknesses versus multidimensional data and consumes much execution time.

2.2.3 Physics-Based Algorithms

Numerous algorithms are proposed such as Henry's law-based algorithm for optimizing solubility [40]. Kaveh et al. [41] employed the physical law of Newton and Coulomb to select significant attributes. Desuky et al. [42] recently introduced a new Archimedes optimization algorithm to select critical features for data classification.

2.2.4 Deep Learning Algorithms

Khishe et al. [51] proposed an improved ChOA to automatically discover optimal DCNN architectures. The proposed model introduced three enhancements namely, digitized-based coding for easier encoding, variable-length DCNNs, and using only part of the dataset for fitness assessment. The proposed model outperformed 23 classifiers on 9 benchmark datasets. It achieved the best performance in the Fashion dataset (5.08%) and ranked second in terms of size (750k parameters). In 87 out of 95 investigations, it surpassed other benchmarks, making it the first ChOA-based effort to autonomously evolve DCNN architectures.

Cai et al. [52] introduce deep neural networks based ChOA capable of diagnosing COVID-19 through X-ray images. The proposed approach can ensure fast training and parallel implementation. A COVID-Xray-5K and COVIDetectioNet databases are chosen to enhance its effectiveness. This approach resulted in superior performance with 99.11% of accuracy, in contrast with other standard DCNN, DCNN-GA, and MSAD models. In addition, the work identified infected areas by activation map with high accuracy values.

In [53], authors designed a features-reducing algorithm combining Chimp Optimization Algorithm (ChOA), Greedy Search (GS), and Opposition-Based Learning (OBL) to select the most significant features. The GSOBL-ChOA approach was evaluated in different benchmark datasets for various engineering fields. The authors demonstrate potential for GSOBL-ChOA approach for applications in attributes selection, it is outperformed other existing benchmarks in most experimental cases.

Khishe et al. [54] proposed a new variation of ChOA namely the Multi-Objective ChOA (MO-ChOA) addressing multidimensional data optimization challenges in engineering problems. This work maintains a leader selection by applying distributed mechanism to efficiently explore the search space and keep non-dominated results. The authors show that the obtained results can achieve better performance by light-weighting structure compared to some existing algorithms.

Saffari et al. [55] use ChOA as an artificial trainer model to identify and classify mammals which is a high-dimensional task. The approach is based on Fuzzy-ChOA with controlled parameters to balance exploration and exploitation. Experimental datasets of underwater marine mammal sounds are collected, pre-processed, and used for ANN classification. The proposed approach yields remarkable results regarding the convergence speed and recognition performance compared with various benchmark algorithms.

3 Background Techniques

3.1 Feature Selection

In feature selection, important features are filtered and selected. It is based on the reduction and/or elimination of redundant, unnecessary, and irrelevant features. This technique is necessary to ensure better data understanding and reduce processing time. As opposed to other approaches, the feature selection approach is predictable. This approach involved four-phase process as shown in Fig. 1: 1) generation of subsets, 2) evaluation of subsets, 3) stopping criteria and 4) validation of results [3]. The first phase consists of adopting a search strategy to select a list of candidates. The second phase consists of assessing a list of candidates using an evaluation criterion. Depending on the stopping criterion (phase 3), phases 1 and 2 will be repeated until the best combination of feature subsets is determined. Finally, the selected subset is validated with an independent data set or using prior domain-specific knowledge according to the task performed.

Fig. 1
figure 1

Feature selection process

3.2 Chimp Optimization Algorithm (ChOA)

The ChOA was proposed by Khishe et al. [14] in 2020 which is based on the chimp's hunting behavior that is guided and controlled by their intelligence and sexual motivations as shown in Fig. 2. ChOA plays a significant role in choosing important features is that it has excellent global and local discovery [14] used for optimization in the opposite of the other existing algorithms. Since they're easier to implement, they may prove to be more useful with higher stability and low feature assessment than PSO or GA, moreover, as an optimization approach, it does not need many initial parameters to ensure optimal convergence. In a chimpanzee colony, individuals are not all of the same capacity. It can use its appropriate strategy to predict the location of prey. Such diversity and difference are very essential to the success of the hunting process. Chimpanzees live in groups that consist of Attackers, Barriers, Chasers, and Drivers. Drivers follow and surround prey without attacking it. They drive it toward the barriers. The barriers on their side build a dam along the path of the prey, eventually trapping it in a tree. The chaser's mission is to move quickly after pouncing on the prey to cause panic and catch up. Finally, the attackers predict the prey’s escape way, either catching it or pushing it toward the hunters.

Fig. 2
figure 2

The hunting process of chimps (exploration and exploitation)

The ChOA technique represents driving and chasing the prey as follows:

$$X\left( {t + 1} \right) = X_{p} \left( t \right) - A.D$$
(1)
$$With{ }D = \left| {C.X_{p} \left( t \right) - M.X\left( t \right)} \right|{ }$$
(2)

where

  • \(t\) : current iteration;

  • \(\mathrm{X}\): current position of chimp;

  • \({X}_{p}\): vector of prey position;

  • \(A\) : coefficient of prey;

  • \(D\) : travel distance that separates the prey and the chimp.

  • \(C,M\) are vectors of coefficients.

The vectors of coefficients \(\mathrm{A},\mathrm{ C},\) and \(\mathrm{f}\) are calculated using equations (3)–(5):

$$\begin{gathered} A_{1} = 2.f.r_{11} - f \hfill \\ A_{2} = 2.f.r_{21} - f \hfill \\ A_{3} = 2.f.r_{31} - f \hfill \\ A_{4} = 2.f.r_{41} - f{ } \hfill \\ \end{gathered}$$
(3)
$$\begin{gathered} C_{1} = 2.r_{12} \hfill \\ C_{2} = 2.r_{22} \hfill \\ C_{3} = 2.r_{32} \hfill \\ C_{4} = 2.r_{42} { } \hfill \\ \end{gathered}$$
(4)

where \(f\) decreases nonlinearly from \(2\) to \(0\), as computed by equation (6):

$$f = 2 - 2 \left( \frac{t}{T} \right){ }$$
(5)

\({r}_{1},{r}_{2}\) are random numbers in [0, 1] which are calculated as follows:

$$\begin{gathered} c_{1} g_{1} = 1.95 - 2.\left( {\frac{{t^{\frac{1}{3}} }}{{T^{\frac{1}{3}} }}} \right){ },{ }r_{11} = { }c_{1} g_{1} { }.{\text{ rand}}() \hfill \\ c_{2} g_{1} = 2.\left( {\frac{{t^{\frac{1}{3}} }}{{T^{\frac{1}{3}} }}} \right) + 0.5{ },{ }r_{12} = { }c_{2} g_{1} { }.{\text{ rand}}() \hfill \\ c_{1} g_{2} = 1.95 - 2.\left( {\frac{{t^{\frac{1}{3}} }}{{T^{\frac{1}{3}} }}} \right){ },{ }r_{21} = { }c_{1} g_{2} { }.{\text{ rand}}() \hfill \\ c_{2} g_{2} = 2.\left( {\frac{{t^{\frac{1}{3}} }}{{T^{\frac{1}{3}} }}} \right) + 0.5{ },{ }r_{22} = { }c_{2} g_{1} { }.{\text{ rand}}() \hfill \\ c_{1} g_{3} = - 2.\left( {\frac{{t^{3} }}{{T^{3} }}} \right) + 2.5{ },{ }r_{31} = { }c_{1} g_{3} { }.{\text{ rand}}() \hfill \\ c_{2} g_{3} = 2.\left( {\frac{{t^{\frac{1}{3}} }}{{T^{\frac{1}{3}} }}} \right) + 0.5,{ }r_{32} = { }c_{2} g_{3} { }.{\text{ rand}}() \hfill \\ c_{1} g_{4} = - 2.\left( {\frac{{t^{3} }}{{T^{3} }}} \right) + 2.5{ },{ }r_{41} = { }c_{1} g_{4} { }.{\text{ rand}}() \hfill \\ c_{2} g_{4} = 2.\left( {\frac{{t^{\frac{1}{3}} }}{{T^{\frac{1}{3}} }}} \right) + 0.5,{ }r_{42} = { }c_{2} g_{4} { }.{\text{ rand}}() \hfill \\ \end{gathered}$$
(6)

where \(rand()\) is the standard function distribution with a scale of 0 to 1.

In Equation (2), the variable M represents a chaotic value ranging from 0 to 1, which is obtained from one of the chaotic maps listed below:

$$\begin{gathered} Quadratic:M = x_{(i + 1)} = x_{i}^{2} - c,c = 1 \hfill \\ Gaussian:M = x_{(i + 1)} = \left\{ \begin{gathered} 1,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,ifxi = 0 \hfill \\ \frac{1}{{\bmod \left( {x_{i} ,1} \right)}}\,\,\,\,\,\,\,\,\,else \hfill \\ \end{gathered} \right. \hfill \\ Logistic:M = x_{(i + 1)} = \alpha x_{i} (1 - x_{i} ),\alpha = 4 \hfill \\ Bernoulli:M = x_{(i + 1)} = 2x_{i} mod1 \hfill \\ \end{gathered}$$
(7)

where \(t\), \(T\) are the current iteration, and the maximum number of iterations, respectively.

To ensure a hybrid efficient search system, the ChOA uses four independent groups, and two main strategies have been proposed named ChOA1 and ChOA2 to achieve superior performance in the classification accuracy, irrespective of the optimization problem. In ChOA, the first task of the attacker is to inform the driver, barrier, and chaser about the location of prey. Then the best leaders guide other chimps toward the prey using the best four locations. The new position of the chimp can be expressed as:

$$X\left(t+1\right)= \frac{{X}_{1}+ {X}_{2}+ {X}_{3}+ {X}_{4}}{4}$$
(8)

Where \({X}_{1},{X}_{2}, {X}_{3}\mathrm{ and }{X}_{4}\) are the positions calculated using Eq. (9)–(12) respectively.

$${\mathrm{X}}_{1}=|{\mathrm{X}}_{\mathrm{a}}- {\mathrm{A}}_{1}. {\mathrm{D}}_{\mathrm{a}}|$$
(9)
$${\mathrm{X}}_{2}=|{\mathrm{X}}_{\mathrm{b}}- {\mathrm{A}}_{2}. {\mathrm{D}}_{\mathrm{b}}|$$
(10)
$${\mathrm{X}}_{3}=|{\mathrm{X}}_{\mathrm{c}}- {\mathrm{A}}_{3}. {\mathrm{D}}_{\mathrm{c}}|$$
(11)
$${\mathrm{X}}_{4}=|{\mathrm{X}}_{\mathrm{d}}- {\mathrm{A}}_{4}. {\mathrm{D}}_{\mathrm{d}}|$$
(12)

where

  • \({X}_{a}\), \({X}_{b}\), \({X}_{c}\) and \({X}_{d}\) are the positions of chimps respectively at iteration \(t\);

  • \({\mathrm{A}}_{1}\), \({\mathrm{A}}_{2}\), \({\mathrm{A}}_{3}\) and \({\mathrm{A}}_{4}\) are the coefficient vectors calculated by equations (3).

  • \({\mathrm{D}}_{\mathrm{a}}\), \({\mathrm{D}}_{\mathrm{b}}\), \({\mathrm{D}}_{\mathrm{c}}\) and \({\mathrm{D}}_{\mathrm{d}}\) calculated using Eq. (13)–(16).

    $${\mathrm{D}}_{\mathrm{a}}=|{{\mathrm{C}}_{1}.\mathrm{X}}_{\mathrm{a}}- {\mathrm{M}}_{1}.\mathrm{ X}|$$
    (13)
    $${\mathrm{D}}_{\mathrm{b}}=|{{\mathrm{C}}_{2}.\mathrm{X}}_{\mathrm{b}}- {\mathrm{M}}_{2}.\mathrm{ X}|$$
    (14)
    $${\mathrm{D}}_{\mathrm{c}}=|{{\mathrm{C}}_{3}.\mathrm{X}}_{\mathrm{c}}- {\mathrm{M}}_{3}.\mathrm{ X}|$$
    (15)
    $${\mathrm{D}}_{\mathrm{d}}=|{{\mathrm{C}}_{4}.\mathrm{X}}_{\mathrm{d}}- {\mathrm{M}}_{4}.\mathrm{ X}|$$
    (16)

where

  • \({\mathrm{C}}_{1}\), \({\mathrm{C}}_{2}\), \({\mathrm{C}}_{3}\) and \({\mathrm{C}}_{4}\) are calculated using equation (4).

  • \({\mathrm{M}}_{1}\), \({\mathrm{M}}_{2}\), \({\mathrm{M}}_{3}\) and \({\mathrm{M}}_{4}\) are the coating functions.

According to equations (7) to (15), chimpanzees constantly update their positions based on the locations of their group leaders, which represent the best positions across the population. The distance \(D\) between a chimpanzee and its prey determines the area of exploration, and parameters \(A\) and \(C\) determine its search areas. This variable \(m\) refers to chimpanzee behavior during the final stages of the hunting process. Sexual motivations drive them to temporarily ignore their hunting tasks to concentrate solely on obtaining meat, symbolizing their dominance. This erratic and impulsive behavior helps them to explore new approaches and overcome challenges related to finding prey, putting their ingenuity to the test. A vital phase of the ChOA (Chimpanzee Optimization Algorithm) involves chimpanzees utilizing their chaotic behavior to overcome optimization problems. Their position update method is highly flexible, providing a significant advantage over local optima traps and resolving otherwise insurmountable complex problems. In this innovative approach, chimpanzees demonstrate how they optimize their path toward more promising solutions by incorporating a controlled chaotic dimension into their optimization process. By harnessing this clever strategy, they offer valuable perspectives for resolving complex problems across various scientific and engineering domains.

Algorithm 1 is an illustration of the steps of ChOA to compute the optimal position of the attacker. In this algorithm, a population is used to initialize the positions of the first four chimpanzees: \({X}_{Attacker}\), \({X}_{Barrier}\), \({X}_{chaser}\), and the \({X}_{Driver}\) as well as their scores to zero. It was evaluated based on the fitness function. To find the optimal solution, the location positions and scores of the four main chimps are updated. Also, coefficients \(f\), \({r}_{1}\) and \({r}_{2}\) are updated using Eq. 6 and the coefficients \(A\) and C are computed based on \(f\), \({r}_{1}\) and \({r}_{2}\) using Eq. 3 and Eq. 4 respectively. The chimpanzee's new positions are updated using Eq. 6. Finally, the optimal binary solution of \({X}_{best}\) is returned.

Generally, the Chimp Optimization Algorithm is used to solve so-called continuous optimization problems, in which chimpanzees change their positions at any point in space. However, there are some optimization issues in which the solutions are bit sequences of 0 or 1 such as feature selections. We then need a binary version of ChOA. In this work, two novel binary wolf optimizations named BChimp1 and BChimp2 are proposed to improve selection accuracy and convergence rate. The updating of the positions of the chimpanzees is done using the positions of the first four chimpanzees. The goal is to attract each chimpanzee in the flock to the first four leaders. At any time, the set of solutions is binary, located at a hypercube’s corner. In other words, change the locations of a particular chimpanzee with respect of BChoA’s principles, while maintaining the binary constraints established by equation (7).

In the following subsections, we will describe the two approaches proposed to achieve the binary aspect of the chimpanzee optimization algorithm.

figure a

Algorithm 1: Pseudo-code of original ChOA

4 Our Proposal

The proposed system can select important features based on enhanced binary Chimp optimization algorithms and machine-learning classifiers. To develop our proposal, we used several classifiers, objective functions, and optimization methods to choose the optimal features. We have combined two variants of ChoA as well as six standard classifiers to ensure a good balance between relevance, convergence level, and training time. Figure 3 describes in detail the suggested system, which consists of two key stages: dimensionality reduction and machine learning. The first stage consists of applying two enhanced binary chimp optimization algorithms to ensure dimensionality reduction. The second phase consists of applying six well-known classifiers to achieve high level of accuracy.

Fig. 3
figure 3

Overall process of binary Chimp feature selection and machine learning approach

4.1 BChimp1: First Binary Chimp Optimization Algorithm

We propose BChimp1, a novel binary approach that builds on the principle of ChoA [14]. It investigates traces of movements of an Attacker, Driver, Barrier, and Chaser. They are jointly collaborated during the hunting phase to find potential prey.

4.1.1 Mathematical Formalization of BChimp1

The position of chimpanzee in BChimp1 is updated using the crossover operator expressed as:

$$X\left(t+1\right)= Crossover({Y}_{1}, {Y}_{2}, {Y}_{3}, {Y}_{4})$$
(17)

where \({Y}_{1}, {Y}_{2}, {Y}_{3}\) and \({Y}_{4}\) are the binary vectors representing the traces of movements of the first four chimpanzees. \(Crossover({Y}_{1}, {Y}_{2}, {Y}_{3}, {Y}_{4})\) is the crossover operation between solutions. The vectors \({Y}_{1}, {Y}_{2}, {Y}_{3}\) and \({Y}_{4}\) are the binary vectors affected by the movement of the first four champs, respectively, which can be calculated as:

$${Y}_{1}^{d}= \left\{\begin{array}{c}1, if \left({X}_{A}^{d}+ {bstep}_{A}^{d}\right)\ge 1 \\ 0, otherwise \end{array}\right.$$
(18)

where \({X}_{A}^{d}\) is the position vector of an attacker and the search space has the dimension \(d\) and \({bstep}_{A }^{d}\) is the binary step that is calculated by equation (19) as follows:

$${bstep}_{A}^{d}= \left\{\begin{array}{c}1, if {cstep}_{A}^{d}\ge {r}_{3}\\ 0, otherwise \end{array}\right.$$
(19)

where \({r}_{3}\) is a random vector with values between 0 and 1 and \({cstep}_{A }^{d}\) is continuous-valued step size, as defined by equation (20) as follows:

$${cstep}_{A}^{d}= \frac{1}{1+exp(-10 ({A}_{1}^{d}.{D}_{A}^{d}- 0.5))}$$
(20)

where \({\mathrm{A}}_{1}^{\mathrm{d}}\) and \({D}_{A}^{d}\) are calculated by equations (3) and (12).

$${Y}_{2}^{d}= \left\{\begin{array}{c}1, if \left({X}_{B}^{d}+ {bstep}_{B}^{d}\right)\ge 1\\ 0, otherwise \end{array}\right.$$
(21)

Where \({X}_{B}^{d}\) is the position vector of the barrier and the search space has the dimension \(d\) and \({bstep}_{B}^{d}\) is the binary step that is calculated by equation (22) as follows:

$${bstep}_{B}^{d}= \left\{\begin{array}{c}1, if {cstep}_{B}^{d}\ge {r}_{4}\\ 0, otherwise \end{array}\right.$$
(22)

Where \({r}_{4}\) is a random vector with values between 0 and 1 and \({cstep}_{B}^{d}\) is continuous-valued step size, as defined by equation (23) as follows:

$${cstep}_{B}^{d}= \frac{1}{1+exp(-10 ({A}_{2}^{d}.{D}_{B}^{d}- 0.5))}$$
(23)

where \({A}_{2}^{d}\) and \({D}_{B}^{d}\) are calculated by equations (3) and (13).

$${Y}_{3}^{d}= \left\{\begin{array}{c}1, if \left({X}_{C}^{d}+ {bstep}_{C}^{d}\right)\ge 1\\ 0, otherwise \end{array}\right.$$
(24)

where \({X}_{\mathrm{C}}^{d}\) is the chaser's position vector and the search space has the dimension \(d\) and \({bstep}_{C}^{d}\) is the binary step that is calculated by equation (25) as follows:

$${bstep}_{c}^{d}= \left\{\begin{array}{c}1, if {cstep}_{C}^{d}\ge {r}_{5}\\ 0, otherwise \end{array}\right.$$
(25)

where \({r}_{5}\) is a random vector with values between 0 and 1; \({cstep}_{\mathrm{C}}^{d}\) is continuous-valued step size, as defined by equation (26) as follows:

$${cstep}_{C}^{d}= \frac{1}{1+exp(-10 ({A}_{3}^{d}.{D}_{C}^{d}- 0.5))}$$
(26)

where \({A}_{3}^{d}\) and \({D}_{C}^{d}\) are calculated by equations (3) and (14).

$${Y}_{4}^{d}= \left\{\begin{array}{c}1, if \left({X}_{D}^{d}+ {bstep}_{D}^{d}\right)\ge 1\\ 0, otherwise \end{array}\right.$$
(27)

where \({X}_{D}^{d}\) is the driver's position vector and the search space has the dimension \(d\) and \({bstep}_{D}^{d}\) is the binary step that is calculated by equation (28) as follows:

$${bstep}_{D}^{d}= \left\{\begin{array}{c}1, if {cstep}_{D}^{d}\ge {r}_{6}\\ 0, otherwise \end{array}\right.$$
(28)

where \({r}_{6}\) is a random vector with values between 0 and 1 and \({cstep}_{\mathrm{D}}^{d}\) is continuous-valued step size, as defined by equation (29) as follows:

$${cstep}_{D}^{d}= \frac{1}{1+exp(-10 ({A}_{4}^{d}.{D}_{D}^{d}- 0.5))}$$
(29)

where \({A}_{D}^{d}\) and \({D}_{D }^{d}\) are calculated by applying equations (3) and (15).

4.1.2 Position Updating Mechanism in BChimp1

The Chimp updating position is the crossing function of four vectors \({Y}_{1}^{d}, {Y}_{2}^{d}, {Y}_{3}^{d}\) and \({Y}_{4}^{d}\).The mathematical formula of the new Chimp position is expressed by

$${X}^{d}\left(t+1\right)= \left\{\begin{array}{c}{Y}_{1}^{d}, {r}_{7}<\frac{1}{4}\\ {Y}_{2}^{d}, \frac{1}{4}{\le r}_{7}<\frac{1}{2}\\ \begin{array}{c}{Y}_{3}^{d}, \frac{1}{2}{\le r}_{7}<\frac{3}{4}\\ {Y}_{4}^{d}, otherwise\end{array}\end{array}\right.$$
(30)

Where \({r}_{7}\) is a random vector with values between 0 and 1.

4.1.3 Algorithm of BChimp1

The BChimp1 approach consists mainly of reducing a dimension \(D\) by choosing relevant features. The binary representation method has come up with the solution by translating the continuous value to a binary value of 0 or 1. Each dimension \(d\) corresponds to the index of a represented feature. The feature \(i\) is selected if the position of the chimp is greater than \(0.5\), which can be expressed as:

$$\left\{ {\begin{array}{*{20}c} {X_{i}^{d} > 0.5,} & {if\,Feautre\,is\,selected} \\ {else,} & {Fautre\,is\,not\,selected} \\ \end{array} } \right.$$
(31)

Where \({X}_{i}^{d}\) denotes the solution \(i\) of the dimension \(d\).

Algorithm 2 shows the steps of BChimp1 and Fig. 4 presents the associated. Each solution (selected features) is defined by a vector of n attributes with a binary value of 1 (selected) or 0 (not selected). First, we randomly initialize the chimps’ positions (\({X}_{i} i=1..N\)) in the group of individuals. The computation of the fitness score for each chimp's position is done to obtain the four best solutions \({X}_{Attacker}^{d}\), \({X}_{Barrier}^{d}\), \({X}_{Chaser}^{d}\), and \({X}_{Driver}^{d}.\) Then the initialization of the parameters \(f\), \(m\), \(a,\) and \(c\) are performed. During each iteration, the binary solution \({Y}_{1}, {Y}_{2}, {Y}_{3}\) and \({Y}_{4}\) of each chimpanzee are calculated by equations (18), (21), (24), and (26), respectively. For each iteration, a crossover operation between the binary solution is performed to update the chimpanzee’s current solution. The calculated new fitness values is carried out using current results. Moreover, the positions of the four chimps and parameters \(f\), \(m\), \(a,\) and \(c\) are updated. This process is repeated until the stop criterion is met. Finally, the optimal binary solution of \({X}_{best}\) is returned.

figure b

Algorithm 2: First Binary Chimp Optimization Algorithm

Fig. 4
figure 4

Flowchart of BChimp1

4.2 BChimp2: Second Binary Chimp Optimization Algorithm

4.2.1 Positions Updating Mechanism in BChimp2

In this section, we propose BChimp2. In this approach, the binary vector is used to update the chimps’ position, as shown in equation (31):

$${X}^{d}\left(t+1\right)= \left\{\begin{array}{c}1, if s\left(\frac{{X}_{1}^{d}+{X}_{2}^{d}+ {X}_{3}^{d}+{X}_{4}^{d}}{4}\right)\\ 0, otherwise \end{array}\right.\ge {r}_{8}$$
(32)

Where \({\mathrm{r}}_{8}\) is a random vector in the range of [0, 1], \(d\) is a dimension of search space, and \(s\left(x\right)\) is sigmoid function calculated as follows:

$$s\left(x\right)= \frac{1}{1+exp(-10(x-0.5))}$$
(33)

The output of the sigmoid function in BChimp2 is interpreted as the selection attribute to the relevant class when the probability is greater than a threshold (e.g., 0.5) and with iterated computation in BChimp2 within the objective functions that leads to an optimal.

4.2.2 Algorithm of BChimp2

Algorithm 3 depicts the pseudo-code of BChimp2. First, we randomly initialize the positions of chimps in the population (either bit 1 or bit 0). Second, we evaluate the fitness of chimpanzees. The algorithm keeps the four best solutions. For each iteration, we calculate \({Y}_{1}, {Y}_{2}, Y\) and \({Y}_{4}\), using equations (17), (20), (23), and (26) respectively. During each iteration, the new position of the chimpanzee is updated using equation (31). Then, we evaluate the fitness of each chimpanzee, and we update the different parameters \((f\), \(m\), \(a,\) and \(c)\). This process is repeated for each chimp of the population until the desired criterion is met.

figure c

Second Binary Chimp Optimization Algorithm

4.3 Feature Selection Using BChimp 1 and BChimp 2

The aim of our work is to come up with an intelligent selection of the most important features for achieving the best accuracy rates and fast processing time for large dataset classification. BChimp1 and BChimp2 are applied to select a small list of attributes from a large dataset. Although, BChimp1 and BChimp2 must keep a high degree of prediction in the description of target attributes [43]. However, when the dimension of the dataset is high, the traditional exhaustive search cannot select relevant features accurately and therefore not practical. Indeed, if the dimension of the dataset is \(N\) then the dimension of search space is \({2}^{N}\)(\(N = 1000\), the search space is 21000!!!). Therefore, we opt for our proposed binary chimp algorithms due to their ability to solve such hard problems.

4.3.1 Solution Representation

The search space of the proposed BChimp 1 and BChimp 2 is composed of a set of binary solutions for selecting a set of significant features. Each chimpanzee defines a solution. The solution is a vector of \(d\) elements, where \(d\) represents number of attributes in the original database. If the \({i}^{th}\) feature in the original database is selected, the element \(i\) takes \(1\) else it takes \(0\) where \(0\le i\le d-1\). Figure 5 illustrates the binary representation of each solution.

Fig. 5
figure 5

Solution representation example

4.3.2 Selection Criteria

BChimp1 and BChimp2 optimize classification accuracy rates by selecting relevant features. They explore the search space effectively and select the ideal solution, BChimp1 selects the ideal solution from the four best possible solutions and applies a stochastic crossover on four moving solutions to deeply speed-up convergence level while BChimp2 uses the sigmoid function to select the optimal relevant features.

4.3.3 Objective Function

It plays a crucial role in selecting the best attribute with wrapper-based algorithms. In general, the choice of objective function for feature selection can be varied depending on the classification problem. The main aim is to maximize the number of relevant features or minimize the number of irrelevant features. The evaluation of BChimp 1 and BChimp 2 is done using the Classification Error Rate (CEE) metric [44] defined as fitness function (\(Fitness\)), which can be expressed as:

$$Fitness=CEE= \frac{Number\, of\, incorrectly\, classified\, instances}{Total\, number\, of\, instances}$$
(34)

If \(CEE\) is close to 1, it will lead to the best classifier, if it is too close to 0, it will lead to a poor classifier.

4.3.4 Choice of an Efficient Classification Model

The important factor of the feature selection problem is how to choose the best classifiers that will be considered for cross-validation while maintaining high accuracy. Machine learning approaches can be used for validating feature selection algorithms, including various types of classifiers, such as KNN, SVM, NB, ANN, RF, KLM, FR, OPF, ID3, C4.5. Figure 6 illustrates the rate at which different classifiers are used in the feature selection problem. The KNN is used data classifier in the field of attribute selection despite its higher efficiency, especially for large databases compared to existing classifiers. The proposed work is based on a widely-accepted metric of KNN algorithm: Classification Error Rate (CEE) with (k = 5) [45]. The KNN algorithm used the Euclidean Distance (ED) defined in [46] as follows:

Fig. 6
figure 6

Different classifiers rates in solving feature selection problems

$$ED\left(X, Y\right)= \sqrt{\sum_{d=1}^{D}{\left({X}^{d}- {Y}^{d}\right)}^{2}}$$
(35)

where \(X\) and \(Y\): define the particular features and the total number of features, respectively.

  • K-Nearest Neighbor (KNN): The key idea of KNN is to find the closest neighbors based on Euclidean distance. KNN is an easy-to-implement and easily understood-algorithm, making it suitable for basic classification tasks. However, it requires a high computation time on large datasets and suffers from the selection of the correct value of K which is crucial for its performance.

  • Decision Tree (DT): is an intuitive model for classification and regression, it is popular and widely used in machine learning. Based on the input features, the model builds a tree. A decision tree is easy to understand as it handles both numerical and categorical data and captures non-linear relationships. However, they can over fit, especially when the tree becomes too complex. Random Forest and Gradient Boosting can be efficient for systems with non-linear relationships.

  • Support Vector Machine (SVM): is used to separate different class data points by a hyperplane. The maximum margin between classes makes it robust to outliers and noise. The kernel trick handles non-linear relationships between features and classes. However, SVM may not scale well with large datasets, and choosing the right kernel and tuning parameters is important for optimal performance. Still, SVM remains a popular and powerful choice for many classification tasks in machine learning.

  • Random Forest (RF): is an efficient model that combines a group of decision trees to ensure accurate predictions. Random Forest is robust and resists overfitting problems. Random Forest is a versatile and powerful classifier that excels in treating multidimensional databases and generating accurate results. It is acceptable to use in various realm applications. It is suitable when robustness and generalization are essential.

  • Quadratic Discriminant Analysis (QDA): is useful when the data does not follow a linear separation pattern. It leverages probability density functions to create quadratic decision boundaries, which allows it to handle more complex datasets.

  • Multi-Layer Perceptron (MLP): is a powerful and versatile algorithm capable of handling complex relationships between attributes and target classes. However, MLP is a is a fundamental building block of modern commonly used in various fields and deep learning models.

To assess the performance of the proposed approach, we separate the database into 70% training set and 30% testing set. Evaluation of feature selection is conducted on the training and testing sets. The cross-validation technique minimizes the overfitting rate with K = 10. The cross-validation technique consists of dividing the training set into 10 times with the same set size. Training occurs with K-1 folds, and validation occurs with the remaining folds. The evaluation process will have been repeated 10 times, each time replacing the training and validation folds. Finally, the average results of different data series will be recorded.

5 Experimental Setup and Evaluation Criteria

We begin this section with an overview of the experimental setup and parameter settings. Then, we present different performance metrics and descriptions of multidimensional databases. The flow diagram of the proposed system is shown in Fig. 7.

Fig. 7
figure 7

A general model for proposed work using binary Chimp-machine learning approach

5.1 Experimental Setup

In this section, the experiments aim to visualize the efficiency of the proposed optimization technique compared with existing optimization algorithms based on accuracy rates and convergence levels. All techniques are developed using MATLAB 2020 with the same objective function through a PC Intel(R) Core (TM) i5-2410M CPU 2.30 GHz, 4 GB of RAM, and 64-bit operating system.

5.2 Parameter Settings

The parameters for carrying out our experiments are shown in Table 1. In the experiments, the size of population is 10. The maximum iteration \(T\) is defined as 100. The feature selection process is iterated 10 times for each data set. Finally, the average values of metrics are computed for 10 experiments. In our experiments, the optimizers of BGA are employed and the crossover rate is set to 0.8. A mutation rate is set to 0.2. In terms of discrete optimization, our experiments employ BPSO with coefficients c1 and c1 both equal to 2.

Table 1 Experimental Parameter settings

5.3 Databases Description

To better test and evaluate the effectiveness of our techniques, it is necessary to use large databases. Many features in large databases represent real-world scenarios. All tests are applied to 23 large databases collected from the University of Arizona [47], of 23 databases, 18 databases have more than 1000 features. A description of the datasets used in experiments is given in Table 2. All databases are high-dimensional and defined by a high number of attributes (18/23 are greater than 1000 attributes).

Table 2 Experimental databases.

5.4 Performance Metrics

To verify the performance of proposed algorithms, four main metrics are measured:

1) The mean fitness (\(\mu\)) and standard deviation (\(SD\)) of the fitness values for each technique after 100 runs are computed to show outperformed method. The standard deviation is defined by the following formula:

$$SD=\sqrt{\frac{1}{n}\sum_{1}^{n}{({x}_{i}-\mu )}^{2}}$$
(36)

where \({x}_{i}\) indicates the fitness of each of the candidate solutions, \(\mu\) indicates the mean fitness, calculated as \(\mu = \Sigma (xi) / n\) and \(n\) indicates the number of the candidate solutions.

2) The dataset size difference before and after the attribute selection process is computed to show how much features are reduced. The dataset difference is defined as follows:

$$D\_size=F\_size-O\_size$$
(37)

where \(F\_size\) and \(O\_size\) represent dataset size before and after the feature selection respectively.

3) The average accuracy rate is computed after 10 runs to show the outperformed method. For quality evaluation, our experiments employ accuracy (Ac), precision (Pre), recall, specificity, and F-score as the metrics to assess the level of security achieved, as shown in Eq. (37).

$$Ac= \frac{TP+TN}{TP+TN+FP+FN}$$
(38)
$$Pre= \frac{TP}{TP+FP}$$
$$Recall= \frac{TP}{TP+FN}$$
$$Specificity= \frac{TN}{TN+FP}$$
$$\mathrm{F}\_\mathrm{score}= 2*\frac{Prec*Recall}{Prec+Recall}$$

where

  • \(TP\) is True Positive,

  • \(TN\) is True Negative,

  • \(FP\) is False Positive,

  • \(FN\) is False Negative.

The average accuracy value is defined as follows:

$$\overline{{Ac}} = \frac{{\sum _{{i = 1}}^{n} Ac_{i} }}{n}$$
(39)

where \({Ac}_{i}\) represents the accuracy value for each run.

4) The convergence rate of each method is defined to study the convergence speed of each method. The convergence rate is defined as follows:

$$Cr=It\to {f }^{*}$$
(40)

where \(It\) represents the number of steps required for achieving the optimal attributes.

6 Performance Comparison

6.1 Comparison of the Average Fitness Values

To demonstrate the efficiency of the proposed algorithms, the average fitness values of BChimp1 and BChimp2 were compared with PSO [28], GWO [13], WOA [48], and GA [9] after 10 executions using 23 multidimensional databases.

Table 3 presents the comparison results, which reveal that all the mean fitness values of BChimp1 and BChimp2 are very good and achieved the exact minimum. From the mean fitness values, we can confirm that BChimp 2 is more efficient.

Table 3 Fitness assessment by different methods

6.2 Comparison of the Standard Deviation of Fitness Values

The standard deviation of fitness values between the proposed selection techniques and PSO [28], GWO [13], WOA [48], and GA [9] is compared in Table 4. This further confirms that the fitness values of different selection algorithms are very close to each other. However, the GWO algorithm outperforms other algorithms in several cases.

Table 4 The standard deviation assessment by different methods

6.3 Comparison of Convergence of Features Selection Algorithms

Through this section, a comparison of the proposed BChimp1 and BChimp2 with relevant optimization techniques has been conducted in terms of convergence, represented by the level of convergence in solving high-dimensional feature optimizations. All feature selection algorithms are evaluated for 10 runs. The graph of their convergences is shown in Fig. 8. The obtained convergence behavior and speed of BChimp1 and BChimp2 are very encouraging compared to other techniques presented in [13, 28, 48], and [9]. However, BChimp1 and BChimp2 can converge quickly and deeply to find the exact minimum compared to PSO [28], GWO [13], WOA [48], and GA [9]. Because the two proposed algorithms have succeeded in combining the good quality of exploration and exploitation of the chimpanzee hunting process on the one hand and the other hand, the two algorithms provide an optimal binary solution. From the convergence levels, we can confirm that BChimp 2 offers a high convergence compared to BChimp1 and other algorithms for all experimented datasets.

Fig. 8
figure 8figure 8

Comparison of convergence of the proposed techniques with other techniques

6.4 Comparison of Dimensionality Reduction Before and After Feature Selection

This section presents the experimental results of dimensionality reduction before and after selection using the two proposed techniques (BChimp 1 and BChimp 2) applied to 23 datasets. Features are reduced during the selection process to a set of the most relevant features. Figure 9 and Table 5 illustrate the comparison of dimensionality reduction before and after binary feature selection on 23 datasets. It is noted that the proposed optimization techniques (BChimp 1 and BChimp 2) offer good dimensionality reduction as shown in all datasets of Table 5. The proposed technique BChimp 2 has an average reduction rate of 72.54% and almost the lowest dimensionality reduction rate among the 23 datasets used and presented particularly a reduction rate of 0.0037 with the COIL dataset. The dimensionality reduction rate for BChimp1 is 42.77%, and the lowest dimensionality reduction rate can be seen with GLIOMA of 0.01.

Fig. 9
figure 9

Dimensionality reduction before and after features selection

Table 5 Dimensionality feature reduction before and after selection

From the dimensionality reduction values, we can conclude that BChimp 2 is more effective while achieving the lowest dimensionality reduction.

6.4.1 Accuracy Comparison of KNN Classifier With/Without Binary Chimp Methods

Figure 10 depicts the comparison of accuracy values for KNN classifier with/without feature selection using BChimp1 and BChimp2 on 23 datasets. From Fig. 10, we observed that the proposed BChimp2 algorithm with the KNN classifier has higher accuracy values than BChimp1 model with KNN classifier. We also observed that KNN classifier with significant relevant feature selection by the proposed binary chimpanzee methods improved classification accuracy.

Fig. 10
figure 10

Accuracy comparison of KNN classifier with/without Binary chimpanzee methods

6.4.2 Accuracy Comparison of DT Classifier With/Without Binary Chimp Methods

To assess the quality of solutions, the DT machine learning model is combined with the proposed algorithms (BChimp1 and BChimp2) on 23 datasets already described in Section 6.1 and compared to the original DT machine learning model. The accuracy comparison of DT classifier with and without binary chimpanzee methods (BChimp1 and BChimp2) is depicted in Fig. 11. The accuracy scores are very close to each other before and after the selection process by BChimp 1 and BChimp 2. However, BChimp1 outperformed the DT classifier in nine cases, while BChimp2 outperformed the DT classifier in seven cases.

Fig. 11
figure 11

Accuracy comparison of DT classifier with/without binary chimpanzee methods

6.4.3 Accuracy Comparison of SVM Classifier With/Without Binary Chimp Methods

Furthermore, when compared the SVM classifier to BChimp1-SVM and BChimp2-SVM as shown in Fig. 12, the recognition rates are very close to each other before and after the selection process by BChimp1 and BChimp2, BChimp1 outperformed it in six cases and BChimp2 outperformed it in ten cases.

Fig. 12
figure 12

Accuracy comparison of SVM classifier with/without binary chimpanzee methods

6.4.4 Accuracy Comparison of FR Classifier With/Without Binary Chimp Methods

The comparison of the FR classifier to BChimp1-FR and BChimp2-FR is shown in Fig. 13. The accuracy values are very close to each other before and after the selection process by BChimp1 and BChimp2, BChimp1 outperformed it in eight cases, and BChimp2 outperformed it in six cases.

Fig. 13
figure 13

Accuracy comparison of FR classifier with/without binary chimpanzee methods

6.4.5 Accuracy Comparison of QDA Classifier With/Without Binary Chimp Methods

The comparison of the QDA classifier to BChimp1-QDA and BChimp2-QDA is showed in Fig. 14. The recognition rates are very close to each other before and after the selection process by BChimp1 and BChimp2, while BChimp 1 outperformed it in 4 cases and BChimp2 outperformed it in 7 cases.

Fig. 14
figure 14

Accuracy comparison of QDA classifier with/without binary chimpanzee methods

6.4.6 Accuracy Comparison of MLP Classifier With/Without Binary Chimp Methods

Figure 15 depicts the accuracy comparison of MLP classifier with/without proposed binary Chimp methods (BChimp1 and BChimp2). The recognition rates are very close to each other before and after the selection process by BChimp1 and BChimp2, while BChimp1 outperformed it in 7 cases and BChimp2 outperformed it in 3 cases.

Fig. 15
figure 15

Accuracy comparison of MLP classifier with/without Binary chimpanzee methods

6.4.7 Accuracy Comparison of Six Classifiers With/Without Proposed Binary Chimp Methods

Table 6 shows the accuracy comparison of six classifiers KNN, DT, SVM, RF, DA, and MLP with/without proposed binary chimp techniques over 23 datasets. The accuracy of the proposed BChimp2 is 82.02 % and 83.83% for BChimp1 for all six classifiers over all 23 datasets used in experiences.

Table 6 Accuracy values before and after features selection and machine learning methods

6.4.8 Comparision of Classification Results Across Different Datasets

To evaluate the BChimp 2 using different classifiers (KNN, DT, SVM, and RF), the accuracy, precision, recall, and F-score are considered as the average of 10 runs across 23 datasets. According to the classification results in Figs. 16, 17, 18, and 19, the proposed BChimp2 SVM demonstrates the best results compared to BChimp2–KNN, BChimp2–DT, and BChimp2–RF for all datasets. This indicates that the use of BChimp2 algorithm enhances the efficiency of the SVM classifier by increasing precision and F-score as the selection is performed within the most significant features. This outcome highlights the efficiency gained by introducing BChimp2 as the features selection step before data classification and the superiority of SVM compared to KNN, DT, and KNN, which grantees high classification results in terms of accuracy, precision, recall, and F-score.

Fig. 16
figure 16

Average classification results of KNN classifier for 23 datasets

Fig. 17
figure 17

Average classification results of DT classifier for 23 datasets

Fig. 18
figure 18

Average classification results of SVM classifier for 23 datasets

Fig. 19
figure 19

Average classification metrics of RF classifier for 23 datasets

6.4.9 Discussion and Lessons Learned

Our contribution highlights the importance of selecting optimal and significant features before the classification phase as a lesson learned from the above sections. The selection should be based on optimization methods that come up with high classification accuracy, high convergence level, and acceptable execution time which may be important criteria in high dimensional data.

As demonstrated in Figs. 1015, we can conclude that the classification accuracy of six learning machines before and after feature selection using the two proposed algorithms BChimp 1 and BChimp 2 are very close to each other. This indicates that the two proposed algorithms BChimp 1 and BChimp 2 were capable to select the important attributes in all experimented databases. Note that the selection time of the proposed BChimp1 and BChimp1 are acceptable with an average of 10 times and 4 times respectively more than normal datasets before selection which is effective and which could be practicable. In addition, note that the accuracy is high in all used datasets for all six learning machines, which means that the tested machine learning classifiers with BChimp 1 or BChimp 2 are worthy to be practical with various domain applications. The proposed BChimp1 with ML (respectively BChimp2 with ML) method has outperformed the capability of classifiers without feature selection in achieving high accuracy on 6 cases (respectively 4 cases) in all 23 datasets. The comparison study highlighted the improved classification average accuracy with 82.02 % for BChimp2-ML and 83.83% for BChimp1-ML and reduced data size with 72.54% for BChimp2-ML and 42.77% for BChimp1-ML for all dataset, which proves the ability and the effectiveness of the suggested systems in the area of feature selection and categorization. Finally, from the detailed results and discussion, it is assumed that the QDA with 90.18% classifier model has ranked first by effectual performance over other classifiers on 23 datasets followed by the MLP with 89.98% classifier and RF with 81.19 % in the third place. However, the proposed methods suffer from some shortcomings such as:

  • The convergence levels are not perfect in some databases for MLP and QDA because we noted when designing these classifiers on the MATLAB platform, some errors appeared since these classifiers have high computational complexity influence outstandingly on relevant features selectivity and data of different natures in some databases as depicted in Table 6.

  • The precision is high but not optimal since the used features for training and testing phases which is limited by the selected features by BChimp 1 and BChimp 2.

  • No mechanism is investigated too deeply select features in unbalanced input datasets, particularly if the positive sample becomes highly sparse in high dimensional problems. during their selection, training, and testing phases. This is very important in high-dimensional data.

Future work will address these limitations.

7 Conclusion

This paper introduces a new hybrid approach in the feature selection field, which is based on enhanced binary chimp optimization algorithms (BChimp1 and BChimp2) combined with six standard classifiers. The approach was evaluated by using different types of datasets to select relevant features that help ensure highly accurate results of machine learning classifiers. Additionally, we compared the proposed approach to four widely used and efficient feature selection algorithms (GA, PSO, GWO, and WOA). The evaluation of the obtained results was achieved through well-known metrics such as dimensionality reduction and convergence rate. The results obtained are very encouraging and validate the efficiency of our approach. To confirm that the proposed feature selection process extracted the most significant and relevant attributes, we have applied the most standard classifiers, such as KNN, DT, SVM, DA, and MLP before and after the feature selection process to 23 benchmarks. In almost all cases, the classifier's accuracy is very similar before and after selecting features. In other cases, the accuracy of the classifier after feature selection is higher than the classifier before feature selection which means the relevance and significance of the proposed approach. It suffers from non-convergence of the recognition process for some datasets for both QDA and MLP classifiers. In general, the two proposed algorithms BChimp1-ML and BChimp2-ML give acceptable convergence time compared to the other techniques on 23 databases for all six learning machine classifiers. In future research, we aim to test BChimp1 and BChimp2 on medical datasets and provide multi-criteria dynamic optimization problems.