1 Introduction

With the rapid development of Internet technologies, amounts of data are being generated and accumulated in numerous fields such as social media (Hu et al. 2022a), financial services (Huang and Tsai 2009), telecommunication applications (Abdel-Basset et al. 2021a), etc. Due to the considerable amount of relevant and redundant information contained in these data, data preprocessing becomes necessary to improve the efficiency of data acquisition. As a common and effective data preprocessing method, feature selection has received wide attention and has been applied in several fields, such as data mining (Hichem et al. 2019), medical diagnosis (Georges et al. 2020), and pattern recognition (Hu et al. 2022a). Feature selection is defined as the process of eliminating redundant features and selecting the optimal feature subset to describe the problem (Sadeghian et al. 2021). In this process, the evaluation criteria and the search approach are involved, the former is used to measure the quality of the feature subsets to guide the search, and the latter is used to explore the search space to find the optimal feature subset. Concerning the evaluation criterion for the feature subset, three techniques are used, i.e., filter-based, wrapper-based, and embedded-based techniques (Senawi et al. 2017). Among these techniques, the filter-based technique is used to evaluate the quality of the feature subsets based on the dependency of the statistical data (Sadeghian et al. 2021). The wrapper-based technique is employed to assess the feature subsets by the classifier (Xue et al. 2015). The embedded-based technique is conducted to select the feature subsets during the training process of the classifier (Nguyen et al. 2020). Compared to the other two techniques, the wrapper-based technique achieves better results although additional computational resources are required (Lin et al. 2014).

From an optimization perspective, feature selection is a hard combinatorial optimization problem (Hu et al. 2020), the key task in this process involves the selection of the best feature subset using search approaches, including the complete, random and heuristic searches. Using the complete search approach, the best feature subset can be selected among 2N possible subsets from an original dataset with N features, the computational burden exponentially grows with the increase in the number of features (Guha et al. 2020). Another search approach for the feature subset is a random search in which the better subsets are constantly produced by iterations (Aljarah et al. 2018). The third search approach for the feature subset is the heuristic search, in which the sequential forward method and the sequential backward method are primarily contained (Hu et al. 2020).

Recently, due to the advantages of simplicity and easy implementation, many proposed meta-heuristic algorithms have aroused widespread interest. According to the design paradigms of theses algorithms, they are categorized into four groups: (1) evolution-based algorithms, (2) physics-based algorithms, (3) swarm intelligence-based algorithms and (4) human-based algorithms. The first category of algorithms is inspired by natural selection. For example, the genetic algorithm (GA) is a well-known evolution-based algorithm, which entirely mimics the processes of biological evolution (Siedlecki and Sklansky 1993). Similar to GA, the differential evolution (DE) algorithm is yet another algorithm and consists of three steps, i.e., selection, crossover and mutation (Deng et al. 2021). The direction of optimization search in DE is guided by the cooperation and competition mechanisms among the individuals.

The second category of algorithms simulates various physical phenomena and laws. For instance, the simulated annealing (SA) algorithm, a type of physics-based algorithms, is inspired from the melting and cooling processes of metallurgical materials (Chantar et al. 2021). The gravity search algorithm (GSA) is inspired by the law of gravity (Mittal et al. 2020). Moreover, the momentum search algorithm (MSA) is based on two physical laws, i.e., the momentum conservation law and kinetic energy law (Dehghani and Samet 2020). Some other algorithms of this category are the henry gas solubility optimization (HGSO) (Hashim et al. 2019), spring search algorithm (SSA) (Dehghani et al. 2020) and flow direction algorithm (FDA) (Karami et al. 2021).

The third category of algorithms is inspired by the movements and hunting behaviors of animals. For instance, the particle swarm optimization (PSO) algorithm is a famous optimization method that imitates the foraging behaviors of birds (Sharkawy et al. 2011). The grey wolf optimizer (GWO) is designed by simulating the cooperation behaviors of wolves (Panda et al. 2019). The whale optimization algorithm (WOA) mimics the common bubble net foraging behaviors of humpback whales with three steps, i.e., encircling the prey, attacking the bubble nets and searching the prey (Gharehchopogh and Gholizadeh 2019). Other swarm intelligence-based algorithms include the African vultures optimization algorithm (AVOA) (Abdollahzadeh et al. 2021) golden jackal optimization (GJO) (Chopra and Ansari 2022), jellyfish search optimizer (JSO) (Chou and Truong 2021), snake optimizer (SO) (Hashim and Hussien 2022), artificial hummingbird algorithm (AHA) (Zhao et al. 2022).

The final category of algorithms is inspired by human behaviors. A commonly used such algorithm is the simple human learning optimization algorithm, which is based on the learning mechanisms of humans (Wang et al. 2014). The poor and rich optimization algorithm is designed by simulating the behaviors of rich and poor when accumulating wealth and improving economic situations (Moosavi and Bardsiri 2019). Other algorithms of this category include the teamwork optimization algorithm (Dehghani and Trojovský 2021), past present future (Naik and Satapathy 2021) and coronavirus herd immunity optimizer (CHIO) (Al-Betar et al. 2021).

Nevertheless, most meta-heuristic algorithms focus only on continuous optimization problems, feature selection is classified as a discrete optimization problem. Therefore, the corresponding discrete algorithms, especially binary algorithms, should be designed to perform feature selection. As a novel meta-heuristic algorithm inspired by the basic arithmetic operators, the arithmetic optimization algorithm (AOA) performs well when dealing with continuous optimization problems (Abualigah et al. 2021). Consequently, binary arithmetic optimization algorithms (BAOAs) are proposed to perform feature selection. The main contributions of this paper are as follows:

  • Multiple BAOAs utilizing different strategies are proposed to perform feature selection.

  • Six algorithms are formed based on six different transfer functions by converting the continuous search space to the discrete search space. Moreover, in order to enhance the speed of searching and the ability of escaping from the local optima, six other algorithms are developed by integrating the transfer functions and Lévy flight.

  • Based on 20 common UCI datasets, the performance of the proposed algorithms utilizing the different strategies is evaluated, and the results illustrate that BAOA_S1LF is the most superior among all the proposed algorithms. Furthermore, the performance of BAOA_S1LF is compared with that of other meta-heuristic algorithms on 26 UCI datasets, and the results demonstrate that BAOA_S1LF performs better than other meta-heuristic algorithms in feature selection.

The rest of this paper is organized as follows: Sect. 2 presents a literature review of the application of meta-heuristic algorithms for feature selection. Section 3 introduces the AOA. Section 4 presents the proposed BAOAs utilizing different strategies. Section 5 compares the results of our proposed algorithms and other meta-heuristic algorithms on the UCI datasets. Section 6 concludes the paper.

2 Literature review

Over the previous decades, meta-heuristics algorithms have been utilized as search strategies in feature selection and have demonstrated superior efficiencies when compared to exact methods. For instance, evolution-based algorithms, such as the binary DE integrated with the mutation operator, the one-bit purifying search operator and the efficient non-dominated sorting operator are developed to tackle feature selection (Zhang et al. 2020). Xue et al. (2021) propose a multi-objective binary GA integrating an adaptive operator selection mechanism (MOBGA-AOS), and the experimental results of 10 datasets reveal that it performs well. For physics-based algorithms, the binary version of the multi-verse optimizer (MVO) is introduced to select the optimal feature subset for solving feature selection (Hans and Kaur 2020b). Neggaz et al. (2020) adopt the HGSO to select the significant features to improve the classification accuracy. Guha et al. (2020) propose a new approach based on the GSA, where a clustering technique is used to overcome the premature convergence.

Concerning the swarm intelligence-based algorithms, a new variant of GWO with an improved spread strategy and a chaotic local search (CLS) mechanism is proposed in (Hu et al. 2022b) to select the best feature subset, where the spread strategy is used to enhance the ability of search agents to avoid the local optima, global exploration capability and individual movement’s randomness, and the CLS mechanism is adopted to accelerate the convergence rate of the evolving agents. Agrawal et al. (2020) propose a quantum WOA for feature selection, in which the modified mutation and crossover operators are applied for quantum-based exploration, shrinking and spiral movements of whales. Ouadfel and Abd (2020) propose a new approach based on the crow search algorithm (CSA) for feature selection with the global search strategy and the adaptive awareness probability. Hu et al. (2020) analyze the range of parameters and introduce a new updating formula for these parameters to balance the ability of the global search and local search in the proposed binary GWO (BGWO). Mafarja et al. (2018) propose a wrapper-based feature selection approach based on the binary dragonfly algorithm (BDA), and eight transfer functions are integrated into the BDA, and the experimental results on 18 datasets illustrate that BDA outperforms other compared approaches. Based on the sine cosine algorithm (SCA) and the ant lion optimizer (ALO), a hybrid sine cosine ant lion optimizer (SCALO) is proposed to tackle feature selection, and the experimental results indicate that SCALO performs better than other compared algorithms (Hans and Kaur 2020a). Chaudhuri and Sahu (2021) propose a novel feature selection approach based on the CSA, the strategy of time varying flight length is adopted to prevent from being trapped in the local optima. Djellali et al. (2018) investigate two hybrid versions of the artificial bee colony (ABC) algorithm with PSO and GA to solve feature selection, the use of particles contributes to the effectiveness of the ABC algorithm, and the mutation operators are applied in the onlooker and scout stages. In (Hu et al. 2022a), the slime mould algorithm (SMA) is embedded with the dispersed foraging strategy and transfer function for feature selection.

For human-based algorithms, Allam and Nandhini (2022) propose a new feature selection approach based on the teaching–learning based optimization algorithm, and the experimental results show that this new approach have the high classification accuracy with the minimum number of features on the Wisconsin diagnosis breast cancer dataset. Alweshah et al. (2022) propose two feature selection approaches based on the CHIO and the greedy crossover operator, and the experimental results reveal that the adopted strategy leads to the performance improvement of the CHIO in feature selection. Furthermore, Table 1 summarizes some existing approaches, and the following observations are made as below:

  • It is inefficient to identify the optimal feature subset with higher classification accuracy in previous studies.

  • The test datasets used in the previous studies lack diversification.

  • Only a few or old algorithms are used in the previous studies.

Table 1 Comparative analysis of existing approaches for feature selection

Considering the aforementioned points, this study aims to propose multiple binary versions of AOA for feature selection. First, six algorithms are formed by converting the continuous search space to the discrete search space with different transfer functions. Second, six other algorithms are developed by integrating the transfer functions and the Lévy flight to further enhance the speed of searching and the ability of escaping from the local optima. Based on the UCI datasets, some of the proposed algorithms perform better than others. Moreover, the best-performing algorithm is compared with other meta-heuristic algorithms.

3 AOA

As a novel meta-heuristic algorithm, the AOA is motivated by the basic arithmetic operators, i.e., addition, subtraction, multiplication, and division (Khatir et al. 2021). The search process of the AOA consists of two phases: exploration and exploitation, which perform the following executions:

$$ \left\{ {\begin{array}{*{20}l} {executing\;the\;exploration\;phase{\kern 1pt} ,} \hfill & {r_{1} > MOA\left( t \right)} \hfill \\ {executing\;the\;exploitation\;phase,} \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
(1)

where \(r_{1}\) follows a uniform distribution in (0, 1), \(t\) is the current iteration and the math optimizer accelerated (MOA) function is the coefficient calculated as follows:

$$ MOA\left( t \right) = Min + t \times \left( {\frac{Max - Min}{{T_{\max } }}} \right) $$
(2)

where \(T_{\max }\) denotes the maximum iteration, \(Max\) and \(Min\) represent the maximum and minimum values of \({\text{MOA}}\), respectively.

3.1 Exploration

In the exploration phase, the solutions are searched randomly by the operators of division and multiplication, and are updated as follows:

$$ y_{i,j} \left( {t + 1} \right) = \left\{ {\begin{array}{*{20}l} {y_{j}^{*} \div \left( {MOP + \delta } \right) \times \left( {\left( {ub_{j} - lb_{j} } \right) \times k + lb_{j} } \right),} \hfill & {r_{2} < 0.5} \hfill \\ {y_{j}^{*} \times MOP \times \left( {\left( {ub_{j} - lb_{j} } \right) \times k + lb_{j} } \right),} \hfill & {r_{2} \ge 0.5} \hfill \\ \end{array} } \right. $$
(3)

where \(y_{i,j} \left( {t + 1} \right)\) refers to the j-th position of the i-th solution at iteration \(t + 1\), \(y_{j}^{*}\) is the j-th position of the best solution, \(ub_{j}\) and \(lb_{j}\) denote the upper and lower bounds of the j-th position, respectively, \(\delta\) is a small integer, k is the parameter that controls the search process of the exploration phase, and \(r_{2}\) follows a uniform distribution in (0,1). Additionally, the math optimizer probability (MOP) is calculated as follows:

$$ MOP\left( t \right) = 1 - \frac{{t^{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 \beta }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\beta $}}}} }}{{T_{\max }^{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 \beta }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\beta $}}}} }} $$
(4)

where \({{MOP}}\left( t \right)\) denotes the function value at the iteration \(t\), and β is a sensitive value that defines the accuracy of the exploration when iterating in the range of (0,10).

3.2 Exploitation

During the exploitation phase, the solutions are further searched by the operators of subtraction and addition, and are updated as follows:

$$ y_{i,j} \left( {t + 1} \right) = \left\{ \begin{gathered} y_{j}^{*} - MOP \times \left( {\left( {ub_{j} - lb_{j} } \right) \times k + lb_{j} } \right),\;r_{3} < 0.5 \hfill \\ y_{j}^{*} + MOP \times \left( {\left( {ub_{j} - lb_{j} } \right) \times k + lb_{j} } \right),\;r_{3} \ge 0.5 \hfill \\ \end{gathered} \right. $$
(5)

where \(r_{{3}}\) follows a uniform distribution in (0,1).

Consequently, the search process of the AOA is shown in Fig. 1 (Abualigah et al. 2021).

Fig. 1
figure 1

The search process of AOA

4 Proposed algorithms

Feature selection aims to select the optimal feature subset, thereby eliminating redundant features. Due to the high flexibility and excellent computational performance, continuous optimization problems can be solved by the AOA, but discrete optimization problems, especially combinatorial optimization problems cannot be well solved by existing meta-heuristic algorithms including the AOA. Therefore, multiple BAOAs with different strategies are proposed to perform feature selection, specifically, transfer functions are applied to convert the continuous search space to the discrete search space, and Lévy flight is used to enhance the search speed and the ability of escaping from the local optima.

4.1 Transfer function

Due to the characteristic of combinatorial optimization problems, feature selection cannot be directly solved by AOA. According to Mirjalili and Lewis (2013), transfer functions are highly suitable for mapping the continuous values to the discrete ones. Therefore, six different transfer functions belonging to two different families (i.e., S- and V-shaped families) are applied to implement discretization. For example, the S-shaped transfer function is applied to convert the PSO into the binary form and the corresponding transfer probability is calculated as follows (Kennedy and Eberhart 1997):

$$ S\left( {y^{k}_{i,j} \left( t \right)} \right) = \frac{1}{{1 + \exp^{{ - y^{k}_{i,j} \left( t \right)}} }} $$
(6)

where \(y^{k}_{i,j} \left( t \right)\) refers to the position in dimension \(k\) at iteration \(t\), and the position is updated as follows:

$$ y^{k}_{i,j} \left( {t + 1} \right) = \left\{ \begin{gathered} 0\;\; \, if\;\; \, rand < S\left( {y^{k}_{i,j} \left( {t + 1} \right)} \right) \hfill \\ 1\;\;if\;\;rand \ge S\left( {y^{k}_{i,j} \left( {t + 1} \right)} \right) \hfill \\ \end{gathered} \right. $$
(7)

On the other hand, the V-shaped transfer function is applied to convert the continuous form to the binary form and the corresponding transfer probability is calculated as follows (Mirjalili 2016):

$$ V\left( {y^{k}_{i,j} \left( t \right)} \right) = \frac{{\left| {y^{k}_{i,j} \left( t \right)} \right|}}{{\sqrt {1 + \left( {y^{k}_{i,j} \left( t \right)} \right)^{2} } }} $$
(8)

Then, the position is updated as follows:

$$ y_{i,j}^{k} \left( {t + 1} \right) = \left\{ \begin{gathered} \neg y_{i,j}^{k} \left( t \right)\;\; \, if\;\;rand < V\left( {y_{i,j}^{k} \left( {t + 1} \right)} \right) \hfill \\ y_{i,j}^{k} \left( t \right)\;\;if\;\;rand \ge V\left( {y_{i,j}^{k} \left( {t + 1} \right)} \right) \hfill \\ \end{gathered} \right. \, $$
(9)

where \(\neg y_{i,j}^{k} \left( t \right)\) represents the complement of \(y_{i,j}^{k} \left( t \right)\). The formulas of six transfer functions are presented in Table 2 and the corresponding curves are depicted in Fig. 2.

Table 2 Formulas of six transfer functions (Faris et al. 2018)
Fig. 2
figure 2

Transfer function a S-shaped and b V-shaped

4.2 Lévy flight

The Lévy flight is a type of non-Gaussian random walking processes, and the step has the characteristic of isotropic random directions in a multidimensional space (Liu and Cao 2020). Moreover, it has been reported that the drastic change caused by Lévy flight is beneficial for jumping out of the local optima and attaining the optimal solution (Zhao et al. 2020). Therefore, to address the drawbacks of slow convergence speed and weak search efficiency in the AOA, Lévy flight is adopted and integrated in our algorithms, and the corresponding solution is updated as follows:

$$ y_{i} (t + 1) = y_{i} (t) + rand \cdot sign\left[ {rand - 0.5} \right] \otimes Levy $$
(10)

where \(y_{i} (t + 1)\) represents the i-th solution at iteration \(t + 1\), \(sign\left[ {rand - 0.5} \right]\) has three values, i.e., 1, 0, −1, \(\otimes\) refers to the entry-wise multiplication, and Lévy denotes the random search path and follows the Lévy distribution shown as below:

$$ {\text{Levy}}\sim u = S^{ - \lambda } ,\;\;1 < \lambda \le 2 $$
(11)

where \(\lambda\) determines the shape of Lévy distribution, and S is the step length of Lévy flight shown as follows.

$$ S = \frac{\mu }{{\left| v \right|^{{\frac{1}{\eta }}} }} $$
(12)

where \( \, \mu \sim N\left( {0,\sigma_{\mu }^{2} } \right)\), \(v\sim N\left( {0,1} \right)\) and \(\sigma_{\mu }^{{}}\) is shown as follows:

$$ \sigma_{\mu } = \left\{ {\frac{\Gamma (1 + \eta )\sin (\pi \eta /2)}{{\eta \cdot \Gamma [(\eta + 1)/2] \cdot 2^{(\eta - 1)/2} }}} \right\}^{1/\eta } $$
(13)

where Γ is the standard Gamma function and \(\eta \in [0,2]\).

4.3 Fitness function

During the feature selection process, the solutions are represented as discrete sequences composed of ones and zeros, ones indicates that the corresponding features are not selected and zeros indicates that these features are selected. These solutions are then evaluated using the wrapper-based technique, in which the K-nearest neighbor (KNN) classifier is used due to the advantage of being easily implemented (Altman 1992). On the other hand, ensuring a minimum number of selected features and a maximum classification accuracy are crucial factors that need to be considered when evaluating the solutions, therefore, these two factors are integrated in the fitness function. Accordingly, the fitness function is formulated as follows (AbuKhurma et al. 2022):

$$ Fitness = \omega \cdot ERR_{R} \left( D \right) + \lambda \cdot \frac{\left| C \right|}{{\left| N \right|}} $$
(14)

where \({\text{ERR}}_{R} \left( D \right)\) represents the classification error rate obtained from the KNN classifier, \(\left| N \right|\) is the number of features,\(\left| C \right|\) indicates the number of selected features, \(\omega \in \left[ {0,1} \right]\) refers to the importance of the classification quality, and \(\lambda\) corresponds to the weight of the features reduction rate (Emine and Ülker 2020).

Consequently, based on the different S-shaped transfer functions, three algorithms denoted by BAOA_S1, BAOA_S2, and BAOA_S3 are formed. Then, based on the different V-shaped transfer functions, three algorithms denoted by BAOA_V1, BAOA_V2, and BAOA_V3 are formed. Moreover, by integrating the different transfer functions and Lévy flight, six other algorithms denoted by BAOA_S1LF, BAOA_S2LF, BAOA_S3LF, BAOA_V1LF, BAOA_V2LF, and BAOA_V3LF are developed and the corresponding flowchart is demonstrated in Fig. 3.

Fig. 3
figure 3

The flowchart of proposed BAOA

5 Experimental results and discussions

The effectiveness of the proposed algorithms is evaluated on the UCI datasets, details of these datasets are presented in Table 3 (Lichman 2013), in which the size of features (Size), number of features (No. of features), number of instances (No. of instances) and type of datasets (Type) are included. All the experiments are implemented by MATLAB R2021a on an PC with Intel Core i7 2.9 GHz CPU and 8 GB RAM. To obtain the statistically meaningful results, every algorithm independently runs 30 times. Moreover, for all the proposed algorithms, \(\omega\) and \(\lambda\) in (14) are set to 0.99 and 0.01 (Emine and Ülker 2020; Faris et al. 2018; Dhiman et al. 2021), and the size of population and the number of iterations are set to 30 and 100, respectively.

Table 3 List of datasets

To verify the optimality of the results, each dataset is randomly divided by the hold-out strategy into 80% for training and 20% for testing (Faris et al. 2018). Furthermore, the evaluation metrics include the average of fitness value (AVG), the standard deviation of fitness value (STD), the average of classification accuracy (ACC), the average of selected features (FEA), and the average of running time in seconds (TIME), and the first four metrics can be expressed as follows (Too et al. 2019):

$$ AVG \, = \frac{1}{R}\sum\limits_{n = 1}^{R} {G_{n} } $$
(15)
$$ \text{STD} = \sqrt {\frac{{\sum\limits_{n = 1}^{R} {\left( {G_{n} - \text{AVG}} \right)^{2} } }}{R - 1}} $$
(16)
$$ \text{ACC} = 1 - \frac{1}{R}\sum\limits_{n = 1}^{R} {\frac{{error\;predicted_{n} }}{Total\;instances}} $$
(17)
$$ \text{FEA} = \frac{1}{R}\sum\limits_{n = 1}^{R} {\left| {C_{n} } \right|} $$
(18)

where R refers to the maximum number of runs, n denotes the order of runs, Gn indicates the best solution obtained from run n, and |Cn| means the number of selected features obtained from run n.

5.1 Sensitivity analysis

In BAOA, β (used in Formula (4)) is a sensitive parameter used to define the exploitation accuracy during iterations, and k (used in Formula (5)) is a parameter that controls the search process of the exploration phase. These two parameters are derived from the AOA and are set as 5 and 0.5, respectively (Abualigah et al. 2021). In order to obtain the most appropriate values of β and k for BAOAs with different transfer functions, the sensitivity analysis is conducted, in which the value of β is changed from 2 to 9, and the value of k is changed from 0.25 to 0.9. As the CNAE_9 dataset is more sensitivity for evaluation, the corresponding experiment is conducted on this dataset (Emine and Ulker 2020). The results of different parameters of BAOA are presented in Table 4.

Table 4 The effects of β and k on performance of BAOAs with different transfer functions

As presented in Table 4, different values of β and k have different effects on the performance of BAOAs with different transfer functions on the CNAE_9 dataset, while all the BAOAs with different transfer functions achieved the best performance when β and k are set to 5 and 0.5, respectively. Therefore, the best-obtained values of β and k are used in the remaining experiments.

5.2 Evaluating BAOAs with transfer functions

The performance of BAOAs with the different transfer functions (namely, BAOA_S1, BAOA_S2, BAOA_S3, BAOA_V1, BAOA_V2, and BAOA_V3) is investigated and the corresponding results of these algorithms are presented in terms of the AVG, STD and TIME in Table 5, and in terms of ACC, FEA in Table 6.

Table 5 Comparison of BAOAs with the transfer functions in terms of AVG, STD and TIME
Table 6 Comparison of BAOAs with the transfer functions in terms of ACC and FEA

As the results shown in Table 5, BAOA_S2 and BAOA_S3 exhibit superior performance on 40% and 30% of the datasets in terms of AVG. Additionally, concerning AVG and STD, it is clear that BAOA_S2 performs better than the other algorithms. Moreover, BAOA_V2 obtains the smallest running time on 90% of the datasets.

From Table 6, it can be observed that BAOA_S1, BAOA_S2 and BAOA_S3 achieve better results on 25%, 40% and 30% of the datasets in terms of ACC, respectively. Additionally, with respect to FEA, BAOA_S1 and BAOA_S2 offer the best results on 25% and 30% of the datasets, BAOA_V1 and BAOA_V2 obtain better results on 15% of the datasets. Moreover, BAOA_V2 obtains the smallest running time on 90% of the datasets.

By synthesizing the results presented in Tables 5 and 6, the performance of the S-shaped transfer functions is better than the V-shaped transfer functions in terms of AVG, STD, ACC, and FEA. Therefore, S-shaped transfer functions are more suitable for BAOA than V-shaped transfer functions.

5.3 Evaluating BAOAs with transfer functions and Lévy flight

The effectiveness of BAOAs with transfer functions and Lévy flight is evaluated by the same twenty datasets. First, concerning the different S-shaped functions, the results of BAOA_S1, BAOA_S2, BAOA_S3, BAOA_S1LF, BAOA_S2LF and BAOA_S3LF are compared in terms of AVG, STD, TIME, ACC and FEA in Tables 7 and 8. Second, with regard to the different V-shaped functions, the performance of BAOA_V1, BAOA_V2, BAOA_V3, BAOA_V1LF, BAOA_V2LF and BAOA_V3LF are compared in terms of AVG, STD, TIME, ACC and FEA in Tables 9 and 10. Furthermore, to select the best algorithm, the results of BAOA_S1LF, BAOA_S2LF, BAOA_S3LF, BAOA_V1LF, BAOA_V2LF and BAOA_V3LF are compared in terms of the different evaluation metrics presented in Tables 11 and 12.

Table 7 Comparison of BAOAs with the S-shaped functions and BAOAs with the S-shaped functions and Lévy flight in terms of AVG, STD and TIME
Table 8 Comparison of BAOAs with the S-shaped functions and BAOAs with the S-shaped functions and Lévy flight in terms of ACC and FEA
Table 9 Comparison of BAOAs with the V-shaped functions and BAOAs with the V-shaped functions and Lévy flight in terms of AVG, STD and TIME
Table 10 Comparison of BAOAs with the V-shaped functions and BAOAs with the V-shaped functions and Lévy flight in terms of ACC and FEA
Table 11 Comparison of BAOAs with the transfer functions and Lévy flight in terms of AVG, STD, the worst and best values, TIME
Table 12 Comparison of BAOAs with the transfer functions and Lévy flight in terms of ACC and FEA

As the results shown in Table 7, BAOA _S1LF outperforms BAOA_S1 on 80% of the datasets, BAOA_S2LF performs better than BAOA_S2 on 65% of the datasets, and BAOA_S3LF outperforms BAOA_S3 on 65% of the datasets in terms of AVG. Concerning TIME, BAOAs with the S-shaped functions perform with the shorter time than BAOAs with the S-shaped functions and Lévy flight, the reason is that the strategy of Lévy flight can increase the running time to obtain the better results.

From Table 8, it can be observed that BAOA_S1LF, BAOA_S2LF, and BAOA_S3LF perform better than BAOA_S1, BAOA_S2, and BAOA_S3 on most of the datasets in terms of ACC and FEA. The reason is that the integration of Lévy flight in BAOA_S1LF, BAOA_S2LF, and BAOA_S3LF can enhance their search efficiencies. Consequently, when a premature convergence occurs, BAOAs with the S-shaped functions and Lévy flight have the higher chance of escaping from the local optima.

By synthesizing the results presented in Tables 7 and 8, BAOAs with the S-shaped functions and Lévy flight perform better than BAOAs with the S-shaped functions in terms of AVG, STD, ACC, and FEA. In particular, BAOA_S1LF is superior to other BAOAs with the S-shaped functions as well as other BAOAs with the S-shaped functions and Lévy flight.

Table 9 shows that BAOA_V1LF has superior performance compared to BAOA_V1 on all the datasets in terms of AVG and on 90% of the datasets in terms of STD, respectively. Meanwhile, BAOA_V2LF and BAOA_V3LF outperform BAOA_V2 and BAOA_V3 on 90% of the datasets in terms of AVG and STD. With respect to TIME, BAOAs with the V-shaped functions require a shorter running time compared to BAOAs with the V-shaped functions and Lévy flight, this is because the use of Lévy flight can increase the running time to obtain the better results.

From Table 10, it is observed that BAOA_V1LF, BAOA_V2LF and BAOA_V3LF obtain the better results than BAOA_V1, BAOA_V2, and BAOA_V3 on 100%, 85% and 85% of the datasets in terms of ACC. Furthermore, with respect to FEA, BAOA_V2LF outperform BAOA_V2 on 60% of the datasets. The reason is that the use of Lévy flight can expand the search range, thereby helping the algorithms to jump out of the local optima, while the other compared algorithms are easily trapped in the local optima.

Overall, BAOAs with the V-shaped functions and Lévy flight are found to have superior performance than BAOAs with the V-shaped functions in terms of AVG, STD, ACC, and FEA. In particular, BAOA_V2LF performs better than all BAOAs with the V-shaped functions as well as other BAOAs with the V-shaped functions and Lévy flight.

Based on the results presented in Tables 7, 8, 9 and 10, BAOAs with the transfer functions and Lévy flight perform better than BAOAs with the transfer functions in terms of AVG, STD, ACC, and FEA. Accordingly, the results further show that the use of Lévy flight can lead to the expansion of the search space and help the algorithm escape from the local optima.

Furthermore, BAOA_S1LF, BAOA_S2LF, BAOA_S3LF, BAOA_V1LF, BAOA_V2LF, and BAOA_V3LF are compared and the algorithm that shows the best performance in terms of AVG, STD, TIME, the worst and best values, ACC and FEA is selected.

By inspecting the results in Table 11, it can be noted that BAOA_S1LF achieves the better results on 85% of the datasets in terms of AVG. Concerning STD, BAOA_S3LF attains the better results on 40% of the datasets. Furthermore, BAOA_S1LF exhibits the most superior performance on most of the datasets in terms of the worst and best values. With respect to TIME, BAOA_V2LF obtains the better results on most of the datasets.

From Table 12, it is clear that BAOA_S1LF achieves the better results on 85% of the datasets in terms of ACC, and outperforms most of the algorithms in terms of FEA.

Based on the results in Tables 11 and 12, BAOA_S1LF is better than other BAOAs with the transfer functions and Lévy flight in terms of AVG, STD, ACC, and FEA. Furthermore, the performance of BAOA_S1LF is compared to the other meta-heuristic algorithms for performing feature selection, as described in the next subsection.

5.4 Comparison with other meta-heuristics algorithms

In this subsection, BAOA_S1LF is compared with other meta-heuristic algorithms, i.e., BWOA (Hussien et al. 2020), BSCA (Reddy et al. 2018), BPSO (Kushwaha and Pant 2018), BGWO (Hu et al. 2020), BMVO (Al-Madi et al. 2019), BGSA (Taradeh et al. 2019), binary GJO (BGJO) (Chopra and Ansari 2022), binary JSO (BJSO) (Chou and Truong 2021), binary artificial electric field algorithm (BAEFA) (Chauhan and Yadav 2022), binary chameleon swarm algorithm (BCSA) (Braik 2021), binary SO (BSO) (Hashim and Hussien 2022), binary AHA (BAHA) (Zhao et al. 2022), and Table 13 presents the main parameter settings used in the comparison.

Table 13 Parameter settings

The performance of BAOA_S1LF and other meta-heuristic algorithms are reported in terms of AVG, STD, TIME, ACC and FEA in Tables 14 and 15.

Table 14 Comparison of BAOA_S1LF and other algorithms in terms of AVG, STD and TIME
Table 15 Comparison of BAOA_S1LF and other algorithms in terms of ACC and FEA

From Table 14, it can be seen that BAOA_S1LF obtains the best results on 92.3% of the datasets in terms of AVG, and concerning STD, BAOA_S1LF performs better than all other meta-heuristic algorithms on most of the datasets. With respect to TIME, the performance advantage of BAOA_S1LF is not obvious in the datasets.

From Table 15, it can be observed that although BAHA shows competitive performance compared to BAOA_S1LF on the Heart and CNAE_9 datasets, BAOA_S1LF obtains the better results than other compared algorithms on 24 out of 26 datasets in terms of ACC. Additionally, concerning FEA, BAOA_S1LF is superior to other compared algorithms on most of the datasets. Overall, based on Tables 14 and 15, BAOA_S1LF outperforms all other compared algorithms in feature selection.

Furthermore, to illustrate the convergence performance, the convergence curves of BAOA_S1LF and other meta-heuristic algorithms on all the datasets are shown in Fig. 4. From these convergence curves, it can be seen that BAOA_S1LF shows the superior performance on 24 out of 26 datasets. Moreover, it is noted that the performance of BAOA_S1LF exhibits an accelerating trend. On datasets of Wine, Landsat, Ecolidata and Wdbc, although the convergence speed of some algorithm is similar to that of BAOA_S1LF, the optimal performance can be obtained by BAOA_S1LF. On the Receptor dataset, although BWOA achieves the better performance, BAOA_S1LF shows the fastest convergence speed. Then, on the Diabetes and LSVT datasets, although the convergence speed of some algorithm is slightly faster, the optimal performance can be obtained by BAOA_S1LF. On the whole, it is concluded that the convergence speed of BAOA_S1LF is faster than most of compared meta-heuristic algorithms, and the speed of searching and the ability of escaping from the local optima are enhanced owing to the use of Lévy flight.

Fig. 4
figure 4figure 4

Convergence curves of algorithms

Although the results on the above-mentioned datasets show the superiority of BAOA-S1LF to a certain extent, meta-heuristic algorithms exhibit a stochastic nature, therefore, it is necessary to illustrate the differences between these algorithms in a statistical manner. Therefore, analysis of variance (ANOVA) and Wilcoxon rank sum test with 5% significance level are conducted, the p-values of ANOVA based on fitness are reported in Table 16, and those of the Wilcoxon rank sum test based on fitness are reported in Table 17, and the ‘NaN’ means that the Wilcoxon rank sum test is not applied. Furthermore, \(p \ge 0.05\) means that there is no significant difference compared to other algorithms, and the corresponding values are marked in bold.

Table 16 p-values of ANOVA for the classification accuracy results of BAOA_S1LF and other algorithms (p ≥ 0.05 are in bold)
Table 17 p-values of Wilcoxon rank sum test for the classification accuracy results of BAOA_S1LF and other algorithms (p ≥ 0.05 are in bold)

From Table 16, it can be seen that the performance of BAOA_S1LF is significantly different from that of other compared algorithms on the Letter to SCADI datasets. Although BAOA_S1LF, BWOA, BPSO and BGSA show the similar performances on the Segmentation dataset, the performance of BAOA_S1LF is different from that of BWOA, BPSO and BGSA on the remaining datasets. However, there is no difference between BAOA_S1LF and BGWO, BCSA, BSO and BAHA on the Heart dataset, the performance of BAOA_S1LF is different from them on the other datasets. Additionally, BAOA_S1LF, BSCA, BGJO and BSO show the similar performances on the Bupa dataset, but the performance of BAOA_S1LF is significantly different from them on most of the datasets.

As can be noted from Table 17, the performance of BAOA_S1LF is significantly different from the other compared algorithms on the Zoo to Breast datasets. Although the performance of BAOA_S1LF is similar to BSCA and BGJO on the Ecolidata dataset, the performance of BAOA_S1LF differs from those of BSCA and BGJO on the remaining datasets. Meanwhile, no significant difference is noted between BAOA_S1LF and BJSO, BCSA on the Msplice dataset. In addition, BWOAS, BGJO, BJSO and BAHA is less significant differences with BAOA_S1LF on the Receptor dataset, but the performance of BAOA_S1LF is significantly different from them on most of the datasets.

6 Conclusion

In this study, the performance of BAOAs is investigated and the following key conclusions are made:

  • Multiple BAOAs utilizing different strategies are proposed for performing feature selection.

  • Six algorithms are formed based on six different transfer functions by converting the continuous search space to the discrete search space.

  • By integrating the transfer functions and Lévy flight, six other algorithms are developed to enhance the speed of searching and the ability of escaping from the local optima.

  • Based on various evaluation metrics and 20 UCI datasets, the performance of our proposed algorithms is evaluated and the results demonstrate that BAOA_S1LF shows the best performance among all the proposed algorithms

  • The effectiveness of BAOA_S1LF is compared with that of other meta-heuristic algorithms on 26 UCI datasets, and the corresponding results show that BAOA_S1LF is superior to other meta-heuristic algorithms for performing feature selection.

In future, some further research will be studied as follows:

  • Although the use of Lévy flight improves the ability of BAOA_S1LF, the running time is prolonged. Therefore, additional strategies, such as opposition-based learning and crossover operator, will be adopted to improve the effectiveness of BAOAs with different transfer functions.

  • To perform feature selection in the wider applications, more algorithms will be used to compare and demonstrate the advantages of our proposed algorithms.

  • In addition to performing feature selection, BAOA_S1LF will be applied to solve other practical optimization problems, such as vehicle routing problems, knapsack problems, facility location problems, etc.