1 Introduction

Our world is witnessing a vast increase in digital data generation daily. These massive data come in various forms, for example, text, images, numbers, audio, videos, and graphs, which can help humans advance knowledge. Such data are generated in various fields of human endeavors like engineering, healthcare, science, industries, education, and others which can be grouped for insight, prediction, and knowledge purposes. Unfortunately, a significant part of these data are raw, irrelevant, unusable, redundant, and therefore need to be transformed to be suitable for modeling. Before modeling, the data are preprocessed and classified into relevant datasets for the model’s training to get insights, make predictions, and contribute to knowledge.

Feature selection problems have become a major real-world problem of reducing the dimension of large datasets available in various scientific endeavors, yet maintaining performance accuracy is paramount. Typically, there is often an insufficient number of objects to sufficiently represent the distribution of the high-dimensional feature spaces [21]. Therefore, reducing dimensionality is a critical issue in numerous scientific fields. Researchers have proposed different approaches to feature reduction in the literature [77, 111]. Feature reduction was classified into two main categories: feature extraction or construction and feature selection [8]. Feature extraction builds some new set of subsets from the original dataset, while feature selection chooses the appropriate subset from the original dataset without any transformation.

Feature selection has posed a significant challenge in machine learning. Based on the increasing time required to locate the best features in a high-dimensional dataset, feature selection is considered an NP-hard problem [279]. To find the best subset, the authors proposed several techniques such as exhaustive search, greedy search, and random search [8]. Many of these techniques are plagued with the problem of converging prematurely, immense complexity, and high computational cost. Hence, metaheuristics have become more prevalent in dealing with these challenges. Metaheuristic algorithms are, therefore, very efficient and effective in locating the best subset of a dataset and are still able to maintain the model’s accuracy. Due to its strength, this study focuses on feature selection problems with metaheuristic algorithms.

In recent years, multiple researchers have proposed various metaheuristic algorithms to solve optimization problems. This study reviews modern metaheuristic algorithms that solve feature selection for multiclass classification problems. Some review works were found in the literature, but none was conducted for the multiclass feature selection problem to the best of our knowledge. Agrawal et al. [8] conducted a complete review of metaheuristic algorithms for a decade between 2009 and 2019 to solve binary feature selection problems. However, real-world problems are not always dichotomous. A comprehensive survey on evolutionary computation for feature selection was conducted by Xue et al. [261, 266]. Their survey enumerated the state of the art of approach with a focus on genetic algorithm (GA), particle swarm optimization (PSO), ant colony optimization (ACO), and genetic programming.

Also, Jović et al. [119] reviewed feature selection methods with applications. The review focused on feature selection in four application domains: text mining, image processing, computer vision, industrial application, and bioinformatics. However, the study considered only one research work in each domain mentioned, which is not comprehensive. A comprehensive literature review on feature selection algorithms in swarm intelligence was conducted by Brezočnik et al. [32]. The study categorized the sixty-four algorithms examined into eight different taxonomy categories and presented the most common application areas. Similarly, Rostami et al. [203] introduced a comparative analysis of other swarm intelligence-based feature selection methods, considering the strength and weaknesses of this method. Moreover, Abiodun et al. [5] conducted an organized review of evolving feature selection methods for text classification optimization tasks. The scope of their work was from 2015 to 2021, reviewing over 200 articles concerning metaheuristic and hyper-heuristic procedures. To this end, the research question below is expressed to achieve this study:

What articles employed metaheuristic algorithms to solve multiclass feature selection between the years 2000 to 2022?

The following questions were formulated to answer the question above.

  1. a.

    What metaheuristic algorithms have been used to solve the multiclass feature selection problems?

  2. b.

    What are the major feature selection methods used by the algorithms identified in (a) above?

  3. c.

    What are the key factors of the methods identified in (b) above are used by the metaheuristic algorithms?

  4. d.

    What are the issues or challenges of metaheuristic algorithms in multiclass feature selection?

  5. e.

    What are the research gaps and future directions in multiclass feature selection problems?

A limited related study has been done on metaheuristic feature selection on the multiclass problem [50, 261, 266], which focuses on evolutionary computation approaches to feature selection and high dimensional feature selection on microarray data, respectively. Therefore, this study reviews various categories of metaheuristic algorithms in solving multiclass problems within two decades (i.e., 2000–2022), e.g., human-based, physics-based, evolutionary-based, and swarm intelligence-based approaches. The motivation of this study is the lack of literature reviewed to explore the multiclass feature selection problem holistically and since many real-world classification problems are not just binary. Therefore, this study is unique because it presents a new holistic approach to reviewing multiclass feature selection with metaheuristic algorithms. Due to this, we make the following contributions:

  1. 1.

    Presentation of the existing body of knowledge found in the literature for feature selection was examined and compiled, followed by a presentation of classification and lists of metaheuristic algorithms in each category.

  2. 2.

    Systematic literature of multiclass wrapper-based metaheuristic algorithms for feature selection problems is presented. Here, the variations of each examined metaheuristic algorithm are discussed.

  3. 3.

    A presentation of the key factors of the wrapper-based approach like classifier description, datasets used, and their respective evaluation metrics utilized.

  4. 4.

    Issues and challenges of metaheuristic algorithms for multiclass feature selection were presented, and various approaches to solving the challenges were mentioned.

  5. 5.

    Lastly, we presented research gaps and explained future research areas as deduced from the literature to assist the researchers in this domain.

The remaining aspects of this paper are structured as follows: In Sect. 2, the background for feature selection and metaheuristic algorithm is presented. Section 3 presents the methodology used to collect papers for this study. Section 4 provides the various categories of metaheuristic algorithms for solving multiclass feature selection difficulties. After that, Sect. 5 presents various hybrid techniques found in the literature for solving multiclass problems. The study discussed the problems and challenges of metaheuristic algorithms in Sect. 6; Sect. 7 outlines the application areas where multiclass feature selection problems were solved. The review discussion with future directions and other research areas in metaheuristic algorithms are presented in Sect. 8, while Sect. 9, which is the last section, presents this paper’s conclusion.

2 Background

Feature selection is a necessary data preprocessing or preparation procedure to illustrate the best relevant, applicable, and essential feature space(s). This approach entails choosing a subset with the utmost discrete and appropriate feature(s) from a huge class of features for record representation in a dataset for predictive modeling [62]. It is an aspect of feature engineering where the dataset attribute is employed to reduce the dimension of the problem to be tackled and, as a result, ease the classification process phase. The main goal of feature selection is to minimize measurement in large dimensional datasets [5]. Conversely, metaheuristic algorithms contribute substantially to optimizing the challenge of selecting the best or near best optima solution. Therefore, a comprehensive description of the definition concepts with the problem of feature selection and categorization of metaheuristic algorithms will be considered in this section.

2.1 Feature selection

Feature selection has drawn the research community’s attention in the past years. It aimed at removing the best features from an original dataset. It was adopted to advance the quality of feature sets in many machine learning responsibilities, i.e., classification, regression, clustering, and prediction of time-series activities [261, 266]. In machine learning activities, feature selection is a crucial and noticeable task. There are different application areas where feature selection was applied in several areas of specialization to solve classification problems, e.g., text mining, image analysis, and biomedical problems. Where significant features exist, there is learning to overfit, which results in their performance degeneration. A central branch of data mining and machine learning research is dimensional reduction techniques, which have been studied and widely adopted strategy for dimensionality reduction among various practitioners, which is aimed at selecting the best subset of relevant features from an original dataset using specific relevant evaluation criteria that produces better performance in learning like better interpretability of model, lower the cost of computational and higher learning accuracy [7]. Figures 1 and 2 show a general concept of feature selection where an original feature set was presented and underwent a selection process. Finally, some relevant or best feature sets are chosen.

Fig. 1
figure 1

Generalization of the concept of feature selection

Fig. 2
figure 2

Illustration of the general feature selection process

Generally, various feature selection methods have been developed to get the optima subsets. Depending on whether the training set is a labeled one or not, we can group feature selection algorithms into supervised [7, 227], unsupervised [58, 160], and semi-supervised [258, 260, 285]. We can further classify supervised feature selection into three: filter, wrapper, and embedded methods. The filter methods isolate feature selection from the classifier learning such that it removes any bias of the learning algorithm from interfering with the feature selection’s algorithm [7]. It usually concentrates on the overall characteristics of the data [258, 260].

On the other hand, the wrapper method predicts the accuracy of the already determined algorithm for learning to generate the selected features’ quality. These feature selection models are often luxurious to run on prominent data features. Usually, the wrapper approach includes the classification algorithm, and it interacts with the classifier. Although this method usually presents better results than the filter approach, they are slower and are computationally expensive because they depend on the resource demand of the modeling algorithm [119].

Therefore, the wrapper methods rely on modeling algorithms where every subset is generated and assessed. Generating subset in wrapper method is based on various search strategies. Exponential, sequential, and random techniques are the three search techniques under the wrapper method [119]. Exponential algorithms evaluate several subsets that exponentially increase with the feature space size. This method is almost impossible because of its high computational cost, even though it shows accurate results. Sequential algorithms add or remove features in sequence (one or more). This may result in local minima once the selected subset is added to or taken from the original dataset,it cannot be further modified. Random algorithms integrate randomness into their search technique, which prevents it from being trapped in local minima. They are often referred to as population-based methods, e.g., metaheuristic algorithm, random generation, simulated annealing, etc. Figure 2 vividly displays the typical feature selection process flowchart from the initial feature set through to subset evaluation. Finally, the embedded approach combines the wrapper and filter methods. In this approach, feature selection is included in the training process and is particular to the learning algorithms [101]. This method may be more efficient in many aspects as they get a solution quicker by avoiding retraining a predictor from the beginning for all the investigated variable subset [85].

We exclude the detailed description of each feature selection method because their detailed explanation is contained in [261, 266]. We show the general wrapper method of the feature selection framework to solve feature selection problems in Fig. 3. The categorization method is shown in Fig. 4, where the spheres in the figure represent the methodology adopted by this paper that defines the way we get to the metaheuristic algorithms.

Fig. 3
figure 3

Framework for wrapper feature selection

Fig. 4
figure 4

Categorization of feature selection methods

2.2 Metaheuristic algorithms

Metaheuristic algorithms are generally a higher-level technique that seeks to generate a sufficiently helpful solution to any optimization problem [50] that is complicated and challenging to solve to an optimal level. It has become vital to locate an optimal solution based on insufficient or partial data in the world of inadequate resources such as time and computational capacity. The advent of metaheuristics in finding a solution to this optimization problem has become one of the most remarkable accomplishments of the past decades in research endeavors. Metaheuristic algorithms behave in a stochastic manner because they begin their optimization by picking random solutions. The technique works as a black box, avoiding local optima while exploring the search space effectively and efficiently because of the stochastic nature of the algorithms. One main strength of metaheuristic algorithms is their outstanding ability to avoid premature convergence. Due to its class of global optimization techniques, metaheuristic algorithms have capabilities of both exploration and exploitation. However, they are often found to trap in local optima rather than a global optimum. The primary reason for this is the difficulty of trade-off exploration and exploitation appropriately [257]. These two are also known as diversification and intensification, respectively. In a generation sense, diversification means the ability to search a lot of different sections of the search space. In contrast, intensification implies finding superior solutions within those sections [142]. Although these two are sometimes conflicting goals, a search algorithm should still balance them. A trade-off between exploration and exploitation is substantial for all optimization methods. A compelling trade-off between these two will help to reduce computational cost and device efficient optimization. Metaheuristic algorithms are successfully utilized to solve science and engineering difficulties like civil engineering to design bridges and buildings; electrical engineering to get an optimal solution to power generation; data mining for prediction, classification, clustering, and many more; industrial sector for transportation, job scheduling, facility location problems [8].

The issue of the trade-off between the exploration and exploitation in multiclass feature selection and classification problems received attention from researchers where different methods were adopted to ensure that these conflicting objectives are balanced, such as the work done by Sahebi et al. [207] where their approach made some random changes to the intelligent crossover (IC), intelligent mutation (IM), and their developed inverse operator. The inverse operator was employed to investigate the monitored genes accurately. This introduced an increment in the intelligence of their approach as the exploration and exploitation were done with a greater knowledge of the search space. In Nagpal et al. [170], the study utilized the selection of just the k-best particles as they applied force on each other to maintain an adequate balance between exploration and exploitation. In that study, k was a function of time which decreased linearly to 1, and the study assigned random weights as the coefficient. The two main categories of metaheuristic algorithms are:

  1. a.

    Single solution-based metaheuristic algorithms the goal of most existing feature selection methods is to maximize the classification performance during the search procedure only or combine the classification performance of the feature subset numbers into one objective function [261, 266]. When algorithms work based on one solution per time and involve the local search-based metaheuristics like tabu search, variable neighborhood search (VNS), and iterated local search [161], this is referred to as a trajectory method. It may often be trapped in local optima due to non-comprehensive exploration of the search space.

  2. b.

    Population-based or multiple metaheuristic algorithms this class of metaheuristics performs a search with numerous initial points similar to swarm-based metaheuristics [161]. The population-based algorithm benefits from scouring the search space for exploration in a useful way. This method is appropriate for searching for global solutions because it can provide global exploration and local exploitation. They are not easily trapped in the local optima because multiple solutions help each other and explore the search space extensively. Due to their benefit of iterating toward the good part of search space is mostly used in real-world problems.

Metaheuristic algorithms have drawn significant attention from researchers to solve feature selection problems in recent decades. In solving this problem, metaheuristic algorithms are categorized into four main aspects depending on their behaviors: evolutional-based, physics-based, swarm intelligence-based, and human-based [162]. Figure 5 illustrates the categorization of metaheuristic algorithms.

  1. 1.

    Evolutionary-based algorithms algorithms in this category are nature-inspired and begin their process by randomly generating their population solutions. The renowned algorithm in this group is the genetic algorithm—GA, which John Holland developed in the 1960s based on Darwin’s evolutionary theory. GA has been given much attention with different variants and improvements from researchers and was applied to solve various problems in a real-world situations. Other popular algorithms developed in this group include genetic programming, tabu search, evolution strategy, differential evolution, flower pollination algorithm [270], memetic algorithm [165], and many more.

  2. 2.

    Swarm Intelligence-based algorithms are inspired by the social interaction or behavior of birds, animals, insects, fish, and many more. Numerous metaheuristic algorithms have been established within the last two decades in this category, and more are being developed. Several researchers have developed variants of some popular ones, and others have hybridized algorithms from this category. The renowned algorithms in this category are the particle swarm optimization (PSO) algorithm [59]. It was inspired by a group of birds flying through their search space to find the best location. It was developed by Kenneth and Eberhart in 1995 and has gained so much attention due to its rich mathematical basis for solving problems. Other algorithms in this group are cuckoo search [267], gray wolf optimizer [154], and krill herd Algorithm [248].

  3. 3.

    Physics-based algorithms algorithms that fall in this group draw their inspiration from the laws of physics in the world. Some of the common algorithms in this group include ray optimization [125], gravitational search algorithms [201], galaxy-based search algorithm [100], equilibrium optimizer [72], atom search optimizer [287].

  4. 4.

    Human-based algorithms the algorithms here are inspired by activities performed by or behaviors of humans. Human beings perform various activities that affect their performance, and researchers use these behaviors to develop algorithms. The two most popular algorithms in this category are teaching–learning-based optimization (TLBO) by Rao et al. [200] and league championship algorithms (LCA) by Husseinzadeh Kashan [105]. Others include exchange market algorithm (EMA) by Ghorbani and Babaei [81], seeker optimization algorithm (SOA) by Dai et al. [49], and social-based algorithm (SBA) by Ramezani and Lotfi [197].

Fig. 5
figure 5

Classification of metaheuristic algorithms

2.3 Scientific background

This subsection presents the mathematical background of the feature selection problem and its multiclass approach. The feature selection problem is mathematically represented as follows: If a dataset “s” contains “\(d\)” feature numbers, then the problem involves the mechanism for choosing the best subset from “\(d\)” features. If we have a dataset \(s = \left\{ {f_{1} ,f_{2} , \ldots ,f_{d} } \right\}\). This aimed at selecting the best subsets of features from s. Remove subset of \(d = \left\{ {f_{1} ,f_{2} , \ldots ,f_{n} } \right\}\) where \(n < d\) and \(f_{1} ,f_{2} , \ldots ,f_{n}\). This signifies the features of whichever dataset.

In Sánchez-Maroño et al. [211], a wrapper-based method of combining ANOVA with functional networks was proposed to estimate the function of \(f\) in terms of \(n\) input variable, \(f = x_{1} ,x_{2 } ,x_{3} , \ldots ,x_{n}\) through the approximation of its component function known as AFN-FS, which serves as our reference point. This same integer function may also be written to be the sum of the \(2^{n}\) orthogonal summands:

$$f\left( {x_{1} ,x_{2} , \ldots ,x_{n} } \right) f_{0} + \mathop \sum \limits_{v = 1}^{{2^{n} - 1}} f_{v} \left( {x_{v} } \right)$$
(1)

Here, \(v\) denotes each subset possible with n variable input and \(f_{0}\) constant corresponds to no argument function. Each of the functional component \(f_{v} \left( {x_{v} } \right)\) in the above expression is approximated by the AFN technique as:

$$f_{v} \left( {x_{v} } \right) = \mathop \sum \limits_{j = 1}^{{2^{n} - 1}} c_{vj} p_{vj} \left( {x_{v} } \right)$$
(2)

In the equation above, \(c_{vj}\) represent parameters that are supposed to be estimated where \(p_{v}\) represent the orthonormalized basis functions set [35].

A problem of optimization was solved by utilizing the \(c_{vj}\) parameters to learn:

$${\text{Minimize}}\;J = \sum\limits_{s = 1}^{m} {\varepsilon_{s}^{2} } = \sum\limits_{s = 1}^{m} {\left( {y_{s} - \hat{y}_{s} } \right)^{2} }$$
(3)

In Eq. 3, \(m\) is the available number of samples and \(y_{s}\) is the output desired for sample \(s\) and \(\hat{y}_{s}\) is the output estimated to be attained by:

$$\hat{y}_{s} = \hat{f}\left( {x_{1} ,x_{2} , \ldots ,x_{{{\text{sn}}}} } \right) f_{0} + \mathop \sum \limits_{v = 1}^{{2^{n} - 1}} \mathop \sum \limits_{j = 1}^{{k_{v} }} c_{vj} p_{vj} \left( {x_{{{\text{sv}}}} } \right)$$
(4)

Immediately the \(c_{vj}\) parameters become learned, it leads to the derivation of two sets of indices called the Global Sensitivity Index (GSI) and Total Sensitivity Index (TSI). The former shows the relevance of the variance in each of the functional components in Eq. 1, while the latter shows each feature’s relevance. These two indices can be expressed as follows:

$${\text{GSI}}_{v} = \mathop \sum \limits_{j = 1}^{{k_{v} }} c^{2}_{vj} v = 1, 2, \ldots ,2^{n} - 1;\;TSI_{i} = \mathop \sum \limits_{v = 1}^{{2^{n} - 1}} {\text{GSI}}_{v} x_{i} \in v;\quad i = 1, \ldots ,n.$$
(5)

Those two indices allow ascertaining the relevant features individually or combined with the others. To advance its application scope, the AFN-FS was modified due to its limited complexity function in Eq. 4 by employing incremental approximation as follows:

$$\hat{y}_{s} = f_{0} + \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{{k_{v} }} c_{ij}^{2} p_{j} \left( {x_{si} } \right)$$
(6)

In Eq. 5, \(s\) indicate a particular sample, and the initial feature is represented by \(n\) and \(k_{i}\) represent the function set for the estimation of the univariate component of \(i\). The approximation complexity can be increased as soon as some sets of features are cast off by adding other components. This process is iterative to take away as many features as possible. It, however, does not diminish the approximation of the mean accuracy.

Further, since this study focuses on the wrapper approach to multiclass feature selection, we present the mathematical background to the multiclass concept and its problem structure. The multiclass output is denoted by utilizing a distributed representation, that is, giving a set of classes of L, \(y_{s}\) as the output is derived by L elements such that \(y_{{{\text{sl}}}} = 0\) 0 to indicate that it is not included. This changes the optimization challenge of multiclass classification to:

$$J = - \mathop \sum \limits_{s} \mathop \sum \limits_{l = 1}^{L} \left\{ {y_{{{\text{sl}}}} \ln \hat{y}_{{{\text{sl}}}} + \left( {1 - y_{{{\text{sl}}}} } \right) \ln \left( {1 - \hat{y}_{{{\text{sl}}}} } \right)} \right\}$$
(7)

\(\hat{y}_{{{\text{sl}}}}\) is the estimate of the desired output \(y_{sl}\) which indicates that sample \(s\) might probably belong to class \(l\). The consideration is that the sample belongs to the class having the highest value of \(\hat{y}_{{{\text{sl}}}}\).

The other approach to multiclass issues entails transforming this problem into many binary ones. The different ways of achieving this task are by using the method of ECOC—error correction output codes [56]. This technique transforms its output by utilizing a matrix \(M_{k \times c}\) with the number of classes \(\left( c \right)\) in the column and the classifiers number (\(k\)) as rows. These schemes are often represented [99] as:

  • One-versus-rest here, one classifier is used for each class (\(k = c\)) which transforms an issue with \(c\) classes into \(c\) binary ones.

  • One-versus-one with 1 vs. 1, each classifier is generated for every class pair. So, we have classifiers \(k = \frac{{c\left( {c - 1} \right)}}{2}\)

3 Methodology and technique of paper collection

This section explained the procedure employed to select, collect, and review papers. Discussion of keywords or techniques search, databases or data sources, and criteria for inclusion & exclusion were presented. This study followed the systematic review approach [8], and we got a guide from work done by Agushaka & Ezugwu [12].

3.1 Search keywords

The study selected some keywords as our search criteria on the data sources or databases to achieve our review objectives. These include multiclass, feature selection, metaheuristics, and algorithms. Initially, the search activities began in September 2021, and the final search for articles was done in March 2022. The papers from the search output were scanned for relevance to collect additional articles from their in-text citation and reference areas.

3.2 Academic data sources

This study used the appropriate keywords to search and select relevant articles from the available literature. The target was particular to articles published in trustworthy peer-reviewed journals, conference proceedings, and edited books indexed in Scopus, Elsevier, IEEE Xplore, Springer link, Research gate, Google Scholar, and Web of Science databases. Also, the study considered those repositories as having high-quality, highly ranked, and internationally recognized articles published in SCI-index journals and conference proceedings. The number of articles reviewed sums up to 221. The keywords above formed our search bases in the named repositories from 2000 to 2022.

3.3 Inclusion or exclusion criteria

To solely extract relevant articles from the literature, we framed a few criteria for inclusion and exclusion. After scanning the titles, abstracts, methodologies, conclusion, or the entire content in other cases, we include or exclude the articles using the set criteria. Table 1 shows the criteria for selection.

Table 1 Criteria for inclusion or exclusion

3.4 Eligibility

The eligibility of the articles selected was determined by the set inclusion or exclusion criteria. A total of 40 related articles were selected from Scopus, 67 from Elsevier, 40 from IEEE, 39 from Springer, 13 from Google Scholar, and 27 from Web of Science. On Scopus, 39 articles were published in peer-reviewed journals and one conference proceeding; Elsevier published the highest number of articles; IEEE published 29 peer-reviewed journal articles and 11 articles in conference proceeding; in Springer, 32 articles in peer-reviewed journals and two articles in conference proceeding were included; 25 articles in peer-reviewed journals and two articles in conference proceeding were reviewed from Web of Science, and 11 journal articles and two book chapters were reviewed from Google Scholar which sum up to 221 articles. The number of articles presented here was obtained after applying the inclusion and exclusion criteria. Therefore, more articles were obtained after the search using the aforementioned keywords, which were then excluded based on the exclusion criteria.

4 Metaheuristic algorithms for multiclass feature selection

Most of the classification metaheuristic algorithms are binary. However, several real-world problems are not dichotomous, i.e., 0 or 1, yes or no, true or false, present or absent. Therefore, the need to extend binary feature selection for multiclass becomes essential. Multiclass classification is more complex than a binary variant. Only the decision boundaries of one class are to be known in the binary selection, while in multiclass classification, numerous limitations are involved in binary feature selection. This study considers multiclass variants of metaheuristic algorithms to get the relevant feature. We considered four categories of all multiclass variants of metaheuristic algorithms: evolutionary-based, swarm intelligence base, human-related, physics base, and finally, we examined some hybrid versions that combine two or more metaheuristic algorithms in solving problems in multiclass feature selection.

4.1 Evolutionary algorithms

We observed that this category of algorithms seems to receive the least attention in solving multiclass feature selection problems. A limited number of these algorithms were developed recently, i.e., from 2000 to 2022. Table 2 shows the list of available metaheuristic algorithms developed within the specified years. The popular genetic algorithm was combined with a support vector machine (SVM) classifier to solve binary or multiclass feature selection problems and parameter optimization in the hospital expense model [235]. Still, its year of development is out of the scope of this study. Although we considered this in the hybrid method. Simon [222] proposed a biogeography-based optimization that was used to solve a real-world sensor selection issue used in the health estimation of aircraft engines. The algorithm’s performance was tested on fourteen standard benchmarks and was compared with seven population-based optimization algorithms like GA, DE, ACO, SGA, and PSO. The result shows that BBO and SGA outperformed seven out of fourteen benchmarks. Table 1 presents the list of metaheuristic algorithms for feature selection that falls under evolutionary-based from 2000 to 2022.

Table 2 List of evolutionary-based metaheuristic algorithms

4.2 Swarm intelligence

Swarm-based metaheuristic algorithms have recently received the most significant attention from researchers. Swarm intelligence systems comprise a population of simple agents interacting locally and in their immediate environment. They are usually a biological system that draws their inspiration from nature. The agents follow a simple procedure even though no central management structure controls how the individual agent is meant to behave [213]. A clear benefit here is “autonomy” since the agents are not controlled by external management, and each agent represents a solution to a particular problem. Also, due to their self-coordination, we can say that the swarm is robust, providing no single point of failure. Another benefit deduced from their behavior is “self-organization” [32]. Examples of the renowned algorithms in this group include particle swarm optimization (PSO), ant colony optimization, artificial bee colony (ABC) optimization, and others. Many of these algorithms were recently proven to provide appropriate outcomes in a large type of actual-world applications [59]. Figure 6 shows a general framework for swarm intelligence algorithms that must obey some fundamental phases.

Fig. 6
figure 6

General framework of swarm intelligence algorithms

This section describes some swarm intelligence metaheuristic algorithms in selecting multiclass classification problems with their modifications and the dataset used. Table 3 presents a comprehensive list of SI metaheuristic algorithms developed between 2000 and 2022 to solve feature selection issues. Although this might not be all, it shows the attention given to SI in the last two decades. Moreover, Fig. 7 represents the role of classifiers in feature selection.

Table 3 List of swarm intelligence-based metaheuristic algorithms
Fig. 7
figure 7

Role of classifiers in feature selection

4.3 Ant lion optimizer

The ant lion optimizer (ALO) imitates the chasing system of antlions in nature. It was developed by Mirjalili [152, 153] and implemented using five significant steps of chasing prey by the ants: the ant’s random walk, building their traps, catching their prey, entrapping ants, and traps re-building. A new feature selection technique was proposed by Wang et al. [250] based on a modified ALO called MALO and WSVM in reducing the dimension of hyperspectral images. They compared the MALO with other well-known algorithms on some hyperspectral image datasets, and the result showed that the MALO outperformed other methods.

In Zawbaa et al. [277], an optimization technique for the feature selection problems is proposed, which studied the antlion optimizer technique’s “chaotic” variance. This chaotic system attempted to advance the trade-off between exploitation and exploration phases. The method was tested with various chaotic maps on some datasets of feature selection. Medjahed et al. [149] presented a method of feature selection and a broad cancer diagnostic procedure using kernel-based learning. The SVM recursive feature elimination (SVM-RFE) was utilized to prefilter the genes. Next, the SVM-RFE was improved by utilizing binary dragonfly (BDF). Experiments were conducted on six microarray datasets found in the literature. Furthermore, the result demonstrated the approach’s efficacy and provided an advanced classification accuracy rate by a reduced number of genes. Moreover [27], a novel search technique for minimal attribute reduction based on rough sets and ALO was proposed. The experiment used University of California Irvine (UCI) repository datasets, and its results indicated that features selected by ALO are well classified with satisfactory accuracy. Emary & Zawbaa [65] modified the ALO and applied it to feature selection with reliance on the Lèvy flights called Lèvy antlion optimization (LALO) algorithm, which was involved in a wrapper-based model used to pick the combination of an optimal feature that maximized classification while minimizing the number of features selected.

4.4 Artificial bee colony algorithm

ABC got its inspiration from the intelligent behavior that mimics the bee’s food search behavior in the populations. Dervis Karaboga originally developed the ABC in 2005. The base algorithm was employed to perform the local search combined with a random search used for hybrid optimizations. Shunmugapriya and Kanmani [221] proposed a new mixed technique called AC-ABC. The features of ABC and Ant Colony algorithms were combined to improve feature selection. They attempted to eradicate the immobility behavior of the ants, including the associated global search time consumption for original solutions using the chosen bees. The ACO used the exploitation ability of the ABC to ascertain the best ant with the best feature subset, and the bees made use of the generated feature subsets for their food sources. They used thirteen UCI benchmark datasets for the evaluation of their algorithm. Hancer et al. [90] combined a new multi-objective ABC approach based on a feature selection algorithm with the non-dominated sorting technique and genetic operators. They developed two main strategies: ABC with binary representation and ABC with continuous representation. They examined the proposed algorithm performance on 12 benchmark datasets. The results show that the method with the binary model has a better performance in dimensionality reduction and classification accuracy than the continuous model.

Further, Zhang et al. [284] studied a multi-objective feature selection approach known as a two-archive multi-objective artificial bee colony algorithm (TMABC-FS). They proposed two novel operators: diversity-guiding search for bees that are onlookers and convergence-guiding search for used bees to find a class of non-dominated feature subsets with good distribution and merging. Two archives were employed: the external archive and the leader archive, to improve the ability of diverse types of bees to search. The hybrid algorithm was validated using several UCI datasets and was compared with a few traditional algorithms and multi-objective techniques with evidence that TMABC-FS is a robust and efficient method in solving cost-sensitive issues in feature selection. Arslan and Ozturk [23] offered a new method of feature selection based on ABC Programming (ABCP). This approach was referred to as Multi Hive ABCP (MHABCP) to solve a high-dimensional SR problem. They investigated the general performance of the MHABCP and its learning capability using synthetic datasets of actual high-dimensional SR and compared them with other existing methods. The method outperformed others in choosing relevant features, reducing the data dimensionality.

The authors [252] proposed a quick multi-objective evolutionary feature selection technique, known as FMABC-FS—fast multi-objective ABC. The algorithm was applied to many UCI datasets. The results confirmed that the sample size strategy of the variable is more appropriate to FMABC-FS, with optimal feature subsets having less running time than other algorithms. Almarzouki [18] utilized the ABC algorithm to select lesser genes for the classification of cancer disease. This method employed the CNNs for tumor classification without including labels. In the testing and training phases, three cancer datasets were utilized: kidney, lungs, and brain cancer datasets. This study also presented suggestions on ways to pre-process and modify gene expression further to improve the detection of cancer accuracy in future research.

4.5 Bat algorithm

The echolocation comportment of bats inspired the bat algorithm. This ability to echolocate microbats is interesting as the bats can locate their prey and categorize different insect types in total darkness [269]. Bats send sound waves and receive reflections to find their prey’s actual path and location. On reception of the sound waves, the bat that transmits the waves then draws an audio image of its surrounding obstacles and sees them clearly [203]. Since its development in 2011, the bat algorithm has attracted considerable attention, and researchers have modified it and developed its versions to solve feature selection issues. Although a year after its creation, Nakamura et al. [171] produced a binary version called binary bat algorithm (BBA) [171], this review’s focus is not on binary classification. Taha et al. [231] presented a hybrid of a nature-inspired bat algorithm and the naive Bayes classifier known as BANB. This approach explored four perspectives: classification accuracy, number of features, feature generalization, and stability. The algorithm’s performance was assessed on twelve datasets taken from various domains and then compared with the other three popular algorithms for feature selection. Results indicated that BANB outperformed in choosing fewer features while the classification accuracy was maintained and is more stable in yielding more feature subsets than other methods.

Again, Jeyasingh and Veluchamy [116] proposed a modified bat algorithm (MBA). They used simple random sampling to choose from the subset, some random instances, and remove irrelevant features from an original feature set. They applied the MBA to enhance the classification of radiofrequency to identify breast cancer occurrence. The MBA was modified by using simple random sampling to choose random instances of the dataset, which reduced the dimension of the feature set, and a random forest trained those selected features for classification. However, using a simple random sample might not be appropriate in all situations and may lead to information loss. Tawhid and Dsouza [237] proposed a hybrid variant of BA with an enhanced PSO, which was used to better the performance of feature selection. They used the PSO algorithm to increase the merging ability of the hybridized algorithm. Saleem et al. [209] developed a modified niche-based bat algorithm (NBBA) with a KNN classifier to solve the feature subset selection issue. The study used over twenty standard UCI repository datasets, and the result showed the capacity of the HBBEPSO in searching feature space. Hammouri et al. [2, 89] adopted the BA to explore features subset optimally and increase the cancer classification accuracy. They combined the wrapper and filter approaches to improve the feature selection performance, where robust mRMR was used as a filter method to choose the highly appropriate feature. The improved BA played the role of the search strategy for the wrapper method of the final elements selected.

4.6 Chimp optimization algorithm

Chimp optimization algorithm (ChOA) is a recently developed metaheuristic method [130] inspired by chimps’ sexual motivation and individual intelligence in their collective foraging that is quite different from the other predators. It was originally developed to remove two major problems of local optima trapping and slow convergence in high-dimensional datasets. After it was developed, Wu et al. [256] proposed an enhanced version of the ChOA called enhanced chimp optimization algorithm (EChOA) to solve feature selection problems. This study discovered that despite the four divisions of the hunting strategy baseline algorithm to diversify the population, the algorithm was still limited by local optimum trapping, which led to the EChOA’s introduction of highly disruptive polynomial mutation (HDPM) for the population diversity increment. Three strategies were introduced to enhance the population’s balance between the exploration and exploitation phases. The study employed support vector machine as a learning algorithm known as EChOA-SVM. The results of the method were compared with other well-known methods and evaluated using seventeen benchmark datasets from the UCI repository, which showed the superiority of the method in terms of higher classification accuracy and lower feature selected on most datasets.

In the following year, Piri et al. [184] proposed a binary multi-objective chimp optimization algorithm (BMOChOA) with dual archive and k-nearest neighbors (KNN) classifier to mine relevant medical data aspects. The study was evaluated using fourteen various dimensions of medical datasets, and the results showed a better performance of the BMOChOA in the features selected and accuracy of performance. The main shortfall of this method is computational complexity and scalability. Furthermore, Pashaei and Pashaei [180] presented two binary versions of the ChOA to tackle the feature selection problem. First, S- and V-shaped transfer functions were utilized to convert the continuous search space to binary, and in the second approach, the crossover operator was used to improve its exploratory behavior. Five high-dimensional biomedical datasets and other datasets from other domains like text, life, and image were adopted to evaluate the performance of the approach. The study was compared with six popular feature selection methods, including PSO, GA, ACO, etc. The method outperformed the other methods in the study in the number of genes selected and the accuracy of classification on most datasets.

4.7 Cuckoo search optimization algorithm

The CSO algorithm was inspired by carefully observing the cuckoo birds’ behavior with their reproduction tactic. It is a popular algorithm for solving real-world problems. The way cuckoo birds lay their eggs and reproduction was the basis for forming this algorithm. The CSO algorithm commences with a primary population of this bird as other evolutionary algorithms. The cuckoo’s initial population has many eggs in the host bird’s nest. Some eggs are more likely to grow into adult birds, while others are recognized and damaged by the host bird. Grown eggs indicate the higher suitability of the area and nest in the search space. The aim of cuckoo as an optimization algorithm is to discover the place for a higher survival rate of most eggs [203]. In 2012, Tiwari [238] proposed a cuckoo-algorithm-based feature selection to solve the face recognition problem. He applied the algorithm to an array of feature vectors extracted by an image’s 2-D discrete cosine transform. In the study, the algorithm searched the feature space for an optimal feature and employed the classifier to locate the most matching image in the dataset using Euclidean distance. The algorithm outperformed PSO, which made him conclude that the cuckoo algorithm is more efficient for face recognition. Mousavirad and Ebrahimpour-Komleh [166] introduced a new approach to the cuckoo search algorithm. The best feature subsets were initially chosen using the CSO algorithm, and then KNN was used as the classifier. This proposed algorithm was assessed using some UCI repository’s datasets, i.e., Iris, Wine, Pima, Glass Classification, and Breast Cancer which was compared with forward feature selection (FFS) & backward feature selection (BFS), GA-based feature selection (GA-FS), and PSO-FS. The result indicates an improved performance classification.

Similarly, Elyasigomari et al. [63] proposed a combination of cuckoo optimization and harmony search optimization called minimum redundancy and maximum relevance cuckoo optimization algorithm–harmony search (MRMR-COA-HS) to solve gene selection problems for cancer classification. First, the MRMR was used to select the relevant genes, after which these selected genes were passed into the wrapper system, which combined the novel COA-HS algorithm by utilizing the SVM classifier. They applied the method to some microarray datasets to assess the leave-one-out cross-validation approach. The performance assessment was conducted with other evolutionary algorithms, and it notably outperformed other methods by selecting the lower number of genes while maintaining the highest classification accuracy. Jayaraman and Sultana [113] introduced artificial gravitational cuckoo search algorithm (AGCSA) with particle bee optimized associative memory neural network (PBAMNN) to manage heart disease-related information gotten from the UCI repository of heart disease dataset, which has three hundred and three instances and seventy-five attributed with high dimensionality. The features’ dimensionality was selected according to the attribute of the AGCSA with notable reduction, and the selected features were processed by the PBAMNN, which was used to improve the heart disease recognition rate. In Mehedi et al. [150], a method was proposed to improve the multiclass support vector machine (MSVM) classifier’s performance by adopting the modified cuckoo search (MCS) called (MCS-MSVM). This approach was applied to minimize the challenge of power quality disturbances. This method showed exceptional 100% classification accuracy when simulated under a noise-free condition and over 98% under various signal-to-noise ratios conditions. This approach was utilized in the electric power network for detesting and classifying power quality disturbance. The MCS reduced the feature dimension and the computational bottleneck by selecting a fewer subset of features.

4.8 Dragon fly algorithm

The dragonfly algorithm (DA) inspired the peculiar flocking comportment of dragonflies in nature. It was developed by Mirjalili [156] and applied to various problems. Zhang et al. [283] propose a new feature selection technique known as IG-MBKH. Information gain (IG) was the basis for a pre-screening feature ranking method, while an improved binary krill herd (MBKH) algorithm was integrated. Its results specified the ability of the IG-MBKH algorithm to achieve convergence improvement, reduced number of feature subsets, and accurate classification compared to many new algorithms. Cui et al. [48] presented a new algorithm for feature selection called hybrid improved dragonfly algorithm (HIDA), combining the advantages of both improved dragonfly algorithm (IDA) with mRMR to produce a suitable feature subset with a classification of higher accuracy. The performance of HIDA was examined using ten gene datasets and eight datasets from the UCI repository. The result showed the superiority of HIDA. Moreover, in 2021, Chantar et al. [36] also proposed an improved version of the DA as they combined simulated annealing (SA) to resolve the local optima challenge of the DA, and they enhanced the ability of the technique to select the best feature subsets for effective classification. The approach utilized a set of frequently used datasets from the UCI repository to test the performance of the approach. The results showed the superiority of the hybridized approach. Other DA variations and application areas are found [198]. Abdelaziz et al. [2] presented an improved DA for feature selection challenges. This method proposed three variants of the BDA using 19 datasets from the UCI repository, out of which 3 were multiclass datasets and were compared with other literature methods. These approaches, according to the study, outperformed other methods. However, the efficacy of these methods was not tested on more multiclass and large datasets, which will prove their effectiveness in a real-world scenario.

4.9 Firefly algorithm

The firefly algorithm was inspired by the optical connection between fireflies in mating and exchanging information with light flashes and was introduced in 2010 by Yang [268]. Kanimozhi and Latha [120] proposed a new technique in combining the SVM classifier with the earlier proposed binary form of the firefly algorithm to retrieve images. The algorithm was over Corel Caltech and Pascal database images to increase the algorithm’s performance with optimal features. Zhang et al. [281] proposed a novel variance of the FA for discriminatory feature selection in regression models and classification to support decision-making procedures by employing approaches to data-based learning. Simulated annealing (SA)-improved local and global solutions, chaotic-accelerated parameters, and diversion mechanisms of feeble solutions to circumvent the possibility of local optimum entrapment and lessen the untimely convergence problem in the baseline FA metaheuristic algorithm. The study evaluated the efficiency of the proposed model using twenty-nine classifications and eleven datasets of regression benchmark. The technique substantially improved over other well-known FA versions statistically and traditional search methods for various feature selection issues. Chikara et al. [39] presented an upgraded firefly algorithm DyFA-Dynamic FA for feature selection, which improved convergence rate and reduced computation complexity using dynamic adaptation in blind image steganalysis. The study further reduced the computational complexity by using a hybrid DyFA designed by collaborative filter and wrapper techniques (DyFA), an incremental SVM classifier with radial basis function kernel on ten cross-validations to estimate the efficiency of the DyFA algorithm. Selvakumar and Muneeswaran [215] presented another feature selection technique based on the FA for network intrusion detection issues. The method combines a filter method of feature selection and wrapper methods of feature selection using the C4.5 & Bayesian network to pick the final feature subsets and utilized for KDD CUP 99 datasets. Marie-Sainte and Alalyani [148] proposed a feature selection approach based on the FA. The study employed selected features using an SVM classifier to categorize Arabic texts.

4.10 Flower pollination algorithm

The pollination technique of flowers inspired the flower pollination algorithm (FPA). Some researchers have proposed binary versions of this algorithm, while others modified or combined the binary versions to solve multiclass feature selection problems. In 2015, Zawbaa et al. [278] proposed a multi-objective flower pollination algorithm (MOFPA) optimization hybrid with a rough set for feature selection. It explored the abilities of wrapper and filter-based feature selection techniques. They tested the MOFPA on eight datasets from the UCI data repository. In 2017, Rajamohana et al. [195] proposed an algorithm that extracted features using the adaptive binary flower pollination algorithm (ABFP), a global optimization method applied to spam detection problems. They used naive Bayes (NB) classifier as an objective function. The result showed that the ABFPA technique selected only the informative set and gave a higher accuracy of classification when compared with other competitive techniques. In 2019, Singh and Kaur [226] proposed the FPA algorithm to select optimal features. They used three predefined feature selection algorithms to select the most crucial attribute for detecting anomalies. The performance was compared on over ten features from Kyoto 2006+ for intrusion detection system (IDS).

4.11 Fireworks algorithm

The fireworks (FA) by Tan and Zhu [232] is a swarm-based metaheuristic algorithm inspired by observing fireworks explosions for complex functions of global optimization. Xue et al. [261, 266] proposed a novel mathematical optimization model for supervised classification problems using FA directly for classification without modification. Based on the training set, a linear equation set was constructed, and an objective function was proposed, which was optimized by FA. 70% of samples were employed as the training set. At the same time, four various datasets were utilized in the experiment. The outcome indicated that the new approach could accurately identify the label set. Xue et al. [264] presented a self-adaptive FA (SaFWA) to solve the optimization classification model competently. The SaFWA utilized four main candidate solution generation strategies (CSGSs) to grow the variety of solutions. They used eight datasets for the experiments to estimate the performance of SaFWA, and the result indicates that the approach is feasible in solving classification problems through SaFWA and the optimization classification model.

4.12 Gray wolf optimizer

The gray wolf optimization algorithm (GWO) was inspired by the chasing procedure of a group of gray wolves in their natural environment [154]. The algorithm emulates the hierarchy of leadership and chasing approach of gray wolves in their natural setting. The GWO has been used recently for solving feature selection problems in data mining. Emary et al. [64] proposed a feature selection method based on multi-objective GWO in searching for the most appropriate and useful features, reducing the feature dimension. The hybrid approach employed the lower computation complexity in the filter method to advance the wrapper method’s performances. It was tested using different UCI datasets and achieved much robustness and stability. Li et al. [138] proposed a novel predictive-based framework that hybridized an improved GWO (IGWO) and kernel extreme learning machine (KELM) known as IGWO-KELM and applied it to problems in medical diagnosis. The approach was compared with the base GA and GWO using some well-known disease diagnosis problems using performance metrics: accuracy of classification, the number of selected feature subsets G-mean, specificity, precision, and F-measure. Its result proved to be superior to its counterparts.

Moreover, Too et al. [239] proposed a novel viable binary variant of the gray wolf optimizer (CBGWO) to solve the feature selection challenge in the electromagnetic classification of signals. They extracted some time–frequency features from the STFT coefficient, and the new method was used to evaluate the optimal subset from the initial dataset. Experimental results showed that the CBGWO was superior in feature reduction and classification performance. It also has a very low cost of computation which is appropriate for real-world applications. Sreedharan et al. [229] developed a system for recognizing facial emotions known as facial emotion recognition (FER) that can analyze essential human facial expressions, like normal, smile, unhappy, angry, amaze, terrified, and irritate. The manner of recognition of the FER system was categorized into four activities, pre-processing, extraction of feature, selection of feature, and classification. After pre-processing, scale-invariant feature transform-based feature extraction was used to extract the features from the facial point. A neural network (NN) based on GWO was utilized to categorize the various emotions from the selected features. Kitonyi and Segera [133] presented a hybridization of a popular metaheuristic optimizer called GWO and gradient descent algorithm to resolve feature selection issues. They first compared the approach with the baseline GWO in twenty-three test functions and developed three binary implementations, and compared the final implementation against two implementations of the binary GWO and binary GWPSO using six medical datasets taken from the UCI repository on the rate of accuracy, the number of features selected subsets, precision, F-measure, and sensitivity metrics. A newly proposed hybridized technique comprised the extended binary cuckoo search, genetic algorithm, and whale optimization algorithm, which aimed to reduce the time required to search a huge database during image retrieval. This approach was compared with other popular classification algorithms like KNN, NB, random forest—RF, CatBoost, considering recall, precision, error rate, F-measure, etc. [118].

4.13 Grasshopper optimization algorithm

Grasshopper optimization algorithm (GOA) was developed by Saremi et al. [212]. It was modeled mathematically, mimicked the grasshopper’s swarm behavior in nature, and was applied to solving challenging problems in structural optimization. Aljarah et al. [16] proposed a hybridized approach based on the GOA to enhance the SVM model’s parameters and simultaneously find the optima features subset. The method employed eighteen different dimensional benchmark data to test its accuracy. They assessed the performance of the proposed approach with seven other popular algorithms. The results indicated that this approach outperformed other methods in most datasets regarding the classification accuracy and reduced the number of feature subsets selected. Zakeri and Hokmabadi [276] proposed a new feature selection method called GOFS, based on mathematically modeling the interaction between grasshoppers to find food sources. They modified the base GOA to ensure its suitability for feature selection. They were enhanced by statistical measures in the iteration processes to change the same features with the most promising ones. The approach used various publicly available datasets to test its performance. The results produced by this method were compared with twelve other prominent forms of feature selection with indications of the significance of the GOFS. The study in Khurana and Verma [132] proposed a combination of tuned grasshopper optimization algorithms with classifiers. Their metaheuristic method aims to determine the significantly reduced feature subset from all features and improve the classification performance. They used the random search technique for tuning the classifiers and adopted KNN & SVM. They evaluated the performance of the approach using five multiclass datasets to test accuracy and the area under the curve—AUC. They computed the results with some cross-validation techniques. The researchers compared the result of the proposed method with other algorithms, which showed the technique’s performance by outperforming all the compared prominent methods.

4.14 Harris hawk optimization

After the HHO was proposed in 2019, some researchers applied, improved, and hybridized it to solve the feature selection problems. The studies by Hu et al. [102] and Wei et al. [254] presented a hybrid approach of HHO and KELM. Wei et al. worked on an intelligent prototype to predict students’ entrepreneurial intention. The HHO was utilized to improve the KELM model. Afterward, Gaussian barebone was used to improve the HHO algorithm to empower the ability of optimization to tune KELM’s parameters and identify the compressed feature sets. The method was referred to as GBHHO-KELM. The study adopted the use of thirty benchmark problems of CEC 2014 and re-evaluated some popular techniques. The results showed that the GBHHO-KELM achieved higher stability and better classification accuracy. However, Hu et al. combined improved a binary version of the HHO known as HHSORL and applied the study to predict the severity of Covid-19 virus infection. Hussain et al. [104] later proposed a hybrid version of HHO with the sine–cosine algorithm. The SCA was employed to solve the ineffective exploration of HHO, and the new method was called SCHHO, which was tested with other well-known methods. The outcome increased the convergence speed and produced significant search results with no computational cost added. Qu et al. [192] proposed a hybrid feature selection technique known as VNLHHO-variable neighborhood learning Harris hawks optimizer. They also presented a new activation function to change the incessant solution of VNLHHO into binary values. The NB classifier was used to select genes that can assist in classifying tissues of binary and multiclass cancers.

4.15 Krill herd algorithm

The individual krill behaviors of herding inspired the krill herd (KH) algorithm. This algorithm is based on swarm intelligence and bacterial foraging algorithms introduced by Gandomi and Alavi [79]. The krill movement’s objective function finds each krill’s minimum distance from food and the herd’s highest density. After it was developed, other modifications and hybridization were proposed for feature selection problems. In Wang et al. [247], the study a proposed series of chaotic particle swarm krill herd (CPKH) algorithms to solve optimization tasks within restricted time situations. Various 1-D chaotic maps were used instead of the method’s parameter. They assessed the approach on thirty-two benchmark functions and a gear train design difficulty, and the results revealed the accuracy and effectiveness of the CPKH method. Hafez et al. [87] proposed a hybridization method of monkey algorithm (MA) for feature selection and KHA called MAKHA. The system was developed to quickly search the feature space for a near-optimal subset and minimize a given fitness function. This method was utilized to choose the optimal combination of features, thereby reducing the datasets’ data dimension to increase the performance classification and select fewer feature subsets. The fitness function employed focused on classification accuracy as the main objective and data size reduction as secondary. It was evaluated on eighteen datasets and proved its advancement over other methods. In Rani and Ramyachitra [199], the fish swarm optimization algorithm with SVM and random forest RF techniques for cancer feature selection and classification reduced only a few features from the datasets. Next, an enhanced krill herd optimization (KHO) technique was used to select the genes, and the RF technique was utilized to categorize the types of cancer. The random forest classification was used for its classification accuracy. They tested the efficiency of this method on ten different gene microarray cancer datasets. The KHO/RF method outperformed the other methods with 100% accuracy of results for most datasets.

4.16 Polar bear optimization

The PBO algorithm was created by Polap and Woźniak [185], with its inspiration drawn from the survival of polar bears in hunting for food in unsuitable weather conditions. The study modeled the polar bear behavior as optima solutions’ search engine. The simulated adaptation of polar bears during harsh winter became an advantage for the search exploration and exploitation phases, and the population control was the birth & death mechanism. From the literature, we discover that not much work has been done in developing other variants of the PBO with application to feature selection. In Haq et al. [91], a binary polar bear optimization was designed to solve the combinatorial problems of scalable unit commitment (UC). The method was also employed to solve the economic dispatch problem using the conventional lambda iteration method. The work in Mirkhan and Çelebi [159] proposed a hybrid technique of rough set and polar bear optimization to find optima feature reducts which is the main goal of feature selection. The rough set theory is a potent technique for measuring the influence of every attribute in a dataset and the effect of removing an attribute on the accuracy of the dataset. The heuristic algorithm plays an important role in avoiding the evaluation of all combinations of features. This method harnesses the advantage of a polar bear by utilizing a dynamic population and its birth and death mechanism to quickly locate optimal solutions by removing irrelevant candidates and retaining the promising ones. The method was able to find better optimal reducts considering the size of the population, time of execution, and number of iterations.

4.17 Red fox optimization

The red fox optimization was developed by Połap and Woźniak [186] and was inspired by the hunting methods in nature of the red fox. The study modeled the exploration phase using the red fox’s territorial search for food when they spot a prey afar off as global search and exploitation using movement within the habitat to get closer to the prey before attaching as the local search. This study was initially applied to solve the optimization problem in the engineering field. The algorithm has then been used to tackle feature selection problems. After its development in 2021, Khorami et al. [131] proposed a hybrid red fox optimization with a convolutional neural network to detect the COVID-19 virus from sets of X-ray images. Their approach examined chest X-ray images using a new pipeline machine vision-based system for more precise results. After the X-ray images input was pre-processed in their method, they segmented the region of interest. After that, a combination of gray-level co-occurrence matrix (GLCM) and discrete wavelet transform (DWT) features was extracted from processed images. The results of the method suggest adequate efficiency in diagnosing the COVID-19 virus compared to other methods in the literature.

Moreover, Vaiyapuri et al. [244] developed a hybridized red fox optimizer with a deep-learning-enabled microarray gene expression classification (RFODL-MGEC) method. Genes were the features, and the study aimed to advance classification performance by choosing suitable features. The RFO was utilized to get optima feature subsets, and a bidirectional deep neural network was employed to classify multiarray gene expression. The neural network parameters used were tuned optimally using the CGO algorithm. The approach was able to perform satisfactorily on the datasets used. However, this method was not used on large-scale datasets, which may reduce the model’s performance. Furthermore, Fu et al. [78] presented an improved variance of the red fox, called developed red fox optimization (DRFO), which was applied to detect skin cancer. Like that of Khorami et al. [131], this study employed a pipeline method to accurately diagnose melanoma from some dermoscopy images. They segmented the region of interest using the kernel fuzzy C-means approach after pre-processing of images was done. Lastly, an optimized classification technique, a multi-layer-based perceptron, was used for the final diagnosis.

4.18 Salp swarm optimization

The salp swarm algorithm (SSO) is a recently developed algorithm inspired by nature [158] to solve different optimization issues. The base algorithm’s inspiration was inspired by the flooding behavior of salps crossing and hunting in aquatic habitats and was employed to solve engineering difficulties. Ibrahim et al. [106] presented a hybridized optimization technique for feature selection problems. The proposed algorithm combined the SSO algorithm and the PSO called SSAPSO to improve the efficacy of both exploitation and exploration phases. They tested the performance of the SAPS using two investigational sequences. They compared the first with related methods using benchmark functions. In contrast, the other was used to determine the best feature set and remove irrelevance ones from the original datasets using various datasets from the UCI ML repository.

Also, [242] proposed a technique for selecting optimal feature subset in the wrapper method and solving feature selection problems. They included two enhancements into the base SSA: based learning at the starting phase of SSA to improve its population diversity in the search space. Secondly, it included developing and using a new local search algorithm with SSA to enhance its exploitation. The KNN classifier was used in the training data as a fitness function. To validate the performance of the enhanced SSA (ISSA), they applied it to eighteen datasets from the UCI repository. They compared it with four common optimization algorithms and four criteria of assessment. The result showed that ISSA performed better than all the base algorithms in the fitness function accuracy, reduction of features in most datasets, and convergence curve. Neggaz et al. [172] developed a new version of SSA for feature selection known as improved follower of salp swarm algorithm, which used the sine cosine algorithm and disrupts operator (ISSAFD), to update the followers’ position in the SSA by utilizing mathematical functions of sinusoidal as inspired by the sine cosine algorithm (SCA). Twenty datasets were evaluated, four of which were multiclass, higher dimensional. The results showed the efficacy of the ISSAFD in reducing feature dimension and fewer selected features, specificity, accuracy, and sensitivity. The enhancement improved the exploration phase and avoided getting stuck in the local zone. Hegazy et al. [96] improved the structure of basic SSA to enhance the solution accuracy, reliability, and convergence speed and was called ISSA. Inertia weight was added as a new control parameter to adjust the best solution. This new method was tested in the feature selection task and was merged with the KNN for feature selection, where twenty-three datasets from UCI were employed to test the performance of the ISSA algorithm. Both Tubishat et al. [242] and Hegazy et al. [96] did similar work in the same year on the same SSA. However, with a few differences in their methods, both called their improved variant ISSA. Jain and Dharavath [112] presented a feature selection technique that improved the SSOA-salp swarm optimization algorithm called memetic-MSSOA, which they transformed into binary to get the best classification accuracy. They compared the efficacy of the MSSOA with the other five metaheuristic algorithms on UCI datasets. This approach was applied to detect plant diseases with superior performance.

4.19 Whale optimization algorithm

The WOA was inspired by the bubble-net chasing strategy of humpback whales and was proposed by Mirjalili and Lewis [157]. This algorithm comprises three operators in simulating the pursuit of their prey, bubble-net foraging, and encircling prey behavior of the humpback whales. This algorithm has been used to solve many optimization problems in recent times. Mafarja and Mirjalili [146] proposed a novel feature selection based on WOA’s wrapper method. They discovered that the algorithm had not been thoroughly applied to feature selection problems. They proposed two binary modifications of the WOA algorithm to explore the best feature subsets as classification, which are (1) roulette wheel and tournament selection mechanisms against a random operator in the process of searching, and (2) mutation and crossover operators were employed in enhancing the WOA algorithm’s exploitation phase. They used twenty benchmark datasets in the approach.

Earlier in 2017, the same researchers proposed a hybrid WOA with a simulated annealing technique for feature selection. Their aim for adopting the simulated annealing in their approach was to ensure enhanced exploitation by searching the most capable areas that the WOA algorithm can locate. Nematzadeh et al. [173] presented a filter method of feature selection outside the scope of this study. Zheng et al. [289] presented a novel hybrid algorithm for feature selection known as the maximum Pearson maximum distance increased whale optimization algorithm (MPMDIWOA). In the first instance, a filter algorithm was proposed named maximum Pearson maximum distance (MPMD) based on Pearson coefficient and correlation distance. Next, parameters were projected in MPMD to fine-tune the weights of redundancy and relevance. Secondly, the revised whale optimization algorithm acted as a wrapper algorithm. They verified this method using ten benchmark UCI machine learning datasets. The outcomes indicated that the algorithm’s classification accuracy was meaningfully higher than the other compared algorithms.

In Bui et al. [33], a hybridization of the WOA and adaptive neuro-fuzzy inference system approach called WANFIS was proposed to solve feature selection and land pattern classification problems. The capital of Vietnam, Hanoi, was selected as a case study due to its complex surface. They compared the model’s performance with many benchmarked classifiers using standard indicators like the Kappa index and receiver operator characteristics. The result showed the outperformance of WANFIS over other approaches. In Mandal et al. [147], the researchers presented a three-stage framework for feature selection using a wrapper–filter method to detect medical-related diseases. The approach adopted three classifiers: naïve bay, SVM, and KNN, with the whale optimization algorithm as the wrapper-based feature selection method to reduce the feature subset and achieve higher accuracy at stage three of their approach. Earlier in their approach, the filter technique was introduced, including Chi-square, ReliefF, and mutual information. Later in stage two, they applied the XGBoost algorithm to get the best feature set. The efficacy of their approach was evaluated on UCI datasets, and the results displayed better performance over other popular techniques. In Too et al. [241], the authors developed two variants of the WOA called spatial bound whale optimization algorithm (SBWOA) and its simplified version S-SBWOA in solving the multiclass high-dimensional problem of feature selection. The study utilized sixteen high-dimensional feature selection datasets from the Arizona State University repository. The two variants outperformed other methods in producing higher validation accuracies, least mean fitness values, standard deviation, selected features, and computational time. This study showed significantly reduced features, increased accuracy, precision, and F-measure compared to the other eight methods. However, the study could adopt other popular classifiers apart from the kNN in the future to enhance the validation accuracy.

4.20 Human related

This section presents a summary of human-based metaheuristic algorithms for multiclass feature selection. Since human activity defers, researchers have developed and still propose various algorithms that depict the actions of humans to solve complex problems. Although the literature shows limited metaheuristic algorithms developed under this approach, the most popular are the teaching–learning optimization, brainstorm optimization, and league championship algorithms. This section will examine some of these algorithms applied to solve feature selection problems and their application areas. Table 4 presents the list of human-based metaheuristic algorithms developed to solve feature selection challenges over the years of consideration.

Table 4 List of human-based metaheuristic algorithms

4.21 Brainstorm optimization

Brainstorm optimization (BSO) was proposed by and inspired by the human brainstorming process [220] and was applied to data classification. Pourpanah et al. [71] proposed a novel hybridized BSO based and the fuzzy ARTMAP (FAM) model called FAM-BSO feature selection and optimization technique for classification problems. The researchers employed ten benchmark challenges and a case study in the real world to appraise the hybrid’s ability. They statistically quantified the results using the bootstrap approach with very high confidence intervals, which showed promising results compared with the original FAM and other procedures. Yun-Tao et al. [275] proposed a new two-phase evolutionary feature selection technique called clustering-guided integer brainstorm optimization algorithm (IBSO-C). The study introduced a new strategy and an integer update scheme for improving the search performance of individuals in BSO. They compared the results with many existing algorithms using real-world datasets. The results indicated that IBSO-C could select fewer feature subsets with high classification accuracy at a less computational cost. Furthermore, Song et al. [228] developed the adaptive mechanism-based BSO (ABSO) algorithm built on chaotic local search. They tested its performance on twenty-nine benchmark functions to ascertain its effectiveness and stability and compared the results with other optimization algorithms. Their results proved that ABSO outperformed those five algorithms considering its stability and convergence accuracy.

4.22 Gaining sharing knowledge-based algorithm

The GSK mimics acquiring and disseminating knowledge in the human lifespan. It was developed by Mohamed et al. [163] for solving continuous space optimization problems. It was developed based on two essential stages: junior and senior gaining and sharing phases. The performance of GSK was verified and analyzed using thirty test problems taken from the CEC2017 benchmark with ten recent well-known metaheuristic algorithms. Results showed the robustness, convergence, and quality of GSK delivering excellent performance in solving optimization problems, particularly on dimensionally severe issues. Agrawal et al. [10] presented an improved form of the GSK algorithm to search for optimal feature subsets. They first represented a binary version of the GSK utilizing a probability estimation operator (Bi-GSK) on the main pillars of GSK and introduced a chaotic map to improve the performance of Bi-GSK. They used twenty-one UCI repository datasets to test the performance of the Bi-GSK, with its results showing that Chebyshev’s chaotic map portrayed an improved performance accuracy and convergence rate. It also outperformed other metaheuristic algorithms considering fitness value, efficiency, and the limited number of selected feature subsets, thereby reducing the dimension of the original datasets. Agrawal et al. [11] proposed the GSK algorithm over continuous search space with a total of eight S-shaped and V-shaped transfer functions used to solve binary search space problems. They tested performance on twenty-one UCI repository benchmark datasets and compared the results with some commonly used metaheuristic algorithms. They performed two nonparametric tests to investigate the results, which showed superiority statistically.

4.23 Teaching–learning-based optimization

Rao et al. [200] developed the teaching–learning-based optimization (TLBO) to optimize the problems in mechanical design. The TLBO works on the consequence of teachers’ impact on their students. This population-based approach ensures that solutions move to a global solution. They evaluated the algorithm’s effectiveness on five diverse controlled benchmark test functions with different characteristics and real-world applications. Since it was initially proposed, researchers have developed some versions and hybridization to solve feature selection problems. Allam and Nandhini [17] utilized the TLBO for optimizing features in automatically diagnosing breast disease. They employed the naive Bayes classifier to find the individual’s fitness and multilayer perceptron (MLP), J48, random forest with logistic regression algorithms to estimate its efficacy. The results confirmed that the scheme generated a higher rate of accuracy on the Wisconsin Diagnosis Breast Cancer (WDBC) dataset in categorizing the benign and malignant tumors. Sevin and Dökeroglu [217] hybridized the TLBO algorithm and extreme learning machines (ELM) called TLBO-ELM to solve data classification problems under which feature selection falls. It was tested on some set of UCI benchmark datasets. Its performance was proven viable for binary and multiclass classification of data problems compared with some commonly used algorithms.

Das et al. [51] proposed FSTLBO, a TLBO based on a feature selection method to find the optimal feature subsets. This method revises the weak features with strong ones, and the results indicate a substantial enhancement when performance was considered compared to some other feature selection models. The method fulfilled its objective of increasing performance, reducing the computational cost, increasing accuracy, removing irrelevant data, and helping faster model learning. Muhammad et al. [167] presented a novel text feature selection technique that used an amalgamation of rough set theory (RST) and TLBO known as RSTLBO. They developed four frameworks in the RSTLBO: the acquisition of standard datasets, dataset pre-processing,utilizing the RSTLBO method; selected feature set used by employing the SVM technique. The result showed that the algorithm produced an improved sentiment analysis. Rajinikanth and Pavithra [196] presented a density-based modified teaching–learning-based optimization (DMTLO) to select features, KNN to cater for NaN values, and classification was done by SVM & Ensemble. The result proved that the DMTLO outperformed the existing methods by generating the required number of attributes. Some binary variants of this algorithm and application areas of TLBO are not covered in this study, and an example is a work by Chen et al. [38].

4.24 Physics base

Physics-based metaheuristic algorithms were established based on the inspiration of physics laws. Some of these algorithms were applied in solving feature selection problems in different application areas. This section examines a few of these physics-based algorithms and their variants that were proposed in solving feature selection problems in different areas of application. Table 5 presents the list of physics-based metaheuristic algorithms for feature selection issues.

Table 5 List of physics-based metaheuristic algorithms

4.25 Equilibrium optimizer

EO is a new metaheuristic algorithm in the physics-based category based on inspiration from control volume mass balance models used to estimate equilibrium and dynamic states and was developed to solve various engineering problems [72]. This algorithm is referred to as one of the very influential, fast, and best performance population-based algorithms for optimization. A new wrapper-based feature selection algorithm and chaos theory [214] was proposed. In this approach, chaos theory’s principles were employed to overcome the slow rate of convergence and its shortcoming of getting entrapped in local optima, which exists in the original EO. Therefore, the approach embeds ten different chaotic maps in optimizing the EO to overcome the difficulties and accomplish a more robust and efficient search mechanism. The researchers also used eight different S & V-shaped transfer functions and tested the performance on fifteen benchmark datasets and four large-scale NLP UCI repository datasets. The result showed that this technique is highly competitive in finding optimal feature subsets.

In Elgamal et al. [61], the study produced a novel metaheuristic optimizer called improved equilibrium optimization algorithm (IEOA) with two significant advancements in the original EO. The first applied elite opposite-based learning (EOBL) to improve its population’s diversity, while the other integrated three new local search strategies to prevent its local optima trapping problem. The IEOA enhanced the population’s diversity and the classification accuracy, feature subsets selected, and an increased convergence speed. They tested the performance on twenty-one biomedical benchmark UCI repository datasets. The results then showed the outperformance of IEOA over the original EO and other algorithms for most of the datasets used. Moreover [177], the researchers presented a hybrid feature selection approach that is based on the Relief filter technique and EO called RBEO-LS, which has two phases: first utilized the ReliefR algorithm as a pre-processing step for feature weights assignment and the second used binary EO (BEO) as a wrapper search technique. The performance was tested on sixteen datasets from the UCI database and other high-dimensional biological datasets. The results displayed that RBEO-LS proved superior to other well-known algorithms. Further work in feature selection for biological data classification developed using the EO can be found in Too and Mirjalili [240].

4.26 Gravitational search algorithm

The GSA was introduced by Rashedi et al. [201]. The basis of this algorithm was the gravitational law and mass interactions. The search agents in this approach were an assembly of masses that cooperate based on the laws of motion and Newtonian gravity. The algorithm was compared with some popular search techniques, and the results showed the high performance of the GSA in solving different nonlinear functions. Papa et al. [179] proposed a combination of algorithms that utilized the optimization behavior of GSA and the speed of the optimum-path forest (OPF) classifier called (OPF-GSA) in providing an accurate and quick framework for the selection of features. The experiments on the datasets gotten from UCI and NTL datasets, like classification of images, recognition of vowels, power distribution systems, and fraud detection, were conducted to evaluate the robustness of the OPF-GSA against linear discriminant analysis (LDA), principal component analysis (PCA), and an algorithm based on PSO for selection of features. Nagpal et al. [170] explored the power of the wrapper-based GSA in solving feature selection issues using biomedical datasets. This approach utilized the GSA and KNN to reduce the number of feature subsets while improving prediction accuracy. Ing et al. [107] presented a new system to regulate the best daily arrangement based on variable photovoltaic (PV) output generation and the profile data load. The results indicated that the proposed best daily configuration technique could advance the performance of the distribution network concerning the reduction of power loss, improvement of voltage profile, and switching minimization.

Furthermore, Zhu et al. [290] proposed an improved GSA known as IGSA, which adopted the concept of global memory and the definition of exponential Kbest to improve the baseline GSA. In this approach, the authors improved the exploitation ability of the IGSA by memorizing the optimal solution obtained, thereby preventing the particles from premature convergence and slow movement, maintaining an equilibrium between the exploration and exploitation. Moreover, Taradeh et al. [236], a GSA-based algorithm with evolutionary mutation and crossover operators was presented to solve the multiclass feature selection problems. The method used both KNN and decision tree classifiers on eighteen popular UCI datasets to assess the method’s performance. It was compared with PSO, GA, and the baseline GSA and outperformed the algorithms mentioned. Kumar and John [135] presented a hybridized Gaussian-based particle swarm optimization gravitational search algorithm for wide-ranging feature selection. This method overcame the shortcoming of getting stuck in local optima and large parameter usage plagued by GSA, PSO, and PSOGSA. SVM was used as a classifier and was assessed for different benchmark datasets.

4.27 Sine cosine algorithm

The SCA is a novel metaheuristic algorithm developed by Sindhu et al. [224] for global optimization using a novel position update approach. The approach’s position update procedure for each search agent was generated by two coefficients: exploration and exploitation rates. Those coefficients were updated in every run of the algorithm and presented an appropriate balance between the two phases, and the performance was evaluated using some benchmark functions. Experimental results showed faster convergence speed and achievement of global best with higher accuracy. Hafez et al. [88] produced a feature selection system that utilizes the SCA. Typically, the SCA can search the feature region rapidly to get the best or a near-best subset of features by curtailing a particular fitness function. They evaluated the performance of the approach on eighteen datasets which showed advancement when compared with other procedures such as PSO and GA. Khamees & Rashed [129] presented a novel hybrid method for feature selection. The approach combined SCA and CS to exploit and explore the search space to reach the best solution. The performance was tested on four UCI repository medical datasets and was compared with the baseline SCA and other algorithms. The results indicated the method’s efficacy in exploring and exploiting the search space, selecting the fewest feature subset possible, reducing datasets’ dimensions, and improving a high classification rate with low run-time on all datasets used. Moreover, Rehman et al. [202] developed a novel Multi Sine Cosine algorithm (MSCA), which employed several swarm clusters to explore & exploit the search space to evade the local minima or maxima issues. The researchers evaluated the hybridization performance on different benchmark functions. Statistically, the results showed the superiority of MSCA in terms of convergence when tested against some commonly used metaheuristic algorithms.

5 Hybrid methods

Hybrid metaheuristic algorithms combine two or more best operators of various metaheuristic algorithms to present a new enhanced or superior version of the existing ones. Many hybrid algorithms have been proposed recently, and hybridization is attracting much attention from the research community. In feature selection, several hybrid metaheuristic algorithms have been proposed to solve this problem in application to biomedical data, image classification, data mining, and many more to obtain optimal feature subsets from irrelevant, redundant original datasets. These hybrid algorithms help eliminate the possibility of trapping in local optima, efficient & effective search space exploration, avoiding premature convergence, and making better exploration. This section explains some of the recently proposed hybrid approaches to selecting the best feature subset in different application areas.

In addition, Yousefpour et al. [274] presented a hybrid method with two metaheuristic algorithms to discover the best feature subset. This study treated this task in two stages: first was to obtain local solutions using filter–wrapper methods to reduce the urge dimensional feature space; second, they used two metaheuristic algorithms to select optimal subset when integrated with harmony search (HS) and GA. The experimental results on three mostly used datasets in sentiment analysis prove the approach’s efficacy over other base methods in classification accuracy. Sindhu et al. [225] proposed a hybrid wrapper-based feature selection technique of biogeography-based optimization (BBO) and sine cosine algorithm (SCA) called IBBO for solving feature selection issues. This method introduced the position update mechanism of the SCA algorithm into BBO to improve variety within the habitats. The results showed the outperformance of most of the datasets when the performance was checked against fourteen benchmark datasets from the UCI repository, with seven benchmark test functions used.

Furthermore, Pirgazi et al. [183] presented a hybridized filter–wrapper metaheuristic gene selection approach based on shuffled frog-leaping algorithm (SFLA) and the IWSSr method for high dimension datasets. They implemented two main phases in their system: filter, which used ReliefF for weighing feature, and wrapper, which used the SFLA and IWSSr algorithms to perform effective feature search. The result of the experiment showed a more compact feature set reduction with high classification accuracy. In Mohmmadzadeh [164], the study used the natural process of whale optimization & flower pollination algorithms and an opposition-based learning approach to achieve the algorithm’s accuracy and convergence speed. The proposed algorithm’s performance was evaluated using ten UCI datasets in spam email detection. Compared with some metaheuristic algorithms, the results were efficacious in the classification accuracy and average feature selection reduction size. Moreover, Dey et al. [52], the study proposed a hybrid feature selection method of metaheuristic using a golden ratio optimization (GRO) and equilibrium optimization (EO) algorithms called the golden ratio-based equilibrium optimization (GREO) algorithm applied in recognizing speech emotion. They fed the selected features by the model into the XGBoost classifier. They considered linear prediction cepstral coefficients (LPCC) and linear predictive coding (LPC)-based features as input and optimized using the GREO algorithm. Two standard datasets were used to assess its high recognition accuracy and outperformed other well-known metaheuristic algorithms for feature selection.

Similarly, Qian et al. [190] proposed an upgraded combined feature selection algorithm comprised of nonlinear inertia weight binary particle swarm optimization with shrinking encircling and exploration mechanism (NBPSOSEE) with sequential backward selection (SBS) known as NBPSOSEE-SBS, to select the best feature subset and applied in electric charge recovery risk. The experiment results proved the effectiveness of NBPSOSEE-SBS in reducing the significant number of irrelevant features and improving the prediction results in terms of the lower execution time compared with a well-known algorithm with seven other popular wrapper-based features subset selection techniques used in the prediction of risk of ECR for power customers. Xue et al. [263] proposed a modern hybrid selection algorithm comprising the GA and PSO to enhance the search capabilities of this model, and KNN was utilized as the classifier. The method’s performance was used in some simulations using the learning array from UCI as a benchmark dataset. Finally, in this section is the work of [6], the work presented a new hybrid binary variance of improved chaotic crow search and particle swarm optimization algorithm (ECCSPSOA with kNN as the classifier in solving feature selection challenges. The version of CSA was hybridized with PSO for better search strategies and convergence into the optimal global solution within the search space. The method used 15 UCI datasets from popular optimization algorithms with six different performance metrics. The findings demonstrated the efficacy of ECCSPSOA in obtaining a median accuracy rate of 89.67% over the fifteen datasets. The approach outperformed some commonly used methods considering standard deviation and fitness value by getting the least value in thirteen and eight of the datasets considered. However, a major drawback of the ECCSPSOA is its selection of more features on seven of fifteen datasets used. A newly proposed hybridized technique comprised the extended binary cuckoo search, genetic algorithm, and whale optimization algorithm aimed at reducing the time required to search a huge database during image retrieval. This approach was compared with other popular classification algorithms like KNN, NB, RF CatBoost, and many more, considering recall, precision, error rate, and F-measure [118]. Isaac et al. [109] proposed a hybrid competitive coevolution model for feature selection that utilized two nature-inspired paddy field algorithm and spider monkey optimization. This approach employed SVM as the classifier and was evaluated on two datasets, producing better results than each algorithm individually. They applied the efficacy of this hybridization to diagnose pulmonary emphysema disease. Basu et al. [29] presented a combination of harmony search and adaptive hill climbing approach to feature selection. The researchers used deep learning based on CNNs—convolutional neural networks to extract features. The performance was better than other popular algorithms detecting the Covid-19 virus on CT scan datasets.

6 Issues and challenges

Although metaheuristic algorithms have solved a series of feature selection problems, there are still some noticeable issues and challenges that we will discuss in this section.

6.1 Scalability and stability

In our world today, datasets are growing exponentially, ranging from thousands to millions. The metaheuristic algorithms developed in solving feature selection problems must be scalable to handle the increasing volume of selecting the best feature subsets from the high-dimensional datasets available. The scalability of feature selection algorithms has been regarded as a great problem requiring a sufficient number of samples to get accurate statistics. Bolón-Canedo et al. [31] noted that little attention had been given to the scalability of the feature selection methods as against the training classifiers. The classifier to be used by the algorithm must also be scalable. Otherwise, it would not be able to handle the classification task. Therefore, in designing an algorithm, we conclude that scalability must be incorporated into the design of algorithms from the start. Extra attention needs to be assumed to the scalability of feature selection and classification to keep pace with the increasing growth of data.

The stability of designed algorithms is another vital issue to be addressed. As defined by Aggarwal et al. [7], stability is the sensitivity of the selection process to data agitation in the training set. It was discovered that after the perturbation is introduced to training samples, features with shallow stability can be selected using common feature selection techniques. In most cases, feature selection algorithms do not find the same subset for various sample datasets when attempting to obtain the best subsets classification. Alelyani et al. [15] discovered that the fundamental traits of data can significantly disturb an algorithm’s stability. A stable feature selection algorithm is as vital as classification accuracy. The authors in [31, 128, 246] proposed some solutions to stabilize feature selection algorithms. However, developing feature selection algorithms for high precision and stability classification is still a challenge.

6.2 Multiclass classifiers

The choice of an appropriate classifier for any feature selection algorithm is key to its success in obtaining the best solution. There are different proposed classifiers used with metaheuristic algorithms to solve feature selection problems, such as k-near neighborhood (KNN), support vector machine (SVM), naive Bayesian (NB), artificial neural network (ANN), random forest (RF), kernel extreme learning machine (KELM), fuzzy rule-based (FR), C4.5, ID3, optimum-path forest (OPF). KNN happens to be the most popular classifier used in the literature, as indicated in Fig. 7, and can be applied for large dimensional datasets. SVM is the next most used classifier for feature selection but mainly was applied to medical datasets like cancer detection, artery diseases, and intrusion detection systems. The role of the other classifiers is indicated in Fig. 7 to describe the classifier for multiclass classification problems, Lin [140] presented an efficient classifier to deal with the issue of continuous data explosion and computational complexity, which has deteriorated the performance and accuracy of the classification models by adopting multivariate statistical analyses. The study exploits the two advantages of multivariate statistical analyses: their ability to explore the relationships between variables and locate the most illustrating features of the examined data and their ability to solve problems that are stuck by high dimensionality. The study applied the number one advantage to select relevant feature subsets and the number two to generate the multivariate classifier. The experimental results indicated that their model could significantly improve the classification training time and still maintain accuracy in multi-class classification problems.

Also, their classifier’s discrimination degree outperformed other well-known classifiers. Additional studies were done that modified or hybridized different classifiers to practical multiclass feature selection problems. Among these are [103] hybridized feature selection and SVM recursive feature elimination (SVM-RFE) to examine classification accuracy in multi-class problems on Dermatology and Zoo databases. Atallah et al. [26] whose work designed an intelligent kidney transplant prediction method to solve the prediction problem by modifying the KNN. A system that categorized the mammogram images into malignant, benign, and normal was created by Punithavathi and Devakumari [189]. The images were preprocessed and extracted features from the region to train the modified SVM and KNN classifier. Moreover, Ezenkwu et al. [69] combined the SVM and random forest classifiers, and Sesmero et al. [216] formalized and evaluated an ensemble of classifiers designed to resolve multiclass problems.

In Yijing et al. [272], the study realized that recent work found in the literature rarely focuses on imbalanced learning situations in multiclass learning. However, more attention was given to binary imbalance cases and balance situations where the regular classifiers display their strength. They believed that since imbalanced data differ in their imbalance ratio, class numbers, and dimension, the classifiers’ performances in learning from diverse datasets are not the same. This motivated the proposal of a system of manifold classifiers called adaptive multiclass classifier system (AMCS), which can handle multiclass imbalanced learning that can differentiate various types of imbalanced data. They combined three major components in AMCS: feature selection, ensemble learning, and resampling, which were individually selected discriminatively for various imbalanced kinds of data. They applied the proposed AMCS in recognition of oil-bearing reservoirs. The results showed the accuracy of the AMCS in recognizing layers characters of some good logging data. Mustaqeem et al. [169] conducted a study to classify cardiac arrhythmia disease into one of sixteen categories using a wrapper-based feature selection method and SVM for multiclass classification. Dataset from the ICU repository was employed to test the performance of their approach, and one-against-one and SVM techniques showed superiority over other classifiers as they got the data accuracy of 81.11% on 80/20 and 92.07% on 90/10 data splits. In Lausser et al. [137], the study incorporates feature selection processes in multiclass classifiers considering low cardinal high-dimensional datasets. Their feature selection method does not precisely fit into wrapper, filter, and embedded categories. They saw their two feature selection network examples as lightly related to the multiclass classifier system’s structure applied to diagnostic phenotype prediction. They evaluated the performance ability of this approach using RD, SVM, and KNN on some multiclass microarray datasets, which proved superior to other methods on most of the datasets used.

6.3 Datasets

In proving the performance of a metaheuristic algorithm for feature selection problems, choosing the suitable datasets is key in selecting the best optima subsets for the classification and training of classifiers. The datasets provide a set of scientifically proven data from diverse application areas for the performance evaluation of any algorithm. We found that most researchers employed the various datasets, for example, Iris, Wine, Breast Cancer, and Heart disease, from the UCI machine learning repository for performance evaluation from the literature reviewed. Only a few studies employed datasets from other warehouses, such as [63, 149], whose work employed microarray datasets [226], utilized the Kyoto 2006+ datasets [229], tested their approach using JAFFE database and the Cohn–Kanade, [179] experimented using datasets from both UCI and NTL datasets, Dey et al. [52] conducted the performance evaluation of their hybrid approach using speech emotion recognition (SER) datasets in Surrey Audio-Visual Expressed Emotion (SAVEE) and Emotional DB (EmoDB). Sometimes, there is a benchmark challenge because the performance evaluation depends on a particular dataset, classes, and classifiers. Comparing feature selection metaheuristic algorithms with a single dataset and the same kind of classifiers is better. However, some classifiers are renowned for multiclass feature selection problems, while others a suitable for binary issues. Table 6 presents a summary of commonly used datasets for feature selection problems, references of research work that used them, and the dataset’s attributes or resource location for their attributes.

Table 6 Summary of commonly used datasets for feature selection problems

6.4 Objective function construction

A wrapper-based algorithm optimizes a particular objective function in selecting the best feature subset(s). Due to the high number of binary feature selection-related work in the literature, the objective function formulated includes either classification accuracy maximization or minimization of the number of subsets selected. Moreover, to combine the conflicting objectives, most works in the literature constructed multi-objective functions to solve feature selection issues [9, 62, 65, 146, 223, 277] and converted the multi-objective problem into a single objective problem through the application of weights on both objectives then performed the algorithm learning. This method has effectively and efficiently optimized the fitness function and located the optimal feature subset within particular datasets. This method has efficiently and effectively optimized the fitness function and located the optimal feature subset within specific datasets.

Mathematically, the objective function can be represented as \(Z = ax + by\) where \(a\) & \(b\) are constraints, \(x\) & \(y\) are variables, and \(Z\) is the objective function that can either be minimized or maximized. Any problem that pursues minimizing or maximizing a linear function depending on some constraints determined by any set of linear inequalities is said to be an optimization problem. It can be a single objective or multi-objective optimization problem. A generic multi-objective optimization problem [86] can formally be expressed as:

$$\begin{gathered} \begin{array}{*{20}c} {{\text{minimize}}} \\ {\left( x \right)} \\ \end{array} :\left( {f_{1} \left( x \right),f_{2} \left( x \right), \ldots ,f_{M} \left( x \right)} \right) \hfill \\ {\text{Subject}}\;{\text{to}}: x_{{\text{L}}} \le x \le x_{{\text{U}}} \hfill \\ g_{a} \left( x \right) \le 0,\;a = 1, \ldots ,p \hfill \\ h_{b} \left( x \right) \le 0,\;a = 1, \ldots ,p \hfill \\ \end{gathered}$$
(8)

Here, \(x \in {\mathbb{R}}^{D}\), \(D\) being the number of problem’s variables, \(x_{{\text{L}}}\) and \(x_{{\text{U}}}\) represent the lower bound and the upper bound of the variables, respectively \(f_{1} ,f_{2} , \ldots ,f_{M}\) represent the objective function of \(M\) to be optimized, which depends on \(p\) & \(q\), inequality & equality constraints, respectively. To extend the population-based optimization problems to multi-objective problems, the authors [82] used three broad categories of methodologies, which are discussed below:

  1. 1.

    Pareto-based method These methods utilize Pareto-dominance relations to assess the population’s quality. Methods here are still in use today; however, it seems that their ability to solve problems that are more than three objectives (many-objective) is somehow inadequate [110]. Fonseca & Fleming [76] first used the Pareto dominance relation in a multi-objective genetic algorithm. The individuals in the population were ranked using Eq. 9 after evaluating the objective function.

    $$r_{1} = 1 + p_{i} ,$$
    (9)

    where \(p_{i}\) represent the number of dominating individuals of the decision vector \(x_{i}\).

  1. 2.

    Decomposition-based methods These methods use vector weighting and a scalarizing function to break down a multi-objective problem into smaller subproblems that are single-objective. When these subproblem sets are solved, it is assumed that a decent estimate of the Pareto front is attained. Available evidence suggests that this way of dealing with multi-objective problems is greatly scalable for multiple objectives problems. However, difficulties still exist in resolving this form of method [82].

  2. 3.

    Indicator-based methods This kind of procedure for solving multi-objective problems is considered promising, as in [251], which is dependent on developed metrics in measuring the quality of the set of solutions gotten from a population-based optimization technique. The hypervolume is the most notable of these many indicators presented in the multi-objective optimization setting [292]. Zitzler et al. [291] noted that the need to compare analysis relating to the strength and weaknesses of various algorithms birthed the introduction of these many indicators.

In Mirjalili et al. [158], the researchers applied the multi-objective optimization problem to solve the design of engineering problems. The equation in their study without losing generality is given as:

$$\begin{gathered} {\text{Minimize:}}F\left( {\vec{x}} \right) = \left\{ {f_{1} \left( {\vec{x}} \right), f_{1} \left( {\vec{x}} \right), \ldots ,f_{o} \left( {\vec{x}} \right)} \right\} \hfill \\ {\text{Subject}}\;{\text{to}}: g_{i} \left( {\vec{x}} \right) \ge 0,i = 1, 2, \ldots ,m \hfill \\ h_{i} \left( {\vec{x}} \right) \le 0,\;a = 1,2, \ldots ,p \hfill \\ lb_{i} \le ub_{i} , \;i = 1, 2, \ldots ,n \hfill \\ \end{gathered}$$
(10)

where \(o\) represent the objectives’ number, \(m\) and \(p\), inequality and equality constraints, respectively, and \(lb_{i}\) & \(ub_{i}\) are upper & lower bounds of the ith variables.

6.5 Evaluation criteria for performance checking

Assessing a model’s performance reveals how well it performs on hidden data. In the practical sense, making predictions data for the future is the essence of the model. Hence, it is a significant problem that predictive models intend to solve. As a result, there is a serious need to comprehend the context before deciding on a suitable metric. Each model tries to address a difficulty with an objective using a separate dataset [5]. Numerous evaluation metrics are used in the literature in assessing the performance of wrapper-based feature selection metaheuristic algorithms. The recall and precision were in data classification in computer science; area under curve (AUC) was employed in radar signals; specificity and sensitivity were used in medical classification. There are other means to measure the performance of algorithms, and a few of the well-known performance metrics are discussed here which were not reviewed in [8]:

  1. 1.

    Standard deviation (SD) This measures the similarity between various solutions run. A high SD shows substantial changes in the solution as the function executes iteratively. It can be mathematically formulated as:

$$\sigma^{j} = \sqrt {\frac{{\sum\nolimits_{i = 0}^{N} {\left( {S_{i}^{j} - \mu } \right)^{2} } }}{N}}$$
(11)
  1. 2.

    Average of solution This is the ratio of the selected features and the overall number of features in the original dataset. It can be mathematically represented as:

    $$\mu^{j} = \frac{{\sum\nolimits_{i = 0}^{N} {S_{i}^{j} } }}{N}$$
    (12)

7 Application area

Multiclass feature selection has been applied in different areas of human endeavors. This section details the diverse application areas of multiclass feature selection. Moreover, a few studies combined feature selection methods with deep learning and ML approaches and are applied in some real-world scenarios, which are as well mentioned in this section.

The major areas of multiclass feature selection found in the literature include facial expression classification [229], cancer diagnosis or detection [66, 116, 149, 187], network intrusion detection [80, 194, 215, 262], text classification [132, 148, 265], classification and detection of other diseases [37, 113, 196, 206], image retrieval [118, 120]. Other application areas include oil-bearing recognition [169], cardiac arrhythmia disease detection [169], phenotype diagnosis [137], classification of power quality disturbance [150], human activity recognition [98], network traffic classification [30, 70, 219], emotion detection from speech signals [273], and sentiment classification [84].

In Bhardwaj et al. [30] and Shi et al. [219], robust techniques were proposed to solve traffic classification challenges in the imbalance dataset, although both studies employed different approaches. The former is called their method global optimization approach (GOA), which selects the best features and recognizes stable ones. This method combined various popular feature selection methods to get optimal feature subsets using different traffic datasets to solve network traffic problems. They proposed a novel goodness measure inside the random forest, outperforming the well-known feature selection methods in traffic classification. While the latter adopted feature selection and deep learning approaches, which proved to have a better outcome in overcoming the negative effect of multiclass imbalance and concept drift issues associated with the machine learning methods. This approach utilized three main phases: to begin with, useless features were taken out using symmetric uncertainty, after which deep learning was introduced to the already selected features to reduce dimensionality and generate features, and lastly, the removal of redundant features, which was done by weighted symmetric uncertainty (WSU). In [34, 83], an approach was proposed to use radiographic or X-tray images to detect the Covid-19 virus using the combination of feature selection techniques with deep learning models, i.e., CNN and deep neural network, respectively. The latter study was carried out in Pakistan with an outstanding result, which has been adopted for use in the Pakistan radiology department. Both studies showed more than a 98% accuracy rate, suggesting that combining deep learning and feature selection techniques could produce a high accuracy rate. Figure 8 shows the areas of application of multiclass feature selection in various fields of human endeavor. Recent research works were conducted to improve the model’s accuracy.

Fig. 8
figure 8

Multiclass feature selection application areas

8 Discussion and future directions

This survey presents a study on various multiclass feature selection techniques that exist in the literature that have been applied to high-dimensional dataset classification. This study revealed the strength and weaknesses of feature selection metaheuristic algorithms discussed in this section, given some gaps for future research in this domain.

Furthermore, we found that several metaheuristic algorithms’ variants have not been developed to solve multiclass feature selection problems. These algorithms include AAA, ACS, BCO, BMO, CGS, CHIO, CSO, EMA EPC, EVOA, ES, FBIO, GbSA, HSO, HGSO, LCA, MBA, PFA, SFL, SSA, SSDO, TCO, TGSR, VCS, VPL, WSA, WSO, and many more. These algorithms will advance classification when their binary and multiclass versions are developed. Several existing methods could also be advanced to solve real-world feature selection problems that are often multiclass. The literature shows that researchers have faced challenges in obtaining the classification problem's best subset.

Although, all metaheuristic algorithms perform differently in different datasets and problems. However, some of them have the following strengths:

  1. 1.

    Ability to converge to a true global optimum.

  2. 2.

    Good exploration and exploitation.

  3. 3.

    Ease of implementation.

  4. 4.

    Ability to perform a local and global search.

  5. 5.

    It is suitable for dynamic applications since some can quickly adapt to change.

The limitations of some of them and possible solutions include:

  1. 1.

    Existing methods are unscalable and are usually unstable when dealing with multiple class and higher-dimensional datasets.

  2. 2.

    Fewer or limited multiclass classifiers exist in the literature compared with binary ones. The strategy for tackling multiclass problems could include using a limited multiclass classifier or decomposing them into multiple binary forms and solving them iteratively using a binary classifier, which may be computationally costly.

  3. 3.

    Existing algorithms suffer from slow convergence rates due to random generation of movement.

  4. 4.

    They are prone to get trapped in local optima or premature convergence. Certainly, metaheuristic algorithms’ use in feature selection is evolving. Upcoming research work can concentrate on combining metaheuristic algorithms to prevent the shortcomings of single ones. Although developing this framework may be demanding, it will present more effective and satisfactory results.

  5. 5.

    They contain complex operators for selection and crossover.

  6. 6.

    High computational time. Most of the alerts generated by algorithms in intrusion detection systems tend to be fault alarm rates which increases the detection rate due to unrelated & incomplete features and duplication in the intrusion detection systems’ datasets. To overcome these challenges and ensure the development of more accurate and efficient IDS models, some researchers have utilized preprocessing methods such as feature selection, normalization, and hybrid modeling techniques that were usually applied. Therefore, we recommend more future work hybridizing IDS models with algorithms to create the feature selection for IDS with higher predictive capacity. Also, IDS should be modeled as multiclass problems where detection is classified into severe, mild, high, medium, or low. Moreover, future work should reduce the false alert rate of the metaheuristic feature selection methods.

  7. 7.

    They must tune many metaheuristic algorithms’ hyperparameter values, leading to premature convergence. Future research should be directed toward verifying the control parameters for metaheuristic algorithms. Few works found in the literature explore that particular area(s). The area that can help is hyperparameters for metaheuristic algorithms in different control parameters test values during the evaluation stage of assessing the algorithm’s practicality. The accuracy of the network for a precise duty depends on the parameters’ structure.

  8. 8.

    Many of the algorithms were designed for binary or real search space only.

Due to the limitations identified, we, therefore, propose that instead of presenting new algorithms belonging to the swarm and physics-based methods, emphasis should be placed on improving and hybridizing the existing metaheuristic algorithms in these areas to minimize the identified disadvantages and are directly applied to solving multiclass feature selection problems. Although, limited algorithms have been proposed in evolutionary and human-based categories. We, therefore, suggest more novel work in evolution and human-based metaheuristic algorithms to solve feature selection difficulties that are particularly multiclass since there is still a significant gap in the development of metaheuristic algorithms specifically for multiclass classification methods. Many techniques in the literature used for multiclass problems combined two or more existing binary methods or hybridization of other forms; however, little or nothing was done mainly for multiclass problems.

The literature moreover revealed that researchers had faced diverse challenges in obtaining the best-selected feature sets in multiclass classification problems as single classifiers seem not to be as effective as combining more than one, which can take advantage of the strengths of those classifiers into one improved approach. The literature’s most utilized classifiers for feature selection problems are KNN and SVM, which implies that future studies can be done considering other classifiers. Some classifiers have attracted less usage in classification issues. Furthermore, we found from the literature that more work was done in applying feature selection problems to areas like intrusion detection systems, cancer detection, image classification, multimedia, text classification [139] and many more, but little or no work was done in drug classification, theft detection, and weather prediction. These can be good areas of application for researchers in this domain to explore.

Finally, few works were found in the literature where deep learning models combined with feature selection techniques proved highly accurate in classification accuracy of not below 98% [34, 83, 219]. Therefore, we conclude that future studies can be undertaken to harness the deep learning approach in combination with feature selection techniques. Also, our study discovered that this hybrid approach had been conducted and applied to disease diagnosis or detection only, which particularly influences the medical science field. In future research, this can also be extended to other real-world situations such as the field of technology (spam mail detection, security threat classification) and the industry—customer purchase behavioral prediction, and more, not just to solve medical problems that currently dominate the research endeavors. The future development of heuristic or metaheuristic approaches may tend toward the use of other classifiers popular, creation of more hybrid local search techniques with metaheuristic methods for more accurate prediction as studied in [20, 36, 60, 136], multi-objective binary techniques creation, application to solving medical diagnosis challenges [180], hybridized wrapper–filter approaches are expected to be developed, and more real-world application areas would be embarked upon [184].

9 Conclusion

The current value of data for knowledge discovery in the digital world has positioned feature selection in an active and evolving mode. The choice of meaningful data helps humans remove noisy or useless data subsets from an original dataset. These assist the different machine learning models in learning from a meaningful set of data for prediction and solving real-life problems. Moreover, the rise in the quantity of digital data from various sources like web pages, image repositories, databases, social media platforms, and many more indicates the improvement in data classification.

The recent rapid data creation, exchange, and sharing of data make data analysis, extraction, and knowledge retrieval a very daunting task. First, there is a need to reduce the dimensionality of the available irrelevant or redundant data to excerpt knowledge and obtain insight from such data. The feature selection process is a crucial data preprocessing stage that helps minimize the predictive model’s data dimension. Though this is a very complicated and demanding task in terms of computation, if not done correctly, it may defeat the purpose of removing relevant feature subsets and the suitability of any predictive model in real life.

Feature selection is essential in enabling any model's faster performance, eradicating loud, less useful data, improving the model’s accuracy and precision, removing unwanted features, and increasing the data testing generalization. Although the traditional feature selection methods have been adopted for classification tasks, these approaches have failed to significantly reduce the high dimension of the feature space, producing inaccurate and inefficient prediction models. An evolving method like metaheuristic optimization has provided an immerging standard for feature selection as they yield exciting results that are precise for best classification as again the traditional approaches. Metaheuristic techniques have the inherent ability to effectively improve the accuracy of computational demands, storage, and categorization. Therefore, metaheuristics have been applied more and more in different fields. Nevertheless, a small detail about best practices for case-by-case usage of these evolving feature selection approaches is known. The findings in the literature continue to reveal the most effective ways(s) that, if not accurately performed, may alter the predictive model's real-world application, precision, and performance.

Based on the reviewed literature, feature selection in machine learning has attracted much attention. A systematic review was conducted in this paper, emphasizing wrapper-based metaheuristic algorithms to solve multiclass feature selection problems with metaheuristic algorithms. Other studies were done in this area center on binary issues, text classification, multi-objective feature selection by introducing reliability to cater for missing data, smarm-based approach, evolutionary-based approach, and many more, but we found no work in the literature on multiclass feature selection problems considering the four categories earlier mentioned. Several techniques aimed at improving the performance of metaheuristics have been considered. The little work done in multiclass feature selection as found in the literature combined two or more binary approaches to help in multiclass situations. However, some metaheuristic algorithms can present a higher performance than others to solve feature selection difficulties in high-dimensional datasets.