1 Introduction

Data mining is a rapidly growing paradigm in data science, due to the massive data gathered daily as well as the excessive demand to turn this data into useful knowledge [46]. Data mining includes multiple preprocessing steps (i.e., integration, filtering, transformation, reduction, etc.), knowledge presentation, and pattern evaluation [69]. Amongst them, Feature Selection (FS), also known as variable (feature) subset selection, is one of such prerequisites, which is reliable to strive to reduce the high computational costs required by heavy mining tasks by discarding potential noisy, redundant, and irrelevant features that may degrade the classification accuracy. Basically, the major objective of an FS procedure is to optimally carry out the classification task by first finding a subset of variables. A given set of features are usually initially given and then an optimal subset of features is thought, pursuing high classification performance.

The term FS resembles an algorithm using an input feature set to output a subset of it. In supervised learning, a set of instances of the same attributes exists along with another attribute representing the corresponding class. Via those instances, a rule or set of rules can be created and generalized, enabling the classification of the instance set highly possible precision. Research on FS has been since the early 1960s (see [27] or [103] for an overview and biographical notes).

Raw datasets by themselves are well-known for not providing much information. So, to extract useful knowledge or patterns, raw data must first be processed. To this end, deriving informative knowledge from these data calls for the development of new effective and efficient methods. This is becoming increasingly substantial [37]. Typically, many as deemed necessary as are features are involved to compose a real-world dataset. This is closely related to: i) knowledge and, in turn, essential features capturing this knowledge concerning the domain of interest; ii) the availability of these features; iii) the resources available for collecting these available features; and (iv) the resources available for storing, maintaining, and retrieving these features. Thus, it is obvious the unnecessity or insufficiency of the all features stored in the resulting dataset to learn the concerned concept. Assuming the presence of the relevant features in this data, FS is the art of electing a small feature subset ideally necessary and sufficient to describe the concept of interest [61].

FS is a recommended data-preprocessing phase in most machine learning tasks due to its potential to alleviate the curse of high dimensionality in large and complex datasets. This helps in reducing the data processing workload and decreasing the computational cost of classification, thus saving resources while enhancing the algorithm overall performance. There are a large number of applications that depend on high-dimensional datasets with a large number of features. Many of these features may be redundant or obsolete. Additionally, the database at hand may be inspected by a high noise level, which may result in degrading the overall classification performance and increasing computational cost [69]. Hence, proposing an efficient FS method is crucial to find the best subset of highly relevant and informative features while maintaining the representation of the original dataset with high classification accuracy as well. Ultimately, FS has three main advantages [46]: i) enhancing the predictor’s performance; ii) gaining a better understanding of the underlying process leading to faster data processing; and iii) giving more cost-effective prediction models.

1.1 Motivation

It is an NP-hard problem searching for a variable subset [19, 22, 36, 64]. For this reason, it cannot be guaranteed to acquire the optimal solution except when conducting an exhaustive search which is quite unfeasible for this situation. Meta-heuristic techniques, being not having to do a comprehensive search for the whole solution space, permits obtaining reasonably good solutions. The quality of each “heuristic” solution obtained is tightly based on the method adopted. In other optimization problems, meta-heuristic methodologies have been found so superior [23, 52, 56, 84, 110, 117]. In this field, we mention the Differential Evolution (DE) [11], Ant Colony Optimization (ACO) [58], Particle Swarm Optimization (PSO) [77], Genetic Algorithm (GA) [29], and most recently, Harris Hawks Optimization (HHO) [132], Whale Optimization Algorithm (WOA) [74], and Salp Swarm Algorithm (SSA) [35]. In real-world applications, people favor a timely obtaining of good solution rather than being obsessed with optimal solutions. Therefore, we have opted the meta-heuristic techniques due to their practicality in real-world applications.

Among various meta-heuristics, swarm-based optimization algorithms have been used to shatter the obstacle to choosing the best solution (optimal feature subset) to the FS problem [78]. The Sparrow Search Algorithm (SSA) is a novel swarm-based optimization algorithm recently introduced by Xue et al. [119] for solving continuous optimization problems. An improved binary version called iBSSA is implemented and introduced based on SSA for solving the problem of wrapper-based FS. In this study, the SSA algorithm was chosen for solving the FS problem, driven by a number of motivations. First, based on the rigorous discussion through Sect. 2, meta-heuristic algorithms have proven superior in solving the FS problem. Therefore, the performance of the new SSA algorithm, being a novel member of this toolbox, is tested. Second, SSA is a recently proposed meta-heuristic algorithm with the capabilities of fast convergence speed, stability, searching precision, and escaping from local optima [119], which is yet to be systematically applied to solving the problem of FS. Last, when the proposed iBSSA algorithm was compared with some high-performance optimization algorithms as well as other latest well-developed ones, it highly supoorted the above claim by providing higher efficiency (i.e., fewer number of iterations or less calculation time) and providing an optimal or near-optimal solution to most of the datasets involved. In order for the FS problem to be dealt more effectively, this work proposes a novel improved binary SSA algorithm by mapping the continuous SSA into discrete values using a set of common S-shaped and V-shaped TFs, along with incorporating the two promising improvements, roaming agents’ (sparrows’) resetting strategy and the local search method.

On the other hand, k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), and Random Forest (RF) are prevalent learning techniques in the machine learning paradigm. In this study, while iBSSA is adopted as a search optimization approach for finding out the most relevant features, the diverse classifiers adopted herein (i.e., k-NN, SVM, and RF) are used as fitness evaluation functions (or evaluators), in order to, at the last, compose a new wrapper FS method. This research will only address supervised FS methods. As mentioned, meta-heuristics have achieved great success in the FS domain. However, most of the proposed methods typically only considered the k-NN classifier and omitted SVM in many cases, while RF has been almost completely ignored, although SVM and RF typically provide better results than k-NN in diverse classification tasks [7, 25, 80]. Therefore, in this study, the presented iBSSA method is tested with the three referred classifiers which are used for estimating the performance of the proposed approach in terms of the fitness value inferred from the classification error rate.

A last motive, many researchers have turned a blind eye to further addressing the roaming behavior of individuals as well as the premature convergence problem in meta-heuristic optimization while solving the FS problem [1, 71, 82]. For this reason, the present work accounts for both re-positioning of roaming individuals to promote the search within feasible regions to amend the best solution randomly faster, combined with a local search algorithm for enhancing the exploitation capability.

1.2 Contribution

To improve the exploration of SSA as well as to promote its exploratory power within feasible regions, a Random Re-positioning of Roaming Agents (3RA) strategy was proposed. Further, a novel Local Search Algorithm (LSA) was also incorporated into the standard SSA algorithm to boost the exploitation process by enhancing the optimum solution obtained at the end of each main loop or iteration in the algorithm. First of all, iBSSA was tested over nine common S-shaped and V-shaped TFs, in order to determine the most effective one that allows iBSSA to exhibit the best convergence speed as well as the best averages of fitness, accuracy, and number of selected features. Then, in pursuit of a fair comparison, the best-performing TF was introduced into the proposed iBSSA as well as into the binary versions of other competitor algorithms. To demonstrate the potential of the proposed approach to significantly select the most relevant features, such popular expert systems as k-NN, SVM, and RF were used to estimate the average fitness value based on the classification error rate averaged on 30 independent runs of the algorithm, over 18 multi-scale benchmark datasets carefully selected from the University of California Irvine (UCI) machine learning repository [38]. To sum up, the main contributions of this paper are outlined as follows:

  1. 1.

    An improved binary version of SSA (iBSSA), which is enhanced through the 3RA strategy and the LSA method, is proposed for the first time based on a feature transformation method for wrapper feature selection in classification tasks.

  2. 2.

    iBSSA was evaluated based on nine different TFs, including S-shaped and V-shaped functions, over 18 benchmark multifaceted, multi-scale UCI datasets.

  3. 3.

    For a fair comparison, the TF most effective with SSA was also integrated into other well-known meta-heuristics, including SSA, ABC, PSO, BA, GWO, WOA, GOA, SFO, HHO, BSA, ASO, and HGSO, for binary conversion, which revealed the supremacy of the proposed iBSSA method for feature selection in classification when compared to other competitor algorithms, over all three classifiers adopted in this study (i.e., k-NN, SVM, and RF), for the majority of the datasets used.

  4. 4.

    The final results were validated based on different performance metrics, including mean classification accuracy, mean fitness, mean number of selected features, along with the respective standard deviation values.

  5. 5.

    The supremacy of the proposed iBSSA method with the three classifiers was affirmed compared to competitors, based on a Wilcoxon’s non-parametric statistical test at a significance level \(\alpha =0.05\).

1.3 Structure

The rest of the paper is organized as follows: Section 2 reviews the existing FS techniques; Section 3 presents a simplified formulation of the FS problem, a description of the original SSA, a brief discussion on the nine TFs, a discussion on the two improvements embedded into the SSA, as well as a debate on the three classifiers adopted in this study: k-NN, SVM, and RF; Section 4, elaborates for and investigates the proposed iBSSA algorithm; Section 5 introduces the computational results and comparisons with a few prominent rival algorithms, and the conclusions as well as some suggested prospective works follows in Sect. 6.

2 An overview of feature selection methods

Dimensionality reduction is so crucial especially when experiencing a high-dimensional feature space, therefore it is commonly used in machine learning where a new space of reduced dimensions is produced by mapping the original feature space onto it. It is extremely important to Identify relevant features in classification tasks (reducing computational costs while maintaining the highest possible accuracy), as well as to recognize the features’ relative significance. Two different paths are usually taken for dimensionality reduction, constructing new dimensions or picking a subset of the original dimensions. Research literature has often distinguished feature selection from feature extraction. In feature selection, the aim is to individually—from among a large set of features—find the best features that can effectively boost the classification capability. More general methods, termed feature extraction algorithms, adopt transformation or combination techniques to create new features form the original feature set. Different weighting schemes have been traditionally used in feature extraction to generate a smaller number of ideally uncorrelated features either by linearly combining features (as in Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA)) or combining them non-linearly (as in neural networks).

To use feature selection or feature extraction is the first critical issue. Each has its own limitations, and no clear-cut evidence exists for the superiority of one to the other for diverse tasks. Feature extraction methods, including PCA [39]—which linearly combine all the available features to produce an, ideally uncorrelated, group of features—may struggle if some of the features are not heterogeneous at all, and there is no correlation between data assumptions. Similarly, it is unfeasible to use sequential search methods or do exhaustive search for high-dimensional datasets in FS. It is simple to examine features individually; however, poor feature subsets can be then yielded. On the other hand, it is too much time consumed, as well as running out of computer memory to try different feature combinations. Finally, it is not necessarily that important features on training data exhibit the same importance level on test data.

As mentioned, the FS problem is the only focus in this work.

Much research on applications have concerned with feature selection (so-called variable selection) for which tens or hundreds of thousands of features are available in datasets. Machine learning tasks, such as time-series prediction, regression, classification, etc. may involve FS problems.

A resurgence of interest has been paid to applying FS methods to treat the large numbers of features encountered in different types of problems, for example:

  • Information fusion of multiple sensors’ data. For instance, both color and shape features have been merged in a trademark image database to provide a higher retrieval accuracy [54].

  • Pooling the parameters from different mathematical models for classification purpose, as the integration of multiple models in [97].

  • Discovering the hidden relationships correlating a plethora of features, as the data mining application in [89].

The effectiveness of an inference model can be enhanced using an appropriate feature selection. Feature selection can grant such significant merits as follows [69]: i) improving the overall performance (simplicity of rules, predictive accuracy, or speed of learning); ii) model selection via data visualization; and iii) dimensionality reduction and noise removal. Although many advantages are offered by feature selection, the risk of over-fitting or decreasing accuracy may be encountered. Thus, it should be well managed how to avoid these risks to get the desired classification performance.

2.1 Typical Feature Selection techniques

There are two typical procedures in FS techniques: exploring the feature space using an underlying search or ranking algorithm; and guiding the underlying algorithm by, for example, measuring the classification error rate in the form of a cost function. On how to evaluate this cost function, proposed approaches most important are described below. The universal algorithms of FS often belong to one of three modalities: filters, wrappers, and embedded [43, 63]. The filter-based approach [61, 65] does not refer to or learn from the target classifier while determining the fitness of an examined feature subset. The cost function is evaluated independently of the target classifier in the following classification of independent datasets using the pre-selected subset of features. Rather, the cost function value can be computed based on a generic error estimation function guides the search for an optimal subset or the ranking of the individual features in the feature search space. This procedure is flawed by ignoring the potential effect of the feature subset on the learning algorithm. In contrast, in the wrapper-based approach [27, 120] the fitness of an examined feature subset is determined by underlying the subset to the design of the target classifier (used as a black box) to get an estimated classification error rate through ranking features in the subset based on their predictive power. Usually, wrappers approach can select more relevant subsets of features than filters approach and thus give better classification accuracy. However, each examined subset’s fitness needs a more computationally intensiveness for fitness assessment than the filter-based methods in terms of the classification error estimation. Consequently, since the wrapper-based methods must evaluate many subsets, they are often criticized by “brute force” and an additional computational overhead may emerge as a result of the target classifier-based evaluations.

Obviously, in the case of the not-too large variables’ number, exhaustive search can be adopted and performed under these two approaches. However, it is well-known that the problem is NP-hard [19, 22, 36, 64] and the computational intractability quickly results from the search. When it is too large the number of variables, high computational costs may be incurred by the approaches. In contrast to wrapper approaches, embedded methods determine the feature subset given the classifier design.

To solve the FS problem more pragmatically, other techniques have been adopted to find a good approximate solution, hopefully closest possible to the optimal subset. Some of the potential subsets of features are intelligently sequentially examined based on “forward” selection or “backward” elimination criteria to select the optimal subset amongst all the examined subsets based on the best cost function evaluation. In the following, the most commonly used methods—under these techniques—for performing FS are discussed. Sequential Forward/Backward Selection (SFS/SBS) [5] typically performs according to simple greedy deterministic heuristics. SFS [115] starts by an empty subset and new features are sequentially incorporated into this subset, whereas in SBS [75], the selection process is reversed, SBS begins with the original subset of all features and sequentially removes the least promising ones until satisfying a certain termination criterion. Both methods suffer from what is called the “nesting effect”; that is, features once selected in the SFS method cannot be discarded later, whereas the features discarded in the SBS method cannot be re-selected. The result is the permanent suboptimality of these methods. In addition, high computation time is required by forward selection and backward elimination when the dimensionality is very high [63].

The nesting effect can be definitely avoided by fusing SFS and SBS. To prevent the “nesting” impact, a method called “plus-l-take-away-r” has been suggested. This method [104] does by applying SFS followed by SBS l and r times, respectively. This forward and backward selection fixed cycle is iterated until reaching a predefined number of features. Thus, the nesting effect can be avoided by removing (in posterior steps) some of the features that have been priorly added. This method enables “fixed backtracking” based on top-down or bottom-up search and this is defined by l’s or r’s values. Although this procedure can partially overcome the problem of nested features, another problem arises: it is not theoretically easy to appropriately determine the land r’s value so that good enough solutions can—with a moderate amount of computational resources—be obtained. The plus-l-take-away-r method basically aims to counteract the nesting effect, and its implementation can be improved more efficiently by conditionally including and excluding features, guided by the value of the fitness itself.

In 1994, the concept of “floating feature search” was introduced by Pudil et al. [87]. In addition, Sequential Forward Floating Selection (SFFS) and Sequential Backward Floating Selection (SBFS) were introduced as two “floating” search methods. Floating selection methods are related to the plus-l-takeaway-r. Unlike the latter, the number of forward and backward steps is not beforehand fixed but controlled dynamically. In the methods of SFFS and SBFS, number of features included or excluded is changing at the different stages of the procedure. These two methods are probably the most effective FS techniques [53]. Mainly, in a forward search, a null feature set is initialized and, for each step, the current feature set contains the best feature that satisfies some fitness function (i.e., one SFS step is performed). At the same time, it is verified by the algorithm if it is possible to improve the fitness by excluding some feature. if so, the set is then refined by eliminating the worst feature, controlled by the fitness value (i.e., one SBS step is performed). Therefore, the SFFS proceeds by dramatically tuning the number of features in subset (increasingly and decreasingly) until reaching the desired number. Similarly, SFBS initiates with the full feature set and uses SBS and SFS steps to perform the search until reaching the desired dimension. Even though these two methods cannot always get the best feature subset, they perform very well in comparison with other search methods.

Compared to the plus-l-take-away-r methods, floating search methods are particularly characterized by the potential to achieve good performance by making more than one sweep while pursuing the best subset of features. In practice, dynamic backtracking search performs very robustly and, among different feature set search methods, the first choice for many scholars would be the floating search procedures [86]. Plus, floating methods behave just as efficiently as the best sequential methods.

Despite the significant intelligence exhibited by the floating methods, they still suffer suboptimality. These sequential floating methods are criticized by being highly elected to fall into the local optimal solutions’ trap, even if the problem scale is quite small and the fitness function is monotonic.

Other literature methods or algorithms adopted for the FS problem include the following:

  • The Branch-and-Bound (BB) FS algorithm has been invented by Narendra and Fukunaga [81] to find the optimal feature subset. This method has one drawback: it requires the monotonousness of the FS criterion function. Thus, adding new features to a feature subset cannot enable to reduce the fitness function value. Unfortunately, it is seldom to satisfy the monotonic condition. Nevertheless, in large feature spaces, the computational cost is prohibitive: in the worst scenario, an exhaustive search is performed and the time exponentially becomes more complex in accordance with the expansion in feature space dimensionality. As pointed out in [53], it is still impractical to apply the BB method to very-large-scale problems.

  • The Max–Min (MM) FS method has been invented by Backer and Shipper [10]. It is a computationally efficient method in which only individual and pairwise merits of features are evaluated. This method invariably achieves unsatisfactory results [53, 62].

  • The use of GA for FS has been firstly introduced by Siedlecki and Sklansky in 1989 [103]. In a GA approach, a binary string (a “chromosome”) is used to represent a given feature subset of total length D, with a value of one or zero in the j-th position to denote the presence or absence of feature j in the set. The algorithm maintains a population of chromosomes and the “fitness” of each chromosome is evaluated in terms of how likely is the survival and breed of the chromosome into the next generation. Old chromosomes are used to create new ones by the following processes: i) crossover, where offspring is created by mixing parts (genes) of two different parent chromosomes; and ii) mutation, where a child is created by randomly disturbing the bits of a single parent. Other works that apply GA to the FS problem include those of Emary et al. [32], Jiang et al. [55], Raman et al. [90], Dong et al. [29], Bouktif et al. [14], and Das et al. [21].

  • The Tabu Search (TS) method has been developed for feature subset selection by Zhang and Sun [128]. An adaptive memory (the tabu list) is used by TS to keep track of solutions that have been visited and should–—for a number of iterations—be avoided. The “tabu tenure” determines how long the tabu list should reserve a solution. In [128], a comparative analysis of the TS-based technique and other FS techniques (SFS, SBS, plus-l-take-away-r, SFFS, SFBS, and GA) has been conducted and the performance analysis shows TS as a promising search “heuristic” for the FS problem.

Despite some progress, it is not yet completely satisfactory the FS techniques available now for large feature sets. They are either enriched with computational feasibility but far from optimality, or they are enriched with optimality (or near-optimality) but cannot cope with the computational complexity of real-world FS problems. More powerful methods for FS are required to be developed, to provide very good results with more efficient computations. Therefore, in the present work, a novel improved meta-heuristic strategy, called iBSSA, is proposed to solve the FS problem.

Several challenges usually arise when getting up to find an optimal subset of features using the aforementioned traditional techniques. Multiple search methods, such as breadth search, random search, depth search, or hybridization of them have been additionally tried to determine the optimal subset of features. Further techniques have been proposed in [2, 3]. However, exhaustive search technique typically adopted there has been found time-consuming and unfavorable, especially with high-dimensional datasets [106]. Feature selection is arguably formulated as a combinatorial NP-hard optimization problem [19, 22, 36, 64]. For example, if we suppose that the number of features in a given dataset is D, it will be computationally expensive to find the optimal subset of features among \(2^D\) different candidate combinations (solutions), given the fact that \(2^D\) different evaluations are required for conducting a full search, methods of forward selection or backward elimination are recursively used in SFS and SBS. In addition, local optima can be more efficiently avoided by using stochastic methods. Therefore, those methods, especially meta-heuristics (discussed below), have been increasingly attempted by researchers as wrapper-based approaches to find out the best feature subset based on their global search process that is significantly effective with large feature spaces [4, 101, 124].

2.2 Meta-heuristics-based feature selection techniques

In recent years, meta-heuristic optimization algorithms, which are very close to stochastic methods in generating and using random variables, have been incorporated into different application areas, including building construction, health-care, agriculture, computer engineering, and many more [117]. This may be due to the obvious merits of these algorithms, including gradient-free nature, flexibility, simplicity, and independency of the problem of interest [51, 79]. Moreover, these algorithms can potentially approximately find the best (optimal) solution in small, comparable amounts of time, even for very large-scale problems. There is therefore still an appeal of solving complex optimization problems, such as the FS problem, using different advanced meta-heuristic algorithms [28]. Among many others, swarm intelligence and physics-based methods are two major categories of meta-heuristic algorithms [50, 130, 131]. Swarm intelligence mimics the collective social behavior of various species (i.e., animals or insects) in swarms for hunting or foraging, in which all individuals cooperate and share information during the optimization process. Representative algorithms include Artificial Bee Colony (ABC) [57], PSO [30], Bat Algorithm (BA) [122], Grey Wolf Optimizer (GWO) [79], Firefly Algorithm (FA) [121], and Cuckoo search (CS) [123]. This latest algorithms in this category include the WOA [78], Bird Swarm Algorithm (BSA) [76], HHO [49], Sailfish Optimizer (SFO) [98], Grasshopper Optimization Algorithm (GOA) [93], and Butterfly Optimization Algorithm (BOA) [9]. Moreover, the laws of physics in nature were utilized to come up with physics-based optimization methods, including Simulated Annealing (SA) [112], Gravitational Search Algorithm (GSA) [91], Atom Search Optimization (ASO) [133], and Henry Gas Solubility Optimization (HGSO) [47]. To learn more about meta-heuristic algorithms, the reader is referred to [28].

In an attempt to tackle the FS problem, meta-heuristics methods have been, due to their simplicity and flexibility, tried by many researchers, and some of them are investigated here. Sharawi et al. [99] introduced a new variant of the WOA algorithm that employed the wrapper-based technique to select the optimal feature number to yield the best mean classification accuracy. Subsequently, Eid [31] suggested the S-shaped function with WOA for tackling FS problems. Two versions of WOA were introduced by Mafarja and Mirjalili [70]. In the first, instead of a random operator, the roulette wheel and tournament selection were applied. The mutation and crossover operators were used in the second version to boost the performance of the proposed algorithm. Furthermore, Sayed et al. [94] hybridized WOA with the chaotic search to solve the slow convergence speed and stagnation to local optima issues potentially encountered while solving FS problems. Mafarja et al. [73] introduced two binary versions of the GOA algorithm. The first method used the S-shaped and V-shaped functions as Transfer Functions (TFs), while the second version employed the mutation operator to improve the exploratory ability of the proposed algorithm. Mafarja et al. [72] employed evolutionary population dynamics and selection operators to enhance the performance of the conventional GOA. Zakeri and Hokmabadi [125] hybridized the GOA with some statistical metrics for replacing the duplicated features with the most favorable one. In a related context, Emary et al. [32, 33] introduced the GWO algorithm and two binary versions of the Lion Optimization Algorithm (LOA), based on S-shaped and V-shaped TFs. Additionally, Arora and Anand [8] proposed a novel binary variant of the BOA based on some of the above-mentioned TFs. Chen et al. [17] introduced a discrete PSO algorithm based on a logistic map sequence to improve the swarm diversity. De Souza et al. [24] suggested a new variant of Crow Search Optimization (CSO) algorithm using a V-shaped TF. Sayed et al. [96] hybridized the Flower Pollination Optimization Algorithm (FPOA) with the clonal selection search, in which the optimum path forest accuracy was applied to evaluate the solution performance. Zawbaa et al. [127] managed to integrate the rough theory into FPOA to solve FS. Sayed et al. [95] validated the performance of 10 chaos maps incorporated into both the CSO and SSA algorithms. Hegazy et al. [48] introduced a new control parameter which was involved to refine the best solution obtained so far, using the k-Nearest Neighbor as an expert system for the evaluation of the relevance of the selected features. Moreover, Zhang et al. [129] hybridized the FA and SA algorithms with the aim to escape from the local optima while increasing the accuracy of the final solutions. Several enhancements, such as return-cost attractiveness, Pareto dominance-based selection, and binary movement with adaptive jump, were employed in [129] in order to solve the underlying FS problems more effectively. Faris et al. [34] introduced a multi-verse optimization algorithm, adopting the Support Vector Machine (SVM) classifier as a fitness evaluator. Gu et al. [41] suggested a new version of the PSO algorithm to solve a high-dimensional FS problem. Many other meta-heuristic based approaches were proposed for FS by, for example, Hafez et al. [44], Kashef and Nezamabadi-pour [58], Li et al. [67], Tabakhi et al. [105], Wang et al. [114], and many others. In fact, the introduction of meta-heuristic algorithms to solve FS problems is a trend for some years now. To learn more about meta-heuristic algorithms for FS, please refer to the review paper [100]. As discussed earlier, there are various advantages of meta-heuristics, so will the FS problem be solved using existing methods sufficiently? The No Free Lunch theorem (NFL) [116] is the appropriate answer to this question. This theorem indicates that a single algorithm cannot optimally solve all optimization problems. For FS on a dataset, while an algorithm performs very well for one kind of dataset, another one may perform poorly. Thence, the search for an advanced meta-heuristic approach for solving almost all possible FS dataset types is still an open research question.

3 Preliminaries

3.1 The Feature Selection (FS) problem

The problem of selecting a feature subset can be formulated as a combinatorial optimization problem with the major aim of obtaining superior classificatory performance as follows: Let an initial feature set, \(\mathbf {f}\), with cardinality D. Let the number of feature to be pursued be represented by d in the selected subset \(\mathbf{s}\). Let the fitness function of the selected subset \(\mathbf {s}\) be represented by \(f(\mathbf {s})\). Let the misclassification error rate, \(c(\mathbf {s})\), of a given classifier to be minimized when presented with the feature subset \(\mathbf {s}\). Formally, the FS problem is to find a subset \(\mathbf {s} \subseteq \mathbf {f}\) such that \(|\mathbf {s}|=d\) and

$$\begin{aligned} f(\mathbf{s})=\min _{\mathbf {s}\subseteq \mathbf {f},|\mathbf {s}| =d,d<D}c(\mathbf{s}). \end{aligned}$$

So, a lower value of \(c(\mathbf {s})\) indicate a highly feasible feature subset. In this case, the “goodness” of a selected subset of features is assessed and computed in the form of a fitness function as follows: Let C be a set of cases (i.e., the whole instances or samples in a given dataset). We know the class to which each case relates (here binary classifications, with only two classes, is considered). A certain partition is made in C, \(C=C_1 \cup C_2\), where \(C_1\) (training data) and \(C_2\) (test data) are in the vicinity of the same cases’ number and the same representation ratio for each class. The Euclidean distance is calculated from each case in \(C_2\) to every case in \(C_1\) and the class corresponding to nearest case is assigned. \(f(\mathbf{s})\)’s value is the proportion of mishits in the assigned classes. That is, how much times the assigned class was NOT the real (true) class.

If we adopted an exhaustive approach that is applied to this problem, this would require investigating all \(\begin{pmatrix}D\\ d\end{pmatrix}\) possible d subsets (which can be derived from the feature set \(\mathbf {f}\)). But the exponential growth of possibilities makes exhaustive search impractical even with reasonable D’s values. As mentioned, FS is an NP-hard problem [19, 22, 36, 64] and heuristic and meta-heuristic techniques are therefore the most appropriate choice.

3.2 Sparrow Search Algorithm (SSA)

Sparrow Search Algorithm (SSA) [119] is a meta-heuristic optimization algorithm under the umbrella of swarm intelligence, meta-heuristics and computational intelligence. SSA is a natural extension of the BSA algorithm [76] extracted from the social behavior and interactions of bird swarms. Sparrows are gregarious birds of several species distributed across the globe and prefer to live in areas where people live. Besides, it belongs to the family of omnivorous birds that usually feed on weed or grain seeds. The sparrow is intelligent in comparison with other small birds and has a strong memory inspired by behaviors of anti-predation and foraging. In fact, captive house sparrows are divided into two main types based on their foraging behaviors: scrounger and producer [12]. The producers energetically pursue potential sources of food, while the scroungers grab food via those producers. In addition, it has been evidenced that birds usually switch between producing and scrounging by utilizing behavioral strategies flexibly [13]. With that being said, sparrows usually use the same strategies which are adopted by producers and scroungers to find their food [68].

Literature studies have shown that each individual in SSA keeps monitoring the behavior of its neighbors. Meanwhile, attackers against the bird flock are competing for foraging high intakes of food for the companions, thereby increasing their own predation probability [66]. Though, when different foraging strategies are adopted by the sparrows, the energy retained by the individuals may be then well utilized in the pursuit of food, so that the scrawny sparrows can scrounge even more [66]. It should be noted that birds within the search space are more vulnerable to attack by predators and therefore are required to seek a better (safer) location [15]. While the birds at the centre can minimize their domain of danger by moving closer to the neighbors [45]. It is well known that the bird family, especially sparrows, shows a natural instinct of curiosity along with constant vigilance. For example, when a predator is detected by a bird, a chirp is given by one or more individuals so that the entire group flies away from the source of danger [88]. According to the previous description of sparrows, a mathematical model can be formulated to construct the SSA algorithm. In simulation experiments, the food source is pursued by using virtual sparrows with a position vector expressed as:

$$\begin{aligned} \mathbf {x} = \begin{bmatrix} x_{1,1} &{} x_{1,2} &{} \ldots &{}x_{1,D} \\ x_{2,1} &{} x_{2,2} &{} \ldots &{}x_{2,D} \\ \vdots &{} \vdots &{} \ldots &{} \vdots \\ x_{N,1} &{} x_{N,2} &{} \ldots &{}x_{N,D} \end{bmatrix}, \end{aligned}$$
(1)

where the number of sparrows are represented by N, and the number of dimensions to be optimized are represented by D. Then, the following vector can represent the fitness value of all sparrows:

$$\begin{aligned} \mathbf {f}(\mathbf {x}) = \begin{bmatrix} f([x_{1,1} &{} x_{1,2} &{} \ldots &{} x_{1,D}]) \\ f([x_{2,1} &{} x_{2,2} &{} \ldots &{} x_{2,D}]) \\ \vdots &{} \vdots &{} \ldots &{} \vdots \\ f[(x_{N,1} &{} x_{N,2} &{} \ldots &{} x_{N,D}]) \end{bmatrix}, \end{aligned}$$
(2)

where the value of each row in \(\mathbf {f}(\mathbf {x})\) represents each individual’s objective function. In SSA, the producers with the best fitness for getting food during the search process are given a higher priority. Besides, because producers are primarily responsible for foraging, in addition to managing the movement of the entire swarm, they have the ability to forage for food in a wider range than the scroungers do. At each iteration t, the position of each producer i is adjusted per every dimension j as:

$$\begin{aligned} x_{i,j}^{t+1} = \left\{ \begin{array}{ll} x_{i,j}^{t} \cdot \exp \left( \frac{-i}{\alpha \cdot T} \right) &{} \mathrm {if} \, R_2 < ST, \\ x_{i,j}^{t} + Q \cdot L_{i,j} &{} \mathrm {if} \, R_{2} \ge ST, \end{array} \right. \ \end{aligned}$$
(3)

where \(j \in \{1,2,\dotsc ,D\}\), and T represents the maximum number of iterations. \(R_2 \in [0, 1]\) represents a warning (alarm) value, while \(ST \in [0.5, 1.0]\) denotes the safety threshold. \(\alpha \in (0, 1]\) and the normally distributed Q are two independent random numbers. L is a matrix of \(1\times D\), inside which each element is 1. \(R_2 < ST\) means no predators around, allowing producers to search wider. On the contrary, \(R_2 \ge ST\) implies the existence of predators discovered by some sparrows, and because of this, all sparrows are needed to quickly move to more secure regions.

As mentioned above, some scroungers continue to track down producers until those producers find a good food source, so that they can then leave their current position to pursue the target source of food. If they could win, they would automatically have a chance to get food from the producers; otherwise, they continue to execute their main task of vigilance. The scroungers update their position according to Eq. (4):

$$\begin{aligned} x_{i,j}^{t+1}= \left\{ \begin{array}{ll} Q \cdot \exp \left( \frac{x_{worst,j}^{t}-x_{i,j}^{t}}{i^2} \right) &{} \mathrm {if} \, {i > N/2}, \\ x_{p,j}^{t+1}+\left| x_{i,j}^{t}-x_{p,j}^{t+1} \right| \cdot A_{0,j}^{+} \cdot L &{} \mathrm {if} \, {i \le N/2}, \\ \end{array} \right. \ \end{aligned}$$
(4)

where \(x_{p,j}\) is the personal best position obtained so far by the producers at the j-th dimension, and \(x_{worst,j}\) denotes the current global worst location at the j-th dimension. A shows a matrix of \(1 \times D\) with the randomly assigned value 1 or \(-1\) for each of its elements, while \(A^+ = A^T {(A{A^T})}^{-1}\). L represents a matrix of \(1\times D\), all of whose entries are 1. When \(i > N/2\), this implies the starvation of the i-th scrounger whose fitness value is among the worst.

In the simulation experiments, those sparrows who are aware of the danger (i.e., expectant sparrows) typically account for \(10\%\) to \(20\%\) of the whole swarm. These sparrows’ positions are initially randomly generated through the swarm as:

$$\begin{aligned} x_{i,j}^{t+1}= \left\{ \begin{array}{ll} x_{worst,j}^{t}+\beta \cdot \left| {x_{i,j}^{t} + x_{best,j}^{t}} \right| &{} \mathrm {if} \, {f_i > f_g}, \\ x_{i,j}^{t} + K \cdot \left( \frac{\left| x_{i,j}^{t} + x_{worst,j}^{t} \right| }{(f_i - f_w) + \varepsilon } \right) &{} \mathrm {if} \, {f_i=f_g}, \\ x_{i,j}^{t} &{} \mathrm {if} \, {f_i < f_g}. \\ \end{array} \right. \ \end{aligned}$$
(5)

The algorithmic structure of the standard SSA is exhibited in Algorithm 1.

figure a

3.3 Transfer Functions (TFs)

Since the final solution obtained by SSA is composed of continuous values, SSA cannot directly be used to solve a FS problem. Therefore, it is necessary to use a mapping (transfer) function to transform the continuous values into binary 0 or 1. Transfer Functions (TFs) [77] determine the rate of changing in the values of the decision variables from 0 to 1 and back. On selecting a TF to transform the values from continuous to binary, some concepts should be taken into consideration—from an SSA perspective—as follows:

  • The range of values obtained from a TF should be within the interval [0, 1], representing the probability whether a sparrow will change its current location.

  • If the alarm value \(R_2\) is lower than the safety threshold \(ST\), then the TF should present a higher probability of changing the current location at the next iteration as sparrows having \(R_2\) larger than \(ST\) are probably going so far from the best solution.

  • When \(R_2\) is small, the TF should provide a small probability of changing the current location.

  • In summary, the probability returned by the TF should increase as \(R_2\) approaches \(ST\), so that sparrows that are moving away from the best solution can have a higher probability of changing their location vector, which enables them to return to their previous best personal location as soon as possible throughout the next iterations.

  • Also, the probability obtained from a TF should decrease as \(R_2\) value keeps away from \(ST\) value.

These concepts judge the high capability of TFs to map the process of continuous search into binary for each individual \(\mathbf {x}\), using Eq. (6):

$$\begin{aligned} {\left( x_{i,j}^{t+1}\right) }_{\mathrm {bin}}=\left\{ \begin{array}{ll}\left\{ \begin{array}{ll}0&{}\mathrm {if}\,rand<TF\left( x_{i,j}^{t+1}\right) ,\\ 1&{}\mathrm {if}\,rand\ge TF\left( x_{i,j}^{t+1}\right) ,\end{array}\right. &{}\mathrm {if} \text { TF is S-shaped},\\ \left\{ \begin{array}{ll}\lnot {\left( x_{i,j}^t\right) }_{\mathrm {bin}}&{}\mathrm {if}\,rand<TF\left( x_{i,j}^{t+1}\right) ,\\ {\left( x_{i,j}^t\right) }_{\mathrm {bin}}&{}\mathrm {if}\,rand\ge TF\left( x_{i,j}^{t+1}\right) ,\end{array}\right.&\mathrm {if}\text { TF is V-shaped},\end{array}\right. \end{aligned}$$
(6)

where \({\left( x_{i,j}^{t+1}\right) }_{\mathrm {bin}}\) represents the j-th dimension of the i-th individual at the current iteration \(t+1\), \(rand\) is a number selected randomly from within the range [0, 1], and \(TF\left( x_{i,j}^{t+1}\right)\) is the probability value obtained when applying a given TF to every j-th component’s continuous value of agent i. It is clear from Eq. (6) that we have two cases: i) if the TF is S-shaped, then if \(rand\) is less than the probability returned by the involved TF, the j-th dimension of the original individual is set to 0; otherwise, it is set 1; and ii) if the TF is V-shaped, then if \(rand\) is less than the probability returned by the involved TF, the j-th dimension is negated; otherwise, it remains unchanged. Thus, by using the S-shaped and V-shaped TFs and Eq. (6), continuous variables are successfully mapped into binary.

Table 1 reports two families of TFs while Fig. 1 exhibits visually their behavior, divided into S-shaped and V-shaped transfer functions. Here, it should be pointed out that the proposed iBSSA method was evaluated based on those nine TFs whose mathematical expressions are shown in Table 1.

Table 1 S-shaped and V-shaped families of TFs
Fig. 1
figure 1

Families of transfer functions

3.4 Learning algorithms adopted in this study

k-NN, SVM, and RF are widespread learning algorithms from different families in the machine learning paradigm. In this article, while iBSSA is used as a search optimization strategy, those classifiers are used to design a new wrapper FS model to assess the goodness of each feature subset. Thence, these classifiers (or expert systems) are described as follows:

3.4.1 k-Nearest Neighbor (k-NN)

k-Nearest Neighbor (k-NN) [18] is a largely popular pattern recognition and machine learning algorithm. It is commonly used due to its advantage of simplicity in implementation over other complicated supervised machine learning algorithms [118]. In pattern classification techniques, k-NN is applied in diverse fields, such as healthcare, forestry, image and video recognition, finance, and many more. The k-NN algorithm can be considered as a wrapper method in which the rules of classification are produced by training instances. After k-NN learns from the training process, the unknown instances in the test set are approximated based on their vicinity to the instances of the training set, so that an unlabeled instance can be subsequently classified according to the largest probability of category. Though, the selection of k in k-NN is crucial and it is therefore chosen after various trial and error experiments. In this study’s empirical experiments, the selected feature subsets are validated by using the k-NN classifier (\(k=5\) [70, 71, 108]) with the Euclidean distance metric.

3.4.2 Support Vector Machine (SVM)

Support Vector Machine (SVM) [113] is a widely used wrapper-based classifier, in which hyper-planes are applied for separating multiple classes. SVM is widespread in the data science community as it has the potential to classify with reliable accuracy and less computational resources as well. This is achieved by mapping the primary data from the original input space – using the non-linear function \(\phi\)—into a higher dimensional space wherein linear separation of the data can take place by finding a hyper plane with the maximal margin in this higher dimensional space \(>0\) for discovering the boundaries between the input classes. However, this particular approach faces two main critical challenges: acceptable basic function selection as well as the adjustment of its parameters [109]. Logically, selecting the best decision plane is mainly treated in the form of an optimization task that helps a kernel function find out the most optimal space wherein categories are often divided linearly through one non-linear transformation.

In SVM models, given that \(k(x_i,x_j)=\phi (x_i)^T \phi (x_j)\) denotes the kernel functions, there are a number of common kernel functions, including:

  • Linear kernel, where \(\phi = x_i*x_j\),

  • Polynomial kernel of degree d, where \(\phi = (x_i \cdot x_j + 1)^{d}\),

  • RBF kernel, where \(\phi = \exp (-{\left\| x_i - x_j \right\| }^{2}/2\sigma ^2)\), and

  • Sigmoid kernel, where \(\phi =\tanh (x_i \cdot x_j + 1)\).

Typically, the polynomial and RBF (Radial Basis Function) kernels are more applicable and suitable for non-linear problems in diverse areas, including intrusion detection, classification, and image processing as they have proven efficient in delivering a better performance. Polynomial kernel provides a high performance—especially on high dimensional databases—by consuming less computational time. According to the works [83] and [102] penetrating the fields of satellite data analysis and intrusion detection, the results exhibit a slight advantage for the polynomial kernel in comparison with the RBF kernel. Additionally, multiple works judge that the optimum value to be assigned to the polynomial degree d should be 2, which would exclude an extensive seeking process regarding this parameter, generating however higher accuracy. On the other hand, the sigma value \(\sigma\) in the RBF kernel makes a greater impact both on the mapping transformation of data space and the ultimately obtained average classification accuracy. Also, the mandatory tuning process of RBF’s controlled parameters complicates the situation because of the much time required. Based on the explained reasons, the polynomial kernel with \(d=2\) was chosen as the most adequate kernel type for SVM. However, to optimally solve the above-mentioned dilemma, a hyper-heuristic mechanism can be incorporated for the automatic selection of kernel types as well as the tuning of controlled parameters.

3.4.3 Random Forest (RF)

Random Forest (RF) [126] is a well-known machine learning algorithm which typically gets involved in heavy-duty tasks, including image classification, action recognition and detection, visual tracking, facial expression recognition, label distribution learning, time-series forecasting, and so on. RF is formally defined as an ensemble of decision trees featuring characteristics, such as robustness to label noise, inherent multi-class handling capacity, enabled FS, parallel processing, fewer tunable parameters, and efficiency in handling numerical and categorical data. Although RF has been proposed two decades ago, it is still employed in various applications [16], thanks to its simplicity in interpretability and implementation as well as the significant computational performance [59]. Decision trees in RF employ recursive partitioning of the training data into small sets, which greatly assists in the classification task by optimizing an impurity criterion, such as information gain or “gini” index [20].

Frankly, the large number of trees (estimators) in the forest increases the execution time of this algorithm, making it ineffective for real-time applications where the run time is an important factor. Moreover, the maximum depth of the tree (height of the tree) is a critical parameter. Once again, we can overcome this dilemma by adopting a hyper-heuristic approach for automatic selection of the split strategy and tuning of the algorithm’s parameters. In our experiments, in order to achieve a relatively high performance in terms of higher classification accuracy, the RF algorithm was implemented and executed extensively with a number of estimators (\(n=10\)) and maximum depth (\(d=5\)).

4 Proposed improved Binary SSA (iBSSA) for feature selection

Due to the high performance derived from the extensive design of the canonical SSA algorithm in balancing exploration and exploitation capabilities and the absence of this algorithm to solve FS problems, this article adopts SSA as a search strategy for a wrapper-based FS. In the successive versions of SSA, any point in the search space can be the optimum position within the swarm. While FS is a discrete optimization problem, and only binary values (0 or 1) can be used to represent the components of the sparrow position in the binary SSA variant based on TFs. The underlying motivation is the appeal of the simplicity of those binary operators over continuous ones. An improved binary SSA variant is presented in this study to solve the feature selection problem. The method for binary conversion of continuous SSA mainly depends on the TFs. It is worth mentioning that two improvements were also introduced to straighten the roaming behavior of the algorithm as well as to enhance its exploitation capability.

The two improvements embedded into the standard SSA algorithm are described through in this section. The first is the 3RA method that is proposed in order to amend the optimum solution randomly faster to become within the feature space. The use of LSA algorithm is the second improvement which enhances the exploitation phase of SSA to avoid becoming stuck in local optima. As shown in Algorithm 4, the proposed iBSSA algorithm works by initially generating a swarm of N sparrows. Then, the main loop is iterating through these N sparrows to adjust their own locations according to Eqs. (3), (4), or (5). Eventually, LSA is applied to the current \(\mathbf {x}_{best}\) at the end of the iBSSA main loop, hoping to obtain a better solution than the current best one obtained so far. At the end, iBSSA will return the best optimum solution composed of \(\mathbf {x}_{best}\) and \(f_g\).

4.1 Improvements embedded into the standard SSA

To improve the exploration of iBSSA as well as to rectify the roaming behavior usually exhibited by individuals in the swarm, the 3RA strategy was proposed. Further, a novel LSA method was incorporated into the standard SSA algorithm to boost the exploitation process by enhancing the optimum solution obtained so far at the end of each main loop or iteration of the algorithm. These two arguably promising improvements are discussed in detail in the subsection to come.

4.1.1 Improving the exploration by Random Re-positioning of Roaming Agents (3RA)

While refreshing the area of sparrows, they may abuse the feature space limit, especially with the high dimensionality of the FS problem at hand. In this way, an appropriate clipping technique is expected to be significant to inhibit roaming individuals (sparrows) which go beyond the search space. In general, numerous past investigations have added this upgrade step in their works [111]. Notwithstanding, which technique has been utilized to re-position roaming agents to the search space (i.e., re-positioning to the hunt space by cutting the detachable situation to stretch edges, irregular re-positioning to the pursuit area, or any others) is not clearly stated. Irrespective of that, we think that random re-positioning of roaming agents to an arbitrary area of the hunt space would provide improved outcomes, inspired by the stochastic nature of meta-heuristics. Therefore, we have adopted it as an upgrade step that is supposed to combat the misuse cycling of sparrows and correct the haphazardness potentially caused by the original SSA. The pseudo-code for the 3RA strategy is presented in Algorithm 2.

figure b

4.1.2 Improving the exploitation by Local Search Algorithm (LSA)

A new LSA algorithm is developed and presented as shown in Algorithm 3. In the original SSA, in the tail of each current iteration \(t+1\), LSA is called to enhance the currently obtained best solution \(\mathbf {x}_{best}^{t+1}\) by further omitting potentially irrelevant features still there. At first, LSA stores, in a variable Temp, the value of \(\mathbf {x}_{best}^{t+1}\) produced at the end of each iBSSA iteration. To improve Temp, LSA runs iteratively LT times. At each iteration Lt of LSA, four features \(rand\_feat\) are randomly selected from Temp. LSA flips the value of each variable in \(rand\_feat\). Then, the fitness value f(Temp) of the new solution (the new Temp) is evaluated; if it is better than \(f \left( \mathbf {x}_{best}^{t+1} \right)\), then \(\mathbf {x}_{best}^{t+1}\) is set to Temp; otherwise, \(\mathbf {x}_{best}^{t+1}\) and \(f_g\) are kept unchanged.

figure c

Furthermore, the classification techniques k-NN, SVM, and RF, were adopted to work with the proposed iBSSA algorithm in wrapper mode for solving FS problems. At each iteration, each classifier is individually applied to train and test the data with the best subset of features selected by iBSSA. To represent the selected and deselected features in FS problem, binary values were used, such that each “1” value in the solution means this index is a selected feature, while “0” value means this index is an deselected feature. The proposed iBSSA algorithm operates as follows:

4.2 Initialization

The initialization phase is the first step in a population-based algorithm or approach, wherein a swarm of N sparrows (search agents) is generated randomly. In this study, each candidate solution i is limited by lower and upper bounds in the range \([-1,1]\), in order to allow individuals to relatively search more broadly, yet restrictively within the continuous search space domain. Each search agent represents a potential solution with a dimension D which, in the paradigm of feature selection, equalizes the original number of features in any of the datasets used. The problem of FS for classification purposes can, in brief, be described as selecting a minimal relevant feature subset that can help maximize (or even maintain) the mean classification accuracy. Therefore, in this study, we manage to identify the salient features (“one” values) and reject the other ones (“zero” values). Initial position of each sparrow in the swarm is firstly converted to random binary values (among 0s and 1s) by discretizing the position at each dimension which takes either 0 (not selected) or 1 (selected) according to Eq. (6) and as shown in Fig. 2, prior to initiating the fitness evaluation process.

Fig. 2
figure 2

Binary solution representation

4.3 Fitness evaluation

When it is required to optimize more than one objective simultaneously, the underlying optimization problem is then described as many- or multi-objective, and all objectives must be satisfied to obtain the target optimum solution. Accordingly, the FS problem can be considered multi-objective as it must achieve two conflicting targets: minimizing the number of selected features and maximizing (or even maintaining) the classification accuracy, given a wrapper classifier [1, 85]. The FS method mainly aims at achieving high classification accuracy with as minimal feature subset as possible. The aforementioned two conflicting objectives are formulated in this work by adopting an objective function to evaluate the optimality of the solution obtained at the end of each iteration:

$$\begin{aligned} Fit_i=\alpha \times Err_i + \beta \times \frac{\left| d^*\right| }{|D|}, \end{aligned}$$
(7)

where \(Err_i\) is the classification error rate produced by wrongly predicting instances in a test set, and is computed based on the classifiers adopted herein; i.e., k-NN, SVM, and RF. While |D| represents the number of all features in the original dataset, \(\left| d^*\right|\) denotes the length of the selected feature subset. As for \(\alpha\) and \(\beta\), they respectively weigh the importance of both the classification accuracy and the selected feature subset’s size, where \(\alpha \in [0,1]\) and \(\beta =1-\alpha\). Based on extensive experiments in previous studies [40, 71], it has been prescribed that \(\alpha =0.99\) and \(\beta =0.01\) is an adequate adjustment. The large impact and weight should highly be given to the classification accuracy rather than the number of selected features. Intuitively, if only the classification accuracy is considered in the evaluation function, the result may then disregard the solutions having the same accuracy but with less number of selected features that contribute greatly to alleviating the dimensionality curse problem. In this study, we utilize the classification methods k-NN, SVM, and RF as expert systems (or evaluators) to assess the process of feature selection based on the classification error rate.

4.4 Position updating

Inclusively, the positions of sparrows in iBSSA are updated according to Eqs. (3), (4), or (5). If the current sparrow is a producer, then Eq. (3) is applied to update its position; otherwise, Eq. (5) is applied to update the scrounger’s position. Meanwhile, Eq. (4) will be used to update the rest of the swarm (i.e., the whole swarm apart from producers). Note that, after the position is updated at each iteration, the continuous values of the position vector are preserved for future use in the continuous position update throughout the successive iterations. These values are also discretized using Eq. (6), so as to evaluate the fitness value of the produced binary solution based on the classification error rate obtained by the classifier involved using the features selected by iBSSA. Next, this process iterates until meeting a stopping criterion which is, in this study, the maximum number of iterations (which has proved sufficient to quantify the quality of the iBSSA algorithm). Finally, the overall pseudo-code of iBSSA can be found in Algorithm 4.

figure d

5 Experiments and discussion

This section highlights the computational experiments for the iBSSA algorithm against different competing meta-heuristic algorithms. Parameter settings, benchmark datasets, and performance measures adopted to validate the proposed approaches are articulated in this section as well.

5.1 Description of datasets

In order to extensively evaluate and validate the performance of the techniques proposed in this manuscript, 18 multifaceted, multi-scale benchmark datasets from the UCI data repository [38] in different fields (e.g., biology, politics, electromagnetic, game, physics, chemistry, and artificial) are used in all experiments. These datasets are very beneficial to better verify the methods proposed herein, based on the different number of instances and features involved. Table 2 shows the details of these datasets.

Table 2 Description of the datasets used in this study

5.2 Environment and parameter settings

The proposed iBSSA algorithm was compared against a binary variant of the standard SSA algorithm (BSSA) as well as binary variants of other 11 state-of-the-art meta-heuristic algorithms, including BABC, BPSO, BBA, BGWO, BWOA, BGOA, BSFO, BHHO, BBSA, BASO, and BHGSO, which are all implemented in this study. Concerning the classifiers adopted in this study, for k-NN, the Euclidean distance metric \(k=5\), while SVM builds on a polynomial kernel with degree \(d=2\). In RF, the number of estimators \(n=10\), and the maximum depth \(d=5\). Due to the stochastic nature of meta-heuristics, for each method, 30 independent runs were performed. Then, the average values of the performance metrics were recorded over the 30 experiments. In order for the comparison to be fair, the maximum number of iterations and the swarm size were respectively set to 100 and 10 for all algorithms. Besides, the number of features in the datasets adopted herein represents the problem size, while the continuous search domain is set to \([-1,1]\) in order to allow individuals to relatively search more broadly, yet restrictively within the continuous search space. Lastly, the number of local iterations in the proposed LSA algorithm was set to 20.

In the proposed approach, the optimality of the results was validated by using the hold-out strategy, where the training and test sets are realized by randomly dividing each dataset into two parts, where the training phase is done on 80% of the dataset while the remaining 20% is for testing purposes [70]. The results in the following tables represent the average values over 30 runs in terms of the mean fitness value, the mean classification accuracy, and the mean number of selected features. The remaining parameters of each algorithm are set according to the standard versions and the information provided in their first publications. The common settings of all algorithms, along with parameter settings for each algorithm, are described in Table 3. To run all experiments in this study, Python was used on a computing environment with a Dual Intel® Xeon® Gold 5115 2.4 GHz CPU and 128 GB of RAM on the operating system Microsoft Windows Server 2019.

Table 3 Parameter setup for all algorithms

5.3 Performance metrics

In this study, to validate the performance of the proposed iBSSA against competitors, each optimization algorithm is evaluated independently 30 times for each dataset, in order to make the empirical results statistically more significant. To this end, some key performance measures in the FS problem were adopted as follows:

  • Mean accuracy \(\left( \mu _{Acc}\right) \): This metric estimates the rate at which the data is classified correctly. The mean classification accuracy \(\left( \mu _{Acc}\right) \), which is obtained by executing the algorithm 30 independent times, is expressed as Eq. (8):

    $$\begin{aligned} \mu _{Acc}=\frac{1}{30} \frac{1}{m} \sum _{k=1}^{30} \sum _{r=1}^{m} match\left( PL_r, AL_r \right) , \end{aligned}$$
    (8)

    where m represents the size of samples in the test dataset, \(PL_r\) and \(AL_r\) respectively denote the output label of the predicted class and the reference class label for sample r, while \(match\left( PL_r, AL_r \right) \) denotes a discriminant comparison function. If \(PL_r==AL_r\), then \(match\left( PL_r, AL_r \right) =1\); otherwise, \(match\left( PL_r, AL_r \right) =0\).

  • Mean fitness value \(\left( \mu _{Fit} \right) \): This metric measures the average fitness value obtained by executing the algorithm 30 independent time, which defines the relationship between reducing the classification error rate and minimizing the selected features’ number as per Eq. (7). The lower value implies the more optimal solution and is determined using Eq. (9):

    $$\begin{aligned} \mu _{Fit}= \frac{1}{30}\sum _{k=1}^{30} f_*^{k}, \end{aligned}$$
    (9)

    where \(f_*^{k}\) represents the optimal fitness value obtained from k-th run.

  • Mean size of selected features \(\left( \mu _{Feat} \right) \): This represents the average number of selected features (or the feature selection ratio) by executing the algorithm 30 independent times, and is estimated as:

    $$\begin{aligned} \mu _{Feat}= \frac{1}{30}\sum _{k=1}^{30} \frac{d_*^{k}}{D}, \end{aligned}$$
    (10)

    where \(d_*^{k}\) is the number of selected features for k-th run, and D is the dimensionality or the number of all features in the original dataset.

  • Standard deviation (\(\sigma _{Y}\)): Corresponding to the aforementioned measures, the final results achieved over the 30 independent runs for each algorithm on every dataset are evaluated and analyzed in terms of stability as:

    $$\begin{aligned} \sigma _{Y}=\sqrt{\frac{1}{29}\sum \nolimits _{k=1}^{30} \left( Y_*^k-\mu _Y\right) ^2}, \end{aligned}$$
    (11)

    where Y denotes the metric to be measured, \(Y_*^k\) is the value of the metric Y in the k-th run, and \(\mu _Y\) is the average of the metric over the 30 independent runs.

  • Wilcoxon’s rank-sum test: For a better understanding of the significance of the underlying technique, its impact should be statistically proved. Thus, the quality of the final results obtained from the involved approaches is often verified using Wilcoxon’s rank-sum non-parametric test. This test, among many other statistical tests, is widely used to statistically differentiate the significance and reliability of many approaches in competition [26]. Using this test, the present study signifies the proposed iBSSA approach against the competitor algorithms, where a null hypothesis states that the iBSSA algorithm does not significantly differ – in terms of performance – from other methods in pair-wise comparison; otherwise, the iBSSA performs significantly better than others. The Wilcoxon’s rank-sum test calculates a so-called p-value, based on which the difference between paired groups is analyzed in terms of the results obtained from 30 independent runs of the proposed iBSSA and the other algorithms in the comparison.

The experimental results are closely analyzed in the subsequent subsections, in which boldface numbers indicate the best results.

5.4 Assessment of the impact of the nine TFs on the fitness values

SSA was originally proposed for continuous optimization problems; therefore, a proper transformation function should be identified to turn the algorithm mechanism to efficiently search in discrete (or binary) space. To this end, iBSSA performance was evaluated based on nine different TFs, including S-shaped and V-shaped functions across the selected 18 UCI benchmark datasets to solve the wrapper FS problem, pursuing the best-performing approach. The different TFs were tested with iBSSA based on the k-NN classifier in terms of the mean fitness value \(\left( \mu _{Fit} \right) \) and the results are reported in Table 4, along with the respective standard deviations. The same observation is highlighted for the SVM and the RF classifier as per Tables 5 and 6, respectively. In the FS optimization problem, the objective function (see Eq. (7)) is formulated based on both the mean classification error rate and the mean feature selection ratio; therefore, the mean fitness value was mainly considered in solving such a problem. The approaches were built based upon nine TFs called Sv1, Sv1c, Sv2, Sv3, and Sv4 for the S-shaped TFs, and Vv1, Vv2, Vv3, and Vv4 for the V-shaped TFs. Consequently, throughout the following discussion, the proposed methods are termed as “iBSSA-TF”, where TF is any of the nine TFs. \(\mathrm {W|T|L}\) at the bottom of the tables represent how many times each competitive method wins/ties/loses compared to others. By analyzing and comparing the above experimental results, the optimal iBSSA variant corresponding to each classifier is to be obtained based on the best-performing TF for that classifier.

5.4.1 Assessment using the k-NN classifier

Table 4 shows the mean fitness values \(\left( \mu _{Fit} \right) \) and standard deviations \(\left( \sigma _{Fit} \right) \) of iBSSA based on the nine TFs, using the k-NN classifier. It is observed that \(\hbox {iBSSA-V}_{\mathrm{v4}}\) shows a significant performance in 11 out of the 18 datasets, followed by \(\hbox {iBSSA-V}_{\mathrm{v3}}\), \(\hbox {iBSSA-V}_{\mathrm{v1}}\), \(\hbox {iBSSA-V}_{\mathrm{v2}}\), and \(\hbox {iBSSA-S}_{\mathrm{v4}}\) with 9, 7, 6, and 4 datasets, respectively. Therefore, in terms of the mean fitness values, \(\hbox {iBSSA-S}_{\mathrm{v4}}\) ranks first among all methods.

Table 4 Assessment of the impact of the nine TFs on iBSSA based on k-NN in terms of the mean fitness value \(\left( \mu _{Fit} \right) \)

5.4.2 Assessment using the SVM classifier

Table 5 represents the mean fitness values \(\left( \mu _{Fit} \right) \) based on the proposed iBSSA with the SVM classifier across the nine TFs. In terms of the mean fitness value in Table 5, although \(\hbox {iBSSA-V}_{\mathrm{v2}}\) and \(\hbox {iBSSA-V}_{\mathrm{v3}}\) seemingly share the first place on 50% of the datasets based on the number of wins and ties, \(\hbox {iBSSA-V}_{\mathrm{v2}}\) experienced the best overall performance in comparison with \(\hbox {iBSSA-V}_{\mathrm{v3}}\) based on the number of wins. In addition, \(\hbox {iBSSA-V}_{\mathrm{v1}}\), \(\hbox {iBSSA-S}_{\mathrm{v4}}\), and \(\hbox {iBSSA-V}_{\mathrm{v4}}\) ranked first on 6, 5, and 5 datasets, respectively, but they were all less than \(\hbox {iBSSA-V}_{\mathrm{v2}}\). Thus, it can be concluded that the \(\hbox {iBSSA-V}_{\mathrm{v2}}\) method is generally performing the best with SVM based on the mean fitness values.

Table 5 Assessment of the impact of the nine TFs on iBSSA based on SVM in terms of the mean fitness value \(\left( \mu _{Fit} \right) \)

5.4.3 Assessment using the RF classifier

When testing the iBSSA algorithm with nine TFs for FS in classification using RF, iBSSA-Vv2 had strong competitiveness against peers. In Table 6, among the 18 datasets, 9 datasets of iBSSA-Vv2 are marked as winner and tied, which ranks the method as the first over the other nine methods in terms of the mean fitness value. Inspecting the above comparison, it is well shown that iBSSA-Vv2 is the best-performing binary variant with the RF classifier in the proposed iBSSA algorithm.

Table 6 Assessment of the impact of the nine TFs on iBSSA based on the RF classifier in terms of the mean fitness value \(\left( \mu _{Fit} \right) \)

All in all, Table 7 shows the best combinations of iBSSA variant and the three classifiers with the nine TFs, based on which the upcoming experiments in this section were conducted. Now, the finally selected, most effective models can be termed as iBSSA-Vv4k-NN, iBSSA-Vv2–SVM, and iBSSA-Vv2–RF. In this regard, it should be noted that, for simplicity, these models are abbreviated based on the three classifiers respectively as iBSSA–k-NN, iBSSA–SVM, and iBSSA–RF, and used consistently hereafter without including the respective TF’s name.

Table 7 The three classifiers along with corresponding best-performing binary variants in the proposed iBSSA algorithm

5.4.4 Overall assessment

In Table 8, iBSSA–k-NN is compared with iBSSA–SVM and iBSSA–RF. In terms of the mean fitness value, it can be seen that the iBSSA–RF ranks first among 8 of the 18 datasets. For the other methods, iBSSA–k-NN and iBSSA–SVM both have 5 datasets ranking first. In the final statistical ranking, iBSSA–RF ranked first, followed by iBSSA–SVM and iBSSA–k-NN, for the value of mean fitness. In comparison of mean classification accuracy, iBSSA–RF won over 5 while tied over 4 of the 18 datasets in the combativeness of classification. Other methods only have 4 winning datasets, which are all less than iBSSA–RF. Especially on the datasets M-of-n, Vote, WineEW, and Zoo, the mean accuracy of iBSSA–k-NN and iBSSA–SVM has reached 100%, while iBSSA–RF reached that mean accuracy on the late 3 datasets only. iBSSA–RF also ranks first in the final statistics, followed by iBSSA–k-NN and iBSSA–SVM. The experimental results on the mean number of selected features for all methods are also shown in Table 8. As can be seen, iBSSA–SVM ranks first on 7 out of the 18 datasets. Especially from the numerical perspective, iBSSA–SVM can relatively find a smaller feature subset, for example, on the high-dimensional dataset PenglungEW, where the selected feature’s mean number is nearly 30% of the original number of features. Similarly, in the final ranking, iBSSA–SVM ranked first, followed by iBSSA–RF and iBSSA–k-NN in terms of the mean number of features selected. From the experimental results analyzed above, it is asserted that both SVM and RF are strongly competitive in compassion with k-NN in terms of mean fitness, mean classification accuracy, and mean number of selected features by the proposed iBSSA algorithm. The harmony between the mean fitness values and mean classification implies a slight advantage of iBSSA–RF among the three methods.

Table 8 Performance comparisons of iBSSA–k-NN, iBSSA–SVM, and iBSSA–RF in terms of the mean fitness value \(\left( \mu _{Fit} \right) \), mean classification accuracy \(\left( \mu _{Acc} \right) \), and mean number of selected features \(\left( \mu _{Feat} \right) \)

These findings can be justified by the nature of the classifiers themselves. k-NN can be described as a lazy classifier in which the learning step is not even required as the classification task is done by simply calculating the distance between the test and training sets. On the other hand, SVM is more complex by using the learning phase to update the bias and weights used in the testing step, in order to create the model. RF is also considered as a complex learning method which is based on training multiple deep decision trees on different parts of the same training set and averaging the results in order to reduce the variance. This comes at the expense of some loss of interpretability and a slight increase in the bias. Although not quite similar, forests act like a K-fold cross-validation. The aforementioned characteristics provide a greater capacity to SVM and RF than k-NN in determining the pertinent features with high discrimination.

Thus, in order to fully verify the effectiveness of the algorithm proposed in this study in feature selection and due to their differential yet helpful nature, the three classifiers, k-NN, SVM, and RF, were considered inclusively independently in all other experiments in this section.

5.5 Performance comparison of classification using the original classifier with all features (before FS), and the iBSSA–classifier method (after FS)

In this experiment, to determine the breadth of influence on it, the number of selected features on average based on the proposed iBSSA algorithm was compared with the original number of features in each dataset. In addition, the magnitude of increase in the level of classification accuracy over the three classifiers was quantified.

5.5.1 Comparisons of k-NN and iBSSA–k-NN

On one side, Table 9 shows the classification accuracy based on the original k-NN classifier (before FS) along with the original number of features in each dataset, for each of the 18 datasets. On the other hand, the table shows the mean classification accuracy \(\left( \mu _{Acc} \right)\) and the mean number of selected features \(\left( \mu _{Feat} \right)\) based on the proposed iBSSA–k-NN method (after FS). It is remarkable that \(\mu _{Acc}\) based on the iBSSA–k-NN method increased by more than 10% in 10 out of the 18 datasets with a significant increase of up to 25% on 4 of them, while the mean accuracy itself reached up to more than 90% in 12 out of the 18 datasets, achieving 100% mean accuracy for 4 datasets. It is also most notable that \(\mu _{Feat}\) based on the iBSSA–k-NN method was reduced by more than 50% on 15 out of the 18 datasets. Unfortunately, iBSSA–k-NN was unable to enhance both \(\mu _{Acc}\) and \(\mu _{Feat}\) on the dataset Tic-tac-toe. It may be the dataset’s particular nature and data characteristics which have hindered iBSSA from improving the feature subset selection. Irrespective, in general, the iBSSA–k-NN method significantly excelled the original k-NN over 17 datasets in terms of the two metrics. Thus, it is very clear how iBSSA–k-NN has provided a really promising solution to the FS problem compared to only k-NN, across the selected datasets.

Table 9 Performance comparisons of classification using the original k-NN classifier on all the features (before FS) and classification using the proposed iBSSA–k-NN (after FS) in terms of the mean classification accuracy \(\left( \mu _{Acc} \right)\) and the mean number of selected features \(\left( \mu _{Feat} \right)\)

5.5.2 Comparisons of SVM and iBSSA–SVM

For the SVM classifier, Table 10 shows that iBSSA–SVM outperformed the original SVM in terms of \(\mu _{Acc}\) and \(\mu _{Feat}\) because iBSSA generally selected fewer features over the 18 datasets with a reduction in feature size by more than 50% in 14 out of the 18 datasets, improving at the same time the \(\mu _{Acc}\) in 15 out of 18 datasets with 100% accuracy for 3 datasets while preserving the \(\mu _{Acc}\) in the remaining 3 datasets. It is also observed that while the \(\mu _{Acc}\) in the datasets Exactly2, M-of-n, and Tic-tac-toe did not increase, the respective number of selected features was reduced on average, over the 3 datasets. From the analysis, the iBSSA–SVM method proposed herein was superior to the original SVM alone, in most datasets.

Table 10 Performance comparisons of classification using the original SVM classifier on all the features (before FS) and classification using the proposed iBSSA–SVM (after FS) in terms of the mean classification accuracy \(\left( \mu _{Acc} \right)\) and the mean number of selected features \(\left( \mu _{Feat} \right)\)

5.5.3 Comparisons of RF and iBSSA–RF

As shown in Table 11, iBSSA–RF outperformed the original RF classifier in terms of the accuracy and the number of selected features, respectively, in 17 and 18 out of all 18 datasets. Among the 18 datasets, the \(\mu _{Acc}\) in 10 datasets increased by up to 10%, including 3 datasets attaining 100% classification accuracy. On the other hand, the \(\mu _{Feat}\) in 15 datasets was reduced by 50%, thereby showing highly significant improvements in the two FS main metrics (\(\mu _{Acc}\) and \(\mu _{Feat}\)). This shows that iBSSA–RF has much better performance.

Table 11 Performance comparisons of classification using the original RF classifier on all the features (before FS) and classification using the proposed iBSSA–RF (after FS) in terms of the mean classification accuracy \(\left( \mu _{Acc} \right)\) and the mean number of selected features \(\left( \mu _{Feat} \right)\)

5.6 Performance comparison of BSSA and iBSSA

In this experiment, to quantify the extent of improvement in it, iBSSA was compared to the original BSSA based on the three classifiers, k-NN, SVM, and RF, in terms of three metrics: \(\mu _{Fit}\), \(\mu _{Acc}\), and \(\mu _{Feat}\).

5.6.1 Comparisons based on the k-NN classifier

As shown in Table 12, which presents the mean metrics based on iBSSA and BSSA both with k-NN, iBSSA outperforms the original BSSA algorithm over the 18 datasets in terms of the mean fitness values. Thus, iBSSA achieved much higher performance than the original BSSA algorithm based on the fitness. Moreover, iBSSA outperformed BSSA in terms of mean classification accuracy on 12 of the 18 datasets, accounting for 67% of all datasets, with 100% classification accuracies on 4 datasets (M-of-n, Vote, WineEW, and Zoo). Furthermore, in terms of the mean number of selected features, iBSSA outperformed BSSA because a smaller number of features was selected by iBSSA over 15 (83%) of all datasets. In addition, given the standard deviations, iBSSA was more stable on the vast majority of datasets, for the three metrics. From these results, it is clearly shown that iBSSA with k-NN has significantly improved both the FS and classification tasks compared to the original BSSA algorithm.

Table 12 Performance comparisons of BSSA and iBSSA in the FS problem based on k-NN in terms of the mean fitness value \(\left( \mu _{Fit} \right)\), the mean classification accuracy \(\left( \mu _{Acc}\right)\), and the mean number of selected features \(\left( \mu _{Feat} \right)\)

5.6.2 Comparisons based on the SVM classifier

Table 13 shows the experimental results by using both BSSA and iBSSA with SVM for FS in the classification task. iBSSA had 16 (89%), 12 (67%), and 16 (89%) out of 18 datasets ranked first on the mean fitness, mean classification accuracy, and mean number of selected features, which confirms a highly considerable overall performance of the proposed iBSSA algorithm over BSSA. It is also worth mentioning that the mean classification accuracy reached 100% on 4 datasets (M-of-n, Vote, WineEW, and Zoo) with less number of features selected by iBSSA. Generally, by comparing the two methods in terms of standard deviation, iBSSA relatively has a higher stability. Thus, iBSSA with SVM outperformed significantly BSSA in solving the FS problem.

Table 13 Performance comparisons of BSSA and iBSSA in the FS problem based on SVM in terms of the mean fitness value \(\left( \mu _{Fit} \right)\), the mean classification accuracy \(\left( \mu _{Acc}\right)\), and the mean number of selected features \(\left( \mu _{Feat} \right)\)

5.6.3 Comparisons based on the RF classifier

From Table 14, it is most notable that iBSSA with RF had the potential to find a smaller feature subset over 13 out of the 18 datasets, yet achieving a higher (or maintaining the same) level of mean accuracy over those datasets. Furthermore, the mean fitness values obtained by iBSSA over the 18 datasets approved its substantial superiority compared to the original BSSA. Additionally, iBSSA–RF stability was very apparent in terms of the three metrics. Notably, iBSSA based on RF classifier showed a higher quality than BSSA alone.

Table 14 Performance comparisons of BSSA and iBSSA in the FS problem based on the RF classifier in terms of the mean fitness value \(\left( \mu _{Fit} \right)\), the mean classification accuracy \(\left( \mu _{Acc}\right)\), and the mean number of selected features \(\left( \mu _{Feat} \right)\)

5.7 Performance comparison with some other promising meta-heuristic algorithms

Previous experiments compared iBSSA with BSSA originated from the original SSA. iBSSA performed superiorly over the original BSSA approach. In fact, these improvements in performance are subjected to the iBSSA capability to balance exploration and exploitation over iterations, as well as its ability to escape from local optima. In this section, in order to further confirm the superiority of iBSSA to other peers in literature, the best-performing iBSSA method with the three classifiers, k-NN, SVM, and RF, is compared with other advanced meta-heuristics implemented in the same conditions. Comparison with iBSSA includes binary variants of some promising optimization algorithms, such as BSSA, BABC, BPSO, BBA, BGWO, BWOA, BGOA, BSFO, BHHO, BBSA, BASO, and BHGSO. In the same way, the comparison and analysis of performance is according to the mean fitness value, the mean classification accuracy, and the mean number of selected features.

5.7.1 Comparisons based on the k-NN classifier

Table 15 reports the results of the mean fitness values for iBSSA and other advanced meta-heuristics in the FS problem based on k-NN. As can be seen in Table 15, iBSSA won over 10 and tied over 3 datasets in the FS task, thereby having a relatively good impact in 13 out of 18 datasets, accounting for 72% of all datasets. Additionally, there are large and small size datasets in the benchmark used, which proves iBSSA ability to consistently perform on all datasets regardless of the dataset size. For the 5 datasets lost by iBSSA, comparing the mean fitness values reveals that these values by iBSSA are very close to other algorithms. The same observation is highlighted for the 5 datasets won by BSFO in the FS problem, where their mean fitness values are very close to those in the case of iBSSA. This shows that the best performance is by iBSSA. Except BSFO, none of the other methods in comparison with iBSSA ranked first in the 18 datasets, which confirms the remarkability of the proposed method with k-NN. Furthermore, the stability of iBSSA with k-NN is relatively strong, based on the standard deviations of the various methods. Now, it can be concluded that iBSSA is highly advantageous, with a proven capability of balancing between exploration and exploitation in the search space over iterations and avoiding local optima. Whereas, other algorithms may become trapped into local optima.

Table 15 Comparisons of iBSSA against a few promising algorithms based on k-NN in terms of the mean fitness value \(\left( \mu _{Fit} \right)\)

Table 16 shows the mean classification accuracy results of k-NN with iBSSA compared to other methods. From Table 16, it is observed that iBSSA excelled all other algorithms in terms of mean classification accuracy over 7 while tied with other algorithms over 8 datasets, achieving overall high accuracy in 15 out of the 18 datasets. Additionally, note that the second ranked optimizer (BSFO) acquired a slight advantage over 3 datasets only: Exactly2, PenglungEW, and SpectEW, with a margin of 0.0013%, 0.0022%, and 0.25%, respectively.

Table 16 Comparisons of iBSSA against a few promising algorithms based on k-NN in terms of the \(\mu _{Acc}\) classification accuracy \(\left( \mu _{Acc}\right)\)

In addition, as shown in Table 17, iBSSA has a better exploration ability than other methods based on the mean number of selected features by iBSSA, which was asserted by selecting less number of features over 12 (11 wins and one tie) out of the 18 datasets, rather than BSFO, BHHO, BBSA, and BASO which have selected a lower yet non-significant number of features than iBSSA and other algorithms over only 2, 1, 1, and 2 datasets, respectively. This proves the ability of iBSSA with k-NN to explore the most feasible regions and avoid searching irrelevant search space areas. Thus, iBSSA based on k-NN can reduce the feature search space through the identification of the most informative features while maintaining high classification accuracy.

Table 17 Comparisons of iBSSA against a few promising algorithms based on k-NN in terms of the mean number of selected features \(\left( \mu _{Feat} \right)\)

5.7.2 Comparisons based on the SVM classifier

By inspecting Table 18, iBSSA with SVM outperformed other optimizers assessed in the same experimental environment, in 14 out of 18 datasets because it has the minimal classification error compared to other methods, while BSFO provides the highest yet slight advantage over one dataset only. Additionally, the standard deviation values are close to the mean fitness, which implies a good balance between exploration and exploitation of the SVM-based iBSSA method.

Table 18 Comparisons of iBSSA against a few promising algorithms based on SVM in terms of the mean fitness value \(\left( \mu _{Fit} \right)\)

In Table 19, iBSSA is compared with other algorithms in terms of mean classification accuracy. By analyzing the results, iBSSA based on SVM had a perfect recognition in terms of 100% classification accuracy on the datasets M-of-n, Vote, WineEW, and Zoo; whereas, it achieved an accuracy varying from 94.76% to 98.54% on the datasets SonarEW, BreastEW, IonosphereEW, CongressEW, BreastCancer, and KrVsKpEW. The results also revealed that iBSSA based on SVM obtained the best mean accuracy on 8 out of the 18 datasets, while the second ranked optimizer (BSFO) scored the best results over one dataset only.

Table 19 Comparisons of iBSSA against a few promising algorithms based on SVM in terms of the mean classification accuracy \(\left( \mu _{Acc}\right)\)

In terms of the mean number of selected features, the results of iBSSA and other competitors based on SVM are depicted in Table 20. By analyzing the results, appealing observations are remarked for iBSSA with SVM, which achieved better results than other optimizers over 12 out of the 18 datasets adopted in this study. It is also observed that the second ranked method (BHHO) outperforms over two datasets only. The superiority of iBSSA with SVM in this regard was demonstrated thanks to the proposed 3RA and LSA methods, which implies that the proposed iBSSA would select fewer number of features, affirming its potential to explore the most important areas of the search space, avoiding searching through non feasible spaces areas.

Table 20 Comparisons of iBSSA against a few promising algorithms based on SVM in terms of the mean number of selected features \(\left( \mu _{Feat} \right)\)

5.7.3 Comparisons based on the RF classifier

Table 21 compares the mean fitness and standard deviation values of iBSSA with other competitors based on RF, under the same implementation conditions. Notably, iBSSA based on RF shows a higher quality than other approaches. By analyzing Table 21, evidence shows that RF-based iBSSA obtained the lowest fitness values along with competitive standard deviations over 12 out of the 18 datasets. In addition, the second ranked optimizer (BSFO) acquired a slight advantage over 4 datasets only: BreastCancer, Exactly, Exactly2, and WineEW with a margin of 0.02%, 0.35%, 0.06%, and 0.01%, respectively.

Table 21 Comparisons of iBSSA against a few promising algorithms based on RF in terms of the mean fitness value \(\left( \mu _{Fit} \right)\)

Table 22 shows the comparison of iBSSA and other algorithms with RF classifier in terms of accuracy results which are assessed fairly in the same environment. The experimental results reveal that RF-based iBSSA classifier is superior on 8 out of 18 datasets, where BSFO, which ranks second, gave the best results on two datasets only. It should also be pointed out that BSFO produced the same results on 6 datasets as iBSSA based on RF.

Table 22 Comparisons of iBSSA against a few promising algorithms based on RF in terms of the mean classification accuracy \(\left( \mu _{Acc}\right)\)

Table 23 presents the mean number of features selected by each algorithm. It is evident that iBSSA excelled other algorithms over 7 out of 18 datasets in feature reduction. Furthermore, BASO excelled other algorithms over only 4 datasets, BSFO excelled other algorithms over only 3 datasets, and BGWO excelled other algorithms over only 2 datasets; whereas, BWOA and BBSA excelled other algorithms for only one dataset. The iBSSA ability to select fewer number of features is believed to be attributed to the superiority of the proposed 3RA and LSA methods in selecting the most informative features through searching within feasible areas while at the same time considering improved classification accuracy.

Table 23 Comparisons of iBSSA against a few promising algorithms based on RF in terms of the mean number of selected features \(\left( \mu _{Feat} \right)\)

5.8 Convergence analysis

This subsection presents an asymptotic analysis of the proposed iBSSA with the three classifiers, k-NN, SVM, and RF, in FS on different datasets. These analyses reveal the convergence capacity of the proposed iBSSA in the form of a convergence graph. Since the mean classification accuracy can be adopted to quantify the performance of iBSSA in FS problems, yet the algorithm convergence behavior strongly determines its efficiency. Figures 3, 4, and 5 depict the relation between the number of iterations and the optimal fitness value obtained so far with k-NN, SVM, and RF, respectively, for iBSSA against binary variants of other optimizers, including BSSA, BABC, BPSO, BBA, BGWO, BWOA, BGOA, BSFO, BHHO, BBSA, BASO, and BHGSO, which are executed in the same experimental settings, including the number of agents and number of iterations.

Fig. 3
figure 3

Convergence curves of the proposed iBSSA and other algorithms over all datasets with the k-NN classifier

Fig. 4
figure 4

Convergence curves of the proposed iBSSA and other algorithms over all datasets with the SVM classifier

Fig. 5
figure 5

Convergence curves of the proposed iBSSA and other algorithms over all datasets with the RF classifier

Based on Figs. 3, 4, and 5, iBSSA with k-NN excelled other algorithms in terms of fast convergence over the 18 datasets. Thus, the superior performance of iBSSA with k-NN is revealed through its speedy convergent behavior in comparison with other methods, especially for large datasets like, for example, PenglungEW where convergent behavior was exhibited after only 25 iterations. This is comparable to both: the iBSSA with SVM, which indicates speedy convergent behavior over 16 datasets except BreastEW and SonarEW, and the iBSSA with RF which indicates divergent behavior over only the dataset SonarEW. Moreover, the convergence behavior shows that the proposed iBSSA effectively balances between exploitation and exploration, that is well demonstrated by the perfect coincidence between the optimal mean fitness values and the optimal mean accuracy values. This really contributed a lot to solving the problem of premature convergence compared to other optimization algorithms. Notably, this stability in iBSSA performance is reflected by incorporating two improvements, 3RA and LSA, into the original BSSA algorithm. It is worth mentioning that iBSSA based on the three classifiers: k-NN, SVM, and RF, exhibited convergent behavior, in most cases, throughout the 100 iterations, which were thus sufficient for convergence. In addition, the fastest convergence was relatively exhibited by iBSSA with the three classifiers on the vast majority of datasets.

5.9 Statistical analysis (Wilcoxon’s rank-sum test)

Based on Tables 15, 18, and 21, it is obvious that the iBSSA with k-NN, SVM, and RF, respectively, shows a clear superiority over other competing algorithms in 13, 14, and 12 out of the 18 datasets. Therefore, in this section, the Wilcoxon’s rank-sum test is assessed for iBSSA with the three classifiers, k-NN, SVM, and RF, compared to BSSA, BABC, BPSO, BBA, BGWO, BWOA, BGOA, BSFO, BHHO, BBSA, BASO, and BHGSO. In order to statistically detect the significant difference between the means of two samples, the Wilcoxon’s rank-sum test [26] is used. Therefore, for assessing the significance of the proposed iBSSA against counterparts, Wilcoxon’s rank-sum statistical analysis is conducted at a 0.05 significance level under which a statistically significant difference of the proposed algorithm is exhibited compared to other algorithms (values with \(p<0.05\) are bold). The results shown in Tables 24, 25, and 26 reveal that the suggested iBSSA approach is arguably the best compared to other optimizers over the three classifiers, k-NN, SVM, and RF, based on the fitness value results.

By inspecting Table 24’s results, iBSSA with k-NN is significantly different from BHHO and BASSO over the 18 datasets since \(p\)-value is less than 0.05. Additionally, compared to BPSO, BBA, and BHGSO, iBSSA with k-NN remains statistically significant on 17 datasets. While iBSSA is significantly different from BGOA, BWOA, BBSA, BGWO, BABC, BSSA, and BSFO with k-NN over 16, 15, 15, 12, 11, 11, and 5 datasets, respectively. On the other hand, iBSSA and BSFO exhibit the same performance with k-NN for the dataset BreastCancer because the mean fitness value is 0.0201 and the standard deviation is zero. The same observation can be made for iBSSA, BABC, and BSFO, where the same performance is also exhibited with k-NN for the dataset Tic-tac-toe as the mean fitness value is 0.1544 and the standard deviation is zero. Furthermore, IBSSA with k-NN does not perform significantly for the datasets Lymphography and M-of-n compared to BSSA, BABC, BGWO, and BSFO. In addition, iBSSA with k-NN are not significantly different from the results of BSSA, BGWO, BWOA, BGOA, BSFP, and BBSA on the datasets Tic-tac-toe and Vote. Lastly, iBSSA with k-NN does not also differ significantly from BSSA, BABC, BPSO, BWOA, and BSFO on the dataset WineEW. In general, it is clear from the table that iBSSA has statistically significant differences on most of the datasets compared to other algorithms.

Table 24 \(p\)-values of the Wilcoxon’s rank-sum test for the mean fitness values \(\left( \mu _{Fit}\right)\) of the iBSSA vs. other approaches based on k-NN (\(\alpha <0.05\) are bold)

From the \(p\)-values in Table 25, it can be seen that based on the mean fitness values, iBSSA with SVM have a statistically significant difference compared to BBA and BASO over the 18 datasets. In addition, the iBSSA with SVM is still more significant than BGOA, BHGSO, BPSO, BBSA, BABC, BWOA, BHHO, BSSA, and BGWO respectively over 17, 17, 16, 16, 15, 15, 14, 12, and 12 out of the 18 datasets. On the other hand, iBSSA, BSFO, and BHHO exhibit the same performance with SVM for the dataset Exactly2 as the mean fitness value is 0.2483 and the standard deviation is zero. The same observation is for iBSSA, BABC, BGWO, and BSFO, where the same performance is also exhibited with SVM for the dataset Tic-tac-toe as the mean fitness value is 0.1017 and the standard deviation is zero. Moreover, it is very notable that iBSSA with SVM do not have a statistically significant difference from BSSA, BPSO, BGWO, BWOA, and BHGSO on the dataset M-of-n. Furthermore, iBSSA with SVM does not have a statistically significant difference from BWOA, BGOA, BHHO, and BBSA on the dataset Tic-tac-toe.

Table 25 \(p\)-values of the Wilcoxon’s rank-sum test for the mean fitness values \(\left( \mu _{Fit}\right)\) of the iBSSA vs. other approaches based on SVM (\(\alpha <0.05\) are bold)

Table 26 shows the \(p\)-values of iBSSA with RF compared to other algorithms. It is remarkable that iBSSA has statistically significant differences from other algorithms over the vast majority of datasets, especially in the cases of BPSO, BBA, BHHO, and BHGSO, where a statistically outstanding difference is exhibited over the 18 datasets.

Table 26 \(p\)-values of the Wilcoxon’s rank-sum test for the mean fitness values \(\left( \mu _{Fit}\right)\) of the iBSSA vs. other approaches based on RF (\(\alpha <0.05\) are bold)

5.10 Comparison with various FS methods from the literature

This subsection compares the mean classification accuracy results for the three models proposed in this article (i.e., iBSSA–k-NN, iBSSA–SVM, and iBSSA–RF) versus various algorithms from the literature, including BGOA_EPD_Tour [72], TCSSA3 [6], HGSA [107], BSSA_S3_CP [35], GA [32], SA [42], and WOASAT-2 [74], which were conducted under the same experimental settings. Table 27 reports the experimental data of the mean classification accuracy of iBSSA–k-NN, iBSSA–SVM, iBSSA–RF, and various methods in the literature.

Table 27 Comparison of iBSSA and other algorithms from the specialized literature in terms of the mean classification accuracy \(\left( \mu _{Acc}\right)\)

By inspecting Table 27, it can be seen that in 18 datasets, iBSSA–RF proposed in this article dominated all approaches in terms of the mean classification accuracy on 4 datasets. Moreover, iBSSA–SVM and iBSSA–k-NN take the second and third places with, respectively, 3 and 1 datasets, followed by HGSA, BGOA_EPD_Tour, WOASAT-2, BSSA_S3_CP, GA, and TCSSA3. In addition, by comparing specific data, it is important to highlight that the mean classification accuracy of iBSSA–k-NN, iBSSA–SVM, and iBSSA–RF have great advantages over competitors. Specifically, on the datasets M-of-n, Vote, WineEW, and Zoo, it reaches up to 100%, and those datasets are of low dimensionality, demonstrating the high performance of iBSSA–k-NN, iBSSA–SVM, and iBSSA–RF with low-dimensional datasets. Although HGSA outperformed iBSSA–k-NN, iBSSA–SVM, and iBSSA–RF on 4 datasets, the numerical margin was relatively small. Therefore, the iBSSA–k-NN, iBSSA–SVM, and iBSSA–RF proposed in this work performed better in solving the FS problem than other FS methods in the literature, ultimately taking the top three in the competition.

5.11 Discussions

This study fused an improved binary version of the SSA algorithm and three well-known classifiers into three models (i.e., iBSSA–k-NN, iBSSA–SVM, and iBSSA–RF) for FS in the classification task. Based on the empirical analytics performed above in this section, it can be seen that the proposed iBSSA algorithm with the three classifiers typically performs better than other state-of-the-art and well-known algorithms. Specifically, iBSSA with RF efficiently achieved higher mean accuracy over the majority of datasets, followed by iBSSA with SVM and iBSSA with k-NN. Moreover, the iBSSA algorithm performed the search more stably in comparison with peers, based on the small standard deviation values in the results. This subsection analyzes the underlying reasons.

First, in an attempt to resolve the feature selection problem more effectively, the original SSA algorithm in its continuous version was first mapped into binary using a transfer function. In order to make this approach as much fruitful as possible, the present study compared nine S-shaped and V-shaped transfer functions. A probability is obtained using the transfer function to map the real value into a binary one. Moreover, three well-known classifiers: k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), and Random Forest (RF) are utilized as fitness evaluators both with the proposed iBSSA and many other competitor algorithms, using 18 multifaceted, multi-scale benchmark datasets from UCI repository. Through experimental comparisons, iBSSA-Vv2–RF performed the best in iBSSA with the transfer function Vv2 and the RF classifier. In this work, nine transfer functions were compared because different transfer functions have different curve slopes. When the transfer function generates the probability, the probability of numerical mapping is made more reasonable through the better slope, so that the algorithm can perform better.

Second, based on the above introduction, the SSA algorithm can be understood as a novel swarm-based meta-heuristic algorithm. It has strong exploration and exploitation capabilities which were further boosted via the proposed iBSSA method by incorporating a strategy for random re-positioning of roaming agents as well as a novel local search algorithm. Moreover, the variable parameters and many mechanisms in the algorithm balance its exploration and development. Herein, the iBSSA only maps real values into binary ones as well as has two improvements (3RA and LSA) introduced into its original structure. Thus, the proposed iBSSA algorithm still has a superior performance.

Moreover, apart from the minimized feature subset and the better mean accuracy, iBSSA with k-NN has exhibited better convergence behavior than other optimizers on most of the datasets. Although iBSSA has, in general, achieved better performance compared to other methods, including the original BSSA algorithm, the convergence graphs reveal that iBSSA still has a problem with high-dimensional datasets, specifically trapping into local optima. Additionally, the mean accuracy of iBSSA with k-NN is inferior to other approaches on high-dimensional datasets. In the future, these shortcomings will be deeply examined. Apart from iBSSA, k-NN was used in this study as a classification algorithm due to its ease of implementation. However, k-NN performance is often degraded when compared to SVM and RF in terms of mean accuracy and mean number of selected features. This can be justified by the adopted wrapper-based feature selection approach, wherein iBSSA works with k-NN as a lazy classifier and with SVM and RF as complex classifiers. Lastly, it should be noted that, because of the non-exact repeatability of the optimization results, different executions of the algorithm can give a different subset of features, which may confuse the user. Hence, on different applications or occasions, iBSSA or other meta-heuristic algorithms applied herein may find a different subset of features.

Summarizing, a three fine-tuned classification algorithms are presented in this study with respect to their algorithmic design perspective, as well as the general framework. First, we briefly discussed the conventional SSA algorithm, different binary conversion techniques, and proposed learning algorithms. Then, proposed iBSSA was duly presented and a theoretical study was conducted on FS approaches. We also presented an adequate analysis of some of the state-of-the-art FS approaches in a comparative manner. Furthermore, a comparison of existing FS algorithms was conducted based on standard performance evaluation metrics. Finally, based on the FS problem formulation, research challenges were discussed with some future directions. This study could be a good starting point for researchers new to the FS community by providing them with a better understanding of recent developments on the FS problem. The current experimental study on FS identifies several research possibilities in data classification. However, several pieces had been done in FS, but still many more things to be uncovered. Future works in FS should boost increasing in efficiency, robustness, scalability, and effectiveness in large-scale datasets. Along with scalability, one of the more challenging issues is the context in FS, which must assert in the future.

6 Conclusions and future directions

A novel approach was proposed in this study for solving the FS problem using an improved binary version of the SSA algorithm (iBSSA). First, a suitable method for binary transformation of continuous SSA to the binary version was determined through the form of probability using common S-shaped and V-shaped transfer functions. In addition, a strategy for Random Re-positioning of Roaming Agents (3RA) along with a novel Local Search Algorithm (LSA) were embedded into the proposed iBSSA algorithm, in order to increase the algorithm capability to search within feasible regions as well as to prevent it from becoming trapped into local optima, respectively. Then, the resulting methods were tested on 18 multifaceted, multi-scale benchmark UCI datasets with three well-known classifiers: k-NN, SVM, and RF. The experimental results showed that iBSSA-Vv4k-NN, iBSSA-Vv2–SVM, and iBSSA-Vv2–RF were the best-performing methods in the proposed algorithms. Then, each of these methods was compared with the better performing and most popular methods in the literature, independently based on every classifier. Comparisons were driven by a number of measures, including mean classification accuracy, mean fitness, and mean number of selected features. By analyzing the experimental results, the three models proposed in this article (i.e., iBSSA–k-NN, iBSSA–SVM, and iBSSA–RF) relatively had the highest performance among the competing algorithms for solving FS problems. Specifically, iBSSA with RF (iBSSA–RF) performed better than other methods. Therefore, iBSSA–RF can be considered more highly when solving FS problems.

Moreover, the proposed iBSSA wrapper feature selection model has an important practical implication. Based on the various types of datasets used over the conducted experiments, iBSSA can be introduced into other domains, including medical applications, engineering applications, data mining, data science, and many more. However, the proposed iBSSA has one limitation: selecting more features than those selected by competitors over 6, 5, and 11 out of the 18 datasets based on iBSSA–k-NN, iBSSA–SVM, and iBSSA–RF, respectively. Therefore, the proposed algorithm can be reinforced by using a new selection strategy to select fewer features, especially on high-dimensional small instance datasets, which calls for further potential future research.

For future work, different conversion forms of iBSSA can be attempted for classification purposes. In addition, a binary version was successfully yielded from the continuous SSA, which can help tackle other discrete optimization problems; e.g., task scheduling, Travelling Salesman Problem (TSP) problem, Knapsack Problem (KP), etc., as well as investigate real-world problems, including Intrusion Detection System (IDS), cancer detection, and sentiment analysis. On the other hand, a sensitivity analysis of the main parameters of the proposed algorithm can be explored as a future extension of this work. Moreover, the embedding of the proposed LSA algorithm with other optimization algorithms could also be investigated. Lastly, other classifiers (e.g., Artificial Neural Networks (ANNs), Decision Tree (DT), etc.) could be employed to further investigate the performance of iBSSA in feature selection for classification.