An improved binary sparrow search algorithm for feature selection in data classification

Gad, Ahmed G.; Sallam, Karam M.; Chakrabortty, Ripon K.; Ryan, Michael J.; Abohany, Amr A.

doi:10.1007/s00521-022-07203-7

An improved binary sparrow search algorithm for feature selection in data classification

Original Article
Open access
Published: 27 April 2022

Volume 34, pages 15705–15752, (2022)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

An improved binary sparrow search algorithm for feature selection in data classification

Download PDF

3951 Accesses
29 Citations
Explore all metrics

A Correction to this article was published on 31 July 2022

This article has been updated

Abstract

Feature Selection (FS) is an important preprocessing step that is involved in machine learning and data mining tasks for preparing data (especially high-dimensional data) by eliminating irrelevant and redundant features, thus reducing the potential curse of dimensionality of a given large dataset. Consequently, FS is arguably a combinatorial NP-hard problem in which the computational time increases exponentially with an increase in problem complexity. To tackle such a problem type, meta-heuristic techniques have been opted by an increasing number of scholars. Herein, a novel meta-heuristic algorithm, called Sparrow Search Algorithm (SSA), is presented. The SSA still performs poorly on exploratory behavior and exploration-exploitation trade-off because it does not duly stimulate the search within feasible regions, and the exploitation process suffers noticeable stagnation. Therefore, we improve SSA by adopting: i) a strategy for Random Re-positioning of Roaming Agents (3RA); and ii) a novel Local Search Algorithm (LSA), which are algorithmically incorporated into the original SSA structure. To the FS problem, SSA is improved and cloned as a binary variant, namely, the improved Binary SSA (iBSSA), which would strive to select the optimal or near-optimal features from a given dataset while keeping the classification accuracy maximized. For binary conversion, the iBSSA was primarily validated against nine common S-shaped and V-shaped Transfer Functions (TFs), thus producing nine iBSSA variants. To verify the robustness of these variants, three well-known classification techniques, including k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), and Random Forest (RF) were adopted as fitness evaluators with the proposed iBSSA approach and many other competing algorithms, on 18 multifaceted, multi-scale benchmark datasets from the University of California Irvine (UCI) data repository. Then, the overall best-performing iBSSA variant for each of the three classifiers was compared with binary variants of 12 different well-known meta-heuristic algorithms, including the original SSA (BSSA), Artificial Bee Colony (BABC), Particle Swarm Optimization (BPSO), Bat Algorithm (BBA), Grey Wolf Optimization (BGWO), Whale Optimization Algorithm (BWOA), Grasshopper Optimization Algorithm (BGOA) SailFish Optimizer (BSFO), Harris Hawks Optimization (BHHO), Bird Swarm Algorithm (BBSA), Atom Search Optimization (BASO), and Henry Gas Solubility Optimization (BHGSO). Based on a Wilcoxon’s non-parametric statistical test ($\alpha =0.05$), the superiority of iBSSA with the three classifiers was very evident against counterparts across the vast majority of the selected datasets, achieving a feature size reduction of up to 92% along with up to 100% classification accuracy on some of those datasets.

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Article 19 January 2024

A review of unsupervised feature selection methods

Article 29 January 2019

Hybrid approaches to optimization and machine learning methods: a systematic literature review

Article Open access 24 January 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data mining is a rapidly growing paradigm in data science, due to the massive data gathered daily as well as the excessive demand to turn this data into useful knowledge [46]. Data mining includes multiple preprocessing steps (i.e., integration, filtering, transformation, reduction, etc.), knowledge presentation, and pattern evaluation [69]. Amongst them, Feature Selection (FS), also known as variable (feature) subset selection, is one of such prerequisites, which is reliable to strive to reduce the high computational costs required by heavy mining tasks by discarding potential noisy, redundant, and irrelevant features that may degrade the classification accuracy. Basically, the major objective of an FS procedure is to optimally carry out the classification task by first finding a subset of variables. A given set of features are usually initially given and then an optimal subset of features is thought, pursuing high classification performance.

The term FS resembles an algorithm using an input feature set to output a subset of it. In supervised learning, a set of instances of the same attributes exists along with another attribute representing the corresponding class. Via those instances, a rule or set of rules can be created and generalized, enabling the classification of the instance set highly possible precision. Research on FS has been since the early 1960s (see [27] or [103] for an overview and biographical notes).

Raw datasets by themselves are well-known for not providing much information. So, to extract useful knowledge or patterns, raw data must first be processed. To this end, deriving informative knowledge from these data calls for the development of new effective and efficient methods. This is becoming increasingly substantial [37]. Typically, many as deemed necessary as are features are involved to compose a real-world dataset. This is closely related to: i) knowledge and, in turn, essential features capturing this knowledge concerning the domain of interest; ii) the availability of these features; iii) the resources available for collecting these available features; and (iv) the resources available for storing, maintaining, and retrieving these features. Thus, it is obvious the unnecessity or insufficiency of the all features stored in the resulting dataset to learn the concerned concept. Assuming the presence of the relevant features in this data, FS is the art of electing a small feature subset ideally necessary and sufficient to describe the concept of interest [61].

FS is a recommended data-preprocessing phase in most machine learning tasks due to its potential to alleviate the curse of high dimensionality in large and complex datasets. This helps in reducing the data processing workload and decreasing the computational cost of classification, thus saving resources while enhancing the algorithm overall performance. There are a large number of applications that depend on high-dimensional datasets with a large number of features. Many of these features may be redundant or obsolete. Additionally, the database at hand may be inspected by a high noise level, which may result in degrading the overall classification performance and increasing computational cost [69]. Hence, proposing an efficient FS method is crucial to find the best subset of highly relevant and informative features while maintaining the representation of the original dataset with high classification accuracy as well. Ultimately, FS has three main advantages [46]: i) enhancing the predictor’s performance; ii) gaining a better understanding of the underlying process leading to faster data processing; and iii) giving more cost-effective prediction models.

1.1 Motivation

It is an NP-hard problem searching for a variable subset [19, 22, 36, 64]. For this reason, it cannot be guaranteed to acquire the optimal solution except when conducting an exhaustive search which is quite unfeasible for this situation. Meta-heuristic techniques, being not having to do a comprehensive search for the whole solution space, permits obtaining reasonably good solutions. The quality of each “heuristic” solution obtained is tightly based on the method adopted. In other optimization problems, meta-heuristic methodologies have been found so superior [23, 52, 56, 84, 110, 117]. In this field, we mention the Differential Evolution (DE) [11], Ant Colony Optimization (ACO) [58], Particle Swarm Optimization (PSO) [77], Genetic Algorithm (GA) [29], and most recently, Harris Hawks Optimization (HHO) [132], Whale Optimization Algorithm (WOA) [74], and Salp Swarm Algorithm (SSA) [35]. In real-world applications, people favor a timely obtaining of good solution rather than being obsessed with optimal solutions. Therefore, we have opted the meta-heuristic techniques due to their practicality in real-world applications.

Among various meta-heuristics, swarm-based optimization algorithms have been used to shatter the obstacle to choosing the best solution (optimal feature subset) to the FS problem [78]. The Sparrow Search Algorithm (SSA) is a novel swarm-based optimization algorithm recently introduced by Xue et al. [119] for solving continuous optimization problems. An improved binary version called iBSSA is implemented and introduced based on SSA for solving the problem of wrapper-based FS. In this study, the SSA algorithm was chosen for solving the FS problem, driven by a number of motivations. First, based on the rigorous discussion through Sect. 2, meta-heuristic algorithms have proven superior in solving the FS problem. Therefore, the performance of the new SSA algorithm, being a novel member of this toolbox, is tested. Second, SSA is a recently proposed meta-heuristic algorithm with the capabilities of fast convergence speed, stability, searching precision, and escaping from local optima [119], which is yet to be systematically applied to solving the problem of FS. Last, when the proposed iBSSA algorithm was compared with some high-performance optimization algorithms as well as other latest well-developed ones, it highly supoorted the above claim by providing higher efficiency (i.e., fewer number of iterations or less calculation time) and providing an optimal or near-optimal solution to most of the datasets involved. In order for the FS problem to be dealt more effectively, this work proposes a novel improved binary SSA algorithm by mapping the continuous SSA into discrete values using a set of common S-shaped and V-shaped TFs, along with incorporating the two promising improvements, roaming agents’ (sparrows’) resetting strategy and the local search method.

On the other hand, k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), and Random Forest (RF) are prevalent learning techniques in the machine learning paradigm. In this study, while iBSSA is adopted as a search optimization approach for finding out the most relevant features, the diverse classifiers adopted herein (i.e., k-NN, SVM, and RF) are used as fitness evaluation functions (or evaluators), in order to, at the last, compose a new wrapper FS method. This research will only address supervised FS methods. As mentioned, meta-heuristics have achieved great success in the FS domain. However, most of the proposed methods typically only considered the k-NN classifier and omitted SVM in many cases, while RF has been almost completely ignored, although SVM and RF typically provide better results than k-NN in diverse classification tasks [7, 25, 80]. Therefore, in this study, the presented iBSSA method is tested with the three referred classifiers which are used for estimating the performance of the proposed approach in terms of the fitness value inferred from the classification error rate.

A last motive, many researchers have turned a blind eye to further addressing the roaming behavior of individuals as well as the premature convergence problem in meta-heuristic optimization while solving the FS problem [1, 71, 82]. For this reason, the present work accounts for both re-positioning of roaming individuals to promote the search within feasible regions to amend the best solution randomly faster, combined with a local search algorithm for enhancing the exploitation capability.

1.2 Contribution

To improve the exploration of SSA as well as to promote its exploratory power within feasible regions, a Random Re-positioning of Roaming Agents (3RA) strategy was proposed. Further, a novel Local Search Algorithm (LSA) was also incorporated into the standard SSA algorithm to boost the exploitation process by enhancing the optimum solution obtained at the end of each main loop or iteration in the algorithm. First of all, iBSSA was tested over nine common S-shaped and V-shaped TFs, in order to determine the most effective one that allows iBSSA to exhibit the best convergence speed as well as the best averages of fitness, accuracy, and number of selected features. Then, in pursuit of a fair comparison, the best-performing TF was introduced into the proposed iBSSA as well as into the binary versions of other competitor algorithms. To demonstrate the potential of the proposed approach to significantly select the most relevant features, such popular expert systems as k-NN, SVM, and RF were used to estimate the average fitness value based on the classification error rate averaged on 30 independent runs of the algorithm, over 18 multi-scale benchmark datasets carefully selected from the University of California Irvine (UCI) machine learning repository [38]. To sum up, the main contributions of this paper are outlined as follows:

1.
An improved binary version of SSA (iBSSA), which is enhanced through the 3RA strategy and the LSA method, is proposed for the first time based on a feature transformation method for wrapper feature selection in classification tasks.
2.
iBSSA was evaluated based on nine different TFs, including S-shaped and V-shaped functions, over 18 benchmark multifaceted, multi-scale UCI datasets.
3.
For a fair comparison, the TF most effective with SSA was also integrated into other well-known meta-heuristics, including SSA, ABC, PSO, BA, GWO, WOA, GOA, SFO, HHO, BSA, ASO, and HGSO, for binary conversion, which revealed the supremacy of the proposed iBSSA method for feature selection in classification when compared to other competitor algorithms, over all three classifiers adopted in this study (i.e., k-NN, SVM, and RF), for the majority of the datasets used.
4.
The final results were validated based on different performance metrics, including mean classification accuracy, mean fitness, mean number of selected features, along with the respective standard deviation values.
5.
The supremacy of the proposed iBSSA method with the three classifiers was affirmed compared to competitors, based on a Wilcoxon’s non-parametric statistical test at a significance level $\alpha =0.05$.

1.3 Structure

The rest of the paper is organized as follows: Section 2 reviews the existing FS techniques; Section 3 presents a simplified formulation of the FS problem, a description of the original SSA, a brief discussion on the nine TFs, a discussion on the two improvements embedded into the SSA, as well as a debate on the three classifiers adopted in this study: k-NN, SVM, and RF; Section 4, elaborates for and investigates the proposed iBSSA algorithm; Section 5 introduces the computational results and comparisons with a few prominent rival algorithms, and the conclusions as well as some suggested prospective works follows in Sect. 6.

2 An overview of feature selection methods

Dimensionality reduction is so crucial especially when experiencing a high-dimensional feature space, therefore it is commonly used in machine learning where a new space of reduced dimensions is produced by mapping the original feature space onto it. It is extremely important to Identify relevant features in classification tasks (reducing computational costs while maintaining the highest possible accuracy), as well as to recognize the features’ relative significance. Two different paths are usually taken for dimensionality reduction, constructing new dimensions or picking a subset of the original dimensions. Research literature has often distinguished feature selection from feature extraction. In feature selection, the aim is to individually—from among a large set of features—find the best features that can effectively boost the classification capability. More general methods, termed feature extraction algorithms, adopt transformation or combination techniques to create new features form the original feature set. Different weighting schemes have been traditionally used in feature extraction to generate a smaller number of ideally uncorrelated features either by linearly combining features (as in Linear Discriminant Analysis (LDA), and Principal Component Analysis (PCA)) or combining them non-linearly (as in neural networks).

To use feature selection or feature extraction is the first critical issue. Each has its own limitations, and no clear-cut evidence exists for the superiority of one to the other for diverse tasks. Feature extraction methods, including PCA [39]—which linearly combine all the available features to produce an, ideally uncorrelated, group of features—may struggle if some of the features are not heterogeneous at all, and there is no correlation between data assumptions. Similarly, it is unfeasible to use sequential search methods or do exhaustive search for high-dimensional datasets in FS. It is simple to examine features individually; however, poor feature subsets can be then yielded. On the other hand, it is too much time consumed, as well as running out of computer memory to try different feature combinations. Finally, it is not necessarily that important features on training data exhibit the same importance level on test data.

As mentioned, the FS problem is the only focus in this work.

Much research on applications have concerned with feature selection (so-called variable selection) for which tens or hundreds of thousands of features are available in datasets. Machine learning tasks, such as time-series prediction, regression, classification, etc. may involve FS problems.

A resurgence of interest has been paid to applying FS methods to treat the large numbers of features encountered in different types of problems, for example:

Information fusion of multiple sensors’ data. For instance, both color and shape features have been merged in a trademark image database to provide a higher retrieval accuracy [54].
Pooling the parameters from different mathematical models for classification purpose, as the integration of multiple models in [97].
Discovering the hidden relationships correlating a plethora of features, as the data mining application in [89].

The effectiveness of an inference model can be enhanced using an appropriate feature selection. Feature selection can grant such significant merits as follows [69]: i) improving the overall performance (simplicity of rules, predictive accuracy, or speed of learning); ii) model selection via data visualization; and iii) dimensionality reduction and noise removal. Although many advantages are offered by feature selection, the risk of over-fitting or decreasing accuracy may be encountered. Thus, it should be well managed how to avoid these risks to get the desired classification performance.

2.1 Typical Feature Selection techniques

There are two typical procedures in FS techniques: exploring the feature space using an underlying search or ranking algorithm; and guiding the underlying algorithm by, for example, measuring the classification error rate in the form of a cost function. On how to evaluate this cost function, proposed approaches most important are described below. The universal algorithms of FS often belong to one of three modalities: filters, wrappers, and embedded [43, 63]. The filter-based approach [61, 65] does not refer to or learn from the target classifier while determining the fitness of an examined feature subset. The cost function is evaluated independently of the target classifier in the following classification of independent datasets using the pre-selected subset of features. Rather, the cost function value can be computed based on a generic error estimation function guides the search for an optimal subset or the ranking of the individual features in the feature search space. This procedure is flawed by ignoring the potential effect of the feature subset on the learning algorithm. In contrast, in the wrapper-based approach [27, 120] the fitness of an examined feature subset is determined by underlying the subset to the design of the target classifier (used as a black box) to get an estimated classification error rate through ranking features in the subset based on their predictive power. Usually, wrappers approach can select more relevant subsets of features than filters approach and thus give better classification accuracy. However, each examined subset’s fitness needs a more computationally intensiveness for fitness assessment than the filter-based methods in terms of the classification error estimation. Consequently, since the wrapper-based methods must evaluate many subsets, they are often criticized by “brute force” and an additional computational overhead may emerge as a result of the target classifier-based evaluations.

Obviously, in the case of the not-too large variables’ number, exhaustive search can be adopted and performed under these two approaches. However, it is well-known that the problem is NP-hard [19, 22, 36, 64] and the computational intractability quickly results from the search. When it is too large the number of variables, high computational costs may be incurred by the approaches. In contrast to wrapper approaches, embedded methods determine the feature subset given the classifier design.

To solve the FS problem more pragmatically, other techniques have been adopted to find a good approximate solution, hopefully closest possible to the optimal subset. Some of the potential subsets of features are intelligently sequentially examined based on “forward” selection or “backward” elimination criteria to select the optimal subset amongst all the examined subsets based on the best cost function evaluation. In the following, the most commonly used methods—under these techniques—for performing FS are discussed. Sequential Forward/Backward Selection (SFS/SBS) [5] typically performs according to simple greedy deterministic heuristics. SFS [115] starts by an empty subset and new features are sequentially incorporated into this subset, whereas in SBS [75], the selection process is reversed, SBS begins with the original subset of all features and sequentially removes the least promising ones until satisfying a certain termination criterion. Both methods suffer from what is called the “nesting effect”; that is, features once selected in the SFS method cannot be discarded later, whereas the features discarded in the SBS method cannot be re-selected. The result is the permanent suboptimality of these methods. In addition, high computation time is required by forward selection and backward elimination when the dimensionality is very high [63].

The nesting effect can be definitely avoided by fusing SFS and SBS. To prevent the “nesting” impact, a method called “plus-l-take-away-r” has been suggested. This method [104] does by applying SFS followed by SBS l and r times, respectively. This forward and backward selection fixed cycle is iterated until reaching a predefined number of features. Thus, the nesting effect can be avoided by removing (in posterior steps) some of the features that have been priorly added. This method enables “fixed backtracking” based on top-down or bottom-up search and this is defined by l’s or r’s values. Although this procedure can partially overcome the problem of nested features, another problem arises: it is not theoretically easy to appropriately determine the land r’s value so that good enough solutions can—with a moderate amount of computational resources—be obtained. The plus-l-take-away-r method basically aims to counteract the nesting effect, and its implementation can be improved more efficiently by conditionally including and excluding features, guided by the value of the fitness itself.

In 1994, the concept of “floating feature search” was introduced by Pudil et al. [87]. In addition, Sequential Forward Floating Selection (SFFS) and Sequential Backward Floating Selection (SBFS) were introduced as two “floating” search methods. Floating selection methods are related to the plus-l-takeaway-r. Unlike the latter, the number of forward and backward steps is not beforehand fixed but controlled dynamically. In the methods of SFFS and SBFS, number of features included or excluded is changing at the different stages of the procedure. These two methods are probably the most effective FS techniques [53]. Mainly, in a forward search, a null feature set is initialized and, for each step, the current feature set contains the best feature that satisfies some fitness function (i.e., one SFS step is performed). At the same time, it is verified by the algorithm if it is possible to improve the fitness by excluding some feature. if so, the set is then refined by eliminating the worst feature, controlled by the fitness value (i.e., one SBS step is performed). Therefore, the SFFS proceeds by dramatically tuning the number of features in subset (increasingly and decreasingly) until reaching the desired number. Similarly, SFBS initiates with the full feature set and uses SBS and SFS steps to perform the search until reaching the desired dimension. Even though these two methods cannot always get the best feature subset, they perform very well in comparison with other search methods.

Compared to the plus-l-take-away-r methods, floating search methods are particularly characterized by the potential to achieve good performance by making more than one sweep while pursuing the best subset of features. In practice, dynamic backtracking search performs very robustly and, among different feature set search methods, the first choice for many scholars would be the floating search procedures [86]. Plus, floating methods behave just as efficiently as the best sequential methods.

Despite the significant intelligence exhibited by the floating methods, they still suffer suboptimality. These sequential floating methods are criticized by being highly elected to fall into the local optimal solutions’ trap, even if the problem scale is quite small and the fitness function is monotonic.

Other literature methods or algorithms adopted for the FS problem include the following:

The Branch-and-Bound (BB) FS algorithm has been invented by Narendra and Fukunaga [81] to find the optimal feature subset. This method has one drawback: it requires the monotonousness of the FS criterion function. Thus, adding new features to a feature subset cannot enable to reduce the fitness function value. Unfortunately, it is seldom to satisfy the monotonic condition. Nevertheless, in large feature spaces, the computational cost is prohibitive: in the worst scenario, an exhaustive search is performed and the time exponentially becomes more complex in accordance with the expansion in feature space dimensionality. As pointed out in [53], it is still impractical to apply the BB method to very-large-scale problems.
The Max–Min (MM) FS method has been invented by Backer and Shipper [10]. It is a computationally efficient method in which only individual and pairwise merits of features are evaluated. This method invariably achieves unsatisfactory results [53, 62].
The use of GA for FS has been firstly introduced by Siedlecki and Sklansky in 1989 [103]. In a GA approach, a binary string (a “chromosome”) is used to represent a given feature subset of total length D, with a value of one or zero in the j-th position to denote the presence or absence of feature j in the set. The algorithm maintains a population of chromosomes and the “fitness” of each chromosome is evaluated in terms of how likely is the survival and breed of the chromosome into the next generation. Old chromosomes are used to create new ones by the following processes: i) crossover, where offspring is created by mixing parts (genes) of two different parent chromosomes; and ii) mutation, where a child is created by randomly disturbing the bits of a single parent. Other works that apply GA to the FS problem include those of Emary et al. [32], Jiang et al. [55], Raman et al. [90], Dong et al. [29], Bouktif et al. [14], and Das et al. [21].
The Tabu Search (TS) method has been developed for feature subset selection by Zhang and Sun [128]. An adaptive memory (the tabu list) is used by TS to keep track of solutions that have been visited and should–—for a number of iterations—be avoided. The “tabu tenure” determines how long the tabu list should reserve a solution. In [128], a comparative analysis of the TS-based technique and other FS techniques (SFS, SBS, plus-l-take-away-r, SFFS, SFBS, and GA) has been conducted and the performance analysis shows TS as a promising search “heuristic” for the FS problem.

Despite some progress, it is not yet completely satisfactory the FS techniques available now for large feature sets. They are either enriched with computational feasibility but far from optimality, or they are enriched with optimality (or near-optimality) but cannot cope with the computational complexity of real-world FS problems. More powerful methods for FS are required to be developed, to provide very good results with more efficient computations. Therefore, in the present work, a novel improved meta-heuristic strategy, called iBSSA, is proposed to solve the FS problem.

Several challenges usually arise when getting up to find an optimal subset of features using the aforementioned traditional techniques. Multiple search methods, such as breadth search, random search, depth search, or hybridization of them have been additionally tried to determine the optimal subset of features. Further techniques have been proposed in [2, 3]. However, exhaustive search technique typically adopted there has been found time-consuming and unfavorable, especially with high-dimensional datasets [106]. Feature selection is arguably formulated as a combinatorial NP-hard optimization problem [19, 22, 36, 64]. For example, if we suppose that the number of features in a given dataset is D, it will be computationally expensive to find the optimal subset of features among $2^D$ different candidate combinations (solutions), given the fact that $2^D$ different evaluations are required for conducting a full search, methods of forward selection or backward elimination are recursively used in SFS and SBS. In addition, local optima can be more efficiently avoided by using stochastic methods. Therefore, those methods, especially meta-heuristics (discussed below), have been increasingly attempted by researchers as wrapper-based approaches to find out the best feature subset based on their global search process that is significantly effective with large feature spaces [4, 101, 124].

2.2 Meta-heuristics-based feature selection techniques

In recent years, meta-heuristic optimization algorithms, which are very close to stochastic methods in generating and using random variables, have been incorporated into different application areas, including building construction, health-care, agriculture, computer engineering, and many more [117]. This may be due to the obvious merits of these algorithms, including gradient-free nature, flexibility, simplicity, and independency of the problem of interest [51, 79]. Moreover, these algorithms can potentially approximately find the best (optimal) solution in small, comparable amounts of time, even for very large-scale problems. There is therefore still an appeal of solving complex optimization problems, such as the FS problem, using different advanced meta-heuristic algorithms [28]. Among many others, swarm intelligence and physics-based methods are two major categories of meta-heuristic algorithms [50, 130, 131]. Swarm intelligence mimics the collective social behavior of various species (i.e., animals or insects) in swarms for hunting or foraging, in which all individuals cooperate and share information during the optimization process. Representative algorithms include Artificial Bee Colony (ABC) [57], PSO [30], Bat Algorithm (BA) [122], Grey Wolf Optimizer (GWO) [79], Firefly Algorithm (FA) [121], and Cuckoo search (CS) [123]. This latest algorithms in this category include the WOA [78], Bird Swarm Algorithm (BSA) [76], HHO [49], Sailfish Optimizer (SFO) [98], Grasshopper Optimization Algorithm (GOA) [93], and Butterfly Optimization Algorithm (BOA) [9]. Moreover, the laws of physics in nature were utilized to come up with physics-based optimization methods, including Simulated Annealing (SA) [112], Gravitational Search Algorithm (GSA) [91], Atom Search Optimization (ASO) [133], and Henry Gas Solubility Optimization (HGSO) [47]. To learn more about meta-heuristic algorithms, the reader is referred to [28].

In an attempt to tackle the FS problem, meta-heuristics methods have been, due to their simplicity and flexibility, tried by many researchers, and some of them are investigated here. Sharawi et al. [99] introduced a new variant of the WOA algorithm that employed the wrapper-based technique to select the optimal feature number to yield the best mean classification accuracy. Subsequently, Eid [31] suggested the S-shaped function with WOA for tackling FS problems. Two versions of WOA were introduced by Mafarja and Mirjalili [70]. In the first, instead of a random operator, the roulette wheel and tournament selection were applied. The mutation and crossover operators were used in the second version to boost the performance of the proposed algorithm. Furthermore, Sayed et al. [94] hybridized WOA with the chaotic search to solve the slow convergence speed and stagnation to local optima issues potentially encountered while solving FS problems. Mafarja et al. [73] introduced two binary versions of the GOA algorithm. The first method used the S-shaped and V-shaped functions as Transfer Functions (TFs), while the second version employed the mutation operator to improve the exploratory ability of the proposed algorithm. Mafarja et al. [72] employed evolutionary population dynamics and selection operators to enhance the performance of the conventional GOA. Zakeri and Hokmabadi [125] hybridized the GOA with some statistical metrics for replacing the duplicated features with the most favorable one. In a related context, Emary et al. [32, 33] introduced the GWO algorithm and two binary versions of the Lion Optimization Algorithm (LOA), based on S-shaped and V-shaped TFs. Additionally, Arora and Anand [8] proposed a novel binary variant of the BOA based on some of the above-mentioned TFs. Chen et al. [17] introduced a discrete PSO algorithm based on a logistic map sequence to improve the swarm diversity. De Souza et al. [24] suggested a new variant of Crow Search Optimization (CSO) algorithm using a V-shaped TF. Sayed et al. [96] hybridized the Flower Pollination Optimization Algorithm (FPOA) with the clonal selection search, in which the optimum path forest accuracy was applied to evaluate the solution performance. Zawbaa et al. [127] managed to integrate the rough theory into FPOA to solve FS. Sayed et al. [95] validated the performance of 10 chaos maps incorporated into both the CSO and SSA algorithms. Hegazy et al. [48] introduced a new control parameter which was involved to refine the best solution obtained so far, using the k-Nearest Neighbor as an expert system for the evaluation of the relevance of the selected features. Moreover, Zhang et al. [129] hybridized the FA and SA algorithms with the aim to escape from the local optima while increasing the accuracy of the final solutions. Several enhancements, such as return-cost attractiveness, Pareto dominance-based selection, and binary movement with adaptive jump, were employed in [129] in order to solve the underlying FS problems more effectively. Faris et al. [34] introduced a multi-verse optimization algorithm, adopting the Support Vector Machine (SVM) classifier as a fitness evaluator. Gu et al. [41] suggested a new version of the PSO algorithm to solve a high-dimensional FS problem. Many other meta-heuristic based approaches were proposed for FS by, for example, Hafez et al. [44], Kashef and Nezamabadi-pour [58], Li et al. [67], Tabakhi et al. [105], Wang et al. [114], and many others. In fact, the introduction of meta-heuristic algorithms to solve FS problems is a trend for some years now. To learn more about meta-heuristic algorithms for FS, please refer to the review paper [100]. As discussed earlier, there are various advantages of meta-heuristics, so will the FS problem be solved using existing methods sufficiently? The No Free Lunch theorem (NFL) [116] is the appropriate answer to this question. This theorem indicates that a single algorithm cannot optimally solve all optimization problems. For FS on a dataset, while an algorithm performs very well for one kind of dataset, another one may perform poorly. Thence, the search for an advanced meta-heuristic approach for solving almost all possible FS dataset types is still an open research question.

3 Preliminaries

3.1 The Feature Selection (FS) problem

The problem of selecting a feature subset can be formulated as a combinatorial optimization problem with the major aim of obtaining superior classificatory performance as follows: Let an initial feature set, $\mathbf {f}$, with cardinality D. Let the number of feature to be pursued be represented by d in the selected subset $\mathbf{s}$. Let the fitness function of the selected subset $\mathbf {s}$ be represented by $f(\mathbf {s})$. Let the misclassification error rate, $c(\mathbf {s})$, of a given classifier to be minimized when presented with the feature subset $\mathbf {s}$. Formally, the FS problem is to find a subset $\mathbf {s} \subseteq \mathbf {f}$ such that $|\mathbf {s}|=d$ and

$$\begin{aligned} f(\mathbf{s})=\min _{\mathbf {s}\subseteq \mathbf {f},|\mathbf {s}| =d,d<D}c(\mathbf{s}). \end{aligned}$$

So, a lower value of $c(\mathbf {s})$ indicate a highly feasible feature subset. In this case, the “goodness” of a selected subset of features is assessed and computed in the form of a fitness function as follows: Let C be a set of cases (i.e., the whole instances or samples in a given dataset). We know the class to which each case relates (here binary classifications, with only two classes, is considered). A certain partition is made in C, $C=C_1 \cup C_2$, where $C_1$ (training data) and $C_2$ (test data) are in the vicinity of the same cases’ number and the same representation ratio for each class. The Euclidean distance is calculated from each case in $C_2$ to every case in $C_1$ and the class corresponding to nearest case is assigned. $f(\mathbf{s})$’s value is the proportion of mishits in the assigned classes. That is, how much times the assigned class was NOT the real (true) class.

If we adopted an exhaustive approach that is applied to this problem, this would require investigating all $\begin{pmatrix}D\\ d\end{pmatrix}$ possible d subsets (which can be derived from the feature set $\mathbf {f}$). But the exponential growth of possibilities makes exhaustive search impractical even with reasonable D’s values. As mentioned, FS is an NP-hard problem [19, 22, 36, 64] and heuristic and meta-heuristic techniques are therefore the most appropriate choice.

3.2 Sparrow Search Algorithm (SSA)

Sparrow Search Algorithm (SSA) [119] is a meta-heuristic optimization algorithm under the umbrella of swarm intelligence, meta-heuristics and computational intelligence. SSA is a natural extension of the BSA algorithm [76] extracted from the social behavior and interactions of bird swarms. Sparrows are gregarious birds of several species distributed across the globe and prefer to live in areas where people live. Besides, it belongs to the family of omnivorous birds that usually feed on weed or grain seeds. The sparrow is intelligent in comparison with other small birds and has a strong memory inspired by behaviors of anti-predation and foraging. In fact, captive house sparrows are divided into two main types based on their foraging behaviors: scrounger and producer [12]. The producers energetically pursue potential sources of food, while the scroungers grab food via those producers. In addition, it has been evidenced that birds usually switch between producing and scrounging by utilizing behavioral strategies flexibly [13]. With that being said, sparrows usually use the same strategies which are adopted by producers and scroungers to find their food [68].

Literature studies have shown that each individual in SSA keeps monitoring the behavior of its neighbors. Meanwhile, attackers against the bird flock are competing for foraging high intakes of food for the companions, thereby increasing their own predation probability [66]. Though, when different foraging strategies are adopted by the sparrows, the energy retained by the individuals may be then well utilized in the pursuit of food, so that the scrawny sparrows can scrounge even more [66]. It should be noted that birds within the search space are more vulnerable to attack by predators and therefore are required to seek a better (safer) location [15]. While the birds at the centre can minimize their domain of danger by moving closer to the neighbors [45]. It is well known that the bird family, especially sparrows, shows a natural instinct of curiosity along with constant vigilance. For example, when a predator is detected by a bird, a chirp is given by one or more individuals so that the entire group flies away from the source of danger [88]. According to the previous description of sparrows, a mathematical model can be formulated to construct the SSA algorithm. In simulation experiments, the food source is pursued by using virtual sparrows with a position vector expressed as:

$$\begin{aligned} \mathbf {x} = \begin{bmatrix} x_{1,1} &{} x_{1,2} &{} \ldots &{}x_{1,D} \\ x_{2,1} &{} x_{2,2} &{} \ldots &{}x_{2,D} \\ \vdots &{} \vdots &{} \ldots &{} \vdots \\ x_{N,1} &{} x_{N,2} &{} \ldots &{}x_{N,D} \end{bmatrix}, \end{aligned}$$

(1)

where the number of sparrows are represented by N, and the number of dimensions to be optimized are represented by D. Then, the following vector can represent the fitness value of all sparrows:

$$\begin{aligned} \mathbf {f}(\mathbf {x}) = \begin{bmatrix} f([x_{1,1} &{} x_{1,2} &{} \ldots &{} x_{1,D}]) \\ f([x_{2,1} &{} x_{2,2} &{} \ldots &{} x_{2,D}]) \\ \vdots &{} \vdots &{} \ldots &{} \vdots \\ f[(x_{N,1} &{} x_{N,2} &{} \ldots &{} x_{N,D}]) \end{bmatrix}, \end{aligned}$$

(2)

where the value of each row in $\mathbf {f}(\mathbf {x})$ represents each individual’s objective function. In SSA, the producers with the best fitness for getting food during the search process are given a higher priority. Besides, because producers are primarily responsible for foraging, in addition to managing the movement of the entire swarm, they have the ability to forage for food in a wider range than the scroungers do. At each iteration t, the position of each producer i is adjusted per every dimension j as:

$$\begin{aligned} x_{i,j}^{t+1} = \left\{ \begin{array}{ll} x_{i,j}^{t} \cdot \exp \left( \frac{-i}{\alpha \cdot T} \right) &{} \mathrm {if} \, R_2 < ST, \\ x_{i,j}^{t} + Q \cdot L_{i,j} &{} \mathrm {if} \, R_{2} \ge ST, \end{array} \right. \ \end{aligned}$$

(3)

where $j \in \{1,2,\dotsc ,D\}$, and T represents the maximum number of iterations. $R_2 \in [0, 1]$ represents a warning (alarm) value, while $ST \in [0.5, 1.0]$ denotes the safety threshold. $\alpha \in (0, 1]$ and the normally distributed Q are two independent random numbers. L is a matrix of $1\times D$, inside which each element is 1. $R_2 < ST$ means no predators around, allowing producers to search wider. On the contrary, $R_2 \ge ST$ implies the existence of predators discovered by some sparrows, and because of this, all sparrows are needed to quickly move to more secure regions.

As mentioned above, some scroungers continue to track down producers until those producers find a good food source, so that they can then leave their current position to pursue the target source of food. If they could win, they would automatically have a chance to get food from the producers; otherwise, they continue to execute their main task of vigilance. The scroungers update their position according to Eq. (4):

$$\begin{aligned} x_{i,j}^{t+1}= \left\{ \begin{array}{ll} Q \cdot \exp \left( \frac{x_{worst,j}^{t}-x_{i,j}^{t}}{i^2} \right) &{} \mathrm {if} \, {i > N/2}, \\ x_{p,j}^{t+1}+\left| x_{i,j}^{t}-x_{p,j}^{t+1} \right| \cdot A_{0,j}^{+} \cdot L &{} \mathrm {if} \, {i \le N/2}, \\ \end{array} \right. \ \end{aligned}$$

(4)

where $x_{p,j}$ is the personal best position obtained so far by the producers at the j-th dimension, and $x_{worst,j}$ denotes the current global worst location at the j-th dimension. A shows a matrix of $1 \times D$ with the randomly assigned value 1 or $-1$ for each of its elements, while $A^+ = A^T {(A{A^T})}^{-1}$. L represents a matrix of $1\times D$, all of whose entries are 1. When $i > N/2$, this implies the starvation of the i-th scrounger whose fitness value is among the worst.

In the simulation experiments, those sparrows who are aware of the danger (i.e., expectant sparrows) typically account for $10\%$ to $20\%$ of the whole swarm. These sparrows’ positions are initially randomly generated through the swarm as:

$$\begin{aligned} x_{i,j}^{t+1}= \left\{ \begin{array}{ll} x_{worst,j}^{t}+\beta \cdot \left| {x_{i,j}^{t} + x_{best,j}^{t}} \right| &{} \mathrm {if} \, {f_i > f_g}, \\ x_{i,j}^{t} + K \cdot \left( \frac{\left| x_{i,j}^{t} + x_{worst,j}^{t} \right| }{(f_i - f_w) + \varepsilon } \right) &{} \mathrm {if} \, {f_i=f_g}, \\ x_{i,j}^{t} &{} \mathrm {if} \, {f_i < f_g}. \\ \end{array} \right. \ \end{aligned}$$

(5)

The algorithmic structure of the standard SSA is exhibited in Algorithm 1.

3.3 Transfer Functions (TFs)

Since the final solution obtained by SSA is composed of continuous values, SSA cannot directly be used to solve a FS problem. Therefore, it is necessary to use a mapping (transfer) function to transform the continuous values into binary 0 or 1. Transfer Functions (TFs) [77] determine the rate of changing in the values of the decision variables from 0 to 1 and back. On selecting a TF to transform the values from continuous to binary, some concepts should be taken into consideration—from an SSA perspective—as follows:

The range of values obtained from a TF should be within the interval [0, 1], representing the probability whether a sparrow will change its current location.
If the alarm value $R_2$ is lower than the safety threshold $ST$, then the TF should present a higher probability of changing the current location at the next iteration as sparrows having $R_2$ larger than $ST$ are probably going so far from the best solution.
When $R_2$ is small, the TF should provide a small probability of changing the current location.
In summary, the probability returned by the TF should increase as $R_2$ approaches $ST$, so that sparrows that are moving away from the best solution can have a higher probability of changing their location vector, which enables them to return to their previous best personal location as soon as possible throughout the next iterations.
Also, the probability obtained from a TF should decrease as $R_2$ value keeps away from $ST$ value.

These concepts judge the high capability of TFs to map the process of continuous search into binary for each individual $\mathbf {x}$, using Eq. (6):

$$\begin{aligned} {\left( x_{i,j}^{t+1}\right) }_{\mathrm {bin}}=\left\{ \begin{array}{ll}\left\{ \begin{array}{ll}0&{}\mathrm {if}\,rand<TF\left( x_{i,j}^{t+1}\right) ,\\ 1&{}\mathrm {if}\,rand\ge TF\left( x_{i,j}^{t+1}\right) ,\end{array}\right. &{}\mathrm {if} \text { TF is S-shaped},\\ \left\{ \begin{array}{ll}\lnot {\left( x_{i,j}^t\right) }_{\mathrm {bin}}&{}\mathrm {if}\,rand<TF\left( x_{i,j}^{t+1}\right) ,\\ {\left( x_{i,j}^t\right) }_{\mathrm {bin}}&{}\mathrm {if}\,rand\ge TF\left( x_{i,j}^{t+1}\right) ,\end{array}\right.&\mathrm {if}\text { TF is V-shaped},\end{array}\right. \end{aligned}$$

(6)

where ${\left( x_{i,j}^{t+1}\right) }_{\mathrm {bin}}$ represents the j-th dimension of the i-th individual at the current iteration $t+1$, $rand$ is a number selected randomly from within the range [0, 1], and $TF\left( x_{i,j}^{t+1}\right)$ is the probability value obtained when applying a given TF to every j-th component’s continuous value of agent i. It is clear from Eq. (6) that we have two cases: i) if the TF is S-shaped, then if $rand$ is less than the probability returned by the involved TF, the j-th dimension of the original individual is set to 0; otherwise, it is set 1; and ii) if the TF is V-shaped, then if $rand$ is less than the probability returned by the involved TF, the j-th dimension is negated; otherwise, it remains unchanged. Thus, by using the S-shaped and V-shaped TFs and Eq. (6), continuous variables are successfully mapped into binary.

Table 1 reports two families of TFs while Fig. 1 exhibits visually their behavior, divided into S-shaped and V-shaped transfer functions. Here, it should be pointed out that the proposed iBSSA method was evaluated based on those nine TFs whose mathematical expressions are shown in Table 1.

Table 1 S-shaped and V-shaped families of TFs

An improved binary sparrow search algorithm for feature selection in data classification

Abstract

Similar content being viewed by others

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

A review of unsupervised feature selection methods

Hybrid approaches to optimization and machine learning methods: a systematic literature review

1 Introduction

1.1 Motivation

1.2 Contribution

1.3 Structure

2 An overview of feature selection methods

2.1 Typical Feature Selection techniques

2.2 Meta-heuristics-based feature selection techniques

3 Preliminaries

3.1 The Feature Selection (FS) problem

3.2 Sparrow Search Algorithm (SSA)

3.3 Transfer Functions (TFs)

3.4 Learning algorithms adopted in this study

3.4.1 k-Nearest Neighbor (k-NN)

3.4.2 Support Vector Machine (SVM)

3.4.3 Random Forest (RF)

4 Proposed improved Binary SSA (iBSSA) for feature selection

4.1 Improvements embedded into the standard SSA

4.1.1 Improving the exploration by Random Re-positioning of Roaming Agents (3RA)

4.1.2 Improving the exploitation by Local Search Algorithm (LSA)

4.2 Initialization

4.3 Fitness evaluation

4.4 Position updating

5 Experiments and discussion

5.1 Description of datasets

5.2 Environment and parameter settings

5.3 Performance metrics

5.4 Assessment of the impact of the nine TFs on the fitness values

5.4.1 Assessment using the k-NN classifier

5.4.2 Assessment using the SVM classifier

5.4.3 Assessment using the RF classifier

5.4.4 Overall assessment

5.5 Performance comparison of classification using the original classifier with all features (before FS), and the iBSSA–classifier method (after FS)

5.5.1 Comparisons of k-NN and iBSSA–k-NN

5.5.2 Comparisons of SVM and iBSSA–SVM

5.5.3 Comparisons of RF and iBSSA–RF

5.6 Performance comparison of BSSA and iBSSA

5.6.1 Comparisons based on the k-NN classifier

5.6.2 Comparisons based on the SVM classifier

5.6.3 Comparisons based on the RF classifier

5.7 Performance comparison with some other promising meta-heuristic algorithms

5.7.1 Comparisons based on the k-NN classifier

5.7.2 Comparisons based on the SVM classifier

5.7.3 Comparisons based on the RF classifier

5.8 Convergence analysis

5.9 Statistical analysis (Wilcoxon’s rank-sum test)

5.10 Comparison with various FS methods from the literature

5.11 Discussions

6 Conclusions and future directions

Change history

31 July 2022

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation