1 Introduction

Feature selection essentially aims to reduce both the redundancy and dimension of high-dimensional datasets during the classification process. Feature selection is also a process of finding optimal feature set by adding relevant and additional features. Feature selection algorithms are used to speed up the learning process of classifiers and to improve the classification accuracy. In general, feature selection methods are categorized into five types: filter, wrapper, embedded, ensemble, and hybrid. In filter methods, the redundant and unrelated features are removed, and an optimal feature subset is selected. The filter methods do not apply the classification algorithms to feature selection processes; therefore, these methods have low computational cost and typically show the best performance in case of high-dimensional datasets. On the contrary, the wrapper methods rely on the predictive performance of a classifier to evaluate the selected feature sets quality. It has two steps: First, a subset of features is selected; second, the selected features are evaluated. Afterward, the two steps are repeated until the fulfilment criteria are met. The highest learning performance feature set is returned as the selected feature. The wrapper methods suffer from high computational cost; however, they finally provide better results. On the other hand, the embedded methods are made up of the filter and wrapper methods, which integrate feature selection in the model learning with the learning algorithm. As they do not evaluate the feature set repeatedly, they are much more efficient than the wrapper methods. Ensemble methods build on the assumption that combining the output of multiple feature selection methods is better than using the output of any single method. Creating a feature selection ensemble involves applying different feature selection methods, each providing their output and then combining the outputs of the single models [1]. The hybrid methods combine two or more feature selection algorithms to shape a new plan to workout problem. These methods take the advantage of the sub-algorithms and, as a consequence, are stronger compared with traditional methods. In this study, a novel hybrid feature selection method is introduced through combining our proposed filter method and GA to eliminate the redundant and irrelevant samples and reduce the final feature set size. The experiments conducted in this study revealed the potency of the proposed feature selection methods by comparing their classification accuracy with other existing feature selection methods. Three different classifiers were applied to the selected datasets to test the validity of the proposed algorithm. In the following, the main contributions of our work are described in detail.

  • Novelty

    1. (1)

      Introducing a rank-based filter feature selection method, SLI-γ, minimizing feature redundancy and maximizing relevancy between features and class label.

    2. (2)

      Introducing a novel hybrid approach based on our proposed filter method and GA in a way to be applicable to high-dimensional problems. The aim is to identify the optimal subset of relevant features, maximize classification accuracy, and minimize the size of final feature set and the execution time.

  • Effectiveness

    The hybrid algorithm capitalizes on the advantages of the SLI-γ and GA. The features selected by our algorithm show more accurate identification rates compared with the existing feature selection approaches. The proposed algorithm maximizes the classification accuracy and, at the same time, minimizes the number of selected features and the execution time of GA when applied to the feature selection problems.

  • Robustness

    Three different classifiers were tested in this study on the selected datasets. All classifiers produced stable classification accuracy.

    To achieve the objectives defined for the current paper, a GA-based wrapper feature selection method is combined with a ranking-based filter feature selection method.

The rest of this paper is organized as follows: Sect. 2 reviews the related work. Then, Sect. 3 presents the background of the above-mentioned algorithms. Next, Sect. 4 discusses the proposed filter method and the proposed hybrid algorithm. Afterward, Sect. 5 presents and analyzes the experimental results. Finally, the conclusions of the research are given in Sect. 6.

2 Related work

Many researchers have shown the effectiveness of feature selection methods in removing unrelated features and improving the learning performance. Research in this field can be divided into two categories:

  1. 1.

    Studies that have examined the efficiency of the filter, wrapper, embedded, and ensemble feature selection methods.

  2. 2.

    Studies that have used a hybrid feature selection method to achieve an optimized feature subset.

A number of the studies that have used filter, wrapper, embedded, and ensemble methods to determine a practical feature subset are discussed in the following.

Maleki et al. [2] used GA to select the features. In their proposed method, the data are first preprocessed; then, the lost data are deleted or replaced with the appropriate data. The uncompleted rows of datasets were filled with the software function to use the average of other values. GA was then applied to this dataset to find the best combination of features, which provided the most relevant features. Pardo et al. [3] analyzed the comparability of six advanced feature selection methods: Interact, CFS, Chi-square, ReliefF, Infogain, and MRMR. Accordingly, they introduced new criteria based on accuracy, execution time, and stability. The results indicated that the filter methods are the most scalable methods of selecting the features, while the Interact, ReliefF, and MRMR methods are the most accurate ones. Gu et al. [4] studied the scalability of modern feature selection algorithms. They compared the performance and power of these algorithms in selecting related features and removing irrelevant features without allowing the number of features to increase. For the scalability analysis, they set new evaluation criteria based on the selection accuracy, execution time, and stability of selected features. Abasabadi et al. [1] introduced a heterogeneous ensemble feature selection method with automatic thresholding capability. To create diversity in the ensemble, they used three filter methods, i.e., ReliefF, Mutual Congestion, and a proposed filter method called Sorted Label Inference. They used the non-dominated sorting method to aggregate the ranking lists, which allows automatic thresholding in the proposed ensemble method. The experimental evaluation of the method using six different high-dimensional datasets and two standard datasets showed the efficiency of the proposed method. Pardo et al. [5] used the filter and embedded methods instead of the usual ensemble method. Their study aimed to diversify and increase the order of the feature selection process to take advantage of the strengths of individual selectors and overcome their weaknesses. Depending on how the data was distributed and the feature selection method was used, two methods were employed, which were tested with SVM classifier. The experimental evaluation of the methods using seven different datasets showed the competence of the proposed ensemble.

Some researchers proposed the use of the hybrid feature selection methods to remove unrelated features and to achieve high accuracy. A number of these studies are discussed below.

Sadeghian et al. [6] proposed the Information Gain binary Butterfly Optimization Algorithm (IG-bBOA) to overcome the S-bBOA constraints. In the first phase of this algorithm, 80% of the irrelevant and redundant features are removed using the Minimal Redundancy–Maximal New Classification Information (MRMNCI) method. Then, in the second phase, the best feature subset is selected using IG-bBOA. Finally, a similarity-based ranking method is used to select the final features subset. The experimental results on six standard datasets showed the efficiency of the proposed method in improving the classification accuracy and selecting the minimum number of features in most cases. Nematzadeh et al. [7] presented a hybrid feature selection method to be applied to large binary medical datasets. Their proposed method consists of three parts. First, the whale algorithm is used to remove half of unrelated features. Next, in the second stage, the remaining features are ranked based on a frequency-based method, called mutual congestion. Finally, a majority vote with a threshold of 10 is applied to the best set of features obtained by the selection of the features. Jain et al. [8] proposed a hybrid feature selection method for sentiment classification. They used GA and a combination of filter methods such as IG, CHI Square, and GINI Index. First, they acquired features from filter methods and then implement the UNION SET Operation to achieve the reduced feature set. Afterward, GA was used to raise the feature set further. They used an ensemble method based on the error rate. To test their proposed hybrid feature selection and ensemble classification approach, they considered four support vector machine (SVM) classifier variants. They applied their proposed method to the UCI ML Datasets. The experimental results showed that their proposed method performed best in all datasets.

Although GA can accurately select meaningful features, it suffers from the execution time problem and also its convergence is highly dependent on the initial population. The results of filter methods should be used as the initial population for GA to reduce the execution time and select the optimal feature subset.

3 Preliminaries

In this section, the basic concepts exploited within the proposed method, i.e., filter feature selection method, genetic algorithm (GA), and artificial neural network (ANN), are reviewed.

3.1 Filter feature selection methods

Filter methods are generally used as a preprocessing step. The selection of features is independent of any machine learning algorithms. Instead, features are selected based on their scores in various statistical tests for their correlation with outcome variable. ReliefF, SVM-Rank, Information Gain, MRMR, and Fisher score are well-known filter methods.

3.2 Genetic algorithm

Genetic algorithm (GA) was first introduced by John H. Holland in the 1960s and was developed later by Holland and his students in the 1960s and 1970s [9]. The basic idea behind GA is to evolve a generation of possible candidate solutions to a problem using several operators. At first, an initial population is created using a random process. Next, reproduction, crossover, and mutation operators are used to generate successive populations from the initial population. Reproduction is a process based on the fitness function of each string, which identifies how “good” a feature is. Thus, features with higher fitness values have a bigger probability of contributing offspring to the next generation. Crossover is a process in which members of the last population are mated at random in the mating pool. Therefore, combining two parents results in the generation of a pair of offspring that hopefully contains improved fitness values. The mutation is a genetic operator used to maintain genetic diversity from one generation of a population of the GA chromosomes to the next. Algorithm 1 shows the pseudo-code of GA.

figure a

3.3 Artificial neural network

Artificial neural network (ANN) was first introduced by McCulloch and Pitts [10], which was in fact a simulation of human brain. The term is obtained from biological neural networks that evolve the form of a human brain. Similar to human brain, ANN has neurons associated with one another in several layers of the network. These neurons are known as nodes. ANN is a set of connected input–output network in which weight is associated with each connection. It consists of one input layer, one or more hidden layer(s), and one output layer. Multilayer perceptron is most generally conceded type of ANNs, which is made up of one input and output layers accompanying one or more hidden layers. Neural network learns by adjusting the weight of connection. By updating the weight iteratively, the performance of network is improved. Neurons of the neural network are activated by the weighted sum of inputs, which is shown as a transfer function by Eq. (1) [11].

$$\mathop \sum \limits_{i = 1}^{n} Wi*Xi + b$$
(1)

where Wi (i = 1, 2,…) and \(b\) are the synaptic weights and bias of the perceptron, respectively. The interconnection weights are optimized during training, until the network reaches the specified level of accuracy [12].

4 The proposed hybrid feature selection method

In this section, a new, simple, and effective hybrid feature selection method, called Garank&rand, is introduced. The goal is to achieve the smallest subset of features as well as the highest classification accuracy. Figure 1 shows the framework of our proposed hybrid feature selection method.

Fig. 1
figure 1

Framework of the Garank&rand hybrid feature selection method

As shown in this figure, our proposed method uses a two-stage procedure to obtain the optimal feature subset. First, our proposed filter feature selection method, called SLI-γ, was applied to the datasets to rank features based on the predictive ability of each feature. Those at the top are more likely to better classify the samples. Afterward, 1% of best ranked features were selected. At the second stage, to start with the best initial solutions, the GA population was initialized with those 1% of the features at the top of the ranking. GA was applied to dataset containing only 1% of selected features, searching for the best feature subset in the solutions space. To calculate the fitness of each individual, two different classifiers (i.e., KNN and ANN) were used as ANN provides more accurate results and KNN improves the execution time. Algorithm 2 shows the Garank&rand hybrid feature selection method.

figure b

4.1 Feature ranking by applying SLI-γ

The current paper proposes a filter feature selection method, called SLI-γ, whose schema is illustrated in Fig. 2. The first step is to sort each feature based on its samples' values. Sorting each feature can result in the relocation of each sample and its corresponding label. Then, at the second step, the interference between the relocated labels must be found. Thus, the area where the positive and negative labels intersect must be found. Interference region is the region where after a consequence of positive labels (if the first label is positive), a negative label appears; it continues until the last positive labels appears. The area between these two appearances of negative and positive labels is called interference region. Moreover, interference region is the region where after a consequence of negative labels (if the first label is negative), a positive label appears; it continues until the last negative label appears. The area between these two appearances of positive and negative labels is interference region. Figure 3 shows a graphical presentation of the interference area. The green color area shows the interference area. Then, it is time to enumerate the number of positive labels in the interference region and non-interference region. The positive label size in the interference region and the positive label size in the non-interference region are characterized with nt2 and nt1, respectively. In the example represented in Fig. 3a, the positive label size in the interference region nt2 is 2, while the positive label size in the non-interference region nt1 is 3.

Fig. 2
figure 2

The schema of the proposed filter method

Fig. 3
figure 3

a Interference region when labels start in positive, b interference region when labels start in negative

Algorithm 3 shows the pseudocode of the extraction of the interference region. Algorithm 4 shows the extraction pseudocode for nt2 and nt1. In the third step, the frequency of positive labels in each region is determined based on our proposed criterion, γ. To calculate the frequency of positive labels in an area, the interference coefficients of γ is proposed using nt2 and nt1, according to Eq. (2):

$$\gamma = {\raise0.7ex\hbox{${n_{2}^{t} }$} \!\mathord{\left/ {\vphantom {{n_{2}^{t} } {n_{1}^{t} + n_{2}^{t} }}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{${n_{1}^{t} + n_{2}^{t} }$}}$$
(2)

where γ is the number of positive labels in the interference area, divided by the total number of positive labels, which has a value between 0 and 1. The value 1 is obtained when all positive labels are in the interference region. On the other hand, when there are no samples in the interference region, the obtained value would be 0. The best feature is the one with the lowest γ. A lower interference shows more similarity between the classification based on this unique feature and the overall classification based on the whole features. This process is repeated for all features in the dataset. Finally, in the fourth step, all features are sorted by their γ values.

figure c
figure d

4.2 Applying GA

In this step, high-ranked features of SLI-γ are given as input to GA, which is explained in detail in the following subsection. To preserve the simplicity and efficiency of the proposed method, the feature subsets are represented as binary vectors. More precisely, the size of the vector is equal to the number of features in dataset. When the ith component of a vector is equal to 1, it means that the subset includes the ith feature; otherwise, the component is set to 0.

4.3 Initial population generation

The initial quantification of population vectors plays an important role in the GA method. In Garank&rand, some features are selected randomly for initial valuation among 1% of the features with the highest SLI-γ rank. On the other hand, some vectors are generated by the basic method in evolutionary algorithms, i.e., random initial valuation. These vectors certainly produce solutions with less relevant features to reduce the excessive pressure exerted by the first method on some areas.

4.4 Cost function

While running GA, in each iteration, the selected features in the current population are evaluated, and their fitness are ranked based on a KNN-based or ANN-based classification error. With the use of KNN as classifier, lower-cost selected feature sets have higher chance of being transmitted into the next generation. As GA iterates, it reduces the classification error and selects the features with the least fitness, and the smallest error rate is finally selected by GA. With using ANN as classifier, after the initialization, there is a need to assign a fitness value to each feature set in the population. We train each neural network and then evaluate their error with the selection instances. A significant selection error means high cost function. Feature sets with less cost function are more likely to be selected for recombination. The objective of feature subset selection is reducing the dimensionality of the dataset as well as decreasing the classification error to achieve higher accuracy. However, these two objectives are contradictory. The reduction of features leads to lower classification accuracy; thus, the use of classification error as cost function is not enough for obtaining optimal feature set. Here, the contradiction is well addressed by using a single objective function. The cost function is defined as a weighted linear aggregation function that can be calculated using Eq. (3).

$$Z = E \times \left( {1 + \beta \times \left( {\frac{{{\text{SF}}}}{F}} \right)} \right)$$
(3)

where SF and F represent the number of features in the selected feature subset and total number of features, respectively, whereas \(\beta\) is a control parameter used to adjust the weight of the feature set in the fitness function. β ∈ [0,1] is the cost of adding features. Higher β values lead to the selection of less features, and vice versa. The above fitness function is used in conjunction with various evolutionary algorithms to find out the optimal feature subset. The fitness function Z is evaluated using classifier error E in terms of the final reduction of feature set and classification accuracy.

5 Experimental study

In this section, first, the datasets are briefly described. The analysis results of SLI-γ are applied on high-dimensional and standard datasets. Subsequently, our proposed hybrid feature selection technique is analyzed.

5.1 Datasets and parameter setting

The Garank&rand was tested using the KNN classifier and the SLI-γ on MATLAB (Ver. 2016) on the HP G9 server with 180 GB of memory and 36 processor CPU e5-2699 v3 2.6 GHz, and Windows Server 2012 operating system. The results are for an average of 100 runs, using the tenfold Monte Carlo cross-validation. SLI-γ was tested on 11 different binary datasets selected from the bioinformatics and biomedical domains, in which Liver, Diabetes, Kidney, and Wisconsin are standard datasets. In contrast, SMK_CAN_187, LeukemiaB, GLI_85, Ovarian, CNS, Colon, and COVID-19 are high-dimensional datasets. COVID-19 is originally a three-label dataset ("no virus," "other virus," and "SC2"); the current study considered "no virus" and "other virus" as one label as "non covid" and SC2 as "covid," so that our proposed method could be applied to the COVID-19 for the purpose of accuracy evaluation. Table 1 shows the specifications of these datasets, which include the number of samples, features and classes. Garank&rand was tested on four different high-dimensional binary datasets of LeukemiaB, GLI_85, CNS, and Colon. Meanwhile, the Garank&rand tests using ANN classifier were performed by means of MATLAB (Ver. 2016) on the HP G9 server with 40 GB of memory and 24 core CPU e5-2690 v3 2.6 GHz, and Windows Server 2012 operating system.

Table 1 Datasets used in the study

5.2 Evaluation metric

Accuracy is the criterion used to evaluate the efficiency of the proposed method. Accuracy indicates the correctness of the classification of positive and negative samples. Additionally, true positive (TP) and true negative (TN) are positive and negative samples, respectively, which are correctly classified. False positive (FP) and false negative (FN) are positive and negative samples, respectively, that are not properly classified.

5.3 Classification methods

Implementing feature selection methods on popular classification algorithms is an accepted way to test the performance of any feature selection method [1]. However, class prediction depends on the applied classification algorithm. In this research, several classification methods have been used to test the feature selection method in order to achieve results independent of the classifier. From the available classifiers introduced in the literature, SVM, NB, and DT were selected to be used in this research for the accuracy evaluation of SLI-γ. SVM, a classification method that has received much attention in recent years, is based on the idea of structural risk minimization (SRM). SVM methods have performed extremely well in a wide range of applications and have been used as powerful tools in solving classification problems [3]. SVM in this research uses the linear kernel function with box constraint of 2. Naïve Bayes can achieve relatively good performance on classification tasks. Naïve Bayes greatly simplifies learning by assuming that features are independent given the class variables. In simple terms, a Naïve Bayes classifier assumes that the presence of a particular feature of a class is unrelated to the presence of any other feature. In spite of their naïve design and apparently over-simplified assumptions, Naïve Bayes classifiers have worked quite well in many complex real-world situations [13]. Decision trees are favored prediction tools as they make a model that is easy to clarify. Each leaf node can be given as an if/then rule. The logical rules come after a decision tree similarly bear a resemblance to human reasoning and are allegedly alluring to decision-makers, who is pleasant with models that they can understand. Decision trees are also multivariate that can model a wide range of data distributions. Furthermore, decision trees can handle data of different types without requiring any transformation of the data. Most importantly, decision trees have the capability to break down a complex decision-making process into a collection of simpler decisions, thus providing a solution that is often easier to interpret [14]. Moreover, ANN and KNN were used for the evaluation of the Garank&rand accuracy. Table 2 shows the calculated accuracy of each classifier before applying the proposed method to the dataset using the Monte Carlo cross-validation.

Table 2 Accuracy results of classifiers for datasets without applying the feature selection methods

5.4 Numerical results and discussion

Regarding the performance analysis of the proposed methods, the effectiveness of the SLI-γ needs to be discussed first. Thus, in Sect. 5.4.2 it is experimentally shown to what extent this algorithm should be used to discard irrelevant features. Then, in Sect. 5.4.3, it is experimentally shown that why GA should be used as an evolutionary algorithm to discard the irrelevant features. Moreover, the convergence analysis of these algorithms is discussed. In Sect. 5.4.4, Garank&rand is applied to the benchmark datasets to identify the best features in the proposed method. Moreover, the Garank&rand method is compared with GA.

5.5 Evaluation of SLI-γ

In this section, a comprehensive experiment was performed on four standard datasets and seven high-dimensional medical datasets to evaluate SLI-γ in terms of accuracy. The performance of SLI-γ was first evaluated and compared with that of three other rankers. Experiments indicated that SLI-γ is able to effectively eliminate the unrelated features. Tables 3, 4 and 5 show that SLI-γ increased the classification accuracy of all datasets compared to the data presented in Table 2. For accuracy comparison, we have used four feature sets for each dataset, and the highest accuracy on each feature set of each dataset among all rankers is bold. More specifically, in Colon, CNS, Diabetes, Kidney, GLI_85, Liver, and Ovarian, SLI-γ increased accuracy with least number of features with all classifiers in comparison with the data presented in Table 2.

Table 3 Comparison of SLI-γ with common rankers using NB classifier
Table 4 Comparison of SLI-γ with common rankers using SVM classifier
Table 5 Comparison of SLI-γ with Common Rankers using DT Classifier

In CNS, Colon, and COVID19, SLI-γ increased accuracy with least number of features in DT in comparison with other rankers. In GLI_85 and SMK_CAN_187, SLI-γ increased accuracy with least number of features in NB and SVM in comparison with other rankers. In Liver, LeukemiaB, and Ovarian, SLI-γ increased accuracy with least number of features in SVM in comparison with other rankers. In LeukemiaB, SLI-γ increased accuracy with the least number of features with SVM and DT in comparison with Table 2. In SMK_CAN_187 and Wisconsin, SLI-γ increased accuracy with least number of features with NB and DT in comparison with Table 2. In Wisconsin, SLI-γ increased accuracy with least number of features in NB and DT in comparison with other rankers. In COVID19, SLI-γ increased accuracy with the least number of features in SVM and DT in comparison with other rankers. SLI-γ seems to be a powerful ranking for high-dimensional and standard datasets. As indicated clearly by the obtained results, SLI-γ increased the classification accuracy of all datasets compared to the results presented formerly in Table 2. Therefore, SLI-γ creates a better model with NB, DT, and SVM. In addition, it increases accuracy in all classifiers. As a result, SLI-γ is used as a ranker in this paper.

Number of features: Abasabadi et al. [1] introduced an automatic thresholding feature selection method, called ATFS, which specifies the threshold value for each dataset. To determine the number of features for testing and comparison of our proposed filter method on benchmark datasets, the ATFS method results were used for these datasets. In Colon, the ATFS method resulted in 14 features as the best and smallest final feature subset. With this threshold value, SLI-γ increased the accuracy compared to other filter methods with all classifiers. In CNS, 8 features were obtained where SLI-γ surpassed other filter methods by NB and SVM. However, in SMK_CAN_187, 9 features were obtained where SLI-γ outperformed other rankers with all classifiers. In GLI-85, 6 features were selected by ATFS, where SLI-γ obtained the best accuracy by all classifiers. In LeukemiaB, with only 4 features that were selected by ATFS, SLI-γ with all classifiers achieved the best accuracy. In COVID19, 10 features were obtained by ATFS, where SLI-γ surpassed other filter methods by DT and SVM. In Wisconsin, with 6 features introduced by ATFS, SLI-γ received the best accuracy for all classifiers. In Ovarian with only 3 features presented by ATFS, SLI-γ improved accuracy with all classifiers. In Liver, SLI-γ surpassed all rankers with 5 features presented by ATFS. In diabetes, the ATFS ensemble method resulted in 4 features as the best and smallest final feature subset. With this threshold value, SLI-γ increased the accuracy compared to other filter methods with SVM, NB, and DT. In Kidney, the ATFS ensemble method resulted in 7 features as the best and smallest final feature subset. With this threshold value, SLI-γ increased the accuracy compared to other filter methods with all classifiers.

5.5.1 Removal rate analysis

After ranking the features based on their relevancy using SLI-γ, it is time to acquire a unique final feature set size for all datasets. Ten experiments were done on each dataset with different discarding rates: 99, 90, 80, 70, 60, 50, 40, 30, 20, and 10%. We used three different classifiers (SVM, NB, and DT) to acquire the best results. As shown in Fig. 4, the best accuracy with most classifiers was achieved when the SLI-γ method discarded 99% of the all features. It means the SLI-γ algorithm could effectively identify relevant features by discarding a huge number of irrelevant features correctly.

Fig. 4
figure 4

Classification accuracy by applying different numbers of selected features

5.5.2 Wrapper method selection

To select the appropriate evolutionary algorithm, a comparison was made between GA, wall optimization algorithm (WOA), and gray wolf optimization (GWO). Figure 5 shows the results of this comparison, indicating that GA was clearly more successful than the other methods in terms of finding the optimal solution for fewer iterations in all datasets. For that reason, among evolutionary algorithms, GA was used in this study to create a hybrid method.

Fig. 5
figure 5

Convergence of evolutionary feature selection methods

5.5.3 Performance evaluation of the proposed hybrid method

This section evaluates the performance of the proposed hybrid method, Garank&rand. Table 6 shows the different parameters of the proposed method.

Table 6 Adjustment of parameters for experiments

To implement Garank&rand, the SLI-γ filter method was first applied to datasets. Then 1% of the features with the highest SLI-γ rank were selected to generate the initial population. A portion of initial population was generated by the basic method of GA, and the others were randomly selected from among 1% of the features with the highest rank to improve the initial solutions. With part of the population randomly selected from all features, the pressure exerted by the related features on some areas will decrease.

5.5.3.1 Performance evaluation of Garank&rand using ANN as classifier

Figures 6 and 7 show the results obtained for Garank&rand on all datasets using ANN as classifier. Two experiments were done for the accuracy evaluation of Garank&rand using ANN. In the first experiment, 1% of the best ranked features were obtained from SLI-γ; then, GA was forced to generate its initial population from the selected features. The results show that using high ranking features to generate the initial population of GA reduces the classifier error (increases the classification accuracy) on all datasets. In the second experiment, 50% of the initial population of GA was generated using 1% of high-ranking features of SLI-γ, and the remaining 50% of the population was produced from all features using basic initial population generation of GA. The results show that applying the proposed method on all datasets reduces the classifier errors.

Fig. 6
figure 6

Applying Garank&rand to the first experiment using ANN

Fig. 7
figure 7

Applying Garank&rand to the second experiment using ANN

5.5.3.2 Performance evaluation of Garank&rand using KNN as classifier

Figure 8 shows the results obtained from Garank&rand for all datasets using KNN as classifier. 100, 90, 80, 70, 60, and 50% of the initial population of GA were generated using 1% of high-ranking features of SLI-γ, along with the random population of GA on datasets. In Colon and CNS the highest accuracy was achieved when 80% of the initial population of GA were generated using 1% of high-ranking features of SLI-γ, along with 20% of the random population of GA. In GLI_85, the highest accuracy was achieved when 60% of the initial population of GA were generated using 1% of high-ranking features of SLI-γ, along with 40% of the random population of GA. In LukemiaB, the highest accuracy was achieved when 50% of the initial population of GA were generated using 1% of high-ranking features of SLI-γ, along with 50% of the random population of GA. The results indicated that using the SLI-γ best-ranked features for initial population generation increases the classification accuracy.

Fig. 8
figure 8

Applying Garank&rand to the datasets using KNN

5.5.3.3 Execution time comparison between GA and Garank&rand using ANN

This subsection compares the GA execution time on all datasets with that of Garank&rand:

  • 100% of initial population generation using 1% of SLI-γ ranked features

  • 50% of initial population generation using 1% of SLI-γ ranked features

As can be seen in Fig. 9, applying the Garank&rand method to datasets using the ANN classifier led to a significant reduction in the algorithm execution time. In addition, the shortest execution time was obtained by generating 100% of the initial population by 1% of the best-ranked features. The results indicated that using the SLI-γ best-ranked features for initial population generation significantly reduce the execution time.

Fig. 9
figure 9

Comparison of the execution time of Garank&rand with that of GA using ANN

5.5.3.4 Comparison of the execution time between GA and proposed hybrid methods using KNN

For the execution time comparison using the KNN classifier, 100, 90, 80, 70, 60, and 50% of the initial population of GA was generated using 1% of high-ranking features of SLI-γ, along with the random population of GA on datasets. The highest execution time was recorded when the whole population was generated randomly using all features (as in classic GA) in all datasets. As can be seen in Fig. 10, applying the Garank&rand method to datasets led to a significant reduction in the algorithm execution time.

Fig. 10
figure 10

Comparison of the execution time using KNN

5.5.3.5 Performance comparison of GA and proposed hybrid methods using ANN

In Table 7, Garank&rand is compared with the basic GA in terms of the initial population size, final feature set size, and classification error. In CNS, using 1% of the best features of SLI-γ led to a significant reduction in the final feature set size, execution time, and classifier errors. Applying Garank&rand (with generating 50% of the population from 1% of the best-ranked features) to CNS dataset resulted in 40% reduction in classifier errors and a 0.1 reduction in the number of final features set size. Furthermore, applying Garank&rand (with generating all population from 1% of the best-ranked features) to CNS dataset resulted in 0.01 reduction in classifier errors and number of final features. Applying Garank&rand (with generating 50% of the population from 1% of the best-ranked features) to the Colon dataset resulted in 0.7% reduction in classifier errors and also a magnificent reduction in the number of final feature set size. In addition, applying Garank&rand by generating all population from 1% of the best-ranked features significantly decreased the classifier errors and also reduced the number of final features to 0.01 of the selected features of the basic GA. Applying Garank&rand (with generating 50% of the population from 1% of the best-ranked features) to GLI_85 reduced the classifier errors, execution time, and final feature set size. However, applying Garank&rand (with generating 100% of the population from 1% of the best-ranked features) to the GLI_85 dataset reduced the classifier errors and also reduced the number of final features to 0.001 GLI_85 features. Applying Garank&rand (with generating 50% of the population from 1% of the best-ranked features) to the LeukemiaB dataset resulted in 27% reduction in the classifier errors and 0.1 reduction in the number of final features. Additionally, applying Garank&rand (with generating 100% of the population from 1% of the best-ranked features) to LeukemiaB dataset considerably reduced the classifier errors in the basic GA and also reduced the number of final features to 0.01 of the selected features of the basic GA. As it is shown in Table 8, applying Garank&rand to all datasets caused a magnificent increase in classification accuracy.

Table 7 Comparison of the proposed hybrid methods with Genetic Algorithm

Finally, Garank&rand with generating 50 and 100% of the population from 1% of the best-ranked features is a hybrid method suitable to large-dimensional datasets. Garank&rand with generating 100% of the population from 1% of the best-ranked features achieved high accuracy with the fewest features and least execution time.

5.5.3.6 Comparison of the Accuracy and Execution time of GA and Garank&rand using KNN

For accuracy evaluation of Garank&rand using KNN, 1% of SLI-γ high-ranked features was used for the generation of 100, 90, 80, 70, 60, 50, and 0% of the initial population. Though, generating all population using only 1% of ranked features resulted in the highest classification accuracy in GLI_85, LeukemiaB, and CNS. In Colon, the highest accuracy was achieved when 50% of GA population was generated using only 1% of high ranked features. On the other hand, the lowest accuracy was recorded when 90% of GA population was generate using 1% of ranked features. As shown in Fig. 11, Garank&rand significantly improved accuracy in all datasets.

Fig. 11
figure 11

Accuracy comparison of different percentages of population generation using KNN

For final feature set size comparison of Garank&rand using KNN, We used 1% of most relevant features of each dataset, for generation of 100, 90, 80, 70, 60 and 50%, of the initial population. As it is shown in Fig. 12, when all initial population generated randomly as in GA, it causes in larger feature set size in all datasets. On the other hand, generating all population using only 1% of ranked features resulted in the smallest final feature set size in all datasets. As it shown in Fig. 12, Garank&rand significantly reduced the final feature set size in all datasets compared with GA.

Fig. 12
figure 12

Final feature set size comparison in Garank&rand using KNN

6 Conclusion

This paper proposed a hybrid feature selection method, called Garank&rand, which combined a proposed filter feature selection (SLI-γ) and the wrapper GA-based feature selection approach. Initially, SLI-γ was used to remove 99% of irrelevant features in the first phase. Then, GA used the most relevant features calculated by SLI-γ to optimize the first phase solutions. This paper used 11 well-known datasets (including 7 high-dimensional datasets and 4 standard datasets) for the evaluation of the measurement criteria. The experimental results showed that SLI-γ was able to produce better results than the other rankers (which were considered in this study) on all the datasets. Moreover, SLI-γ had a significant impact on the performance of GA in terms of the classification accuracy and the number of selected features. Furthermore, the execution time of Garank&rand decreased significantly when 1% of the best ranked features were selected for the GA population generation. As future work, we considered using filter methods as fitness function in other state-of-the-art evolutionary algorithms, which guarantees less complexity and execution time in wrapper methods.