Introduction

With bio-information development, gene expression profile analysis has become an essential means of oncogene identification and plays a critical role in cancer classification and prediction. Microarray technology [1] uses many probes each time, and gene information involves many aspects, leading to the high-dimensionality of microarray data. At the same time, the sample preparation cost is high, and the process is complex, which leads to the small-sample size and the uneven distribution of sample categories. Therefore, gene expression profile data are typical high-dimensional small-sample data [2], which has strong feature redundancy and all the characteristics of high-dimensional small-sample data. This type of data is directly used to build predictive models, and it is easy to have problems such as long training time, low model performance, and overfitting. Feature selection is required to eliminate and mitigate dimensional disasters, improve model performance, reduce runtime, and extract beneficial information [3].

Literature [4] summarizes the popular feature selection methods broadly divided into the filter, wrapper, and embedded methods [5]. Wrappers select subsets of features from the initial feature collection, train learners such as support vector machine (SVM) classifier, and evaluate subsets based on the learner's performance. Wrapper performs well in classification, but it costs too much and risks overfitting. Embedded methods automatically select features during training. This kind of method, although short in computing time, relies too much on classifiers. Filtering methods can be scaled efficiently on high-dimensional datasets, regardless of classifiers, and are particularly widely used in high-dimensional data.

In high-dimensional data, the removal of redundancy is a hot topic of research [6]. Filtering methods for feature redundancy problems based on information metrics [7], such as maximum relevancy and minimum redundancy feature selection (mRMR) [8], fast correlation-based filter (FCBF) [9], markov blanket based feature selection algorithm [10], and related improvement algorithms promoted based on the above algorithms [11]. These methods are not suitable for high-dimensional data processing because of large computation and high time complexity.

The paper proposes a feature selection algorithm using a non-dominant features-guided search (NDFS) to solve the above problems. The main ideas of this method are as follows: (1) Based on the framework combining feature ranking and search strategy, the irrelevant, redundant features can be quickly filtered and screened. (2) Fisher score and cosine distance measure the class correlation and similarity of features. (3) The concept of non-dominant features is proposed, and a two-way search strategy is adopted in the process of non-dominant feature-guided search. (4) Finally, a feature subset with maximum correlation and minimum redundancy is selected to improve the performance of subsequent classification.

The rest of this paper is organized as follows. “Related work” discusses related work. “Preliminaries” presents some preliminaries for this work. A novel feature selection method is given in “Proposed architecture and methods”. “Experimental results and discussion” gives experimental results and discussion. “Conclusion” concludes the paper.

Related work

This section first briefly introduces the nature of the feature selection problem, and then discusses the existing mainstream feature selection approaches.

Feature selection problem

Mathematically, the feature selection problem can be expressed in the following way. Assuming a dataset S contains d features. The essence of the feature selection problem is to select relevant features among d features, to optimize the given classification performance index as much as possible. Given a dataset S = {f1, f2, f3,┉,fd}, the goal is to select the optimal subset of features from S. Select a subset D = {f1, f2, f3, ⋯, fn}, where n < d, f1, f2, f3,⋯,fn represent the features of the dataset.

Existing feature selection approaches

Feature selection plays a critical role in classification problems, especially for data sets that have many features [12]. These features need to be measured in two ways: correlation between features and class and redundancy between features. Combined with the corresponding search strategy, the final feature subset is obtained [13].

Early feature selection methods only consider selecting features more relevant to categories. Relevant features can be derived from label information. The ranking of features by scoring them based on relevancy criterion, represented by Relief [14], ReliefF [15], Fisher score [16] and Maximal Information Coefficient (MIC) [17]. The relief method and its multi-class extension, ReliefF, select features from instances that are separated from different classes. The algorithm randomly selects an instance from the data, then calculates the distance to find positive or negative samples of its nearest neighbor, and updates the weight of each feature. The Relief series algorithms operate efficiently with no restrictions on data types. Fisher score algorithm uses probability distance as the evaluation criterion of a feature. The distance between the same class of samples is small, and the distance between different classes of samples is large. The Fisher score algorithm is versatile, has low time complexity, and is particularly suitable for working with high-dimensional datasets. However, in feature selection, there are many redundancy features because these algorithms fail to consider the relationship between features and features.

Redundant features do not provide any additional information other than noise for the classification algorithm, so they should be removed. Typical algorithms include mRMR, correlation-based feature selection (CFS) [18], composition of feature relevancy (CFR) [19]. mRMR is based on mutual information, minimizing the correlation between features and mutual information, and maximizing the correlation between features and class labels. In addition, many literatures are modified based on mRMR method, such as using normalized mutual information [20] and various monotonic dependence measures to replace mutual information for feature selection. CFS is evaluated based on the predictive power of each feature in the subset and its correlation, and the subsets of individual features with strong predictive power and low correlation within the feature subset perform well. CFR calculates the relevance score by calculating union condition information for candidate characteristics for a given selected feature collection category. Feature redundancy is given by the joint information of candidate features, categories, and selected features.

The calculation of feature redundancy is highly complex on high-dimensional data. Scholars adopt a two-stage feature selection method to balance correlation and redundancy to improve efficiency [21]. The basic idea is to calculate the correlation between features and categories for sorting and then use the algorithm based on a search strategy to remove redundant features. The typical algorithm is FCBF proposed in 2004. Firstly, it calculates the symmetrical uncertainty (SU) of each feature and class and sorts it in descending order, removing features less than pre-set thresholds, i.e., irrelevant features. Secondly, this algorithm selects the feature with the largest SU of the feature and class in the current feature set. It calculates the SU of the remaining features and the current features and the SU of the remaining features and class one by one until all redundant features under the feature are removed. Feature Selection Method Based on maximum information coefficient and approximate markov blanket (FCBF-MIC) [22] still uses symmetric uncertainty to measure the correlation between features and categories in the first stage. In the second stage, an approximate Markov blanket is used to fuse the maximum information coefficient measurement standard to remove redundant features. Feature selection algorithm based on approximate Markov blanket is proposed and named as normal max-relevance and min-redundancy (nmRMR) [23] algorithm. Firstly, the features are sorted using the maximum correlation minimum redundancy criterion. Secondly, the irrelevant and redundant features are removed according to the approximate Markov blanket condition. Mini batch K-means normalized mutual information feature inclusion (KNFI) [24] is proposed in 2019, which combines filter and wrapper techniques. The algorithm uses normalized mutual information as a measure to sort the features after clustering by small-batch K-means, the sorting features of the first stage are added to the subset one by one.

The above method can select the relevant features and eliminate the redundant features. However, they still need to be improved in determining the optimal subset of features efficiently and improving the classification performance.

Preliminaries

Fisher score has been maturely applied to feature selection problems. Pareto dominance theory is mainly used to deal with multi-objective optimization problems and the essence of feature selection problems is a multi-objective problem. This section introduces Fisher score algorithm and Pareto dominance theory.

Fisher score algorithm based on probability distance standard

Fisher score is a correlation-based feature evaluation criterion based on probability distance, a practical feature selection method. In the Fisher score algorithm, intra-class dispersion Sw represents intra-class distance, and inter-class dispersion Sb represents inter-class distance. The class distinguishing ability of a feature is the ratio of Sb to Sw. The larger this value is, the stronger the category correlation of the feature is.

Assuming there is a binary classification problem. The positive sample is marked as 1, the negative sample is marked as 0, the number of positive samples is n1, the number of negative samples is n0, the total number of samples is n, and the number of features is m.

Sw, Sb of feature f are defined as for Eqs. (1) and (2).

$$ S_{w}^{\left( f \right)} = n_{0} \left( {\sigma_{0}^{\left( f \right)} } \right)^{2} + n_{1} \left( {\sigma_{1}^{\left( f \right)} } \right)^{2} $$
(1)
$$ S_{b}^{\left( f \right)} = n_{0} \left( {\mu_{0}^{\left( f \right)} - \mu^{\left( f \right)} } \right)^{2} + n_{1} \left( {\mu_{1}^{\left( f \right)} - \mu^{\left( f \right)} } \right)^{2} $$
(2)

The correlation between feature f and class calculated by Fisher score is defined as Eq. (3).

$$ Correlation\,\left( f \right) = FS\left( f \right) = \frac{{S_{b}^{\left( f \right)} }}{{S_{w}^{\left( f \right)} }} $$
(3)

where \(\mu_{0}^{\left( f \right)}\) is the mean of feature f in the negative sample, \(\mu_{1}^{\left( f \right)}\) is the mean of feature f in the positive sample, and \(\mu^{\left( f \right)}\) is the mean of feature f in the overall sample. \(\sigma_{0}^{\left( f \right)}\) and \(\sigma_{1}^{\left( f \right)}\) are the variance of feature f in negative and positive samples.

It can be seen from formula (3) that the greater the inter-class dispersion of a feature, the smaller the intra-class dispersion, and the better the classification effect of the feature.

Pareto dominance theory

Multi-objective optimization generally involves maximizing or minimizing multiple objective functions. Generally, minimizing a multi-objective optimization function [25] can be described as a formula (4).

$$ \begin{aligned} \min j\left( x \right) & = \left( {j_{1} \left( x \right),j_{2} \left( x \right), \ldots ,j_{k} \left( x \right)} \right)^{T} \\ s.t.\quad g_{i} \left( x \right) & \ge 0,i = 1,2, \cdots m; \\ h_{j} \left( x \right) & = 0,j = 1,2, \cdots ,l \\ \end{aligned} $$
(4)

In formula (4), x is the decision variable \(j_{1} \left( x \right),j_{2} \left( x \right), \ldots ,j_{k} \left( x \right)\) represent k objective functions, and the objective is to minimize them. \(g_{i} \left( x \right),h_{j} \left( x \right)\) is the constraint condition of the problem.

In this minimized multi-objective optimization problem, for k objective components, given any two decision variables \(x_{a}\) and \(x_{b}\). If the following two conditions are true, then \(x_{a}\) dominates \(x_{b}\).[26].

  • \((1)\,\,\forall i \in 1,2, \ldots ,k,\quad j_{i} \left( {x_{a} } \right) \le j_{i} \left( {x_{b} } \right)\)

  • \((2)\,\,\exists i \in 1,2, \ldots ,k,\quad j_{i} \left( {x_{a} } \right) < j_{i} \left( {x_{b} } \right)\)

When the value of the objective function corresponding to the solution \(x_{a}\) is better than the value of the objective function corresponding to the solution \(x_{b}\), \(x_{a}\) is called strong Pareto dominating \(x_{b}\). When a solution \(x_{a}\), there is no other solution that can dominate it, then it is called non-dominated solution.

Proposed architecture and methods

In this paper, a feature selection using non-dominant features-guided search is proposed. The Fisher score algorithm, based on the probability distance standard, measures the category correlation of features and extracts a set of pre-selected features with high correlation. Cosine similarity is used to measure the similarity between features based on geometric distance measurement standards, and sample features with lower dimensions represent more information of samples. The Pareto dominance theory is introduced to calculate the non-dominant features for guided search. The feature subset with the most significant category correlation and the least redundancy between features is selected. Figure 1 shows the overall architecture of this method.

Fig. 1
figure 1

NDFS architecture

Cosine similarity measure

Cosine similarity [27] is a method to measure the similarity of two vectors. This paper introduces it to calculate the similarity between features. Assuming there are two features f1 and f2, where they represent any two features from a feature set, the cosine value of the two features can be used to measure the similarity between the two features. The cosine similarity of the two features f1 and f2 was calculated using Eq. (5).

$$ Similarity\,Info\left( {f_{1} ,f_{2} } \right) = cos\left( \theta \right) = \frac{{f_{1} .f_{2} }}{{\left\| {f_{1} } \right\| \times \left\| {f_{2} } \right\|}} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {f_{{1_{i} }} f_{{2_{i} }} } \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} f_{{1_{i} }}^{2} } \times \sqrt {\mathop \sum \nolimits_{i = 1}^{n} f_{{2_{i} }}^{2} } }} $$
(5)

In Formula (5), given f1 and f2 as the feature vectors, n as the total number of instances. \(f_{{1_{i} }}\) and \(f_{{2_{i} }}\) represent the values of f1 and f2 corresponding to the ith instance, respectively.

Therefore, the range of feature similarity is [−1, 1]. The closer the value is to 1, the more similar the two features are.

Pareto dominance feature

In the process of feature selection, not only the correlation between features and class but also the correlation between features should be considered. Features with high-class correlation may be similar in data distribution, so the similarity between features and features may be high. It means there may be redundancy between features. High redundancy can not improve the model’s performance and even make the performance of the model decline sharply, so it is necessary to remove redundancy. Thus, the feature selection process can regard it as a multi-objective optimization problem. The goal is to select features with the highest class correlation and the lowest feature substitutability. Inspired by the Pareto theory in the multi-objective issue, the concept of the Pareto dominance feature is defined

$$ \left\{ \begin{array}{ll} f_{1} > f_{2} \left( {f_{1} {\text{dominates }}f_{2} } \right), & if C\left( {f_{1} } \right) > C\left( {f_{2} } \right) {\text{and S}}\left( {f_{1} ,f_{2} } \right) > \mu \\ f_{1} \ge f_{2} \left( {f_{1} {\text{weakly dominates}} f_{2} } \right), & else if C\left( {f_{1} } \right) \ge C\left( {f_{2} } \right) {\text{and S}}\left( {f_{1} ,f_{2} } \right) > \mu \\ f_{1} \sim f_{2} \left( {f_{1} {\text{ non}} - {\text{dominates}} f_{2} } \right), & else \\ \end{array} \right. $$
(6)

Assuming there are two features f1 and f2, if the category correlation of f1 is higher than that of f2, and the similarity between f1 and f2 is higher than the given threshold, then f1 dominance f2. Otherwise, f1 is a non-dominated feature of f2.

Definition 1

Non-dominance feature. For any two features, f1 and f2, the binary relationship \(>\), \(\ge\) and \(\sim\) are defined as for formula (6).

Where, given \( C\left( {f_{i} } \right) \) as the category correlation of feature fi. \(C\left( {f_{1} } \right) > C\left( {f_{2} } \right)\) represents that the category correlation of f1 is higher than that of f2.\( {\text{S}}\left( {f_{1} ,f_{2} } \right)\) represents feature similarity between f1 and f2.\(\mu\) is the given similarity threshold.

In the NDFS algorithm proposed in this paper, the category correlation of features is calculated by Fisher score, and the similarity between features is calculated by cosine similarity.

The algorithm procedure

The specific process of the algorithm is shown in Table 1.

Table 1 Pseudocode of NDFS

Assuming the original dataset has n samples, m genes, and the feature set \(F = \left\{ {f_{1} ,f_{2} , \cdots ,f_{m} } \right\}\). Assuming the threshold set in Fisher score algorithm is \(\mu_{1}\), and the similarity threshold is \(\mu_{2}\) when eliminating redundant features based on the non-dominated theory. Proposed algorithm proceeds as follows.


Step 1: Fisher score of each feature is calculated using the Fisher score algorithm. The features with small class correlation are removed by thresholds \(\mu_{1}\) to obtain a subset F of the remaining features.


Step 2: Given an empty set S and selects the first feature \(f_{k}\) from the remaining feature subset F that gives the largest Fisher score between the feature and the class target Y.

The feature \(f_{k}\) is added to the selected key feature set S (i.e., \(S \leftarrow S \cup f_{k}\)) and then removed from the set F (i.e., \(F \leftarrow F\backslash f_{k}\)).


Step 3: Select the largest Fisher score feature \(f_{d}\) in the remaining feature subset, where the class correlation of feature \(f_{s}\) in the key feature subset is greater than that of feature \(f_{d}\) (i.e., \(C\left( {f_{s} } \right) > C\left( {f_{d} } \right)\)). The similarity between \(f_{s}\) and \(f_{d}\) is calculated. If \(S\left( {f_{s} ,f_{d} } \right) > \mu_{2}\), the feature \(f_{d}\) is dominated by \(f_{s}\), and the feature \(f_{d}\) is removed from the remaining set. If \(S\left( {f_{s} ,f_{d} } \right) \le \mu_{2}\), the feature \(f_{d}\) is not dominated by \(f_{s}\). If the key feature subset does not have the dominated feature \(f_{d}\), it is added to the key feature subset.


Step4: If the remaining set F is empty, terminate the algorithm. Otherwise, go to step 3. The final output is the key feature subset S containing non-dominated features, which is the feature subset with the maximum correlation and minimum redundancy.

Experimental results and discussion

This section uses the proposed algorithm to discuss the experimental results on six high-dimensional small-sample data sets. Section “Experimental data sets” introduces the experimental data sets in detail. Section "Experimental setup” gives the experimental setup. Section “Feature selection experiment” describes the feature selection process experiment of NDFS in detail. And the feature selection results of the proposed method are compared with 6 algorithms in Section “Experimental result analysis”.

Experimental data sets

To verify the effectiveness and applicability of this method in dealing with the feature selection problem of high-dimensional and small-sample gene data, this paper selects six public gene data sets. The HeadNeck data set is obtained from the GEO database [28], the two public data sets Colon data set and the Leukemia data set are obtained from Kaggle, the Lung dataset, and the 11_Tumors dataset are obtained from the website http://www.gemssystem.org/, and the LIHC dataset is from the cancer genome atlas (TCGA). These data have been widely cited by scholars at home and abroad and have certain standards. Table 2 summarizes the 6 public datasets used in this study. It contains 4 binary datasets and 2 multi-category datasets.

Table 2 Datasets used in experiments

The Colon dataset consists of 62 samples collected from Colon cancer patients, including 40 tumor samples and 22 normal samples. The Leukemia dataset contains 72 case samples of 2 different leukemias, including acute myeloid (AML) and acute lymphoblastic leukemia (ALL). The 11_Tumors dataset contains 174 samples and genetic data of 11 common human cancer cases, including prostate cancer, bladder/urethral cancer transitional cell carcinoma and squamous cell carcinoma) invasive breast ductal carcinoma, rectal cancer, gastric adenocarcinoma, clear kidney cell carcinoma, liver cancer, ovarian serous papillary adenocarcinoma, pancreatic cancer and Lung adenocarcinoma and squamous cell carcinoma). The Lung dataset contains four different Lung tumors (139 cases of adenocarcinoma, 6 cases of small cell Lung cancer, 21 cases of squamous cell carcinoma, and 20 cases of Lung carcinoid) and 17 cases of normal Lung tissue. The HeadNeck dataset contains 55 samples with local recurrence and 50 samples without local recurrence. The LIHC dataset from TCGA includes 374 liver cancer samples and 50 paracancerous samples. The number of samples in these datasets ranges from 62 to 424, and the number of features ranges from 2000 to 54,869, all of which are high-dimensional small-sample data.

Experimental setup

To evaluate the classification performance of the selected feature set, SVM, decision tree (DT), random forest (RF), logic regression (LR), and multi-layer perception machine (MLP) are selected to construct prediction models. The parameter settings of each model are shown in Table 3. AUC values, Accuracy, F1-score, and ROC are used as evaluation indicators to evaluate the performance of different feature results and the constructed prediction model.

Table 3 Parameter table of each model

To avoid overfitting and improve data reusability, fivefold cross-validation is used in this experiment. The original data set is randomly divided into five equal parts. The proportion of positive and negative samples is consistent with the original data's proportion of positive and negative samples in each equal part. One sample is selected as the test set, and the other four samples are selected as the training set. Five models are finally obtained after five times of execution. The evaluation indexes of these five models are taken as the final prediction model’s evaluation results by calculating the average value.

For performance comparison, the following algorithms have been selected: Fisher score, MIC, FCBF-MIC, CFR, KNFI, and mRMR. Since the Fisher score, MIC, CFR, and mRMR algorithms can directly select a certain number of features, the number of features selected by these two methods is consistent with the number of features finally selected by NDFS. Since the number of features that FCBF-MIC, and KNFI eventually generate cannot be known in advance, there is no limit to the number of features that the method ultimately selects. FCBF-MIC is directly consistent with the NDFS during the pre-selection process.

Feature selection experiment

When using the NDFS algorithm for feature selection, two thresholds need to be determined. One is the Fisher score threshold \(\mu_{1}\) for pre-selection using the Fisher score algorithm, and the other is the similarity threshold \(\mu_{2}\) for removing redundant features based on Pareto dominant theory.

To determine the threshold \(\mu_{1}\), the original features are sorted according to the Fisher score value. For visual observation, the top 50, 100, 200, 300, 400, 500 features are selected to form a series of feature subsets. According to these feature subsets, SVM and logistic regression (LR) is used to construct the classification model. The average classification accuracy is used to evaluate the feature subset. Finally, the Fisher score value corresponding to the appropriate number of features is selected to determine the threshold of each data set. Finally, the corresponding Fisher score is selected based on the appropriate number of features to determine each dataset’s threshold \(\mu_{1}\).

Figure 2 shows the average classification accuracy corresponding to different features on the 6 datasets. It can be seen from Fig. 2a that for Colon dataset, when the number of features is about 400, the accuracy of the SVM and LR classifiers is relatively high. At this time, the corresponding fisher value is 0.0627, so the threshold \(\mu_{1}\) of Colon dataset is determined to be 0.06, and the final number of selected features is 414. It can be seen from Fig. 2b that for the Leukemia dataset, when the number of features is about 400, the accuracy of the SVM and LR classifiers is relatively high. Therefore, the threshold \(\mu_{1}\) corresponding to the Leukemia dataset is determined to be 0.2, and the number of selected features is 406. As can be seen from Fig. 2c, for the 11_Tumors dataset, when the number of features is about 300, the accuracy of the classifier is relatively high. Therefore, the threshold \(\mu_{1}\) corresponding to the 11_Tumors dataset is determined to be 1.13, and the number of selected features is 302. It can be seen from Fig. 2d that for the Lung dataset, when the number of features is about 300, the accuracy of the classifier is relatively high. Therefore, the threshold \(\mu_{1}\) corresponding to the Lung dataset is determined to be 1.17, and the number of selected features is 299. It can be seen from Fig. 2e that for the HeadNeck dataset, when the number of features is about 300, it performs best on the LR classifier. Although the accuracy of SVM classifier increase with the decrease of the number of features, the performance of the LR classifier decreases gradually. To avoid losing important features, the number of features is about 300. Therefore, the threshold \(\mu_{1}\) corresponding to the HeadNeck dataset is determined to be 0.05, and the number of selected features is 287. It can be seen from Fig. 2f that for LIHC dataset, when the number of features is about 100, the accuracy of the classifier is relatively high. Therefore, the threshold \(\mu_{1}\) corresponding to the LIHC dataset is determined to be 1.53, and the number of selected features is 99.

Fig. 2
figure 2

Average classification accuracy of different features on six data sets

To sum up, when the NDFS algorithm uses Fisher score to pre-select features in the first stage, the corresponding thresholds and the number of pre-selected features in different datasets are shown in the following Table 4.

Table 4 Fisher thresholds and the number of pre-selected features in different data sets

For the second stage, to remove redundant features based on similarity measurement and Pareto dominance theory, the similarity threshold \(\mu_{2}\) should be determined. In this experiment, cosine similarity was used to measure the similarity between features. Therefore, the closer the value is to 1, the stronger the redundancy between features, and features with high-class correlation are more dominant to other features. Once the similarity threshold has been determined, the Pareto dominance feature corresponding to the feature can be removed. The larger the threshold, the less the Pareto dominance of the feature. The smaller the threshold, the more the Pareto dominating features and the more the deleted features.

Different candidate feature subsets are selected according to threshold \(\mu_{1}\) in Table 4. Calculate the classification accuracy corresponding to different similarity thresholds under candidate feature subsets. After many experiments, the threshold \(\mu_{2} = 0.68\) is finally determined. At this point, the number of optimal subset features of each dataset is displayed in Table 5.

Table 5 Similarity threshold and feature number of optimal feature subset for each dataset

Therefore, the final number of features selected by this method was determined as follows: Colon-8, Leukemia-15, 11_Tumors-11, Lung-4, HeadNeck-109, and LIHC-3.

Experimental result analysis

In this section, the proposed algorithms are compared with the other six algorithms (Fisher score, MIC, FCBF-MIC, CFR, KNFI, and mRMR) on the six data sets of HeadNeck, Colon, and Leukemia. Different classifiers were used to construct prediction models and evaluate the performance of the feature subsets selected by NDFS.

Table 6 lists the number of features selected by the proposed method and the six comparison algorithms.. According to the experimental setup in Section “Experimental setup”, Fisher score, MIC, CFR and mRMR all obtained the same number of dimensional features as NDFS. NDFS first retains only features that are strongly related to the category and more discriminative, greatly reducing the feature dimension, and then performs surprise search based on non-dominated features to retain the best number of features. The second stage of FCBF-MIC uses the MIC between features and features and between features and classes to calculate the approximate Markov blanket. The number of deleted features meeting this condition is less than the number of redundant features abandoned by NDFS using non-dominated feature-guided search. Therefore, in all data sets, the final number of features selected by FCBF-MIC is much higher than that of NDFS.

Table 6 The number of features selected by 7 feature selection methods for 6 high-dimensional datasets

Tables 7, 8, and 9 show the classification accuracy, F1_score and AUC values of 6 datasets under 7 algorithms and 5 classifiers. Figure 3 shows the ROC curves of the proposed method and the other 6 methods under the SVM classifier. In the ROC curve, the ROC curve of the feature subset obtained by NDFS is closer to the upper left corner, and its performance is better. Compared with other six comparison algorithms, it can be easily explained why the proposed method performs better.

Table 7 Classification accuracy, F1 and AUC of Colon and Leukemia datasets under 7 algorithms and different classifiers
Table 8 Classification accuracy, F1 and AUC of 11_Tumors and Lung datasets under 7 algorithms and different classifiers
Table 9 Classification accuracy, F1 and AUC of HeadNeck and LIHC datasets under 7 algorithms and different classifiers
Fig. 3
figure 3

Receiver Operating Characteristic of six feature selection methods on Colon and Leukemia datasets under SVM classifier

Fisher score, MIC only computes the relationship between features and categories, thus ignoring the relationship between features. NDFS can calculate between features based on removing irrelevant features, eliminating data redundancy, and completing the multi-objective calculation in feature selection. Therefore, compared with Fisher score and MIC algorithm, NDFS algorithm can remove redundant features. As far as the experimental results are concerned, the evaluation indexes of NDFS under the four classifiers in all data sets are superior to Fisher score and mic algorithms. The ACC of RF classifier is slightly lower than that of MIC in three data sets, but the generalization ability of NDFS in different learners is still better than these two algorithms on the whole.

Compared with FCBF-MIC, the classification results of NDFS are better than those of FCBF-MIC on Leukemia, HeadNeck, Colon and Lung datasets. The two algorithms have advantages and disadvantages in using different classifiers on 11_Tumors and LIHC data sets. FCBF-MIC uses SU for correlation analysis in the first stage and MIC for redundancy deletion in the second stage. There is no correlation between the two steps. NDFS can select non-dominated features through Fisher score value, and use two-way search strategy to make full use of the feature information of each calculation.

Compared with CFR, the classification effect of RF is better than that of NDFS only under the Leukemia dataset. However, under the other datasets, the NDFS has an obvious advantage. The mutual information and conditional mutual information used by CFR to calculate the relationship between features is too redundant and has low performance. CFR later used greedy searching strategy, but its final selected number threshold of features was not supported by sufficient solutions. NDFS calculates feature similarity in non-dominant features, and NDFS selects the final optimal subset based on the established similarity threshold.

When selecting features, KNFI focuses more on the influence of the current selected features on the performance accuracy of the classifier, ignoring the information of the features themselves. In the whole process of NDFS, not only the information of the feature itself is fully used, but also the classification information can be used as a reference for feature correlation threshold selection. The experimental results show that the classification effect of KNFI using DT is better than that of NDFS only under HeadNeck dataset, and NDFS can obtain a feature subset with better classification effect under other dataset classifiers.mRMR uses SVM on the HeadNeck dataset to achieve the best classification indicators, and some indicators on DT, LR, and MLP achieve the best results. F1 is also optimal under the DT classifier on the Lung dataset. But it lacks a competitive advantage under the rest of the datasets. mRMR uses mutual information to calculate feature information, which makes the feature far away and is still highly correlated with classification variables. NDFS calculates the feature similarity by considering the relationship between features to remove the disposable features, to generate a reliable feature subset.

Overall, the proposed algorithm uses NDFS in six datasets to select effective features for current problems. It has better performance in both binary and multi-classification tasks, and has good generalization ability on different classifiers.

To further compare the differences among the algorithms, Friedman nonparametric test was used to calculate, and the average ranking of the algorithms was used to judge whether there were significant differences between the algorithms. The average ranking of 7 algorithms in 6 datasets was calculated and compared with the f-distribution critical value with a confidence degree of 0.1 to obtain Table 10. It can be seen from Table 10 show that for five classifiers, \(T_{F}\) of classification accuracy and \(T_{F}\) of F1_score are both greater than the corresponding critical value, indicating significant differences between algorithms. Therefore, the proposed NDFS algorithm has good performance.

Table 10 Friedman test

NDFS algorithm, in the best case, to be selected features are the key feature subset of the first feature dominating features. The time complexity is O(n). In the worst case, the unselected features are non-dominated features of the key feature subset, and the time complexity is O(n2). Since the NDFS algorithm can delete many irrelevant features first, it will reduce the Pareto dominance feature calculation scale and improve the overall speed of the method. Therefore, NDFS has relatively fast operating efficiency. The feature selection experiments on each data set were repeated 10 times, and then the mean was calculated as the estimation of the running time of feature selection. The final experimental results are shown in Table 11.

Table 11 Average running time(s) of 7 algorithms from 6 datasets

NDFS algorithm is a pre-selection based on Fisher score, which is higher than Fisher score and MIC algorithm in running time. However, the classification experiments show that the classification performance is poor, and the two algorithms eventually contain high redundancy features, so the proposed algorithm time is relatively acceptable. It can be seen from the table that compared with FCBF-MIC, CFR, and KNFI algorithms, NDFS has a lower running time on six datasets. Compared with mRMR algorithm, NDFS has lower running time on three datasets.

In summary, when the same number of feature subsets is selected, the classification performance of feature subsets selected by NDFS algorithm is better than those of Fisher score, MIC, CFR, and mRMR under most classifiers. NDFS selects important features and eliminates redundant features, which significantly preserves useful feature information. Compared with the FCBF-MIC and KNFI algorithm, NDFS adopts the distance-based measure more accurately than the probability value of information theory, so the performance of the selected feature subset is better.

Conclusion

In this paper, a feature selection using non-dominant features-guided search is proposed. Fisher score algorithm is used to measure the category correlation of features. Cosine similarity based on geometric distance standard is used to measure the similarity between features. Specifically, the algorithm combines the Pareto dominance theory to gradually remove the Pareto dominance feature (redundant feature) of the largest category correlation feature. A feature subset with maximum correlation and minimum redundancy is obtained.

This algorithm uses the fast and effective characteristics of the Fisher score algorithm to select related features. It makes up for the deficiency of Fisher score that does not consider the correlation between features. The proposed method is compared to six competing feature selection methods on six real-world data sets. This approach has better classification performance than Fisher score, MIC, CFR, and mRMR algorithms and does not only consider the category correlation or feature redundancy of features. Compared with the FCBF-MIC and KNFI algorithms, it can obtain the feature subset with better classification ability while accelerating the algorithm execution efficiency. In light of the above experimental results show that NDFS method outperforms other compared feature selection methods.

NDFS has been able to extract features with low redundancy and affecting gene category. It is also worth studying to analyze the genes that affect the disease from the selected genes. Further work will establish an interpretable association model of selected features and final results to provide a scientific basis for personalized diagnosis and treatment.