1 Introduction

With the development of machine learning and artificial intelligence, more and more rich information is obtained from research objects, but some irrelevant and redundant features are added, which leads to higher feature dimensions. For the classification learning, the good learning samples are the key to train the classifier. The irrelevant or redundant information of the samples will increase the complexity of the classification algorithm as well as the time of training and prediction, and directly affect the classification performance. Therefore, it is necessary to select the optimal features that can best represent the characteristic of object, and eliminate irrelevant and redundant features to improve the efficiency of the classifier [1].

Feature selection is the process of selecting the optimal feature subset from the raw feature set to reduce the dimension of the feature space [2]. Its purpose is to simplify the data structure, interpret the data information, and improve the robustness, stability and identification of the model. According to whether the evaluation criteria of feature subset are related to classifier, feature selection methods are divided into three ways: filter, wrapper and embedded [3]. Filter method relies on the properties of the feature space itself, based on feature metrics to maximize data representation information. The feature selection and training classifier are two independent processes. For example, Zhao et al. [4] applied constraint score method to score audio features in environmental sound classification and select the optimal feature subset to improve the classification performance. Saqlain et al. [5] used fisher score method to select feature subsets, and transferred selected feature subsets to the RBF kernel-based SVM for diagnosing heart disease. Yong et al. [6] divided features into three levels: strong correlation, sub-strong correlation and other features, based on the measurement criterion of mutual information and the correlation between the feature and the label. Then the features are simplified. Wrapper method relies on preselected classifiers, takes the performance of the classifier as an evaluation criterion for feature subset, and removes features that negatively affect the classification performance. For example, Ye et al. [7] proposed a multimodal wrapper feature selection method based on effective distance, which considered the global relationship between samples. Mafarja et al. [8] wrapped the whale optimization algorithm into the classifier, and used the crossover and mutation characteristics of the optimization algorithm to enhance the speed and accuracy of feature search. In order to improve the classification performance of the financial credit scoring model, Jadhav et al. [9] proposed an information gain directed feature selection algorithm, and propagated the top m features through the genetic algorithm wrapper. The embedded method puts the feature selection process into the learning algorithm, which is a compromise between the filter method and the wrapper method. For example, Xu et al. [10] proposed a joint intra-class variation coefficient and inter-class mutual information, constructed a joint index to evaluate the feature contribution, and combined the embedded method for feature selection. Maldonado et al. penalized the cardinality of feature sets through the scale factor technology, and used SVM as the embedded base classifier for feature selection. This method was applied to high-dimensional datasets with extremely uneven categories [11]. Liu et al. [12] proposed an unsupervised adaptive neighborhood embedded feature selection method, which mainly captures the internal geometric structure of the data based on the K-nearest neighbor method. Because both the wrapper and embedded methods have certain dependencies on the classification learning algorithm, and are prone to problems such as overfitting and low efficiency. The filter method has high efficiency and is suitable for various data types. The method has strong versatility and low algorithm complexity.

In the filter method, the feature metric is the key factor. That the maximal relevance and minimal redundancy (mRMR) criterion is a filter feature metric for pattern recognition [13], it aims to select the feature subset with the maximal dependency, maximal relevance and the minimal redundancy from the raw feature set. However, mRMR only considered the contribution of a single feature. The feature with the maximum contribution from the raw feature set are selected and added to form the optimal feature subset. In this way, the relevance between features will be ignored, and the joint contribution of multiple features to the classification will not be considered enough, resulting in the selected feature subset may not be optimal. There is still a certain room for improvement in the selection of the optimal feature subset. Therefore, based on the mRMR method, in this paper, the ImRMR (Improved mRMR) method is proposed. First, the relevance between each feature and the class is calculated through the Pearson correlation coefficient and mutual information, and tradeoff the two metrics to rank the features. Secondly, candidate feature subsets are generated with the proposed equal grouping method (EGM) according to the ranking features. The joint contribution of feature subset is evaluated by the relevance and redundancy of feature subset, and the ranking of feature subsets is found by incremental search method. Finally, the final optimal feature subset is obtained by combining the sequence forward search method (SFS). The experiments are conducted on seven datasets to verify the effectiveness of the ImRMR.

This paper adopts a supervised learning method to perform feature selection on these datasets through ImRMR, and uses advanced machine learning classification algorithm (random forest) to classify these features. In this study, 70% of each dataset was randomly selected for the training dataset, and the remaining 30% was used for the test dataset. We conduct extensive experiments and evaluate on various performance metrics (accuracy, dimensionality reduction rate, comprehensive rate, precision, recall and F-measure) to determine the effectiveness of the proposed method. Experiments show that the proposed feature selection method on random forest classifier outperforms the original mRMR and other methods.

The rest of this paper is organized as follows: Sect. 2 lists the related work done by various authors. Section 3 describes the mRMR method. Section 4 presents the proposed ImRMR method in detail. Then, the experiments and results are analyzed in Sect. 5. Finally, we give limitations and conclusion.

2 Related work

Based on the mRMR, Lu et al. extracted the key information most relevant to the fault location in traditional transmission line faults. By mining the implicit relationship between key features and fault location, the fault location result is obtained by synthesizing multiple feature information. This method has better fault location accuracy and better adaptability to transient components that appear after the fault [14]. Billah and Waheed [15] effectively reduced the dimensionality of the extracted features of endoscopic video frames with mRMR. Toaar et al. used the mRMR to select the features of the lung X-ray images for the diagnosis of pneumonia extracted by three deep models, and combined them to form an efficient feature set. The selected feature set provided robust and consistent features for pneumonia detection, and the mRMR can effectively reduce the dimension of the feature set [16]. Gu et al. calculated the relevance between power transformer fault feature quantities and the redundancy between feature quantities and fault types based on the mRMR, the optimal transformer fault feature set was obtained. Experiments show that the optimal feature set is more efficient than the traditional feature set in transformer fault diagnosis [17]. Erolu et al. [18] used the mRMR to select the features of breast ultrasound image extracted by the hybrid CNN structure, and achieved good results in classification and recognition. Fan et al. [19] used mRMR to select important lead-rhythm features extracted from electroencephalogram recordings to build a predictive model to predict the prognostic effect of acupuncture on depression. Tuncer et al. [20] used the discrete wavelet transform to decompose the electroencephalogram signal, and then used the mRMR to select the most discriminant feature from the texture features generated by the decomposed discrete wavelet transform subband, and used SVM classifier to classify the selected features. In the study of COVID-19 disease classification using supervised optimization machine learning technology, Sharma et al. [21] used the mRMR to remove irrelevant and misleading features in the high-dimensional data of COVID-19 to reduce the size of the search space of the cuckoo search algorithm and improve the learning efficiency. Baliarsingh et al. employed mRMR to select relevant subsets of genes from the microarray dataset. Then, simulated annealing is hybridized with Rao algorithm to improve the solution quality after each iteration of Rao algorithm. The discriminant genes selected on the SRBCT dataset have high classification accuracy [22].

In addition, many scholars have improved the mRMR to obtain better feature selection results. Feng and Zhang improved the mRMR by using conditional mutual information formula and three-dimensional calculation to determine the candidate connection of each attribute node in Bayesian network classifier. The method enhanced the reliability and robustness of small sample calculation [23]. Yao et al. introduced mRMR into the particle swarm optimization algorithm search process to select feature subset. The feasibility and effectiveness of the proposed algorithm were verified on the UCI dataset with SVM as the classifier [24]. Li and Wang [25] proposed a new mRMR method to use a variety of different evaluation criteria to measure the redundancy between features and the relevance between features and categories, and an indicator vector λ was introduced to describe the actual data dimension requirements of users. Wang et al. [26] improved the mRMR by merging relevance measurement coefficients to obtain a primary feature subset, and binary-coded the feature subset, and then combined the genetic algorithm to search for the optimal or suboptimal feature subset. Jo et al. [27] used the Pearson correlation coefficient as the redundancy metric and the R value as the relevance metric to improve the mRMR. Ahmed et al. put forward an enhanced mRMR (EmRMR) filtering method to remove the noise features in ransomware, and selected the most relevant feature subset to characterize the real behavior of ransomware. EmRMR requires only a small amount of evaluation, avoiding unnecessary calculations inherent in the original mRMR [28]. Combines EmRMR with term frequency-inverse document frequency (TF-IDF), a weighted mRMR (WmRMR) technology was proposed to filter out runtime noise behavior according to the weight calculated by TF-IDF. Compared with the mRMR, WmRMR has low-dimensional complexity and less evaluation times, and better estimates the feature significance in the data captured in the early stage of ransomware attacks [29].

The existing improved methods for mRMR are based on the calculation of correlation and redundancy of a single feature, they ignore the joint contribution of multiple features, and do not take into account the calculation of redundancy and correlation between feature subsets. We propose an improve mRMR (ImRMR) based on feature subsets. The proposed method and performance evaluation will be discussed in the following sections.

3 mRMR feature selection

3.1 mRMR definition

Maximal relevance and minimal redundancy (mRMR) is a filter feature measurement criterion, which calculates the redundancy between features and the correlation between features and class based on mutual information \(I\left( {x;y} \right)\). It selects the features that are most relevant to the category and have the least redundancy with other features from the raw feature set.

The mutual information \(I\left( {x;y } \right)\) is defined as Eq. 1.

$$I\left( {x;y} \right) = \iint {p\left( {x,y} \right)\log \frac{{p\left( {x,y} \right)}}{p\left( x \right)p\left( y \right)}{\text{d}}x{\text{d}}y}$$
(1)

where \(p\left( {x,y} \right)\) is the joint probability density of random variables \(x\) and \(y\), and \(p\left( x \right)\) and \(p\left( y \right)\) are the marginal probability densities of \(x\) and \(y\), respectively.

Given a sample feature set \(S = \{ f_{1} ,f_{2} , \ldots f_{n} \}\) and a sample class \(c\). The relevance between \(S\) and \(c\) is the mean of all mutual information between each feature \(f_{i}\) and class \(c\). It is shown in Eq. 2.

$$D\left( {S,c} \right) = \frac{1}{\left| s \right|}\mathop \sum \limits_{{f_{i} \in S}} I\left( {f_{i} ;c} \right)$$
(2)

where \(\left| S \right|\) is the number of features in \(S\), and \(I\left( {f_{i} ;c} \right)\) is the mutual information between the feature \(f_{i}\) and the class \(c\).

The redundancy of all features in \(S\) is the mean of all mutual information between feature \(f_{i}\) and feature \(f_{j}\). It is shown in Eq. 3.

$$R\left( S \right) = \frac{1}{{\left| s \right|^{2} }}\mathop \sum \limits_{{f_{i} ,f_{j} \in S}} I\left( {f_{i} ;f_{j} } \right)$$
(3)

The mRMR method seeks the optimal features of the samples with maximal relevance \(D\left( {S,c} \right)\) and minimal redundancy \(R\left( S \right)\). So the criterion of feature measurement for mRMR can be as shown in Eq. 4.

$${\text{mRMR}} = \mathop {\max }\limits_{s} \left( {D\left( {S,c} \right) - R\left( S \right)} \right)$$
(4)

3.2 Incremental search method

Incremental search is used to quickly and efficiently select the optimal feature set. Given the raw feature set \(X\), if the optimal feature \(S_{m - 1}\) has been selected, then it will continue to search for the optimal feature in the remaining feature space \(X - S_{m - 1}\). Equation 4 can be illustrated as Eq. 5.

$$\mathop {\max }\limits_{{f_{j} \in X - S_{m - 1} }} \left[ {I\left( {f_{i} ;\,c} \right) - \frac{1}{m - 1}\mathop \sum \limits_{{f_{i} \in S_{m - 1} }} I\left( {f_{j} ;f_{i} } \right)} \right]$$
(5)

4 Improved mRMR based on feature subset

The mRMR method measures the contribution of feature by calculating the relevance and redundancy of individual feature. The joint contribution of multiple features is ignored. Relevance and redundancy are only based on mutual information measures. Therefore, to obtain the optimal feature subset, from the perspective of feature subset, this paper uses two measurement criteria, Pearson correlation coefficient and mutual information to evaluate the relevance and redundancy of the feature subsets. A weight factor is introduced to tradeoff the two measurement criteria.

The whole process of proposed ImRMR is shown in Fig. 1. Firstly, we adopt equal grouping method (EGM) to initially divide the candidate feature subsets. Then the contribution of each candidate feature subset is calculated with the correlation and redundancy according to the Pearson correlation coefficient and mutual information. Finally, we convert the ranking of the feature subsets into the feature ranking and combine the SFS search strategy to obtain the final preferred subset. Mapping the optimal feature subset to the raw sample set can carry out classification and recognition.

Fig. 1
figure 1

The follow chart of feature selection with ImRMR method

4.1 Initialize the candidate feature subsets with EGM

Given a sample set of \(N*M\), \(N\) represents the number of samples, and \(M\) refers to the feature dimension. The base decision tree in the random forest includes the feature number \(\log_{2} \left( M \right)\) as the candidate feature subsets [30]. The relevance between each feature and the class is calculated by the Pearson correlation coefficient and mutual information, and α is introduced to adjust the weight of the two indicators. The features are ranked on relevance. The relevance \(Ic\left( {f_{i} ;c} \right)\) between a single feature and the sample class is calculated as shown in Eqs. 6 and 7.

$$Ic\left( {f_{i} ;\,c} \right) = \alpha *I\left( {f_{i} ;\,c} \right) + \left( {1 - \alpha } \right)*{\text{corr}}\left( {f_{i} ;\,c} \right)$$
(6)
$${\text{corr}}\left( {f_{i} ;\,c} \right) = \frac{{\sum f_{i} c - \frac{{\sum f_{i} \sum c}}{N}}}{{\sqrt {\left( {\sum f_{i}^{2} - \frac{{\left( {\sum f_{i} } \right)^{2} }}{N}} \right)\left( {\sum c^{2} - \frac{{\left( {\sum c} \right)^{2} }}{N}} \right)} }}$$
(7)

where \(\alpha \in \left[ {0.1,1} \right]\), with a step size of 0.1, \(I\left( {f_{i} ;\,c} \right)\) represents the mutual information between each feature and class, and \({\text{corr}}\left( {f_{i} ;\,c} \right)\) refers to the Pearson correlation coefficient between each feature and class.

The range of feature subsets \(r\) is shown in Eqs. 8 and 9.

$$r = \left[ {1,{\text{round}}\left( \frac{M}{Fc} \right)} \right]$$
(8)
$$Fc = {\text{round}}\left( {\log_{2} \left( M \right)} \right)$$
(9)

where \(Fc\) is the number of features in the feature subset.

This paper exploits an equal grouping method (EGM) to generate candidate feature subsets. According to the ranking features with relevance, each feature is assigned to a group one by one. When the first-round assignation has finished, the feature will be assigned to a group reversely till the end of the round, and keep going the assignation and swapping directions until all the features assigned to the groups.

For example, suppose the number of features \(M\) = 20, then \(Fc\) = 4, and the range of feature subsets \(r\) = 5, the feature subsets generation process is shown in Fig. 2.

Fig. 2
figure 2

Generate feature subsets with equal grouping method

4.2 Generate ranking feature subsets sequence

By calculating the contribution of each candidate feature subset, the ranking sequence of these feature subsets is obtained. It is mainly divided into the following three steps.

  1. 1.

    Calculate the relevance of all individual features in each candidate feature subset according to Eq. 6, and accumulate them to obtain the contribution of the candidate feature subset. The candidate feature subset with the largest contribution is taken as the first selected feature subset \(F_{y}\);

  2. 2.

    Calculate the relevance \(D\left( {F_{x} ;\,c} \right)\) and redundancy \(R\left( {F_{x} ;\,F_{y} } \right)\) of the remaining candidate feature subsets according to Eqs. 10 and 12, respectively;

  3. 3.

    Combined with Eq. 14, the incremental search method is used to select feature subset. Then, contribution ranking of all feature subsets is obtained.

    $$D\left( {F_{x} ;\,c} \right) = \beta *I\left( {F_{x} ;\,c} \right) + \left( {1 - \beta } \right)*{\text{corr}}\left( {F_{x} ;\,c} \right)$$
    (10)
    $$I\left( {F_{x} ;\,c} \right) = \mathop \sum \limits_{i = 1}^{Fc} I\left( {f_{i} ;\,c} \right),{\text{corr}}\left( {F_{x} ;\,c} \right) = \mathop \sum \limits_{i = 1}^{Fc} {\text{corr}}\left( {f_{i} ;\,c} \right)$$
    (11)
    $$R\left( {F_{x} ;\,F_{y} } \right) = \frac{1}{m - 1}\mathop \sum \limits_{{F_{y} \in S_{m - 1} }} \left( {\beta *I\left( {F_{x} ;\,F_{y} } \right) + \left( {1 - \beta } \right)*{\text{corr}}\left( {F_{x} ;\,F_{y} } \right)} \right)$$
    (12)
    $$I\left( {F_{x} ;\,F_{y} } \right) = \mathop \sum \limits_{i,j = 1}^{Fc} I\left( {f_{i} ;\,f_{j} } \right), {\text{corr}}\left( {F_{x} ;\,F_{y} } \right) = \mathop \sum \limits_{i,j = 1}^{Fc} {\text{corr}}\left( {f_{i} ;\,f_{j} } \right)$$
    (13)
    $${\text{ImRMR}} = \mathop {\max }\limits_{{F_{x} \in X - S_{m - 1} }} \left[ {D\left( {F_{x} ;\,c} \right) - R\left( {F_{x} ;\,F_{y} } \right)} \right]$$
    (14)

    where \(\beta \in \left[ {0.1,1} \right]\), with a step size of 0.1.

4.3 Acquisition of optimal feature subset

Take out the features of the feature subsets sequence in turn to find features sequence. The optimal feature subset is obtained according to the ranking features sequence and the SFS search strategy.

The feature selection algorithm based on ImRMR and SFS is descripted as follows.

figure a

In addition, for the ImRMR algorithm, the same \(\alpha\) and different \(\beta\) settings make the sample set generate different feature subset when calculating the contribution of feature subset according to Eq. 14. Therefore, the different feature subsets \(b_{1} ,b_{2} , \ldots ,b_{m}\) can be obtained, and \(m\) represents the number of settings \(\beta\), where the feature subset \(b_{\max }\) with the highest accuracy is selected as the optimal feature subset for the current \(\alpha\).

Different \(\alpha\) settings make the sample set generate different optimal feature subset such as \(CS_{1} ,CS_{2} , \ldots ,CS_{n}\), \(n\) represents the number of settings \(\alpha\). The feature subset \(CS^{*}\) with the highest accuracy among these \(n\) feature subsets is selected as the final optimal feature subset.

5 Experiment and result analysis

5.1 Experimental environment

The hardware platform is a laptop computer with 16G RAM and 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30 GHz CPU. The operating system is Windows 10 64-bit. MATLAB 2018b is the data processing, programming and running platform as well WEKA 3.8.5.

5.2 Experimental datasets

Seven datasets were involved in experiments. Five of them are provided by the UCI Machine Learning Repository [31]: Musk dataset (Musk), Urban land cover dataset (Urban), Glass dataset (Glass), Libras Movement dataset (Movement) and Ionosphere dataset (Ionosphere). One dataset is the commonly used hyperspectral dataset Pavia University (PU) [32]. And the other one is crane songs dataset (Crane).

Each dataset is divided into training set and test set as the ratio of 7:3. The information of datasets are listed in Table 1.

Table 1 The information of datasets

5.3 The design of experiments

Two groups experimental scheme are designed. One group compared the proposed method of EGM with the method of randomly selecting features to generate candidate feature subsets. And the other compared the ImRMR with other feature selection methods, such as mRMR, InfoGain (IG) [33], Symmetrical Uncert (SU) [34], GainRatio (GR) [35] and ReliefF (RfF) [36].

The four feature selection methods of IG, SU, GR RfF and the classification method of random forest involved in the experiment are all built-in methods in WEKA 3.8.5, which are implemented by MATLAB 2018b calling the interface of WEKA 3.8.5, and the parameters of each feature selection method are default parameters. The mRMR method has no parameter settings, and \(\alpha \;{\text{and}}\;\beta\) in ImRMR are set in \(\left[ {0.1,1} \right]\).

Random forest is an ensemble learning method based on bagging, which can handle classification problems and regression problems well, and is one of the most widely used machine learning methods at present. To verify effectiveness of the proposed ImRMR feature selection method, the random forest classifier is adopted to obtain classification results of the selected optimal feature subset. Each experiment is repeated 30 times independently, and the average of the experimental results is taken as the final result. The evaluation indicators include accuracy rate, dimensionality reduction rate, comprehensive rate, precision, recall and F-measure.

The accuracy rate is used to evaluate the proportion of correctly identified samples to the total number of samples in the prediction results. Its calculation is shown in Eq. 15.

$${\text{Acc}} = \frac{{P_{a} }}{N}*100\%$$
(15)

where \(P_{a}\) is the number of the correctly classified samples, and \(N\) represents the number of all samples.

The dimensionality reduction rate as an evaluation indictor is introduced, as shown in Eq. 16.

$$Dr = \left( {1 - \frac{Sc}{{Oc}}} \right)*100\%$$
(16)

where \(Sc\) is the number of selected features, and \(Oc\) represents the number of raw features. The larger the \(Dr\) value, the stronger the ability to reduce dimensions.

The comprehensive rate considers the accuracy rate and the dimensionality reduction rate, as shown in Eq. 17.

$$Z = \theta *{\text{Acc}} + \left( {1 - \theta } \right)*Dr$$
(17)

where \(Z\) is the comprehensive rate, \({\text{Acc}}\) is the accuracy rate, \(Dr\) is the dimensionality reduction rate. The \(\theta\) is the tradeoff factor. In experiments, the value of \(\theta\) is 0.5.

The precision is used to evaluate the proportion of all predicted correct samples that contain actual correct samples. Its calculation is shown in Eq. 18.

$${\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(18)

where TP is the sum of the number of correctly classified samples and FP is the number of samples predicted to be correct that are actually wrong.

The recall rate is the percentage of in all samples where the correct sample is predicted to be correct. Its calculation is shown in Eq. 19.

$${\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(19)

where TP is the sum of the number of correctly classified samples and FN is the number of samples that are actually correct but predicted to be wrong.

F-measure is precision and recall weighted harmonic mean. F-measure calculation is shown in Eq. 20.

$$F{\text{-measure}} = 2 \times \frac{{{\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}}$$
(20)

5.4 Result analysis

5.4.1 Comparison of EGM with random grouping

In this group experiments, EGM and the method of random grouping are, respectively, applied to the generating candidate feature subsets part of ImRMR, namely ImRMR-EGM and ImRMR-RS. The random grouping method is different from EGM only in that the features of its grouping are randomly assigned, and other parts are consistent with EGM.

Using SFS search strategy, by recording the dimensionality reduction rate, the accuracy rate, precision, recall, F-measure and the comprehensive rate when the classification accuracy obtained by the ImRMR-EGM and ImRMR-RS feature selection of the seven datasets reached the maximum value, the performances of the two methods are compared. The experimental results are shown in Table 2 and Fig. 3.

Table 2 Experimental results of ImRMR-EGM and ImRMR-RS
Fig. 3
figure 3

Performance of ImRMR-EGM and ImRMR-RS

Seen from Table 2 and Fig. 3, ImRMR-EGM has good effects on the seven datasets. The dimensionality reduction rate reaches 55.56–90.96%, and the classification accuracy is 2.34–18.75% higher than that of the raw feature. ImRMR-EGM not only effectively reduces the dimension of the dataset and but also improves the classification accuracy.

In seven datasets, the comprehensive rate of ImRMR-EGM is higher than that of ImRMR-RS by 0.14–5.42%. And in most cases, the accuracy, precision, recall and F-measure of ImRMR-EGM are higher than ImRMR-RS. Therefore, the EGM has achieved good results in generating candidate feature subsets, which outperforms the random grouping method.

5.4.2 Comparison of ImRMR with other feature selection methods

Experiments with ImRMR and other five feature selection methods are conducted on seven datasets. Combine SFS search strategy, the results of various feature selection methods are classified by random forest. Their performances are compared with the evaluation indices including dimensionality reduction rate, accuracy rate, precision, recall, F-measure and comprehensive rate when the classification accuracy reaches the maximum.

The experimental results are shown in Table 3 and Fig. 4. And the abscissa labels 1, 2, 3, 4, 5, 6, and 7 in Fig. 4 represent raw feature and six feature selection method of ImRMR, mRMR, IG, SU, GR, and RfF, respectively.

Table 3 Experimental results of ImRMR and other five methods
Fig. 4
figure 4

Performance of ImRMR with other methods

Seen from Table 3 and Fig. 4, ImRMR outperforms five methods (mRMR, IG, SU, GR, and RfF) on seven datasets of Musk, Urban, Ionosphere, Glass, PU, Movement, and Crane. Among them, the comprehensive rate of ImRMR is 1.27–3.82%, 6.34–24.48%, 3.8–6.94%, 4.75–20.36% and 0.83–6% higher than the five methods on the five datasets of Urban, Glass, Movement, PU and Crane, respectively; for the Musk dataset, the comprehensive rate of ImRMR is only 0.6% lower than that of mRMR, and 2.67–4.18% higher than that of the other four methods; for the Ionosphere dataset, the comprehensive rate of ImRMR is only 0.99% and 0.04% lower than mRMR and SU, respectively, and 1.95–6.36% higher than the other three methods. The accuracy of ImRMR on six datasets (Urban, Ionosphere, Glass, PU, Movement and Crane) is higher than these five feature selection methods, and the accuracy on Musk datasets is the same as mRMR and better than the other four feature selection methods. In addition, in most cases, ImRMR is superior to the other five feature selection methods in terms of recall, F-measure and precision on seven sets of datasets. Therefore, from a comprehensive comparison, the ImRMR feature selection method is superior to the five feature selection methods.

From the above comparative experiments, experimental results show that the proposed ImRMR method can effectively remove irrelevant and redundant features, which can not only reduce the dimension of sample features, but also achieve better classification and recognition results.

5.4.3 Comparison with state-of-the-art methods

Recently, various researchers have analyzed various feature selection methods due to improved accuracy results. Table 4 shows the comparative analysis of the proposed method with other methods using the same dataset. It can be noticed that the proposed method shows higher accuracy (Acc) and comprehensive rate (Z) in feature selection compared to other methods using the same dataset.

Table 4 Comparative analysis with other methods

5.5 Discussion of using SFSFs for ranking feature subsets sequence

To further explore the effectiveness of the feature subset selected by the ImRMR, the sequence forward selection feature subset method (SFSFs) is used for the ranking feature subsets sequence to verify the pros and cons of the feature subset.

The steps of the SFSFs are as below.

  1. 1.

    The optimal feature subset starts with an empty set;

  2. 2.

    The first feature subset in the ordered feature subset set is added for the first time, and then the feature subset is added iteratively according to the order, which is combined with the selected feature subset to form a new feature subset;

  3. 3.

    This process continues until the classification accuracy of the newly mapped training set is greater than or equal to the raw classification accuracy, and the corresponding feature subset is the selected optimal feature subset.

The feature selection method based on SFSFs is shown in Fig. 5. Combine the SFSFs search strategy to get the final preferred subset and mapping the optimal feature subset to the raw sample set can carry out classification and recognition. The effects of ImRMR and other five feature selection methods are compared by the dimensionality reduction rate, accuracy, comprehensive rate, recall, F-measure and precision when the classification accuracy with feature selection is greater than or equal to that of the raw datasets.

Fig. 5
figure 5

Feature selection method based on SFSFs

The experimental results are shown in Table 5 and Fig. 6. And the abscissa labels 1, 2, 3, 4, 5, 6, and 7 in Fig. 6 represent raw feature and six feature selection method of ImRMR, mRMR, IG, SU, GR, and RfF, respectively.

Table 5 Experimental results of ImRMR and other five methods based SFSFs
Fig. 6
figure 6

Comparison of the effects of six methods

From Table 5 and Fig. 6, the performance of ImRMR based on SFSFs is better than five feature selection methods including mRMR, IG, SU, GR, and RfF in seven datasets. Among them, the comprehensive rate of ImRMR is 5.32–26.46%, 15.64–24.39%, 21.57–31.02%, 9.39–30.21%, 11.33–29.33% higher than the five feature selection methods on the five datasets of Musk, Urban, Movement, PU and Crane, respectively; the comprehensive rate of ImRMR on Ionosphere was only 0.52% lower than mRMR, and 0.48–23.01% higher than the other four methods; the comprehensive rate of ImRMR on Glass is 3.3% and 8.08% lower than that of RfF and mRMR, respectively, and 1.47–12.58% higher than the other three methods. The accuracy of ImRMR on six datasets including Musk, Urban, Ionosphere, Glass, Movement and PU is higher than the five feature selection methods, and the accuracy on Crane dataset is the same as the five feature selection methods. Furthermore, ImRMR outperforms the other five feature selection methods in recall, F-measure and precision on the seven datasets in most cases. Therefore, in a comprehensive comparison, with SFSFs the ImRMR feature selection method is superior to the five feature selection methods such as mRMR, IG, SU, GR, and RfF, and can effectively measure the contribution of feature subset to obtain an effective ranking feature subsets sequence.

6 Limitations and future scope

This paper improves mRMR based on feature subsets, and proposes the ImRMR method, which is used to select efficient feature sets, reduce the dimension of feature sets, and improve the classification performance of samples. There still exists more issues to carry out in the future work, and some of the most important points are listed below:

  • Extend the method to more datasets.

  • This study currently proposes EGM to divide candidate feature subsets. Other more feature subset division methods are needed to be explored to improve the quality of candidate feature subsets.

  • This study only uses Pearson correlation coefficient and mutual information to measure correlation and redundancy. In the next work, we will try to explore other measurement criteria to improve the effectiveness of the selected features.

7 Conclusion

In this paper, considering the joint contribution between multiple features, the proposed ImRMR method expands the feature selection process based on feature subset.

EGM is used to divide the raw feature set into multiple different candidate feature subsets. Two criteria of Pearson’s correlation coefficient and mutual information are used to calculate the correlation and redundancy of candidate feature subsets, and weights are introduced to tradeoff both criteria. Then the SFS search strategy is combined for the ranking feature subsets sequence to obtain the optimal feature subset. Compared with five methods, including mRMR, InfoGain, Symmetrical Uncert, GainRatio, and ReliefF, experimental results verified the effectiveness on seven datasets. In most cases, ImRMR outperforms the other methods, and can effectively obtain the optimal feature subset and improve the performance of the classification.