1 Introduction

The classification of data is an indispensable subject in the domain of machine learning and data mining [1]. It serves as a pervasive technique for discerning and categorizing different system states, thereby finding extensive applications in various fields. As a pre-processing technology, feature selection (FS) is widely used for eliminating redundant and irrelevant features from the original dataset to improve the efficiency and accuracy of the classification model [2]. However, the number of measurements to measure the state of a system has increased dramatically with the development of information technology. High-dimensional datasets may contain more redundant and irrelevant features[3], which definitely puts more pressure on FS to filter out the best subset of features. Thus, it is of great significance to minimize the number of features as much as possible to enhance model efficiency for FS of high-dimensional datasets apart from ensuring classification accuracy. This makes an urgent need for designing an efficient method for solving multi-objective FS (MOFS) of high-dimensional datasets.

FS can be regarded as a problem of combinatorial optimization [4]. The exhaustive method can be employed to obtain all possible combinations of features and determine the subset that optimizes the evaluation criteria. Nevertheless, the solution space for a set of \(n\) features is \({2}^{n}-1\), and the computational time required for the exhaustive method grows exponentially with the number of feature dimensions [5]. Thus, despite its simplicity, the exhaustive method proves challenging to implement in practical applications [6]. To solve this problem, many heuristic algorithms have been proposed for solving FS problems. Among these methodologies, sequential forward selection (SFS) and sequential backward selection (SBS) [7] emerge as two prominent approaches. These methods involve initializing either an empty set or a complete set and subsequently iteratively adding or removing features to optimize the subset of features. Despite improving search efficiency relative to exhaustive search, a limitation of both SFS and SBS is that it cannot be reconsidered for further inclusion or exclusion once a feature has been included or excluded. But some features are functionally interconnected, exhibiting a correlated coupling effect. To address this drawback, a novel selection method known as the plus-l minus-r selection method [8] has been introduced. This method effectively alleviates the aforementioned concern associated with SFS and SBS, but the plus and minus of features are fixed. Based on this, the sequential floating forward selection [9] and sequential floating backward selection [10] are designed based on plus-l minus-r selection method, where the plus and minus features can change each time. Although the heuristic methods can significantly reduce the time complexity in comparison to exhaustive search, they often encounter the challenge of being susceptible to local optima due to their fixed search strategies. Thus, intelligent algorithm (IA), as an ideal method has received increasing attention in recent years for its ability to achieve near-optimal/optimal solutions within a reasonable time [11, 12].

However, IA is easy to be trapped in suboptimal solutions when handling FS in high-dimensional datasets [13, 14]. Recently, many search strategies have been designed to improve the search efficiency of the meta-heuristic algorithm. Based on the Pearson correlation coefficient, Kabir et al. [15] developed a novel local search operation that relied on the distinct and informative nature of input features, as computed through correlation information. Moradi and Gholampour [16] introduced a local search strategy that selected distinct features by considering their correlation information. However, the Pearson correlation coefficient can only measure linear relationships between variables, not non-linear relationships, and is sensitive to outliers. To overcome this shortcoming, mutual information is used to measure the correlation between variables [17]. Vinh et al. [18] introduced a correlation-redundancy metric (CRM) to characterize the correlation and redundancy of each feature, which denoted the correlation between a feature and the labels, as well as the redundancy between two features. Furthermore, Wang et al. [19, 20] designed a “relevance-redundancy index” for evaluating the feature importance based on CRM in an ant colony optimization to classification. Song et al. [21] proposed supplementary and deletion operators to enhance the local search ability. However, CRM can measure the feature when the feature has a high correlation between class labels and a low correlation with other features. When both the correlation and redundancy of a feature are high, it will cause the probability of the feature to be close to 0, which will make features with high correlation be discarded. Thus, a novel evaluation function is designed to measure the correlation-redundancy (CR) of each feature. Based on this, a novel local search strategy is proposed which considers both the search depth and the CR of features simultaneously.

Additionally, population initialization also significantly impacts the performance of IA. Most existing literatures use traditional random initialization (TRI) [1, 4, 16, 22] which may influence the quality of the initial population, resulting in slowing down the overall convergence rate [23]. To handle this issue, Xu et al. [24] devised a split initialization approach to enhance the diversity of the initial population, thereby amplifying the breadth of the exploration process. Xue et al. [25] introduced an interval initialization technique that constrained the selection of features to a specific number. Xue et al. [26] devised three novel warm initialization strategies, namely small, large, and mixed initialization methods. The above papers focus on the controlling the quantity of selected features to enhance diversity, disregarding the importance of the selected features themselves, such as the correlation and redundancy of each feature. To overcome this problem, Deniz et al. [27] employed an information gain technique to evaluate the significance of each feature and subsequently initialized the population based on the obtained scores. Song et al. [21] developed a swarm initialization strategy that leverages label correlation as a guiding principle. Mohsen et al. [28] employed cosine similarity in their pheromone initialization approach. Li and Ren [29] designed a swam initialization method based on the maximal information coefficient between the feature and label. Although the aforementioned studies considered the correlation between features and labels, the redundancy between features is ignored. Thus, this paper introduces a novel initialization strategy, which considers a new evaluation function for evaluating the CR of each feature and utilizes it to determine the probability of selecting a feature. This provides an efficient way to get a high-quality initial population and reduce the solution space, thereby expediting the convergence speed.

To sum up, this study proposes a correlation-redundancy guided evolutionary algorithm, named CRGEA, to solve high-dimensional MOFS. A new correlation-redundancy assessment method (CRAM) is designed for measuring the CR of each feature, which guides the entire evolutionary process. In CRGEA, a novel initialization strategy combined with a multiple threshold selection mechanism (ISMTS) is developed to produce a high-quality initial population. A local acceleration evolution strategy (LAES) based on a parallel simulated annealing algorithm and a pruning method is developed to improve the local search ability of the CRGEA. The contributions of this paper can be summarized as follows:

(1) A new correlation-redundancy assessment method is designed in this paper for measuring the CR of each feature and paying more attention to the feature with high relevance and low redundancy of high-dimensional datasets to speed up the entire evolutionary process;

(2) A correlation-redundancy guided evolutionary algorithm is designed for solving MOFS of high-dimensional datasets. In CRGEA, CRAM first evaluates the CR of each feature, and produces high-quality initial populations. LAES performs deep searches combing the annealing stage around the best solutions to improve the local search ability of the CRGEA, which facilitates individuals in escaping local optima solutions and obtaining the best features subset;

(3) Numerical experiments are conducted based on 16 high-dimensional datasets, where nine datasets contain more than 1000 dimensions, three datasets have over 6000 dimensions, and two datasets have over 10,000 dimensions. The experiment results show that CRGEA can efficiently reduce redundant features while ensuring high accuracy compared to other state-of-the-art intelligent algorithms.

The following sections of this paper are structured in the following manner: Sect. 2 provides related works about MOFS, mutual information, K-Nearest Neighbor algorithm, etc. Section 3 presents a comprehensive explanation of the designed CRGEA. Section 4 describes the results obtained from analyzing 16 publicly high-dimensional datasets. In Sect. 5, a summary and discussion are provided based on the outcomes of the three numerical experiments. Finally, Sect. 6 concludes with an overview of the findings and future research directions.

2 Related Work

2.1 Multi-objective Feature Selection for High-Dimensional Datasets

In most existing studies, FS has usually been regarded as a single objective problem for improving classification accuracy [21, 30,31,32,33,34]. However, the feature subset often contains redundant features, which would reduce algorithm efficiency and classification accuracy. Therefore, some researchers introduce the number of features in the FS to remove the redundancy and irrelevant features [35,36,37,38,39,40,41,42]. The average classification error and the number of features are combined into one goal by setting a weight. The weight for the two targets requires expert experience and may be different for different datasets, which makes it hard to put into practice. In addition, the early research of FS focused on small and medium-sized datasets. With the advent of the information era, more and more research has begun to focus on FS of high-dimensional datasets. However, most of the literature on high-dimensional datasets focuses on improvements in search strategies [13, 22, 23, 25, 43], single objective problems [44,45,46], and grouping features considering relevance between features [47, 48], ignoring the characteristics of the dataset itself such as the correlation and redundancy of each feature. This could potentially degrade the algorithm's search effectiveness. Thus, the MOFS for high-dimensional datasets considering correlation-redundancy of each feature is studied in this study to minimize classification error and the number of features simultaneously. The MOFS can be described as follows:

Let the input be the normalized dataset \(X\in {R}^{n\times D}\), where \(n\) indicates the number of samples in \(X\) and \(D\) indicates the dimension of \(X\). The feature set \(S=\left\{{f}_{1},{f}_{2},\dots ,{f}_{D}\right\}\) represents the dimensional features of the \(X\). The aim of MOFS is to optimize a function vector \(f\left({\text{x}}\right)=\left\{{f}_{1}\left(x\right),{f}_{2}\left(x\right),\dots ,{f}_{m}\left(x\right)\right\}\), where \(m\) represents the number of objectives to be optimized.

2.2 Mutual Information

Mutual information (MI) can evaluate the relationship between two discrete variables [49], which is widely used to evaluate the relevance between features and labels as well as the redundancy between features in research [50,51,52]. However, the number of features needs to be specified in advance and cannot automatically obtain the optimal number of features and classification accuracy in the above research. Thus, the MI is used in intelligent algorithms to optimize classification accuracy and the number of features simultaneously offer end-to-end convenience in this paper. The MI is described as Eq. (1):

$$I\left(x,y\right)=\sum_{y}\sum_{x}p\left(x,y\right)lg\frac{p\left(x,y\right)}{p\left(x\right)p(y)}$$
(1)

where x and y are two random variables. The edge probabilities of x and y are denoted as \(p(x)\) and \(p(y)\) respectively, while \(p(x,y)\) represents their joint probability density. Building upon this foundation, Vinh et al. [18] introduced a CR metric to characterize the correlation and redundancy of each feature. CR denotes the correlation between a feature and the sample labels, as well as the redundancy between two features. For the \(i-th\) feature within a feature subset, its CR value can be defined as Eq. (2).

$${\varphi }\left({x}_{i}\right)=I\left({x}_{i},c\right)-\frac{1}{|D|}\sum_{{x}_{i}\in L}I\left({x}_{i},{x}_{l}\right)$$
(2)

where \({\varphi }\left({x}_{i}\right)\) represents the CR of feature \({x}_{i}\), \(c\) means the class label of \({x}_{i}\), and \(|D|\) means the size of the feature subset.

2.3 Non-dominated Sorting Genetic Algorithm II

The non-dominated sorting genetic algorithm II (NSGA-II) [53] employs the concept of Pareto, which can optimize multiple objectives simultaneously and realize a balance between all objectives [54]. It has been successfully used in many research fields [55,56,57,58]. NSGA-II has been transformed into a multi-objective IA by combining non-dominated rank and crowding distance, which can provide a set of no-dominated solutions (NDSs). The solutions are compared based on the non-dominance rank and the crowding distance. The solution with the lower non-dominated rank is preferred if solutions belong to different fronts. Otherwise, the solution with the bigger crowded distance is preferred.

2.4 Simulated Annealing

The simulated annealing algorithm (SA) [59] accepts an inferior solution with a probability and finds better solutions through multiple cycles. This method can enhance the local search ability, which facilitates individuals in escaping local optima solutions. Thus, SA has a strong local search capability [60]. Equation (3) is widely applied in single-objective problems to determine the probability of accepting a new solution, but it is not suitable for multi-objective problems. Thus, a new probability is introduced as Eq. (4) in this paper [61].

$${P}_{SA}=\left\{\begin{array}{l}1, {f}^{new}<{f}^{cur} \\ {{\exp}}\left(-\frac{{f}^{new}-{f}^{cur}}{T}\right), {f}^{new}\ge {f}^{cur}\end{array}\right.$$
(3)
$${P}_{SA}=\left\{\begin{array}{l}1, \exists m, {f}_{m}^{new}<{f}_{m}^{cur} \\ {\prod }_{m=1}^{M}{{\exp}}(-\frac{{f}_{m}^{new}-{f}_{m}^{cur}}{T}), \exists m, {f}_{m}^{new}\ge {f}_{m}^{cur}\end{array}\right.$$
(4)

where \(m\) {1, 2, …, \(M\)}, \({f}_{m}^{new}\) and \({f}_{m}^{cur}\) are the \(m-th\) objective function values of new solution \({X}_{new}\) and current solution \({X}_{cur}\) respectively.

2.5 K-Nearest Neighbor Algorithm

The K-Nearest Neighbor algorithm (KNN) is a simple clustering algorithm that can produce competitive and easily interpretable results [33]. It is widely recognized and extensively employed in the field of pattern recognition for its effectiveness in both classification and regression tasks [42]. In the classification phase, a new test sample is calculated according to Eq. (5) to compute the distance between the sample and all training samples, and then the majority label of the \(k\) training samples closest to the unlabeled sample is assigned to classify the unlabeled sample.

$${\text{d}}(X,Z)={\left(\sum_{i=1}^{n}({z}_{i}-{X}_{i})\right)}^{1/2}$$
(5)

where X and Z are D-dimensional vectors in the feature space; In addition, the cross-validation [62] was introduced to assess the classification accuracy for KNN, which leads to the highest classification generalizability (Fig. 1).

Fig. 1
figure 1

The flowchart of the designed CRGEA

3 Proposed CRGEA for Feature Selection

This section presents the details of CRGEA to solve the MOFS. The flowchart of the CRGEA is described in Fig. 2. Algorithm 1 describes the main procedures of the designed CRGEA.

Fig. 2
figure 2

The flowchart of the ISMTS

As shown in Fig. 1, firstly, a novel initialization strategy combined with a multiple threshold selection mechanism is developed to produce a high-quality initial population. Then, the multi-point crossover operator and hybrid mutation strategy are applied to search in multiple directions to improve the search ability. Then, the potential NDSs are obtained by the genetic operations. Next, a local acceleration evolution strategy is proposed to enhance the quality of NDSs and overall population quality for the next generation, which further improves the local search ability of the CRGEA. Select superior individuals to participate in next generation evolution. Genetic operations and local search strategy are iteratively performed until the stopping criterion is met, otherwise the algorithm is stopped and the best solutions are reported. The additional details of the CRGEA are described in the following subsections.

3.1 Binary Encoding Strategy and Decoding

The aim of FS is to choose an appropriate subset from the pool of available features and make the optimization objectives optimal. Each feature can exist in one of two states: selected or unselected. We use a binary encoding strategy, i.e., a 0–1 encoding pattern. 1 indicates the selection of the corresponding feature, whereas 0 denotes its exclusion. More specifically, if an individual is encoded as a vector of 01001000, it indicates that the number of features is 8 and the second feature and the fifth feature are selected. Then, the objectives can be calculated using a KNN classifier with K-fold crossover validation. The first objective of average classification error is described by Eq. (6).

$${{\min}}{f}_{1}\left(x\right)=\frac{1}{n}\sum_{k=1}^{n}\frac{{Nerr}_{k}}{Nval}$$
(6)

where \(k\) represents \(k-th\) clustering process; \(x\) is a feasible solution.\({Nerr}_{k}\) and \(Nval\) are the number of misclassified samples and the total number of all samples in the \(k-th\) classification process, respectively. The second objective of the number of features is described by Eq. (7).

$${{\min}}{f}_{2}\left(x\right) = \sum_{i=1}^{D}{x}_{i}$$
(7)

where \({x}_{i}\) indicates the value of \(i-th\) position on the sample \(x\), and \(D\) represents the number of features.

3.2 Initialization Strategy

The quality of the initial population seriously influences the convergence speed of the meta-heuristic-based algorithm. A novel initialization strategy combined with a multiple threshold selection mechanism (ISMTS) is developed to produce a high-quality initial population. Firstly, a CRAM is designed to assess the CR of each feature. This strategy overcomes the shortcomings of evaluating CR for each feature using Eq. (2), i.e., when both correlation and redundancy of features are high, it will cause the probability of features to be close to 0, which will make such features be discarded. Moreover, a multiple thresholds selection strategy is proposed to restrict the number of selected features and enhance the distribution of the population in the target space. This approach ensures that the initial solutions maintain a high level of quality and diversity. The threshold vector is set as \(T=[{0.50,0.65,0.80}]\), and \({T}_{m}\) means the \(m-th\) value in \(T\). The following definitions are used to describe the details of the designed ISMTS:

Definition 1

(\({C}_{iy}\)) Assume the labels vector of feature \({x}_{i} (i={1,2},\dots ,D)\) is \({\text{y}}\). \(D\) is the number of features. The correlation between \({x}_{i}\) and \(y\), is calculated by Eq. (8), denoted as \({{\text{C}}}_{iy}\), as follows:

$${C}_{iy}=I\left({x}_{i},y\right) = \sum_{y}\sum_{{x}_{i}}p\left({x}_{i},y\right)lg\frac{p\left({x}_{i},y\right)}{p\left({x}_{i}\right)p(y)}$$
(8)

where \({x}_{i}\) and y are two random variables. The edge probabilities of \({x}_{i}\) and y are denoted as \(p({x}_{i})\) and \(p(y)\) respectively, while \(p({x}_{i},y)\) represents their joint probability density. The importance of the feature to class labels decreases as the value of \({{\text{C}}}_{iy}\) decreases. In addition to retaining the most relevant features, redundant features should be removed where possible. In this paper, \({R}_{i}\) is used to assess redundancy between feature \(i\).

Definition 2

(\({R}_{i}\)) Let \({x}_{i} ,{x}_{j} (i,j={1,2},\dots ,D)\) be the samples of the feature\(i,j\), the redundancy of feature \(i\) is denoted as\({R}_{i}\), as follows:

$${R}_{i}=\frac{1}{|D|}\sum_{j=1}^{D}I({x}_{i},{x}_{j}), i\ne j,i,j={1,2},\dots ,D$$
(9)

Definition 3

(\({CR}_{i}\)) A new correlation-redundancy assessment method is designed to describe the CR of feature \(i\) (\({CR}_{i}\)), which is calculated by:

$${CR}_{i}={C}_{iy}-\alpha \frac{\mathit{min}\left({C}_{iy}, { R}_{i}\right)}{\mathit{max}\left({C}_{iy}, { R}_{i}\right) },i={1,2},\dots ,S$$
(10)

where \(\alpha \) (\(\alpha =0.3\) in this paper) is a moderator that is used to adjust the possibility of feature being selected when high correlation and high redundancy occurs. As shown in Eq. (10), it prefers features with high correlation and low redundancy. And it will reduce the probability of feature when it has high correlation and high redundancy. Based on this idea, we normalize \({CR}_{i}\) and define \({P}_{i}\) to determine the probability of a feature is selected.

Definition 4

(\({P}_{i}\)) The probability of a feature is selected, denoted as \({P}_{i}\), is calculated by:

$${P}_{i}=\frac{{CR}_{i}-{}_{j=1,\dots ,D}{}^{min}({CR}_{i})}{{}_{j=1,\dots ,D}{}^{max}({CR}_{i})-{}_{j=1,\dots ,D}{}^{min}({CR}_{i})}, i={1,2},\dots ,D$$
(11)

Decide whether a feature \(i\) will be selected by comparing the \({P}_{i}\) with the threshold \({T}_{m}\) in \(T\). The proposed ISMTS is stated as follows procedures, and corresponding flowchart is described as Fig. 2.

Step 1: Parameters initialization: initial population \((IP={\varnothing })\), the size of population (\(N\)), the number of features (\(D\)), the probability of feature \(i\) is selected (\({P}_{i}\));

Step 2: Set \(n=1\), which is the index of individual \({x}_{n}\); Set \(m=1\), which is the index of threshold vector \(T\);

Step 3: Set individual \({x}_{n}\left({x}_{n1,}..,{x}_{nj,}..,{x}_{nD}\right).\) Set \(j=1\), which is index of feature;

Step 4: If \({P}_{j}>{T}_{m}\),\({x}_{nj} =1\), else \({x}_{nj} =0\);

Step 5: If\(j\le D\), Set\(j=j+1\), go to Step 4, elseif \(n\le ((0.75*N)/3)*m,\) add \({x}_{n} {\text{to}}\) set \( IP,n=n+1\). Set\(j=1\), go to Step 4, elseif \(m\le 3 ,\) \(m=m+1\) go to Step 4, else \(n=n+1,\) set\(j=1\), go to Step 6.

Step 6: If \({P}_{j}>rand\),\({x}_{nj} =1\), else \({x}_{nj} =0\);

Step 7: If \(j\le D\), Set \(j=j+1\), go to Step 6, elseif \(n\le N,\) add \({x}_{n} {\text{to}}\) set \( IP,\) \(n=n+1\), Set \(j=1\), go to Step 6, else output the obtained initial population \(IP.\)

3.3 Multi-points Crossover Operator

The crossover operator performs a global search by exchanging the parent's excellent genes, which helps to improve the global search ability. The crossed individuals are compared with their parents by Pareto analysis, and the better two of these individuals are selected to enter the next evolution. Furthermore, an external file (EF) is designed to store solutions that are not dominated by the current solution during the search, which will be used in the selection operator to make an improvement of the overall population. Procedure 1 shows the detailed process of the crossover procedure.

figure a

Procedure 1 The procedures of proposed multi-points crossover operator

3.4 Hybrid Parallel Mutation Operator

Mutation operator improves the diversity of the population by changing the structure of existing chromosomes. Considering that the aim of FS is to obtain subsets with strong correlation and low redundancy, the proposed mutation strategy specifically designs two mutation strategies to change the number of selected feature quantities, i.e., addition strategy and deletion strategy. This guides individuals to search in different directions to improve the local search ability. The proposed novel evaluation function in Eq. (10) is adopted to determine adding or deleting a gen in a selected individual. The mutated individual \(x\) is selected according to the mutation probability \({P}_{m}\). The selected features of \(x\) is \({S}_{s}\), and all unselected feature is \({S}_{us}\). According to Eq. (11), calculated \({P}_{i}\) for all unselected features in \({S}_{us}\). The addition strategy and deletion strategy are presented below:

Addition strategy: The feature with maximum \({P}_{i}\) is added to \({S}_{s}\), and its corresponding gene is set to 1. Get the mutated individual \({x}_{1}\);

Deletion strategy: The feature with the smallest value of CR, f, is removed from \({S}_{s}\), and its corresponding gene is set to 0. Get the mutated individual \({x}_{2}\).

As shown in Fig. 3, the designed mutation strategies can change the number of the selected features directionally, which makes the solutions search in different directions and improves the local search ability. The details of the hybrid parallel mutation operator are stated in Procedure 2.

figure b

Procedure 2 The procedures of proposed hybrid parallel mutation operator

Fig. 3
figure 3

The graphical representation of the two search strategies in mutation operator

3.5 Local Acceleration Evolution Strategy

In this paper, a local acceleration evolution strategy (LAES) based on a parallel simulated annealing algorithm (PSA) with two search strategies i.e., addition strategy and deletion strategy, is employed to enhance the quality of NDSs derived from genetic operations. This allows LAES to perform a depth parallel search combing the annealing stage around the NDSs, which facilitates individuals in escaping local optima solutions. Unlike the traditional SA, this paper uses parallel SA according to the different characteristics of the addition strategy and deletion strategy, which can search in different directions and improve search efficiency. In addition, a pruning strategy is introduced into LAES to avoid invalid searches and improve search efficiency. If the new solution obtained by a search strategy is dominated by its parent, the strategy is disabled in the next search phase. If the new solution dominates its parent, the strategy is still retained in the next search phase, which indicates that the search strategy of features reduces the classification error. The description of the LAES is indicated in Procedure 3.

figure c

Procedure 3 The procedures of proposed LAES

3.6 Selection Operator

The selection operator aims to identify individuals with relatively superior performance for inclusion in the next generation population. The NDSs are considered individuals exhibiting better performance. The rapid non-dominated sorting technique and the crowding distance algorithm [53] are employed to identify the NDSs after the genetic operations. The selection operator is executed as follows: Firstly, combine the offspring population (\(OP\)) after LAES and the \(EF\) as \(NP\). Then, execute rapid non-dominated sorting for \(NP\). Assuming that individuals need to be selected from the first \(n\) layers. Finally, individuals from the first \(n-1\) front are added to the next generation population \(GP\), whose number is \(m\). And \(N-m\) number of individuals from the \(n-th\) front is selected to \(GP\) according to the crowding distance.

4 Experimental Study

This section conducts three experiments to validate the effectiveness of the designed CRGEA and search strategies for solving MOFS. The first experiment is to compare the CRGEA with using all features to indicate the necessity of performing FS in classification. The second experiment is to verify the effectiveness and superiority compared with the other five well-known algorithms. In the last experiment, an ablation experiments and statistical analysis are used to prove that the designed strategies are effective. The KNN is used as the classifier to evaluate the selected feature subsets, and K-fold (\(K=5\)) cross-validation is introduced to assess the classification accuracy of KNN. All experiments in this study are coded in MATLAB R2018a and run on a PC with an Intel (R) Core (TM) i5-12400F CPU and 16G of RAM. The remaining parameters of the have been experimentally determined and set as follows: the population size equals 60, the maximum iteration equals 200, the crossover probability equals 0.9, the mutation probability equals 0.2, the initial temperature equals 10, the final temperature equals 1, the cooling coefficient equals 0.7, the number of cycles at each temperature level equals 3.

All experiments are based on 16 well-regarded high-dimensional datasets involving different domains from the research repository. The first five datasets are sourced from research [21], datasets 6–8 are sourced from research [1], and the last nine datasets are sourced from research [63]. Note that we preprocess the data at first, i.e., remove missing data and features with a constant value to reduce search space. Table 1 presents the details of the dataset, including the varying numbers of features, samples, and classes across different datasets. Unlike many existing studies, the cases in this paper are high-dimensional datasets.

Table 1 The information of the high-dimensional datasets

4.1 Verification of FS Model

This section compares the CRGEA with the method using all features (without FS) to reveal the necessity of FS in classification. The results are shown in Table 2. The CRGEA can obtain a set of no-dominated solutions, the table shows the solution with the best \({f}_{1}\). The \({RR}_{1}\) and \({RR}_{2}\) mean the reduction rate of \({f}_{1}\) and \({f}_{2}\) respectively. As described in Table 2, significant improvements have been made in both indicators \({f}_{1}\) and \({f}_{2}\). The CRGEA achieves the greatest improvement on WarpAR10P and WarpPIE10P of \({f}_{1}\), which is reduced by 100%. The CRGEA achieves the most dimensionality reduction on WarpPIE10P, and its subset size is reduced by 99.6694%. The results verify the necessity of performing FS in classification, which can reduce the classification error while reducing the size of datasets.

Table 2 Experimental results of CRGEA and using all features

4.2 Performance Analysis of the Designed CRGEA

To analyze the performance of the designed CRGEA, the widely applied NSGA-II [53], MOEA/D [64], SPEA-II [65], MOBGA-AOS [4], MOEA-ISa [25] and ISAGA [60] are used as the comparison algorithms. To ensure fairness, all algorithms have the same number of iterations, i.e. 200 generations. All parameters of the comparison algorithms have been experimentally determined as follows. All algorithms also have the same population size. The \({P}_{c}\) and \({P}_{m}\) of NSGA-II equal 0.9 and 0.2 respectively. The T (the number of the weight vectors in the neighborhood of each weight vector) of MOEA/D equals 30. The \({P}_{c}\) and \({P}_{m}\) are of MOBGA-AOS equals 0.9 and 1/D respectively as literature [4]. For MOEA-ISa, the crossover probability equals 1. For ISAGA, the initial temperature, the number of cycles, the cooling rate and iteration number in the simulated annealing operation are set to 1, 10, 0.9 and 3, respectively.

The best and average value of \({f}_{1}\) and \({f}_{2}\) obtained by seven algorithms in 30 runs are shown in Table 3. The results indicate that CRGEA demonstrates competitiveness against all comparison algorithms across all datasets. The CRGEA can obtain the best values of both \({f}_{1}\) and \({f}_{2}\) on all datasets, which means that the designed algorithm has a stronger search capability and the obtained solutions are closer to the ideal Pareto front. MOEA-ISa achieves the best \({f}_{1}\) by employing interval initialization which constrains the number of selected features. However, the solution obtained by MOEA-ISa is not as good as that obtained by CRGEA. In addition, compared to the remaining five algorithms using random initialization, they tend to fall into local optima easily for high-dimensional problems, such as Isolet1, WarpPIE10P, Orl, Qsar, and only find sub-optimal solutions. On the WarpAR10P dataset, the CRGEA obtained the minimum best value in \({f}_{1}\), the average performance is worse compared to the other algorithms. This is because the solutions of the designed CRGEA are more widely distributed over \({f}_{1}\). The search in the \({f}_{1}\) direction of the comparison algorithms falls into a local optimum. Figure 4 supports this conclusion. The above results verify the superiority of the designed CRGEA.

Table 3 The best and average values of \({f}_{1}\) and \({f}_{2}\) obtained by seven algorithms
Fig. 4
figure 4

Normalized NDSs obtained by seven comparison algorithms on 16 datasets

Figure 4 indicates the NDSs acquired by the seven comparison algorithms in 30 runs to illustrate the distribution of the obtained solutions. Each algorithm yields multiple solutions, and for enhanced visualization, we present only the normalized NDSs, represented by seven distinct colors. Apparently, the NDSs obtained by CRGEA exhibit dominance over a significant portion of the non-dominated solutions obtained by other comparison algorithms. The solutions obtained by CRGEA also have better convergence and diversity, as they closely approximate the true Pareto front surface.

The evaluation of the distribution and convergence of the solutions for each algorithm is conducted using two widely recognized performance metrics, i.e., the hypervolume (HV) [66] and the inversion generation distance (IGD) [67]. Due to the Pareto frontier of each dataset is not known, we aggregate the solutions obtained by all algorithms and subsequently select the first non-dominated front as the ideal non-dominated frontier surface. Figure 5 describes the results of IGD and HV in 30 runs. The results show that the IGD and HV obtained by the proposed CRGEA are superior to the other algorithms when dealing with high-dimensional datasets.

Fig. 5
figure 5

The best and average value of IGD and HV on 16 datasets

Additionally, the comparison between the performance of the other algorithms and CRGEA is examined using the Wilcoxon rank sum test with a significance level of 0.05. This statistical analysis helps determine if there are significant differences in the performance of the comparison algorithms compared to CRGEA. The HV is used as a comprehensive evaluation metric in the Wilcoxon test. The results in Table 4 reveal that the results of CRGEA have statistical significance compared with other comparison algorithms on all the high-dimensional datasets.

Table 4 p-values of the Wilcoxon test of between CRGEA and five comparison algorithms

To illustrate the relationship between performance improvement and computational cost, Table 5 summaries the average CPU time of comparison algorithms. The comparison algorithms show similar time consumption for all datasets. Classifiers are usually necessary in FS to evaluate schemes performance, which consumes most of the algorithm's computation time [21]. In addition, the number of samples and features is the main determinant of the time cost of a classifier. By using the local search operator, the designed CRGEA requires additional fitness evaluations, CRGEA consumes much more runtime. However, it achieves the best classification performance. Overall, the proposed CRGEA algorithm, although adding several new operators, still has acceptable running times similar to other comparison algorithms.

Table 5 Running time consumed by the seven algorithms on the seventeen datasets (unit: minute)

4.3 Effectiveness Validation of the Proposed Strategies

To validate the effectiveness of the designed search techniques in CRGEA, an ablation experiment is conducted. The HV is employed to evaluate the performance of algorithms. The origin algorithm NSGA-II, named MOEA; Combine the MOEA with the proposed initialization strategy (MOEA-ISMTS). Combine the MOEA with the proposed local acceleration evolution strategy (MOEA-LAES). The average values of HV obtained by CRGEA and other variants in 30 runs are shown in Table 6.

Table 6 The results of the designed CRGEA and other variants

As described in Table 6, it can be observed that both MOEA-ISMTS and MOEA-LAES exhibit superior performance compared to MOEA across all datasets. This finding strongly suggests that the proposed ISMTS and LAES employed in these algorithms are indeed effective. Moreover, MOEA-ISMTS performs better than MOEA-LAES when the number of features exceeds 1000, suggesting that the designed ISMTS is more effective in solving high-dimensional MOFS. MOEA-LAES performs better than MOEA-ISMTS when the number of features is less than 1000. This confirms that local search is more effective than MOEA-ISMTS when dealing with small-scale problems. The results obtained by CRGEA are superior to MOEA-ISMTS, which indicates that the local acceleration strategy can further improve the search ability and facilitates individuals in escaping local optima solutions. In addition, the Wilcoxon test is applied to analyze the statistical significance of the results between MOEA and the other three variants. The results of the Wilcoxon test are shown in Table 7, which confirms that the results in Table 6 have statistical significance on all the datasets.

Table 7 p-values of the Wilcoxon test of between MOEA and three comparison algorithms

Additionally, to verify the superiority of the designed ISMTS, a comparison experiment is conducted by comparing with the traditional random initialization (TRI) method. Taking three different scale datasets (WarpAR10P, Multiple Features and DrivFace) as an example, the distributions of individuals using ISMTS and TRI are described in Fig. 6. The results show that the proposed ISMTS method exhibits superior convergence and diversity across various datasets in comparison to the TRI. This serves as further confirmation of the effectiveness of the proposed TRI.

Fig. 6
figure 6

The initial populations’ distribution generated by two different initialization methods

To verify the effectiveness of the proposed novel evaluation function, CRGEA using two different methods for evaluating the CR of each feature on 16 high-dimensional datasets, i.e., the method using the designed Eq. (10) and existing research using Eq. (2), named CRGEA and CRGEA-I respectively. The fourth column in Table 8 indicates the p-values of the Wilcoxon test between CRGEA and CRGEA-I. It is evident that the proposed CRGEA achieve the best results on all datasets, the Wilcoxon test supports the conclusion. This demonstrates the effectiveness of the designed novel evaluation function.

Table 8 The results of the CRGEA and CRGEA-I on 16 high-dimensional datasets

5 Summary and Discussions

In Sect. 4.1, the CRGEA is compared with without FS which demonstrates the necessity for performing MOFS in high-dimensional datasets. In Sect. 4.2, we analyze the \({f}_{1}\),\({f}_{2}\),IGD and HV metrics between each comparison algorithm, and perform a statistical analysis employing the Wilcoxon test. The performance superiority of CRGEA in tackling high-dimensional MOFS problems has been unequivocally demonstrated. Furthermore, we present the final Pareto frontier of all algorithms, which demonstrates that the Pareto frontier obtained by CRGEA exhibits greater diversity and convergence efficacy in comparison to the comparison algorithms. In Sect. 4.3, an ablation experiment is performed to prove the effectiveness of the designed ISMTS and LAES, and reveal that they both affect the performance of CRGEA. Among them, the designed initialization method can significantly improve the performance of CRGEA in solving high-dimensional MOFS problems. The local acceleration strategy can further improve the search ability and help the individuals jump out of the local optimal solutions. Then, we compare the distribution within the solutions space of the designed ISMTS with that of the TRI. The findings indicate that the proposed ISMTS method exhibits superior convergence and diversity characteristics. Finally, we compare the performance of the CRGEA using two different methods for evaluating the CR of each feature on 16 high-dimensional datasets, and the results demonstrate the effectiveness of the proposed novel correlation-redundancy assessment method.

6 Conclusions and Future Work

Most of the literature on FS of high-dimensional datasets focuses on improvements in search strategies, ignoring the characteristics of the dataset itself such as the correlation and redundancy of each feature. This could potentially degrade the algorithm's search effectiveness. Thus, a correlation-redundancy guided evolutionary algorithm is designed to address high-dimensional FS with the objective of optimizing classification accuracy and the number of features simultaneously. A new correlation-redundancy assessment method is proposed to select the features with high relevance and low redundancy to speed up the entire evolutionary process. A novel initialization strategy combined with a multiple threshold selection mechanism is developed to produce a high-quality initial population. A local acceleration evolution strategy based on a parallel simulated annealing algorithm and a pruning method is developed to improve the local search ability of the CRGEA. Numerical experiments are conducted by comparing CRGEA with state-of-the-art algorithms on 16 high-dimensional datasets to demonstrate the superiority of CRGEA in solving MOFS of high-dimensional datasets. The results show that CRGEA outperforms the other comparative algorithms in \({f}_{1}\),\({f}_{2}\), IGD and HV metrics on all datasets. As the dimensionality of the data increases, this advantage becomes increasingly evident. The results of the statistical analysis support the conclusions. Additionally, an ablation experiment was performed to evaluate the potential impact of the proposed ISMTS and LAES on the algorithm's performance. The results demonstrate a significant enhancement in both the convergence speed and search capability of CRGEA when incorporating these proposed techniques. In addition, we also compare the performance of the algorithm using two different methods for evaluating the CR of each feature on 16 high-dimensional datasets. The results indicate that the proposed evaluation function can better evaluate the correlation and redundancy of each feature.

Despite the results from the aforementioned series of experiments highlighting the exceptional performance of the proposed CRGEA for solving MOFS of high-dimensional datasets. However, we mainly focus on the effect of correlation and redundancy of features on classification. The interactions between features are disregarded in this paper, which is common in biology [68,69,70]. In the future, we will improve the proposed algorithm for solving interactions between features on biological datasets.