1 Introduction

In this era of data explosion, as the number of instances and the dimensionality of data continue to increase, the processing and parsing of data have become increasingly challenging. Feature selection (FS) is a mainstream data reduction technique that aims to eliminate redundant and noisy attributes. Its primary objective is to select the smallest subset of features from the original feature set based on a FS criterion [1]. The advantage of FS lies in its ability to compress the search space of the learning algorithm and lower the size of the feature set, thereby diminishing the dimensionality of the data, easing the learning task, and improving model efficiency [2].

In recent years, numerous researchers have applied swarm intelligence evolutionary algorithms (EA) to the field of FS. Swarm intelligence optimization algorithms exhibit characteristics such as simple operation, fast convergence, and robust global search ability, making them well-suited for tackling intricate optimization problems. Swarm intelligence algorithms, including genetic algorithm (GA) [3, 4], artificial bee swarm algorithm (ABO) [5], grey wolf algorithm (GWO) [6], particle swarm algorithm (PSO) [7], have demonstrated promising results. Among them, PSO stands out as one of the most frequently employed optimization techniques. PSO is not only used to feature selection problems, but also widely applied in other fields. Many scholars have made different improvements to in different fields. Wang et al. [8] proposed a particle swarm optimization algorithm based on reinforcement learning level (RLLPSO) for large-scale problems, which increases the diversity of the population, improves the search performance and convergence speed of the population. Inspired by conditional integrals in automatic control, Xiang et al. [9] proposed an adaptive search direction learning method for PSO (ISPSO). This method has faster global convergence speed and higher solution accuracy. Xia et al. [10] proposed an MFCPSO algorithm to address the shortcomings of fitness based selection, which exhibits promising characteristics in large-scale complex functions. However, these evolutionary optimization algorithms also have certain limitations. Most of them are designed for single-objective FS problems, whereas FS can be viewed as a multi-objective optimization problem. Typically, two optimization objectives are considered: maximizing the classification accuracy of the selected feature subset and minimizing the size of the subset. In fact, researchers have explored the use of multi-objective EA, including the MOPSO algorithm, for solving the FS problem. Pradip et al. [11] proposed a two-phase multi-objective FS method aimed at selecting the most relevant features. The one phase involves global search using PSO, while in the other phase, a combination of PSO and GWO, based on a modified Newton’s second law of motion, performs local search starting from the results obtained in the global search. Wang et al. [12] introduced a multi-objective evolutionary FS algorithm that incorporates a correlation metric and a novel redundancy metric for class correlation redundancy. The method uses Pareto optimality to assess a subset of candidate features to find the compact feature subset with maximum correlation and minimum redundancy. Xue et al. [13] proposed a FS adaptive multi-objective genetic algorithm, which incorporates an adaptive mechanism to dynamically select five different crossover operators during various evolutionary processes, allowing the algorithm to remove multiple features while ensuring classification performance. Feng et al. [14], aims to improve the global search capability and mitigate the stagnation of local optimal solutions phenomenon, the model was modified in the PSO part using genetic operators and Levy flight. These algorithms strive to discover a collection of solutions that strike a balance between classification precision and the size of the selected feature subset.

However, these algorithms ignore the prior information contained in the feature data during the initialization process, and use random initialization methods to generate initial solutions that may be far from the true Pareto front, affecting the convergence speed of the population. To alleviate this problem, Han et al. [15] introduced an improved feature selection method, which sets the selection threshold according to the correlation between features and categories in the initialization stage to select feature subsets of superior quality. Yu et al. [16] presented a swarm initialization strategy that combines blended initialization and threshold selection techniques. Additionally, PCA is employed to rank the importance of features. Although these methods take into account the prior information contained in the feature data during the initialization process, the particles are still susceptible to fall into local optima. Aiming at this problem, Fu et al. [17] introduced a novel multi-objective binary GWO method that incorporates a guided mutation strategy. The method utilizes the Pearson correlation coefficient to guide local search, enhancing the population’s ability to explore local regions. Additionally, a dynamic perturbation mechanism is employed for mutation, preventing population stagnation caused by a single strategy. This dynamic adjustment ensures population diversity is maintained and improves the algorithm’s detection capability. Zhou et al. [18] presented an adaptive hierarchical update PSO algorithm to overcome the issue of particle swarm algorithms frequently getting trapped in local optima and struggling to escape. The proposed method incorporates multi-level update formulas for both the global exploration subgroup and the local exploitation subgroup. This approach enhances the resistance to local optima and improves the algorithm’s ability to explore globally optimal solutions. Wei et al. [19] employed a neighborhood search strategy to enhance the local search capability of the swarm during stagnation periods. Xiang et al. [20] proposed a PID based PSO strategy (PBS-PSO) to avoid premature convergence of particle swarm optimization, in order to accelerate convergence and adjust the search direction to escape local optima. Xue et al. [13] introduced a mechanism for detecting search stagnation aimed at mitigating premature convergence in PSO. Although these methods can avoid particle swarms from getting trapped in local optima, due to most of them are lack of prior information guidance, restrict the search performance of swarm intelligence algorithms and hinders their ability to converge towards the global optimum.

Based on the above analysis, incorporating prior knowledge into both the population initialization and search process would inevitably expedite the algorithm’s search speed and enhance the explainability of the selected features. Introducing prior information in the initialization process can bring the generated initial solutions closer to the true Pareto front, accelerate population convergence speed, and also increase the diversity of population particles. Coupling prior knowledge into the search process can effectively guide particles to search in a better direction, improve the search performance of the population. Therefore, this paper proposes an adaptive multi-objective particle swarm feature selection algorithm based on feature-label relevance information guidance, combining the advantages of filtered and wrapped FS algorithms on the basis of full consideration of prior information. The primary differentiating factors of this paper from other algorithms can be summarized as follows:

Firstly, a strategy for setting feature encoding intervals is proposed, which determines the interval boundaries based on the magnitude of correlation between features and categories. This strategy increases the probability of selecting features with higher correlation to the categories, thereby enhancing the explainability of the selected features.

Secondly, a novel swarm initialization method based on feature-label correlation is proposed. This method improves the quality of initial solutions and the distribution of particles, resulting a significant improvement in the proximity of initial solutions to the true Pareto front. Additionally, It expedites the rate at which the population converges.

Finally, an adaptive hybrid perturbation strategy is proposed to facilitate the particles in escaping from local optima, taking into account the performance of the particles, the selection probability of the features and the selection situation.

The paper is organized as follows in the subsequent sections: Sect. 2 presents an overview of existing work related to MOPSO, and information entropy. Sect. 3 provides the proposed FS algorithm. In Sect. 4, the experimental results are presented and analyzed, providing a comprehensive discussion of the obtained findings. Finally, Sect. 5 gives the conclusions of this paper.

2 Preliminaries

2.1 Multi-objective Optimization Problems (MOPs)

Problems with multiple optimization objectives are called multi-objective problems, and since the objectives are in conflict with each other, a solution cannot be optimal for all objectives. The solutions that satisfy the Pareto optimality criteria in such problems are referred to as Pareto optimal solutions. These solutions allow for a trade-off among different objective functions, as improving one objective may come at the expense of another [21, 22]. The minimum MOP can be described in the following manner:

$$\begin{aligned} minimize\ F(x)=(f_{1}(x),f_{2}(x),{\ldots },f_{n}(x))\nonumber \\ subject\ to : u_{i}(x)\le 0,i = 1,2,\ldots ,k\nonumber \\ e_{j}(x) = 0,j = 1,2,\ldots ,k \end{aligned}$$
(1)

where \(X=(x_1,x_2,x_3,\ldots ,x_D)\) represents the D-dimensional vector in decision space and n is the number of objectives, \(f_i(X)\) indicates the ith minimized objective function, \(u_i(X)\) and \(e_j(X)\)are the inequality and equlaity constraints, respectively. Given two feasible solutions \(X_1\) and \(X_2\), \(X_1\) dominates \(X_2\), if and only if for \(\forall _{a}\), \(f_a(X_1)~\leqslant ~f_a(X_2)\) and \(\exists b,~f_b(X_1)<~f_b(X_2),~a,b\in \{1,2,\ldots ,n\}\). If no other solution dominates \(X^{*}\), then \(X^{*}\) is known as a Pareto-optimal solution. The set of all Pareto-optimal solutions is known as the Pareto-optimal set, while the objective values associated with these solutions form the Pareto front.

2.2 Particle Swarm Optimization

PSO has been widely used in a diverse range of optimization problems [23, 24]. In the particle swarm algorithm, each particle corresponds to a prospective solution to an optimization problem, and collectively, all particles form a set of candidate solutions. Each particle possesses two fundamental properties: velocity and position. The update of velocity and position for the particle swarm is performed as follows.

$$\begin{aligned} v_i(t+1)~=~\omega *v_i(t)+c_1*r_1*(pbest_i-x_i(t))+c_2*r_2*(gbest_i-x_i(t)) \end{aligned}$$
(2)
$$\begin{aligned} x_i(t+1)=x_i(t)+v_i(t+1),i=1,2,\ldots ,n \end{aligned}$$
(3)

where \(\omega \) represents the inertia weight, t represents the number of current iterations, \(c_1\) and \(c_2\) are the learning factors, \(r_1\) and \(r_2\) are two random values uniformly distributed in the interval [0,1], and \(pbest_i\) and \(gbest_i\) serve as representations for the individual optimal position and the global optimal position, respectively, of particle i.

2.3 Information Entropy

2.3.1 Entropy

Entropy quantifies the level of uncertainty associated with a random variable. Higher entropy corresponds to greater uncertainty in the random variable. The entropy of a continuous random variable X, denoted as H(X), is defined by the following equation:

$$\begin{aligned} H(X)=-\sum _{x\in X}p(x)log(p(x)) \end{aligned}$$
(4)

where X denotes the random variable and p(x) is the probability density function of X.

2.3.2 Relative Entropy

Relative entropy is a measure that quantifies the difference or dissimilarity between two probability distributions. It provides a measure of how one distribution differs from another in terms of their information content or structure. Specifically, it measures the additional amount of information needed to encode data from one distribution using a code optimized for another distribution. The definition of the relative entropy between probability distributions p(x) and q(x) is as follows:

$$\begin{aligned} D(p||q)=\sum _{x\in X}p(x)log\frac{p(x)}{q(x)} \end{aligned}$$
(5)

2.3.3 Mutual Information (MI)

MI is a measure used to quantify the amount of information that one random variable contains about another random variable [25]. It reflects the degree of correlation between the variables, with higher values indicating stronger correlation. The MI between two discrete variables X and Y is defined as follows:

$$\begin{aligned} (X;Y)=\sum _{x\in X}\sum _{y\in Y}p(x,y)log\frac{p(x,y)}{p(x)p(y)}=D(p(x,y)||p(x)p(y)) \end{aligned}$$
(6)

where p(xy) denotes the joint probability density of x and y, and p(x) and p(y) refer to the marginal probability densities of x and y respectively.

The relationship between MI and entropy can be described as follows:

$$\begin{aligned} I(X;Y)=H(X)+H(Y)-H(X,Y) \end{aligned}$$
(7)

3 The Proposed Method

In this section, in order to improve the quality of the initial solutions of the population and to expedite the convergence process. A novel particle swarm initialization strategy is proposed, which couples prior information in the initialization process and enhances the explainability of the selected features. At the same time, an adaptive hybrid perturbation strategy is proposed in order to avoid the PSO algorithm from falling into local optimal solutions. The specific details of the two strategies are as follows.

3.1 A Novel Initialization Strategy

To enhance the dispersion of particles and improve the quality of initial solutions, it is essential to thoroughly take into account the interrelation between features and categories. In this paper, mutual information is utilized as a metric to assess the correlation between features and labels. A higher value of mutual information indicates a stronger relevance between the features and labels. In order to ensure the diversity of particles, half of the particles of the population are initialized using feature-label guidance, while the other half is left to be initialized randomly. The overall process is illustrated in Algorithm 1.

Algorithm 1
figure a

The proposed initialization strategy

3.1.1 The Initialization Strategy Based on Feature-Label Correlation Information

FS can be viewed as a binary optimization problem since it entails making decisions on whether to select or exclude features. While binary PSO can directly encode particle positions as binary values, continuous PSO has shown better performance in FS [26]. Therefore, in this paper, continuous PSO is employed to adjust the position information of particles in the FS algorithm. Nonetheless, evaluating fitness in continuous particle swarm algorithms is challenging, requiring the conversion of real values to binary values before fitness evaluation. In the conversion process, most PSO-based feature selection algorithms encode particle position information in the range of [0, 1] and use a fixed conversion threshold. However, this fixed feature encoding interval and conversion threshold do not adequately incorporate the correlation information between features and categories. To tackle this problem, we propose a feature encoding interval setting strategy based on feature-label correlation.

Different feature coding intervals are set according to the magnitude of the correlation value. This paper divides the encoding interval of features into two categories: one sets the lower bound of the feature encoding interval (\(X_{lb}\)), and the other sets the upper bound of the feature encoding interval (\(X_{ub}\)). The rules for setting the interval bounds are as follows.

\(X_{lb}\) is set when the correlation value between features and categories exceeds the average correlation value across all features and labels. Conversely, \(X_{ub}\) is set when the correlation value is below the average. The calculation formulas are shown in Eq.(8) and Eq.(9).

$$\begin{aligned} X_{lb}=\alpha *\frac{I(f_{j},C)}{m a x(I(f,C))},\,j=1,2,...,D, \alpha =0.2 \end{aligned}$$
(8)
$$\begin{aligned} X_{ub}=T+\beta *\frac{I(f_j,C)}{max(I(f,C))},j=1,2,\ldots D,\beta =0.4 \end{aligned}$$
(9)

where \(I(f_j,C)\) represents the value of the MI between the feature and the category C. T represents for selection threshold. \(\alpha \),\(\beta \) are two different moderators, the exact values of which are discussed in Sect. 4.6.

The encoding process of the features is as follows. Taking a data set with D-dimensional features as an example, then the position information of the ith particle can be represented by a string of D-dimensional real-valued data, denoted as vector \(F_{i}=(x_{i,1},x_{i,2},x_{i,3},\ldots ,x_{i,D})\). The range of values of each component in \(F_i\) is divided into two cases, as shown in Eq.(10)

$$\begin{aligned} x_{i,j}\in {\left\{ \begin{array}{ll}&{}[X_{lb},1],I(f_j,C)>MeanMl\\ {} &{}[0,X_{ub}],I(f_j,C)\le MeanMl\end{array}\right. },\quad i=1,2,3,\ldots ,N,j=1,2,3,\ldots ,D\nonumber \\ \end{aligned}$$
(10)

where MeanMl represents the mean of all feature-label relevance values.

Based on the equation mentioned above, it can be inferred that, given a fixed selection threshold, a higher mutual information value leads to a wider interval of selected features. This ensures that features with a stronger relevance have a higher likelihood of being chosen for selection.

The random initialization of particle position information is shown in Eq.(11).

$$\begin{aligned} x_{i,j}\in [0,1],i=1,2,3,\dots ,N,j=1,2,3,\dots ,D \end{aligned}$$
(11)

Similar to the approach used in HMPSOFS [27], this paper utilizes a consistent binary threshold. Consequently, the particle’s position is converted into a binary value for each dimension, relying on this threshold. The conversion process is shown in Eq. (12), \(x_{i,j}\) is set to 1 when \(x_{i,j}\) is greater than T, otherwise it is set to 0.

$$\begin{aligned} {F_{i,j}}=\left\{ \begin{matrix}1,\quad x_{i,j}>T\\ 0,\quad x_{i,j}\le T\end{matrix}\right. \end{aligned}$$
(12)

where \(F_{i,j}\) denotes the jth feature belonging to the feature subset \(F_i\). When \(F_{i,j}=1\) represents that the feature is selected and \(F_{i,j}=0\) means that the feature is not selected. Referring to previous studies [13, 22, 23], the threshold T is set to 0.6 in this paper.

3.2 The Adaptive Hybrid Mutation Strategy

To leverage the correlation information between features and labels more effectively and prevent the particle swarm from converging to local optima, an adaptive hybrid mutation strategy is introduced. The age threshold in dMOPSO [28] is introduced to determine whether the particles fall into a local optimum. In the early stage of the algorithm operation, the particle swarm exhibits powerful search ability and the individual optimal positions of the particles are updated continuously during the search process. However, with the progression of population updates, the search ability of the particles gradually declines, resulting in the particles easily entering a stagnant state. When the age of the particle is below the predetermined, it indicates that the particle still possesses good search ability, so the particle is slightly perturbed by using non-uniform mutation [29]. On the other hand, When the particle’s age goes beyond the preset age limit, it implies that the particle is likely to trapped in a local optimum and requires a larger perturbation, so adaptive variation approach is implemented to support the particle in breaking free from the local optimum and exploring different domains of the solution space. The detailed process is outlined in Algorithm 2.

Algorithm 2
figure b

The adaptive hybrid mutation strategy

3.2.1 Non-uniform Mutation

The non-uniform mutation operator \(\varphi \) incorporates a dynamic decrease in mutation probability as the number of iterations increases. During the iteration process, the PSO algorithm has been pursuing the balance between exploration and exploitation. In the early stage of the iteration, by increasing the exploration intensity, the algorithm is more likely to find the global optimal solution or a solution close to the optimal solution. Therefore, using a higher mutation probability can improve the global search ability of particles. In the later stages of the iteration, when the search space is reduced and the global optimal solution is closer, local search becomes more important. At this time, the mutation probability is reduced and the exploitation of existing excellent solutions is increased.

$$\begin{aligned} x_{i,j}={\left\{ \begin{array}{ll}x_{i,j}+Pbest_{i,j}*(1-r^{(\varphi )^\lambda }),r\le 0.5\\ x_{i,j}-Pbest_{i,j}*(1-r^{(\varphi )^\lambda }),r>0.5\end{array}\right. } \end{aligned}$$
(13)
$$\begin{aligned} {\varphi =1-\frac{t}{maxlt}} \end{aligned}$$
(14)

where r is a random number in the range of 0 to 1, t is the current number of iterations of the population, and maxIt is the maximum number of iterations. \(\lambda \) is a system parameter that determines the dependence of the random number perturbation on the number of iterations, and based on related research [30, 31], the algorithm proposed in this chapter \(\lambda \) will be set to 3.

3.2.2 Adpative Mutation

The adaptive mutation strategy further utilizes the prior information contained in the feature data, and calculates the feature mutation probability according to the performance of the particle itself, combined with the selection probability of the feature and whether the feature is selected.

Firstly, the performance of the particle is defined. Without considering any preferences, the Euclidean distance between the particle’s position in the target space and the origin of the target space is employed as a metric to evaluate the particle’s performance. A smaller distance indicates better performance for the particle. It is calculated as follows:

$$\begin{aligned} performance_i = \left\| \overline{f(x)}\right\| _2 \end{aligned}$$
(15)

where \(performance_i\) denotes the performance of the ith particle, denotes the target vector of particle i.

Next, the probability of a feature being selected is calculated based on the feature encoding interval set during initialization. The feature coding interval length is \(1-X_{lb}\) or \(X_{ub}\) and a fixed selection threshold T is used. The selection probability of a feature (\(P_s\)) is defined as follows:

$$\begin{aligned} P_s={\left\{ \begin{array}{ll}\dfrac{1-T}{1-X_{lb}},x_{i,j}\epsilon [X_{lb},1]\\ \dfrac{X_{ub}-T}{X_{ub}},x_{i,j}\epsilon [0,X_{ub}]\end{array}\right. } \end{aligned}$$
(16)

For a feature \(f_j\), as shown in Eq.(16), different feature encoding intervals correspond to different mutation probabilities.

Finally, the probability of variation is calculated based on the selection of features, which is divided into the following two cases.

Case 1: If the feature is selected, its mutation probability will be calculated as follows:

$$\begin{aligned} MP=exp(-P_S)*(1-performance_i) \end{aligned}$$
(17)

Case 2: If the feature is not selected, the probability of its mutation will be calculated as follows:

$$\begin{aligned} MP=(1-exp(-P_S))*(1-performance_i) \end{aligned}$$
(18)

An example of a specific perturbation is shown in the Fig. 1. When the MP of feature exceeds the generated random number, the mutation operation is executed, and vice versa.

Fig. 1
figure 1

An example of a specific perturbation

3.3 The Framework of the Proposed Method

Algorithm 3 outlines the general framework of fMOPSO-FS. The fMOPSO-FS framework primarily comprises two phases. The first stage is the initialization stage. Firstly, the mixed initialization method is used to initialize the population. As outlined in Algorithm 1, certain particles are initialized based on prior information derived from the feature data, while the remaining particles are initialized randomly. Additionally, the external archive and particle ages are also initialized. The main loop stage constitutes the second phase of the framework, which mainly involves the evaluation of particles and the update of adaptive hybrid mutation and external archives. As depicted in Algorithm 2, when the age of the particle exceeds the predefined age threshold, adaptive mutation is performed; otherwise, non-uniform mutation is utilized. Adaptive mixed perturbation strategies can help them break through local optima and increase population diversity. As the external archive is continuously updated, the final set of leader archive serves as the final outcome. In this paper, minimizing the feature subset size and minimizing the classification error rate are chosen as the evaluation functions, which are conflicting objectives. Minimizing the feature subset size is denoted as \(f_1\), minimizing the classification error rate is denoted as \(f_2\). These two evaluation functions are calculated according to the Eq.(19) and Eq.(20) respectively.

$$\begin{aligned} f_1=\frac{s_i}{D},S_i=\sum _{j=1}^DF_{i,j} \end{aligned}$$
(19)

where \(S_i\) represents the number of features in the feature subset \(F_i\) and D represents the total count of features in the dataset.

$$\begin{aligned} f_{2}=\frac{(FP+FN)}{(FP+FN+TP+TN)} \end{aligned}$$
(20)

where FP, FN, TP and TN represent false positive, false negative, true positive and true negative respectively.

Algorithm 3
figure c

Framework of fMOPSO-FS

3.4 Computational Complexity Analysis

The proposed algorithm mainly includes two stages: initialization and main loop. The initialization phase mainly includes initializing the velocity and position of particles, as well as external archiving, etc. The main loop phase includes the search process of particles and the selection of global optimal particles, etc. The main time cost in the initialization stage is to calculate the correlation between features and labels, with a time complexity of \(O(D+N)\), where D represents the dimension of features, N denote the number of particles. The time complexity in the main loop stage is mainly affected by the search and update process of particles, and the main time consumption in the particle search process is to calculate the mutation probability of features, with a time complexity of \(O(N^{2}+D)\), In the process of particle update, the time complexity mainly depends on the selection of leading particles. The selected time complexity is O(N). If the selection and update of particles are carried out serially, the time complexity of the main loop stage is \(O(N^{2}+D+N)\). Due to \(N^{2}\) being much larger than N and D, the time complexity of the proposed algorithm is \(O(N^{2})\). Compared with other similar feature selection algorithms based on particle swarm optimization, although the proposed algorithm increases the calculation of the correlation between features and labels and the probability of feature mutation, the increased time complexity is constant level, so the overall time complexity did not increase.

4 Experiments and Discussion

4.1 Methods of Comparison and Corresponding Parameter Configurations

In this section, we have selected a series of multi-objective FS algorithms for comparison with fMOPSO-FS, which contains several classical and state-of-the-art multi-objective optimization algorithms. The classical multi-objective optimization algorithms consist of MOPSO [32], NSGAIII [33], MOEA/D [34], and the advanced multi-objective optimization algorithms encompass HMPSOFS [27], RFPSOFS [35], MOEA/D-COPSO [36], and AGMOPSO [15]. All four of these advanced algorithms employ the PSO to discover optimal solutions.

To guarantee the impartiality of the comparative experiment, regarding the dataset processing, to begin the experiment, the dataset undergoes a random partitioning process, where it is divided into two subsets. The training set comprises 70% of the data, while the remaining 30% is designated as the test set. Additionally, a 10-fold cross validation technique is employed to evaluate the model. This approach helps mitigate the risk of overfitting the model on the training set and enhances the reliability of the training process. The classification error rate of each particle is computed using the K Nearest Neighbor (KNN) classifier, with k set to 5. The settings of parameters for different algorithms are presented in Table 1. These algorithms are implemented on MATLAB R2020b, Intel(R) Core(TM) i5-8265U, 1.80 GHz, 8GB RAM.

Table 1 Parameter configurations for seven algorithms

4.2 Performance Metrics

To gauge the effectiveness of the comparison algorithm and the fMOPSO-FS algorithm, two commonly used metrics, namely hypervolume (HV) and inverted generational distance (IGD), are employed. These metrics, HV and IGD, are considered as the most representative measures for evaluating the performance of optimization algorithms.

The evaluation method for the HV metric was initially introduced by Zitzler et al. [37]. The diversity and convergence of an algorithm are assessed by measuring the volume of the hypercube formed by the individuals in the Pareto solution set and the reference points in the target space. The larger the HV value is, the better the Pareto front set is. This evaluation method quantifies the spread and performance of the algorithm’s solutions. In the specific research paper mentioned, the reference point is defined as (1.0, 1.0) based on the objective function’s design. The formula to calculate the HV is as follows:

$$\begin{aligned} HV=\delta (\cup _{i=1}^{|S|}v_i) \end{aligned}$$
(21)

where \(\delta \) is the Lebesgue measure; |S| denotes the number of non-dominated solutions obtained by the algorithm, and \(v_i\) denotes the HV comprising of the reference point and the ith solution in the solution collection.

The IGD is a comprehensive metric for evaluating algorithm performance, and is mainly used to evaluate the convergence performance and distribution performance of the algorithm [38]. A lower IGD value indicates better overall performance of the algorithm in terms of convergence and distribution. However, in multi-objective feature selection problems, there is no true PF available. Therefore, in this paper, the set of non-dominated solutions generated by all compared algorithms and the proposed algorithm in 30 independent runs is considered as the surrogate Pareto front. The calculation of the IGD is performed as follows:

$$\begin{aligned} IGD(P_s,P^*)=\frac{\Sigma _{x\in P^*}min_{y\in P_s}Dis(x,y)}{|P^*|} \end{aligned}$$
(22)

where \(P_s\) represents the set of Pareto optimal solutions obtained from the algorithm and \(P^*\) denotes a collection of uniformly distributed reference points that are sampled from the true PF. Dis(xy) is the Euclidean distance between point x in \(P_s\) and point y in the optimal solution collection obtained by the method.

4.3 Experimental Analysis on UCI Datasets

To evaluate the performance of the fMOPSO-FS, seven UCI datasets are selected as experimental datasets in this subsection. The details of the datasets utilized are outlined and presented in the following Table 2 [39].

Table 2 Details of relevant UCI datasets

Tables 3, 4, 5, 6 show the average and standard deviation of the HV and IGD values obtained by the fMOPSO-FS algorithm and the comparison algorithms on the seven UCI datasets.The values ’\(\uparrow \)’,’\(\downarrow \)’ and ’\(\circ \)’ indicate that the comparison algorithm outperforms, underperforms and approximates fMOPSO-FS, respectively, while the values preceding and following the symbol ’±’ indicates the mean and standard deviation of the relevant algorithm on the dataset, respectively. Since the sample data do not have normality, non-parametric tests are used to compare the differences in the data. Judgment based on the P value obtained by non-parametric test. If the difference in the judgment results is not significant, it means that the performance of the two algorithms is close, represented by ’\(\circ \)’. If there are significant differences, the evaluation will be carried out according to the evaluation methods of different indicators. The larger the HV value, the better the algorithm performance. The smaller the IGD value, the better the algorithm performance. Choose ’\(\uparrow \)’ or ’\(\downarrow \)’ to represent it according to the corresponding situation. The bold font represents the best performing among these algorithms.

Table 3 HV values obtained for each algorithm on the training sets of each dataset
Table 4 IGD values obtained for each algorithm on the training sets of each dataset
Table 5 HV values obtained for each algorithm on the test sets of each dataset
Table 6 IGD values obtained for each algorithm on the test sets of each dataset
Fig. 2
figure 2

Pareto fronts of the maximum HV value in the training set for each algorithm for each dataset

Fig. 3
figure 3

Pareto fronts of the maximum HV value in the test set for each algorithm for each dataset

When analyzing the results from the training set perspective, as presented in Tables 3 and  4, it is evident that fMOPSO-FS outperforms MOPSO, MOEA/D and NSGAIII, and the obtained HV value and IGD value are close to those of MOEA/D-COPSO and AGMOPSO on the German dataset.The HV value obtained by the MOEA/D-COPSO algorithm is better than that of the fMOPSO-FS algorithm, but its IGD value is worse than that of the fMOPSO-FS algorithm. The stability of fMOPSO-FS shows a slight decrease compared to MOEA/D-COPSO. However, in contrast to other comparison algorithms, fMOPSO-FS consistently outperforms them in terms of HV and IGD values across various datasets.

In accordance with the test set results shown in Tables 5 and 6, it can be seen that fMOPSO-FS achieves HV and IGD values comparable to HMPSOFS, RFPSOFS, MOEA/D-COPSOS, and AGMOPSO on the German dataset. On the Sonar dataset, fMOPSO-FS demonstrates similar performance to HMPSOFS and MOEA/D-COPSOFS. On the Musk1 dataset, fMOPSO-FS exhibits HV and IGD values similar to RFPSOFS and MOEA/D-COPSO, but it outperforms them to emerge as the top-performing algorithm overall. Meanwhile, AGMOPSO obtained better IGD values on the CNAE, but fMOPSO-FS obtained better HV values. MOEA/D-COPSO algorithm has shown good performance in LSVT, with better HV and IGD values than the fMOPSO-FS algorithm and showcases strong stability across the Sonar, Hillvalley, Musk1, and CNAE datasets. In contrast, fMOPSO-FS demonstrates excellent stability specifically on the Sonar and Hillvalley datasets. Nevertheless, fMOPSO-FS does not yield significantly improved results on the German dataset. This could be attributed to the dataset’s weak correlation with categories or disregard for potential redundancies among features. Overall, when compared to other comparative algorithms, the fMOPSO-FS algorithm consistently achieves better performance.

To visually verify the aforementioned conclusions, Figs. 2, and  3 show the PF with the maximum HV values obtained by each algorithm for 30 experiments on the training and test sets of each dataset, respectively. In Fig. 2, it is evident that fMOPSO-FS is able to obtain a set of Pareto solution sets with low classification error rates and compact feature subset sizes across the majority of the datasets. For the MultipleFeatures dataset, although the diversity of solutions obtained by the fMOPSO-FS algorithm is not as extensive as the other comparative algorithms, the fMOPSO-FS algorithm obtains candidate solutions of better quality than the other comparison algorithms. Moving to Fig. 3, compared to other algorithms on the test set, fMOPSO-FS continues to show its unique advantages. When the same number of features are selected, the fMOPSO-FS algorithm outperforms both HMPSOFS and MOEA/D-COPSO on the Isolet5 dataset. While HMPSOFS and MOEA/D-COPSO exhibit a broader distribution of solution sets, their classification error rates are considerably higher compared to the fMOPSO-FS algorithm.

4.4 Comparison of Seven Feature Selection Methods using Different Classifiers

To verify the effectiveness of the proposed FS approach in this paper, this section classifies a subset of the features obtained by all the algorithms using the classical classifiers SVM, Naive Bayes and KNN, and the classification results are given in the Tables 7, 8,  9. In these tables, the mean classification precision on the training set is denoted by \(Tr_{Acc}\), while \(Te_{Acc}\) represents the mean classification precision on the test set.

Table 7 The median value of the classification accuracy achieved by each algorithm on the SVM classifier
Table 8 The median value of the classification accuracy achieved by each algorithm on the Naive Bayes classifier
Table 9 The median value of the classification accuracy achieved by each algorithm on the KNN classifier

An observation from Tables 7 and  8 reveals that for the datasets German, Ioslet5 and MultipleFeatures, the feature subset selected by fMOPSO-FS shows better classification performance on SVM and Naive Bayes classifiers than other comparison algorithms. Although MOEA/D and NSGAIII can demonstrate superior classification accuracy in most datasets, it becomes apparent from Figs. 2 and  3 that it is difficult to find a set of solutions with good diversity using these two algorithms, and the feature subset selected by these algorithm are significantly larger in size compared to the feature subset chosen by fMOPSO-FS. In Table 7, the classification accuracy of the fMOPSO-FS algorithm on the LSVT dataset differs significantly from other comparison algorithms. This may be due to the low correlation between features and categories in the LSVT dataset, resulting in the selection of representative features and poor classification performance of the algorithm. According to Table 9, the feature subset obtained by MOEA/D demonstrates superior classification performance on most of the datasets. However, for the Hillvalley dataset, the feature subset chosen by RFPSOFS performs better than other comparison algorithms, while for the Musk1 and Isolet5 datasets, fMOPSO-FS exhibits better classification performance. Table 10 shows the average number of features selected in the subset of features selected by each algorithm. From Table 10, it can be seen that on most datasets, the fMOPSO algorithm selects fewer features than other comparison algorithms, except on the German dataset. Although some comparison algorithms have achieved better classification accuracy on some datasets, taking into account the number of selected features, diversity of candidate solutions and the evaluation results of HV and IGD indicators, the fMOPSO-FS algorithm still has better performance compared with other algorithms.

4.5 Experimental Analysis on Gene Expression Datasets

The preceding subsection showcases the satisfactory performance of the proposed algorithm on conventional datasets, which are typically characterized by low feature dimensionality and a large number of samples. To verify that the fMOPSO-FS algorithm can also demonstrate its advantages on high-dimensional datasets, therefore, we selected six gene expression profile datasets, Colon, SRBCT, Lymphoma, Leukemia3, Lung and Kolod, which have high latitude and a small number of instances. Table 11 [40, 41] presents the specifications and details of the datasets used in the paper. For testing, Tables 12 and  13 show the HV values obtained by the fMOPSO-FS algorithm and other comparison algorithms on the training and test sets of the above six datasets. From the perspective of the training set, in Table 12, the fMOPSO-FS algorithm obtained HV values on the Colon and Lung datasets that were close to those of the AGMOPSO algorithm, and achieved better HV values compared to other comparative algorithms. On the SRCBT and Leukemia3 datasets, the HV values obtained by fMOPSO-FS were only slightly lower than those obtained by the AGMOPSO algorithm. The HV value obtained by fMOPSO-FS on the Lung dataset is only slightly lower than that obtained by the MOEA/D-COPSO algorithm. On the Kolod dataset, fMOPSO-FS achieved better HV values than other comparison algorithms. From the perspective of the test set, in Table 13, the HV values obtained by fMOPSO-FS and AGMOPSO algorithm are similar on most datasets, but lower than those obtained by MOEA/D-COPSO algorithm on the lung dataset. The HV values obtained from the Leukemia3 dataset are close to those obtained from HMPSOFS and RFPSOFS algorithms. In summary, it can be inferred that the fMOPSO-FS algorithm can also perform well on high-dimensional datasets.

Table 10 Average number of selected features for each algorithm on different datasets
Table 11 Details of relevant gene expression profile datasets
Table 12 HV values obtained for each algorithm on the training sets of each dataset
Table 13 HV values obtained for each algorithm on the test sets of each dataset

4.6 Parameter Analysis

The proposed initialization strategy contains two adjustment factors \(\alpha \) and \(\beta \). As the selection threshold is fixed at 0.6 and \(\beta \) affects the upper bound of the coding interval for less relevant features, \(\beta \) taken at 0.4 enables the feature coding interval to range from [0, 1] in line with the random initialization interval, while also ensuring that features with higher relevance have a higher selection probability. Thus, this section focuses on the effect of \(\alpha \) on particle mass and particle distribution. The Fig. 4 depicts the initialization process of the fMOPSO-FS algorithm on various datasets with different values of \(\alpha \). The graph reveals a gradual decline in the quality of the generated initial solutions as \(\alpha \) increases. Hence, it is unnecessary to set a large \(\alpha \) value, while a small \(\alpha \) value has a negligible impact on the population. Consequently, we define the range of \(\alpha \) selection as 0.1, 0.2, 0.3, 0.4, 0.5. Notably, when \(\alpha \) is set to 0.1, the particles in the target space are positioned closer to the origin, which means that the obtained initial solutions have higher quality. Therefore, the value of \(\alpha \) is set to 0.1.

Fig. 4
figure 4

Distribution of particle populations on different datasets with different \(\alpha \) values

4.7 Analysis of the Proposed Strategies

To further analyze the effectiveness of the algorithm, the proposed feature-label correlation-guided initialization strategy and adaptive perturbation strategy as well as the introduced mutual information theory are validated separately.

4.7.1 Initialization Strategy Analysis

In order to assess the effectiveness of the initialization strategy, we compared it with the random initialization strategy. In Fig. 5, PF represents the Pareto front simulated by the non-dominated solutions produced by fMOPSO-FS and all the comparison algorithms after 30 times independent runs. It can be clearly seen from Fig. 5 that the initial solution generated by the proposed initialization strategy on most data sets exhibits greater proximity to the Pareto front. Although the impact on the Musk1 and SRBCT datasets is not significant, it is likely due to the limited correlation between the feature data and class labels. However, it is worth noting that this hybrid initialization method still effectively enhances the diversity of initial solutions compared to a single initialization method. The initial solutions generated by a single initialization method will be limited to a certain area in the target space, while using a mixed initialization method will cover more areas and expand the search range of particles.

Fig. 5
figure 5

Distribution of particles with different initialization strategies on different datasets

4.7.2 Analysis of Adaptive Hybrid Perturbation Strategies

To validate the efficacy of the adaptive hybrid perturbation strategy, the MOPSO-FL-FS algorithm retained only the feature-label-guided initialization strategy, and the fMOPSO-FS-F algorithm retained the feature-label-guided initialization strategy while using a perturbation strategy with a fixed probability of variation. Each algorithm is run independently for 30 times, and the Tables 14 and  15 show the HV values of the algorithms on the training and test sets of each dataset, respectively. In the Tables 14 and  15, it can be seen that the HV values obtained by the fMOPSO-FS algorithm are significantly better than those obtained by the MOPSO-FL-FS algorithm on both the training and test sets, and compared with the fMOPSO-FS-F algorithm, although the HV values obtained on most of the datasets are the HV values obtained are similar, the advantage is also demonstrated on some of the datasets. One possible reason for this is that the adaptive hybrid perturbation strategy can dynamically adjust the variation probability based on the performance of the particles themselves, unlike fixed variation probability. This adaptiveness allows the strategy to fine-tune and optimize the exploration and exploitation trade-off, leading to improved results.

Table 14 HV values obtained for each algorithm on the training sets of each dataset
Table 15 HV values obtained for each algorithm on the test sets of each dataset

4.7.3 Validation of the Validity of the Mutual Information Theory

The validity of the MI theory is also validated for the different datasets. In this section, the subset of features obtained by running the fMOPSO-FS algorithm 30 times independently on each of these datasets are analysed, and the Fig. 6 shows how well the algorithm works on these datasets in terms of the number of times each feature was selected and the statistics of how often they correlate with the categories.

Fig. 6
figure 6

Statistics on the frequency of feature selection

In Fig. 6 the X-axis denotes the feature ordinal number, determined by ranking their relevance to class labels from highest to lowest. The left and right Y-axis represent the frequency of FS and the correlation between features and categories, respectively. From Fig. 6, it is evident that on the MultipleFeature and CNAE datasets, as the correlation between a feature and the label increases, the frequency of selecting that particular feature also increases. Conversely, as the correlation decreases, the corresponding features are less selected. On the gene expression profile datasets Colon, SRBCT and Lung, the correlation and feature frequency trends of the selected features are roughly the same. For other datasets, since the correlation of features and labels is not so obvious, there is no obvious regularity in the presented images. It can be concluded that the incorporation of prior information can guide the algorithm to select features with higher relevance to the class labels, thus obtaining a higher quality subset of features, and can enhance the explainability of the selected features.

5 Conclusions

The randomness and lack of knowledge guiding the initialization process of most existing MOPSO-based feature selection methods may lead to the initialized solutions searching or even repeating meaningless regions of the search space during the evolution process, and the generated initial population may be far from the true Pareto front. Furthermore, the absence of sufficient selection pressure on the particle population during the later stages of iterative evolution results in a predisposition for the population to converge towards local optima. In order to enhance the distribution of the initial population and the quality of the initial solutions, while avoiding the particle swarm from being stuck in local optima, an adaptive MOPSOFS method based on feature-label correlation guidance is proposed in this paper. The method adopts a novel initialization strategy that makes full use of the prior knowledge in the feature data to obtain higher quality initial solutions. Simultaneously, an adaptive hybrid mutation strategy is proposed to enable the particle swarm to escape local optima. This mutation strategy dynamically adjusts the mutation rate based on the convergence status of the swarm, facilitating exploration of the search space and reducing the likelihood of getting trapped in suboptimal solutions.

The experimental findings validate that the method has greater advantages in solving the multi-objective FS problem, but there are still some problems to be tackled. Firstly, it can be time-consuming in obtaining prior information on feature data, especially for datasets with large feature dimensions. Secondly, the method mainly considers the correlation between features and categories, so there may still be a small number of redundant features in the obtained feature subset. Hence, improving the efficiency of obtaining certain prior information, combining the correlation between features and further eliminating redundant features should be the primary areas of focus for future research.