A Feature Selection Method Based on Feature-Label Correlation Information and Self-Adaptive MOPSO

Han, Fei; Li, Fanyu; Ling, Qinghua; Han, Henry; Lu, Tianyi; Jiao, Zijian; Zhang, Haonan

doi:10.1007/s11063-024-11553-9

A Feature Selection Method Based on Feature-Label Correlation Information and Self-Adaptive MOPSO

Open access
Published: 18 March 2024

Volume 56, article number 110, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

A Feature Selection Method Based on Feature-Label Correlation Information and Self-Adaptive MOPSO

Download PDF

Fei Han^1,2,
Fanyu Li^1,2,
Qinghua Ling³,
Henry Han⁴,
Tianyi Lu¹,
Zijian Jiao¹ &
…
Haonan Zhang¹

580 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Feature selection can be seen as a multi-objective task, where the goal is to select a subset of features that exhibit minimal correlation among themselves while maximizing their correlation with the target label. Multi-objective particle swarm optimization algorithm (MOPSO) has been extensively utilized for feature selection and has achieved good performance. However, most MOPSO-based feature selection methods are random and lack knowledge guidance in the initialization process, ignoring certain valuable prior information in the feature data, which may lead to the generated initial population being far from the true Pareto front (PF) and influence the population’s rate of convergence. Additionally, MOPSO has a propensity to become stuck in local optima during the later iterations. In this paper, a novel feature selection method (fMOPSO-FS) is proposed. Firstly, with the aim of improving the initial solution quality and fostering the interpretability of the selected features, a novel initialization strategy that incorporates prior information during the initialization process of the particle swarm is proposed. Furthermore, an adaptive hybrid mutation strategy is proposed to avoid the particle swarm from getting stuck in local optima and to further leverage prior information. The experimental results demonstrate the superior performance of the proposed algorithm compared to the comparison algorithms. It yields a superior feature subset on nine UCI benchmark datasets and six gene expression profile datasets.

Feature Selection for High-Dimensional Data Based on a Multi-objective Particle Swarm Optimization with Self-adjusting Strategy Pool

An improved feature selection method based on angle-guided multi-objective PSO and feature-label mutual information

Article 01 June 2022

A Subset Similarity Guided Method for Multi-objective Feature Selection

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In this era of data explosion, as the number of instances and the dimensionality of data continue to increase, the processing and parsing of data have become increasingly challenging. Feature selection (FS) is a mainstream data reduction technique that aims to eliminate redundant and noisy attributes. Its primary objective is to select the smallest subset of features from the original feature set based on a FS criterion [1]. The advantage of FS lies in its ability to compress the search space of the learning algorithm and lower the size of the feature set, thereby diminishing the dimensionality of the data, easing the learning task, and improving model efficiency [2].

In recent years, numerous researchers have applied swarm intelligence evolutionary algorithms (EA) to the field of FS. Swarm intelligence optimization algorithms exhibit characteristics such as simple operation, fast convergence, and robust global search ability, making them well-suited for tackling intricate optimization problems. Swarm intelligence algorithms, including genetic algorithm (GA) [3, 4], artificial bee swarm algorithm (ABO) [5], grey wolf algorithm (GWO) [6], particle swarm algorithm (PSO) [7], have demonstrated promising results. Among them, PSO stands out as one of the most frequently employed optimization techniques. PSO is not only used to feature selection problems, but also widely applied in other fields. Many scholars have made different improvements to in different fields. Wang et al. [8] proposed a particle swarm optimization algorithm based on reinforcement learning level (RLLPSO) for large-scale problems, which increases the diversity of the population, improves the search performance and convergence speed of the population. Inspired by conditional integrals in automatic control, Xiang et al. [9] proposed an adaptive search direction learning method for PSO (ISPSO). This method has faster global convergence speed and higher solution accuracy. Xia et al. [10] proposed an MFCPSO algorithm to address the shortcomings of fitness based selection, which exhibits promising characteristics in large-scale complex functions. However, these evolutionary optimization algorithms also have certain limitations. Most of them are designed for single-objective FS problems, whereas FS can be viewed as a multi-objective optimization problem. Typically, two optimization objectives are considered: maximizing the classification accuracy of the selected feature subset and minimizing the size of the subset. In fact, researchers have explored the use of multi-objective EA, including the MOPSO algorithm, for solving the FS problem. Pradip et al. [11] proposed a two-phase multi-objective FS method aimed at selecting the most relevant features. The one phase involves global search using PSO, while in the other phase, a combination of PSO and GWO, based on a modified Newton’s second law of motion, performs local search starting from the results obtained in the global search. Wang et al. [12] introduced a multi-objective evolutionary FS algorithm that incorporates a correlation metric and a novel redundancy metric for class correlation redundancy. The method uses Pareto optimality to assess a subset of candidate features to find the compact feature subset with maximum correlation and minimum redundancy. Xue et al. [13] proposed a FS adaptive multi-objective genetic algorithm, which incorporates an adaptive mechanism to dynamically select five different crossover operators during various evolutionary processes, allowing the algorithm to remove multiple features while ensuring classification performance. Feng et al. [14], aims to improve the global search capability and mitigate the stagnation of local optimal solutions phenomenon, the model was modified in the PSO part using genetic operators and Levy flight. These algorithms strive to discover a collection of solutions that strike a balance between classification precision and the size of the selected feature subset.

However, these algorithms ignore the prior information contained in the feature data during the initialization process, and use random initialization methods to generate initial solutions that may be far from the true Pareto front, affecting the convergence speed of the population. To alleviate this problem, Han et al. [15] introduced an improved feature selection method, which sets the selection threshold according to the correlation between features and categories in the initialization stage to select feature subsets of superior quality. Yu et al. [16] presented a swarm initialization strategy that combines blended initialization and threshold selection techniques. Additionally, PCA is employed to rank the importance of features. Although these methods take into account the prior information contained in the feature data during the initialization process, the particles are still susceptible to fall into local optima. Aiming at this problem, Fu et al. [17] introduced a novel multi-objective binary GWO method that incorporates a guided mutation strategy. The method utilizes the Pearson correlation coefficient to guide local search, enhancing the population’s ability to explore local regions. Additionally, a dynamic perturbation mechanism is employed for mutation, preventing population stagnation caused by a single strategy. This dynamic adjustment ensures population diversity is maintained and improves the algorithm’s detection capability. Zhou et al. [18] presented an adaptive hierarchical update PSO algorithm to overcome the issue of particle swarm algorithms frequently getting trapped in local optima and struggling to escape. The proposed method incorporates multi-level update formulas for both the global exploration subgroup and the local exploitation subgroup. This approach enhances the resistance to local optima and improves the algorithm’s ability to explore globally optimal solutions. Wei et al. [19] employed a neighborhood search strategy to enhance the local search capability of the swarm during stagnation periods. Xiang et al. [20] proposed a PID based PSO strategy (PBS-PSO) to avoid premature convergence of particle swarm optimization, in order to accelerate convergence and adjust the search direction to escape local optima. Xue et al. [13] introduced a mechanism for detecting search stagnation aimed at mitigating premature convergence in PSO. Although these methods can avoid particle swarms from getting trapped in local optima, due to most of them are lack of prior information guidance, restrict the search performance of swarm intelligence algorithms and hinders their ability to converge towards the global optimum.

Based on the above analysis, incorporating prior knowledge into both the population initialization and search process would inevitably expedite the algorithm’s search speed and enhance the explainability of the selected features. Introducing prior information in the initialization process can bring the generated initial solutions closer to the true Pareto front, accelerate population convergence speed, and also increase the diversity of population particles. Coupling prior knowledge into the search process can effectively guide particles to search in a better direction, improve the search performance of the population. Therefore, this paper proposes an adaptive multi-objective particle swarm feature selection algorithm based on feature-label relevance information guidance, combining the advantages of filtered and wrapped FS algorithms on the basis of full consideration of prior information. The primary differentiating factors of this paper from other algorithms can be summarized as follows:

Firstly, a strategy for setting feature encoding intervals is proposed, which determines the interval boundaries based on the magnitude of correlation between features and categories. This strategy increases the probability of selecting features with higher correlation to the categories, thereby enhancing the explainability of the selected features.

Secondly, a novel swarm initialization method based on feature-label correlation is proposed. This method improves the quality of initial solutions and the distribution of particles, resulting a significant improvement in the proximity of initial solutions to the true Pareto front. Additionally, It expedites the rate at which the population converges.

Finally, an adaptive hybrid perturbation strategy is proposed to facilitate the particles in escaping from local optima, taking into account the performance of the particles, the selection probability of the features and the selection situation.

The paper is organized as follows in the subsequent sections: Sect. 2 presents an overview of existing work related to MOPSO, and information entropy. Sect. 3 provides the proposed FS algorithm. In Sect. 4, the experimental results are presented and analyzed, providing a comprehensive discussion of the obtained findings. Finally, Sect. 5 gives the conclusions of this paper.

2 Preliminaries

2.1 Multi-objective Optimization Problems (MOPs)

Problems with multiple optimization objectives are called multi-objective problems, and since the objectives are in conflict with each other, a solution cannot be optimal for all objectives. The solutions that satisfy the Pareto optimality criteria in such problems are referred to as Pareto optimal solutions. These solutions allow for a trade-off among different objective functions, as improving one objective may come at the expense of another [21, 22]. The minimum MOP can be described in the following manner:

$$\begin{aligned} minimize\ F(x)=(f_{1}(x),f_{2}(x),{\ldots },f_{n}(x))\nonumber \\ subject\ to : u_{i}(x)\le 0,i = 1,2,\ldots ,k\nonumber \\ e_{j}(x) = 0,j = 1,2,\ldots ,k \end{aligned}$$

(1)

where $X=(x_1,x_2,x_3,\ldots ,x_D)$ represents the D-dimensional vector in decision space and n is the number of objectives, $f_i(X)$ indicates the ith minimized objective function, $u_i(X)$ and $e_j(X)$are the inequality and equlaity constraints, respectively. Given two feasible solutions $X_1$ and $X_2$, $X_1$ dominates $X_2$, if and only if for $\forall _{a}$, $f_a(X_1)~\leqslant ~f_a(X_2)$ and $\exists b,~f_b(X_1)<~f_b(X_2),~a,b\in \{1,2,\ldots ,n\}$. If no other solution dominates $X^{*}$, then $X^{*}$ is known as a Pareto-optimal solution. The set of all Pareto-optimal solutions is known as the Pareto-optimal set, while the objective values associated with these solutions form the Pareto front.

2.2 Particle Swarm Optimization

PSO has been widely used in a diverse range of optimization problems [23, 24]. In the particle swarm algorithm, each particle corresponds to a prospective solution to an optimization problem, and collectively, all particles form a set of candidate solutions. Each particle possesses two fundamental properties: velocity and position. The update of velocity and position for the particle swarm is performed as follows.

$$\begin{aligned} v_i(t+1)~=~\omega *v_i(t)+c_1*r_1*(pbest_i-x_i(t))+c_2*r_2*(gbest_i-x_i(t)) \end{aligned}$$

(2)

$$\begin{aligned} x_i(t+1)=x_i(t)+v_i(t+1),i=1,2,\ldots ,n \end{aligned}$$

(3)

where $\omega $ represents the inertia weight, t represents the number of current iterations, $c_1$ and $c_2$ are the learning factors, $r_1$ and $r_2$ are two random values uniformly distributed in the interval [0,1], and $pbest_i$ and $gbest_i$ serve as representations for the individual optimal position and the global optimal position, respectively, of particle i.

2.3 Information Entropy

2.3.1 Entropy

Entropy quantifies the level of uncertainty associated with a random variable. Higher entropy corresponds to greater uncertainty in the random variable. The entropy of a continuous random variable X, denoted as H(X), is defined by the following equation:

$$\begin{aligned} H(X)=-\sum _{x\in X}p(x)log(p(x)) \end{aligned}$$

(4)

where X denotes the random variable and p(x) is the probability density function of X.

2.3.2 Relative Entropy

Relative entropy is a measure that quantifies the difference or dissimilarity between two probability distributions. It provides a measure of how one distribution differs from another in terms of their information content or structure. Specifically, it measures the additional amount of information needed to encode data from one distribution using a code optimized for another distribution. The definition of the relative entropy between probability distributions p(x) and q(x) is as follows:

$$\begin{aligned} D(p||q)=\sum _{x\in X}p(x)log\frac{p(x)}{q(x)} \end{aligned}$$

(5)

2.3.3 Mutual Information (MI)

MI is a measure used to quantify the amount of information that one random variable contains about another random variable [25]. It reflects the degree of correlation between the variables, with higher values indicating stronger correlation. The MI between two discrete variables X and Y is defined as follows:

$$\begin{aligned} (X;Y)=\sum _{x\in X}\sum _{y\in Y}p(x,y)log\frac{p(x,y)}{p(x)p(y)}=D(p(x,y)||p(x)p(y)) \end{aligned}$$

(6)

where p(x, y) denotes the joint probability density of x and y, and p(x) and p(y) refer to the marginal probability densities of x and y respectively.

The relationship between MI and entropy can be described as follows:

$$\begin{aligned} I(X;Y)=H(X)+H(Y)-H(X,Y) \end{aligned}$$

(7)

3 The Proposed Method

In this section, in order to improve the quality of the initial solutions of the population and to expedite the convergence process. A novel particle swarm initialization strategy is proposed, which couples prior information in the initialization process and enhances the explainability of the selected features. At the same time, an adaptive hybrid perturbation strategy is proposed in order to avoid the PSO algorithm from falling into local optimal solutions. The specific details of the two strategies are as follows.

3.1 A Novel Initialization Strategy

To enhance the dispersion of particles and improve the quality of initial solutions, it is essential to thoroughly take into account the interrelation between features and categories. In this paper, mutual information is utilized as a metric to assess the correlation between features and labels. A higher value of mutual information indicates a stronger relevance between the features and labels. In order to ensure the diversity of particles, half of the particles of the population are initialized using feature-label guidance, while the other half is left to be initialized randomly. The overall process is illustrated in Algorithm 1.

3.1.1 The Initialization Strategy Based on Feature-Label Correlation Information

FS can be viewed as a binary optimization problem since it entails making decisions on whether to select or exclude features. While binary PSO can directly encode particle positions as binary values, continuous PSO has shown better performance in FS [26]. Therefore, in this paper, continuous PSO is employed to adjust the position information of particles in the FS algorithm. Nonetheless, evaluating fitness in continuous particle swarm algorithms is challenging, requiring the conversion of real values to binary values before fitness evaluation. In the conversion process, most PSO-based feature selection algorithms encode particle position information in the range of [0, 1] and use a fixed conversion threshold. However, this fixed feature encoding interval and conversion threshold do not adequately incorporate the correlation information between features and categories. To tackle this problem, we propose a feature encoding interval setting strategy based on feature-label correlation.

Different feature coding intervals are set according to the magnitude of the correlation value. This paper divides the encoding interval of features into two categories: one sets the lower bound of the feature encoding interval ($X_{lb}$), and the other sets the upper bound of the feature encoding interval ($X_{ub}$). The rules for setting the interval bounds are as follows.

$X_{lb}$ is set when the correlation value between features and categories exceeds the average correlation value across all features and labels. Conversely, $X_{ub}$ is set when the correlation value is below the average. The calculation formulas are shown in Eq.(8) and Eq.(9).

$$\begin{aligned} X_{lb}=\alpha *\frac{I(f_{j},C)}{m a x(I(f,C))},\,j=1,2,...,D, \alpha =0.2 \end{aligned}$$

(8)

$$\begin{aligned} X_{ub}=T+\beta *\frac{I(f_j,C)}{max(I(f,C))},j=1,2,\ldots D,\beta =0.4 \end{aligned}$$

(9)

where $I(f_j,C)$ represents the value of the MI between the feature and the category C. T represents for selection threshold. $\alpha $,$\beta $ are two different moderators, the exact values of which are discussed in Sect. 4.6.

The encoding process of the features is as follows. Taking a data set with D-dimensional features as an example, then the position information of the ith particle can be represented by a string of D-dimensional real-valued data, denoted as vector $F_{i}=(x_{i,1},x_{i,2},x_{i,3},\ldots ,x_{i,D})$. The range of values of each component in $F_i$ is divided into two cases, as shown in Eq.(10)

$$\begin{aligned} x_{i,j}\in {\left\{ \begin{array}{ll}&{}[X_{lb},1],I(f_j,C)>MeanMl\\ {} &{}[0,X_{ub}],I(f_j,C)\le MeanMl\end{array}\right. },\quad i=1,2,3,\ldots ,N,j=1,2,3,\ldots ,D\nonumber \\ \end{aligned}$$

(10)

where MeanMl represents the mean of all feature-label relevance values.

Based on the equation mentioned above, it can be inferred that, given a fixed selection threshold, a higher mutual information value leads to a wider interval of selected features. This ensures that features with a stronger relevance have a higher likelihood of being chosen for selection.

The random initialization of particle position information is shown in Eq.(11).

$$\begin{aligned} x_{i,j}\in [0,1],i=1,2,3,\dots ,N,j=1,2,3,\dots ,D \end{aligned}$$

(11)

Similar to the approach used in HMPSOFS [27], this paper utilizes a consistent binary threshold. Consequently, the particle’s position is converted into a binary value for each dimension, relying on this threshold. The conversion process is shown in Eq. (12), $x_{i,j}$ is set to 1 when $x_{i,j}$ is greater than T, otherwise it is set to 0.

$$\begin{aligned} {F_{i,j}}=\left\{ \begin{matrix}1,\quad x_{i,j}>T\\ 0,\quad x_{i,j}\le T\end{matrix}\right. \end{aligned}$$

(12)

where $F_{i,j}$ denotes the jth feature belonging to the feature subset $F_i$. When $F_{i,j}=1$ represents that the feature is selected and $F_{i,j}=0$ means that the feature is not selected. Referring to previous studies [13, 22, 23], the threshold T is set to 0.6 in this paper.

3.2 The Adaptive Hybrid Mutation Strategy

To leverage the correlation information between features and labels more effectively and prevent the particle swarm from converging to local optima, an adaptive hybrid mutation strategy is introduced. The age threshold in dMOPSO [28] is introduced to determine whether the particles fall into a local optimum. In the early stage of the algorithm operation, the particle swarm exhibits powerful search ability and the individual optimal positions of the particles are updated continuously during the search process. However, with the progression of population updates, the search ability of the particles gradually declines, resulting in the particles easily entering a stagnant state. When the age of the particle is below the predetermined, it indicates that the particle still possesses good search ability, so the particle is slightly perturbed by using non-uniform mutation [29]. On the other hand, When the particle’s age goes beyond the preset age limit, it implies that the particle is likely to trapped in a local optimum and requires a larger perturbation, so adaptive variation approach is implemented to support the particle in breaking free from the local optimum and exploring different domains of the solution space. The detailed process is outlined in Algorithm 2.

3.2.1 Non-uniform Mutation

The non-uniform mutation operator $\varphi $ incorporates a dynamic decrease in mutation probability as the number of iterations increases. During the iteration process, the PSO algorithm has been pursuing the balance between exploration and exploitation. In the early stage of the iteration, by increasing the exploration intensity, the algorithm is more likely to find the global optimal solution or a solution close to the optimal solution. Therefore, using a higher mutation probability can improve the global search ability of particles. In the later stages of the iteration, when the search space is reduced and the global optimal solution is closer, local search becomes more important. At this time, the mutation probability is reduced and the exploitation of existing excellent solutions is increased.

$$\begin{aligned} x_{i,j}={\left\{ \begin{array}{ll}x_{i,j}+Pbest_{i,j}*(1-r^{(\varphi )^\lambda }),r\le 0.5\\ x_{i,j}-Pbest_{i,j}*(1-r^{(\varphi )^\lambda }),r>0.5\end{array}\right. } \end{aligned}$$

(13)

$$\begin{aligned} {\varphi =1-\frac{t}{maxlt}} \end{aligned}$$

(14)

where r is a random number in the range of 0 to 1, t is the current number of iterations of the population, and maxIt is the maximum number of iterations. $\lambda $ is a system parameter that determines the dependence of the random number perturbation on the number of iterations, and based on related research [30, 31], the algorithm proposed in this chapter $\lambda $ will be set to 3.

3.2.2 Adpative Mutation

The adaptive mutation strategy further utilizes the prior information contained in the feature data, and calculates the feature mutation probability according to the performance of the particle itself, combined with the selection probability of the feature and whether the feature is selected.

Firstly, the performance of the particle is defined. Without considering any preferences, the Euclidean distance between the particle’s position in the target space and the origin of the target space is employed as a metric to evaluate the particle’s performance. A smaller distance indicates better performance for the particle. It is calculated as follows:

$$\begin{aligned} performance_i = \left\| \overline{f(x)}\right\| _2 \end{aligned}$$

(15)

where $performance_i$ denotes the performance of the ith particle, denotes the target vector of particle i.

Next, the probability of a feature being selected is calculated based on the feature encoding interval set during initialization. The feature coding interval length is $1-X_{lb}$ or $X_{ub}$ and a fixed selection threshold T is used. The selection probability of a feature ($P_s$) is defined as follows:

$$\begin{aligned} P_s={\left\{ \begin{array}{ll}\dfrac{1-T}{1-X_{lb}},x_{i,j}\epsilon [X_{lb},1]\\ \dfrac{X_{ub}-T}{X_{ub}},x_{i,j}\epsilon [0,X_{ub}]\end{array}\right. } \end{aligned}$$

(16)

For a feature $f_j$, as shown in Eq.(16), different feature encoding intervals correspond to different mutation probabilities.

Finally, the probability of variation is calculated based on the selection of features, which is divided into the following two cases.

Case 1: If the feature is selected, its mutation probability will be calculated as follows:

$$\begin{aligned} MP=exp(-P_S)*(1-performance_i) \end{aligned}$$

(17)

Case 2: If the feature is not selected, the probability of its mutation will be calculated as follows:

$$\begin{aligned} MP=(1-exp(-P_S))*(1-performance_i) \end{aligned}$$

(18)

An example of a specific perturbation is shown in the Fig. 1. When the MP of feature exceeds the generated random number, the mutation operation is executed, and vice versa.

3.3 The Framework of the Proposed Method

Algorithm 3 outlines the general framework of fMOPSO-FS. The fMOPSO-FS framework primarily comprises two phases. The first stage is the initialization stage. Firstly, the mixed initialization method is used to initialize the population. As outlined in Algorithm 1, certain particles are initialized based on prior information derived from the feature data, while the remaining particles are initialized randomly. Additionally, the external archive and particle ages are also initialized. The main loop stage constitutes the second phase of the framework, which mainly involves the evaluation of particles and the update of adaptive hybrid mutation and external archives. As depicted in Algorithm 2, when the age of the particle exceeds the predefined age threshold, adaptive mutation is performed; otherwise, non-uniform mutation is utilized. Adaptive mixed perturbation strategies can help them break through local optima and increase population diversity. As the external archive is continuously updated, the final set of leader archive serves as the final outcome. In this paper, minimizing the feature subset size and minimizing the classification error rate are chosen as the evaluation functions, which are conflicting objectives. Minimizing the feature subset size is denoted as $f_1$, minimizing the classification error rate is denoted as $f_2$. These two evaluation functions are calculated according to the Eq.(19) and Eq.(20) respectively.

$$\begin{aligned} f_1=\frac{s_i}{D},S_i=\sum _{j=1}^DF_{i,j} \end{aligned}$$

(19)

where $S_i$ represents the number of features in the feature subset $F_i$ and D represents the total count of features in the dataset.

$$\begin{aligned} f_{2}=\frac{(FP+FN)}{(FP+FN+TP+TN)} \end{aligned}$$

(20)

where FP, FN, TP and TN represent false positive, false negative, true positive and true negative respectively.

3.4 Computational Complexity Analysis

The proposed algorithm mainly includes two stages: initialization and main loop. The initialization phase mainly includes initializing the velocity and position of particles, as well as external archiving, etc. The main loop phase includes the search process of particles and the selection of global optimal particles, etc. The main time cost in the initialization stage is to calculate the correlation between features and labels, with a time complexity of $O(D+N)$, where D represents the dimension of features, N denote the number of particles. The time complexity in the main loop stage is mainly affected by the search and update process of particles, and the main time consumption in the particle search process is to calculate the mutation probability of features, with a time complexity of $O(N^{2}+D)$, In the process of particle update, the time complexity mainly depends on the selection of leading particles. The selected time complexity is O(N). If the selection and update of particles are carried out serially, the time complexity of the main loop stage is $O(N^{2}+D+N)$. Due to $N^{2}$ being much larger than N and D, the time complexity of the proposed algorithm is $O(N^{2})$. Compared with other similar feature selection algorithms based on particle swarm optimization, although the proposed algorithm increases the calculation of the correlation between features and labels and the probability of feature mutation, the increased time complexity is constant level, so the overall time complexity did not increase.

4 Experiments and Discussion

4.1 Methods of Comparison and Corresponding Parameter Configurations

In this section, we have selected a series of multi-objective FS algorithms for comparison with fMOPSO-FS, which contains several classical and state-of-the-art multi-objective optimization algorithms. The classical multi-objective optimization algorithms consist of MOPSO [32], NSGAIII [33], MOEA/D [34], and the advanced multi-objective optimization algorithms encompass HMPSOFS [27], RFPSOFS [35], MOEA/D-COPSO [36], and AGMOPSO [15]. All four of these advanced algorithms employ the PSO to discover optimal solutions.

To guarantee the impartiality of the comparative experiment, regarding the dataset processing, to begin the experiment, the dataset undergoes a random partitioning process, where it is divided into two subsets. The training set comprises 70% of the data, while the remaining 30% is designated as the test set. Additionally, a 10-fold cross validation technique is employed to evaluate the model. This approach helps mitigate the risk of overfitting the model on the training set and enhances the reliability of the training process. The classification error rate of each particle is computed using the K Nearest Neighbor (KNN) classifier, with k set to 5. The settings of parameters for different algorithms are presented in Table 1. These algorithms are implemented on MATLAB R2020b, Intel(R) Core(TM) i5-8265U, 1.80 GHz, 8GB RAM.

Table 1 Parameter configurations for seven algorithms

A Feature Selection Method Based on Feature-Label Correlation Information and Self-Adaptive MOPSO

Abstract

Similar content being viewed by others

Feature Selection for High-Dimensional Data Based on a Multi-objective Particle Swarm Optimization with Self-adjusting Strategy Pool

An improved feature selection method based on angle-guided multi-objective PSO and feature-label mutual information

A Subset Similarity Guided Method for Multi-objective Feature Selection

Explore related subjects

1 Introduction

2 Preliminaries

2.1 Multi-objective Optimization Problems (MOPs)

2.2 Particle Swarm Optimization

2.3 Information Entropy

2.3.1 Entropy

2.3.2 Relative Entropy

2.3.3 Mutual Information (MI)

3 The Proposed Method

3.1 A Novel Initialization Strategy

3.1.1 The Initialization Strategy Based on Feature-Label Correlation Information

3.2 The Adaptive Hybrid Mutation Strategy

3.2.1 Non-uniform Mutation

3.2.2 Adpative Mutation

3.3 The Framework of the Proposed Method

3.4 Computational Complexity Analysis

4 Experiments and Discussion

4.1 Methods of Comparison and Corresponding Parameter Configurations

4.2 Performance Metrics

4.3 Experimental Analysis on UCI Datasets

4.4 Comparison of Seven Feature Selection Methods using Different Classifiers

4.5 Experimental Analysis on Gene Expression Datasets

4.6 Parameter Analysis

4.7 Analysis of the Proposed Strategies

4.7.1 Initialization Strategy Analysis

4.7.2 Analysis of Adaptive Hybrid Perturbation Strategies

4.7.3 Validation of the Validity of the Mutual Information Theory

5 Conclusions

Data availability and access

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical and informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation