Enhanced SparseEA for large-scale multi-objective feature selection problems

Chu, Shu-Chuan; Zhuang, Zhongjie; Pan, Jeng-Shyang; Mohamed, Ali Wagdy; Hu, Chia-Cheng

doi:10.1007/s40747-023-01177-2

Enhanced SparseEA for large-scale multi-objective feature selection problems

Original Article
Open access
Published: 25 July 2023

Volume 10, pages 485–507, (2024)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Enhanced SparseEA for large-scale multi-objective feature selection problems

Download PDF

Shu-Chuan Chu^1,2,
Zhongjie Zhuang¹,
Jeng-Shyang Pan ORCID: orcid.org/0000-0002-3128-9025^1,3,
Ali Wagdy Mohamed^4,5 &
…
Chia-Cheng Hu⁶

881 Accesses
1 Citation
Explore all metrics

Abstract

Large-scale multi-objective feature selection problems are widely existing in the fields of text classification, image processing, and biological omics. Numerous features usually mean more correlation and redundancy between features, so effective features are usually sparse. SparseEA is an evolutionary algorithm for solving Large-scale Sparse Multi-objective Optimization Problems (i.e., most decision variables of the optimal solutions are zero). It determines feature Scores by calculating the fitness of individual features, which does not reflect the correlation between features well. In this manuscript, ReliefF was used to calculate the weights of features, with unimportant features being removed first. Then combine the weights calculated by ReliefF with Scores of SparseEA to guide the evolution process. Moreover, the Scores of features remain constant throughout all runs in SparseEA. Therefore, the fitness values of excellent and poor individuals in each iteration are used to update the Scores. In addition, difference operators of Differential Evolution are introduced into SparseEA to increase the diversity of solutions and help the algorithm jump out of the local optimal solution. Comparative experiments are performed on large-scale datasets selected from scikit-feature repository. The results show that the proposed algorithm is superior to the original SparseEA and the state-of-the-art algorithms.

Multi-objective Optimization Based Feature Selection Using Correlation

Feature selection for high-dimensional classification using a competitive swarm optimizer

Article 07 October 2016

A Multi-objective Feature Selection Method Considering the Interaction Between Features

Article 09 March 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Feature engineering is an important and critical part of machine learning [1]. Essentially, feature engineering is a process of representing data. In practice, feature engineering aims to remove defects and redundancy in the raw data and design more efficient features to describe the relationship between the solved problem and the prediction model. It is generally accepted that data and features determine the upper bound of the performance of machine learning, and the models and algorithms can only approximate this bound as best they can. Thereby, it can be seen that good data and features are the premise for models and algorithms to play an essential role. In detail, feature engineering usually includes feature availability assessment, feature cleaning, feature storage, feature selection, feature extraction, and so on. Among them, feature selection is an important part of feature engineering [2, 3]. The main application fields of feature selection include text classification [4, 5], image recognition [6], bio-information analysis [7], time series [8], intrusion detection [9], and software defect prediction [10].

The main idea of feature selection is to select the most valuable feature subsets by deleting irrelevant and redundant features from the feature space of the original dataset to improve the prediction accuracy, robustness, and interpretability of the models. The feature selection method was first proposed by Dash and Liu [11]. It can be divided into four steps: generating feature subset, evaluating feature subset, setting stop criterion and judging whether stop is sufficient, and verifying the final result. Suppose there are n features, each of which can be selected or not, then there are $2^{n}$ cases of the feature subset. When n is very large, it is obviously not feasible to obtain the optimal subset of features by exhaustive selection due to the time complexity. Therefore, it is an important problem that must be considered and solved to find the optimal feature from the feature space quickly and effectively [12].

Feature selection algorithms can be divided into different categories according to different separability criteria. According to the classification features, it can be divided into supervised and unsupervised feature selection algorithms. In terms of search strategies, feature selection algorithms can be categorized into global search, sequential search, and random search. Depending on the combination form of feature selection and machine learning algorithms, it includes four types [13]: filter, wrapper, embedded, and ensemble. With different evaluation criteria, feature selection algorithms can be divided into several categories: based on distance measurement [14, 15], dependency measurement [16, 17], information measurement [18], and accuracy/error rate measurement [19]. In detail, the core of distance measurement is distance formula, and commonly used distances are Euclidean distance, Hamming distance, Probability distance, and so on. Algorithms based on dependency measures use statistical principles to evaluate the correlation between features and categories, such as T test, Pearson correlation coefficient, and Fisher scores. The information metrics include mutual information, information gain, minimum description length, etc. In particular, the algorithms based on the measurement of accuracy/error rate have the best overall performance. They train the classifier using the selected feature subset and measure the performance of the feature subset by the accuracy/error rate.

Meta-heuristic algorithm is widely used because of its simplicity and generality [20, 21]. In recent years, more and more feature selection algorithms using meta-heuristic algorithms have been proposed, which are based on the measurement of accuracy/error rate [22]. The feature selection algorithms can be divided into single-objective feature selection and multi-objective feature selection according to the number of evaluation criteria. For a long time in the past, feature selection was regarded as a single objective optimization problem, which optimized the weighted sum of the accuracy/error rate and the number of selected features, or only optimized the accuracy/error rate. There are many excellent studies for solving single-objective feature selection problem by meta-heuristic algorithms. In 2020, the improved Binary Grey Wolf Optimizer was proposed and achieved good performance on the single-objective feature selection problem [23]. A surrogate-assisted evolutionary algorithm was proposed in paper [24]. The single-objective feature selection problem is solved by decomposing the large-scale original problem into several small subproblems and establishing a surrogate-assisted model for each subproblem. In paper [25], a hybrid version of Simulated Normal Distribution Optimizer with Simulated Annealing is proposed for feature selection which uses Simulated Annealing as a local search to achieve higher classification accuracy. Whale Optimization Algorithm is used for feature selection of high-dimensional data based on spatial boundary strategy in [26]. The time-varying transfer function was used on Binary Dragonfly Algorithm for feature selection to balance the exploration and exploitation and obtained excellent results [27]. Hybrid feature selection based on Chi-square and binary Particle Swarm Optimization algorithm was designed and applied for Arabic email authorship analysis in 2021 [28].

If the fitness value of the single-objective feature selection algorithms is set to the weighted sum, and the weights are usually predetermined, then the algorithms are not flexible enough. For the algorithms only consider accuracy/error rate, the sparsity of selected features is ignored. As a consequence, like most engineering and scientific problems in practice, feature selection can also be regarded as a multi-objective optimization problem [29, 30]. Multi-objective optimization algorithms are usually to optimize multiple conflicting objectives simultaneously [31, 32]. Evolutionary multi-objective optimization algorithms have gained popularity in the past decade and beyond [33, 34]. The multi-objective feature selection problems mainly optimize two objectives: maximizing the classification accuracy and minimizing the number of features. For the multi-objective feature selection algorithms, it can provide a series of relative optimal solutions for users to choose, instead of a single solution. There are relatively few studies on multi-objective feature selection problem. Two variants using the angle competitive mechanism and Euclidean distance competitive mechanism of differential evolution (DE) algorithm are proposed in paper [35], and are applied to the feature selection problem. In [36], a binary multi-objective grey wolf algorithm was proposed and a wrapper-based Artificial Neural Network is used to assess the classification performance of the selected features for the multi-objective feature selection. Paper [37] studies a new multi-objective feature selection approach based on the Binary DE with self-learning for solving feature selection and achieves a trade-off between local exploitation and global exploration. A fast multi-objective evolutionary feature selection algorithm is proposed in [38] by embedding an improved Artificial Bee Colony algorithm [39] based on the particle update model. The authors of paper [40] combine binary encoding with real value encoding to utilize the advantages of Genetic Algorithm and Direct Multi-Search to solve multi-objective feature selection of unbalanced production data and obtain significantly good search performance. A multi-objective evolutionary algorithm is proposed for feature selection in learning to rank in paper [41] and get excellent performance.

In large-scale data, because of the large number of features, the efficiency of traditional feature selection algorithms is reduced or even cannot be processed. However, there are many application scenarios of Large-scale Sparse Multi-Objective Feature Selection Problems (LSMFSPs) in real life. For example, in the field of text classification [5], the number of words commonly used in everyday life is about order of magnitude $10^4$. In the field of image processing [42], if the image features are pixels, the number of features of a picture with a resolution of $1024\times 1024$ will easily reach the order of $10^6$. Biological omics data also usually have large-scale features: DNA microarray chip can detect and obtain thousands of gene expression values at the same time [43]; There are hundreds of protein mass spectrum peaks and related biomarkers in protein mass spectrum data [44]; there are often hundreds of chromatographic peaks in metabolic mass spectrometry data.

Large-scale data usually have a lot of redundancy and require special research. However, there are few studies that are specifically used to deal with LSMFSPs. LSMFSPs is one of Large-scale Multi-Objective Problems (LMOPs). Evolutionary algorithms for solving LMOPs can generally be divided into three categories: the divide-and-conquer, dimensionality reduction, and enhanced search-based approaches. A similar method based on random decomposition is proposed in [45], which improves the MOEA/D framework to enable it to handle LMOPs. Paper [46] proposes a customized evolutionary algorithm based on decision variable clustering method. It uses k-means to divide decision variables into convergence-related variables and diversity-related variables, and optimizes the two variables, respectively. A general, theoretically grounded yet simple approach was proposed in paper [47], which can scale current derivative-free multi-objective algorithms to the high-dimensional non-convex multi-objective functions with low effective dimensions, using random embedding. Based on dimension reduction, it transforms the original decision space into a low-dimensional subspace. In paper [48], an enhanced large-scale multi-objective algorithm based on search is proposed, which incorporates a new solution generator with an external archive, thus forcing the search toward different subregions of the Pareto front using a dual local search mechanism. Paper [49] proposes a novel multi-objective large-scale cooperative co-evolutionary algorithm for three-objective feature selection, and it designs a cooperative searching framework for seeking the optimal feature subset efficiently and effectively.

Paper [50] puts forward the concept of Large-scale Sparse Multi-objective Optimization Problems (LSMOPs) in 2019, which means that most decision variables of these solutions are zero. In this paper, an evolutionary algorithm named SparseEA is designed, which solves the LSMOPs problem by constructing sparse solutions. In particular, LSMFSPs are specific applications of LSMOPs. The experimental results show that SparseEA performs excellent in solving LSMOPs. At present, there are few papers dedicated to dealing with LSMOPs. The authors of paper [51] uses two unsupervised neural networks, a restricted Boltzmann machine and a denoising autoencoder to learn a sparse distribution and a compact representation of the decision variables for LSMOPs. The proposed algorithm in paper [52] suggests an evolutionary pattern mining approach to detect the maximum and minimum candidate sets of the nonzero variables in the Pareto optimal solutions, and uses them to limit the dimensions in generating offspring solutions for LSMOPs. An improved SparseEA was proposed in paper [53] to enhance the connection between real variables and binary variables within the two-layer encoding scheme with the assistance of variable grouping techniques for LSMOPs.

Therefore, this manuscript proposes an enhanced SparseEA algorithm based on ReliefF with difference operators for solving the LSMFSPs. The main contributions of this paper are concluded as follows:

1.
It combines a filtering feature selection method with SparseEA. ReliefF was used to calculate the weights of features, with unimportant features being removed first.
2.
Combine the weights calculated by ReliefF with Scores of SparseEA to guide the evolution process. Meanwhile, an adaptive score update strategy is designed for solving the Scores of decision variables remains constant throughout all iteration.
3.
Difference operators of DE are introduced into SparseEA to increase the diversity of solutions and help the algorithm jump out of the local optimal solution.
4.
SparseEA with hybrid difference operators is proposed to balance the exploration and exploitation.
5.
The proposed algorithm is compared with the excellent algorithms proposed in recent 3 years to solve the LSMFSPs. The experimental results verify the superiority of the proposed algorithm.

The rest of the paper is organized as follows. “SparseEA” shows the original SparseEA algorithm. “SparseEA based on reliefF” depicts the proposed SparseEA based on ReliefF strategy. It describes the details of the SparseEA with binary difference operators in “RA-SparseEA with difference operator”. “Experiments” is experimental results and analysis. “Conclusion” depicts the main work of the paper and gives some suggestions for further work.

SparseEA

SparseEA is an evolutionary algorithm for solving large-scale sparse multi-objective optimization problems. In SparseEA, a solution x consists of two components, i.e., a real vector (denoted as Dec) can record the best decision variables found so far, and a binary vector (denoted as Mask) can record the decision variables that should be set to zero. For instance, the number of variable is $D=5$, $Dec = (0.5,0.3,0.2,0.8,0.1)$, and $Mask = (1,0,0,1,0)$. Then, x can be obtained by Eq. (1). Therefore, the final solution is $x=(0.5,0,0,0.8,0)$

$$\begin{aligned}&(x_{1},x_{2},\ldots ,x_{D})\nonumber \\ {}&\quad =(Dec_{1}\times Mask_{1},Dec_{2}\times Mask_{2}, \ldots , Dec_{D}\times Mask_{D}). \end{aligned}$$

(1)

The framework of SparseEA is very similar to NSGAII which is shown in Algorithm 1. However, the strategies to generate the initial population and offsprings are different from NSGAII, and those can ensure the sparsity of the generated solutions. To begin with, the Scores of each variable are got by the fitness value and the population P with size N is initialized, which is described in Algorithm 2 particularly. After that, fast non-dominated ordering and crowding calculation are performed on P. In the main loop, the binary tournament selection is used to obtain 2N parents solutions. Then, N offsprings are generated from 2N parents solutions by the new genetic operation which is shown in detail in Algorithm 3. At the last, the environmental selection is executed based on front number and crowding distance.

The initialization process of SparseEA includes calculate the Scores of variables and generate the initial population. In the first step, for real variables, a $D \times D$ random matrix is generated as Dec and a $D \times D$ identity matrix is set to Mask. The solutions can be got by Eq. (1). Then, the fitness values of each solution can be calculated and the non-dominated sorting can be executed to obtain the Scores of each variable. However, for the binary problem, the Dec is a $D \times D$ matrix of ones and the Mask is a also $D \times D$ identity matrix. Then, the solutions are also a $D \times D$ identity matrix which is equal to the Mask. For the ith solution $x_{i}$, all elements are 0 except for the ith element is 1. Therefore, the fitness of $x_{i}$ can be viewed as the importance of the ith variable. In SparseEA, the Pareto front number of $x_{i}$ is used as the Scores. In the next step, a initial population can be got by a $N\times D$ Dec and a $N\times D$ Mask. The Dec is uniformly randomly generated for the real variables, while, it is a matrix of ones for the binary problem. For every solution of Mask, $rand() \times D$ times binary tournament selection is performed on the variables and the variable with lower Scores value will be set to 1. Thereby, at most $rand() \times D$ variables are set to 1 for a solution. This strategy ensures the sparsity of the population.

The genetic operator is another key component that makes SparseEA different from NSGAII. As shown in Algorithm 3, it is composed of generating the Mask of offsprings and generating the Dec of offsprings. The SparseEA adopts the existing genetic operators for the Dec of offsprings. To be specific, if the decision variables are real numbers, the Dec is got by performing simulated binary crossover and polynomial mutation. And it is simply set to matrix of ones if the decision variables is binary. The main contribution of the genetic operator of SparseEA is the crossover and mutation operator of binary mask. Two parents p and q are randomly selected from $P'$ to generate an offspring o each time. Then, the binary vector mask of o is first set to the same to that of p. The crossover of mask is to select one variable which is different in p.Mask and q.Mask to flip. In detail, a random number is used to determine the variable is selected from the zero elements or the nonzero elements in the binary vector Mask with the same probability. If the random number is less than 0.5, two decision variables are randomly selected from $p.Mask \cap \overline{q.Mask}$ and the element with bigger fitness is set to 0. Else, two decision variables are chosen from $\overline{p.Mask} \cap q.Mask$ and the element with bigger fitness is flipped. In the mutation operator, it is also one variable is selected to be flipped. Similarly, randomly select two decision variables from the nonzero elements in o.Mask or $\overline{o.Mask}$, and the element with more contribution is set to 1 or with smaller fitness is set to 0.

SparseEA based on reliefF

It can be observed from line 11 in Algorithm 2 that the Scores in SparseEA is the non-domination Pareto front number of the corresponding solution. For feature selection problem, the Score of the ith feature is the front number of the solution where only the ith element is 1 and the rest are all 0. The fitness of the solution for feature selection problem consists of sparsity and error rate. Since the sparsity of each solution is 1/D, the Pareto front number of the solution is uniquely determined by the error rate. That is, the Score of the ith feature is only decided by the error rate in SparseEA. What’s more, this Scores value remain constant throughout all iteration. Due to the correlation between features, calculating the fitness of a single feature only in the initial stage cannot well reflect the importance of features. However, computing all possible combinations of features is a NP-hard problem. Therefore, in this manuscript, the fitness values of excellent and poor individuals in each iteration are used to update the Scores of features.

In addition, the fitness value of the solution can only reflect the importance of the features from one view. Many traditional feature selection methods evaluate the importance of features based on different criteria. Therefore, in this section, we combine the traditional feature selection method with SparseEA algorithm. Relief is a filtering feature selection algorithm that updates feature weights by looking for the nearest neighbour of each sample. It evaluates the correlation and redundancy of features by calculating adjacent samples of the same and different classes. However, the Relief algorithm was designed to handle only dichotomies, so Kononenko expanded on Relief in 1994 to design ReliefF algorithm that could handle multiple types of data with better performance. The ReliefF algorithm determines the size of feature weights in each sample according to certain weight measures between samples in the original sample set, similar samples, and different samples. Then, according to certain evaluation criteria to distinguish the strong correlation, weak correlation, and no correlation of the sample features. For sparse large-scale feature selection problem, there are a lot of redundant features. Therefore, in this manuscript, ReliefF algorithm is used first to eliminate unimportant features and build feature subsets. At the same time, the number of features is reduced and the running speed of the algorithm can be accelerated. To some extent, this can offset the time spent in calculating the Relief algorithm. Furthermore, the weights calculated by ReliefF algorithm are combined with Scores of SparseEA to guide the evolution process.

The framework of the RA-SparseEA for feature selection is shown in Algorithm 4. First of all, the ReliefF algorithm is first executed to get the weights of the feature $W_{rlf}$. Remove features with low $W_{rlf}$ values, and the number of features is reduced from D to $D'$. In this manuscript, we set $D' = 0.5*D$ for datasets with more than 1000 features; otherwise, $D' = D$. Then, same as SparseEA, Algorithm 2 is used to initialize population in $D'$ dimension, and the $Scores= [s_{1},...,s_{D'}]$ are obtained. The difference is that the Scores is set to not the Pareto front number but the fitness values. Then, the $W_{rlf}$ is used to guide the updating of the Scores. For the good features in $W_{rlf}$, the scores $s_{i}$ are add a value $\tau $. And for the poor features, $s_{i} = s_{i} - \tau $ is calculated. After that, fast non-dominated ordering and crowding calculation are performed on P. In the main loop, selecting parents’ individuals $P'$ and genetic operator is the same as SparseEA. In the subsequent environmental selection stage, delete duplicated solutions and do non-dominated on P first. The features selected in every non-dominated solution are considered to have higher Scores, while the features selected in the solutions of last Pareto front should have lower Scores. Then, $s_{i} = s_{i} + \varepsilon $ and $s_{i} = s_{i} - \varepsilon $ is done for the features selected in $F_{1}$ and $F_{last}$, respectively. Where $F_{last}$ is the last Pareto front. To balance the exploration and exploitation at different evolution stages, $\varepsilon $ is designed as a linearly increasing function which is shown in Eq. (2), where t is the number of times a feature is selected in all non-dominated solutions, $\alpha $ is the step parameter which usually set to 0.01, FE is the number of consumed function evaluations, and MaxFE is the maximum number of function evaluations

$$\begin{aligned} \varepsilon = t\times \alpha \times (\mathrm{{Iter}}/\mathrm{{MaxIter}}). \end{aligned}$$

(2)

The rest are the same as the environment selection in SparseEA and will not be repeated.

RA-SparseEA with difference operator

In this section, the SparseEA with difference operator (RA-SparseDO) will be described in detail. As shown in the previous section, the RA-SparseEA reverses only one element per particle in mutation and crossover operations, respectively, in each turn. This limits the diversity of the population and makes the algorithm easily fall into the local optimal solution. What’s more, the Mask of offspring is first assigned to that of one parent ($o.Mask = p.Mask$); therefore, the parent p has a great influence on the offspring, while the parent q not. The difference operator proposed in DE can obtain genetic information from multiple parents [54,55,56].

Feature selection is a binary problem, and there are some important binary variants of the DE. In paper [57], sigmoid transfer function is used to convert the mutation operator into binary form. A new Taper-shaped transfer function was proposed and used to transform the continuous DE algorithm into binary form in [58]. Paper [59] makes use of binary operators such as xor, and, or, and not operators to generate trial solutions. An adaptive quantum-inspired DE was designed in [60] for solving 0–1 Knapsack Problem. Pampara et al. [61] presented angle-modulated DE, which uses angle modulation to evolve the coefficients of the trigonometric function, thus allowing mapping from continuous space to binary space. For multi-objective binary algorithm, scholars also have done some excellent work. A binary differential evolution algorithm with a self-learning strategy for multi-objective feature selection problems was designed in paper [37]. Non-dominated sorting binary differential evolution was proposed for cascading failures protection in complex networks in Ref. [62]. Paper [63] proposed a binary version of generalized differential evolution for multi-label feature selection based on majority voting of solutions and opposition-based learning. There are not many studies on using binary differential evolution to solve large-scale problems. A new self-adaptive binary variant of a differential evolution algorithm based on measure of dissimilarity was proposed in [64], and used for solving high-dimensional knapsack problems.

Table 1 The different mutation schemes of DE

Enhanced SparseEA for large-scale multi-objective feature selection problems

Abstract

Similar content being viewed by others

Multi-objective Optimization Based Feature Selection Using Correlation

Feature selection for high-dimensional classification using a competitive swarm optimizer

A Multi-objective Feature Selection Method Considering the Interaction Between Features

Explore related subjects

Introduction

SparseEA

SparseEA based on reliefF

RA-SparseEA with difference operator

Experiments

Experiments’ settings

Datasets’ description

Stopping condition and performance metrics

Experiment on SMOP test suite

Experiments of diversity

Definition 1

Definition 2

Definition 3

Definition 4

Ablation study

Comparative experiments for large-scale sparse multi-objective feature selection problems

Comparative experiments with 30 individuals

Comparative experiments with 50 individuals

Comparative experiments with different population size

Comparative experiments for small-scale multi-objective feature selection problems

Running time

Conclusion

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation