Introduction

Feature engineering is an important and critical part of machine learning [1]. Essentially, feature engineering is a process of representing data. In practice, feature engineering aims to remove defects and redundancy in the raw data and design more efficient features to describe the relationship between the solved problem and the prediction model. It is generally accepted that data and features determine the upper bound of the performance of machine learning, and the models and algorithms can only approximate this bound as best they can. Thereby, it can be seen that good data and features are the premise for models and algorithms to play an essential role. In detail, feature engineering usually includes feature availability assessment, feature cleaning, feature storage, feature selection, feature extraction, and so on. Among them, feature selection is an important part of feature engineering [2, 3]. The main application fields of feature selection include text classification [4, 5], image recognition [6], bio-information analysis  [7], time series [8], intrusion detection [9], and software defect prediction [10].

The main idea of feature selection is to select the most valuable feature subsets by deleting irrelevant and redundant features from the feature space of the original dataset to improve the prediction accuracy, robustness, and interpretability of the models. The feature selection method was first proposed by Dash and Liu [11]. It can be divided into four steps: generating feature subset, evaluating feature subset, setting stop criterion and judging whether stop is sufficient, and verifying the final result. Suppose there are n features, each of which can be selected or not, then there are \(2^{n}\) cases of the feature subset. When n is very large, it is obviously not feasible to obtain the optimal subset of features by exhaustive selection due to the time complexity. Therefore, it is an important problem that must be considered and solved to find the optimal feature from the feature space quickly and effectively [12].

Feature selection algorithms can be divided into different categories according to different separability criteria. According to the classification features, it can be divided into supervised and unsupervised feature selection algorithms. In terms of search strategies, feature selection algorithms can be categorized into global search, sequential search, and random search. Depending on the combination form of feature selection and machine learning algorithms, it includes four types [13]: filter, wrapper, embedded, and ensemble. With different evaluation criteria, feature selection algorithms can be divided into several categories: based on distance measurement [14, 15], dependency measurement [16, 17], information measurement [18], and accuracy/error rate measurement [19]. In detail, the core of distance measurement is distance formula, and commonly used distances are Euclidean distance, Hamming distance, Probability distance, and so on. Algorithms based on dependency measures use statistical principles to evaluate the correlation between features and categories, such as T test, Pearson correlation coefficient, and Fisher scores. The information metrics include mutual information, information gain, minimum description length, etc. In particular, the algorithms based on the measurement of accuracy/error rate have the best overall performance. They train the classifier using the selected feature subset and measure the performance of the feature subset by the accuracy/error rate.

Meta-heuristic algorithm is widely used because of its simplicity and generality [20, 21]. In recent years, more and more feature selection algorithms using meta-heuristic algorithms have been proposed, which are based on the measurement of accuracy/error rate [22]. The feature selection algorithms can be divided into single-objective feature selection and multi-objective feature selection according to the number of evaluation criteria. For a long time in the past, feature selection was regarded as a single objective optimization problem, which optimized the weighted sum of the accuracy/error rate and the number of selected features, or only optimized the accuracy/error rate. There are many excellent studies for solving single-objective feature selection problem by meta-heuristic algorithms. In 2020, the improved Binary Grey Wolf Optimizer was proposed and achieved good performance on the single-objective feature selection problem [23]. A surrogate-assisted evolutionary algorithm was proposed in paper [24]. The single-objective feature selection problem is solved by decomposing the large-scale original problem into several small subproblems and establishing a surrogate-assisted model for each subproblem. In paper [25], a hybrid version of Simulated Normal Distribution Optimizer with Simulated Annealing is proposed for feature selection which uses Simulated Annealing as a local search to achieve higher classification accuracy. Whale Optimization Algorithm is used for feature selection of high-dimensional data based on spatial boundary strategy in [26]. The time-varying transfer function was used on Binary Dragonfly Algorithm for feature selection to balance the exploration and exploitation and obtained excellent results [27]. Hybrid feature selection based on Chi-square and binary Particle Swarm Optimization algorithm was designed and applied for Arabic email authorship analysis in 2021 [28].

If the fitness value of the single-objective feature selection algorithms is set to the weighted sum, and the weights are usually predetermined, then the algorithms are not flexible enough. For the algorithms only consider accuracy/error rate, the sparsity of selected features is ignored. As a consequence, like most engineering and scientific problems in practice, feature selection can also be regarded as a multi-objective optimization problem [29, 30]. Multi-objective optimization algorithms are usually to optimize multiple conflicting objectives simultaneously [31, 32]. Evolutionary multi-objective optimization algorithms have gained popularity in the past decade and beyond [33, 34]. The multi-objective feature selection problems mainly optimize two objectives: maximizing the classification accuracy and minimizing the number of features. For the multi-objective feature selection algorithms, it can provide a series of relative optimal solutions for users to choose, instead of a single solution. There are relatively few studies on multi-objective feature selection problem. Two variants using the angle competitive mechanism and Euclidean distance competitive mechanism of differential evolution (DE) algorithm are proposed in paper [35], and are applied to the feature selection problem. In [36], a binary multi-objective grey wolf algorithm was proposed and a wrapper-based Artificial Neural Network is used to assess the classification performance of the selected features for the multi-objective feature selection. Paper [37] studies a new multi-objective feature selection approach based on the Binary DE with self-learning for solving feature selection and achieves a trade-off between local exploitation and global exploration. A fast multi-objective evolutionary feature selection algorithm is proposed in [38] by embedding an improved Artificial Bee Colony algorithm [39] based on the particle update model. The authors of paper [40] combine binary encoding with real value encoding to utilize the advantages of Genetic Algorithm and Direct Multi-Search to solve multi-objective feature selection of unbalanced production data and obtain significantly good search performance. A multi-objective evolutionary algorithm is proposed for feature selection in learning to rank in paper [41] and get excellent performance.

In large-scale data, because of the large number of features, the efficiency of traditional feature selection algorithms is reduced or even cannot be processed. However, there are many application scenarios of Large-scale Sparse Multi-Objective Feature Selection Problems (LSMFSPs) in real life. For example, in the field of text classification [5], the number of words commonly used in everyday life is about order of magnitude \(10^4\). In the field of image processing [42], if the image features are pixels, the number of features of a picture with a resolution of \(1024\times 1024\) will easily reach the order of \(10^6\). Biological omics data also usually have large-scale features: DNA microarray chip can detect and obtain thousands of gene expression values at the same time [43]; There are hundreds of protein mass spectrum peaks and related biomarkers in protein mass spectrum data [44]; there are often hundreds of chromatographic peaks in metabolic mass spectrometry data.

Large-scale data usually have a lot of redundancy and require special research. However, there are few studies that are specifically used to deal with LSMFSPs. LSMFSPs is one of Large-scale Multi-Objective Problems (LMOPs). Evolutionary algorithms for solving LMOPs can generally be divided into three categories: the divide-and-conquer, dimensionality reduction, and enhanced search-based approaches. A similar method based on random decomposition is proposed in [45], which improves the MOEA/D framework to enable it to handle LMOPs. Paper [46] proposes a customized evolutionary algorithm based on decision variable clustering method. It uses k-means to divide decision variables into convergence-related variables and diversity-related variables, and optimizes the two variables, respectively. A general, theoretically grounded yet simple approach was proposed in paper [47], which can scale current derivative-free multi-objective algorithms to the high-dimensional non-convex multi-objective functions with low effective dimensions, using random embedding. Based on dimension reduction, it transforms the original decision space into a low-dimensional subspace. In paper [48], an enhanced large-scale multi-objective algorithm based on search is proposed, which incorporates a new solution generator with an external archive, thus forcing the search toward different subregions of the Pareto front using a dual local search mechanism. Paper [49] proposes a novel multi-objective large-scale cooperative co-evolutionary algorithm for three-objective feature selection, and it designs a cooperative searching framework for seeking the optimal feature subset efficiently and effectively.

Paper [50] puts forward the concept of Large-scale Sparse Multi-objective Optimization Problems (LSMOPs) in 2019, which means that most decision variables of these solutions are zero. In this paper, an evolutionary algorithm named SparseEA is designed, which solves the LSMOPs problem by constructing sparse solutions. In particular, LSMFSPs are specific applications of LSMOPs. The experimental results show that SparseEA performs excellent in solving LSMOPs. At present, there are few papers dedicated to dealing with LSMOPs. The authors of paper [51] uses two unsupervised neural networks, a restricted Boltzmann machine and a denoising autoencoder to learn a sparse distribution and a compact representation of the decision variables for LSMOPs. The proposed algorithm in paper [52] suggests an evolutionary pattern mining approach to detect the maximum and minimum candidate sets of the nonzero variables in the Pareto optimal solutions, and uses them to limit the dimensions in generating offspring solutions for LSMOPs. An improved SparseEA was proposed in paper [53] to enhance the connection between real variables and binary variables within the two-layer encoding scheme with the assistance of variable grouping techniques for LSMOPs.

Therefore, this manuscript proposes an enhanced SparseEA algorithm based on ReliefF with difference operators for solving the LSMFSPs. The main contributions of this paper are concluded as follows:

  1. 1.

    It combines a filtering feature selection method with SparseEA. ReliefF was used to calculate the weights of features, with unimportant features being removed first.

  2. 2.

    Combine the weights calculated by ReliefF with Scores of SparseEA to guide the evolution process. Meanwhile, an adaptive score update strategy is designed for solving the Scores of decision variables remains constant throughout all iteration.

  3. 3.

    Difference operators of DE are introduced into SparseEA to increase the diversity of solutions and help the algorithm jump out of the local optimal solution.

  4. 4.

    SparseEA with hybrid difference operators is proposed to balance the exploration and exploitation.

  5. 5.

    The proposed algorithm is compared with the excellent algorithms proposed in recent 3 years to solve the LSMFSPs. The experimental results verify the superiority of the proposed algorithm.

The rest of the paper is organized as follows. “SparseEA” shows the original SparseEA algorithm. “SparseEA based on reliefF” depicts the proposed SparseEA based on ReliefF strategy. It describes the details of the SparseEA with binary difference operators in “RA-SparseEA with difference operator”. “Experiments” is experimental results and analysis. “Conclusion” depicts the main work of the paper and gives some suggestions for further work.

SparseEA

SparseEA is an evolutionary algorithm for solving large-scale sparse multi-objective optimization problems. In SparseEA, a solution x consists of two components, i.e., a real vector (denoted as Dec) can record the best decision variables found so far, and a binary vector (denoted as Mask) can record the decision variables that should be set to zero. For instance, the number of variable is \(D=5\), \(Dec = (0.5,0.3,0.2,0.8,0.1)\), and \(Mask = (1,0,0,1,0)\). Then, x can be obtained by Eq. (1). Therefore, the final solution is \(x=(0.5,0,0,0.8,0)\)

$$\begin{aligned}&(x_{1},x_{2},\ldots ,x_{D})\nonumber \\ {}&\quad =(Dec_{1}\times Mask_{1},Dec_{2}\times Mask_{2}, \ldots , Dec_{D}\times Mask_{D}). \end{aligned}$$
(1)

The framework of SparseEA is very similar to NSGAII which is shown in Algorithm 1. However, the strategies to generate the initial population and offsprings are different from NSGAII, and those can ensure the sparsity of the generated solutions. To begin with, the Scores of each variable are got by the fitness value and the population P with size N is initialized, which is described in Algorithm 2 particularly. After that, fast non-dominated ordering and crowding calculation are performed on P. In the main loop, the binary tournament selection is used to obtain 2N parents solutions. Then, N offsprings are generated from 2N parents solutions by the new genetic operation which is shown in detail in Algorithm 3. At the last, the environmental selection is executed based on front number and crowding distance.

Algorithm 1
figure a

Framework of the SparseEA

Algorithm 2
figure b

Initialization strategy of SparseEA

The initialization process of SparseEA includes calculate the Scores of variables and generate the initial population. In the first step, for real variables, a \(D \times D\) random matrix is generated as Dec and a \(D \times D\) identity matrix is set to Mask. The solutions can be got by Eq. (1). Then, the fitness values of each solution can be calculated and the non-dominated sorting can be executed to obtain the Scores of each variable. However, for the binary problem, the Dec is a \(D \times D\) matrix of ones and the Mask is a also \(D \times D\) identity matrix. Then, the solutions are also a \(D \times D\) identity matrix which is equal to the Mask. For the ith solution \(x_{i}\), all elements are 0 except for the ith element is 1. Therefore, the fitness of \(x_{i}\) can be viewed as the importance of the ith variable. In SparseEA, the Pareto front number of \(x_{i}\) is used as the Scores. In the next step, a initial population can be got by a \(N\times D\) Dec and a \(N\times D\) Mask. The Dec is uniformly randomly generated for the real variables, while, it is a matrix of ones for the binary problem. For every solution of Mask, \(rand() \times D\) times binary tournament selection is performed on the variables and the variable with lower Scores value will be set to 1. Thereby, at most \(rand() \times D\) variables are set to 1 for a solution. This strategy ensures the sparsity of the population.

Algorithm 3
figure c

Genetic Operator of SparseEA

The genetic operator is another key component that makes SparseEA different from NSGAII. As shown in Algorithm 3, it is composed of generating the Mask of offsprings and generating the Dec of offsprings. The SparseEA adopts the existing genetic operators for the Dec of offsprings. To be specific, if the decision variables are real numbers, the Dec is got by performing simulated binary crossover and polynomial mutation. And it is simply set to matrix of ones if the decision variables is binary. The main contribution of the genetic operator of SparseEA is the crossover and mutation operator of binary mask. Two parents p and q are randomly selected from \(P'\) to generate an offspring o each time. Then, the binary vector mask of o is first set to the same to that of p. The crossover of mask is to select one variable which is different in p.Mask and q.Mask to flip. In detail, a random number is used to determine the variable is selected from the zero elements or the nonzero elements in the binary vector Mask with the same probability. If the random number is less than 0.5, two decision variables are randomly selected from \(p.Mask \cap \overline{q.Mask}\) and the element with bigger fitness is set to 0. Else, two decision variables are chosen from \(\overline{p.Mask} \cap q.Mask\) and the element with bigger fitness is flipped. In the mutation operator, it is also one variable is selected to be flipped. Similarly, randomly select two decision variables from the nonzero elements in o.Mask or \(\overline{o.Mask}\), and the element with more contribution is set to 1 or with smaller fitness is set to 0.

SparseEA based on reliefF

It can be observed from line 11 in Algorithm 2 that the Scores in SparseEA is the non-domination Pareto front number of the corresponding solution. For feature selection problem, the Score of the ith feature is the front number of the solution where only the ith element is 1 and the rest are all 0. The fitness of the solution for feature selection problem consists of sparsity and error rate. Since the sparsity of each solution is 1/D, the Pareto front number of the solution is uniquely determined by the error rate. That is, the Score of the ith feature is only decided by the error rate in SparseEA. What’s more, this Scores value remain constant throughout all iteration. Due to the correlation between features, calculating the fitness of a single feature only in the initial stage cannot well reflect the importance of features. However, computing all possible combinations of features is a NP-hard problem. Therefore, in this manuscript, the fitness values of excellent and poor individuals in each iteration are used to update the Scores of features.

In addition, the fitness value of the solution can only reflect the importance of the features from one view. Many traditional feature selection methods evaluate the importance of features based on different criteria. Therefore, in this section, we combine the traditional feature selection method with SparseEA algorithm. Relief is a filtering feature selection algorithm that updates feature weights by looking for the nearest neighbour of each sample. It evaluates the correlation and redundancy of features by calculating adjacent samples of the same and different classes. However, the Relief algorithm was designed to handle only dichotomies, so Kononenko expanded on Relief in 1994 to design ReliefF algorithm that could handle multiple types of data with better performance. The ReliefF algorithm determines the size of feature weights in each sample according to certain weight measures between samples in the original sample set, similar samples, and different samples. Then, according to certain evaluation criteria to distinguish the strong correlation, weak correlation, and no correlation of the sample features. For sparse large-scale feature selection problem, there are a lot of redundant features. Therefore, in this manuscript, ReliefF algorithm is used first to eliminate unimportant features and build feature subsets. At the same time, the number of features is reduced and the running speed of the algorithm can be accelerated. To some extent, this can offset the time spent in calculating the Relief algorithm. Furthermore, the weights calculated by ReliefF algorithm are combined with Scores of SparseEA to guide the evolution process.

Algorithm 4
figure d

Framework of the RA-SparseEA for Feature Selection

The framework of the RA-SparseEA for feature selection is shown in Algorithm 4. First of all, the ReliefF algorithm is first executed to get the weights of the feature \(W_{rlf}\). Remove features with low \(W_{rlf}\) values, and the number of features is reduced from D to \(D'\). In this manuscript, we set \(D' = 0.5*D\) for datasets with more than 1000 features; otherwise, \(D' = D\). Then, same as SparseEA, Algorithm 2 is used to initialize population in \(D'\) dimension, and the \(Scores= [s_{1},...,s_{D'}]\) are obtained. The difference is that the Scores is set to not the Pareto front number but the fitness values. Then, the \(W_{rlf}\) is used to guide the updating of the Scores. For the good features in \(W_{rlf}\), the scores \(s_{i}\) are add a value \(\tau \). And for the poor features, \(s_{i} = s_{i} - \tau \) is calculated. After that, fast non-dominated ordering and crowding calculation are performed on P. In the main loop, selecting parents’ individuals \(P'\) and genetic operator is the same as SparseEA. In the subsequent environmental selection stage, delete duplicated solutions and do non-dominated on P first. The features selected in every non-dominated solution are considered to have higher Scores, while the features selected in the solutions of last Pareto front should have lower Scores. Then, \(s_{i} = s_{i} + \varepsilon \) and \(s_{i} = s_{i} - \varepsilon \) is done for the features selected in \(F_{1}\) and \(F_{last}\), respectively. Where \(F_{last}\) is the last Pareto front. To balance the exploration and exploitation at different evolution stages, \(\varepsilon \) is designed as a linearly increasing function which is shown in Eq. (2), where t is the number of times a feature is selected in all non-dominated solutions, \(\alpha \) is the step parameter which usually set to 0.01, FE is the number of consumed function evaluations, and MaxFE is the maximum number of function evaluations

$$\begin{aligned} \varepsilon = t\times \alpha \times (\mathrm{{Iter}}/\mathrm{{MaxIter}}). \end{aligned}$$
(2)

The rest are the same as the environment selection in SparseEA and will not be repeated.

RA-SparseEA with difference operator

In this section, the SparseEA with difference operator (RA-SparseDO) will be described in detail. As shown in the previous section, the RA-SparseEA reverses only one element per particle in mutation and crossover operations, respectively, in each turn. This limits the diversity of the population and makes the algorithm easily fall into the local optimal solution. What’s more, the Mask of offspring is first assigned to that of one parent (\(o.Mask = p.Mask\)); therefore, the parent p has a great influence on the offspring, while the parent q not. The difference operator proposed in DE can obtain genetic information from multiple parents [54,55,56].

Feature selection is a binary problem, and there are some important binary variants of the DE. In paper [57], sigmoid transfer function is used to convert the mutation operator into binary form. A new Taper-shaped transfer function was proposed and used to transform the continuous DE algorithm into binary form in [58]. Paper [59] makes use of binary operators such as xor, and, or, and not operators to generate trial solutions. An adaptive quantum-inspired DE was designed in [60] for solving 0–1 Knapsack Problem. Pampara et al. [61] presented angle-modulated DE, which uses angle modulation to evolve the coefficients of the trigonometric function, thus allowing mapping from continuous space to binary space. For multi-objective binary algorithm, scholars also have done some excellent work. A binary differential evolution algorithm with a self-learning strategy for multi-objective feature selection problems was designed in paper [37]. Non-dominated sorting binary differential evolution was proposed for cascading failures protection in complex networks in Ref. [62]. Paper [63] proposed a binary version of generalized differential evolution for multi-label feature selection based on majority voting of solutions and opposition-based learning. There are not many studies on using binary differential evolution to solve large-scale problems. A new self-adaptive binary variant of a differential evolution algorithm based on measure of dissimilarity was proposed in [64], and used for solving high-dimensional knapsack problems.

Table 1 The different mutation schemes of DE

Therefore, DE is effective for solving binary problems. Thus, this manuscript attempts to introduce the difference operator into SparseEA to increase the diversity of solutions for solving LSMFSPs. SparseEA selects parents via binary tournament selection and then produces offsprings. Then, we introduce four commonly used difference operators of DE (“DE/rand/1", “DE/rand/2", “DE/best/1", and “DE/beset/2") to SparseEA. Table 1 shows the details of the four difference operators. In the DE algorithm, mutation is done first, and then is crossover, which is different from GA. This manuscript takes “DE/rand/2" and “DE/beset/2" as examples to introduce the difference operator for SparseEA in detail.

Algorithm 5
figure e

Difference Operator of SparseDO(Rand/2)

Algorithm 5 describes the pseudo-code of “DE/rand/2" difference operator of SparseDO. First of all, randomly select five particles \(x_{r1}, x_{r2},x_{r3}, x_{r4},x_{r5}\) from \(P'\) and remove them from \(P'\). The Mask of the offspring o is first set to that of the \(x_{r1}\). Then, the elements with different values in \(x_{r2}\) and \(x_{r3}\) are marked in R1. Similarly, the elements with different values in \(x_{r4}\) and \(x_{r5}\) are marked in R2. Then, R can be obtained by \(R = R1 + R2\). Calculate the marked elements in R, and then, the candidate variables need to change are chosen. In paper [59], the OR operator are used on the produced R and \(x_{1}\). However, in this manuscript, to increase the sparsity and randomness of the offspring, the nonzero elements in R are directly replace the elements in o.Mask according to a random number. That is, the random number is used to determine the candidate variables of o.Mask is set to 0 or 1 with the same probability for each nonzero element in R. After mutating, crossover operations are performed to determine whether the mutated gene will eventually be passed on to the offspring. In this algorithm, one particle is randomly selected from \(P'\) as the parent in the crossover step. Two decision variables are randomly selected from the nonzero elements in \(o.Mask \cap \overline{p.Mask}\), and the one with bigger fitness is set to 0; or selected from \( \overline{o.Mask} \cap p.Mask\), and the one with smaller fitness is set to 1. The Dec of the offspring are generated in the same way as SparseEA. Due to the feature selection is a binary problem, the Dec of offspring is set to vector of ones.

Algorithm 6
figure f

Difference Operator of SparseDO(Best/2)

Algorithm 7
figure g

Framework of the RA-HSparseDO(Rand/2- Best/2) for Feature Selection

Similarly, Algorithm 6 describes the “DE/best/2" difference operator of SparseDO. The main difference is the line 5 and line 6 in Algorithm 6. Execute environmental selection based on front number and crowding distance, and then, the mask of the offspring is initialized through a randomly select non-dominant solution. The rest are similar to Algorithm 5 and will not be repeated.

The difference operator “DE/rand/2" is good for exploration and “DE/best/2" is good for exploitation. Then, to balance the exploration and exploitation, SparseEA with hybrid difference operators based on ReliefF for feature selection which named RA-HSparseDO was described in Algorithm 7. It can be seen from Line 11 to 15 in Algorithm 7, do the Algorithm 5 in the early half iteration, while do the Algorithm 6 in the later half iteration.

Experiments

Experiments’ settings

In this section, some experiments are designed to test the performance of the proposed algorithms. All the experiments are conducted on the evolutionary multi-objective optimization platform PlatEMO [65].

Datasets’ description

The scikit-feature repository is selected for LSMFSPs, which is an open-source feature selection repository developed at Arizona State University (https://jundongl.github.io/scikit-feature/index.html). It serves as a platform for facilitating feature selection application, research, and comparative study. The dataset name, number of instances, number of features, number of classes, and the keyword description are shown in Table 2. These datasets include 4 face image data (ORL, warpAR10P, warpPIE10P, and Yale), 6 biological data (lung, lung-discrete, lymphoma, nci9, Prostate-GE, and TOX-171), and 2 text data (BASEHOCK and RELATHE). It can be seen that the dataset with the largest number of features is Prostate-GE, and the number of features is up to 9712. The least number of features is lung-discrete, which still has 325 features.

Table 2 The datasets of scikit-feature repository

To verify the effect of the proposed algorithm on small-scale datasets, UCI datasets with less than 500 features (http://archive.ics.uci.edu/ml/index.php) are selected in this manuscript. Table 3 details the dataset name, number of instances, number of features, number of classes, and the keyword description of these datasets, i.e., iris, Lung Cancer, Person-Classification, MUSK1, heart, ionosphere, Parkinson, and COVID-19 Surveillance. It can be found that data types include integer, real, and category.

Table 3 The datasets of UCI machine learning repository

Stopping condition and performance metrics

For the sake of efficient and fair experiments, the maximum number of function evaluations is adopted as the stopping criteria.

There are two objectives used in the manuscript for multi-objective feature selection problems, that is, the validation error and the ratio of selected features. Since the Pareto fronts of the MOPs in applications are unknown, the hypervolume (HV) is adopted to measure each obtained solution set. The HV index was first proposed by zitzler et al., and it represents the volume of the hypercube surrounded by the individuals and reference points in the solution set in the target space. The reference point for calculating HV in this manuscript is set to (1,1).

Experiment on SMOP test suite

In this section, the performance of SparseDO in generating offspring solutions of real variables will be tested. The sparse multi-objective test suite [50] is adopted in this experiment, which is widely used in assessing the performance of existing multi-objective evolutionary algorithms in obtaining sparse Pareto optimal solutions. The test suite contains eight benchmark problems SMOP1–SMOP8 with scalable number of decision variables. In the experiments, the number of objectives of these problems is set to 2, and the number of decision variables is set to 500, 1000, and 1500.

The algorithms used in this experiment are the original SparseEA, SparseDO with “DE/best/1" which denoted as SparseDO (Best1), SparseDO with “DE/best/2" which denoted as SparseDO (Best2), SparseDO with “DE/rand/1" which denoted as SparseDO (Rand1), SparseDO with “DE/rand/2" which denoted as SparseDO (Rand2), Hybird SparseDO with “DE/best/1" and “DE/rand/1" which denoted as HSparseDO (Rand1Best1), and Hybird SparseDO with “DE/best/2" and “DE/rand/2" which denoted as HSparseDO (Rand2Best2). Each algorithm performs 1000 function evaluations on each function, the population size is set to 10, and the HV index is used to measure the results. The experiments all runs 30 times, and the mean and standard deviation are used to measure the results. The experimental results are shown in Table 4.

Table 4 The results of the SparseEA and the proposed SparseDO algorithms with different DO on SMOP

The results of the proposed SparseDO (Best1), SparseDO (Best2), SparseDO (Rand1), SparseDO (Rand2), HSparseDO (Rand1Best1), and HSparseDO (Rand2Best2) are compared with those of SparseEA. The number marked in red indicates that the proposed algorithm is better than the original SparseEA. The last row in Table 4 shows the times of the proposed algorithm obtains better results than SparseEA. As can be seen in Table 4, SparseDO (Best1), SparseDO (Best2), SparseDO (Rand1), and SparseDO (Rand2) get better results on 17, 19, 16 and 13 functions than SparseEA, respectively. Hence, the four SparseDO in generating offspring solutions of real variables are effective. The HSparseDO (Rand1Best1) and HSparseDO (Rand2Best2) obtain 19 and 16 better values, respectively. Therefore, the proposed HSparseDO can also improve the performance of the original SparseEA algorithm.

Experiments of diversity

In this section, the effect of binary differential operators on solution diversity will be tested. The most intuitive measure of diversity is the degree of difference between individuals in a population. Since the increase or decrease of individual diversity within a population is caused by the change of individual gene loci, the diversity in terms of gene loci should also be considered. In general, population diversity can be considered from both macro- and micro-perspectives. Therefore, both individual diversity and genetic diversity are composed of internal diversity and external diversity [66]. Listed below are four definitions about diversity.

Population P is a set of N individuals which can be denoted as \(P = [p_{1}, p_{2},...,p_{N}]^\mathrm{{T}}\), where \(p_{i} =[p_{i}^{1}, p_{i}^{2},...,p_{i}^{D}], \) is a D dimension vector. P is a \(N\times D\) matrix, where each row represents an individual and each column represents a gene. The \(p_{i}^{j}\) is the jth gene value of the ith individual.

Definition 1

The average of the population P is defined as

$$\begin{aligned} \bar{P}=\frac{1}{N\times D}\sum _{i=1}^{N}\sum _{j=1}^{D}p_{i}^{j}. \end{aligned}$$
(3)

Definition 2

The overall diversity of the population P is defined as

$$\begin{aligned} \mathrm{{DP}}=\frac{1}{N\times D}\sum _{i=1}^{N}\sum _{j=1}^{D}[p_{i}^{j}-\bar{P}]^{2}. \end{aligned}$$
(4)

Definition 3

Genetic internal diversity is defined as

$$\begin{aligned} \mathrm{{DG}}_{I}=\frac{1}{N\times D}\sum _{i=1}^{N}\sum _{j=1}^{D}[p_{i}^{j}-\bar{G^{j}}]^{2}, \,\,\mathrm{{where}}\,\, \bar{G^{j}} = \frac{1}{N}\sum _{i=1}^{N}p_{i}^{j}. \end{aligned}$$
(5)

Genetic external diversity is defined as

$$\begin{aligned} \mathrm{{DG}}_{E}=\frac{1}{D}\sum _{j=1}^{D}[\bar{G^{j}}-\bar{P}]^{2}. \end{aligned}$$
(6)

Genetic diversity is defined as

$$\begin{aligned} \mathrm{{DG}} = \mathrm{{DG}}_{I}+\mathrm{{DG}}_{E}. \end{aligned}$$
(7)

Definition 4

Individual internal diversity is defined as

$$\begin{aligned} \mathrm{{DI}}_{I}=\frac{1}{N\times D}\sum _{i=1}^{N}\sum _{j=1}^{D}[p_{i}^{j}-\bar{P_{i}}]^{2}, \,\,\mathrm{{where}}\,\, \bar{P_{i}} = \frac{1}{D}\sum _{j=1}^{D}p_{i}^{j}. \end{aligned}$$
(8)

Individual external diversity is defined as

$$\begin{aligned} \mathrm{{DI}}_{E}=\frac{1}{N}\sum _{i=1}^{N}[\bar{P_{i}}-\bar{P}]^{2}. \end{aligned}$$
(9)

Individual diversity is defined as

$$\begin{aligned} \mathrm{{DI}} = \mathrm{{DI}}_{I}+\mathrm{{DI}}_{E}. \end{aligned}$$
(10)

In this manuscript, the overall diversity of the population DP, genetic diversity DG, and individual diversity DI are used as the evaluation indicators. The experiments are performed on five scikit-feature datasets, that is lymphoma, warpPIE10P, ORL, lung-discrete, and warpAR10P. Each algorithm is executed for 5000 function evaluations on each dataset and the HV index is used to measure the results. The experiments all runs 30 times, and the mean and standard deviation are used to assess the results. To test the influence of population size on the algorithm, two experiments are conducted in this section, with 100 and 200 individuals, respectively. The experimental results of population diversity, gene diversity, and individual diversity with population size set to 100 are shown in Table 5, where DP is population diversity, DG is genetic diversity, and DI is individual diversity. The rows of TOTAL under the three diversities is the sum of the values of each diversity in five datasets. The value marked in red indicates that the proposed algorithm performs better than the original SparseEA.

Table 5 Results of population diversity, gene diversity, and individual diversity on scikit-feature repository (with population size set to 100)

It can be found from Table 5, the population diversity values of SparseDO (Best2), SparseDO (Rand1), SparseDO (Rand2), HSparseDO (Rand1Best1), and HSparseDO (Rand2Best2) on the five datasets are better than those of SparseEA, so their Total values are also higher. However, the values of SparseDO (Best1) on all five datasets are less than those of SparseEA. Thus, the population diversity of SparseDO (Best1) is worse than SparseEA at a population size of 100. Meanwhile, compared with the values of SparseDO (Best1), the values of SparseDO (Best2) are higher, and compared with the values of SparseDO (Rand1), the values of SparseDO (Rand2) are higher. Moreover, the values of SparseDO (Rand1) are all higher than those of SparseDO (Best1), and the values of SparseDO (Rand2) are all higher than those of SparseDO (Best2). Thus, random difference operators can increase the diversity of solutions better than best difference operators. In addition, the diversity value of HSparseDO (Rand1Best1) is higher than SparseDO (Rand1) and SparseDO (Best1). Similarly, the values of HSparseDO (Rand2Best2) are higher than SparseDO (Rand2) and SparseDO (Best2). Therefore, the HSparseDO algorithms is effective in population diversity. The same conclusion can be reached for genetic diversity and individual diversity.

Table 6 shows the results of population diversity, gene diversity, and individual diversity on scikit-feature datasets with population size set to 200. It can be seen from Table 6 that SparseDO (Best2), SparseDO (Rand1), SparseDO (Rand2), HSparseDO (Rand1Best1), and HSparseDO (Rand2Best2) are effective for increasing the diversity of SparseEA. SparseDO (Best1) is worse than SparseEA with population size set to 200. Furthermore, comparing Tables 5 and 6, it is not difficult to find that the values in Table 6 are all greater than the corresponding values in Table 5. Therefore, the diversity of solutions can be increased when the population size is large.

Ablation study

There are two components in the algorithm, the relief-based component and the difference operator-based component. In this subsection, the influence of these two components on the performance of the proposed algorithm for LSMFSPs is tested and analyzed.

In this experiment, five datasets of the scikit-feature repository are selected, which are BASEHOCK, RELATHE, lymphoma, nci9, and warpPIE10P. Similarly, to test the influence of population size on the algorithm, two experiments are conducted in this section, with 10 and 30 individuals, respectively. Each algorithm is executed for 5000 function evaluations on each dataset. The population size and function evaluation times of each algorithm are the same. All algorithms in the experiment run 30 times, and AVG and STD are the mean and standard deviation of the results of 30 times, respectively.

Table 7 shows the experimental results of SparseEA, RA-SparseEA, SparseDO (Best1), SparseDO (Best2), SparseDO (Rand1), SparseDO (Rand2), HSparseDO (Rand1Best1), and HSparseDO (Rand2Best2) on scikit-feature repository with population size set to 10. The value marked in red indicates that the performance of the algorithm is better than that of the original SparseEA. The penultimate row of Table 7 (AVG TOTAL) represents the sum of the algorithm’s AVG values over the five datasets. The last row (COUNT) of the table indicates the number of times the algorithm is superior to the SparseEA algorithm in five datasets.

It can be found from the last row of Table 7, RA-SparseEA performs better than SparseEA on all five datasets. In particular, the value of RA-SparseEA is 0.083 higher than that of SparseEA in BASEHOCK, 0.061 higher in RELATHE, and 0.147 higher in nci9. Hence, the SparseEA based on ReliefF is effective for LSMFSPs. SparseDO (Best1), SparseDO (Best2), SparseDO (Rand1), SparseDO (Rand2), HSparseDO (Rand1Best1), and HSparseDO (Rand2Best2) are superior to SparseEA on 4, 4, 4, 5, 4, and 4 datasets, respectively. In particular, the value of SparseDO (Rand1) is at least 0.02 higher than that of SparseEA in data sets BASEHOCK, RELATHE, and lymphoma. Therefore, the difference operator is still valid for feature selection problems. From the perspective of AVG TOTAL, all proposed algorithms are superior to the original algorithm, among which RA-SparseEA has the best performance.

The results of SparseEA, RA-SparseEA, SparseDO (Best1), SparseDO (Best2), SparseDO (Rand1), SparseDO (Rand2), HSparseDO (Rand1Best1), and HSparseDO (Rand2Best2) with population size set to 30 are shown in Table 8. Similarly, the performance of RA-SparseEA on all five datasets is better than that of SparseEA, which is the best among all algorithms. In particular, the value of RA-SparseEA is 0.065 higher than that of SparseEA in BASEHOCK, 0.046 higher in RELATHE, and 0.140 higher in nci9. Compared with SparseEA, SparseDO (Best1), SparseDO (Best2), SparseDO (Rand1), SparseDO (Rand2), HSparseDO (Rand1Best1), and HSparseDO (Rand2Best2) perform better on 4, 3, 5, 5, 4, and 4 datasets, respectively. On nci9 dataset, the values of SparseDO (Best1), SparseDO (Best2), SparseDO (Rand2), HSparseDO (Rand1Best1), and HSparseDO (Rand2Best2) are at least 0.02 higher than that of SparseEA. As can be seen from the penultimate row of Table 8, both the difference operator and ReliefF-based strategy can improve the performance of the original SparseEA.

Table 6 Results of population diversity, gene diversity, and individual diversity on scikit-feature repository (with population size set to 200)
Table 7 Results of ablation study on scikit-feature repository (with population size set to 10)
Table 8 Results of ablation study on scikit-feature repository (with population size set to 30)

Comparative experiments for large-scale sparse multi-objective feature selection problems

In this subsection, comparative experiments for LSMFSPs will be verified on scikit-feature repository.

Fig. 1
figure 1

Obtained Pareto fronts of SparseEA, RA-SparseDO (Best1), RA-SparseDO (Best2), RA-SparseDO (Rand1), RA-SparseDO (Rand2), RA-HSparseDO (Rand1Best1), and RA-HSparseDO (Rand2Best2)

Comparative experiments with 30 individuals

Each algorithm is executed for 5000 function evaluations on each dataset. In these experiments, ten datasets from scikit-feature repository are used to verify the effectiveness of the proposed algorithm for LSMFSPs, that is lung, lung-discrete, lymphoma, nci9, ORL, Prostate-GE, TOX-171, warpAR10P, warpPIE10P, and Yale. The numbers of features on the ten datasets are 3312, 325, 4026, 9712, 1024, 5966, 5748, 2400, 2420, and 1024, respectively. First of all, the SparseEA based on ReliefF with different difference operator will be tested. Then, the proposed algorithm will be compared with the state-of-the-art algorithms.

Table 9 shows the HV values of the SparseEA, RA-SparseDO (Best1), RA-SparseDO (Best2), RA-SparseDO (Rand1), RA-SparseDO (Rand2), RA-HSparseDO (Rand1Best1), and RA-HSparseDO (Rand2Best2) on scikit-feature repository with population size set to 30. Similarly, the values marked in red indicate that the algorithm is better than the original SparseEA. The penultimate row of Table 9 (AVG TOTAL) represents the sum of the algorithm’s AVG values over the ten datasets. The last row (COUNT) of the table indicates the number of times the algorithm is superior to the SparseEA algorithm in ten datasets. As can be seen from the last line of Table 9, RA-SparseDO (Rand2), RA-HSparseDO (Rand1Best1), and RA-HSparseDO (Rand2Best2) all obtain ten better results than SparseEA. Followed by algorithm RA-SparseDO (Rand1), which performs better than SparseEA on 9 dataset. RA-SparseDO (Best1) and RA-SparseDO (Best2) got higher HV values than SparseEA on 6 and 5 datasets, respectively. From the values of AVG TOTAL, we can see that the most effective algorithm is RA-SparseDO (Rand2), followed by RA-SparseDO (Rand1). RA-SparseDO (Best2) has the worst effect of all the proposed algorithms. Specifically, on Prostate-GE, the values of all six proposed algorithms are at least 0.03 higher than that of SparseEA. On nci9, all six proposed algorithms are improved by at least 0.04. On TOX-171, all algorithms are improved by at least 0.09. What’s more, the values of RA-SparseDO (Rand2) are at least 0.03 higher than those SparseEA on five datasets.

To show the non-dominated solutions of these algorithms, the Pareto fronts on four datasets (nci9, ORL, TOX-171, and warpPIE10P) that is obtained by SparseEA, RA-SparseDO (Best1), RA-SparseDO (Best2), RA-SparseDO (Rand1), RA-SparseDO (Rand2), RA-HSparseDO (Rand1Best1), and RA-HSparseDO (Rand2Best2) are plotted in Fig. 1. It can be observed from Fig. 1 that the non-dominated solutions of all proposed algorithms are obviously superior to the original SparseEA algorithm on nci9 and TOX-171. For ORL and warpPIE10P, SparseEA can achieve low ratio of selected features, while most proposed algorithms can achieve low validation error.

Then, RA-SparseDO (Rand2), RA-HSparseDO (Rand2Best2), and the original SparseEA are used to compare with other algorithms. Four comparison algorithms were selected in this subsection, i.e., ARMOEA [67], DAEA [68], DEAGNG [69], and MOEAPSL [51]. All four comparison algorithms have been proposed in the past 3 years. ARMOEA algorithm is an adaptive geometry estimation-based many-objective evolutionary algorithm which was proposed in 2019. In 2021, a duplication analysis-based evolutionary algorithm (DAEA) was proposed for solving bi-objective feature selection in classification. DEAGNG is an decomposition-based evolutionary algorithm guided by growing neural gas which was designed in 2020. MOEAPSL is an evolutionary algorithm proposed in 2021 to solve sparse large-scale multi-objective problems by learning Pareto optimal subspace.

The experimental results are shown in Table 10. First, compare SparseDO (Rand2) with ARMOEA, DAEA, DEAGNG, MOEAPSL, and the original SparseEA. Among the six algorithms, the one with the best performance is marked in red, and the penultimate row of Table 10 indicates the number of optimal results obtained by the algorithm. It can be found that SparseDO (Rand2) performs best on seven datasets. Next is DAEA, which performs best on three datasets. ARMOEA, DEAGNG, and SparseEA have not obtained the optimal results. Similarly, RA-HSparseDO (Rand2Best2) is used to compare with five comparison algorithm, and the maximum HV values in all algorithms are marked with bold. The last row of Table 10 represents the number of times the optimal value was obtained. RA-HSparseDO (Rand2Best2) obtained optimal values on 6 datasets, and DAEA obtained optimal values on 4 datasets. Therefore, the proposed algorithm has the best performance, and is significantly better than ARMOEA, DEAGNG, MOEAPSL, and the original SparseEA.

Similarly, the Pareto fronts obtained by ARMOEA, DAEA, DEAGNG, MOEAPSL, SparseEA, RA-SparseDO (Rand2), and RA-HSparseDO (Rand2Best2) are plotted in Fig. 2. As shown in Fig. 2, the solutions obtained by ARMOEA and DEAGNG are not sparse enough and are obviously worse than those obtained by other algorithms. RA-SparseDO (Rand2) has obtained non-dominated solutions on lymphoma, warpAR10P, and Yale with low validation error than other algorithms. The non-dominated solutions of DAEA have lower ratio of selected features than RA-SparseDO (Rand2) and RA-HSparseDO (Rand2Best2) on warpAR10P and Yale.

Table 9 The comparison between SparseEA and RA-SparseDO(Best1), RA-SparseDO(Best2), RA-SparseDO(Rand1), RA-SparseDO(Rand2), RA-HSparseDO(Rand1Best1), and RA-HSparseDO(Rand2Best2) on scikit-feature repository (with population size set to 30)
Table 10 Comparisons of different algorithms on scikit-feature repository (with population size set to 30)
Fig. 2
figure 2

Obtained Pareto fronts of ARMOEA, DAEA, DEAGNG, MOEAPSL, SparseEA, RA-SparseDO (Rand2), and RA-HSparseDO (Rand2Best2)

Comparative experiments with 50 individuals

The next experiment is to verify the effectiveness of the proposed algorithm when the population size is 50. Similarly, the performances of the proposed algorithms under different difference operators are compared with the original SparseEA. The HV values of SparseEA, RA-SparseDO (Best1), RA-SparseDO (Best2), RA-SparseDO (Rand1), RA-SparseDO (Rand2), RA-HSparseDO (Rand1Best1), and RA-HSparseDO (Rand2Best2) are shown in Table 11.

Table 11 The comparison between SparseEA and RA-SparseDO(Best1), RA-SparseDO(Best2), RA-SparseDO(Rand1), RA-SparseDO(Rand2), RA-HSparseDO(Rand1Best1), and RA-HSparseDO(Rand2Best2) on scikit-feature repository (with population size set to 50)
Table 12 Comparisons of different algorithms on scikit-feature repository (with population size set to 50)

It can be found from the last line of Table 11 that RA-SparseDO (Rand2) performs better than SparseEA on all ten datasets. Next are RA-SparseDO (Best1), RA-SparseDO (Rand1), and RA-HSparseDO (Rand1Best1), all of which have better results on nine datasets. RA-SparseDO (Best2) and RA-HSparseDO (Rand2Best2) obtained 7 and 8 better values, respectively. As can be seen from the penultimate row of Table 11, the six proposed algorithms are all superior to the original SparseEA. The highest AVG TOTAL value is obtained by RA-SparseDO (Rand1), which is 9.41E+00. Followed by RA-HSparseDO (Rand1Best1), which is 9.38E+00. In detail, the values of all six proposed algorithms are at least 0.03 higher than those of SparseEA on lung-discrete, nci9, Prostate-GE, and TOX-171. Specifically, all six proposed algorithms are improved at least 0.09 on TOX-171. On lung-discrete, RA-SparseDO (Rand1), RA-SparseDO (Rand2), RA-HSparseDO (Rand1Best1), and RA-HSparseDO (Rand2Best2) are improved at least 0.05. On nci9, five proposed algorithms are improved at least 0.08. What’s more, the values of RA-SparseDO (Rand1) are at least 0.03 higher than those SparseEA on six datasets. Similarly, RA-SparseDO (Rand2) and RA-HSparseDO (Rand1Best1) are also at least 0.03 higher than those SparseEA on six datasets.

Next, RA-SparseDO (Rand2) and RA-HSparseDO (Rand1Best1) are selected to compare with the comparison algorithms. Table 12 shows the results of ARMOEA, DAEA, DEAGNG, MOEAPSL, SparseEA, RA-SparseDO (Rand2), and RA-HSparseDO (Rand1Best1) with population size set to 50. The best results are marked in red. The results of SparseDO (Rand2) compared with ARMOEA, DAEA, DEAGNG, MOEAPSL, and SparseEA are shown in the penultimate row of Table 12. It can be seen that RA-SparseDO (Rand2) got 7 best values, and DAEA got three best results. The last line of Table 12 is the result of the comparison between RAHSparseDO (Rand1Best1) and ARMOEA, DAEA, DEAGNG, MOEAPSL, and SparseEA. The best results are marked in bold. In this comparison, RAHSparseDO (Rand1Best1) won 6 times and DAEA won 4 times. Therefore, the proposed algorithm is slightly better than DAEA, but significantly better than ARMOEA, DEAGG, MOEAPSL and the original SparseEA for solving LSMFSPs.

Comparative experiments with different population size

It is well known that different algorithms usually have different optimal settings of the population size. The third experiment is to verify the effectiveness of the algorithms with different population size. Each algorithm is executed for 5000 function evaluations on each dataset. Table 13 shows the results of RA-SparseDO (Rand2) with different population size set to 10, 30, 50, 70, 90, 110, 130, 150, 170, and 190. Similarly, the best results of different population size are marked in red. The optimal population size is 30, and 6 red values are obtained. Next are 50 and 70, both with 2 best results. When the population size is 10, the optimal solution is obtained on 1 dataset. In particular, the algorithm gets the same value on the lung-discrete dataset when the population size is set to 30 and 50. When the population size is greater than or equal to 90, the algorithm does not get the best result.

Table 13 RA-SparseDO (Rand2) with different population size on scikit-feature repository

Table 14 shows the results of SparseEA with different population size. It can be seen that SparseEA has the best results when the population size is set to 30, 50, 70, 90, 110. The optimal population number is 70, and 3 red values are obtained. Next are 90 and 110, both with 2 best results. When the number of individuals is 30 and 50, only one optimal solution can be obtained.

Table 14 SparseEA with different population size on scikit-feature repository

Similarly, Tables 15, 16, 17 and 18 are the HV results of ARMOEA, DAEA, DEAGNG, and MOEAPSL with different population size. The optimal population number of ARMOEA is 70, and 5 red values are obtained. This is followed by 10 and 50, with 3 and 2 best values, respectively. For DAEA, the optimal population number is 50, and four red values are obtained. It also gets red values at 30, 70, and 150. DEAGNG gets red values at 10, 70, 90, and 110. When the population number is 70 and 90, three red values are obtained, while when the population number is 10 and 110, two red values are obtained. MOEAPSL has obtained the optimal value in seven population sizes, so the population size has relatively little influence on it. MOEAPSL get two red results with population size set to 50, 70, and 150. It obtains one best value on 30, 90, 110, and 170 individuals.

Table 15 ARMOEA with different population size on scikit-feature repository
Table 16 DAEA with different population size on scikit-feature repository
Table 17 DEAGNG with different population size on scikit-feature repository
Table 18 MOEAPSL with different population size on scikit-feature repository
Table 19 The best results with different population size of ARMOEA, DAEA, DEAGNG, MOEAPSL, SparseEA, and RA-SparseDO (Rand2) on scikit-feature repository
Table 20 Comparisons of different algorithms on UCI machine learning repository (with population size set to 100)
Table 21 The time consumption of NSGAII, DEAGNG, ARMOEA, SparseEA, and RA-SparseDO on scikit-feature repository (all results were performed on 10 independent runs)

Table 19 is the best results with different population size of ARMOEA, DAEA, DEAGNG, MOEAPSL, SparseEA, and RA-SparseDO (Rand2) on scikit-feature repository That is, the red values of Tables  13, 14, 15, 16, 17 and 18 are shown in Table 19. Among the six algorithms, the one with the best performance is marked in red. It can be found form Table 19, the proposed RA-SparseDO (Rand2) performs best on 7 datasets. Next is DAEA, which performs best on three datasets. ARMOEA, DEAGNG, MOEAPSL, and SparseEA have not obtained the red results.

Comparative experiments for small-scale multi-objective feature selection problems

In this subsection, the experiments will be done to further verify the effectiveness of the proposed algorithm for small-scale multi-objective feature selection problems. The comparison algorithms used in this experiment are NSGAII [70], DEAGNG, ARMOEA, and the original SparseEA. Specifically, the NSGAII is one of the most classical and popular multi-objective optimization algorithms, which is based on genetic algorithm. The datasets chosen in this subsection are shown in Table 3. In this experiment, RA-SparseDO (Rand1) and RA-HSparseDO (Rand1Best1) are selected to compare with the comparison algorithms. The number of function evaluations for each algorithm is 10,000, and the population size of each algorithm is set to 100.

Table 20 shows the HV values obtained by NSGAII, DEAGNG, ARMOEA, SparseEA, and RA-SparseDO (Rand1). Similarly, the best results of the five algorithms are marked in red. From the penultimate row of Table 20, we can see that both DEAGNG and ARMOEA have the best performance on the one dataset. RA-SparseDO (Rand1) has the best performance, and obtains the best value on six datasets. The best results of RA-HSparseDO (Rand1Best1) compared with NSGAII, DEAGNG, ARMOEA, SparseEA are marked in bold. It can be found from the last line of Table 20 that RA-HSparseDO (Rand1Best1) got the best results on four datasets. Next is ARMOE, which obtains two best values. Both NSGAII and DEAGN perform best on only one dataset. Therefore, the proposed algorithm also has certain advantages for small-scale datasets.

Running time

Finally, the computational efficiency of the five algorithms is compared in this subsection. The experiment is performed on scikit-feature repository and the results are shown in Table 21. The last line refers to the total times on ten datasets. It can be found that the original SparseEA algorithm has the longest running time and MOEAPSL has the shortest running time. DAEA takes slightly more time to run than MOEAPSL. ARMOEA and DEAGNG take about the same time. Since the ReliefF algorithm has reduced the dimensions first for LSMFSPs, the running times of the proposed RA-SparseDO (Best2), RA-SparseDO (Best2), and RA-HSparseDO (Rand2Best2) are all less than that of the original SparseEA; especially on nci9, Prostate-GE and TOX-171 datasets, because the number of features in these datasets exceeds 5000.

Conclusion

There are many application scenarios of LSMFSPs in real life, but there is little research dedicated to solving this problem. SparseEA is an excellent algorithm for solving LSMOPs, and LSMFSPs are specific applications of LSMOPs. Therefore, this manuscript proposes an enhanced SparseEA algorithm based on ReliefF with difference operators to specifically solve the LSMFSPs. SparseEA determines feature Scores by calculating the fitness of individual features, which does not reflect the correlation between features well. Therefore, combining the filter feature selection algorithm ReliefF and SparseEA, a Filter-Wrapper feature selection method is proposed in this manuscript for LSMFSPs. Furthermore, to improve the performance of SparseEA, difference operators and adaptive scores strategy are used in this manuscript.

Experiments on the SMOP test suite show that SparseDO is effective in generating offspring solutions of real variables. To verify the effect of binary differential operators on solution diversity, we conducted experiments on the scikit-feature repository. It can be seen from the three diversity indicators (the overall diversity of the population DP, genetic diversity DG, and individual diversity DI) that SparseDO (Best2), SparseDO (Rand1), SparseDO (Rand2), HSparseDO (Rand1Best1), and HSparseDO (Rand2Best2) are effective for increasing the diversity of SparseEA. The ablation experiments show that both the ReliefF-based component and the difference operator-based component are effective for solving LSMFSPs. Comparative experiments for LSMFSPs are verified on scikit-feature repository. The experimental results show that the proposed algorithm is significantly better than ARMOEA, DEAGG, MOEAPSL and the original SparseEA for solving LSMFSPs. Meanwhile, experiment on UCI repository shows that the proposed algorithm also has certain advantages for small-scale datasets. In addition, since the ReliefF algorithm has reduced the dimensions first for LSMFSPs, the running times of the proposed algorithms are less than that of the original SparseEA.

Later work will study the other ways of combining difference operators with SparseEA. The combination of other traditional feature selection algorithms with meta-heuristic algorithms is also one of the key points in the future work.