Introduction

Various LSMOPs [1, 2] have been proposed in many scientific and engineering fields. In general, these optimization objectives conflict with each other. So each LSMOP wants to obtain a set of solutions by balancing conflicting objectives rather than a single-objective optimal solution. These LSMOPs present a high-dimensional search space that poses a stiff challenge for optimization algorithms [3] to effectively approximate the Pareto front as the number of decision variables increases. To solve this problem, several large-scale multiobjective optimization algorithms (LSMOEA) have been proposed in recent years. They are broadly classified into three types [4]. The first category is based on decision variable grouping [5], which divides variables into groups and then alternatively optimizes each group. The grouping strategy can be random, as in CCGDE3 [6], or heuristic, as in MOEA/DVA [7]. The second category is based on decision space simplification, which is achieved by problem reconstruction (e.g., LSMOF [8]) or dimensionality reduction techniques [9]. The third category employs a novel search strategy that solves LSMOPs directly in the original decision space by proposing new reproduction operators (e.g., DGEA [10]) or probabilistic models (e.g., GMOEA [11]).

Fig. 1
figure 1

a and b Plot the parallel coordinates plot of the decision variables of solutions obtained by SparseEA and PM-MOEA on SMOP8 and SparseEA and SparseEA2 on SMOP3, respectively

The general LSMOPs can be well solved by the above algorithms. However, in many practical applications, there are MOPs [12] with sparse optimal solutions, called sparse LSMOPs. The decision variables for the Pareto optimal solution of such problems are mostly zero. They are becoming increasingly popular in scientific research. For illustration, in neural network training [13], there is structural redundancy between many neuron connection layers, requiring sparse structural representations. To reduce model complexity [14] and improve model accuracy, many neurons should not be connected, and the weights should be set to zero. Feature selection in the classification problem [15] requires selecting as few features as possible to reduce the dimensionality of the dataset and to improve the performance of the classifier. In the network critical node detection problem [16], only a few critical nodes play an important role in the whole network. The problem aims to reduce the number of nodes while improving the efficiency of the whole network. However, although the existing LSMOEAs have excellent performance on general LSMOPs [17], they are invariably inefficient when applied to solve sparse LSMOPs. Because most LSMOEAs [18, 19] evolve sparse MOPs without considering the sparsity of their Pareto optimal solutions. Excessive computational resources are wasted on zero-variable (i.e., useless decision variables) optimization, resulting in slow convergence on sparse optimization problems. With limited computational resources, it is often difficult for populations to approach the Pareto front in a large search space.

Some sparse MOEAs have been tailored for this class of problems in recent years. SparseEA [20] uses a new genetic operator to control the generation of sparse solutions. MOEA/PSL [21] use the hidden layer size of unsupervised neural networks to estimate the sparsity of nondominated solutions. PM-MOEA [22] employs pattern mining techniques to mine the sparsity of Pareto optimal solutions. MDR-SAEA [23] performs multistage detection using feature selection to remove as many decision variables as possible that are zero in the optimal solution. Unlike the detection using double-layer encoding, S-ECSO [24] applies a strongly convex sparse operator, which can generate sparse solutions directly during the search process. ST-CCPSO [25] introduces a sparse truncation operator, which uses the cumulative gradient value to determine if the variable is zero.

Figure 1 plots the parallel coordinate plot of the solutions obtained by SparseEA, PM-MOEA and SparseEA2, where all variables outside the gray area are zero in the Pareto optimal solution. The value in the upper right corner of each subplot shows the median value of the IGD metric obtained after 30 runs of the corresponding algorithm. The sparsity of these two problems is set to 0.1, so 90\(\%\) of the decision variables in the Pareto optimal solution are zero. In Fig. 1a, both SparseEA and PM-MOEA use same genetic operators for real variables, i.e., simulated binary crossover [26] and polynomial mutation [27]. They optimize the decision variables to the same extent. As shown in the figure, PM-MOEA performs optimal sparsity detection and obtains the best IGD value. The detection result of SparseEA is not as good as those of PM-MOEA, and its optimization result is slightly worse. Therefore, accurate sparse detection is crucial to improve the algorithm performance. Moreover, when the accuracy of the sparse detection is the same, we also cannot ignore the importance of variable optimization. As shown in Fig. 1b, although SparseEA and SparseEA2 adopt the same detection mechanism, SparseEA2 performs better optimization of decision variables by strengthening the connection between real variables and mask. SparseEA2 performs better than SparseEA. It shows that variable optimization is also critical for the algorithm to improve performance.

Based on the above analysis, this paper proposes MOEA-ESD. It uses an adaptive sparse genetic strategy to detect the sparsity of decision variables. To prevent local detection, which leads to inadequate sparse detection, the ESD strategy is proposed to globally mine the sparse distribution of the population. In addition, an improved weighted optimization strategy is used to fully optimize the nonzero variables so that the population can converge to the sparse Pareto front. Specifically, the main contributions of this paper are detailed below:

  1. 1.

    An adaptive sparse genetic operator is proposed. It can adjust the number of flips at different stages of the algorithm according to the specific sparse problem to better detect the sparsity of individuals. In addition, using the property of the sparse problem, this operator can narrow the mutation interval to maximize the optimization of critical nonzero variables and prevent wasting arithmetic power on useless decision variables.

  2. 2.

    To prevent the shortage of localized detection, this paper proposes the ESD strategy, which learns the sparse information of the current population through linear combinations of mask vectors to mine the sparse distribution of decision variables at a global scale. The linear combination process is based on the idea of problem transformation, i.e., the evolution of the mask is transformed into coefficient vector optimization. We optimize the coefficients to find the best combination and then use the transformation function to obtain the corresponding mask vector as a way to enhance the sparsity of the solution. In addition, the ESD strategy can be easily embedded into other sparse MOEAs to improve the sparse detection capability of the algorithm.

  3. 3.

    We use an improved weighted optimization strategy to find improved nonzero variables in a reduced subspace. The genetic operator is performed in the reduced subspace to more easily optimize the nonzero variables, resulting in a good balance between exploration and optimization. Based on this, MOEA-ESD is proposed, and the experimental results show that it significantly outperforms the compared LSMOEAs (including three sparse LSMOEAs) in enhancing the sparsity of the solutions and improving the optimization ability of the algorithm.

The rest of the paper is organized as follows. “Related work and motivation” reviews the existing representative general LSMOEAs and sparse LSMOEAs. “Proposed algorithm” elaborates the details of the proposed algorithm. “Experimental results and analysis” conducts comparative experiments and analyzes the experimental results. The final section summarizes the conclusions and outlines future work.

Related work and motivation

Sparse multiobjective optimization problem

Without loss of generality, the unconstrained MOPs [28] can be formulated as:

$$\begin{aligned} \begin{aligned}&\text {Minimize} \quad F({\textbf{x}})=(f_{1}({\textbf{x}}),f_{2}({\textbf{x}}),\dots ,f_{M}({\textbf{x}})), \\&\text {subject to} \qquad {\textbf{x}} \in \varOmega \end{aligned} \end{aligned}$$
(1)

where F(x) consists of m real-valued continuous objectives \((f_{1}(x),f_{2}(x),\dots ,f_{M}(x)))\) and \(\varOmega \) denotes the space of decision variables to take values. For any \(i\in \{1,2,\dots ,M\}\), there is \(f_i(x)\le f_i(y)\) and there exists at least one \(j\in \{1,\dots ,M\}\) such that \(f_j(x)<f_j(y)\), then the solution x dominates the solution y. A solution is called Pareto optimal if any solution does not dominate it. If an MOP is characterized by a large number of decision variables and sparse optimal solutions, it is known as the sparse multiobjective optimization problem.

Sparse large-scale optimization algorithms

The sparsity of Pareto optimal solutions poses a great challenge to extant LSMOEAs. Compared with general LSMOPs, only a few decision variables play a role, and the rest of the variables are zero. To efficiently solve sparse LSMOPs, several algorithms dedicated to this class of problems have been proposed.

The first category is based on new search strategies. SparseEA [20] has a framework similar to that of NSGA-II [29], whose innovation is that the algorithm uses a new population initialization strategy and a sparse genetic operator to ensure the sparsity of the solution. However, SparseEA attachs importance to the sparse detection and neglects the full optimization of useful variables. To solve this problem, SparseEA2 [30] enhances the connection between real and binary variables using variable grouping techniques. When the binary variables are flipped, the decision variables at the corresponding positions are optimized. MSKEA [31] is inspired by SparseEA. However, SparseEA only utilizes the invariant prior knowledge and does not update it dynamically, which may lead to the degradation of the optimization performance. MSKEA uses a multistage evolution mechanism based on knowledge fusion. The mechanism introduces three different sparse knowledge to guide the evolution. It maintains a good balance between exploration and exploitation, while also improving the performance of the algorithm. TS-SparseEA [32] proposes a two-stage evolutionary framework tailored for sparse LSMOPs. It first uses a binary weight optimization framework to obtain a well-approximated population. Then, the hybrid coding is used to evolve real vectors and masks separately, and finally both are combined based on similarity.

Unlike the two-layer encoding approach, S-ECSO [24] uses a strongly convex sparse operator that can directly generate promising sparse solutions. In addition, It uses an enhanced competitive swarm optimizer that can update particles in different ways, balancing exploration and exploitation. Inspired by the gradient descent method, ST-CCPSO [25] applies a sparse truncation operator. The operator can determine whether to set the variable to zero according to its cumulative gradient value. Similar to S-ECSO, ST-CCPSO employs a cluster-based competitive particle swarm optimizer to achieve a balance between exploration and exploitation.

The second category is based on dimensionality reduction methods. MOEA/PSL [21] adopts a restricted Boltzmann machine (RBM) [33] and denoising autoencoder (DAE) [34] to learn the sparse distribution and compact representation of decision variables, respectively. It uses the combination of sparse distribution and compact representation to approximate the Pareto-optimal subspace, where the genetic operator is conducted in the learned subspace. Therefore, the search space is greatly reduced. PM-MOEA [22] uses the pattern mining technique [35] to mine the two candidate sets of Pareto optimal solutions, i.e., the maximum candidate set and the minimum candidate set. The decision variable state of each offspring solution is jointly determined by two candidate sets, where the variable in the smallest candidate set is always set to 1, the variable in the largest candidate set is determined by the genetic operator, and the other variables are fixed to 0. MDR-SAEA [23] proposes a multi-stage dimensionality reduction framework for expensive sparse LSMOPs. It can pick out useful decision variables. Finally, the agent-assisted evolution algorithm [36] is applied to optimize the key nonzero variables.

Motivation

SparseEA detects the sparsity of decision variables by flipping potentially useful variables. However, it fixes the number of flipped variables in the evolutionary process. This approach may lead to algorithms that are not well adapted to different sparse problems and different stages. In Fig. 2, SparseEA\('\) indicates the correct detection rate of SparseEA for the set of useful variables. SparseEA\(''\) indicates the error detection rate of SparseEA for the set of useless variables. It is obvious that the detection accuracy of SparseEA is not sufficient. We can infer that it would be beneficial to detect the sparsity of variables if we adjust the number of flips at different stages of the algorithm according to the specific sparse problem.

Fig. 2
figure 2

Ratio of nonzero decision variables in each solution set

It is worth noting that each individual uses its own local information to flip the decision variables randomly. This may result in some critical variables not being detected within a limited number of evaluation. As shown in Fig. 3, the key nonzero variables are highlighted. Although most individuals detect useful variables at the \(x_1\), \(x_2\) and \(x_3\) positions, there may be a few individuals that do not detect these variables due to the random nature of the flip. Some individuals may mistakenly flip zero variables, such as those at the \(x_{(1)}\), \(x_{(2)}\) and \(x_{(3)}\) positions. All these operations will eventually lead to inaccurate sparse detection. To do so, we need to learn the global information of the population and mine the sparse distribution on the global scale.

Fig. 3
figure 3

The table represents the mask matrix of the populations. The numbers at the top of the table indicate the importance of each variable, where the gray areas indicate the useful decision variables

It is obvious that sparse detection has a great impact on the performance of the algorithm. However, we cannot ignore the influence of variable optimization on the performance either when the detection accuracy of the algorithm is the same. Sparse LSMOPs naturally have the characteristic of a high-dimensional search space. With the increase in the number of decision variables, evolution algorithms often face the “curse of dimensionality”. However, SparseEA does not take effective measures to alleviate the dimensionality curse. This may result in its inability to effectively optimize the decision variables.

Fig. 4
figure 4

Parallel coordinate plot of the decision variables of solutions obtained by MOEA/PSL and SparseEA on SMOP3 with 3000 variables

MOEA/PSL learns Pareto optimal subspaces during evolution. The genetic operator is conducted in the learned subspace, and the search space is greatly reduced. Compared to SparseEA, MOEA/PSL can achieve better optimization for useful variables. Figure 4 plots the parallel coordinate plot of the solutions obtained by MOEA/PSL and SparseEA run on SMOP3. The sparsity of the problem is set to 0.1. Both algorithms accurately detect the locations of nonzero variables, but MOEA/PSL achieves better experimental results than SparseEA. This shows that the optimization of decision variables also has a significant impact on the performance of the algorithm. As a result, to solve sparse LSMOPs, we need to take the necessary measures to overcome the dimensionality curse and achieve better optimization of nonzero variables within a limited function evaluation budget.

Based on the above analysis, MOEA-ESD applies the ESD strategy to the entire population, in addition to the adaptive sparse genetic operator for individuals, in the optimization process. This will further improve the sparse detection capability of the algorithm. In addition, to avoid the lack of optimization of variables due to the curse of dimensionality, the algorithm employs an improved weighted optimization strategy to fully optimize the decision variables. Next, we describe the algorithm in detail.

Proposed algorithm

Framework of the MOEA-ESD

figure a

The general framework of MOEA-ESD is shown in Algorithm 1. The solution x is represented by the product of the binary vector mask and the real vector dec, i.e., \(x_i=dec_i \times mask_i\), where \(dec_i\) represents the i-th real value of the solution and \(mask_i\) indicates whether the i-th position of the solution is zero (Line 4). The population evolution is preceded by measuring the importance of decision variables and generating the initial population (Lines 5–6). The quality of each solution is directly related to the importance of the nonzero variables of that solution. As in Eq. (2), the score of the i-th decision variable is represented by the sum of S(j) of all individuals dominating individual i in the population, where S(j) denotes the number of individuals dominated by individual j in the population. The score is closely related to the importance of the variable, and the lower the score is, the lower the probability of being set to zero, showing that the variable is more important. Then, the decision variables are divided into r groups according to their importance.

(2)

In the main loop of the algorithm, the \(t_1\) function evaluation is first used to optimize the initial population Q (Line 8). To ensure that the decision variables are fully optimized, we take advantage of the feature that the weighted optimization strategy can reduce the search space to optimize nonzero variables more efficiently in the subspace (Line 9). Then, environmental selection is performed on the union of the weighted population and the parent population. After the function evaluation is used, the remaining values are assigned to the next stage. We first use the ESD strategy to further mine the sparse distribution of the population (Line 13). Then, Algorithm 2 uses the \(t_4\) function evaluation to optimize population Q (Line 15). For simplicity, tr is set to 0.5, i.e., \(50\%\) of all FEs is allocated to each stage. In the following subsections, other important components of algorithm 1 are described.

Sparse detection strategy

figure b

This section introduces the adaptive sparse detection strategy, which can ensure the sparsity of the solution. The general sparse LSMOEAs initialize a random population and then set the useless variables to zero. Due to random initialization, there is a 50\(\%\) probability that the variables in the mask will be set incorrectly. Typically, sparse application problems have a sparsity of 10\(\%\) or less. Therefore, there will be a greater evolution distance when we start with a randomly initialized mask compared to starting with an all-zero mask. Based on the properties of the sparse problem, we set the mask of the initial population to all zeros and adaptively flip potentially useful variables. This approach allows the saved number of evaluation to be spent on the optimization of key variables to provide stronger convergence pressure. Based on this idea, an adaptive sparse detection strategy is proposed in this paper. The strategy can adaptively adjust the number of flips for different stages of the algorithm. In addition, we only flip the zero elements to find as many useful variables as possible. The work of correcting the incorrectly flipped zero variables and finding the missing useful variables is done by the ESD strategy. Next, we describe the specific steps of the algorithm in detail.

Before the algorithm starts, a pool \(N_s=\{n_1, n_2, \dots , n_k\}\) containing k different numbers is first preset. Before each generation of evolution, a roulette wheel selection mechanism is first used to select the number n. The n shows the number of zero variables to be flipped (Line 4). To correctly select the number, we identify its probability of being selected using the performance improvement caused by the number. Corresponding to the number size pool \(N_s\), the performance improvement table is defined as \(R=\{r_1,r_2...,r_k\}\), which stores the relative performance improvement of the selected number in this evolutionary round. In the initial stage, Each element of R is set to 1, indicating that each number has the same probability of being selected. The strategy does not have a fixed number of flips to find a more appropriate size for different stages of the algorithm. We can infer that at the beginning of the algorithm, the number of useful decision variables detected is small. We need to flip as many zero variables of high importance as possible. As the algorithm proceeds, more useful decision variables are detected, and the number of flips required is reduced. In this way, we can detect as many useful decision variables as quickly as possible. This facilitates the ESD strategy to mine the sparse distribution of the population as a whole.

First, the population Q is sorted nondominantly [37], and the crowding distance is computed. Then, N cycles are executed. Each cycle selects two parents p and q by binary tournament selection, and produces one offspring o. We set the o.mask to be the same as p.mask (Lines 6–7). Then, we select an element from the zero elements of \(p.mask \cap q.{\overline{mask}}\) and set that element in o.mask to 1 (Lines 8–9). N loops are executed, and each iteration selects one element from the zero elements of o.mask via the tournament selection and sets that element to 1. To prevent duplicate selections, we finally remove the selected element (Lines 10–14). p.dec and q.dec perform simulated binary crossover to generate o.dec.

In general, polynomial mutation has only a 1/D probability of variation for each variable. However, the key variables of sparse problems generally account for only a small percentage of the decision variables and performing genetic operations on each variable with the same probability makes it difficult to optimize the nonzero variables. As shown in Fig. 5, variables of mutation are highlighted. Although key useful variables have been precisely found, nonzero variables are also difficult to evolve in this iteration due to the large proportion of zero variables. To do this, we need to narrow the mutation and focus on key nonzero variables and zero variables with potential value, without wasting limited computational resources on useless decision variables. We perform polynomial mutation on the union of the nonzero variables and the most important \(2\%*n\) zero variables. It will facilitate the limited computational resources for the optimization of important decision variables. And the search space of the genetic operator is greatly reduced. The proportion of zero variables performing genetic operations is dynamically adjusted during the evolutionary process because the number of useful variables detected in the earlier stage is small and zero variables with high importance are more likely to be potentially useful variables. As the number of detected valid variables increases, the number of potentially useful variables decreases. Finally, the environment selection is executed on \(O\cup Q\), and the N solutions with better quality are used as the next generation population (Lines 15–18).

Fig. 5
figure 5

An example of the SparseEA mutation process, where the variable undergoing mutation is highlighted

Finally, the relative performance improvement of the selected number is calculated according to Eq. 3.

$$\begin{aligned} r_i = C(CUR,PRE)=\frac{|\{ u \in PRE |\exists v \in CUR\,\ v\prec u \}|}{N},\nonumber \\ \end{aligned}$$
(3)

where N denotes the size of the population and CUR and PRE are two PF approximations of an MOP. In this paper, we use C-metric [38] to calculate the improvement in the relative performance. This metric is defined as the proportion of solutions in PRE that are dominated by at least one solution of CUR, where CUR is the same size as PRE. When C(CURPRE) is 1, all solutions in PRE are dominated by some solutions in CUR. When C(CURPRE) is 0, all solutions in PRE are not dominated by solutions in CUR. Finally, based on the computed R, the number n is reselected from the pool \(N_s\) (Line 19).

Weighted optimization strategy

figure c

To achieve a good balance between exploration and exploitation, this paper uses an improved weighted optimization strategy to further optimize the real variables. The pseudocode of this strategy is shown in Algorithm 3. It first selects the best \(M+1\) solutions from the nondominated solutions of the current population based on the crowding distance. Next, \(M+1\) transformations and optimizations are performed on the weights in the weighted optimization step. At the beginning of each loop, the original problem is transformed to obtain a new optimization problem \(P_k\) using the transformation function. Then, the weighted population \(W_k\) is randomly initialized and is optimized using a metaheuristic algorithm for a total of \(t_2\) evaluations. The whole process is executed \(M+1\) times, and a set of weighted populations of size \(M+1\) is obtained (Lines 5–9).

$$\begin{aligned} \begin{aligned} d(X,Y)&= \sum _{i=1}^{D}(x_i-y_i)^2 \\ c(X,Y)&= \frac{\sum _{i=1}^{D}(x_i \times y_i)}{\sqrt{\sum _{i=1}^{D}(x_i)^2}\times \sqrt{\sum _{i=1}^{D}(y_i)^2}} \\ S(X,Y)&= \frac{1}{1+d(X,Y)} + 0.5+0.5 \times c(X,Y). \end{aligned} \end{aligned}$$
(4)

In the next step, the obtained weights are assigned to the original populations. To reduce the number of evaluations, the solution \(w_k\) with the largest crowding distance is selected from the nondominated solutions of each weight population \(W_k\), and the weight vector \(w_k\) is applied to all real vectors dec of the original population. We need to find the most appropriate mask vector for each real number variable in \(W_k\). An improper combination of masks and vectors may produce a weak solution, which will waste computational resources needed to find a good solution. As shown in Fig. 6, suppose \(x_p=(1,0.2)\) is the Pareto optimal solution. Given a real vector x = (0.9,0.1), the solution \(x_{(1)}\) obtained by combining the vector x with the mask (1,0) is closer to the Pareto front and better than the solution \(x_{(2)}\) obtained by combining the mask (0,1) because the real vector x finds a mask vector with a better match. If the real vector x is incorrectly combined with the mask (0,1), a weak solution \(x_{(2)}\) will be generated, leading away from the Pareto front. However, the computation of \(N^2\) is overwhelming if the best mask is returned in the traditional way. Therefore, in this paper, we propose a matching mechanism to solve this problem. This mechanism uses the Euclidean distance and the angular cosine function (Eq. 4) to jointly measure the degree of matching between the true vector and the mask. If the combined solution is more similar to the Pareto optimal solution \(x_p\), we can assume that the solution will also be better, indicating that the mask also matches the true vector better (Lines 10–16). To prevent the deterioration of diversity, a final deweighting of the new population is performed.

Fig. 6
figure 6

Example of real vector and mask matching

Enhanced sparse detection strategy(ESD)

figure d

This section describes the proposed ESD strategy. In metaheuristic optimization algorithms, promising solutions to optimization problems can be generated from the intersection of other good solutions. Unlike this method, xNSGA-II [39] generates new solutions by combination of existing solutions. The idea is to extract useful information from the optimization result by linear combination. Inspired by this idea, the ESD strategy is proposed. In the evolutionary process, the optimization result contains the current sparsest solutions. The ESD strategy integrates all mask vectors of the solution set. The aim is to analyze the proportion of individuals with a positive attitude towards a variable and to determine the usefulness of the variable more accurately globally by the voting mechanism. This approach allows for more precise sparsity detection on a global scale. This also speeds up the convergence of the population to the Pareto front to some extent.

Suppose the population comprises N solution vectors, each of dimension D. We define the mask set of the population as M, i.e., \(M=\{m^{(1)},m^{(2)},m^{(N)}\}\), where each mask vector is \(m^{(i)}=(m_1^{(i)},m_2^{(i)},\dots ,m_D^{(i)})\) and the value of element \(m_j^{(i)}\) in vector \(m^{(i)}\) is 0 or 1. The set M defines a search subspace over the mask vectors, the dimensionality of which is given by the population size of the optimization problem.

$$\begin{aligned} M=\left( \begin{array}{cccc} m_{1}^{(1)} &{} m_{2}^{(1)} &{} \ldots &{} m_{D}^{(1)} \\ m_{1}^{(2)} &{} m_{2}^{(2)} &{} \ldots &{} m_{D}^{(2)} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ m_{1}^{(N)} &{} m_{2}^{(N)} &{} \ldots &{} m_{D}^{(N)} \\ \end{array}\right) \end{aligned}$$
(5)

The linear combination of mask vectors in the population is defined as follows:

$$\begin{aligned} m' = \frac{k_1 \times m^{(1)}+k_2 \times m^{(2)}+\cdots +k_N \times m^{(N)}}{N} \end{aligned}$$
(6)

We use k to denote the coefficient vector of the above mask combination; i.e., \(k = (k_1,k_2,\dots ,k_N)\), \(k_i \in \{0,1\}\).

Fig. 7
figure 7

Schematic diagram of the combination process

The pseudocode framework for the ESD is given in Algorithm 4. In the ESD, M denotes the mask vector matrix of nondominated solutions, where each row corresponds to a mask vector of nondominated solutions (Line 2). The ESD performs the transformation and optimization of the coefficients. To do this, we need to draw a good solution from the current population (Line 3). In each transformation and optimization step, an \(N'\times N\) binary coefficient vector matrix is first randomly initialized (Line 4). Because of the need for the transformation problem, the real vectors of \(\textbf{x}\) act as the benchmark real vectors in the transformation process (Line 5). Next, the transformation problem \(P_{\textbf{x}}\) is constructed based on the benchmark real vector \(\textbf{x}\), the original problem P, the binary coefficient vector matrix Cm and the mask vector matrix M (Line 6). We initialize the coefficient populations as follows, with Cm and M being matrices of \(N' \times N\) and \(N\times D\), respectively, where \(N'\) denotes the population size of the suboptimization problem. Based on Eq. 6, Cm and M are multiplied to obtain a mask vector matrix of \(N' \times D\). Each mask vector in this matrix is then multiplied by the benchmark real vector \(\textbf{x}.dec\) to obtain the coefficient population \(C_{k}\) (Line 7). Next, we use NSGA-II on this newly formed population of coefficients to find a suitable combination (Line 8). It is also worth stating that the elements of the mask vector have values of 0 or 1, so the elements of the vector \(m'\), \(m'_i \in [0,1/N,\dots ,1]\). In this paper, we assume that if \(m'_i\ge t\), the decision variable corresponding to \(m'_i\) is flipped to 1. A schematic diagram of the combination process is shown in Fig. 7. t indicates that the number of nonzero variables at a position in the combined mask vector needs to reach t of the entire population size to be 1; otherwise, it is 0. When more individuals have a decision variable of 1 at a position, the more important the variable is, the higher the probability of being 1.

The result of this low-dimensional optimization step is a population of coefficients C. To save limited computational resources, an individual from the set of nondominated solutions of the coefficient population C is selected, and the mask vector corresponding to it is applied to the real vector of each solution (Lines 9–12). Finally, the environment selection mechanism is executed on \(Q\cup Q'\) (line 13). It should be noted that in each round of the transformation and optimization step, we use only the optimal member of the coefficient population, which represents a mask vector with a good sparse distribution.

Since the optimization of the population C depends on the premise of the optimal subspace defined by the population Q and all mask vectors are created randomly in the initial stage. There is no guarantee that the good solution is in subspace. The algorithm should be allowed to run for a period of time to find the approximate optimal subspace before executing the ESD. Therefore, the ESD needs to be executed late in the algorithm. In particular, to prevent the algorithm from searching locally in the subspace of the defined N dimension, the ESD strategy should be used in rotation with Algorithm 2.

Table 1 IGD values obtained for different LSMOEAs on SMOP1\(\sim \)SMOP8 with decision variables from 500 to 3000, with the best results highlighted in each row

Experimental results and analysis

This section verifies the effectiveness of MOEA-ESD for eight benchmark sparse LSMOPs. The LSMOEAs used for performance comparisons include CCGDE3 [6], LMEA [40], WOF-NSGA-II [41], and LMOCSO [42], which are based on decision variable grouping, decision variable analysis, decision space simplification, and new search strategies, respectively. Sparse LSMOEAs include PM-MOEA [22], MOEA/PSL [21] and SparseEA [20], which are all MOEAs dedicated to solving sparse LSMOPs. All experiments are implemented on PlateEMO [43].

Experimental settings

Specific parameter settings: For CCGDE3 [6], the number of groups is set to 2, and random grouping is used. For WOF-NSGA-II [41], the evaluation number \(t_1\) of the original problem is set to 1000, the evaluation number \(t_2\) of the transformation problem is set to 500, the solution number q for weight optimization selection is set to 3, the number of groups is set to 4, and ordered grouping is used. In LMEA, the number of selected solutions for variable clustering is set to 2, the number of perturbations for each solution is set to 4, and the number of solutions for variable interaction analysis is set to 5. For PM-MOEA, the population size of the pattern mining method is set to 20, and the number of generations is set to 10. In MOEA-ESD, the group size r is set to 8, the pool \(N_s\) is set to [1, 2, 3, 5] and the threshold t is set to 0.3. The evaluation numbers \(t_1\) and \(t_4\) of Algorithm 2 are set to 1000 and 3000, respectively. The evaluation numbers \(t_2\) and \(t_3\) of the transformed problem are set to 500 and 200, respectively.

Stopping condition and population size: The population size is set to 100. The maximum number of evaluations per MOEA is set to \(100\times D\), and D is the number of decision variables.

Test problems: Eight benchmark sparse multiobjective test problems SMOP1 \(\sim \) SMOP8 [20] are used to test the performance of the compared MOEAs. These problems are deceptive and have low intrinsic dimensionality; hence, it is challenging for existing MOEAs to obtain a sparse set of optimal solutions. In this experiment, the number of objectives for these sparse multiobjective test suites is set to 2. The sparsity of the Pareto optimal solutions is set to 0.1.

Performance metrics: The IGD indicator [44] is used to measure the results of MOEAs on the benchmark problems.

Performance analysis of benchmarking problems

Performance comparison with general LSMOEAs

Table 1 shows the IGD values obtained by CCGDE3, LMEA, WOF-NSGA-II, LMOCSO and the proposed MOEA-ESD for 30 runs on SMOP1\(\sim \)SMOP8 with 500, 1000 and 3000 decision variables. Each algorithm is run 30 times independently. The experimental results are statistically analyzed using a Wilcoxon [45] rank sum test with a significance level of 0.05. It can be seen that when solving sparse MOPs, MOEA-ESD performs significantly better than the other four LSMOEAs, while the experimental results of these LSMOEAs are not satisfactory. This experiment confirms that the general LSMOEAs cannot deal with sparse LSMOPs well.

Figure 8 plots nondominated solution sets with median IGD obtained by the compared algorithm after 30 runs on SMOP3, SMOP5 and SMOP8 with 3000 decision variables. It can be clearly seen that the MOEAs solving general LSMOPs cannot converge to the true Pareto front. Figure 9 shows the parallel coordinate plots [46] of the decision variables of the solutions obtained by the comparison algorithm on SMOP2 with 3000 variables. All variables outside the gray area are zero in the Pareto optimal solution. Since these LSMOEAs do not consider the sparsity of the Pareto optimal solution during the optimization process, a large number of nonzero decision variables are obtained for the final Pareto optimal solution. It can be clearly seen that the decision variables of the solutions obtained by CCGDE3, LMEA, LMOCSO and WOF-NSGA-II are mostly far from zero. This illustrates that the general LSMOEA is not suitable for solving sparse LSMOPs, so we need to customize specialized algorithms with sparse detection to solve this type of problem.

Performance comparison with sparse LSMOEAs

Table 2 lists the means and standard deviations of the IGD values obtained by SparseEA, MOEA/PSL, PM-MOEA, and the proposed MOEA-ESD on SMOP1\(\sim \)SMOP8 with 500, 1000, 3000, and 5000 decision variables. According to the Wilcoxon rank sum test with a significance level of 0.05, the results of the statistical analysis for the other three algorithms compared to MOEA-ESD were 0/31/1, 0/32/0, and 8/24/0. From this table, it can be seen that for the 32 benchmark test problems, PM-MOEA obtains the best experimental results on SMOP4 and SMOP6. The experimental results of MOEA-ESD for these two problems are slightly worse than those of PM-MOEA but better than those of SparseEA and MOEA/PSL. In other problems, MOEA-ESD obtains the best experimental results, so it can be concluded that MOEA-ESD achieves the best experimental results overall. Tables 1 and 2 show that on the 24 benchmark test problems, the four MOEAs tailored specifically for sparse MOPs clearly perform better than the general LSMOEAs on the sparse test suite. It shows that it is necessary to tailor the MOEAs for such problems.

Fig. 8
figure 8

Nondominated solution sets with median IGD obtained by CCGDE3, LMEA, WOF-NSGA-II, LMOCSO and the proposed MOEA-ESD on SMOP3, SMOP5 and SMOP8 with 3000 decision variables

Fig. 9
figure 9

Parallel coordinates plot of the decision variables of solutions obtained by CCGDE3, LMEA, WOF-NSGA-II, LMOCSO, and the proposed MOEA-ESD on SMOP3 with 3000 variables. All variables outside the gray area are zero in the Pareto-optimal solutions

Table 2 IGD values obtained by different sparse LSMOEAs on SMOP1 \(\sim \) SMOP8 with decision variables from 500 to 5000, with the best results highlighted in each row
Fig. 10
figure 10

Nondominated solution sets with median IGD obtained by SparseEA, MOEA/PSL, PM-MOEA, and the proposed MOEA-ESD on SMOP2, SMOP3, and SMOP7 with 5000 decision variables

Fig. 11
figure 11

Boxplots of the IGD values of the four compared algorithms on the SMOP1, SMOP2, SMOP4 and SMOP7 with 5000 decision variables

Fig. 12
figure 12

Convergence curves of IGD values obtained by SparseEA, PM-MOEA, MOEA/PSL, and the proposed MOEA-ESD on SMOP1, SMOP2, SMOP3 and SMOP7 with 3000 decision variables

Fig. 13
figure 13

Parallel coordinates plot of the decision variables of solutions obtained by SparseEA, MOEA/PSL, PM-MOEA, and the proposed MOEA-ESD on SMOP1, SMOP7 and SMOP8 with 5000 variables. All variables outside the gray area are zero in the Pareto optimal solution

Figure 10 plots the nondominated solution sets with median IGD obtained by four sparse LSMOEAs ran 30 times on SMOP2, SMOP7 and SMOP8 with 5000 decision variables, where the red line shows the true Pareto front for each test problem and the black dots show the approximate Pareto front. It can be clearly seen that the convergence of the proposed MOEA-ESD is better than that of the other algorithms. On SMOP2 and SMOP7, SparseEA, MOEA/PSL and PM-MOEA do not converge to the Pareto front. The convergence of SparseEA and PM-MOEA is not as good as that of MOEA/PSL on SMOP3. MOEA/PSL runs better than SparseEA and PM-MOEA in terms of convergence for high-dimensional problems but is slightly worse than PM-MOEA in terms of diversity.

To further analyze and compare the experimental results, Fig. 11 shows the boxplots of the IGD values of the four compared algorithms on SMOP1, SMOP2, SMOP4 and SMOP7 with 5000 decision variables. It can be seen that MOEA-ESD has a significantly better IGD mean on SMOP1, SMOP2 and SMOP7 than the other three algorithms. On SMOP4, PM-MOEA achieves the best results. It is easy to see from Fig. 11a, b and d that the stability of MOEA-ESD is also better than that of the other algorithms. Since MOEA/PSL searches in the learned Pareto optimal subspace, it alleviates the problem of the curse of dimensionality. When solving optimization problems with 5000 decision variables, MOEA/PSL has better performance compared to SparseEA and PM-MOEA. However, the performance fluctuations of MOEA/PSL on the four test problems are more pronounced than those of the other three algorithms, and more discrete values are produced.

For further study, Fig. 12 plots the convergence curves of the IGD values obtained by the four comparison algorithms ran on the SMOPs. It can be seen that MOEA-ESD converges significantly faster than the other algorithms throughout the evolution process and can converge to the Pareto front with fewer evaluations. Figure 13 shows the parallel coordinate plot of the decision variables of the solution of the comparison algorithm on the SMOPs with 5000 decision variables, where the variables outside the gray area are zero in the Pareto optimal solution. It can be clearly seen that MOEA-ESD detects the sparsity of variables more adequately, and the obtained solutions are sparser compared to those of the other algorithms. While the solution of PM-MOEA is sparser than the solutions of the other two algorithms, the results obtained are not satisfactory because PM-MOEA focuses more on the maintenance of variable sparsity and neglects the variable optimization. The above experiments confirm the necessity of the proposed algorithm to perform additional optimization of the real variables in the preliminary stage. It can can optimize the nonzero variables more adequately without affecting the maintenance of variable sparsity.

Performance of MOEA-ESD on Real-World SMOPs

In this section, we compare the performance of MOEA-ESD with other MOEAs on two real-world problems to verify the effectiveness of the proposed algorithm in solving sparse MOPs in real-world applications. The two practical problems are the neural network training problem and the portfolio optimization problem. The goal of the neural network training problem is to find a sparse network structure with the lowest classification error rate. The portfolio optimization problem aims to find the portfolio with the maximum expected return and the lowest risk.

Table 3 Parameter settings of two real-world SMOPs
Table 4 HV values obtained by LMEA, WOF-NSGA-II, LMOCSO, and the proposed MOEA-ESD on NN1-NN2 and PO1-PO2, where the best results for each row are highlighted

The corresponding parameter settings for each of these comparison algorithms remain the same as in the previous section. The details of each problem are given in Table 3, where NN and PO denote the neural network training problem and the portfolio optimization problem, respectively. Because of the unknown PF of the real problem, we use the HV indicator with a reference point (1,1) to evaluate the performance of these algorithms, while the Wilcoxon rank sum test with a significance level of 0.05 is used for statistical analysis of the experimental results.

The means and standard deviations of the HV values obtained by the CCGDE3, LMEA, WOF-NSGA-II, LMOCSO, SparseEA, MOEA/PSL, PM-MOEA, and MOEA-ESD applied in two practical applications are presented in Tables 4 and 5, respectively. In the experimental results, ‘+’, ‘–’ and ‘\(\approx \)’ indicate that MOEA-ESD shows statistically better, worse and approximately the same performance than the algorithms being compared, respectively. As seen, MOEA-ESD achieves the best results in the four test cases. It can be clearly seen that general LSMOEAs are not as good as specialized sparse MOEAs in solving real-world sparse MOPs. SparseEA, MOEA/PSL and PM-MOEA have better experimental results than those of the general LSMOEAs in solving sparse MOPs because the sparsity of the problem is considered.

Figure 14 shows the parallel coordinate plots of the decision variables of the solutions obtained by the eight comparison algorithms ran on a portfolio optimization problem with 5000 decision variables. We can clearly see that the Pareto optimal solutions obtained by the general LSMOEAs are not sparse. PM-MOEA obtains solutions with higher sparsity compared to those obtained by SparseEA and MOEA/PSL. In addition, MOEA-ESD obtains the sparsest Pareto optimal solution because MOEA-ESD further mines the sparse distribution of the population with the ESD strategy.

Ablation experiment

The contributions of MOEA-ESD are mainly arise from three components, namely, the adaptive sparse genetic operator, the improved weighted optimization strategy, and the enhanced sparse detection strategy. In this subsection, we perform ablation experiments to verify the effectiveness of each component. Specifically, the first variant, MOEA-ESD\('\), is the variant without the improved weighted optimization strategy; the second variant, MOEA-ESD\(''\), is the variant without the ESD strategy; the third variant, MOEA-ESD\('''\), does not use adaptive sparse genetic operators but flips one zero variable at a time and uses traditional genetic operator for the overall real variables.

Table 6 lists the IGD values obtained by MOEA-ESD and its three variants on SMOP1 to SMOP8 with 1000 decision variables. It is clear that MOEA-ESD achieves the best results on all tested problems, which indicates that three components are effective in solving sparse LSMOPs. The experimental results of MOEA-ESD\('''\) show that adaptive flipping is necessary for different sparse problems. Furthermore, it is clear from the results of MOEA-ESD\(''\) that, global detection allows the algorithm to more fully explore the sparse distribution of decision variables, thereby improving the performance of the algorithm. The result of MOEA-ESD\('\) confirms that the optimization of useful variables is also critical in solving sparse LSMOPs and equally affects the algorithm performance. In conclusion, when solving sparse LSMOPs, in addition to considering the sparsity of variables, variable optimization should not be neglected.

Table 5 HV values obtained by SparseEA, MOEA/PSL, PM-MOEA, and the proposed MOEA-ESD on NN1-NN2 and PO1-PO2, where the best results for each row are highlighted
Fig. 14
figure 14

Parallel coordinates plot of the decision variables of solutions obtained by LMEA, WOF-NSGA-II, LMOCSO, SparseEA, PM-MOEA, MOEA/PSL, and the proposed MOEA-ESD on PO with 5000 variables. All variables outside the gray area are zero in the Pareto optimal solution

Table 6 IGD values obtained by MOEA-ESD and its variants on SMOP1\(\sim \)SMOP8 with 1000 decision variables, with the best results highlighted in each row

For further analysis, Fig. 15 plots the convergence curves of the IGD values obtained by MOEA-ESD and its variants on SMOP7 with 1000 decision variables. As seen from the figure, MOEA-ESD converges faster than the other two variants overall. Compared with MOEA-ESD, MOEA-ESD\('\) converges slightly slower in the early stage, indicating that the improved weighted optimization strategy can quickly guide the population to the Pareto front in the early stage. In addition, although MOEA-ESD\(''\) can converge quickly in the early stage, it tends to stabilize in the later stage and gradually falls into a local optimum. This indicates that global detection of the population is necessary in the late stage. The ESD strategy can break this deadlock to guide the population to further convergence.

Parameter discussion

In this section, we analyze the sensitivity to changes in several main parameters of MOEA-ESD. The parameters tested are the threshold t and the group size r.

Next, we analyze the sensitivity of the algorithm to parameter t. In the late stage of the algorithm, the ESD strategy judges whether a variable at a certain position is useful based on a threshold t. Since each element has equal probability of being 0 or 1 when the binary coefficient vector evolves, the maximum value of t is 0.5. When the number of nonzero variables at a position exceeds \(2*t*N\), the variable is flipped. Figure 16 shows the IGD results for MOEA-ESD with different t values on SMOP2 and SMOP7 with 1000 decision variables. The t values range from 0.05 to 0.5. From the figure, it can be concluded that the parameter t affects the algorithm performance to some extent. Additionally, \(t=0.3\) can help the algorithm to obtain relatively better performance. In this paper, \(t=0.3\) is adopted as the threshold size for the experiment.

Fig. 15
figure 15

the convergence curves of the IGD values obtained by MOEA-ESD and its variants on SMOP7 with 1000 decision variables

Fig. 16
figure 16

Sensitivity analysis of parameter t from 0.05 to 0.5 on SMOP2 and SMOP7 with 1000 decision variables

Table 7 shows the IGD results for MOEA-ESD using different group size r on SMOP1 \(\sim \) SMOP8 with 1000 decision variables. For the group size r, we consider the following values [1, 2, 4, 8, 16, 32, 64] to verify the impact of the size of r on the performance of the algorithm. It is obvious from the table that if r is too large, for example, \(r=64\), or if r is too small, for example, \(r=1\), the performance of MOEA-ESD is very poor. When \(r=8\), the algorithm has the best performance. Therefore, \(r=8\) is considered the best group size for the experiment.

Table 7 IGD values obtained by different group size on SMOP1\(\sim \)SMOP8 with 1000 decision variables, with the best results highlighted in each row

Conclusions

Many researchers have tried to employ MOEAs to solve practical optimization problems, some of which are characterized by sparse Pareto optimal solutions. When solving this type of problem using general MOEAs, difficulties are often encountered. Consequently, this paper proposes an algorithm to solve sparse MOPs, called MOEA-ESD. MOEA-ESD uses an adaptive sparse genetic operator to generate sparse solutions. To prevent the deficiency of local detection, the algorithm adopts the ESD strategy to mine the sparse distribution of the population globally, which will further improve the accuracy of sparse detection. In addition, the algorithm applies an improved weighted optimization strategy to fully optimize the key nonzero variables, achieving a better balance between detection and optimization.

The experiments on sparse test suites and real-world problems show that MOEA-ESD has superior performance compared to seven state-of-the-art algorithms.

For future work, we would like to further explore the ability of the proposed algorithm in solving real-world problems, e.g., performing ultralarge-scale feature selection and simplifying complex deep neural network architectures. In addition, the proposed algorithm can be combined with multimodal processing techniques or constraint processing techniques to solve sparse multimodal MOPs [47] and sparsely constrained MOPs [48].

However, real-life sparse MOPs often have very complex variable interactions. Unfortunately, this paper does not provide an effective solution for such problems, so how to better deal with such problems will be the focus of our future work.