Cardinality-Constrained Multi-Objective Optimization: Novel Optimality Conditions and Algorithms

In this paper, we consider multi-objective optimization problems with a sparsity constraint on the vector of variables. For this class of problems, inspired by the homonymous necessary optimality condition for sparse single-objective optimization, we define the concept of L-stationarity and we analyze its relationships with other existing conditions and Pareto optimality concepts. We then propose two novel algorithmic approaches: the first one is an Iterative Hard Thresholding method aiming to find a single L-stationary solution, while the second one is a two-stage algorithm designed to construct an approximation of the whole Pareto front. Both methods are characterized by theoretical properties of convergence to points satisfying necessary conditions for Pareto optimality. Moreover, we report numerical results establishing the practical effectiveness of the proposed methodologies.


Introduction
Multi-objective optimization (MOO) is a mathematical tool that has received loads of attention by the research community for the last 25 years.As a matter of fact, MOO problems turned out to be relevant in different application fields, where objectives that are in contrast with each other must be taken into account (see, e.g., [11,13,25,43,45]).
With respect to single-objective optimization (SOO), the multi-objective case presents an additional complexity: in general, a solution minimizing all the objectives at once does not exist.This fact requires to introduce optimality conditions based on Pareto's theory, together with new, nontrivial, optimization schemes.Among the latter, scalarization [24,42] and evolutionary [20,36] approaches are very popular.However, although these two algorithm classes have some appealing features, they also present important shortcomings.In particular, the scalarization approaches strongly depend on the problem domain and structure for the weight choices; moreover, even under strong regularity assumptions, they may lead to unbounded scalar problems for unfortunate weight selections [26, section 7].On the other hand, despite excelling at handling very difficult tasks, convergence properties are hard to prove for evolutionary approaches [26].
A different branch of MOO algorithms that is receiving increasing attention is that of the descent algorithms [15,22,26,27,34].These methods basically extend the classic SOO descent methods.For most of them, theoretical convergence properties have been proved.The earliest developed algorithms of this class were only able to produce a single Pareto-stationary solution; in order to generate an approximation of the entire Pareto front, they were run multiple times in a multi-start fashion.More recently, some of these algorithms have been extended to overcome this limitation.These new approaches (see, e.g., [16][17][18][19]33]) are capable of dealing with sequences of sets of points and thus producing a Pareto front approximation in an efficient and effective manner.Inspired by works for SOO [38,40], front-oriented descent methods have also been combined with genetic algorithms for MOO (see, e.g., [35]).
A second topic recently investigated by the optimization community concerns problems where solutions with few nonzero components are required [46].Solution sparsity is often induced by the direct introduction of a cardinality constraint on the variables vector.However, setting an upper-bound for the 0 pseudo-norm makes the problem partly combinatorial and thus N P-hard [8,41].For this reason, many approaches whose aim is to solve this problem approximately have been proposed.We refer the reader to [46] for a thorough survey of these methods.However, algorithms dealing exactly with the 0 pseudo-norm can be found in the literature.In particular, the Iterative Hard Thresholding (IHT) algorithm [1], the Penalty Decomposition (PD) approach [39] and the Sparse Neighborhood Search (SNS) method [32] are designed to be employable in the most general cases, even without convexity assumptions.With these methods, problems are tackled by means of continuous local optimization steps and convergence to solutions satisfying necessary optimality conditions is guaranteed.
Although the two challenges have been thoroughly investigated separately, the combination of sparsity and multiple objectives has almost not been explored.The theoretical foundation for cardinality-constrained MOO was recently laid in [31], where a Penalty Decomposition approach was also proposed for sparse MOO tasks along with its convergence analysis.Moreover, a theoretical study extending the work from [10] to the MOO case was presented in [28].The development of high-performing procedures to deal with this class of problems is beneficial for many real-world applications.For instance, there are several reasons in machine learning for inducing sparsity within classification/regression models (e.g., interpretability [4], robustness [47], lightness [12]).In addition, there are approaches in the literature where learning tasks can be tackled from a Pareto-based, multi-objective perspective: fitting quality and model complexity are just two examples of conflicting objectives for which a good trade-off may be useful [29].
In this paper, we continue the theoretical analysis started in [31], introducing new optimality conditions for MOO problems with cardinality constraints.In particular, we define the concept of L-stationarity in MOO, which is directly inspired by the homonymous condition for sparse SOO tasks [1].Then, we introduce two new algorithms to solve these problems; the first one consists of an extension of the IHT method to the MOO case and it is designed to retrieve an L-stationary solution; we call this method Multi-Objective Iterative Hard Thresholding (MOIHT) and prove that that it is indeed guaranteed to converge to points satisfying the newly introduced necessary optimality condition.The second algorithm, on the other hand, is a two-stage approach whose ultimate goal is to approximate the whole Pareto front.This method, which we call Sparse Front Steepest Descent (SFSD), is also analyzed theoretically and then shown, numerically, to actually reconstruct well the solution sets.
The remainder of the paper is organized as follows.In Section 2, we first review the main MOO concepts along with some notions for cardinality-constrained MOO problems.In Section 3, we define L-stationarity for the considered class of problems and state its theoretical relations with Pareto's theory and other existing optimality conditions.In Section 4, we provide a description of the MOIHT algorithm, along with its convergence analysis; moreover, in this section we propose the SFSD methodology.In Section 5, we report the results of some computational experiments, showing the goodness of our novel approaches.Finally, in Section 6, we provide some concluding remarks.

Preliminaries
In this paper, we consider problems of the following form: where F : R n → R m is a continuously differentiable vector-valued function, • 0 denotes the 0 pseudo-norm, i.e., the number of nonzero components of a vector, and In what follows, we indicate with • the Euclidean norm in R n .We denote by Ω = {x ∈ R n | x 0 ≤ s} the closed and non-empty set induced by the upper bound on the 0 pseudo-norm.
To deal with the multi-objective setting, we need to define a partial ordering in R m : given u, v ∈ R m , we say that u < v if and only if, for all j ∈ {1, . . ., m}, u j < v j ; an analogous definition can be stated for the operators ≤, >, ≥.Furthermore, given the objective function F , we say that x dominates y w.r.t.F if F (x) ≤ F (y) and F (x) = F (y) and we denote it by F (x) F (y).
In multi-objective optimization, a solution which simultaneously minimizes all the objectives is unlikely to exist.In this case, optimality concepts are based on Pareto's theory.Definition 1.A point x ∈ Ω is Pareto optimal for problem (1) if there does not exist any y ∈ Ω such that F (y) F (x).If there exists a neighborhood N (x) such that the property holds in Ω ∩ N (x), then x is locally Pareto optimal.
Pareto optimality is a strong property and, as a consequence, it is often hard to achieve in practice.Then, a weaker but more affordable condition can be introduced.Definition 2. A point x ∈ Ω is weakly Pareto optimal for problem (1) if there does not exist any y ∈ Ω such that F (y) < F (x).If there exists a neighborhood N (x) such that the property holds in Ω ∩ N (x), then x is locally weakly Pareto optimal.
We refer to the set of Pareto optimal solutions as the Pareto set; the image of the latter under F is called Pareto front.
Additional notation: Given an index set S ⊆ {1, . . ., n}, the cardinality of S is indicated with |S|, while we denote by S = {1, . . ., n} \ S its complementary set; we call S a singleton if |S| = 1.Letting x ∈ R n , we denote by x S the sub-vector of x induced by S, i.e., the vector composed by the components x i , with i ∈ S; S 1 (x) = {i ∈ {1, . . ., n} | x i = 0} represents the support set of x, that is, the set of the indices corresponding to the non-zero components of x; S 0 (x) = {1, . . ., n} \ S 1 (x) is the S 1 (x) complementary set.Furthermore, according to [2], we say that an index set J is a super support set for x if S 1 (x) ⊆ J and |J| = s; the set of all super support sets at x is denoted by J (x) and it is a singleton if and only if x 0 = s.Finally, we denote by 1 N and 0 N , with N ∈ N + , the N -dimensional vectors of all ones and all zeros, respectively.

The Proximal Operator in Multi-Objective Optimization
Thorough analyses of proximal methods in the multi-objective setting can be found in the literature (see, e.g., [9,44]).For the scope of this work, we refer to the discussion carried out in [44], where the considered MOO problems are of the form For all j ∈ {1, . . ., m}, f j is assumed to be continuously differentiable, whereas g j is lower semi-continuous, proper convex but not necessarily smooth.
Let x k ∈ R n .A proximal step at x k can be carried out according to where L > 0. An optimal solution of problem (2) is such that 0 n is solution to (3).
Interestingly and similarly to the scalar case, problem (3) can be seen as a generalization of well-known schemes to define the search direction: • if, for all j ∈ {1, . . ., m}, g j = 0, then (3) coincides with the problem of finding the steepest common descent direction [27]; • if, for all j ∈ {1, . . ., m}, g j is the indicator function of a convex set C, then (3) becomes equivalent to the projected gradient direction problem [22].
In the next section, we are going to show that the proximal operator can be used to handle the nonconvex set Ω, in line with the work [1] for scalar optimization.

Optimality Conditions
Under differentiability assumptions on the objective function F , a Pareto-stationarity condition was proved in [31] where We denote by v(x) the set of optimal solutions of problem (4) at x.
The second lemma states that, assuming the convexity of the objective functions, the stationarity condition is also sufficient for local weak Pareto optimality.Lemma 2. Assume F is component-wise convex.Let x ∈ Ω a Pareto-stationary point for problem (1).Then, x is locally weakly Pareto optimal for (1).
Proof.See Appendix A.
Moreover, in [31], the Lu-Zhang first-order optimality conditions for scalar cardinality-constrained problems [39] have been extended to the multi-objective optimization setting.As pointed out in [31], the converse is not always true; in order to obtain an equivalence between the two conditions, we need a stronger requirement.Lemma 4 ( [31, Proposition 3.10]).A point x ∈ Ω is a Pareto stationary point for problem (1) if and only if it satisfies MOLZ conditions for all J ∈ J (x).
The Pareto-stationarity condition can be interpreted as a direct extension of the basic feasibility concept in cardinalityconstrained SOO [1,2].As such, the limitations of scalar basic feasibility naturally get transferred to the MOO case; in particular, Pareto-stationarity is only a local optimality condition and it does not allow to obtain information about the quality of the current support set.The MOLZ conditions emphasize this issue even more, being generally less restrictive than Pareto-stationarity.
With the above consideration in mind, we are motivated to extend the stronger L-stationarity condition from [1] to the MOO case.In order to do so, we shall reinterpret L-stationarity in terms of proximal operators.Specifically, we can employ problem (3) to define L-stationarity for MOO.
Let us consider the problem where and let us denote by v L (x) the set of optimal solutions at x (since Ω is not a convex set, the solution is not necessarily unique).It is easy to notice that problem ( 6) is equivalent to (3) where, for all j ∈ {1, . . ., m}, g j is the indicator function of the set Ω.
1.The proof is trivial since • the feasible set D L (x) is closed and non-empty, • max j=1,...,m ∇f j (x) d + L 2 d 2 is strongly convex and continuous in D L (x).
2. Given d = 0 n , we have that max j=1,...,m ∇f j (x , we get the thesis. 3. The proof is identical to the one of Proposition 4 in [22].The argument is not spoiled by the set D L (x) being nonconvex.
We are now ready to introduce the definition of L-stationarity in MOO.
By simple algebraic manipulations, the problem in ( 6) can be rewritten as We can now observe that, if m = 1, Definition 5 actually coincides with scalar L-stationarity.Indeed, exploiting (7), The minimum in the above problem is attained for with Π Ω being the (not unique) Euclidean projection onto the nonconvex set Ω. We thus have that θ i.e., x is L-stationary according to [1].In the rest of the section, we analyze the relations between L-stationarity, Pareto optimality, Pareto-stationarity and MOLZ conditions.We begin showing that, for any L > 0, each L-stationary point is Pareto-stationary.Proposition 1.Let x ∈ Ω be an L-stationary point for problem (1) with L > 0.Then, x is Pareto-stationary for (1).
Proof.By contradiction, we assume that x is not Pareto-stationary for (1), i.e., there exists d ∈ D(x) such that where the second inequality is justified by the non-negativity of the norm operator.
We now define the direction d(t) = t d.Given the definition of D(x) and the feasibility of d, we have there exists By (6), it follows that θ L (x) = θL (x, dL ), where dL ∈ v L (x), and also Combining the definitions of d(t) and θL (x, d), we get that θL (x, d(t)) = t max j=1,...,m ∇f j (x where the right-hand side is a positive quantity as L > 0 and (8) holds.
Then, taking into account the feasibility of d(t), ( 9) and ( 10), we can define a direction We finally get a contradiction since, by hypothesis, x is an L-stationary point for (1), i.e., θ L (x) = 0.
The last result also highlights the relation between L-stationarity and MOLZ conditions, which, as stated in Lemma 3, are necessary for Pareto stationarity.We formalize it in the next corollary.Corollary 1.Let x ∈ Ω be an L-stationary point for problem (1) with L > 0.Then, x satisfies the MOLZ conditions.
Given Proposition 1 and Lemma 2, we can also state that, under convexity assumptions for F , L-stationarity is a sufficient condition for local weak Pareto optimality.Corollary 2. Assume that F is component-wise convex and let x ∈ Ω be an L-stationary point for problem (1) with L > 0.Then, x is locally weakly Pareto optimal for (1).
In order to continue the analysis, we need to introduce a couple of notions.The first one is an assumption similar to the one used for L-stationarity in [1], while the second one concerns an adaptation of the descent lemma to MOO.Assumption 1.For all j ∈ {1, . . ., m}, ∇f j is Lipschitz-continuous over R n with constant L(f j ), i.e., ∇f j In what follows, we indicate with L(F ) ∈ R m the vector of the Lipschitz constants, i.e., L(F ) = (L(f 1 ), . . ., L(f m )) .Lemma 6 ( [3, Proposition A.24]).Let f j , j = 1, . . ., m, be a continuously differentiable function satisfying Assumption 1.Then, for all L ≥ L(f j ) and any x, d ∈ R n , we have that We are ready to show that, for specific L values, the L-stationarity condition is necessary for weak Pareto optimality.Proposition 2. Let Assumption 1 hold, x ∈ Ω be a weakly Pareto optimal point for problem (1) and L > max j=1,...,m L(f j ).Then, x is L-stationary for (1).Moreover, we have that v L (x) = {0 n }, i.e., the set v L (x) is a singleton.
Proof.By contradiction, let us assume that either x is not L-stationary for (1) or v L (x) \ {0 n } = ∅.Then, there exists a direction d ∈ D L (x) such that d = 0 n and max j∈{1,...,m} By Lemma 6, we have that, for all h ∈ {1, . . ., m}, From Equation ( 11), we get that , where the first inequality comes from the definition of maximum operator.Recalling the hypothesis on L and the non-negativity of the norm, we combine (12) and the last result obtaining that Thus, for all h ∈ {1, . . ., m} we have 2 (L(f h ) − max j∈{1,...,m} L(f j )) ≤ 0, leading to the conclusion that we have found a point x + d ∈ Ω such that F (x + d) < F (x).This is a contradiction since, by hypothesis, x is weakly Pareto optimal for (1).Thus, we get the thesis.
The analysis on L-stationarity highlights how the choice of the L value could be crucial: if L is too small, L-stationarity might not be a necessary optimality condition; on the other hand, if L gets too large, all the Pareto stationary points also become L-stationary.This behavior can be better noticed with an example.Example 1.Let us consider the following optimization problem: The Lipschitz constant of the gradient of both objective functions f j is L(f j ) = 1.In Figure 1, the Pareto optimal solutions and the Pareto front are plotted: the problem has global optimal solutions corresponding to points with x 1 = 0; the local ones are characterized by the second component x 2 = 0.By Lemmas 1-3, it follows that all the considered points are Pareto-stationary and satisfy the MOLZ conditions.In Figure 2, we show which Pareto solutions are L-stationary, considering four different choices for L. If L is chosen too small (Figure 2a), some global Pareto solutions do not result to be L-stationary.As stated in Proposition 2, the L-stationarity condition turns out to be necessary for Pareto optimality for an L value greater than the Lipschitz constants (Figure 2b where L = 1.01).On the other hand, a too high value makes the condition rather weak: in Figure 2c (L = 1.25), even some local Pareto solutions are L-stationary.The situation is further stressed in Figure 2d where L = 2 and all Pareto-stationary points are also L-stationary.

New Algorithmic Approaches for Sparse MOO Problems
In this section, we propose two procedures to solve cardinality-constrained MOO problems.The first one can be seen as an extension of the Iterative Hard Thresholding (IHT) algorithm [1]; the second one is a front approximation approach that takes as input candidate solutions, possibly associated with different support sets, and then spans the portions of the Pareto front associated with those supports.We report their schemes and discuss their properties in separate sections.

Multi-Objective Iterative Hard Thresholding
The first procedure we introduce is the Multi-Objective Iterative Hard Thresholding algorithm.In the remainder of the paper, we refer to it as MOIHT.The scheme of the method is reported in Algorithm 1.At each iteration of MOIHT, the current solution x k is updated solving problem (13).The execution continues until an L-stationary point for (1) is found.
Remark 2. At each iteration k, the solution x k+1 generated by MOIHT is feasible for (1).Indeed, the feasibility easily follows by definition of D L (x k ).
Remark 3. It is very important to underline that Step 4 is a practical operation that can be effectively implemented in the general case.Problem (13) can indeed be solved up to global optimality, for example with mixed-integer programming techniques (see, e.g., [6,7]).Defining a sufficiently large scalar M > 0, the problem can be equivalently reformulated as Algorithm 1: Multi-Objective Iterative Hard Thresholding

Convergence Analysis
In this section, we analyze the method from a theoretical perspective.Before proving the main convergence result, we need to state an additional assumption on F and prove a technical lemma.Assumption 2. F has bounded level sets in the multi-objective sense, i.e., the set L F (z) = {x ∈ Ω | F (x) ≤ z} is bounded for any z ∈ R m .Lemma 7. Let Assumptions 1-2 hold and {x k } be the sequence generated by Algorithm 1 with constant L > max j∈{1,...,m} L(f j ).Then: 3. for all j ∈ {1, . . ., m}, the sequence {f j (x k )} is non-increasing; 4. the sequence {F (x k )} converges; Proof.
1.The thesis can be proved making an argument similar to that of Proposition 2 and reminding that Step 6 of Algorithm 1). 2. It follows directly from Point 1., recalling that L > L(f j ) for all j ∈ {1, . . ., m} and x k − x k+1 > 0.

5.
From Point 1., we have that, for all k and j ∈ {1, . . ., m}, ).Since f j is continuous, we can take the limit for k → ∞ on both sides of the inequality: , where the equality comes from Point 4..By the definition of L and the non-negativity of the norm, the statement is proved.Proposition 3. Let Assumptions 1-2 hold and {x k } be the sequence generated by Algorithm 1 with constant L > max j∈{1,...,m} L(f j ).Then, the sequence admits cluster points, each one being L-stationary for problem (1).
Proof.First, we prove that the sequence admits limit points.By Lemma 7, we deduce that, for all k, F (x k ) ≤ F (x k−1 ) ≤ . . .≤ F (x 0 ).Moreover, as noted in Remark 2, x k ∈ Ω for all k.Thus, we have that x k ∈ L F (F (x 0 )) ∀k.Since Assumption 2 holds, we conclude that the sequence {x k } is bounded and it thus admits limit points.Now, by contradiction, let us suppose there exists a subsequence K ⊆ {0, 1, . ..} such that lim k→∞ k∈K and x is not L-stationary for problem (1).Then, there exists ε > 0 such that θ L (x k ) ≤ −ε < 0 for all k ∈ K.
By the instructions of the algorithm, at each iteration k, d L k ∈ v L (x k ) solves problem (13).Moreover, by Step 6, By Lemma 6 and Step 6 of the algorithm, we have that, given h ∈ {1, . . ., m}, Given the definition of maximum operator, we can combine ( 15)-( 16) to obtain x k+1 − x k 2 and, thus, By Point 3. of Lemma 7, we know that, for all k ≥ 1, f h (x k−1 ) ≥ f h (x k ).Given the latter result and ( 17), for k ∈ K such that k ≥ 1, we obtain Recursively applying the reasoning used for (18), we can conclude that for all Now, taking into account Point (v) of Lemma 7, ( 14) and the continuity of the norm, we get that lim k→∞,k∈K x k+1 = x.Thus, taking the limit in (19) for k → ∞, k ∈ K, recalling again Point (v) of Lemma 7, ( 14) and the continuity of f h , we conclude that 0 = f h (x) − f h (x) ≥ ε.We get a contradiction since ε > 0. Thus, the thesis is proved.
Remark 4. By Proposition 3 and the continuity of θ L (Lemma 5) we are guaranteed that, for any ε > 0, Algorithm 1 will produce a point x k such that θ L (x k ) > −ε in a finite number of iterations.Thus, we can effectively employ this condition as a practical stopping criterion for the MOIHT procedure.

Sparse Front Steepest Descent
In what follows, we describe and analyze the Sparse Front Steepest Descent (SFSD) methodology.The algorithm can be seen as a two phases approach, which is based on the following consideration: in problems of the form (1), the Pareto front is usually an irregular set made up of several, distinct smooth parts; each of these nice portions of the front is typically the image of a set of solutions sharing the same structure, i.e., associated with the same support set.The rationale of the proposed algorithm is thus to first define a set of starting solutions; the support sets of these solutions should ideally be diverse and define a subspace where a portion of the Pareto set lies.Then, an adaptation of the front steepest algorithm [17,33] can be run starting from this initial set of solutions to span the front exhaustively.To the best of our knowledge, SFSD is the first front-oriented approach for cardinality-constrained MOO.

Phase One: Initialization
The first phase of the SFSD procedure deals with the identification of a set of starting solutions.The most direct way of proceeding would arguably be exhaustive enumeration of the super support sets, selecting for each a solution.However, the number of possible supports is high, growing as fast as n s , but only a small fraction contributes to the Pareto front.Thus, this strategy is inefficient, up to being totally impractical with problems of nontrivial size.
A totally random initialization might also appear to be a possible path to take, but, by similar reasons as above, it would end up being a completely luck-based operation.Therefore, we suggest to exploit single-point solvers to retrieve Pareto-stationary solutions.Indeed, by the mechanisms of this kind of algorithms, not only the obtained points are stationary but are usually also good solutions from a global optimization perspective.We explored the following (not exhaustive) list of options: • Using the MOIHT discussed in Section 4.1 in a multi-start fashion.Since the algorithm finds L-stationary solutions, optimization should be driven avoiding "bad" supports.• Using the Multi-Objective Sparse Penalty Decomposition (MOSPD) method from [31] in a multi-start fashion; in brief, at each iteration k of MOSPD, a pair (x k+1 , y k+1 ) is found such that x k+1 is (approximately) Paretostationary for the penalty function The pair (x k+1 , y k+1 ) is obtained by means of an alternate minimization scheme.For further details, we refer the reader to [31].MOSPD is proved to converge to points satisfying the multi-objective Lu-Zhang conditions for problem (1), that are even weaker than Pareto-stationarity; however, Penalty Decomposition methods have been shown to retrieve solutions both in the scalar [30] and in the multi-objective [31] case that are good from a global optimization perspective.
• Combining the strategies of the two preceding points: for each point of a multi-start random initialization, we can first run the MOSPD procedure to exploit its exploration capabilities; then, we can use MOIHT in cascade, so that bad Lu-Zhang points that are not L-stationary can eventually be escaped.We refer to this approach as MOHyb.• Solving the scalarized -single objective -problem for different trade-off parameters.
Once the starting set of solutions is obtained by one of the above strategies, a further step has to be carried out.Indeed, we need to associate each solution with a super support set.Now, if a solution has full support, then there is a unique super support and no ambiguity.However, there might be solutions with incomplete support; these solutions might be not Pareto-stationary (for example if obtained with MOSPD), in which case we shall carry out a descent step along the steepest feasible descent direction; if on the other hand we actually have a Pareto-stationary point with incomplete support, we shall associate it with any of the super supports.
Obviously, we can complete this first phase with a filtering operation, where dominated solutions get discarded.To sum up, the result of the first phase of the algorithm provides a set of starting solutions each one associated with a super support set.

Phase Two: Front steepest descent
In Algorithm 2, we report the scheme of the proposed algorithmic framework (SFSD).
The method starts working with the starting set of solutions resulting from the Initialize step, i.e., phase one of the algorithm; the obtained set X 0 is then given by i.e., solutions associated with a corresponding super support set.Given any pairs (x, J x ), (y, J y ) ∈ X 0 with J x = J y , we assume that x and y are mutually nondominated w.r.t.F .
Basically, the SFSD algorithm employs the instructions of the front steepest descent algorithm [17], modified as suggested in [33], treating separately points associated with different super support sets.
Specifically, for any nondominated point x c in the current Pareto front approximation, a common descent step in the subspace corresponding to the support J xc is carried out, doing a standard Armijo-type line search [27].In other words, the search direction is thus given by d Jx c (x c ) according to (5).Then, further searches w.r.t.subsets of objectives are carried out from the obtained point, as long as it is not dominated by any other points y in the set with J y = J xc .These additional explorations are carried out along partial descent directions [16,17] in the reference subspace of the point at hand.Considering I ⊆ {1, . . ., m} as a subset of objectives indices, we define θ I J (x) = min d∈R n max j∈I ∇f j (x) d + 1 2 d 2 s.t.d J = 0 | J| .Similar to (5), the problem has a unique solution that we denote by d I J (x).
Since the solutions are compared only if associated to the same super support set, the subspaces induced by different super support sets are explored separately in SFSD.As a result, we basically obtain separate Pareto front approximations, each one corresponding to a super support set.At the end of the SFSD execution, all the points can be compared and the dominated ones can finally be filtered out in order to obtain the final Pareto front approximation for problem (1).
Note that, conceptually, the algorithm can be seen as if multiple, independent runs of the front steepest descent method were carried out, each time constraining the optimization process in a particular subspace; however, exploration in SFSD is carried out "in parallel" throughout different supports, so that the front approximation is constructed uniformly and we can avoid cases where all the computational budget is spent for the optimization w.r.t. the first few considered supports.Remark 5. Since each point is considered for search steps only in the subspace induced by its associated super support set, it easily follows that every new solution will be feasible for (1).

Algorithm Theoretical Analysis
In this section, we state the convergence property of the SFSD methodology.We refer the reader to [33] for the proofs of properties inherited by SFSD directly from the front steepest descent method.
Before proving the convergence result, we need to introduce the set X k J = x | ∃ (x, J) ∈ X k , with J denoting a super support set, and to recall the definition of linked sequences, firstly introduced in [37].Definition 6.Let X k J be the sequence of sets of nondominated points, associated with the super support set J, produced by Algorithm 2. We define a linked sequence as a sequence x J j k such that, for all k, the point x Proposition 4. Let us assume that X 0 J is a set of mutually nondominated points and there exists Moreover, let X k J be the sequence of sets of nondominated points, associated with the super support set J, produced by Algorithm 2, and x J j k be a linked sequence.Then, the latter admits accumulation points, each one satisfying the MOLZ conditions for problem (1).
Proof.By the instructions of the algorithm, each linked sequence x J j k can be seen as a linked sequence generated by applying the front steepest descent algorithm from [33] to the problem of minimizing F (x) subject to x J = 0. Thus, we can follow the proof of [33,Proposition 3.4] to show that each accumulation point x of the linked sequence x J j k is such that θ J (x) = 0, i.e., x satisfies the MOLZ conditions for (1).

Computational Experiments
In this section, we report the results of some computational experiments aimed at assessing the numerical potential of the proposed approaches.The code for the experiments, which was written in Python3, was run on a computer with the following characteristics: Ubuntu 22.04, Intel Xeon Processor E5-2430 v2 6 cores 2.50 GHz, 16 GB RAM.In order to solve instances of problems like ( 4)-( 5)-( 13), the Gurobi optimizer (version 9.1) was employed.

Experiments Setup
In our numerical experience, we considered two classes of problems: cardinality-constrained quadratic problems and sparse logistic regression tasks.
The quadratic MOO problems, which often represent a useful test benchmark in optimization, have the form where Q 1 , Q 2 ∈ R n×n are random positive semi-definite matrices and c 1 , c 2 ∈ R n are vectors whose values are randomly sampled in the range [−1, 1).In the experiments, we varied the following problem parameters: the size n ∈ {10, 25, 50}; the condition number of the matrices κ ∈ {1, 10, 100}; the cardinality upper bound s.In particular, the latter was set in the following way: for n = 10, s ∈ {2, 5, 8}; for n = 25, s ∈ {5, 10, 20}; for n = 50, s ∈ {5, 15, 30}.Moreover, we used 3 different seeds for the pseudo-random number generator, thus leading to a total of 81 quadratic problems.For each instance, Q 1 and Q 2 are characterized by the same condition number, i.e., L(f 1 ) = L(f 2 ) = κ.
As for the sparse logistic regression problem [5,14], it is a relevant task in machine and statistical learning.Given a dataset of N samples with n features R = (r 1 , . . ., r N ) ∈ R N ×n and N corresponding labels {t 1 , . . ., t N } belonging to {−1, 1}, the regularized sparse logistic regression problem is given by: where λ ≥ 0. The logistic loss aims to fit the training data, while the regularization term helps to avoid overfitting.The two functions are clearly in contrast with each other.For our experiments, we employed the multi-objective reformulation considered in [31]: For this problem, L(F ) = ( R R / N, 1) .The dataset suite we considered is composed of 7 binary classification datasets from the UCI Machine Learning Repository [23] (Table 1).We tested the algorithms on instances of the problem with s ∈ {2, 5, 8, 12, 20}.For each dataset, the samples with missing values were removed.Moreover, the categorical variables were one-hot encoded, while the other ones were standardized to zero mean and unit standard deviation.
In order to evaluate the performance of the algorithms one compared to the others, we employed the performance profiles [21].In brief, performance profiles show the probability that a metric value achieved by a method in a problem is within a factor τ ∈ R of the best value obtained by any of the solvers considered in the comparison.We refer the reader to [21] for more details.As performance metrics, we used some classical ones of the multi-objective literature: purity, Γ-spread, ∆-spread [19] and hyper-volume [48].Note that, since purity and hyper-volume have increasing values for better solutions, the performance profiles w.r.t.them were produced based on the inverse of the obtained values.
Note that, for both classes of problems, in the following we will also consider solution approaches based on scalarization, i.e., tackling the problem min x∈Ω f 1 (x) + λf 2 (x), where λ ≥ 0. In the quadratic case, the problem can be solved by means of commercial solvers such as Gurobi, exploiting an MIQP reformulation.In the logistic regression case, we instead use the Greedy Sparse-Simplex (GSS) algorithm [1].Note that, opposed to MIQP approach in quadratic problems, GSS is not guaranteed to produce a Pareto optimal solution.
As anticipated in Section 4.2.1, our SFSD methodology was tested taking as starting solutions the ones generated by the single-point methods mentioned above, i.e., MOIHT, MOSPD, MOHyb, MIQP and GSS.Similarly to what is done in [33], we employed a strategy to limit the number of points used for partial descent searches, in order to improve the SFSD efficiency and prevent the production of too many, very close solutions.In detail, we added a condition based on the crowding distance [20] to determine whether a point should be considered for further exploration after the common descent step or not.
Every execution had a time limit of 4 minutes.In particular, each single-point method was tested in a multi-start fashion: it had to process as many input points as possible within 2 minutes; in the remaining time, the MOSD procedure was employed as a refiner, starting at each returned point and keeping fixed its zero variables so that the cardinality constraint was kept valid.In SFSD, we set a time limit of 2 minutes for both phases of the algorithm.For MOIHT, MOSPD and MOHyb, we considered 2n initial solutions randomly sampled from a box ([−2, 2] n for the quadratic problems; [0, 1] n for logistic regression).In order to be feasible, each initial point is first projected onto Ω.These algorithms were executed 5 times with different seeds for the pseudo-random number generator to reduce the sensibility from the random initialization.The five generated fronts were then compared based on the purity metric and only the best and worst ones were chosen for the comparisons.The scalarization-based approaches were run once considering 2n values for λ, i.e., λ ∈ {2

Preliminary Assessment of MOIHT, MOSPD and MOHyb
We start analyzing the effectiveness of MOIHT, MOSPD and MOHyb, comparing them in Figure 3 on a selection of quadratic problems.In order to show the differences among the algorithms as clearly as possible, only for this experiment, we considered a single run where the methods took as input the same 25 randomly extracted initial points.Moreover, we set no time limit, so that all the algorithms could process each initial solution until the respective stopping criteria were met.
The MOSPD and MOHyb performance was investigated for values for τ 0 ∈ {1, 100} (results for τ 0 = 100 are shown in the left column of the figure, τ 0 = 1 on the right).The black dots indicate the reference front: the latter is obtained combining the fronts retrieved by running SFSD with all the proposed initialization strategies and discarding the dominated solutions.
In well-conditioned problems (κ = 1), the MOIHT algorithm performs well, reaching solutions that belong to the reference front.As for MOSPD, the results with τ 0 = 100 are worse, with MOSPD obtaining solutions far from the reference front.The situation is further stressed as the problem dimension n grows.This sounds reasonable, since setting τ 0 = 100 in Penalty Decomposition schemes binds the variables close to the initial feasible solution and, as a consequence, prevents from exploiting the exploration capabilities of MOSPD.A better choice for τ 0 (τ 0 = 1) improves the performance of the algorithm, although MOIHT still performs better.This result is somewhat in line with the theory: MOIHT generates L-stationary points, whereas MOSPD converges to solutions only guaranteed to satisfy the (weaker) MOLZ conditions.In this scenario, MOHyb inherits the effectiveness of MOIHT: regardless the value for τ 0 , it succeeded in getting solutions of the reference front.
In ill-conditioned problems (κ > 1), the MOIHT performance gets worse: the method struggled in reaching the reference front.This might be explained by the larger values of L that have to be used with these problems (L = 1.1κ).As the value of L grows, the L-stationarity condition does not provide enough information on the quality of the solution support set.As a consequence, MOIHT can end up in many L-stationary points with "bad" support, i.e., far from the actual Pareto front of the problem.MOSPD with τ 0 = 1 obtained better solutions in these cases.Employing lower values for τ 0 , the approach is initially allowed to search for a good point minimizing F : this feature can be crucial to avoid a large portion of "bad" L-stationary points and to reach solutions in the reference front.MOHyb (τ 0 = 1) proved to be effective in these scenarios too.The hybrid approach, in these ill-conditioned cases, profited from the exploration capabilities of MOSPD, reaching the same solutions.However, like in the well-conditioned case, MOHyb also proved to be less sensitive than MOSPD w.r.t. the value of τ 0 , taking advantage of the MOIHT mechanisms to reach at least L-stationary solutions when τ 0 = 100.

Evaluation of the SFSD Methodology
We start the analysis of the SFSD algorithm performance on the quadratic problems through Figure 4, where we show how the front descent phase allows to improve the results of basic multi-start approaches corresponding to phase one, i.e., MOIHT, MOSPD, MOHyb and MIQP.According to the results of Section 5.2.1, for MOSPD and MOHyb, we set τ 0 = 1.
As anticipated in Section 4.2, in cardinality-constrained MOO the Pareto fronts are typically irregular and made up of several smooth parts.The plots in the figure perfectly reflect these characteristics: each front portion can indeed  be associated with a specific support set.Starting from the solutions generated by the single-point methods, the SFSD methodology proves to be effective in exhaustively spanning each portion associated with a support set.This feature allowed our novel front-oriented approach to identify regions of the Pareto front that would have otherwise be hardly covered with the multi-start strategy.
As mentioned in Section 4.2, to the best of our knowledge, SFSD is the first front-oriented approach for cardinalityconstrained MOO.In the absence of other specialized algorithms, it is difficult to quantitatively assess the potential of our algorithm and we need to resort to the visual inspection of the solutions.In the rest of the section, we then focus our attention on the different options we outlined for the phase one of the SFSD algorithm, comparing its performance as the initialization strategy varies.The comparisons were made by means of the performance profiles (Figure 5) on the entire benchmark of quadratic problems.
Looking at the purity and the hyper-volume metrics, SFSD resulted to be more robust with MOHyb as initialization strategy instead of MOIHT and MOSPD.These results reflect the behavior of the three single-point algorithms already shown in Figure 3: while MOIHT resulted to be more effective on the well-conditioned problems, MOSPD, with a right choice for the value for τ 0 , performed better on the (larger) set of ill-conditioned problems; MOHyb, inheriting the mechanisms of both, managed to obtain good results on problems of both types.As for Γ-spread, MIQP proves to be more capable than the other single-point methods in generating solutions in the extreme regions of the objectives space, and this fact allowed SFSD to get wider front reconstructions.The performance of our front-oriented approach with MOHyb, MOIHT and MOSPD employed in the phase one was quite similar in this scenario.Regarding the ∆-spread metric, i.e., uniformity of the Pareto front approximation, all the initialization strategies led to comparable results.

Logistic Regression
In this last section, we analyze the performance profiles (Figure 6) on the logistic regression problems for SFSD with the different possible choices for the first phase of the algorithm.The values for the parameters of the algorithms were again chosen based on preliminary experiments which are not reported for the sake of brevity.In particular, we set: L = 1.1 max{L(f 1 ), L(f 2 )} for MOIHT; ε = 10 −7 for both MOIHT and MOSD; τ 0 = 1, τ k+1 = 1.3τ k , ε 0 = 10 −5 , ε k+1 = 0.9ε k and x k+1 − y k+1 ≤ 10 −3 as stopping condition for MOSPD; α 0 = 1, δ = 0.5 and γ = 10 −4 for all Armijo-type line searches.Again, the parameters for MOIHT and MOSPD were also employed in MOHyb.Finally, since the objective functions have different scales, similarly to what is done in [31], when computing the spread metrics we considered the logarithm (base 10) of the f 2 values and, then, re-scaled both f 1 and f 2 to have values in [0, 1].
With respect to the purity and the hyper-volume metrics, SFSD resulted to be more robust with MOIHT, MOSPD and MOHyb as initialization strategies, with MOHyb appearing to be slightly superior.As for the Γ-spread metric, GSS was the best algorithm for the SFSD phase one.However, although SFSD, equipped with this setting, was effective in reaching remote regions of the objectives space, it struggled to obtain uniform Pareto front approximations and, thus, to obtain good ∆-spread values.As for this last metric, using as initialization strategy MOIHT proved to be a better choice.Remark 6.In the previous sections, we considered the best executions of SFSD equipped with MOIHT, MOSPD and MOHyb, and we compared them with the deterministic outputs of our front-oriented algorithm when MIQP/GSS was employed in the phase one.Thus, for the sake of completeness, in Figure 7 we report the performance profiles w.r.t. the purity metric obtained considering the worst runs.Comparing these performance profiles with the ones in Figures 5a-6a, we observe only slight decreases in the performance of the non-deterministic strategies.

Figure 1 :
Figure 1: Pareto optimal solutions and Pareto front of problem of Example 1.

Figure 2 :
Figure 2: L-stationary points in the Pareto front of problem of Example 1 (L(F ) = [1, 1] ) for different values of L.
where t k is a suitable stepsize and the descent direction d k is obtained solving

Table 1 :
and starting at the initial solution 0 n ∈ Ω. Datasets used for the experiments on sparse logistic regression.The number of features takes into account one-hot encoding of categorical ones.