A limited memory Quasi-Newton approach for multi-objective optimization

In this paper, we deal with the class of unconstrained multi-objective optimization problems. In this setting we introduce, for the first time in the literature, a Limited Memory Quasi-Newton type method, which is well suited especially in large scale scenarios. The proposed algorithm approximates, through a suitable positive definite matrix, the convex combination of the Hessian matrices of the objectives; the update formula for the approximation matrix can be seen as an extension of the one used in the popular L-BFGS method for scalar optimization. Equipped with a Wolfe type line search, the considered method is proved to be well defined even in the nonconvex case. Furthermore, for twice continuously differentiable strongly convex problems, we state global and R-linear convergence to Pareto optimality of the sequence of generated points. The performance of the new algorithm is empirically assessed by a thorough computational comparison with state-of-the-art Newton and Quasi-Newton approaches from the multi-objective optimization literature. The results of the experiments highlight that the proposed approach is generally efficient and effective, outperforming the competitors in most settings. Moreover, the use of the limited memory method results to be beneficial within a global optimization framework for Pareto front approximation.


Introduction
Multi-Objective Optimization (MOO) is a mathematical tool which has proved to be particularly well suited for many real-world problems over the years. Particular relevance of this framework is demonstrated, for example, by applications in statistics [1], design [2], engineering [3,4], management science [5], space exploration [6]. The principal complexity that makes MOO problems difficult to handle is the general impossibility to reach a solution minimizing all the objective functions simultaneously. In this context, the definitions of optimality (global, local and stationarity) are based on Pareto's theory. The latter has complexity elements that make it difficult the creation of new methods and optimization processes.
A class of MOO methods that has been widely studied for the past two decades is the one concerning descent algorithms (either first-order, second-order and derivativefree). These approaches are basically extensions of the classical iterative scalar optimization algorithms. Steepest Descent [7], Newton [8,9], Quasi-Newton [10], Augmented Lagrangian [11], Conjugate Gradient [12] are only a few methods of this family. In addition to having theoretically relevant convergence properties, these algorithms, when used on problems with reasonable regularity assumptions, proved to be valid alternatives to the scalarization approaches [13] and the evolutionary ones [14], especially as the problem size grows [15,16]. In recent years, some of the descent methods were also extended to generate approximations of the Pareto front, instead of a single Pareto-stationary solution [15][16][17][18]. Moreover, following ideas developed for scalar optimization [19,20], descent methods have also been used as local search procedures within memetic algorithms [21].
Quasi-Newton methods are among the most popular algorithms for unconstrained single-objective optimization. Based on a quadratic model of the objective function, they do not require the calculation of the second derivatives in order to find the search direction: the real Hessian is replaced by an approximation matrix, which is updated at each iteration considering the new generated solution and the previous one. The most famous update formula for the approximation matrix is the BFGS one, which is named after Broyden, Fletcher, Goldfarb and Shanno [22]. In the multi-objective setting, Quasi-Newton methods were proposed, for instance, in [10,[23][24][25].
Among the factors contributing to the success of Quasi-Newton methods in scalar optimization, the possibility of defining limited-memory variants of these approaches certainly stands out. The approximate Hessian matrix can in fact be roughly recovered only using a finite number M of previously generated solutions. In this way, its management in memory, which could be extremely inefficient and time-consuming, is avoided. In particular, the L-BFGS algorithm, firstly designed in [26], has managed over the years to achieve state-of-the-art performance in most settings, even with relatively small values for M.
This work concerns, to the best of our knowledge, the first attempt in the literature to define a multi-objective limited memory Quasi-Newton method. The key elements that characterize the proposed approach are: (i) a shared approximation of the Hessian matrices is employed to compute the search direction; (ii) the Hessian matrix approximation only requires information related to the most recent iterations to be computed; (iii) the method is in general well defined and, in the strongly convex case, is shown to possess R-linear global convergence properties to Pareto optimality.
The rest of the manuscript is organized as follows. In Sect. 2, we recall the main concepts related to MOO Quasi-Newton methods. In Sect. 3, we provide a description of the proposed limited memory Quasi-Newton approach; we then provide the convergence analysis in Sect. 4. In Sect. 5, we show through computational experiments the good performance of the limited memory approach w.r.t. main state-of-the-art Newton and Quasi-Newton methods. Moreover, in this section we show the performance of the new approach used as local search procedure in a global optimization method. Finally, in Sect. 6 we provide some concluding remarks.

Preliminaries
In this work, we consider the following unconstrained multi-objective optimization problem: where F : R n → R m is a continuously differentiable function. We denote by J F (·) = (∇ f 1 (·), . . . , ∇ f m (·)) T ∈ R m×n the Jacobian matrix associated with F. Moreover, for all j ∈ {1, . . . , m}, the Hessian matrix of the function f j (·), when it exists, is denoted by ∇ 2 f j (·). In what follows, the Euclidean norm in R n will be denoted by · .
Since we are in a multi-objective setting, we need a partial ordering in R m : considering two points u, v ∈ R m , we have that We can say that u dominates v if u ≤ v and u = v. In this case, we use the following notation: u v. Similarly, we state that x ∈ R n dominates y ∈ R n w.r.t. F if F(x) F(y).
In multi-objective optimization, a point minimizing all the objectives at once is unlikely to exist. For this reason, the concepts of optimality are based on Pareto's theory.
Definition 1 A pointx ∈ R n is Pareto optimal for problem (1) if there does not exist y ∈ R n such that F(y) F(x). If there exists a neighborhood N (x) in which the previous property holds, thenx is locally Pareto optimal.
Since Pareto optimality is a strong property, it is hard to attain in practice. A slightly more affordable condition is weak Pareto optimality. Definition 2 A pointx ∈ R n is weakly Pareto optimal for problem (1) if there does not exist y ∈ R n such that F(y) < F(x). If there exists a neighborhood N (x) in which the previous property holds, thenx is locally weakly Pareto optimal.
We define the Pareto set as the set of all the Pareto optimal solutions. Moreover, we refer to the image of the Pareto set w.r.t. F as the Pareto front.
We can now introduce the concept of Pareto stationarity. Under differentiability assumptions, this condition is necessary for all types of Pareto optimality. Moreover, assuming the convexity of the objective functions in problem (1), Pareto stationarity is also a sufficient condition for Pareto optimality.
The concepts of Pareto stationarity, Pareto optimality and convexity are related according to the following lemma.
Lastly, we introduce a relaxation of Pareto stationarity, called ε-Pareto-stationarity. This concept is firstly introduced in [11]: here, we propose a slightly modified version.
In the following, we briefly review the basic concepts for Quasi-Newton algorithms in multi-objective optimization.

Quasi-Newton methods
If a pointx ∈ R n is not Pareto-stationary, then there exists a descent direction w.r.t. all the objectives. The Quasi-Newton direction can be introduced as the solution of the following problem [10]: where B j ∈ R n×n , with j ∈ {1, . . . , m}, approximates the second derivatives ∇ 2 f j (x). If the approximation matrices are positive definite, i.e., B j 0 ∀ j ∈ {1, . . . , m}, then the function ∇ f j (x) T d + (1/2) d T B j d is strongly convex for each j ∈ {1, . . . , m}. In this case, problem (2) has a unique minimizer: we denote it by d Q N (x). We also indicate with θ Q N (x) the optimal value of problem (2) atx. It is trivial to observe that θ Q N (x) ≤ 0 for any x ∈ R n . Ifx is Pareto-stationary, then θ Q N (x) = 0.
As in [24], we introduce the function D : R n × R n → R, defined by Any direction d such that D(x, d) < 0 is a descent direction atx for F. Moreover, the function D(·, ·) has some properties, which we report in the next lemma.

Lemma 2 ([24, Lemma 2.2])
The following statements hold: As we thoroughly recall in Appendix A.1, the Lagrangian dual problem of (2) is given by: Regarding problems (2) and (3), strong duality holds and the Karush-Kuhn-Tucker conditions are sufficient and necessary for optimality. Moreover, denoting by T the optimal Lagrange multipliers vector, we have that and Due to the presence of the inverse of the convex combination of the approximation matrices, i.e., If, for all j ∈ {1, . . . , m}, B j = I , where I ∈ R n×n is the identity matrix, problem (2) is identical to the one proposed in [7] to find the steepest common descent direction. We denote the latter by d S D (x) and the associated Lagrange multipliers vector by Obviously, equations (4)-(5) hold true in this particular case. We further recall a well-known result that will be used in our convergence analysis.

Lemma 3
The following statements hold: Based on the concept of Quasi-Newton direction, a Quasi-Newton approach for multi-objective optimization of strongly convex objective functions is proposed in [10]. In this algorithm, a backtracking Armijo-type line search is used to guarantee the sufficient decrease w.r.t. all the objective functions. The result is formalized by the following lemma.

Lemma 4 ([7, Lemma 4])
If F is continuously differentiable and J F (x)d < 0, then there exists some ε > 0, which may depend on x, d and γ ∈ (0, 1), such that Remark 1 By the definition of D(·, ·), we have that J F (x)d ≤ 1D(x, d). Moreover, if B j 0 ∀ j ∈ {1, . . . , m}, then it follows that D(x, d) < θ Q N (x). Using Lemma 4 and these results, we trivially obtain that, for all t ∈ (0, ε], In many works for MOO [10,25], the BFGS update formula is independently used for all B j , with j ∈ {1, . . . , m}: . We also introduce the formula for updating the inverse of the approximation matrix B j , which we denote by H j : where ρ k j = 1/ s T k y k j . Similar to the scalar case [28], for each j ∈ {1, . . . , m}, if s T k y k j > 0 and B k j 0, then B k+1 j is positive definite. The same property holds true if {H k j } is considered. When the objective functions are strictly convex, the condition s T k y k j > 0 is always satisfied for any pair (x k+1 , x k ) and for each j ∈ {1, . . . , m}. However, this property is not guaranteed to hold in the general case. In order to overcome this issue, in Quasi-Newton methods for scalar optimization, Wolfe conditions are imposed at each iteration [28].
The Wolfe conditions have been extended to MOO in [12]: we may have that, for some j ∈ {1. . . . , m}, which can be also re-written in the form For this reason, a different formula for updating B j is introduced in [24]. The corresponding update formula for H j remains similar to (6), except that ρ k j is now defined as Using the above update rule, ρ k j is proved to be strictly positive even when s T k y k j ≤ 0. Thus, H k+1 j and, consequently, B k+1 j always remain positive definite.

Single Hessian matrix approximation
The use of a single positive definite matrix B was proposed in [23] to approximate ∇ 2 f 1 (x), . . . , ∇ 2 f m (x). In this case, problem (2) becomes Now, the descent direction can accordingly be computed as (3) is replaced by B −1 . As a consequence, problem (11) reduces to a linearly constrained, convex quadratic program which is easy to solve.
The unique matrix B is obtainable as the approximation of a convex combination of matrices. For this purpose, slightly modified BFGS update formulas are introduced in [23]:

12: end for
In the proposed approach, we use a single positive definite matrix H k at each iteration k. In Sect. 3.2, we introduce the update formula for H k , which is slightly different w.r.t. the one introduced in [23]. As in L-BFGS for scalar optimization, we maintain only a finite number M of vectors pairs {(s i , u i )} in memory: the oldest one is discarded each time a new vectors pair is calculated. These pairs are used in a two-loop recursive procedure to efficiently carry out the matrix multiplication . This procedure is essentially an extension for MOO of the one used in L-BFGS [28]. The matrix R k is then used in problem (14) at step 3 of the algorithm: the latter is simply derived from problem (11) substituting H k J F (x k ) T with R k . We denote by θ L M (x k ) the optimal value of problem (14) at x k . Moreover, we respectively denote by λ L M (x k ) (Line 4) and d L M (x k ) (Line 5) the Lagrange multipliers vector and the direction corresponding to θ L M (x k ). Note that (4) is valid in this context too. Finally, in Line 6, a Wolfe line search is carried out to find a step size α k along the direction d L M (x k ), satisfying the Wolfe conditions for MOO (Sect. 3.3).
In the following, we deeply analyze the various aspects of Algorithm 1.

Two-loop recursive procedure for MOO
In L-BFGS, one of the most relevant features is the two-loop recursive procedure which, at any iteration k, given the vectors pairs saved in memory, allows to efficiently compute the product indicates the objective function [28]. We remind, indeed, that in scalar optimization the negative of this product identifies the Quasi-Newton descent direction: Using this procedure, we do not need to store the matrix H in memory. This property could be crucial when high dimensional problems are considered: in these cases, maintaining and updating the matrix H , which is dense in general, could be extremely inefficient. In this work, we propose an extension of this procedure for MOO: the algorithmic scheme is reported in Algorithm 2.

Algorithm 2 Two-Loop Recursive Procedure
end for 14: end if 15: return R k With respect to the scalar optimization case, this procedure computes the product H k J F (x k ) T . The same result could be obtained repeating m times the procedure for scalar optimization to find H k ∇ f j (x k ) for all j ∈ {1, . . . , m}. In both cases, m (4Mn + n) multiplications are required. However, Algorithm 2 allows to exploit the optimized operations of software libraries for vector calculus.
The employment of Algorithm 2 is possible thanks to some properties of (13). Indeed, the latter can be re-written in the following form [28]: As in L-BFGS, the exact matrix H k−M is substituted by a suitable sparse positive definite matrix H 0 . From this last equation, the two-loop recursive procedure to compute the product H k J F (x k ) T is derived. We refer the reader to [28] for more details.

Definition of H
In the proposed approach, we use a single positive definite matrix H . As in [23] the update formula (13) is used. However, taking inspiration from (10), we use a different definition of ρ k : As in [24], we carry out a line search to find a step size satisfying the Wolfe conditions for MOO (Sect. 3.3). However, recalling the reasoning in Sect. 2.1, in order to ensure that H k+1 0, we force through (19) ρ k to be positive even when s T k u k ≤ 0. We formalize this statement in the following proposition.

Proposition 1 Considering a generic iteration k of Algorithm
be the Lagrange multipliers vector obtained solving problem (14). If ρ k is updated by (19), then ρ k is positive.

Remark 2
In the single objective case, the update formula (13) for H k coincides with the classical BFGS rule. Indeed, it is sufficient to realize that, since λ L M (x k ) lies in the unit simplex by (4), then u k = y k . Moreover, the same reasoning can be applied with (19) to get that ρ k = 1/ s T k y k . Hence, the two-loop recursive procedure reduces to that of L-BFGS. In turn, the overall Algorithm 1 is nothing but L-BFGS, since and Wolfe conditions are imposed by the line search.

Remark 3
The procedure in Algorithm 2 cannot be used if we consider an approximation matrix for each objective function, as in problem (2). In such case, both in the primal and in the dual problem (3) the matrices are tied to the problem variables; for example, when solving (3), the product [ m j=1 λ j B j ] −1 J F (x) T would be recomputed any time a different solution λ is considered. The use of a single positive definite matrix prevents this issue: matrix multiplication H J F (x) T can be computed only once, before solving subproblem (11), making it possible to exploit the efficiency of the two-loop recursive procedure.

Wolfe line search
In this section, we introduce a simple line search scheme to find a step size α along a given direction d k satisfying the Wolfe conditions: Before proceeding, we make a reasonable assumption. Then, we prove that there exists an interval of values satisfying the Wolfe conditions. Note that an analogous result has been obtained in [12,29] under the different assumptions also reported in Sect. 2.1 of this manuscript.

Assumption 1
The objective function F has bounded level sets in the multi-objective sense, i.e., the set (20) and (21) hold.
After proving the existence of an interval of values satisfying the Wolfe conditions, we report the algorithmic scheme of the considered line search.

15: end for
Starting from α 0 l = 0, α 0 u = ∞, the core idea of the line search is that of reducing the interval α t l , α t u until a valid step size α t is found. At the beginning of the for-loop, the Wolfe sufficient decrease condition (20) is checked. If it is not satisfied by α t , we update α t u and we maintain the same value for α t l (Lines 4 and 5). Otherwise, α t u is not updated (Line 7) and we check if the Wolfe curvature condition (21) is satisfied by α t : if it is, both Wolfe conditions are satisfied and, then, the current step size value is returned; else α t l is updated according to Line 9. After updating α t u or α t l , a new value for the step size α t is chosen in the interval α t l , α t u and the process is repeated. In the next lemma, we state some properties related to the interval upper and lower bounds α t u and α t l .
Lemma 5 Consider a generic iteration t of Algorithm 3. Let x k ∈ R n and d k be a direction such that D(x k , d k ) < 0. Then, we have the following properties: (ii) α t l is such that Proof See Appendix B.
In the following proposition, we state that the proposed line search is well defined, i.e., it terminates after a finite number of iterations returning a step size satisfying the Wolfe conditions. Proposition 3 Let Assumption 1 hold, δ ∈ [1/2, 1), η > 1 and let {α t l , α t u , α t } be the sequence generated by Algorithm 3. Assume that: Then Algorithm 3 is well defined, i.e., it stops after a finite number of iterations returning a step sizeα satisfying the Wolfe conditions for MOO.
Proof See Appendix B.

Remark 4
To the best of our knowledge, the first Wolfe line search for MOO was proposed in [29]. Our line search is just a simpler algorithm that is guaranteed to produce a point satisfying the Wolfe conditions. In fact, we think that not using an inner solver, as done in [29], could be a performance disadvantage and, in addition, smarter strategies to set the trial step size may be integrated. We decided not to compare the two line searches, since finding new efficient methodologies to find the step size is not the focus of our work. Moreover, we are confident that the experimental results of Sect. 5 would be similar regardless the employed Wolfe line search.

Convergence analysis
In this section, we show the convergence properties of our Limited Memory Quasi-Newton approach. Before proceeding, similarly to what is done in [30], we need to make some assumptions about the objective function F and the initial approximation matrix H 0 .
Assumption 2 We assume that:

Assumption 3
The matrix H 0 is chosen such that the norms H 0 and B 0 are bounded.
Remark 5 Assumption 2 implies Assumption 1. Indeed, f j (·) is strongly convex for all j ∈ {1, . . . , m}, and thus has all the level sets bounded. Also, by Assumption 1, we have that Propositions 2 and 3 concerning the line search remain valid.

Remark 6
By Assumption 2, we have s T k y k j > 0 for any k and for all j ∈ {1, . . . , m}. Then, considering (18) and since λ L M (x k ) satisfies (4), we have that s T k u k > 0. Then, according to (19), ρ k = 1/ s T k u k and, thus, we update B k and H k using (12) and (13), respectively.
In order to carry out the theoretical analysis, we take as reference Algorithm 4, which is mathematically equivalent to Algorithm 1 but makes it more explicit how the approximation of H k is computed, i.e., applying M times the update rule (13) starting from H 0 . In the remainder of the section, we will consider the approximation matrix B k for the sake of clarity; the results are obviously the same if we consider the matrix H k . Finally, note that Algorithm 4 is only used in this section, since, unlike Algorithm 1, it requires to store the entire matrix in memory.
For the theoretical analysis, we also need to introduce the formula for the trace and the determinant of the matrix B k+1 : Note that these expressions hold when (12) is used to update the matrix B k , which is always the case here by Assumption 2. We also introduce some basic notation that will be useful in the following analysis. Notation: We will denote by B k the eigenvalues set of the matrix B k ; by ω m B k and ω M B k we indicate the minimum and the maximum eigenvalue, respectively; we refer by β k to the angle between the vectors s k and B k s k . Concerning β k , we also recall the formula of the cosine:

17: end for
We are now able to begin the convergence analysis with three technical lemmas.
Proof The result follows as in Proposition 3.3 in [24], as the assumptions made in the latter are trivially implied by Assumption 2.
Proof The proof is analogous to the one of Lemma 4.2 in [24], taking into account that we have a single approximation matrix B k .  (15) and (17), we have that x k + τ s k ∈ L F (F (x 0 )). Also recalling that λ L M (x k ) satisfies (4), we obtain for any z ∈ R n that and, then, For z = s k we thus obtain Defining and recalling (18), we solve the integral: Given this last result and equation (30), we obtain that a s k 2 ≤ s T k u k ≤ b s k 2 and, thus, considering the left-hand side, Furthermore, if we consider z = I 1/2 k s k in (29), with I 1/2 k being the positive definite square root of I k , we get and, recalling (31), Then, given Remark 6 and equation (32), focusing on the right-hand side, we have Now, recalling Assumption 3 and equation (29), we apply recursively (26) and we obtain that From (34), we deduce that the greatest eigenvalue of B k (l) is smaller thanb. Thus, given Assumption 3 and equation (33), we get that whereã > 0. Then, by (28), the min-max theorem and the triangle inequality, we have: We know that: • by definition of trace and determinant, recalling (34) and (35), we get and thus • considering the euclidean norm and that B k is a real positive definite matrix, Joining the last three results, we obtain that where the last inequality comes from the definitions ofã andb. Thus, we get the thesis choosing In the next proposition, we state that the sequence of points produced by Algorithm 4 converges to a Pareto optimal point.

Proposition 4 Let Assumptions 2 and 3 hold. Assume that {x k } is the sequence generated by Algorithm 4. Then, {x k } converges to a Pareto optimal point x for problem (1).
Proof By Lemmas 7 and 8, we know that there exists a constant δ > 0 such that, for all k ≥ 0, Considering this last result and Lemma 6, we obtain that and, thus, By (15), we know that, for all k ≥ 0, Recalling Lemma 3 and equation (36), we have that d S D (x ) = 0 and, thus, x is Pareto-stationary for problem (1). Therefore, by Lemma 1 and Assumption 2, we conclude that x is Pareto optimal. Now, let us assume, by contradiction, that there exists another subsequenceK ⊆ {0, 1, . . .} such that lim k→∞ k∈K withx = x . We prove that F (x) = F (x ). If it were false, since by Assumption 2 F is strongly convex and L F (F (x 0 )) is convex, for all t ∈ (0, 1), we would get that But, in this case, we would contradict the fact that x is Pareto optimal.
Then, given that x is Pareto optimal and that F (x) = F (x ), ∃j ∈ {1, . . . , m} such that f˜j x < f˜j (x) . Now, recalling (37) and (38), there exist k ∈ K andk ∈K such that k <k and f˜j (x k ) < f˜j xk .
But, since (15) holds at each iteration of Algorithm 4, we implicitly have that the sequence f j (x k ) is decreasing, for all j ∈ {1, . . . , m}. Thus, we get a contradiction and we conclude that with x being Pareto optimal.
In the rest of the section, we discuss the convergence rate of Algorithm 4. We first have to provide a technical result.

Lemma 9 Let Assumptions 2 and 3 hold. Moreover, let {x k } be the sequence generated by Algorithm 4 and x be the Pareto optimal point to which the sequence converges.
Then, for all k ≥ 0, The proof is analogous to the one of Lemma 4.4 in [24], recalling that here a single approximation matrix B k is considered.
We are now ready to prove that the sequence of points generated by Algorithm 4 R-linearly converges to Pareto optimality. 2 and 3 hold. Furthermore, let {x k } be the sequence generated by Algorithm 4 and x be the Pareto optimal limit point of the sequence. Then, {x k } R-linearly converges to x . In addition, we have that

Proposition 5 Let Assumptions
Proof We first introduce the function f : R n → R, defined as where λ S D (x ) is the multipliers vector associated with the steepest common descent direction at x . Recalling Lemmas 1 and 3, that x is Pareto optimal and that (5) holds for d S D (x ), we have that Now, for all k ≥ 0 and j ∈ {1, . . . , m}, by Assumption 2 and using Taylor's theorem, we get Multiplying this result by λ S D j (x ), summing over j ∈ {1, . . . , m}, recalling (4), which is valid for λ S D (x ), and (41), we obtain that Given Lemma 9, from the right-hand side of the last result we get On the other side, (4), (15) and (40) imply that, for all k ≥ 0, which, by subtracting the term f (x ) in both sides and taking into account Lemmas 7 and 9, changes into Joining this last result and (43), we obtain that Now, for all k ≥ 0, we define It is easy to see that, by the definitions of γ and σ , Assumption 2 and Lemma 8, r k ∈ (0, 1). In addition, by Lemma 8, we also have that there exists a constant δ > 0 such that, for all k ≥ 0, Then, recursively applying equation (44) and taking into account that, combining (4), (15) and (40) Considering this last result and the left-hand side of (42), we obtain that , and, thus, the sequence {x k } R-linearly converges to x . Summing the last result for all k ≥ 0 and recalling thatr < 1, we get that (39) holds.

Computational experiments
In this section, we report the results of thorough computational experiments, comparing the performance of the proposed approach and other state-of-the-art methods from the literature. All the tests were run on a computer with the following characteristics: Ubuntu 20.04 OS, Intel Xeon Processor E5-2430 v2 6 cores 2.50 GHz, 16 GB RAM. For all algorithms, the code was implemented in Python3. 1 Finally, in order to solve the optimization problems to determine the descent direction, e.g., problem (14), the Gurobi Optimizer (Version 9) was employed.

Experiments settings
In the next subsections, we provide detailed information on the settings used for the experiments.

Algorithms and parameters
We chose to compare the new limited memory Quasi-Newton approach, which we call LM-Q-NWT for the rest of the section, with some state-of-the-art Newton and Quasi-Newton methods for MOO from the literature.
The first competitor is the Multi-Objective Newton method (NWT) proposed in [8], which is an extension of the classical Newton method to multi-objective optimization. In this algorithm, the problem for finding the search direction is similar to problem (2): 1 The implementation code can be found at https://github.com/pierlumanzu/ limited_memory_method_for_MOO [37]. the difference is on the use of the real Hessian ∇ 2 f j (x) instead of the approximation matrix B j , for all j ∈ {1, . . . , m}. Since this method is not designed to handle unconstrained multi-objective non-convex problems, we evaluated its performance only on the convex test instances.
The other two competitors are the Quasi-Newton approach (Q-NWT) proposed in [24] and the Modified Quasi-Newton method (MQ-NWT) presented in [23]. We refer the reader back to Sect. 2.1 for further details. At the first iteration of all the Quasi-Newton approaches, including LM-Q-NWT, the approximation matrix/matrices is/are set equal to the identity matrix.
In order to make the comparisons as fair as possible, we decided to use the same line search strategy for all the approaches. In particular, we employed the proposed Wolfe line search (Algorithm 3). The values for the line search parameters were chosen according to some experiments on a subset of the tested problems and are as follows: γ = 10 −4 , σ = 10 −1 , η = 2.5 and δ = 0.5. We do not report these preliminary results for the sake of brevity. In order to efficiently use the proposed line search in MQ-NWT, we used equation (19) to compute ρ k at each iteration k.
Finally, the choice for the parameter M of the new limited memory approach is separately discussed in Sect. 5.2. Since it denotes the number of vectors pairs maintained in memory during the iterations, it is the most critical among the LM-Q-NWT parameters.

Problems
In Table 1, we list the tested problems. In particular, we compared the algorithms in 78 convex and 83 non-convex problems. All the test instances are characterized by objective functions that are at least continuously differentiable almost everywhere. If a problem is characterized by singularities, these latter ones were counted as Paretostationary points. All the problems have objective functions that let Assumption 1 hold: the latter is essential to guarantee the finite termination of the proposed Wolfe line search.
Some problem names are characterized by the prefix M-. These problems are rescaled versions of the original ones and their formulations are provided in Appendix C. In this appendix, we also introduce a new convex test problem, which we call MAN_2.
For each algorithm, we tested each problem with 100 different initial points chosen from a uniform distribution. The latter was defined through lower and upper bounds specified for each problem. Since in this work we consider unconstrained multi-objective optimization problems, these bounds were only used to choose the random initial points. For the M-FDS_1, MMR_5, M-MOP_2, MOP and CEC problems, the lower and upper bounds values can be found in the referenced papers. For the others, the values are provided in Table 2.
Finally, starting from an initial point, we decided to let the algorithms run until one of the following stopping conditions was met: • the current solution is ε-Pareto-stationary (Definition 4); in the experiments, ε = 5eps 1/2 ,  where eps denotes the machine precision; • a time limit of 2 min is reached.

Metrics
For each algorithm and problem, the main metrics to be computed are the following.
• N ε : the percentage of runs ended with an ε-Pareto-stationary point.
• T : the computational time to reach the ε-Pareto-stationarity from an initial point.
If the ε-Pareto-stationarity is not reached within the time limit, the value of T related to that point is set to ∞. • T M : the mean of the finite T values.
In Sect. 5.4, we employed the metrics proposed in [17]: purity, -spread andspread. These metrics are used to evaluate the quality of Pareto front approximations. On the one hand, the purity metric indicates the ratio of the number of non-dominated points that a method obtained w.r.t. a reference front over the number of the points produced by that method. The reference front is obtained by combining the fronts retrieved by all the considered algorithms and by discarding the dominated points. On the other hand, the spread metrics measure the uniformity of the generated fronts in the objectives space. In particular, the -spread is defined as the maximum ∞ distance in the objectives space between adjacent points of the Pareto front, while the -spread is similar to the standard deviation of this distance. Finally, we employed the performance profiles introduced in [36], which are an useful tool to appreciate the relative performance and robustness of the considered algorithms. The performance profile of a solver w.r.t. a certain metric is the (cumulative) distribution function of the ratio of the score obtained by the solver over the best score among those obtained by all the considered solvers. In other words, it is the probability that the score achieved by a method in a problem is within a factor τ ∈ R of the best value obtained by any of the algorithms in that problem. We refer the reader to [36] for additional information about this tool. Since N ε and purity have increasing values for better solutions, the performance profiles w.r.t. these metrics were produced based on the inverse of the obtained values. All the performance profiles were plotted with specific axes ranges in order to remark the differences among the considered solvers.

Selection of the parameter M
The parameter M indicates how many vectors pairs {(s i , u i )} are maintained in memory at each iteration of LM-Q-NWT. A bad value for this parameter might compromise the overall performance of the approach, making it too slow or not capable of reaching ε-Pareto-stationary points within the time limit.
In order to select a proper value for M, we analyzed the performance of LM-Q-NWT with M ∈ {2, 3, 5, 10, 20} on a subset of the tested problems. In Fig. 1, we report the performance profiles for the five variants of the new limited memory method. The solvers with M ∈ {5, 10} turned out to be the best w.r.t. both N ε and T , while the variant with M = 2 was outperformed by all the other methods. We conclude that too little information on the previous steps can compromise the performance of LM-Q-NWT. On the other hand, the management of too many vectors pairs and the use of the two-loop recursive procedure can require great computational costs. A demonstration of this fact is the performance of the proposed approach with M = 20 on the T metric. Although this solver performed well w.r.t. N ε , it is only the fourth most robust algorithm in terms of computational time.
After analyzing the performance profiles, we decided to use the new limited memory approach with M = 5 for the rest of the section. However, the variant with M = 10 appears to be a good choice too.

Overall comparisons
In this section, we compare the proposed approach with the Newton and Quasi-Newton algorithms described in Sect. 5.1.1. As already mentioned, we tested NWT only on the convex problems. Then, we separately report the performance profiles for the convex and non-convex problems in Figs. 2 and 3 respectively. In order to better remark the differences among the methods, for each metric we show three plots concerning different sets of values for n. Regarding the performance on the convex problems for all the n values, the proposed approach proved to be the best algorithm, outperforming the competitors w.r.t. both the metrics. Moreover, the gap between LM-Q-NWT and the others is sharper when taking into account the non-convex problems or the high dimensional ones. For high n values, the NWT and Q-NWT algorithms proved to suffer the maintenance of the Hessians and the approximation matrices respectively. As a consequence, they turned out to be the least robust w.r.t. both the metrics. Using a single approximation matrix allowed the MQ-NWT approach to perform better. However, in extremely high dimensional problems, even managing a single matrix proved to be an expensive job. In these cases, the performance of the limited memory approach was remarkable.
On the low dimensional problems, the NWT and Q-NWT algorithms had a good performance. The proposed approach similarly behaved w.r.t. the N ε metric, but it was generally outperformed by these algorithms in terms of T . Managing the real Hessians or the approximation matrices turned out to be a tractable task when n is small enough. Moreover, by definition, these matrices provide more accurate information about the curvature of the objective functions than the matrix of LM-Q-NWT and  Table 1 (for interpretation of the references to color in text, the reader is referred to the electronic version of the article). Performance metric: a-c N ε ; d-f T . Values for n: a, d All; b, e n ≥ 50; c, f n < 50 MQ-NWT. However, these two algorithms still proved to be competitive, obtaining good T metric results in most of the problems.
In order to analyze the performance of the algorithms more deeply, in Tables 3, 4, 5 and 6 we report the metrics values obtained in two convex and two non-convex problems respectively. In particular, we show the results for n ∈ {5, 20, 50, 200, 500, 1000}.
Regarding the N ε metric, the proposed method outperformed the competitors regardless the values for n and m. As in the performance profiles, the differences between LM-Q-NWT and the other approaches are clearer on the high dimensional problems. In some of these, NWT, Q-NWT and MQ-NWT were not able to obtain any ε-Pareto-stationary point.
On the problems with two objective functions, almost all the best results in terms of the T M metric were obtained by the proposed approach. However, the same performance was not obtained on the problems with m = 3 and low value for n. The use of a single matrix seems not to provide accurate enough information about the functions curvature when the objectives are more than two. An additional demonstration of this fact could be also the similar performance of the MQ-NWT algorithm. On the other hand, the use of the real Hessian/an approximation matrix for each objective function seems to overcome the issue: indeed, NWT and Q-NWT had the best performance in terms of T M in these cases. LM-Q-NWT still obtained great results for this metric on the problems with three objective functions and high value for n, outperforming the other competitors. Even with m = 3, the employment of a single matrix turned out A value marked in bold is the best obtained for a metric on a specific problem to be essential in high dimensional problems. Like the proposed approach, MQ-NWT proved to perform better than NWT and Q-NWT with m = 3 and high value for n, resulting the second best algorithm in these cases.  A valuemarked in bold is the best obtained for a metric on a specific problem

Results in a global optimization setting
In the previous section, we compared the LM-Q-NWT method with strongly related approaches from the state-of-the-art, in terms of efficiency and effectiveness at reaching approximate Pareto-stationarity. Now, we show the (positive) impact that the proposed procedure may have if used within a global multi-objective optimization framework.
In particular, here we consider the memetic algorithm proposed in [21], which is named NSMA. Starting with an initial population of N points, this method aims at approximating the Pareto front of the considered problem, combining the genetic operators of the NSGA-II algorithm [14] with a front-based projected gradient method FMOPG. The latter is employed, every n opt iterations of NSMA, to refine selected solutions up to ε-Pareto-stationarity. Moreover, the selection of the points to be optimized is based on the ranking and the crowding distance values assigned to the population at each iteration. Finally, the algorithm exploits a front-based variant of the Armijo-Type Line Search for MOO defined in [7]. For additional information about NSMA, we refer the reader to [21].
For the experiments of the present paper, we consider possible modifications of the NSMA algorithm: • NSMA-W, which employs the proposed Wolfe line search (Algorithm 3) in the FMOPG method; • NSMA-L, which uses the new limited memory approach (Algorithm 1) as the local optimization procedure.
We compared these two approaches with NSGA-II and the original version of NSMA.
In the experiments, as in [21], we set N = 100 and n opt = 5. Moreover, the points selected as starting solutions for the local search procedures were only optimized w.r.t. all the objective functions. In the original version of the NSMA method, the points can be also refined w.r.t. a subset of the objective functions I ⊂ {1, . . . , m}. However, Assumption 1 may not hold for some subset I and, then, when trying to optimize a point w.r.t. I , the Wolfe line search would continue its execution for an infinite number of steps.
In Fig. 4, we report the performance profiles for the NSMA-L, NSMA-W, NSMA and NSGA-II algorithms on the problems listed in Table 1. The considered performance metrics are purity, -spread and -spread. In order to make the comparisons as independent from random operations as possible, for each algorithm and problem five tests characterized by different seeds for the pseudo-random number generator were executed. The five resulting Pareto front approximations were then compared based on the purity metric: the best among them was chosen as the output of the algorithm for the problem at hand.
In terms of purity, NSMA-L and NSMA-W turned out to be the two most robust algorithms. The proposed Wolfe line search allowed to improve the results of the original NSMA. In fact, the use of the limited memory approach allowed to obtain the best possible performance. Regarding the spread metrics, NSMA-L and NSMA-W had a similar performance. NSMA results on -spread are comparable with the ones of the two variants. However, it was slightly outperformed in terms of -spread. NSGA-II did not perform well w.r.t. all the metrics: the variants of NSMA turned out to be capable in finding more accurate and uniform Pareto front approximations.  Table 1 (for interpretation of the references to color in text, the reader is referred to the electronic version of the article). Performance metric: a purity; b -spread; c -spread

Conclusions
In this paper we proposed a new limited memory Quasi-Newton algorithm for unconstrained multi-objective optimization. To the best of our knowledge, it is the first attempt to define such an approach for MOO. As in [23], we use a single approximation matrix, contrarily to what is done in the other Quasi-Newton approaches. The idea of a single matrix, whose update formula is slightly modified from the one used in the scalar case, allowed us to extend the L-BFGS two-loop recursive procedure to multi-objective optimization: the Hessian matrix approximation does not need to be maintained and managed in memory, but it is computed using a finite number M of previously generated solutions. This feature proves to be crucial, especially when the approximation matrix is dense and/or high dimensional problems are handled. For the proposed approach, under assumptions similar to the ones made for L-BFGS in the strongly convex scalar case, we stated properties of R-linear convergence to the Pareto optimality of the produced sequence of points.
The results of thorough computational experiments show that the new limited memory algorithm consistently outperforms the state-of-the-art Newton and Quasi-Newton methods for MOO. Moreover, we show the substantial benefits of using the proposed algorithm as local search procedure within a global optimization framework.
In this case, the Lagrangian function is of the following form: where λ 1 , . . . , λ m are the Lagrange multipliers. Denoting by λ ∈ R m the vector of all the Lagrange multipliers, we also introduce the dual problem: As already mentioned in Sect. 2.1, if B j 0 ∀ j ∈ {1, . . . , m}, then problem (2) has a unique solution. Moreover, the problem is convex and has a Slater point, i.e., (t, d) = (1, 0). Then, strong duality holds and the Karush-Kuhn-Tucker conditions are sufficient and necessary for optimality. We thus obtain: Considering equation (A1), given λ ≥ 0, we can state that in MOO the Quasi-Newton direction depends on convex combinations of both the approximation matrices and the gradients. In scalar optimization, the second formula of (A1) reduces to the classical indicates the objective function. Equation (A1) also leads to re-write the dual function; in particular, t disappears since m j=1 λ j = 1, while d is substituted: Finally, using this last equation, we can retrieve the final form of the dual problem:

Proposition 1
Proof First, the case s T k u k > 0 is trivial. Then, let s T k u k ≤ 0 and let us consider and equation (8), we have that In Algorithm 1, the Wolfe conditions are imposed at each iteration k (Line 6): in particular, D (x k+1 , d L M (x k )) ≥ σ D (x k , d L M (x k )). Then, also considering Lemma 2, we obtain that where the last inequality comes from the fact that α k > 0, σ − 1 < 0 and D (x k , d L M (x k )) < 0. Using (A2) and (A3), we conclude that Since we have considered a generic j ∈ {1, . . . , m}, it trivially follows that this last equation is verified for all j. Now, given (4), which is valid for the Lagrange multipliers vector λ L M (x k ), and (A4), we can state that Then, if the second formula in (19) is used, ρ k > 0.

Proposition 2
Proof Given F continuously differentiable and d k descent direction for F at x k and recalling Lemma 4, we can state that there exists ε > 0 such that, for all α ∈ (0, ε], we have where the last inequality follows from Remark 1. We now assume, by contradiction, that for all α > 0 we have This last result indicates that we have {x k + αd k | α > 0} ⊆ L F (F(x k )), which is absurd since Assumption 1 holds. Therefore, there existsα > ε andĵ ∈ {1, . . . , m} such that and, by the continuity of F, equation (B1) holds for all α <α.

Lemma 5
Proof (1) Since α 0 u = ∞ and α t u < ∞, the interval upper bound has been updated at least once though Line 4. Lett be the iteration in which this update takes place: it follows that 0 ≤t < t. In this case, we have that Then, it trivially follows that equation (22) holds for αt +1 u . Now, suppose thatt + 1 < t and consider the iterationt + 1. By the instructions of the algorithm, αt +2 u is updated either by αt +2 u = αt +1 (Line 4) or αt +2 u = αt +1 u (Line 7). The first case is identical to the one of iterationt. In the second case, since (22) is satisfied for αt +1 u , it is also for αt +2 u . For t >t + 2, the property follows by induction.
(2) Given σ ∈ (0, 1) and D(x k , d k ) < 0, it is easy to prove that (23) and (24)  Thus, it simply follows that conditions (23) and (24) are satisfied in both cases. For t > 1, we get the thesis by induction.

Proposition 3
Proof By contradiction, we assume that the thesis is false, i.e., the algorithm does not stop in a finite number of iterations. First, we consider the case in which α t u = ∞ for all t, i.e., the interval upper bound is never updated. In this case, by the instructions of Algorithm 3, the Wolfe sufficient decrease condition is always satisfied, i.e., for each step size α t we have that F x k + α t d k ≤ F(x k ) + 1γ α t D (x k , d k ) .
(B5) Given (B7) and (B8), the definition ofĵ and the continuity of F, by taking the limit for t → ∞, with t ∈ T , we obtain that fˆj (x k +ᾱd k ) = fˆj (x k ) + γᾱD (x k , d k ) . (B11) Taking into account this result and (B7), we have that, for all t ∈ T , α t u >ᾱ. On the other hand, for t ∈ T , equation (B7) can be re-written in the following way: Using (B11) and by simple algebraic manipulations we get and, then, Now, by taking the limit for t → ∞, with t ∈ T , and recalling (B10) and the continuous differentiability of F, we obtain that Since D (x k +ᾱd k , d k ) ≥ ∇ fˆj (x k +ᾱd k ) T d k by definition of D(·, ·), σ > γ and D (x k , d k ) < 0, from (B12) we have that However, from Lemma 5, it follows that D x k + α t l d k , d k < σ D (x k , d k ) .
By taking the limit for t → ∞, with t ∈ T , and recalling (B10) and the continuity of D(·, ·), we get from the last equation that The latter is in contradiction with (B13). We thus get the thesis.

C Definition of new test problems
In this appendix, we introduce the new convex test problem which we call MAN_2. Moreover, we report the formulation of the rescaled versions of MAN_1 [21], FDS_1