1 Introduction

Multi-Objective Optimization (MOO) is a mathematical tool which has proved to be particularly well suited for many real-world problems over the years. Particular relevance of this framework is demonstrated, for example, by applications in statistics [1], design [2], engineering [3, 4], management science [5], space exploration [6]. The principal complexity that makes MOO problems difficult to handle is the general impossibility to reach a solution minimizing all the objective functions simultaneously. In this context, the definitions of optimality (global, local and stationarity) are based on Pareto’s theory. The latter has complexity elements that make it difficult the creation of new methods and optimization processes.

A class of MOO methods that has been widely studied for the past two decades is the one concerning descent algorithms (either first-order, second-order and derivative-free). These approaches are basically extensions of the classical iterative scalar optimization algorithms. Steepest Descent [7], Newton [8, 9], Quasi-Newton [10], Augmented Lagrangian [11], Conjugate Gradient [12] are only a few methods of this family. In addition to having theoretically relevant convergence properties, these algorithms, when used on problems with reasonable regularity assumptions, proved to be valid alternatives to the scalarization approaches [13] and the evolutionary ones [14], especially as the problem size grows [15, 16]. In recent years, some of the descent methods were also extended to generate approximations of the Pareto front, instead of a single Pareto-stationary solution [15,16,17,18]. Moreover, following ideas developed for scalar optimization [19, 20], descent methods have also been used as local search procedures within memetic algorithms [21].

Quasi-Newton methods are among the most popular algorithms for unconstrained single-objective optimization. Based on a quadratic model of the objective function, they do not require the calculation of the second derivatives in order to find the search direction: the real Hessian is replaced by an approximation matrix, which is updated at each iteration considering the new generated solution and the previous one. The most famous update formula for the approximation matrix is the BFGS one, which is named after Broyden, Fletcher, Goldfarb and Shanno [22]. In the multi-objective setting, Quasi-Newton methods were proposed, for instance, in [10, 23,24,25].

Among the factors contributing to the success of Quasi-Newton methods in scalar optimization, the possibility of defining limited-memory variants of these approaches certainly stands out. The approximate Hessian matrix can in fact be roughly recovered only using a finite number M of previously generated solutions. In this way, its management in memory, which could be extremely inefficient and time-consuming, is avoided. In particular, the L-BFGS algorithm, firstly designed in [26], has managed over the years to achieve state-of-the-art performance in most settings, even with relatively small values for M.

This work concerns, to the best of our knowledge, the first attempt in the literature to define a multi-objective limited memory Quasi-Newton method. The key elements that characterize the proposed approach are:

  1. (i)

    a shared approximation of the Hessian matrices is employed to compute the search direction;

  2. (ii)

    the Hessian matrix approximation only requires information related to the most recent iterations to be computed;

  3. (iii)

    the method is in general well defined and, in the strongly convex case, is shown to possess R-linear global convergence properties to Pareto optimality.

The rest of the manuscript is organized as follows. In Sect. 2, we recall the main concepts related to MOO Quasi-Newton methods. In Sect. 3, we provide a description of the proposed limited memory Quasi-Newton approach; we then provide the convergence analysis in Sect. 4. In Sect. 5, we show through computational experiments the good performance of the limited memory approach w.r.t. main state-of-the-art Newton and Quasi-Newton methods. Moreover, in this section we show the performance of the new approach used as local search procedure in a global optimization method. Finally, in Sect. 6 we provide some concluding remarks.

2 Preliminaries

In this work, we consider the following unconstrained multi-objective optimization problem:

$$\begin{aligned} \begin{aligned} \min _{x\in {\mathbb {R}}^n}\;&F\left( x\right) =\left( f_1\left( x\right) , \ldots ,f_m\left( x\right) \right) ^T, \end{aligned} \end{aligned}$$
(1)

where \(F: {\mathbb {R}}^n \rightarrow {\mathbb {R}}^m\) is a continuously differentiable function. We denote by \(J_F(\cdot ) = \left( \nabla f_1(\cdot ), \ldots , \nabla f_m(\cdot )\right) ^T \in {\mathbb {R}}^{m \times n}\) the Jacobian matrix associated with F. Moreover, for all \(j \in \left\{ 1,\ldots , m\right\} \), the Hessian matrix of the function \(f_j\left( \cdot \right) \), when it exists, is denoted by \(\nabla ^2 f_j\left( \cdot \right) \). In what follows, the Euclidean norm in \({\mathbb {R}}^n\) will be denoted by \(\left\| \cdot \right\| \).

Since we are in a multi-objective setting, we need a partial ordering in \({\mathbb {R}}^m\): considering two points \(u, v \in {\mathbb {R}}^m\), we have that

$$\begin{aligned} \begin{array}{ll} u< v \iff u_i < v_i, &{}\quad \forall i = 1,\ldots , m, \\ u \le v \iff u_i \le v_i, &{}\quad \forall i= 1,\ldots , m. \end{array} \end{aligned}$$

We can say that u dominates v if \(u \le v\) and \(u \ne v\). In this case, we use the following notation: \(u \lneqq v\). Similarly, we state that \(x \in {\mathbb {R}}^n\) dominates \(y \in {\mathbb {R}}^n\) w.r.t. F if \(F(x) \lneqq F(y)\).

In multi-objective optimization, a point minimizing all the objectives at once is unlikely to exist. For this reason, the concepts of optimality are based on Pareto’s theory.

Definition 1

A point \({\bar{x}} \in {\mathbb {R}}^n\) is Pareto optimal for problem (1) if there does not exist \(y \in {\mathbb {R}}^n\) such that \(F(y) \lneqq F({\bar{x}})\). If there exists a neighborhood \({\mathcal {N}}({\bar{x}})\) in which the previous property holds, then \({\bar{x}}\) is locally Pareto optimal.

Since Pareto optimality is a strong property, it is hard to attain in practice. A slightly more affordable condition is weak Pareto optimality.

Definition 2

A point \({\bar{x}} \in {\mathbb {R}}^n\) is weakly Pareto optimal for problem (1) if there does not exist \(y \in {\mathbb {R}}^n\) such that \(F(y) < F({\bar{x}})\). If there exists a neighborhood \({\mathcal {N}}({\bar{x}})\) in which the previous property holds, then \({\bar{x}}\) is locally weakly Pareto optimal.

We define the Pareto set as the set of all the Pareto optimal solutions. Moreover, we refer to the image of the Pareto set w.r.t. F as the Pareto front.

We can now introduce the concept of Pareto stationarity. Under differentiability assumptions, this condition is necessary for all types of Pareto optimality. Moreover, assuming the convexity of the objective functions in problem (1), Pareto stationarity is also a sufficient condition for Pareto optimality.

Definition 3

A point \({\bar{x}} \in {\mathbb {R}}^n\) is Pareto-stationary for problem (1) if we have that

$$\begin{aligned} \min _{d \in {\mathbb {R}}^n} \max _{j=1,\ldots ,m} \nabla f_j({\bar{x}})^Td = 0. \end{aligned}$$

The concepts of Pareto stationarity, Pareto optimality and convexity are related according to the following lemma.

Lemma 1

([8, Theorem 3.1]) The following statements hold:

  1. (i)

    if \({\bar{x}}\) is locally weakly Pareto optimal, then \({\bar{x}}\) is Pareto-stationary for problem (1);

  2. (ii)

    if F is convex and \({\bar{x}}\) is Pareto-stationary for problem (1), then \({\bar{x}}\) is weakly Pareto optimal;

  3. (iii)

    if F is twice continuously differentiable, \(\nabla ^2 f_j\left( x\right) \succ 0\) for all \(j \in \left\{ 1,\ldots , m\right\} \) and all \(x \in {\mathbb {R}}^n\), and \({\bar{x}}\) is Pareto-stationary for problem (1), then \({\bar{x}}\) is Pareto optimal.

Lastly, we introduce a relaxation of Pareto stationarity, called \(\varepsilon \)-Pareto-stationarity. This concept is firstly introduced in [11]: here, we propose a slightly modified version.

Definition 4

Let \(\varepsilon \ge 0\). A point \({\bar{x}} \in {\mathbb {R}}^n\) is \(\varepsilon \)-Pareto-stationary for problem (1) if

$$\begin{aligned} \min _{d \in {\mathbb {R}}^n} \max _{j=1,\ldots ,m} \nabla f_j({\bar{x}})^Td + \frac{1}{2}\left\| d \right\| ^2 \ge -\varepsilon . \end{aligned}$$

In the following, we briefly review the basic concepts for Quasi-Newton algorithms in multi-objective optimization.

2.1 Quasi-Newton methods

If a point \({\bar{x}} \in {\mathbb {R}}^n\) is not Pareto-stationary, then there exists a descent direction w.r.t. all the objectives. The Quasi-Newton direction can be introduced as the solution of the following problem [10]:

$$\begin{aligned} \min _{d\in {\mathbb {R}}^n}\max _{j=1,\ldots ,m} \nabla f_j\left( {\bar{x}}\right) ^Td + \frac{1}{2}d^TB_jd, \end{aligned}$$
(2)

where \(B_j \in {\mathbb {R}}^{n \times n}\), with \(j \in \{1,\ldots , m\}\), approximates the second derivatives \(\nabla ^2 f_j({\bar{x}})\). If the approximation matrices are positive definite, i.e., \(B_j \succ 0\ \forall j \in \{1,\ldots , m\}\), then the function \(\nabla f_j\left( {\bar{x}}\right) ^Td + \left( 1 / 2\right) d^TB_jd\) is strongly convex for each \(j \in \{1,\ldots , m\}\). In this case, problem (2) has a unique minimizer: we denote it by \(d_{QN}({\bar{x}})\). We also indicate with \(\theta _{QN}({\bar{x}})\) the optimal value of problem (2) at \({\bar{x}}\). It is trivial to observe that \(\theta _{QN}(x) \le 0\) for any \(x \in {\mathbb {R}}^n\). If \({\bar{x}}\) is Pareto-stationary, then \(\theta _{QN}({\bar{x}}) = 0\).

As in [24], we introduce the function \({\mathcal {D}}: {\mathbb {R}}^n\times {\mathbb {R}}^n \rightarrow {\mathbb {R}}\), defined by

$$\begin{aligned} {\mathcal {D}}(x, d) = \max _{j=1,\ldots ,m} \nabla f_j(x)^Td. \end{aligned}$$

Any direction d such that \({\mathcal {D}}({\bar{x}},d)<0\) is a descent direction at \({\bar{x}}\) for F. Moreover, the function \({\mathcal {D}}(\cdot , \cdot )\) has some properties, which we report in the next lemma.

Lemma 2

([24, Lemma 2.2]) The following statements hold:

  1. 1.

    for any \(x \in {\mathbb {R}}^n\) and \(\alpha \ge 0\), we have \({\mathcal {D}}(x, \alpha d) = \alpha {\mathcal {D}}(x, d)\).

  2. 2.

    the mapping \((x, d) \rightarrow {\mathcal {D}}(x, d)\) is continuous.

As we thoroughly recall in Appendix A.1, the Lagrangian dual problem of (2) is given by:

$$\begin{aligned} \begin{aligned} \max _{\lambda \in {\mathbb {R}}^m}\;&\quad -\frac{1}{2}\lambda ^T J_F\left( {\bar{x}}\right) \left[ \sum _{j=1}^{m}\lambda _jB_j\right] ^{-1} J_F\left( {\bar{x}}\right) ^T\lambda \\ \text {s.t. }&\quad \sum _{j=1}^m \lambda _j = 1,\qquad \lambda \ge {\textbf{0}}. \end{aligned} \end{aligned}$$
(3)

Regarding problems (2) and (3), strong duality holds and the Karush-Kuhn-Tucker conditions are sufficient and necessary for optimality. Moreover, denoting by \(\lambda ^{QN}\left( {\bar{x}}\right) = \left( \lambda _1^{QN}\left( {\bar{x}}\right) ,\ldots , \lambda _m^{QN}\left( {\bar{x}}\right) \right) ^T\) the optimal Lagrange multipliers vector, we have that

$$\begin{aligned} \sum _{j=1}^m\lambda _j^{QN}\left( {\bar{x}}\right) = 1, \qquad \lambda ^{QN}\left( {\bar{x}}\right) \ge {\textbf{0}} \end{aligned}$$
(4)

and

$$\begin{aligned} d_{QN}\left( {\bar{x}}\right) = - \left[ \sum _{j=1}^m\lambda _j^{QN}\left( {\bar{x}}\right) B_j\right] ^{-1} J_F\left( {\bar{x}}\right) ^T\lambda . \end{aligned}$$
(5)

Due to the presence of the inverse of the convex combination of the approximation matrices, i.e., \(\left[ \sum _{j=1}^m\lambda _jB_j\right] ^{-1}\), problem (3) is difficult to solve.

If, for all \(j \in \left\{ 1,\ldots , m\right\} \), \(B_j = I\), where \(I \in {\mathbb {R}}^{n \times n}\) is the identity matrix, problem (2) is identical to the one proposed in [7] to find the steepest common descent direction. We denote the latter by \(d_{SD}\left( {\bar{x}}\right) \) and the associated Lagrange multipliers vector by \(\lambda ^{SD}\left( {\bar{x}}\right) = \left( \lambda _1^{SD}\left( {\bar{x}}\right) ,\ldots , \lambda _m^{SD}\left( {\bar{x}}\right) \right) ^T\). Obviously, equations (4)-(5) hold true in this particular case. We further recall a well-known result that will be used in our convergence analysis.

Lemma 3

The following statements hold:

  1. (i)

    the mapping \(d_{SD}\left( \cdot \right) \) is continuous;

  2. (ii)

    \({\bar{x}} \in {\mathbb {R}}^n\) is Pareto-stationary for problem (1) if and only if \(d_{SD}\left( {\bar{x}}\right) = {\textbf{0}}\).

Proof

See [27, Lemma 3.3] and [8, Lemma 3.2]. \(\square \)

Based on the concept of Quasi-Newton direction, a Quasi-Newton approach for multi-objective optimization of strongly convex objective functions is proposed in [10]. In this algorithm, a backtracking Armijo-type line search is used to guarantee the sufficient decrease w.r.t. all the objective functions. The result is formalized by the following lemma.

Lemma 4

([7, Lemma 4]) If F is continuously differentiable and \(J_F(x)d<{\textbf{0}}\), then there exists some \(\varepsilon >0\), which may depend on x, d and \(\gamma \in \left( 0, 1\right) \), such that

$$\begin{aligned} F(x+td)<F(x) + \gamma t J_F(x)d \end{aligned}$$

for all \(t\in (0,\varepsilon ]\).

Remark 1

By the definition of \({\mathcal {D}}(\cdot , \cdot )\), we have that \(J_F(x)d \le {\textbf{1}}{\mathcal {D}}(x, d)\). Moreover, if \(B_j \succ 0\ \forall j \in \{1,\ldots , m\}\), then it follows that \({\mathcal {D}}(x, d) < \theta _{QN}(x)\). Using Lemma 4 and these results, we trivially obtain that, for all \(t\in (0,\varepsilon ]\),

$$\begin{aligned} \begin{aligned} F(x+td)&< F(x) + \gamma t J_F(x)d \\&\le F(x) + {\textbf{1}}\gamma t {\mathcal {D}}(x, d) \\&< F(x) + {\textbf{1}}\gamma t \theta _{QN}(x). \end{aligned} \end{aligned}$$

In many works for MOO [10, 25], the BFGS update formula is independently used for all \(B_j\), with \(j \in \{1,\ldots , m\}\):

$$\begin{aligned} B_{j}^{k + 1} = B_{j}^{k} - \frac{B_{j}^{k} s_ks_k^T B_{j}^{k}}{s_k^T B_{j}^{k} s_k} + \frac{y_{j}^{k}\left( y_{j}^{k}\right) ^T}{s_k^T y_{j}^{k}}, \end{aligned}$$

where \(s_k = x_{k + 1} - x_k\) and \(y_{j}^{k} = \nabla f_j\left( x_{k + 1}\right) - \nabla f_j\left( x_k\right) \).

We also introduce the formula for updating the inverse of the approximation matrix \(B_j\), which we denote by \(H_{j}\):

$$\begin{aligned} H_{j}^{k + 1} = \left( I - \rho _{j}^{k} y_{j}^{k} s_k^T\right) ^T H_{j}^{k} \left( I - \rho _{j}^{k} y_{j}^{k} s_k^T\right) + \rho _{j}^{k} s_k s_k^T, \end{aligned}$$
(6)

where \(\rho _{j}^{k} = 1 / \left( s_k^Ty_{j}^{k}\right) \).

Similar to the scalar case [28], for each \(j \in \{1,\ldots , m\}\), if \(s_k^Ty_{j}^{k} > 0\) and \(B_{j}^{k} \succ 0\), then \(B_{j}^{k + 1}\) is positive definite. The same property holds true if \(\{H_j^{k}\}\) is considered. When the objective functions are strictly convex, the condition \(s_k^Ty_{j}^{k} > 0\) is always satisfied for any pair \((x_{k + 1}, x_k)\) and for each \(j \in \{1,\ldots , m\}\). However, this property is not guaranteed to hold in the general case. In order to overcome this issue, in Quasi-Newton methods for scalar optimization, Wolfe conditions are imposed at each iteration [28].

The Wolfe conditions have been extended to MOO in [12]:

$$\begin{aligned}{} & {} F\left( x_k + \alpha d_{QN}(x_k)\right) \le F(x_k) + {\textbf{1}}\gamma \alpha {\mathcal {D}}\left( x_k, d_{QN}(x_k)\right) ,\nonumber \\{} & {} {\mathcal {D}}\left( x_k + \alpha d_{QN}(x_k), d_{QN}(x_k)\right) \ge \sigma {\mathcal {D}}(x_k, d_{QN}(x_k)). \end{aligned}$$
(7)

Assuming that \(d_{QN}(x_k)\) is a descent direction for F at \(x_k\) and there exists \({\mathcal {A}} \in {\mathbb {R}}^m\) such that \(F\left( x_k + \alpha d_{QN}(x_k)\right) \ge {\mathcal {A}}\) for all \(\alpha > 0\), an interval of values exists satisfying these conditions [12, Proposition 3.2]. The theoretical result can be further improved assuming the boundedness of at least one objective function [29, Proposition 1].

However, even if Wolfe conditions are satisfied, it may occur that \(s_k^Ty_{j}^{k} \le 0\) for some \(j \in \{1.\ldots , m\}\). In other words, considering that

$$\begin{aligned} s_k = x_{k + 1} - x_k = \alpha _kd_{QN}(x_k), \end{aligned}$$
(8)

we may have that, for some \(j \in \{1.\ldots , m\}\),

$$\begin{aligned} \left[ \nabla f_j(x_{k + 1}) - \nabla f_j(x_k)\right] ^T d_{QN}(x_k) \le 0, \end{aligned}$$

which can be also re-written in the form

$$\begin{aligned} \nabla f_j(x_{k + 1})^Td_{QN}(x_k) \le \nabla f_j(x_k)^Td_{QN}(x_k). \end{aligned}$$
(9)

For this reason, a different formula for updating \(B_{j}\) is introduced in [24]. The corresponding update formula for \(H_j\) remains similar to (6), except that \(\rho _{j}^{k}\) is now defined as

$$\begin{aligned} \rho _{j}^{k} = {\left\{ \begin{array}{ll} 1 / \left( s_k^Ty_{j}^{k}\right) &{} \quad \text {if } s_k^Ty_{j}^{k} > 0, \\ 1 / \left[ {\mathcal {D}}\left( x_{k + 1}, s_k\right) - \nabla f_j\left( x_k\right) ^T s_k\right] &{} \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(10)

Using the above update rule, \(\rho _{j}^{k}\) is proved to be strictly positive even when \(s_k^Ty_{j}^{k} \le 0\). Thus, \(H_j^{k + 1}\) and, consequently, \(B_j^{k + 1}\) always remain positive definite.

2.2 Single Hessian matrix approximation

The use of a single positive definite matrix B was proposed in [23] to approximate \(\nabla ^2f_1(x),\ldots ,\nabla ^2f_m(x)\). In this case, problem (2) becomes

$$\begin{aligned} \min _{d\in {\mathbb {R}}^n}\max _{j=1,\ldots ,m} \nabla f_j\left( {\bar{x}}\right) ^Td + \frac{1}{2}d^TBd, \end{aligned}$$

while the dual (3) changes into

$$\begin{aligned} \begin{aligned} \max _{\lambda \in {\mathbb {R}}^m}\;&\quad -\frac{1}{2}\lambda ^T J_F\left( {\bar{x}} \right) B^{-1}J_F\left( {\bar{x}}\right) ^T\lambda \\ \text {s.t. }&\quad \sum _{j=1}^m \lambda _j = 1,\qquad \lambda \ge {\textbf{0}}. \end{aligned} \end{aligned}$$
(11)

Now, the descent direction can accordingly be computed as

$$\begin{aligned} d_{MQN}({\bar{x}}) = -B^{-1}J_F\left( {\bar{x}}\right) ^T\lambda ^{MQN}\left( {\bar{x}}\right) , \end{aligned}$$

where \(\lambda ^{MQN}\left( {\bar{x}}\right) = \left( \lambda _1^{MQN}\left( {\bar{x}}\right) ,\ldots , \lambda _m^{MQN}\left( {\bar{x}}\right) \right) ^T\) indicate the Lagrange multipliers vector.

The difficult term \(\left[ \sum _{j=1}^m\lambda _jB_j\right] ^{-1}\) appearing in (3) is replaced by \(B^{-1}\). As a consequence, problem (11) reduces to a linearly constrained, convex quadratic program which is easy to solve.

The unique matrix B is obtainable as the approximation of a convex combination of matrices. For this purpose, slightly modified BFGS update formulas are introduced in [23]:

$$\begin{aligned} B^{k + 1}= & {} B^k - \frac{B^k s_ks_k^T B^k}{s_k^T B^k s_k} + \frac{u_ku_k^T}{s_k^T u_k}; \end{aligned}$$
(12)
$$\begin{aligned} H^{k + 1}= & {} \left( I - \rho ^k u_k s_k^T\right) ^T H^k \left( I - \rho ^k u_k s_k^T\right) + \rho ^k s_k s_k^T, \end{aligned}$$
(13)

with \(\rho ^k = 1 / \left( s_k^Tu_k\right) \) and \(u_k = \sum _{j=1}^{m}\lambda _j^{MQN}(x_k)y_{j}^{k}\).

3 A limited memory Quasi-Newton method

In this section, we introduce a new Limited Memory Quasi-Newton approach for MOO, whose algorithmic scheme is reported in Algorithm 1.

figure a

In the proposed approach, we use a single positive definite matrix \(H^k\) at each iteration k. In Sect. 3.2, we introduce the update formula for \(H^k\), which is slightly different w.r.t. the one introduced in [23]. As in L-BFGS for scalar optimization, we maintain only a finite number M of vectors pairs \(\left\{ \left( s_i, u_i\right) \right\} \) in memory: the oldest one is discarded each time a new vectors pair is calculated. These pairs are used in a two-loop recursive procedure to efficiently carry out the matrix multiplication \({\mathcal {R}}^k = H^kJ_F(x_k)^T\) (Sect. 3.1). This procedure is essentially an extension for MOO of the one used in L-BFGS [28]. The matrix \({\mathcal {R}}^k\) is then used in problem (14) at step 3 of the algorithm: the latter is simply derived from problem (11) substituting \(H^kJ_F(x_k)^T\) with \({\mathcal {R}}^k\). We denote by \(\theta _{LM}(x_k)\) the optimal value of problem (14) at \(x_k\). Moreover, we respectively denote by \(\lambda ^{LM}(x_k)\) (Line 4) and \(d_{LM}(x_k)\) (Line 5) the Lagrange multipliers vector and the direction corresponding to \(\theta _{LM}(x_k)\). Note that (4) is valid in this context too. Finally, in Line 6, a Wolfe line search is carried out to find a step size \(\alpha _k\) along the direction \(d_{LM}(x_k)\), satisfying the Wolfe conditions for MOO (Sect. 3.3).

In the following, we deeply analyze the various aspects of Algorithm 1.

3.1 Two-loop recursive procedure for MOO

In L-BFGS, one of the most relevant features is the two-loop recursive procedure which, at any iteration k, given the vectors pairs saved in memory, allows to efficiently compute the product \(H^k\nabla f(x_k)\), where \(f(\cdot )\) indicates the objective function [28]. We remind, indeed, that in scalar optimization the negative of this product identifies the Quasi-Newton descent direction: \(d(x_k) = -H^k\nabla f(x_k)\). Using this procedure, we do not need to store the matrix H in memory. This property could be crucial when high dimensional problems are considered: in these cases, maintaining and updating the matrix H, which is dense in general, could be extremely inefficient. In this work, we propose an extension of this procedure for MOO: the algorithmic scheme is reported in Algorithm 2.

figure b

With respect to the scalar optimization case, this procedure computes the product \(H^kJ_F(x_k)^T\). The same result could be obtained repeating m times the procedure for scalar optimization to find \(H^k\nabla f_j(x_k)\) for all \(j \in \{1,\ldots , m\}\). In both cases, \(m\left( 4Mn + n\right) \) multiplications are required. However, Algorithm 2 allows to exploit the optimized operations of software libraries for vector calculus.

The employment of Algorithm 2 is possible thanks to some properties of (13). Indeed, the latter can be re-written in the following form [28]:

$$\begin{aligned} H^k= & {} \left[ \left( V^{k - 1}\right) ^T\ldots \left( V^{k - M}\right) ^T\right] H^{k - M}\left[ V^{k - M}\ldots V^{k - 1}\right] \\{} & {} \quad +\rho ^{k - M}\left[ \left( V^{k - 1}\right) ^T\ldots \left( V^{k - M + 1}\right) ^T\right] s_{k - M}s_{k - M}^T\left[ V^{k - M + 1}\ldots V^{k - 1}\right] \\{} & {} \quad +\rho ^{k - M + 1}\left[ \left( V^{k - 1}\right) ^T\ldots \left( V^{k - M + 2}\right) ^T\right] s_{k - M + 1}s_{k - M + 1}^T\left[ V^{k - M + 2}\ldots V^{k - 1}\right] \\{} & {} \quad +\ldots \\{} & {} \quad +\rho ^{k - 1}s_{k - 1}s_{k - 1}^T, \end{aligned}$$

where \(V^i = I - \rho ^iu_is_i^T\). As in L-BFGS, the exact matrix \(H^{k - M}\) is substituted by a suitable sparse positive definite matrix \(H^0\). From this last equation, the two-loop recursive procedure to compute the product \(H^kJ_F(x_k)^T\) is derived. We refer the reader to [28] for more details.

3.2 Definition of H

In the proposed approach, we use a single positive definite matrix H. As in [23] the update formula (13) is used. However, taking inspiration from (10), we use a different definition of \(\rho ^k\):

$$\begin{aligned} \rho ^k = {\left\{ \begin{array}{ll} 1 / \left( s_k^Tu_k\right) &{} \quad \text {if } s_k^Tu_k > 0, \\ 1 / \left\{ \sum _{j=1}^m\lambda _j^{LM}(x_k)\left[ {\mathcal {D}} \left( x_{k + 1}, s_k\right) - \nabla f_j\left( x_k\right) ^T s_k\right] \right\} &{} \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(19)

As in [24], we carry out a line search to find a step size satisfying the Wolfe conditions for MOO (Sect. 3.3). However, recalling the reasoning in Sect. 2.1, in order to ensure that \(H^{k + 1} \succ 0\), we force through (19) \(\rho _k\) to be positive even when \(s_k^Tu_k \le 0\). We formalize this statement in the following proposition.

Proposition 1

Considering a generic iteration k of Algorithm 1, let \(x_k \in {\mathbb {R}}^n\), \(d_{LM}(x_k) \in {\mathbb {R}}^n\) be a direction such that \({\mathcal {D}}\left( x_k, d_k\right) < 0\), \(\alpha _k > 0\) be a step size along \(d_{LM}(x_k)\) and \(\lambda ^{LM}(x_k)\) be the Lagrange multipliers vector obtained solving problem (14). If \(\rho ^k\) is updated by (19), then \(\rho ^k\) is positive.

Proof

See Appendix A.2. \(\square \)

Remark 2

In the single objective case, the update formula (13) for \(H^k\) coincides with the classical BFGS rule. Indeed, it is sufficient to realize that, since \(\lambda ^{LM}\left( x_k\right) \) lies in the unit simplex by (4), then \(u_k = y^k\). Moreover, the same reasoning can be applied with (19) to get that \(\rho ^k = 1 / \left( s_k^Ty^k\right) \). Hence, the two-loop recursive procedure reduces to that of L-BFGS. In turn, the overall Algorithm 1 is nothing but L-BFGS, since \(d^k=-H^k\nabla f({\bar{x}})\) and Wolfe conditions are imposed by the line search.

Remark 3

The procedure in Algorithm 2 cannot be used if we consider an approximation matrix for each objective function, as in problem (2). In such case, both in the primal and in the dual problem (3) the matrices are tied to the problem variables; for example, when solving (3), the product \([\, \sum _{j=1}^m \lambda _j B_j]^{-1}J_F({\bar{x}})^T\) would be recomputed any time a different solution \(\lambda \) is considered. The use of a single positive definite matrix prevents this issue: matrix multiplication \(HJ_F({\bar{x}})^T\) can be computed only once, before solving subproblem (11), making it possible to exploit the efficiency of the two-loop recursive procedure.

3.3 Wolfe line search

In this section, we introduce a simple line search scheme to find a step size \(\alpha \) along a given direction \(d_k\) satisfying the Wolfe conditions:

$$\begin{aligned} F\left( x_k + \alpha d_k\right)\le & {} F(x_k) + {\textbf{1}}\gamma \alpha {\mathcal {D}}\left( x_k, d_k\right) , \end{aligned}$$
(20)
$$\begin{aligned} {\mathcal {D}}\left( x_k + \alpha d_k, d_k\right)\ge & {} \sigma {\mathcal {D}}\left( x_k, d_k\right) . \end{aligned}$$
(21)

Before proceeding, we make a reasonable assumption. Then, we prove that there exists an interval of values satisfying the Wolfe conditions. Note that an analogous result has been obtained in [12, 29] under the different assumptions also reported in Sect. 2.1 of this manuscript.

Assumption 1

The objective function F has bounded level sets in the multi-objective sense, i.e., the set \({\mathcal {L}}_F(z)=\{x \in {\mathbb {R}}^n \mid F(x) \le z\}\) is bounded for any \(z \in {\mathbb {R}}^m\).

Proposition 2

Let Assumption 1 hold. Let \(x_k\in {\mathbb {R}}^n\) and assume that \(d_k \in {\mathbb {R}}^n\) is a direction such that \({\mathcal {D}}(x_k, d_k)<0\), \(\gamma \in \left( 0, 1/2\right) \) and \(\sigma \in \left( \gamma , 1\right) \). Then, there exists an interval of values \(\left[ \alpha _l, \alpha _u\right] \), with \(0< \alpha _l < \alpha _u\), such that for all \(\alpha \in \left[ \alpha _l, \alpha _u\right] \) equations (20) and (21) hold.

Proof

See Appendix B. \(\square \)

After proving the existence of an interval of values satisfying the Wolfe conditions, we report the algorithmic scheme of the considered line search.

figure c

Starting from \(\alpha _l^0 = 0, \alpha _u^0 = \infty \), the core idea of the line search is that of reducing the interval \(\left[ \alpha _l^t, \alpha _u^t\right] \) until a valid step size \(\alpha ^t\) is found. At the beginning of the for-loop, the Wolfe sufficient decrease condition (20) is checked. If it is not satisfied by \(\alpha ^t\), we update \(\alpha _u^t\) and we maintain the same value for \(\alpha _l^t\) (Lines 4 and 5). Otherwise, \(\alpha _u^t\) is not updated (Line 7) and we check if the Wolfe curvature condition (21) is satisfied by \(\alpha ^t\): if it is, both Wolfe conditions are satisfied and, then, the current step size value is returned; else \(\alpha _l^t\) is updated according to Line 9. After updating \(\alpha _u^t\) or \(\alpha _l^t\), a new value for the step size \(\alpha ^t\) is chosen in the interval \(\left( \alpha _l^t, \alpha _u^t\right) \) and the process is repeated.

In the next lemma, we state some properties related to the interval upper and lower bounds \(\alpha _u^t\) and \(\alpha _l^t\).

Lemma 5

Consider a generic iteration t of Algorithm 3. Let \(x_k \in {\mathbb {R}}^n\) and \(d_k\) be a direction such that \({\mathcal {D}}(x_k, d_k) < 0\). Then, we have the following properties:

  1. (i)

    if \(\alpha _u^t < \infty \), then

    $$\begin{aligned} \exists \; j \left( \alpha _u^t\right) \text { s.t. }f_{j\left( \alpha _u^t\right) }(x_k + \alpha _u^td_k) > f_{j\left( \alpha _u^t\right) }(x_k) + \gamma \alpha _u^t{\mathcal {D}}(x_k, d_k); \end{aligned}$$
    (22)
  2. (ii)

    \(\alpha _l^{t}\) is such that

    $$\begin{aligned}{} & {} F(x_k + \alpha _l^td_k) \le F(x_k) + {\textbf{1}}\gamma \alpha _l^t{\mathcal {D}}(x_k, d_k), \end{aligned}$$
    (23)
    $$\begin{aligned}{} & {} {\mathcal {D}}(x_k + \alpha _l^td_k, d_k) < \sigma {\mathcal {D}}(x_k, d_k). \end{aligned}$$
    (24)

Proof

See Appendix B. \(\square \)

In the following proposition, we state that the proposed line search is well defined, i.e., it terminates after a finite number of iterations returning a step size satisfying the Wolfe conditions.

Proposition 3

Let Assumption 1 hold, \(\delta \in \left[ 1/2, 1\right) \), \(\eta > 1\) and let \(\{\alpha _l^t,\alpha _u^t,\alpha ^t\}\) be the sequence generated by Algorithm 3. Assume that:

  1. 1.

    \(d_k \in {\mathbb {R}}^n\) is a descent direction for F at \(x_k \in {\mathbb {R}}^n\);

  2. 2.

    for all \(t > 0\), the step size \(\alpha ^t\) is chosen so that

    1. a)

      if \(\alpha _{u}^{t} = \infty \),

      $$\begin{aligned} \alpha ^t \ge \eta \max \left\{ \alpha _{l}^{t}, \alpha ^0\right\} , \end{aligned}$$
      (25)
    2. b)

      if \(\alpha _{u}^{t} < \infty \),

      $$\begin{aligned} \max \left\{ \left( \alpha ^t - \alpha _{l}^{t}\right) , \left( \alpha _{u}^{t} - \alpha ^t\right) \right\} \le \delta \left( \alpha _{u}^{t} - \alpha _{l}^{t}\right) . \end{aligned}$$

Then Algorithm 3 is well defined, i.e., it stops after a finite number of iterations returning a step size \({\hat{\alpha }}\) satisfying the Wolfe conditions for MOO.

Proof

See Appendix B. \(\square \)

Remark 4

To the best of our knowledge, the first Wolfe line search for MOO was proposed in [29]. Our line search is just a simpler algorithm that is guaranteed to produce a point satisfying the Wolfe conditions. In fact, we think that not using an inner solver, as done in [29], could be a performance disadvantage and, in addition, smarter strategies to set the trial step size may be integrated. We decided not to compare the two line searches, since finding new efficient methodologies to find the step size is not the focus of our work. Moreover, we are confident that the experimental results of Sect. 5 would be similar regardless the employed Wolfe line search.

4 Convergence analysis

In this section, we show the convergence properties of our Limited Memory Quasi-Newton approach. Before proceeding, similarly to what is done in [30], we need to make some assumptions about the objective function F and the initial approximation matrix \(H^0\).

Assumption 2

We assume that:

  1. (i)

    F is twice continuously differentiable;

  2. (ii)

    the set \({\mathcal {L}}_F\left( F\left( x_0\right) \right) = \left\{ x \in {\mathbb {R}}^n \mid F\left( x\right) \le F\left( x_0\right) \right\} \) is convex;

  3. (iii)

    \(\exists a, b \in {\mathbb {R}}_+\) such that, for all \(j \in \left\{ 1,\ldots , m\right\} \),

    $$\begin{aligned} a\left\| z\right\| ^2 \le z^T\nabla ^2f_j(x)z \le b\left\| z\right\| ^2, \qquad \forall z \in {\mathbb {R}}^n, \forall x \in {\mathcal {L}}_F\left( F(x_0)\right) . \end{aligned}$$

Assumption 3

The matrix \(H^0\) is chosen such that the norms \(\left\| H^0 \right\| \) and \(\left\| B^0 \right\| \) are bounded.

Remark 5

Assumption 2 implies Assumption 1. Indeed, \(f_j\left( \cdot \right) \) is strongly convex for all \(j \in \left\{ 1,\ldots , m\right\} \), and thus has all the level sets bounded. Also, by Assumption 1, we have that Propositions 2 and 3 concerning the line search remain valid.

Remark 6

By Assumption 2, we have \(s_k^Ty_j^k > 0\) for any k and for all \(j \in \left\{ 1,\ldots , m\right\} \). Then, considering (18) and since \(\lambda ^{LM}\left( x_k\right) \) satisfies (4), we have that \(s_k^Tu_k > 0\). Then, according to (19), \(\rho ^k = 1 / \left( s_k^Tu_k\right) \) and, thus, we update \(B^k\) and \(H^k\) using (12) and (13), respectively.

In order to carry out the theoretical analysis, we take as reference Algorithm 4, which is mathematically equivalent to Algorithm 1 but makes it more explicit how the approximation of \(H^k\) is computed, i.e., applying M times the update rule (13) starting from \(H^0\). In the remainder of the section, we will consider the approximation matrix \(B^k\) for the sake of clarity; the results are obviously the same if we consider the matrix \(H^k\). Finally, note that Algorithm 4 is only used in this section, since, unlike Algorithm 1, it requires to store the entire matrix in memory.

For the theoretical analysis, we also need to introduce the formula for the trace and the determinant of the matrix \(B^{k + 1}\):

$$\begin{aligned} {{\,\textrm{Tr}\,}}(B^{k + 1})= & {} {{\,\textrm{Tr}\,}}(B^k) - \frac{\left\| B^ks_k\right\| ^2}{s_k^TB^ks_k} + \frac{\left\| u_k\right\| ^2}{s_k^Tu_k}, \end{aligned}$$
(26)
$$\begin{aligned} \det (B^{k + 1})= & {} \det (B^k)\frac{s_k^Tu_k}{s_k^TB^ks_k}. \end{aligned}$$
(27)

Note that these expressions hold when (12) is used to update the matrix \(B^k\), which is always the case here by Assumption 2. We also introduce some basic notation that will be useful in the following analysis.

Notation: We will denote by \(\Omega \left( B^k\right) \) the eigenvalues set of the matrix \(B^k\); by \(\omega _m\left( B^k\right) \) and \(\omega _M\left( B^k\right) \) we indicate the minimum and the maximum eigenvalue, respectively; we refer by \(\beta ^k\) to the angle between the vectors \(s_k\) and \(B^ks_k\). Concerning \(\beta ^k\), we also recall the formula of the cosine:

$$\begin{aligned} \cos \beta ^k = \frac{s_k^TB^ks_k}{\left\| s_k\right\| \left\| B^ks_k\right\| }. \end{aligned}$$
(28)
figure d

We are now able to begin the convergence analysis with three technical lemmas.

Lemma 6

Let Assumption 2 hold and consider the sequences \(\left\{ x_k\right\} \) and \(\left\{ d_{LM}\left( x_k\right) \right\} \) generated by Algorithm 4. Then,

$$\begin{aligned} \sum _{k \ge 0}\frac{{\mathcal {D}}\left( x_k, d_{LM}(x_k)\right) ^2}{\left\| d_{LM}(x_k)\right\| ^2} < \infty . \end{aligned}$$

Proof

The result follows as in Proposition 3.3 in [24], as the assumptions made in the latter are trivially implied by Assumption 2. \(\square \)

Lemma 7

Assume that Assumption 2 holds. Let \(\left\{ x_k\right\} \) be the sequence generated by Algorithm 4. Then, for all \(k \ge 0\), we have that

$$\begin{aligned} {\mathcal {D}}\left( x_k, d_{LM}(x_k)\right) \le -\frac{\cos \beta ^k}{2}\left\| d_{LM}(x_k)\right\| \left\| d_{SD}(x_k)\right\| . \end{aligned}$$

Proof

The proof is analogous to the one of Lemma 4.2 in [24], taking into account that we have a single approximation matrix \(B^k\). \(\square \)

Lemma 8

Let Assumptions 2 and 3 hold. Moreover, let \(\left\{ x_k\right\} \) be the sequence generated by Algorithm 4. Then, there exists a constant \(\delta > 0\) such that, for all \(k \ge 0\), we have that

$$\begin{aligned} \cos \beta ^k \ge \delta . \end{aligned}$$

Proof

Let us consider \(k \ge 0\), \(\tau \in \left[ 0, 1\right] \) and the point \(x_k + \tau s_k\). By Assumption 2 and equations (15) and (17), we have that \(x_k + \tau s_k \in {\mathcal {L}}_F\left( F\left( x_0\right) \right) \). Also recalling that \(\lambda ^{LM}\left( x_k\right) \) satisfies (4), we obtain for any \(z\in {\mathbb {R}}^n\) that

$$\begin{aligned} \int _{0}^{1} a\left\| z\right\| ^2\, d\tau \le \int _{0}^{1} z^T\sum _{j=1}^{m}\lambda _j^{LM}\left( x_k\right) \nabla ^2f_j(x_k + \tau s_k)z \, d\tau \le \int _{0}^{1} b\left\| z\right\| ^2\, d\tau \end{aligned}$$

and, then,

$$\begin{aligned} a\left\| z\right\| ^2 \le z^T\int _{0}^{1}\sum _{j=1}^{m}\lambda _j^{LM}\left( x_k\right) \nabla ^2f_j(x_k + \tau s_k)z \, d\tau \le b\left\| z\right\| ^2. \end{aligned}$$
(29)

For \(z=s_k\) we thus obtain

$$\begin{aligned} a\left\| s_k\right\| ^2 \le s_k^T\int _{0}^{1}\sum _{j=1}^{m}\lambda _j^{LM}\left( x_k\right) \nabla ^2f_j(x_k + \tau s_k)s_k \, d\tau \le b\left\| s_k\right\| ^2. \end{aligned}$$
(30)

Defining

$$\begin{aligned} I_k = \int _{0}^{1}\sum _{j=1}^{m}\lambda _j^{LM}\left( x_k\right) \nabla ^2f_j(x_k + \tau s_k) \, d\tau \end{aligned}$$
(31)

and recalling (18), we solve the integral:

$$\begin{aligned} \begin{aligned} I_ks_k&= \sum _{j=1}^{m}\lambda _j^{LM}\left( x_k\right) \int _{0}^{1}\nabla ^2f_j(x_k + \tau s_k)s_k \, d\tau \\ {}&= \sum _{j=1}^{m}\lambda _j^{LM}\left( x_k\right) \left[ \nabla f_j\left( x_{k + 1}\right) - \nabla f_j\left( x_k\right) \right] = u_k. \end{aligned} \end{aligned}$$
(32)

Given this last result and equation (30), we obtain that

$$\begin{aligned} a\left\| s_k\right\| ^2 \le s_k^Tu_k \le b\left\| s_k\right\| ^2 \end{aligned}$$

and, thus, considering the left-hand side,

$$\begin{aligned} \frac{s_k^Tu_k}{\left\| s_k\right\| ^2} \ge a. \end{aligned}$$
(33)

Furthermore, if we consider \(z=I_k^{1/2}s_k\) in (29), with \(I_k^{1/2}\) being the positive definite square root of \(I_k\), we get

$$\begin{aligned} a\left\| I_k^{1/2}s_k\right\| ^2 \le \left( I_k^{1/2}s_k\right) ^T\int _{0}^{1}\sum _{j=1}^{m}\lambda _j^{LM}\left( x_k\right) \nabla ^2f_j(x_k + \tau s_k)\, d\tau \left( I_k^{1/2}s_k\right) \le b\left\| I_k^{1/2}s_k\right\| ^2 \end{aligned}$$

and, recalling (31),

$$\begin{aligned} a\left( s_k^TI_ks_k\right) \le s_k^TI_k^2s_k \le b\left( s_k^TI_ks_k\right) . \end{aligned}$$

Then, given Remark 6 and equation (32), focusing on the right-hand side, we have

$$\begin{aligned} \frac{\left\| u_k\right\| ^2}{s_k^Tu_k} \le b. \end{aligned}$$

Now, recalling Assumption 3 and equation (29), we apply recursively (26) and we obtain that

$$\begin{aligned} \begin{aligned} {{\,\textrm{Tr}\,}}(B^{k + 1})&= {{\,\textrm{Tr}\,}}(B^k_{\left( 0\right) }) - \sum _{l=0}^{\min \left\{ k, M-1\right\} } \frac{\left\| B^k_{\left( l\right) }s_{l + h}\right\| ^2}{s_{l + h}^TB^k_{\left( l\right) }s_{l + h}} + \sum _{l=0}^{\min \left\{ k, M-1\right\} } \frac{\left\| u_{l + h}\right\| ^2}{s_{l + h}^Tu_{l + h}}\\&\le {{\,\textrm{Tr}\,}}(B^0) + \sum _{l=0}^{\min \left\{ k, M-1\right\} } \frac{\left\| u_{l + h}\right\| ^2}{s_{l + h}^Tu_{l + h}} \\&\le {{\,\textrm{Tr}\,}}(B^0) + \left( \min \left\{ k, M - 1\right\} + 1\right) b \le {\tilde{b}}, \end{aligned} \end{aligned}$$
(34)

for some \({\tilde{b}} > 0\), where the inequalities come from the fact that, for all \(k \ge 0\) and \(l=0,\ldots ,\min \{k,M-1\}\), \(B^k_{(l)}\) is positive definite (cf. the instructions of Algorithm 4 and Remark 6). We can apply a similar reasoning with the determinant formula (27):

$$\begin{aligned} \begin{aligned} \det (B^{k + 1})&= \det (B^k_{\left( 0\right) })\prod _{l=0}^{\min \left\{ k, M-1\right\} }\frac{s_{l + h}^Tu_{l + h}}{s_{l + h}^TB^k_{\left( l\right) }s_{l + h}} \\&= \det (B^0)\prod _{l=0}^{\min \left\{ k, M-1\right\} }\frac{s_{l + h}^Tu_{l + h}}{\left\| s_{l + h}\right\| ^2}\frac{\left\| s_{l + h}\right\| ^2}{s_{l + h}^TB^k_{\left( l\right) }s_{l + h}}. \end{aligned} \end{aligned}$$

From (34), we deduce that the greatest eigenvalue of \(B^k_{\left( l\right) }\) is smaller than \({\tilde{b}}\). Thus, given Assumption 3 and equation (33), we get that

$$\begin{aligned} \det \left( B^{k + 1}\right) \ge \det (B^0)\left( \frac{a}{{\tilde{b}}}\right) ^{\min \left\{ k, M - 1\right\} + 1} \ge {\tilde{a}}, \end{aligned}$$
(35)

where \({\tilde{a}} > 0\).

Then, by (28), the min-max theorem and the triangle inequality, we have:

$$\begin{aligned} \cos \beta ^k = \frac{s_k^TB^ks_k}{\left\| s_k\right\| \left\| B^ks_k\right\| } \ge \frac{\omega _m\left( B^k\right) \left\| s_k\right\| ^2}{\left\| B^k\right\| \left\| s_k\right\| ^2} = \frac{\omega _{m}\left( B^k\right) }{\left\| B^k\right\| }. \end{aligned}$$

We know that:

  • by definition of trace and determinant, recalling (34) and (35), we get

    $$\begin{aligned} \det \left( B^k\right) = \prod _{\omega \in \Omega (B^k)}^{}\omega \le (n-1)\omega _M\left( B^k\right) \omega _m\left( B^k\right) , \end{aligned}$$

    and thus

    $$\begin{aligned} \omega _m\left( B^k\right) \ge \frac{\det \left( B^k\right) }{\left( n - 1\right) \omega _M\left( B^k\right) } \ge \frac{{\tilde{a}}}{\left( n - 1\right) {{\,\textrm{Tr}\,}}\left( B^k\right) } \ge \frac{{\tilde{a}}}{\left( n - 1\right) {\tilde{b}}}; \end{aligned}$$
  • considering the euclidean norm and that \(B^k\) is a real positive definite matrix,

    $$\begin{aligned} \left\| B^k\right\| \le \omega _M\left( B^k\right) \le {{\,\textrm{Tr}\,}}\left( B^k\right) \le {\tilde{b}}. \end{aligned}$$

Joining the last three results, we obtain that

$$\begin{aligned} \cos \beta ^k \ge \frac{\omega _m\left( B^k\right) }{\left\| B^k\right\| } \ge \frac{{\tilde{a}}}{\left( n - 1\right) {\tilde{b}}^2} > 0, \end{aligned}$$

where the last inequality comes from the definitions of \({\tilde{a}}\) and \({\tilde{b}}\). Thus, we get the thesis choosing

$$\begin{aligned} \delta = \frac{{\tilde{a}}}{\left( n - 1\right) {\tilde{b}}^2}. \end{aligned}$$

\(\square \)

In the next proposition, we state that the sequence of points produced by Algorithm 4 converges to a Pareto optimal point.

Proposition 4

Let Assumptions 2 and 3 hold. Assume that \(\left\{ x_k\right\} \) is the sequence generated by Algorithm 4. Then, \(\left\{ x_k\right\} \) converges to a Pareto optimal point \(x^\star \) for problem (1).

Proof

By Lemmas 7 and 8, we know that there exists a constant \(\delta > 0\) such that, for all \(k \ge 0\),

$$\begin{aligned} {\mathcal {D}}\left( x_k, d_{LM}(x_k)\right) \le -\frac{\cos \beta ^k}{2}\left\| d_{LM}(x_k)\right\| \left\| d_{SD}(x_k)\right\| \le -\frac{\delta }{2}\left\| d_{LM}(x_k)\right\| \left\| d_{SD}(x_k)\right\| . \end{aligned}$$

Considering this last result and Lemma 6, we obtain that

$$\begin{aligned} \infty > \sum _{k \ge 0}\frac{{\mathcal {D}}\left( x_k, d_{LM}(x_k)\right) ^2}{\left\| d_{LM}(x_k)\right\| ^2} \ge \sum _{k \ge 0}\frac{\delta ^2}{4}\left\| d_{SD}(x_k)\right\| ^2, \end{aligned}$$

and, thus,

$$\begin{aligned} \lim _{k \rightarrow \infty }d_{SD}(x_k) = {\textbf{0}}. \end{aligned}$$
(36)

By (15), we know that, for all \(k \ge 0\), \(x_k \in {\mathcal {L}}_F\left( F\left( x_0\right) \right) \). Since \({\mathcal {L}}_F\left( F\left( x_0\right) \right) \) is compact (Remark 5), there exists a subsequence \(K \subseteq \left\{ 0, 1, \ldots \right\} \) such that

$$\begin{aligned} \lim _{\begin{array}{c} k \rightarrow \infty \\ k \in K \end{array}}x_k = x^\star . \end{aligned}$$
(37)

Recalling Lemma 3 and equation (36), we have that \(d_{SD}\left( x^\star \right) = {\textbf{0}}\) and, thus, \(x^\star \) is Pareto-stationary for problem (1). Therefore, by Lemma 1 and Assumption 2, we conclude that \(x^\star \) is Pareto optimal.

Now, let us assume, by contradiction, that there exists another subsequence \({\tilde{K}} \subseteq \left\{ 0, 1,\ldots \right\} \) such that

$$\begin{aligned} \lim _{\begin{array}{c} k \rightarrow \infty \\ k \in {\tilde{K}} \end{array}}x_k = {\tilde{x}}, \end{aligned}$$
(38)

with \({\tilde{x}} \ne x^\star \).

We prove that \(F\left( {\tilde{x}}\right) \ne F\left( x^\star \right) \). If it were false, since by Assumption 2F is strongly convex and \({\mathcal {L}}_F\left( F\left( x_0\right) \right) \) is convex, for all \(t \in \left( 0, 1\right) \), we would get that

$$\begin{aligned} F\left( t{\tilde{x}} + \left( 1 - t\right) x^\star \right) < tF\left( {\tilde{x}}\right) + \left( 1 - t\right) F\left( x^\star \right) = F\left( x^\star \right) . \end{aligned}$$

But, in this case, we would contradict the fact that \(x^\star \) is Pareto optimal.

Then, given that \(x^\star \) is Pareto optimal and that \(F\left( {\tilde{x}}\right) \ne F\left( x^\star \right) \),

$$\begin{aligned} \exists {\tilde{j}} \in \left\{ 1,\ldots , m\right\} \text { such that } f_{{\tilde{j}}}\left( x^\star \right) < f_{{\tilde{j}}}\left( {\tilde{x}}\right) . \end{aligned}$$

Now, recalling (37) and (38), there exist \(k \in K\) and \({\tilde{k}} \in {\tilde{K}}\) such that \(k < {\tilde{k}}\) and

$$\begin{aligned} f_{{\tilde{j}}}\left( x_{k}\right) < f_{{\tilde{j}}}\left( x_{{\tilde{k}}}\right) . \end{aligned}$$

But, since (15) holds at each iteration of Algorithm 4, we implicitly have that the sequence \(\left\{ f_j\left( x_k\right) \right\} \) is decreasing, for all \(j \in \left\{ 1,\ldots , m\right\} \). Thus, we get a contradiction and we conclude that

$$\begin{aligned} \lim _{k \rightarrow \infty }x_k = x^\star , \end{aligned}$$

with \(x^\star \) being Pareto optimal. \(\square \)

In the rest of the section, we discuss the convergence rate of Algorithm 4. We first have to provide a technical result.

Lemma 9

Let Assumptions 2 and 3 hold. Moreover, let \(\left\{ x_k\right\} \) be the sequence generated by Algorithm 4 and \(x^\star \) be the Pareto optimal point to which the sequence converges. Then, for all \(k \ge 0\),

  1. (i)

    \(\left\| x_k - x^\star \right\| \le \frac{2}{a}\left\| d_{SD}(x_k)\right\| \);

  2. (ii)

    \(\left\| s_k\right\| \ge \frac{\left( 1 - \sigma \right) }{2b}\cos \beta ^k\left\| d_{SD}(x_k)\right\| \).

Proof

The proof is analogous to the one of Lemma 4.4 in [24], recalling that here a single approximation matrix \(B^k\) is considered. \(\square \)

We are now ready to prove that the sequence of points generated by Algorithm 4 R-linearly converges to Pareto optimality.

Proposition 5

Let Assumptions 2 and 3 hold. Furthermore, let \(\left\{ x_k\right\} \) be the sequence generated by Algorithm 4 and \(x^\star \) be the Pareto optimal limit point of the sequence. Then, \(\left\{ x_k\right\} \) R-linearly converges to \(x^\star \). In addition, we have that

$$\begin{aligned} \sum _{k \ge 0}\left\| x_k - x^\star \right\| < \infty . \end{aligned}$$
(39)

Proof

We first introduce the function \(f^\star : {\mathbb {R}}^n \rightarrow {\mathbb {R}}\), defined as

$$\begin{aligned} f^\star \left( x\right) = \sum _{j=1}^m\lambda _j^{SD}\left( x^\star \right) f_j\left( x\right) , \end{aligned}$$
(40)

where \(\lambda ^{SD}\left( x^\star \right) \) is the multipliers vector associated with the steepest common descent direction at \(x^\star \). Recalling Lemmas 1 and 3, that \(x^\star \) is Pareto optimal and that (5) holds for \(d_{SD}\left( x^\star \right) \), we have that

$$\begin{aligned} \nabla f^\star \left( x^\star \right) = \sum _{j=1}^m\lambda _j^{SD}\left( x^\star \right) \nabla f_j\left( x^\star \right) = -d_{SD}\left( x^\star \right) = {\textbf{0}}. \end{aligned}$$
(41)

Now, for all \(k \ge 0\) and \(j \in \left\{ 1,\ldots , m\right\} \), by Assumption 2 and using Taylor’s theorem, we get

$$\begin{aligned} \frac{a}{2}\left\| x_k - x^\star \right\| ^2 \le f_j\left( x_k\right) - f_j\left( x^\star \right) - \nabla f_j\left( x^\star \right) ^T\left( x_k - x^\star \right) \le \frac{b}{2}\left\| x_k - x^\star \right\| ^2. \end{aligned}$$

Multiplying this result by \(\lambda _j^{SD}\left( x^\star \right) \), summing over \(j \in \left\{ 1,\ldots , m\right\} \), recalling (4), which is valid for \(\lambda ^{SD}\left( x^\star \right) \), and (41), we obtain that

$$\begin{aligned} \frac{a}{2}\left\| x_k - x^\star \right\| ^2 \le f^\star \left( x_k\right) - f^\star \left( x^\star \right) \le \frac{b}{2}\left\| x_k - x^\star \right\| ^2. \end{aligned}$$
(42)

Given Lemma 9, from the right-hand side of the last result we get

$$\begin{aligned} f^\star \left( x_k\right) - f^\star \left( x^\star \right) \le \frac{2b}{a^2}\left\| d_{SD}\left( x_k\right) \right\| ^2. \end{aligned}$$
(43)

On the other side, (4), (15) and (40) imply that, for all \(k \ge 0\),

$$\begin{aligned} f^\star \left( x_{k + 1}\right) \le f^\star \left( x_k\right) + \gamma \alpha _k{\mathcal {D}}\left( x_k, d_{LM}\left( x_k\right) \right) \end{aligned}$$

which, by subtracting the term \(f^\star \left( x^\star \right) \) in both sides and taking into account Lemmas 7 and 9, changes into

$$\begin{aligned} \begin{aligned} f^\star \left( x_{k + 1}\right) - f^\star \left( x^\star \right)&\le f^\star \left( x_k\right) - f^\star \left( x^\star \right) - \frac{\gamma \cos \beta ^k}{2}\left\| s_k\right\| \left\| d_{SD}\left( x_k\right) \right\| \\&\le f^\star \left( x_k\right) - f^\star \left( x^\star \right) - \frac{\gamma \left( 1 - \sigma \right) \cos ^2\beta ^k}{4b}\left\| d_{SD}\left( x_k\right) \right\| ^2. \end{aligned} \end{aligned}$$

Joining this last result and (43), we obtain that

$$\begin{aligned} f^\star \left( x_{k + 1}\right) - f^\star \left( x^\star \right) \le \left( 1 - \frac{\gamma \left( 1 - \sigma \right) a^2\cos ^2\beta ^k}{8b^2}\right) \left( f^\star \left( x_k\right) - f^\star \left( x^\star \right) \right) . \end{aligned}$$
(44)

Now, for all \(k \ge 0\), we define

$$\begin{aligned} r_k = 1 - \frac{\gamma \left( 1 - \sigma \right) a^2\cos ^2\beta ^k}{8b^2}. \end{aligned}$$

It is easy to see that, by the definitions of \(\gamma \) and \(\sigma \), Assumption 2 and Lemma 8, \(r_k \in \left( 0, 1\right) \). In addition, by Lemma 8, we also have that there exists a constant \(\delta > 0\) such that, for all \(k \ge 0\),

$$\begin{aligned} r_k \le 1 - \frac{\gamma \left( 1 - \sigma \right) a^2\delta ^2}{8b^2} = {\bar{r}} < 1. \end{aligned}$$

Then, recursively applying equation (44) and taking into account that, combining (4), (15) and (40), \(f^\star \left( x_0\right) - f^\star \left( x^\star \right) > 0\), we get

$$\begin{aligned} \begin{aligned} f^\star \left( x_{k + 1}\right) - f^\star \left( x^\star \right)&\le \left[ \prod _{l=0}^{k}r_l\right] \left( f^\star \left( x_0\right) - f^\star \left( x^\star \right) \right) \\&\le \left[ \prod _{l=0}^{k}{\bar{r}}\right] \left( f^\star \left( x_0\right) - f^\star \left( x^\star \right) \right) \\&= {\bar{r}}^{k + 1}\left( f^\star \left( x_0\right) - f^\star \left( x^\star \right) \right) . \end{aligned} \end{aligned}$$

Considering this last result and the left-hand side of (42), we obtain that

$$\begin{aligned} \left\| x_{k + 1} - x^\star \right\| \le \left( {\bar{r}}^{k + 1}\right) ^{1/2} \left[ \frac{2}{a}\left( f^\star \left( x^0\right) - f^\star \left( x^\star \right) \right) \right] ^{1/2}, \end{aligned}$$

and, thus, the sequence \(\left\{ x_k\right\} \) R-linearly converges to \(x^\star \).

Summing the last result for all \(k \ge 0\) and recalling that \({\bar{r}} < 1\), we get that (39) holds. \(\square \)

5 Computational experiments

In this section, we report the results of thorough computational experiments, comparing the performance of the proposed approach and other state-of-the-art methods from the literature. All the tests were run on a computer with the following characteristics: Ubuntu 20.04 OS, Intel Xeon Processor E5–2430 v2 6 cores 2.50 GHz, 16 GB RAM. For all algorithms, the code was implemented in Python3.Footnote 1 Finally, in order to solve the optimization problems to determine the descent direction, e.g., problem (14), the Gurobi Optimizer (Version 9) was employed.

5.1 Experiments settings

In the next subsections, we provide detailed information on the settings used for the experiments.

5.1.1 Algorithms and parameters

We chose to compare the new limited memory Quasi-Newton approach, which we call LM-Q-NWT for the rest of the section, with some state-of-the-art Newton and Quasi-Newton methods for MOO from the literature.

The first competitor is the Multi-Objective Newton method (NWT) proposed in [8], which is an extension of the classical Newton method to multi-objective optimization. In this algorithm, the problem for finding the search direction is similar to problem (2): the difference is on the use of the real Hessian \(\nabla ^2 f_j({\bar{x}})\) instead of the approximation matrix \(B_j\), for all \(j \in \{1,\ldots , m\}\). Since this method is not designed to handle unconstrained multi-objective non-convex problems, we evaluated its performance only on the convex test instances.

The other two competitors are the Quasi-Newton approach (Q-NWT) proposed in [24] and the Modified Quasi-Newton method (MQ-NWT) presented in [23]. We refer the reader back to Sect. 2.1 for further details. At the first iteration of all the Quasi-Newton approaches, including LM-Q-NWT, the approximation matrix/matrices is/are set equal to the identity matrix.

In order to make the comparisons as fair as possible, we decided to use the same line search strategy for all the approaches. In particular, we employed the proposed Wolfe line search (Algorithm 3). The values for the line search parameters were chosen according to some experiments on a subset of the tested problems and are as follows: \(\gamma = 10^{-4}\), \(\sigma = 10^{-1}\), \(\eta = 2.5\) and \(\delta = 0.5\). We do not report these preliminary results for the sake of brevity. In order to efficiently use the proposed line search in MQ-NWT, we used equation (19) to compute \(\rho _k\) at each iteration k.

Finally, the choice for the parameter M of the new limited memory approach is separately discussed in Sect. 5.2. Since it denotes the number of vectors pairs maintained in memory during the iterations, it is the most critical among the LM-Q-NWT parameters.

5.1.2 Problems

In Table 1, we list the tested problems. In particular, we compared the algorithms in 78 convex and 83 non-convex problems. All the test instances are characterized by objective functions that are at least continuously differentiable almost everywhere. If a problem is characterized by singularities, these latter ones were counted as Pareto-stationary points. All the problems have objective functions that let Assumption 1 hold: the latter is essential to guarantee the finite termination of the proposed Wolfe line search.

Some problem names are characterized by the prefix M-. These problems are rescaled versions of the original ones and their formulations are provided in Appendix C. In this appendix, we also introduce a new convex test problem, which we call MAN_2.

Table 1 Problems used in the computational experiments

For each algorithm, we tested each problem with 100 different initial points chosen from a uniform distribution. The latter was defined through lower and upper bounds specified for each problem. Since in this work we consider unconstrained multi-objective optimization problems, these bounds were only used to choose the random initial points. For the M-FDS_1, MMR_5, M-MOP_2, MOP and CEC problems, the lower and upper bounds values can be found in the referenced papers. For the others, the values are provided in Table 2.

Table 2 Bounds used to choose the initial points

Finally, starting from an initial point, we decided to let the algorithms run until one of the following stopping conditions was met:

  • the current solution is \(\varepsilon \)-Pareto-stationary (Definition 4); in the experiments,

    $$\begin{aligned} \varepsilon = 5\texttt {eps}^{1/2}, \end{aligned}$$

    where \(\texttt {eps}\) denotes the machine precision;

  • a time limit of 2 min is reached.

5.1.3 Metrics

For each algorithm and problem, the main metrics to be computed are the following.

  • \(N_\varepsilon \): the percentage of runs ended with an \(\varepsilon \)-Pareto-stationary point.

  • T: the computational time to reach the \(\varepsilon \)-Pareto-stationarity from an initial point. If the \(\varepsilon \)-Pareto-stationarity is not reached within the time limit, the value of T related to that point is set to \(\infty \).

  • \(T_M\): the mean of the finite T values.

In Sect. 5.4, we employed the metrics proposed in [17]: purity, \(\Gamma \)spread and \(\Delta \)spread. These metrics are used to evaluate the quality of Pareto front approximations. On the one hand, the purity metric indicates the ratio of the number of non-dominated points that a method obtained w.r.t. a reference front over the number of the points produced by that method. The reference front is obtained by combining the fronts retrieved by all the considered algorithms and by discarding the dominated points. On the other hand, the spread metrics measure the uniformity of the generated fronts in the objectives space. In particular, the \(\Gamma \)spread is defined as the maximum \(\ell _\infty \) distance in the objectives space between adjacent points of the Pareto front, while the \(\Delta \)spread is similar to the standard deviation of this distance.

Finally, we employed the performance profiles introduced in [36], which are an useful tool to appreciate the relative performance and robustness of the considered algorithms. The performance profile of a solver w.r.t. a certain metric is the (cumulative) distribution function of the ratio of the score obtained by the solver over the best score among those obtained by all the considered solvers. In other words, it is the probability that the score achieved by a method in a problem is within a factor \(\tau \in {\mathbb {R}}\) of the best value obtained by any of the algorithms in that problem. We refer the reader to [36] for additional information about this tool. Since \(N_\varepsilon \) and purity have increasing values for better solutions, the performance profiles w.r.t. these metrics were produced based on the inverse of the obtained values. All the performance profiles were plotted with specific axes ranges in order to remark the differences among the considered solvers.

5.2 Selection of the parameter M

The parameter M indicates how many vectors pairs \(\left\{ \left( s_i, u_i\right) \right\} \) are maintained in memory at each iteration of LM-Q-NWT. A bad value for this parameter might compromise the overall performance of the approach, making it too slow or not capable of reaching \(\varepsilon \)-Pareto-stationary points within the time limit.

In order to select a proper value for M, we analyzed the performance of LM-Q-NWT with \(M \in \left\{ 2, 3, 5, 10, 20\right\} \) on a subset of the tested problems.

  • 2 convex problems: SLC_2 (\(m = 2\)), MAN_2 (\(m = 3\)).

  • 2 non-convex problems: CEC09_1 (\(m = 2\)), CEC09_10 (\(m = 3\)).

Fig. 1
figure 1

Performance profiles for the LM-Q-NWT algorithm with \(M \in \left\{ 2, 3, 5, 10, 20\right\} \) on the SLC_2, MAN_2, CEC09_1 and CEC09_10 problems (for interpretation of the references to color in text, the reader is referred to the electronic version of the article). Performance metric: a \(N_\varepsilon \); b T

In Fig. 1, we report the performance profiles for the five variants of the new limited memory method. The solvers with \(M \in \left\{ 5, 10\right\} \) turned out to be the best w.r.t. both \(N_\varepsilon \) and T, while the variant with \(M=2\) was outperformed by all the other methods. We conclude that too little information on the previous steps can compromise the performance of LM-Q-NWT. On the other hand, the management of too many vectors pairs and the use of the two-loop recursive procedure can require great computational costs. A demonstration of this fact is the performance of the proposed approach with \(M = 20\) on the T metric. Although this solver performed well w.r.t. \(N_\varepsilon \), it is only the fourth most robust algorithm in terms of computational time.

After analyzing the performance profiles, we decided to use the new limited memory approach with \(M = 5\) for the rest of the section. However, the variant with \(M = 10\) appears to be a good choice too.

5.3 Overall comparisons

In this section, we compare the proposed approach with the Newton and Quasi-Newton algorithms described in Sect. 5.1.1. As already mentioned, we tested NWT only on the convex problems. Then, we separately report the performance profiles for the convex and non-convex problems in Figs. 2 and 3 respectively. In order to better remark the differences among the methods, for each metric we show three plots concerning different sets of values for n.

  • Figs. 2a, 2d, 3a, 3d: all the n values.

  • Figs. 2b, 2e, 3b, 3e: \(n \ge 50\).

  • Figs. 2c, 2f, 3c, 3f: \(n < 50\).

Fig. 2
figure 2

Performance profiles for the LM-Q-NWT, NWT, Q-NWT and MQ-NWT algorithms on the convex problems of Table 1 (for interpretation of the references to color in text, the reader is referred to the electronic version of the article). Performance metric: ac \(N_\varepsilon \); df T. Values for n: a, d All; b, e \(n \ge 50\); c, f \(n < 50\)

Fig. 3
figure 3

Performance profiles for the LM-Q-NWT, Q-NWT and MQ-NWT algorithms on the non-convex problems of Table 1 (for interpretation of the references to color in text, the reader is referred to the electronic version of the article). Performance metric: ac \(N_\varepsilon \); df T. Values for n: a, d All; b, e \(n \ge 50\); c, f \(n < 50\)

Regarding the performance on the convex problems for all the n values, the proposed approach proved to be the best algorithm, outperforming the competitors w.r.t. both the metrics. Moreover, the gap between LM-Q-NWT and the others is sharper when taking into account the non-convex problems or the high dimensional ones. For high n values, the NWT and Q-NWT algorithms proved to suffer the maintenance of the Hessians and the approximation matrices respectively. As a consequence, they turned out to be the least robust w.r.t. both the metrics. Using a single approximation matrix allowed the MQ-NWT approach to perform better. However, in extremely high dimensional problems, even managing a single matrix proved to be an expensive job. In these cases, the performance of the limited memory approach was remarkable.

On the low dimensional problems, the NWT and Q-NWT algorithms had a good performance. The proposed approach similarly behaved w.r.t. the \(N_\varepsilon \) metric, but it was generally outperformed by these algorithms in terms of T. Managing the real Hessians or the approximation matrices turned out to be a tractable task when n is small enough. Moreover, by definition, these matrices provide more accurate information about the curvature of the objective functions than the matrix of LM-Q-NWT and MQ-NWT. However, these two algorithms still proved to be competitive, obtaining good T metric results in most of the problems.

In order to analyze the performance of the algorithms more deeply, in Tables 3, 4, 5 and 6 we report the metrics values obtained in two convex and two non-convex problems respectively. In particular, we show the results for \(n \in \{5, 20, 50, 200, 500, 1000\}\).

Regarding the \(N_\varepsilon \) metric, the proposed method outperformed the competitors regardless the values for n and m. As in the performance profiles, the differences between LM-Q-NWT and the other approaches are clearer on the high dimensional problems. In some of these, NWT, Q-NWT and MQ-NWT were not able to obtain any \(\varepsilon \)-Pareto-stationary point.

On the problems with two objective functions, almost all the best results in terms of the \(T_M\) metric were obtained by the proposed approach. However, the same performance was not obtained on the problems with \(m=3\) and low value for n. The use of a single matrix seems not to provide accurate enough information about the functions curvature when the objectives are more than two. An additional demonstration of this fact could be also the similar performance of the MQ-NWT algorithm. On the other hand, the use of the real Hessian/an approximation matrix for each objective function seems to overcome the issue: indeed, NWT and Q-NWT had the best performance in terms of \(T_M\) in these cases. LM-Q-NWT still obtained great results for this metric on the problems with three objective functions and high value for n, outperforming the other competitors. Even with \(m=3\), the employment of a single matrix turned out to be essential in high dimensional problems. Like the proposed approach, MQ-NWT proved to perform better than NWT and Q-NWT with \(m=3\) and high value for n, resulting the second best algorithm in these cases.

Table 3 Metrics values achieved by the LM-Q-NWT, NWT, Q-NWT and MQ-NWT algorithms on the convex M-MAN_1 problem (\(m = 2\)) for \(n \in \left\{ 5, 20, 50, 200, 500, 1000\right\} \)
Table 4 Metrics values achieved by the LM-Q-NWT, NWT, Q-NWT and MQ-NWT algorithms on the convex M-FDS_1 problem (\(m = 3\)) for \(n \in \left\{ 5, 20, 50, 200, 500, 1000\right\} \)
Table 5 Metrics values achieved by the LM-Q-NWT, Q-NWT and MQ-NWT algorithms on the non-convex M-MOP_2 problem (\(m = 2\)) for \(n \in \left\{ 5, 20, 50, 200, 500, 1000\right\} \)
Table 6 Metrics values achieved by the LM-Q-NWT, Q-NWT and MQ-NWT algorithms on the non-convex CEC09_8 problem (\(m = 3\)) for \(n \in \left\{ 5, 20, 50, 200, 500, 1000\right\} \)

5.4 Results in a global optimization setting

In the previous section, we compared the LM-Q-NWT method with strongly related approaches from the state-of-the-art, in terms of efficiency and effectiveness at reaching approximate Pareto-stationarity. Now, we show the (positive) impact that the proposed procedure may have if used within a global multi-objective optimization framework.

In particular, here we consider the memetic algorithm proposed in [21], which is named NSMA. Starting with an initial population of N points, this method aims at approximating the Pareto front of the considered problem, combining the genetic operators of the NSGA-II algorithm [14] with a front-based projected gradient method FMOPG. The latter is employed, every \(n_{opt}\) iterations of NSMA, to refine selected solutions up to \(\varepsilon \)-Pareto-stationarity. Moreover, the selection of the points to be optimized is based on the ranking and the crowding distance values assigned to the population at each iteration. Finally, the algorithm exploits a front-based variant of the Armijo-Type Line Search for MOO defined in [7]. For additional information about NSMA, we refer the reader to [21].

For the experiments of the present paper, we consider possible modifications of the NSMA algorithm:

  • NSMA-W, which employs the proposed Wolfe line search (Algorithm 3) in the FMOPG method;

  • NSMA-L, which uses the new limited memory approach (Algorithm 1) as the local optimization procedure.

We compared these two approaches with NSGA-II and the original version of NSMA.

Fig. 4
figure 4

Performance profiles for the NSMA-L, NSMA-W, NSMA and NSGA-II algorithms on the problems of Table 1 (for interpretation of the references to color in text, the reader is referred to the electronic version of the article). Performance metric: a purity; b \(\Gamma \)–spread; c \(\Delta \)–spread

In the experiments, as in [21], we set \(N = 100\) and \(n_{opt} = 5\). Moreover, the points selected as starting solutions for the local search procedures were only optimized w.r.t. all the objective functions. In the original version of the NSMA method, the points can be also refined w.r.t. a subset of the objective functions \(I \subset \left\{ 1,\ldots , m\right\} \). However, Assumption 1 may not hold for some subset I and, then, when trying to optimize a point w.r.t. I, the Wolfe line search would continue its execution for an infinite number of steps.

In Fig. 4, we report the performance profiles for the NSMA-L, NSMA-W, NSMA and NSGA-II algorithms on the problems listed in Table 1. The considered performance metrics are purity, \(\Gamma \)–spread and \(\Delta \)–spread. In order to make the comparisons as independent from random operations as possible, for each algorithm and problem five tests characterized by different seeds for the pseudo-random number generator were executed. The five resulting Pareto front approximations were then compared based on the purity metric: the best among them was chosen as the output of the algorithm for the problem at hand.

In terms of purity, NSMA-L and NSMA-W turned out to be the two most robust algorithms. The proposed Wolfe line search allowed to improve the results of the original NSMA. In fact, the use of the limited memory approach allowed to obtain the best possible performance. Regarding the spread metrics, NSMA-L and NSMA-W had a similar performance. NSMA results on \(\Gamma \)–spread are comparable with the ones of the two variants. However, it was slightly outperformed in terms of \(\Delta \)–spread. NSGA-II did not perform well w.r.t. all the metrics: the variants of NSMA turned out to be capable in finding more accurate and uniform Pareto front approximations.

6 Conclusions

In this paper we proposed a new limited memory Quasi-Newton algorithm for unconstrained multi-objective optimization. To the best of our knowledge, it is the first attempt to define such an approach for MOO. As in [23], we use a single approximation matrix, contrarily to what is done in the other Quasi-Newton approaches. The idea of a single matrix, whose update formula is slightly modified from the one used in the scalar case, allowed us to extend the L-BFGS two-loop recursive procedure to multi-objective optimization: the Hessian matrix approximation does not need to be maintained and managed in memory, but it is computed using a finite number M of previously generated solutions. This feature proves to be crucial, especially when the approximation matrix is dense and/or high dimensional problems are handled. For the proposed approach, under assumptions similar to the ones made for L-BFGS in the strongly convex scalar case, we stated properties of R-linear convergence to the Pareto optimality of the produced sequence of points.

The results of thorough computational experiments show that the new limited memory algorithm consistently outperforms the state-of-the-art Newton and Quasi-Newton methods for MOO. Moreover, we show the substantial benefits of using the proposed algorithm as local search procedure within a global optimization framework.