Fast Multiobjective Gradient Methods with Nesterov Acceleration via Inertial Gradient-like Systems

We derive efficient algorithms to compute weakly Pareto optimal solutions for smooth, convex and unconstrained multiobjective optimization problems in general Hilbert spaces. To this end, we define a novel inertial gradient-like dynamical system in the multiobjective setting, whose trajectories converge weakly to Pareto optimal solutions. Discretization of this system yields an inertial multiobjective algorithm which generates sequences that converge weakly to Pareto optimal solutions. We employ Nesterov acceleration to define an algorithm with an improved convergence rate compared to the plain multiobjective steepest descent method (Algorithm 1). A further improvement in terms of efficiency is achieved by avoiding the solution of a quadratic subproblem to compute a common step direction for all objective functions, which is usually required in first order methods. Using a different discretization of our inertial gradient-like dynamical system, we obtain an accelerated multiobjective gradient method that does not require the solution of a subproblem in each step (Algorithm 2). While this algorithm does not converge in general, it yields good results on test problems while being faster than standard steepest descent by two to three orders of magnitude.


Introduction
In many applications in industry, economics, medicine or transport, optimizing several criteria is of interest.In the latter, one wants to reach a destination as fast as possible with minimal power consumption.Drug development aims for maximizing efficacy while minimizing side effects.Even these elementary examples share an inherent feature.The different criteria one seeks to optimize are in general contradictory.There is no design choice that is best for all criteria simultaneously.This insight shifts the focus from finding a single optimal solution to a set of optimal compromises -the Pareto set.Given the Pareto set, a decision maker can select an optimal compromise according to their preferences.In this paper, we derive efficient gradient-based algorithms to compute elements of the Pareto set.Formally, a problem involving multiple criteria can be formalized as an unconstrained multiobjective optimization problem . . .
where f i : H → R for i = 1, . . ., m are the objective functions describing the different criteria.Popular approaches to tackle this problem in the differentiable case are first order methods which exploit the smooth structure of the problem while not being computationally demanding compared to higher order methods involving exact or approximated Hessians.
While in single objective optimization, accelerated first order methods are very popular, these methods are not studied sufficiently from a theoretical point of view in the multiobjective setting.A fruitful approach to analyze accelerated gradient methods is to interpret them as discretizations of suitable gradient-like dynamical systems [1].The analysis of the continuous dynamics is often easier and can later on be transferred to the discrete setting.This perspective is until now not fully taken advantage of in the area of multiobjective optimization.In this paper, we utilize this approach to derive accelerated gradient methods for multiobjective optimization.To this end, we define and analyze the following novel dynamical gradient-like system ẍ(t) + α ẋ(t) + proj C(x(t)) (−ẍ(t)) = 0, (IMOG') with α > 0, C(x) := co ({∇f i (x) : i = 1, . . ., m}), where co(•) denotes the convex hull and proj C(x(t)) (−ẍ(t)) is the projection of −ẍ(t) onto the convex set C(x(t)).The system (IMOG') is an inertial multiobjective gradient-like system.We choose the designation (IMOG') to emphasize its relation to the system mẍ(t) + γ ẋ(t) + proj C(x(t)) (0) = 0, (IMOG) with m, γ > 0, which was discussed in [2].In the single objective setting (m = 1), both (IMOG') and (IMOG) reduce to the heavy ball with friction system mẍ(t) + γ ẋ(t) + ∇f (x(t)) = 0, (HBF) which is well studied for different types of objective functions f , see e.g.[3][4][5].
We discretize (IMOG') to obtain an iterative scheme of the form with appropriately chosen coefficients a, s > 0 and θ k ∈ R m .This scheme can be interpreted as an inertial gradient method for (MOP).We show that it shares many properties with its continuous counterpart and that iterates defined by this algorithm converge weakly to Pareto critical points.To the best of our knowledge this is the first multiobjective method involving a constant momentum term with guaranteed convergence to Pareto critical solutions.
In a further step, we introduce time-dependent friction and informally define the following multiobjective gradient-like system with asymptotically vanishing damping In the single objective setting this system simplifies to the following inertial system with asymptotically vanishing damping ẍ(t) + α t ẋ(t) + ∇f (x(t)) = 0. (AVD) It is well-known that (AVD) is naturally linked with Nesterov's accelerted gradient method [1,[6][7][8].Discretizing the dynamical system (MAVD), and using our knowledge about (IMOG'), we derive an accelerated gradient method for multiobjective optimization that takes the form with appropriately chosen coefficients s > 0 and θ k ∈ R m .In a recent preprint, Tanabe, Fukuda and Yamashita derive an accelerated proximal gradient method for multiobjective optimization using the concept of merit functions [9].We show that the method we derive from the differential equation (MAVD) achieves the same convergence rate of order O(k −2 ) for the function values, measured with a so-called merit function.This remainder of the paper is organized according to the novelties described above.After introducing some basic definitions and notation in Section 2, we prove that solutions to the system (IMOG') exist in finite-dimensional Hilbert spaces in Section 3, and that they converge to Pareto critical points in Section 4. Based on that, we derive a discrete optimization algorithm from an explicit discretization on (IMOG') and show that the iterates defined by this method converge weakly to Pareto critical points, in Section 5. We then introduce Nesterov Acceleration and prove an improved convergence result in Section 6, and a further improvement in numerical efficiency is discussed in Section 7.These two central algorithms are summarized in Algorithms 1 and 2 in the respective sections.We compare these on convex and non-convex example problems in Section 8, and conclude our findings and list future research projects in Section 9.

Notations
Throughout this paper, H is a real Hilbert space with inner product •, • and induced norm • .The set ∆ m := {α ∈ R m : α ≥ 0, and m i=1 α i = 1} is the unit simplex.We denote the open ball with radius δ > 0 and center x by B δ (x) := {y ∈ H : y − x < δ}.The closed ball with radius δ > 0 and center x is denoted by B δ (x).For a set of vectors {ξ 1 , • • • , ξ k } ⊆ H we denote the convex hull of these vectors by co({ξ For two vectors x, y ∈ R k , we define the partial order x ≤ y :⇔ x i ≤ y i for all i = 1, . . ., k.We define ≥, <, > on R k analogously.When we treat dynamical systems, t ∈ R and x ∈ H are the time and state variable, respectively.We denote trajectories in H with t → x(t) with first derivative ẋ(t) and second derivative ẍ(t).

Multiobjective Optimization
Consider the unconstrained multiobjective optimization problem . . .
with at least once differentiable objective functions f i : H → R for i = 1, . . ., m.The definitions in this subsection are aligned with [10].
for all i = 1, . . ., m, and f j (x) < f j (x * ) for at least one index j.The set of all Pareto optimal points is the Pareto set, which we denote with P .
ii) A point x * ∈ H is locally Pareto optimal if there exists δ > 0 such that x * is Pareto optimal in B δ (x * ).
iii) A point x * ∈ H is weakly Pareto optimal if there does not exist another vector x ∈ H such that f i (x) < f i (x * ) for all i = 1, . . ., m. iv) A point x * ∈ H is locally weakly Pareto optimal if there exists δ > 0 such that x * is weakly Pareto optimal in B δ (x * ).
In this paper, we treat convex MOPs.Therefore, the objective functions f i are convex for all i = 1, . . ., m.In this setting, every locally (weakly) Pareto optimal point is also (weakly) Pareto optimal.For unconstrained MOPs, the so-called Karush-Kuhn-Tucker conditions can be written as follows.
Definition 2.2.A point x * ∈ H satisfies the Karush-Kuhn-Tucker conditions if there exists α ∈ ∆ m such that m i=1 α i ∇f i (x * ) = 0.If x * satisfies the Karush-Kuhn-Tucker conditions, we call it Pareto critical.The condition 0 ∈ co ({∇f i (x * ) : i = 1, . . ., m}) is equivalent to the Karush-Kuhn-Tucker conditions.Analogously to the single objective setting, criticality of a point is a necessary condition for optimality.In the convex setting, the KKT conditions are also sufficient conditions for weak Pareto optimality.If we denote the Pareto set by P , the weak Pareto set by P w and the Pareto critical set by P c in the setting of smooth and convex multiobjective optimization, we observe the relation

Accelerated Methods for Multiobjective Optimization
Accelerated methods for multiobjective optimization are not sufficiently discussed from a theoretical point of view in the literature yet.In [11] El Moudden and El Moutasim propose an accelerated method for multiobjective optimization which incorporates the multiobjective descent direction by Fliege [12] and the same acceleration scheme as in Nesterov's accelerated method [13].El Moudden and El Moutasim prove a convergence rate of the function values with rate O(k −2 ).Their proof relies on the restrictive assumption that the Lagrange multipliers of the quadratic subproblem, that is used to compute the step direction in every iteration, remain fixed from a certain point on.Under this assumption, the method simplifies to Nesterov's method for single objective optimization problems applied to a weighted sum of the objective functions with fixed weights.Only recently, Tanabe, Fukuda and Yamashita derived an accelerated proximal gradient method for multiobjective optimization problems in [9].They developed their method using the concept of merit functions (see Subsection 2.5) and show that the function values converge with rate O(k −2 ) without additional assumptions on the Lagrange multipliers.

Dynamical Systems linked to Multiobjective Optimization
In [14] Smale presents the idea of treating multiobjective optimization problem with a continuous time perspective that is motivated from an economical point of view using utility functions in a multi-agent framework.The simplest dynamical system for multiobjective optimization problems is the multiobjective gradient system where C(x(t)) = co ({∇f i (x(t)) : i = 1, . . ., m}).This system is already treated in [15] and in addition by Cornet in [16][17][18] as a dynamical system for resource allocation from an economical point of view.In [19,20] the system (MOG) gets introduced as a tool for multiobjective optimization.The system (MOG) can also be seen as a continuous version of the multiobjective steepest descent method by Fliege [12].In the single objective setting (m = 1), the system (MOG) simplifies to the steepest descent dynamical system ẋ(t) + ∇f (x(t)) = 0. Generalizations of (MOG) are treated in [20,21].In [20] Attouch and Goudou discuss a dynamical system for constrained minimization and in [21] Attouch, Garrigos and Goudou present a differential inclusion for constrained nonsmooth optimization.In [2], Attouch and Garrigos introduce inertia in the system (MOG) and define the following inertial multiobjective gradient-like dynamical system Trajectories of (IMOG) converge weakly to Pareto optimal solutions given γ 2 > mL, where L is a common Lipschitz constant for the gradients of the objective functions.

Merit Functions
A merit function associated with an optimization problem is a function that returns zero at an optimal solution and which is strictly positive otherwise.An overview on merit functions used in multiobjective optimization is given in [22].In our proofs we use the merit function which satisfies the following statement.
Additionally, u 0 (x) is lower semicontinuous.Therefore, if (x k ) k≥0 is a sequence with u 0 (x k ) → 0, every cluster point of (x k ) k≥0 is weakly Pareto optimal.This motivates the usage of u 0 (x) as a measure of complexity for multiobjective optimization methods.The function u 0 (x) is not the only merit function for multiobjective optimization problems, see also [23][24][25] and further references in [22].

Global Existence in Finite Dimensions
In this section, we show that solutions exist for the Cauchy problem related to (IMOG'), i.e ẍ(t) + α ẋ(t) + proj C(x(t)) (−ẍ(t)) = 0, with initial data x 0 , v 0 ∈ H.To this end, we show that for this system solutions exist if there exists a solution for a first order differential inclusion ( u, v) ∈ F (u, v), with a set-valued mapping F : H × H ⇒ H × H.Then, we use an existence theorem for differential inclusions from [26].Our argument works only in finite-dimensional Hilbert spaces.Thus, we assume dim(H) < +∞ from here on.In our context, the following set-valued mapping is of interest: As stated above, C(u) := co ({∇f i (u) : i = 1, . . ., m}).We can show that (CP) has a solution if the differential inclusion with appropriate initial data u 0 , v 0 has a solution.

Existence of Solutions to (DI)
To show that there exist solutions to (DI), we investigate the set-valued mapping (u, v) ⇒ F (u, v) defined in (2).The basic definitions for set-valued mappings used in this subsection can be found in [26].
Proof.The statement follows directly from the definition.
To use an existence theorem from [26], we need to show that (u, v) ⇒ F (u, v) is upper semicontinuous.Showing this is elementary.We omit the full proof here but sketch a possible way to prove this result.Lemma 3.2.Let C(u) := co({c i (u) : i = 1, . . ., m}) with c i : H → H, u → c i (u) continuous for i = 1, . . ., m.Let (u 0 , v 0 ) ∈ H × H be fixed.Then, for all ε > 0 there exists an δ > 0 such that for all (u, v) ∈ H×H with u−u 0 < δ and v−v 0 < δ and for all z ∈ arg min z∈C(u) z, v there exists z 0 ∈ arg min z0∈C(u0) z 0 , v 0 with z − z 0 < ε.
Proof.The proof follows straightforward.
Proof.Using Lemma 3.2 we can show in a straightforward manner using only continuity arguments.Then, the statement follows by the fact that (u, v) ⇒ F (u, v) is locally compact.
Proposition 3.4.Let H have finite dimension.Then, the mapping Proof.If dim(H) < +∞ the proof follows easily since all images F (u, v) are compact and depend on (u, v) in a well behaved manner.On the other hand, from φ being locally compact, we get that v → v is locally compact which is equivalent to H being finite-dimensional.
The following existence theorem from [26] is applicable in our setting.Theorem 3.5 ( [26], p. 98, Theorem 3).Let X be a Hilbert space, Ω ⊂ R×X be an open subset containing (0, x 0 ).Let F be an upper semicontinuous map from Ω into the non-empty closed convex subsets of X.We assume that (t, x) → proj F (t,x) (0) is locally compact.Then, there exists T > 0 and an absolutely continuous function x(•) defined on [0, T ] which is a solution to the differential inclusion ẋ(t) ∈ F (t, x(t)), x(0) = x 0 .
We are finally in the position to state an existence theorem for (DI).Theorem 3.6.Assume H is finite-dimensional and that the gradients of the objective function ∇f i are globally Lipschitz continuous.Then, for all (u 0 , v 0 ) ∈ H × H there exists T > 0 and an absolutely continuous function (u(•), v(•)) defined on [0, T ] which is a solution to the differential inclusion (DI).
Proof.The proof follows immediately from Propositions 3.3 and 3.4 which show that the set-valued mapping satisfies all conditions required for Theorem 3.5.
In the following, we show that under additional conditions on the objective functions f i there exist solutions defined on [0, +∞).The extension of the solution works using a standard argument.We show that the solutions to (DI) remain bounded.Then, we use Zorn's Lemma to retrieve a contradiction if there is a maximal solution that is not defined on [0, +∞).Theorem 3.7.Assume H is finite-dimensional and that the gradients of the objective function ∇f i are globally Lipschitz continuous.Then, for all (u 0 , v 0 ) ∈ H×H there exists an absolutely continuous function (u(•), v(•)) defined on [0, +∞) which is a solution to the differential inclusion (DI).
Proof.Theorem 3.6 guarantees the existence of solutions defined on [0, T ) for some T ≥ 0. Using the domain of definition, we can define a partial order on the set of solutions to the problem (DI).Assuming there is no solution defined on [0, +∞), Zorn's Lemma guarantees the existence of a solution (u(•), v(•)) : [0, T ) → H × H with T < +∞ which can not be extended.We will show that (u(t), v(t)) does not blow up in finite time and therefore can be extended which contradicts the claimed maximality.Define where (x, y) H×H = x 2 + y 2 .We show that h(t) can be bounded by a real-valued function.Using the Cauchy-Schwarz inequality, we directly get We next derive a bound on F (u, v) .The basic inequalities between the 1 and 2 norm applied to Therefore where we have chosen c = √ 2 max {1 + α, L, max i=1,...,m ∇f i (0) }.Combining the last inequality with (3), we get Using a Gronwall-type argument (see Lemma A.4 and Lemma A.5 in [27]) just as in Theorem 3.5 in [2], we know that there exists c ∈ R such that for an arbitrary ε > 0 Since this upper bound is independent of t and ε, it follows that h ∈ L ∞ ([0, T ], R).Therefore, solutions to (DI) do not blow up in finite time and can be extended.This is a contradiction to the maximality of the solution (u(t), v(t)).

Existence of Solutions to (CP)
Using the findings of the previous subsection, we can proceed with the discussion of the Cauchy problem (CP).In this subsection, we show that solutions to the differential inclusion (DI) immediately give solutions to the Cauchy problem (CP).
Finally, we can state the full existence theorem for the Cauchy problem (CP).
Theorem 3.9.Assume H is finite-dimensional and that the gradients of the objective function ∇f i are globally Lipschitz continuous.Then, for all x 0 , v 0 ∈ H, there exists a twice continuously differentiable function x(t) defined on [0, +∞) which is absolutely continuous with absolutely continuous first derivative ẋ(t), and which is a solution to the Cauchy problem (CP) with initial values (x 0 , v 0 ).
Proof.The proof follows immediately combining Theorem 3.7 and Theorem 3.8.
Remark 3.10.Throughout this section, we have assumed that the gradients ∇f i of the objective functions are globally Lipschitz continuous.One can relax this condition and only require the gradients to be Lipschitz continuous on bounded sets if we can guarantee that the solutions remain bounded.This holds for example if one of the objective functions f i is coercive.

Asymptotic Analysis of Trajectories of (IMOG')
In this section, we omit the assumption dim(H) < +∞.We show that trajectories of the differential equation (IMOG') converge weakly to Pareto critical points of the optimization problem (MOP).This follows from a dissipative property of the system and an argument that relies on Opial's Lemma.We first define a so-called energy function for the system (IMOG') that has Lyapunov-type properties.
Proposition 4.1.Let x : [0, +∞) → H be a solution to (CP).For i = 1, . . ., m define the global energy Then, for all t ∈ (0, +∞) it holds that d dt E i (t) ≤ −α ẋ(t) 2 .Proof.From the definition of the differential equation (IMOG'), it follows that −α ẋ(t) = proj co(∇fi(x(t))+ẍ(t)) (0).By the variational characterization of the convex projection, we get for all i = 1, . . ., m α ẋ(t) + ∇f i (x(t)) + ẍ(t), α ẋ(t) ≤ 0, which immediately gives Applying the chain rule to d dt E i (t) yields the desired result.Proposition 4.2.Let x : [0, +∞) → H be a bounded solution of (CP) and let further ∇f i be Lipschitz continuous on bounded sets.Then, for all i = 1, . . ., m There exists a monotonically increasing unbounded sequence Proof.i) From Proposition 4.1, we immediately get that E i is monotonically decreasing and therefore Since ∇f i is bounded on bounded sets, we can conclude by the mean value theorem that f i is bounded on bounded sets.Since x(t) remains bounded by assumption, we conclude that f i (x(t)) is bounded from below, and hence ii) We know that f i (x(t)) is bounded.Then, by the definition of E i and the fact that E i is monotonically decreasing, we immediately get that ẋ is bounded for all t ≥ 0. Since it is of class C 1 , it follows that ẋ ∈ L ∞ ([0, +∞)).Using Proposition 4.1 it follows that and therefore ẋ ∈ L 2 ([0, +∞)).
iii) Since ẋ(t) and ∇f i (x(t)) remain bounded for all t ≥ 0 it follows that ẍ(t remains bounded for all t ≥ 0. By the fact that x is twice continuously differentiable, it follows that ẍ is continuous and hence ẍ iv) Assume that the negation of statement iv) holds, namely Fix an arbitrary δ > 0. Since ẋ(t) → 0 and ∇f i is Lipschitz continuous on a set containing x(t) it follows that there exists T δ > T such that for all t > T δ and s ∈ [t, t + δ) Define v := proj C(x(t)) / proj C(x(t)) .From ( 5) it follows that Combining the last statement with (6) and using the Cauchy-Schwarz inequality we get And hence Using the Cauchy-Schwarz inequality again, we get Since we can choose δ arbitrarily large and independently from M , this contradicts ẋ(t) → 0.
We will use part iv) of Proposition 4.2 to show that a weak limit point of the trajectory x(t) is Pareto critical.To this end we introduce the following lemma that states a demiclosedness property of the set-valued mapping C : H ⇒ H, x → C(x).Assume that the objective functions f i are continuously differentiable.Let (x k ) k≥0 be a sequence in H that converges weakly to x ∞ , and assume there exists a sequence (g k ) k≥0 with g k ∈ C(x k ) that converges strongly to zero.Then, 0 ∈ C(x ∞ ) and hence x ∞ is Pareto critical.
If we can show that the trajectories of (IMOG') converge weakly, Proposition 4.2 together with Lemma 4.3 guarantees that the limit points are Pareto critical.To show that the trajectories are in fact converging, we require Opial's Lemma.Lemma 4.4 (Opial's Lemma [28]).Let S ⊂ H be a nonempty subset of H and x : [0, +∞) → H. Assume that x(t) satisfies the following conditions.i) Every weak sequential cluster point of x(t) belongs to S.
Then, x(t) converges weakly to an element x ∞ ∈ S.
To use Opial's Lemma, we need a suitable nonempty set S ⊂ H that we define in the following proposition.
Proposition 4.5.Let x(t) be a bounded solution to (CP).Then, the set is nonempty.
for all i = 1, . . ., m.Since x(t) is bounded, it possesses at least one weak sequential cluster point x ∞ .The objective functions f i are convex and continuous and therefore weakly lower semicontinuous.From this we conclude x ∞ ∈ S.
For the set S defined in (7) and a bounded solution x(t) of (CP), the first part of Opial's Lemma is easy to obtain.It follows analogously to the proof of Proposition 4.5 where it is shown that S is nonempty.To show the second part of Opial's Lemma, we verify that h z (t) := 1 2 x(t) − z 2 satisfies a differential inequality.Then, the convergence can be deduced from the following lemma.
With these ingredients, we can formulate the main convergence theorem of this subsection.Theorem 4.7.Assume that the objective functions f i are convex with gradients ∇f i that are Lipschitz continuous on bounded sets.Then, solution x : [0, +∞) → H of (CP) with arbitrary initial conditions x 0 , v 0 ∈ H that remains bounded converges weakly to a Pareto critical point of (MOP).
In addition, we know that every weak sequential cluster point of x(t) belongs to S by the weak lower semicontinuity of the objective functions f i .Then, we can use Opial's Lemma 5.5 to prove that x(t) converges weakly to an element in S. Let x ∞ be the weak limit of x(t).Then, by Proposition 4.2, there exists a monotonically increasing unbounded sequence (t k ) k≥0 with proj C(x(t k )) (0) → 0 for k → +∞.Since x(t k ) converges weakly to x ∞ , Lemma 4.3 states that x ∞ is Pareto critical.

An Inertial Multiobjective Gradient Algorithm
In this section, we derive an inertial first order method for multiobjective optimization problems from an explicit discretization of the differential equation (IMOG').We write the system (IMOG') in the equivalent form and use the following discretization of the differential equation Lemma A.2 states that x k+1 is uniquely defined as Therefore, x k+1 can be written as where θ k ∈ ∆ m is the solution to the quadratic optimization problem In the following subsection, we analyze the asymptotic behavior of the sequence (x k ) k≥0 that is defined by equations ( 11) and ( 12).

Asymptotic Analysis
The asymptotic analysis of the sequence (x k ) k≥0 defined by ( 11) and ( 12) works surprisingly similar to the asymptotic analysis of the trajectories x(t) of the differential equation (IMOG').We start by proving that the sequence (x k ) k≥0 satisfies a dissipative property.To this end, we introduce the following preparatory lemma.
Lemma 5.1.Let (x k ) k≥0 be defined by (11) and ( 12) with x 0 = x 1 ∈ H and α, h > 0.Then, for all i = 1, . . ., m it holds that Proof.Using the variational characterization of the convex projection in the identity (10), we get for all i = 1, . . ., m, αh( which can be rearranged into Using Lemma 5.1, we show that there exists an energy sequence which can be seen as a discretization of the energy function defined in Proposition 4.2.
Proposition 5.2.Assume that the gradients ∇f i of the objective functions are globally L-Lipschitz continuous for all i = 1, . . ., m and further assume Lh < 2α.Then is monotonically decreasing.
Proof.We start with investigating the difference Using that f i is convex with L-Lipschitz continuous gradient, we estimate the expression above by Using Lemma 5.1, we estimate this term by For hL < 2α it holds that L 2 − α h < 0 and we get which completes the proof.
The following corollary is an immediate consequence of Proposition 5.2.
Corollary 5.3.Assume all conditions of Proposition 5.2 are met.Then, for all i = 1, . . ., m and all k ≥ 1 it holds that f i (x k ) ≤ f i (x 0 ).
Corollary 5.3 hints at a condition that guarantees that the sequence (x k ) k≥0 remains bounded.If the level set L i (f i (x 0 )) := {x ∈ H : f i (x) ≤ f i (x 0 )} of one objective function f i is bounded, the sequence (x k ) k≥0 remains bounded.In the following proposition we collect some immediate consequences of Proposition 5.2.
Proposition 5.4.Assume the gradients ∇f i of the objective functions are L-Lipschitz continuous on a bounded set containing the sequence (x k ) k≥0 , that is defined by equations (11) and (12).Assume Lh < 2α, then for all i = 1, . . ., m the following statements hold.
Proof.i) Proposition 5.2 states that E i,k is monotonically decreasing.Therefore, E i,k → E ∞ i holds.We have to show that E ∞ i > −∞.Since the objective functions f i have Lipschitz continuous gradients on a bounded set containing (x k ) k≥0 , it follows by the mean value theorem that f i is bounded on this sets and in particular on (x k ) k≥0 .Therefore, we conclude ii) From the inequality (15) we immediately follow Since Lh < 2α, it holds that α h − L 2 > 0 and therefore we get for all K ≥ 1 From part i), we know that the right hand side converges which completes the proof of ii). iii We use the following discrete version of Opial's Lemma to prove that (x k ) k≥0 converges weakly to a Pareto critical point of (MOP).Lemma 5.5 (Opial's Lemma [28]).Let S ⊂ H be nonempty and let (x k ) k≥0 be a sequence in H that satisfies the following conditions.i) For all z ∈ S lim k→+∞ x k − z exists.
ii) Every weak sequential cluster point of (x k ) k≥0 belongs to S.
Then, it follows that (x k ) k≥0 converges weakly to an element in S.
We will use Opial's Lemma on the set Theorem 5.6.Assume the gradients ∇f i of the objective functions are L-Lipschitz continuous on a bounded set containing the sequence (x k ) k≥0 , defined by ( 11) and ( 12) and further assume Lh < 2α.
Then, (x k ) k≥0 converges weakly to a Pareto critical point of (MOP).
Proof.We show that (x k ) k≥0 satisfies Opial's Lemma for the set S defined by (17).We start by showing a quasi Fejér property of the sequence (x k ) k≥0 .For a fixed z ∈ S, define the sequence It is easy to check that Proposition 5.4 guarantees the monotonicity of E i,k .Since z ∈ S, from the convexity of f i we can deduce for all i = 1, . . ., m that and therefore Using this inequality we can show which leads to the inequality We use this inequality to show and a := 1 1+αh , we can therefore conclude Proposition 5.4 states that ∞ k=1 δ k < +∞.Therefore, we can use Theorem 2.1 in [29] or Theorem 3.1 in [4] to show that h k converges.To use Opial's Lemma, we also have to show that all weak sequential cluster points of (x k ) k≥0 belong to S. Since the sequence (x k ) k≥0 is bounded, it possesses at least one sequential cluster point that we denote by x ∞ and a subsequence (x k l ) l≥0 that converges weakly to x ∞ .Since f i is convex and continuous, it is also weakly lower semicontinuous and it follows that for all i = 1, . . ., m where the equality follows from the fact that the limit exists.Therefore, x ∞ ∈ S and hence S is nonempty.Then, Opial's Lemma 5.5 states that (x k ) k≥0 converges weakly to an element in S that we denote by x ∞ .We will show that each weak sequential cluster point of (x k ) k≥0 is Pareto critical.By the definition of the sequence (x k ) k≥0 in (11), it holds that This sum is finite by part ii) of Proposition 5.4.Thus, we know that the sequence g k := m i=1 θ k i ∇f i (x k ) ∈ co(∇f i (x k )) converges strongly to zero.Since x k converges weakly to x ∞ , Lemma 4.3 states that 0 ∈ co(∇f i (x ∞ )) and hence x ∞ is Pareto critical.

An Accelerated Multiobjective Gradient Method
In this section, we define a multiobjective gradient method with Nesterov acceleration based on the inertial method we discussed in the previous subsection.

The single objective Case
In this subsection we present Nesterov's method in the single objective setting and point out its relation to an intertial gradient-like dynamical system with asymptotically vanishing damping.Consider the problem where f : H → R is convex and differentiable with L-Lipschitz continuous gradient ∇f (x).For α ≥ 3, 0 < s ≤ 1 L and x 0 , x 1 ∈ H, define the sequence (x k ) k≥0 by If arg min f = ∅ it can be shown that f ) and that (x k ) k≥0 converges weakly to an element in arg min f [6].Nesterov's method is related to the following gradient-like dynamical system with asymptotically vanishing damping The algorithm (18) can be derived as a discretization of ( 19).This relation is further investigated in [1,30].

Introducing Nesterov Acceleration in (IMOG')
We formally define the following gradient-like system with asymptotically vanishing damping for multiobjective optimization.
We have shown that the differential equation ( 20) can be derived from the scheme (22).Using Lemma A.1 on ( 22), we get that x k+1 is uniquely defined as The last term can be witten as where θ k ∈ R m is a solution to the quadratic optimization problem We want to drop the factor k k+3 in front of the term m i=1 θ k i ∇f i (y k ) to get a method that more closely resembles (18).In addition, we perform a shift of the index k to transform k k+3 into k−1 k+2 .The final method we define in this subsection can then be defined as follows.Let x 0 = x 1 ∈ H and s > 0. Define the scheme where in each step θ k ∈ R m is a solution to the quadratic optimization problem The fact that we have to transform the quadratic optimization problem from ( 27) into ( 29) is an observation from the proof of Proposition 6.1.The presented method is still asymptotically equivalent to the scheme defined by (22).We summarize the defined method in Algorithm 1 for later references.

7:
Update k ← k + 1 and go to step 1. 8: end if

A Dissipative Property
We start our investigations of Algorithm 1 with an energy estimate analogous to Proposition 5.2 for the inertial method.Proposition 6.1.Assume that the gradients ∇f i of the objective functions are globally L-Lipschitz continuous for all i = 1, . . ., m and further assume sL ≤ 1. Define for all k ≥ 1 the energy sequence For all k ≥ 1, it holds that Proof.From the definition of x k and y k in (28) we get Hence, for all i = 1, . . ., m it holds that from which we follow Writing out the definition of y k , one can easily verify that Combining the inequalities above and using sL ≤ 1 we get which completes the proof.

Convergence of Function Values with Rate
The proof in this section relies on the proof by Fukuda, Tanabe and Yamashita [9] for their accelerated gradient method and the proof of Attouch and Peypouquet [6] for the single objective case.The following definition is aligned with [22] and the concept of merit functions that was introduced in [22] and further utilized in [9,31].For z ∈ H define Proof.The objective functions f i are convex with L-Lipschitz continuous gradients.Therefore, for all i = 1, . . ., m it holds that The definition of σ k (z) gives Combining ( 30) and ( 31) and using ) we get the desired inequality.
We want to find a similar inequality for the expression f i (x k+1 ) − f i (x k ).To this end, we introduce the following lemma.Lemma 6.4.Define the optimization problem Then, it holds that the dual problem to this problem is the quadratic problem (29).An optimal solution θ * to (29) satisfies Proof.Since H is potentially infinite-dimensional, we need duality statements for infinite-dimensional constrained optimization problems.The statements we use in this proof can be found in Sections 8.3 to 8.6 of [32].Since the optimization problem (32) has a fairly simple structure, we will not recite every result we use.The duality between ( 32) and ( 29) follows from a straightforward computation.Since the objective function Φ(v, α) of ( 32) is convex and all constraints g i (v, α) are linear, strong duality holds.Hence a KKT point 32) yields a solution to (29).From the KKT conditions for (32) we get that For all i = 1, . . ., m it holds that g i (v, α) ≤ 0 and hence By the complementarity of θ * i and g i (v * , α * ) we get The second equality above follows from the fact that θ * j > 0 holds for at least one j ∈ {1, . . ., m} as a consequence of the dual feasibility.Using v * = − m i=1 θ * i ∇f i (y k ), we get sv * = x k+1 − y k and therefore Proof.For all a, b ∈ R m it holds that min i=1,...,m and therefore for all z ∈ H Using that the objective functions f i are convex with L-Lipschitz continuous gradients and the fact that sL ≤ 1, we can bound this expression by Now we use Lemma 6.4 and get the equality From here, we continue by using the definitions of x k and y k from (28 Theorem 6.6.The sequence (x k ) k≥0 defined by (28) satisfies Proof.Lemma 6.3 and Lemma 6.5 state Taking a convex combination of the last inequalities with weights 2 k+2 and k k+2 yields Define and notice that Using the identity (35) in (33) we get From the definition of z k in (34) one can see that Using this identity we can simply compute the squared norm of z k+1 − z 2 as Rearranging this identity and multiplying with Combining ( 36) and (37), in total we get Multiplying both sides with (k + 2) 2 then yields Summing this inequality from k = 1, . . ., K, we get for all z ∈ H Similar computations to Lemma 6.3 yield Then, for all k ≥ 1, we obtain The theorem above is not straight forward to interpret since we only get convergence of order O(k −2 ) for min i=1,...,m f i (x k )−f i (z).This on it's own does not state that the vector f (x k ) = f 1 (x k ), . . ., f m (x k ) converges to an element of the Pareto front.However we can refine the statement of Theorem 6.6 in the following way to get a stronger convergence statement under a weak additional assumption.Theorem 6.7.Assume in addition to the assumption in Theorem 6.6 that for all x ∈ L(f (x 0 )) there exists an x * ∈ L * := P w ∩ L(f Then, there exists R ≥ 0 with Proof.Theorem 6.6 gives for all z ∈ H Taking a supremum over this inequality we get sup It remains to show that Writing out the definition of σ k (z), we get sup The function u 0 (x) = sup z∈H min i=1,...,m f i (x) − f i (z) attains the value zero if and only if x is weakly Pareto optimal.Theorem 6.7 shows that u 0 (x k ) = O(k −2 ).

Relation to Tanabe's Accelerated Multiobjective Gradient Method
In the recent preprint [9], Tanabe, Fukuda and Yamashita define an accelerated proximal gradient method for MOPs with objective functions that have a separable structure of the form f i (x) = g i (x)+h i (x), where g i : R n → R is convex, continuously differentiable with L-Lipschitz continuous gradient and h i : R n → R is convex, lower semicontinuous and proper for all i = 1, . . ., m.Since we only treat the case of smooth objective functions f i , we set from here on h i ≡ 0. Tanabe et al. discovered their method using techniques different from the ones used throughout this paper, using the concept of merit functions.We will not recite their method here but refer the reader to [9].To understand the similarity between their method and Algorithm 1, we investigate the quadratic optimization problems that have to be solved in each iteration of the methods, respectively.In the method from [9], the step direction is computed by solving a quadratic optimization problem with the following objective function Ψ : R m → R, Using the first order approximation Minimizing Ψ(θ) is equivalent to minimizing the function Φ : R m → R, ) we note that Φ(θ) is in fact the objective function of the quadratic optimization problem (29).After this observation, it is not surprising that the method by Tanabe et al. shows covergence behavior similar to Algorithm 1.

Improving the Numerical Efficiency
First order methods for multiobjective optimization that are based on the steepest descent method by Fliege and Svaiter [12] require the solution of a quadratic subproblem in each iteration.Computing the solutions of these problems is computational demanding.In the following subsection, we present a possible approach to overcome this problem.

A Multiobjective Gradient Method without Quadratic Subproblems
In this subsection, we define a method based on Algorithm 1 which does not require the solution of a quadratic subproblem in each iteration.In Subsection 6.2, we derived Algorithm 1 from the scheme which can be interpreted as a discretization of the differential equation If, instead, we use the discretization we obtain a different method.Lemma A.1 gives a formula to compute x k+1 where This can be solved efficiently by computing m inner products.After changing k−3 k into k−1 k+2 in (43), we define Algorithm 2. There is no proof of convergence for the method defined by Algorithm 2 but we discuss its numerical behavior in Section 8. Also, convergence can be guaranteed by switching from the significantly faster Algorithm 2 to Algorithm 1 as soon as some heuristic criterion is met.

Backtracking for unknown Lipschitz Constants
In all presented algorithms, we can include backtracking if the Lipschitz constants of the gradients ∇f i of the objective functions are unknown.We can do this as stated in [9,33].To include backtracking, we choose an initial step size s 0 > 0 and a parameter σ ∈ (0, 1).In all discussed algorithms there is a step Algorithm 2. Accelerated multiobjective gradient method without quadratic subproblems Require: Choose x 0 = x 1 ∈ H, s > 0 and set k = 1.
1: Set Stop. 6: else 7: Update k ← k + 1 and go to step 1. 8: end if x k+1 = w k −sd k , with d k ∈ H and w k = x k or w k = y k .One can replace this step with x k+1 = w k −s k d k , with a step size s k that is determined using backtracking.We choose in every step s k = σ l k s k−1 where l k ≥ 0 is the smallest nonnegative integer satisfying for all i = 1, . . ., m The sequence (s k ) k≥0 is monotonically decreasing by definition.Under the condition that the objective functions posses L-Lipschitz continuous gradients, it is guaranteed that the sequence (s k ) k≥0 is constant from same k on.This is true since s k can only decrease as long as s k > 1 L .Therefore, s k can only decrease finitely many times until it reaches a point where s k ≤ 1 L .Using this observation, we can include backtracking in Algorithm 1 and still use the proofs of Theorem 6.6 and Theorem 6.7 to show that the same convergence results can be achieved.

Numerical Examples
In this section, we present the typical behavior of our algorithms on two test problems.We compare Algorithms 1 and 2 with the steepest descent method by Fliege and Svaiter with constant step sizes [12].Throughout this section we denote the steepest descent method by SD, Algorithm 1 by AccG (accelerated gradient method) and Algorithm 2 by AccG w\o Q (accelerated gradient method without quadratic subproblems).We implemented all codes using Matlab R2021b and executed the algorithms on a machine with a 2.80 GHz Intel Core i7 processor and 48 GB memory.We solved the quadratic subproblems for SD and AccG using the built-in Matlab function quadprog.

Example 1: A Convex MOP with three Objective Functions
In our first example, we choose a problem with input dimension n = 20 and three objective functions (m = 3).We define the objective functions using the following parameters.For p = 50 and i = 1, 2, 3 we generate matrices A i = a i 1 , . . ., a i p T ∈ R p×n with a i j ∈ R n for j = 1, . . ., p and vectors b i ∈ R p .Then, for i = 1, 2, 3, we define the objective functions For the first experiment we randomly generate Matrices A i ∈ R p×n and vectors b i ∈ R p with entries uniformly sampled in [−1, 1] for i = 1, 2, 3.The starting vector x 0 is uniformly randomly drawn from [−15, 15] n .We use the step size s = 5e−2 and execute maximally k max = 1000 iterations.Figure 1 contains plots of the sequences (x k ) k≥0 for the different algorithms.In Figure 1a, one sees that the sequences generated with AccG and AccG w\o Q advance much faster in the beginning, while the velocity for the sequence generated with SD remains constant.The sequences generated by AccG and AccG w\o Q give very similar trajectories in the beginning.This result is intuitive given that the schemes in the algorithms are derived from different discretizations of the same differential equation.However, this result is still surprising keeping in mind that in Algorithm 2 we do not solve a quadratic subproblem in each iteration.Only in Figure 1b, we see that the sequences differ more substantially in the long run.It is also noteworthy that the sequence generated by AccG is smoother compared to the trajectory generated by   AccG w\o Q.This is due to the fact that in AccG w\o Q we choose one of the gradients of the objective functions for the gradient component of the step direction while in AccG we choose an element of the convex hull of the gradients.AccG and AccG w\o Q are superior to SD in terms of convergence of the function values for all objective functions, as shown in Figure 2 In a second experiment, we execute all algorithms for 50 starting values uniformly sampled in [−5, 5] n with step size s = 5e−2.We use the stopping criterion f (x k )−f (x k−1 ) ∞ < 1e−4 to stop the algorithms if the function values do not change significantly.In Figure 3a, we perform up to k max = 50, in Figure 3b up to k max = 250 and in Figures 3c and 3d up to k max = 1000 iterations.Similar to the results observed in Figures 1 and 2, AccG and AccG w\o Q advance much faster in the beginning compared to SD. Comparing Figures 2b and 2c, we see that after 250 iterations the function values for the accelerated methods are converging or the stopping conditions were met.The different behavior of the accelerated methods can be observed in Figure 3d.While the solutions of AccG are farther spread, it looks like the solutions of AccG w\o Q are drawn towards the center of the Pareto front.Altogether, the accelerated methods perform better for this problem in terms of convergence speed of the function values.In Table 1  For our second test problem, we choose an example from [34] with input dimension n = 2 and the two objective functions with λ = 0.6.For the multiobjective optimization problem (MOP) with these objective functions it can easily be verified that the Pareto set is In the first experiment, we execute Algorithms SD, AccG and AccG w\o Q with the starting vector    gradients are similar in magnitude, which is why the solution is zig-zagging back and forth between the objectives in these locations.However, this disadvantage is compensated by the fact that we do not need to solve a quadratic subproblem in every step.We need potentially more iterations when choosing smaller step sizes but every iteration is computationally cheaper in comparison to SD and AccG.Moreover, a small adaptation to step 2 of Algorithm 2, where we include a weighting parameter in the max problem, might allow us to diversify solutions, similar to the weighted sum method.

Conclusion and Open Questions
We present the novel inertial gradient-like dynamical system (IMOG') for Pareto optimization.We show that trajectories of this system converge weakly to Pareto critical points of (MOP).Based on this, we define a novel inertial gradient method for multiobjective optimization and show weak convergence to Pareto critical points.We derive an accelerated gradient method from the informally introduced internal gradient-like system (MAVD) which incorporates asymptotically vanishing damping.Using the concept of merit functions, we show that our method possesses an improved convergence rate.Using a different discretization of the system (MAVD), we define a further accelerated gradient method which does not require the solution to a quadratic optimization problem in every step.A comparison on selected test problems shows that the accelerated methods are in fact superior to the plain multiobjective steepest descent method.There are a lot of open questions arising from the presented work.The gradient system (IMOG') can be analyzed for different problem classes.Also, we have not formally discussed the system (MAVD).It would be interesting to see if we can derive solutions from this system for α > 3 and obtain stronger convergence results similar to the single objective setting.In addition, we can adapt our gradient systems and algorithms to treat problems with a separable smooth and nonsmooth structure using proximal methods.It would also be interesting to investigate the behavior of our algorithms for high-dimensional and nonconvex problems.In addition, one could apply the presented algorithms in the area of machine learning, e.g., for multitask learning problems [35].

A. Two Lemmas on Convex Projections
Lemma A.1.Let H be a real Hilbert space, C ⊂ H a convex and compact set and η ∈ H a fixed vector.Then, ξ ∈ H is a solution to the problem Find ξ ∈ H such that : η = proj C+ξ (0), (45) if and only if it has the form ξ = η − µ, where µ is a solution to the constrained optimization problem min µ∈C µ, η .
Proof.First, we show that an element of the form ξ = η − µ, with µ a solution to min µ∈C µ, η is a solution to problem (45).The set of minimizers of the problem min µ∈C µ, η is nonempty, since C is compact.Let µ ∈ arg min µ∈C µ, η .Since C is convex, the first order optimality condition for this problem gives that for all x ∈ C it holds that x − µ, η ≥ 0 and hence x + ξ − (µ + ξ), η ≥ 0.
Since we have chosen ξ = η − µ the equation above reads as x + ξ − η, η ≥ 0, which is equivalent to η = proj C+ξ (0).The other direction works analogously.If the vector ξ is a solution to problem (45) this guarantees that µ = ξ − η satisfies the first order optimality condition for problem min µ∈C µ, η .Since problem min µ∈C µ, η is convex and defined over a convex set, this is equivalent to µ being an optimal solution to min µ∈C µ, η .
The uniqueness follows the same way.Assume we have a solution ξ to (46).By the same computations as above it holds that for all x ∈ C x + (1 + a) ξ + aν, ξ + ν ≤ 0. This is equivalent to −((1 + a) ξ + aν) = proj C (ν), from which follows that ξ = ξ is the unique solution.

Figure 1 .
Figure 1.: Coordinates (x 1 , x 2 , x 3 ) of the sequences (x k ) k≥0 for SD, AccG and AccG w\o Q. Line plot for 1000 iterations with a filled circle every 50 iterations to compare the velocities.

( a )Figure 2 .
Figure 2.: Function values (f i (x k )) k≥0 of the iterates for the objective functions i = 1, 2, 3 for the different algorithms.

Figure 3 .
Figure 3.: Function values of the objective functions in the image space for the different algorithms and a different maximum number of iterations k max = 50, 250, 1000.

2 )
T and perform k max = 1000 iterations.The step size is set to s = 5e−3.The sequences and function values of the objective functions are shown in Figure4.Similarly to the first experiment in Figure4a, the sequences (x k ) k≥0 of the accelerated methods advance faster in the beginning.While Algorithms SD and AccG converge to the same element in the Pareto set the algorithm AccG w\o Q produces a trajectory that deviates from the trajectories of SD and AccG and moves to a different part of the Pareto set.The values of the objective functions in Figures4b and 4cindicate a similar behavior.For the accelerated methods we have faster decrease in the beginning and we note that the function values for SD and AccG converge to similar values.In Figure5we use 100 random starting points uniformly sampled in [−2, 2] 2 .For the experiments we use different maximal numbers of iterations k max .In addition we stop the algorithm if f (x k ) − f (x k−1 ) ∞ < 1e−4.Comparing Figures 5a, 5b and 5c, we

Figure 4 .
Figure 4.: Sequences (x k ) k≥0 and function values (f i (x k )) k≥0 of iterates for i = 1, 2. For the sequences we use a line plot for 1000 iterations with a filled circle every 50 iterations to compare velocities.

Figure 5 .
Figure 5.: Values of the objective functions in the image space for different maximum numbers of iterations k max = 50, 250, 1000 for the different algorithms SD, AccG and AccG w\o Q
. AccG and AccG w\o Q experience fast convergence within the first 200 iterations.Comparing the different objective functions in Figures 2a, 2b and 2c, we see that AccG and AccG w\o Q yield outputs with similar function values for all objective functions.

Table 1
the total number of iterations and computation times for the experiments are listed.The accelerated methods require fewer iterations.Compared to SD, AccG requires only approximately 25 % and AccG w\o Q only approximately 50 % iterations.For the computation times the results are different.SD and AccG behave similar, with AccG requiring approximately 25 % of the computation time that is required for SD.However, AccG w\o Q needs less the 2 % of the time which is consumed by AccG.This improvement stems from the quadratic optimization problems that not required in AccG w\o Q.