Fast Optimistic Gradient Descent Ascent (OGDA) method in continuous and discrete time

,


Introduction
Let H be a real Hilbert space and V : H Ñ H a monotone and continuous operator.We are interested in developing fast converging methods aimed to find a zero of V , or in other words, to solve the monotone equation V pzq " 0, (1) for which assume that it has a nonempty solution set Z. The monotonicity and the continuity of V imply that z ˚is a solution of (1) if and only if it is a solution of the following variational inequality xz ´z˚, V pzqy ě 0 @z P H. (2) One of the main motivations to study (1) comes from minimax problems.More precisely, consider the problem min where X and Y are real Hilbert spaces and Φ : X ˆY Ñ R is a continuously differentiable and convexconcave function, i.e., Φ p¨, yq is convex for every y P Y and Φ px, ¨q is convex for every x P X .A solution of ( 3) is a saddle point px ˚, y ˚q P X ˆY of Φ, which means that it fulfills Φ px ˚, yq ď Φ px ˚, y ˚q ď Φ px, y ˚q @ px, yq P X ˆY or, equivalently, # ∇ x Φ px ˚, y ˚q " 0 ´∇y Φ px ˚, y ˚q " 0.
Taking into account that the mapping px, yq Þ Ñ ´∇x Φ px, yq , ´∇y Φ px, yq ¯( 5) is monotone ( [43]), it means that the problem of finding a saddle point of Φ eventually brings us back to the problem (1).
Both (1) and ( 3) are fundamental models in various fields such as optimization, economics, game theory, and partial differential equations.They have recently regained significant attention, in particular in the machine learning and data science community, due to the fundamental role they play, for instance, in multi agent reinforcement learning [37], robust adversarial learning [32] and generative adversarial networks (GANs) [24,18].
In this paper we develop fast continuous in time dynamics as well as numerical algorithms for solving (1) and investigate their asymptotic/convergence properties.First we formulate a second order dynamical system that combines a vanishing damping term with the time derivative of V along the trajectory, which can be seen as an analogous of the Hessian-driven damping in case the operator is originating from a potential.A continuously differentiable and nondecreasing function β : rt 0 , `8q Ñ p0, `8q, which appears in the system, plays an important role in the analysis.If β satisfies a specific growth condition, which is for instance satisfied by polynomials including constant functions, then the method exhibits convergence rates of order o ´1 tβptq ¯for }V pzptqq}, where zptq denotes the generated trajectory, and for the restricted gap function associated with (2).In addition, zptq converges asymptotically weakly to a solution of (1).
By considering a temporal discretization of the dynamical system we obtain an implicit numerical algorithm which exhibits convergence rates of order o ´1 kβ k ¯for }V pz k q} and the restricted gap function associated with (2), where pβ k q kě0 is a nondecreasing sequence and pz k q kě0 is the generated sequence of iterates.For the latter we also prove that it converges weakly to a solution of (1).
By a further more involved discretization of the dynamical system we obtain an explicit numerical algorithm, which, under the additional assumption that V is Lipschitz continuous, exhibits convergence rates of order o `1 k ˘for }V pz k q} and the restricted gap function associated with (2), where pz k q kě0 is the generated sequence of iterates, which is also to converge weakly to a solution of (1).
The resulting numerical schemes can be seen as accelerated versions of the Optimistic Gradient Descent Ascent (OGDA) method ( [33,42]) formulated in terms of a general monotone operator V .It should be also emphasized that the convergence rate statements for both the implicit and the explicit numerical algorithm are last iterate convergence results and are, to our knowledge, the best known convergence rate results for monotone equations.

Related works
In the following we discuss some discrete and continuous methods from the literature designed to solve equations governed by monotone and (Lipschitz) continuous, and not necessarily cocoercive operators.It has been recognized that the simplest scheme one can think of, namely the forward algorithm, which, for a starting point z 0 P H and a given step size s ą 0, reads for k ě 0 z k`1 :" z k ´sV ´zk ¯, and mimics the classical gradient descent algorithm, does not converge.Unless for the trivial case, the operator in (5), which arises in connection with minimax problems, is only monotone and Lipschitz continuous but not cocoercive.Therefore, it was early recognized that explicit numerical methods for monotone equations require an operator corrector term.
In case V is monotone and L-Lipschitz continuous, for L ą 0, Korpelevich [30] and Antipin [2] proposed to solve (1) the nowadays very popular Extragradient (EG) method, which reads for k ě 0 s z k :" z k ´sV ´zk zk`1 and converges for a starting point z 0 P H and 0 ă s ă 1 L to a zero of V .The last iterate convergence rate for the extragradient method was only recently derived by Gorbuno-Loizou-Gidel in [25].For s z P H and δ ą 0 we denote B ps z; δq :" tu P H : ∥s z ´u∥ ď δu.For z ˚P Z and δ `z0 ˘:" z ˚´z 0 , the restricted gap function associated with the variational inequality (2) is defined as (see [36]) Gap pzq :" sup uPBpz˚;δpz 0 qq xz ´u, V puqy ě 0.
In [25] it was shown that In the same setting, Popov introduced in [42] for minmax problems and the operator in (5) the following algorithm which, when formulated for (1), reads for k ě 1 z k`1 :" z k ´2sV ´zk ¯`sV ´zk´1 ¯, (7) and converges for starting points z 0 , z 1 P H and step size 0 ă s ă 1 2L to a zero of V .This algorithm is usually known as the Optimistic Gradient Descent Ascent (OGDA) method, a name which we adopt also for the general formulation in (7).Recently, Chavdarova-Jordan-Zampetakis proved in [19] that for 0 ă s ă 1 16L the scheme exhibits the following best-iterate convergence rate We notice also that, according to Golowich-Pattathil-Daskalakis-Ozdaglar (see [22,23]), the lowerbound for the restricted gap function for the algorithms ( 6) and ( 7) is of O ´1{ ?k ¯as k Ñ `8.The solving of equation ( 1) can be also addressed in the general framework of continuous and discrete time methods for finding the zeros of a maximally monotone operator.Attouch-Svaiter introduced in [14] (see also [20]) a first order evolution equation linked to the Newton and the Levenberg-Marquardt methods, which when applied to (1) reads 9 z ptq `λptq d dt V pzptqq `λptqV pzptqq " 0, (8) where t Þ Ñ λptq is a continuous mapping, and for which they proved that its trajectories converge weakly to a zero of V .Attouch-Peypouquet studied in [12] the following second-order differential equation with vanishing damping : z ptq `α t 9 z ptq `Aγptq pz ptqq " 0, (9) where A : H Ñ H is a possibly set-valued maximally monotone operator, A γ :" 1 γ pId ´JγA q stands for the Yosida approximation of A of index γ ą 0, and J γA " pId `γAq ´1 : H Ñ H for the resolvent of γA.The dynamical system (9) gives rise via implicit discretization to the following so-called Regularized Inertial Proximal Algorithm, which for every k ě 1 reads z 0 , z 1 P H are the starting points, α ą 2, s ą 0 and γ k " p1 `εq s α 2 k 2 for every k ě 1, with ε ą 0 fixed.In [12] it was shown that the discrete velocity z k`1 ´zk vanishes with a rate of convergence of O p1{kq as k Ñ `8 and that the sequence of iterates converges weakly to a zero of A. The continuous time approach in (9) has been extended by Attouch-László in [9] by adding a Newton-like correction term ξ d dt `Aγptq pz ptqq ˘, with ξ ě 0, whereas the discrete counterpart of this scheme was proposed and investigated in [10].
For an inertial evolution equation with asymptotically vanishing damping terms approaching the set of primal-dual solutions of a smooth convex optimization problem with linear equality constraints, that can also be seen as the solution set of a monotone operator equation, and exhibiting fast convergence rates expressed in terms of the value functions, the feasibilty measure and the primal-dual gap we refer to the recent works [6,17].
We also want to mention the implicit method for finding the zeros of a maximally monotone operator proposed by Kim in [29], which relies on the performance estimation problem approach and makes use of computer-assisted tools.
In the case when V is monotone and L-Lipschitz continuous, for L ą 0, Yoon-Ryu recently proposed in [49] an accelerated algorithm for solving (1), called Extra Anchored Gradient (EAG) algorithm, designed by using anchor variables, a technique that can be traced back to Halpern's algorithm (see [28]).The iterative scheme of the EAG algorithm reads for every k ě 0 where z 0 P H is the starting point and the sequence of step sizes ps k q kě0 is either chosen to be equal to a constant in the interval `0, 1 8L ‰ or such that where s 0 P `0, 3  4 L ˘.This iterative scheme exhibits in both cases the convergence rate of Later, Lee-Kim proposed in [31] an algorithm formulated in the same spirit for the problem of finding the saddle points of a smooth nonconvex-nonconcave function.
Further variants of the anchoring based method have been proposed by Tran-Dinh in [47] and together with Luo in [48], which all exhibit the same convergence rate for }V pz k q} as EAG.Tran-Dinh in [47] and Park-Ryu in [40] pointed out the existence of some connections between the anchoring approach and Nesterov's acceleration technique used for the minimization of smooth and convex functions ( [34,35]).

Our contributions
The starting point of our investigations is a second order evolution equation associated with problem (1) that combines a vanishing damping term with the time derivative of V along the trajectory, which will then lead via temporal discretizations to the implicit and the explicit algorithms.In [19] several dynamical systems of EG and OGDA type were proposed, mainly in the spirit of the heavy ball method, that is, with a constant damping term, exhibiting a convergence rate of }V pzptqq} " O `1{ ?t ȃs t Ñ `8 and, in case V is bilinear, weak convergence of the trajectory zptq to a zero of the operator.One of the main discoveries of the last decade was that asymptotically vanishing damping terms (see [46,4,13]) lead to the acceleration of the convergence of the value functions along the trajectories of a inertial gradient systems.Moreover, when enhancing the evolution equations also with Hessiandriven damping terms, the rate of convergence of the gradient along the trajectories can be accelerated, too ([13, 45]).It is natural to ask whether asymptomatically vanishing damping terms have the same accelerating impact on the values of the norm of the governing operator along the trajectories of inertial dynamical systems associated with monotone (not necessarily potential) operators.
For z ˚P Z and the dynamics generated by this dynamical system we will prove that Polynomial parameter functions βptq " β 0 t ρ , for β 0 ą 0 and ρ ě 0, satisfy the two growth conditions for α ě ρ `2 and α ą ρ `2, respectively.
To the main contributions of this work belongs not only the improvement of the convergence rates in [19] in both continuous and discrete time, but in particular the surprising discovery that this can be achieved by means of asymptotically vanishing damping, respectively, as we will see below, of Nesterov momentum.This shows that the accelerating effect of inertial methods with asymptotically vanishing damping/Nesterov momentum goes beyond convex optimization and opens the gate towards new unexpected research perspectives.
Further we provide two temporal discretizations of the dynamical system, one of implicit and one of explicit type.
Implicit Fast OGDA: Let α ą 2, z 0 , z 1 P H, s ą 0, and pβ k q kě1 a positive and nondecreasing sequence which satisfies For every k ě 1 we set We will prove that, for z ˚P Z, it holds and and that the sequence `zk ˘kě0 converges weakly to a solution in Z.The constant sequence β k " 1 obviously satisfies the growth condition required in the implicit numerical scheme and for this choice the generated sequence `zk ˘kě0 fulfills for every k ě 1 From the general statement we have that and `zk ˘kě0 converges weakly to a solution in Z.A further contribution of this work is therefore this numerical algorithm with Nesterov momentum for solving (1), obtained by implicit temporal discretization of the inertial evolution equation and which reproduces all its convergence properties in discrete time.
Only for the explicit discrete scheme we will additionally assume that the operator V is L-Lipschitz continuous, with L ą 0.
Explicit Fast OGDA: Let α ą 2, z 0 , z 1 , s z 0 P H, and 0 ă s ă 1 2L .For every k ě 1 we set When taking a closer look at its equivalent formulation, which reads for every k ě 1 one can notice that the iterative scheme can be seen as an accelerated version of the OGDA method.
An important feature of the explicit Fast OGDA method is that it requires the evaluation of V only at the elements of the sequence `s z k ˘kě0 , while the Extragradient method (6) and the Extra Anchored Gradient method (10) require the evaluation of V at both sequences `zk ˘kě0 and `s z k ˘kě0 .We will show that, for z ˚P Z, it holds and that also for this algorithm the generated sequence `zk ˘kě0 converges weakly to a solution in Z. Another main contribution of this work is the explicit Fast OGDA method with Nesterov momentum and operator correction terms, for which we show the best convergence rate results known in the literature of explicit algorithms for monotone inclusions and the convergence of the iterates to a zero of the operator.We illustrate the theoretical findings with numerical experiments, which show the overwhelming superiority of our method over other numerical algorithms designed to solve monotone equations governed by monotone and Lipschitz continuous operators.These include the algorithms designed by using "anchoring" techniques, for which the tracing of the iterates back to the starting value seems to have a slowing effect on the convergence performances.
Remark 2. (the role of the time scaling parameter function β) The function β which appears in the formulation of the dynamical system can be seen as a time scaling parameter function in the spirit of recent investigations on this topic (see, for instance, [5,7]) in the context of the minimization of a smooth convex function.It was shown that, when used in combination with vanishing damping (and also with Hessian-driven damping) terms, time scaling functions improve the convergence rates of the function values and of the gradient.The positive effect of the time scaling on the convergence rates can be transferred to the numerical schemes obtained via implicit discretization, as it was recently pointed out by Attouch-Chbani-Riahi in [8], and long time ago by Güler in [26,27] for the proximal point algorithm, which may exhibit convergence rates for the objective function values of o p1{k ρ q rate, for arbitrary ρ ą 0. On the other hand, this does not hold for numerical schemes obtained via explicit discretization, as it is the gradient method for which it is known that the convergence rate of o `1{k 2 ˘for the objective function values (see [11]) cannot be improved in general ( [34,35]).
This explains why the discretization of the parameter function β appears only in the implicit numerical scheme and in the corresponding convergence rates, and not in the explicit numerical scheme.

The continuous time approach
In this section we will analyze the continuous time scheme proposed for (1), and which we recall for convenience in the following.
Let z ˚P Z and 0 ď λ ď α ´1.We consider the following energy function E λ : rt 0 , `8q Ñ r0, `8q, which will play a fundamental role in our analysis.By taking into consideration (2), for every 0 ď λ ď α ´1 we have E λ ptq ě 0 @t ě t 0 . Denote The growth condition (13) guarantees that wptq ě 0 for every t ě t 0 .First we will show that the energy dissipates with time.
Lemma 3. Let z : rt 0 , `8q Ñ H be a solution of (12), z ˚P Z and 0 ď λ ď α ´1.This, in combination with @ 9 z ptq , d dt V pz ptqq D ě 0 for every t ě t 0 , which is a consequence of the monotonicity of V , leads to (16).
The following theorem provides first convergence rates which follow as a direct consequence of the previous lemma.Since β is positive and nondecreasing, we have lim tÑ`8 tβ ptq " `8.
Theorem 4. Let z : rt 0 , `8q Ñ H be a solution of (12) and z ˚P Z.For every t ě t 0 it holds and the following statements are true ż `8 If we assume in addition that then the trajectory t Þ Ñ z ptq is bounded, it holds ż `8 and the limit lim tÑ`8 E λ ptq P R exists for every λ satisfying 0 ď λ ď α ´1.
The existence and uniqueness of solutions for ( 12) can be guaranteed in a very general setting, which includes the one of continuously differentiable operators defined on finite-dimensional spaces, that are obviously Lipschitz continuous in bounded sets.The proof of Theorem 5 is provided in the Appendix and it relies on showing that the maximal solution given by the Cauchy-Lipschitz Theorem is a global solution.
Theorem 5. Let α ą 2 and assume that V : H Ñ H is continuously differentiable, β : rt 0 , `8q Ñ p0, `8q is a continuously differentiable and nondecreasing function which satisfies condition (21) and that V and 9 β are Lipschitz continuous on bounded sets.Then for every initial condition z pt 0 q " z 0 P H and 9 z pt 0 q " 9 z 0 P H the dynamical system (12) has a unique global twice continuously differentiable solution z : rt 0 , `8q Ñ H.
Further we prove that, under the slightly stronger growth condition (21), the trajectories of the dynamical system (12) converge to a zero of V .This phenomenon is also present at inertial gradient systems with asymptotically vanishing damping terms, where it concerns the coefficient α, too.Theorem 6.Let α ą 2 and z : rt 0 , `8q Ñ H be a solution of (12) and assume that β : rt 0 , `8q Ñ p0, `8q satisfies the growth condition (21), in other words Then z ptq converges weakly to a solution of (1) as t Ñ `8.
Finally, let s z be a weak sequential cluster point of the trajectory z ptq as t Ñ `8.This means that there exists a sequence pz pt n qq ně0 such that where á denotes weak convergence.On the other hand, Theorem 4 ensures that Since V is monotone and continuous, it is maximally monotone (see, for instance, [16,Corollary 20.28]).Therefore, the graph of V is sequentially closed in H weak ˆHstrong , which means that V ps zq " 0. In other words, the hypothesis piiq of Opial's Lemma also holds, and the proof is complete.
Next we will see that under the growth condition (21) the convergence rates obtained in Theorem 4 can be improved from O to o, which is also a phenomenon known for inertial gradient systems with asymptotically vanishing damping terms.
Theorem 7. Let α ą 2 and z : rt 0 , `8q Ñ H be a solution of (12), z ˚P Z, and assume that β : rt 0 , `8q Ñ p0, `8q satisfies the growth condition (21), in other words Then it holds and Proof.For every 0 ď λ ď α ´1 the energy function of the system can be written as where the last equation comes from the definition of p ptq in ( 28) and the formula Recalling that as both limits lim tÑ`8 E λ ptq P R and lim tÑ`8 p ptq P R exist (see Theorem 4 and ( 30)), we conclude that for h : rt 0 , `8q Ñ R, hptq " t 2 ∥ 9 z ptq `β ptq V pz ptqq∥ tβ ptq ∥V pz ptqq∥ " 0.
Remark 8.One of the anonymous referees made an excellent observation regarding the asymptotic behaviour of the trajectories on which we will elaborate in the following.For the first order system attached to (1) 9 u ptq `V pu ptqq " 0, (34) it is known that the solution trajectories converge weakly in ergodic (averaged) sense towards a zero of V .In other words, there exists z ˚P Z such that z ptq :" 1 t ş t 0 u psq ds á z ˚P Z as t Ñ `8 (see, for instance, [15,41]).
This leads to the natural idea of considering the averaging trajectory z, that fulfills 9 z ptq `1 t pz ptq ´u ptqq " 0, (35) and to drive the equation of its dynamics from (34).For more details on this very powerful approach we refer the reader to [3].
Taking the Taylor expansion it leads to the second-order dynamical system with correction term d dt V pz ptqq which is of the same type as (12).This approach suggests that one can expect the non-ergodic convergence of the solution trajectory of ( 12) to a zero of V .The function β can be "inserted" into the system through time scaling approaches aimed to speed up its convergence behaviour (see also [3,5,7,8] for related ideas).

An implicit numerical algorithm
In this section we formulate and investigate an implicit type numerical algorithm which follows from a temporal discretization of the dynamical system (12).We recall that the latter can be equivalently written as (see the proof of Theorem 5) β ptq `p2 ´αq β ptq ¯V pz ptqq u ptq " 2 pα ´1q z ptq `2t 9 z ptq `2tβ ptq V pz ptqq with the initializations z pt 0 q " z 0 and 9 z pt 0 q " 9 z 0 .We fix a time step s ą 0, set τ k :" s pk `1q and σ k :" sk for every k ě 1, and approximate z pτ k q « z k`1 , u pτ k q « u k`1 , and β pσ k q « β k .The implicit finite-difference scheme for (36) at time t :" τ k for pz, uq and at time t :" σ k for β gives for every k ě 1 with the initialization u 1 :" z 0 and u 0 :" z 0 ´s 9 z 0 .Therefore we have for every k ě 1 and after substraction we get where the last relation comes from the first equation in (37).From here, we deduce that for every k ě 1 For the algorithm can be further equivalently written as and is therefore well-defined due to the maximal monotonicity of V .We also want to point out that the discrete version of the growth condition (21) reads where pβ k q kě0 is a positive and nondecreasing sequence.This means that there exists some 0 ď ε ă α ´2 such that In addition, for every k ě rαs it holds To sum up, the implicit algorithm we propose for solving (1) is formulated below.
Algorithm 1. (Implicit Fast OGDA) Let α ą 2, z 0 , z 1 P H, s ą 0, and pβ k q kě0 a positive and nondecreasing sequence which satisfies For every k ě 1 we set Inspired by the continuous setting, we consider for 0 ď λ ď α ´1 the following sequence defined for every k ě 1 which is the discrete version of the energy function considered in the previous section.We have for every k ě 1 The following lemma shows that the discrete energy dissipates with every iteration of the algorithm.Its proof can be found in the Appendix.Lemma 9 is the essential ingredient for the derivation of the convergence rates in Theorem 10.Lemma 9. Let z ˚P Z and `zk ˘kě0 the sequence generated by Algorithm 1 for pβ k q kě0 a positive and nondecreasing sequence which satisfies (41).Then for every 0 ď λ ď α ´1 and every k ě rαs it holds where and ε is chosen to fulfill (39).
From (39) we deduce that pk `2 ´αq β k ´kβ k´1 ď ´εβ k for every k ě 1.Hence, for every α ´1 ´ε 4 ă λ ă α ´1, from Lemma 9 and (46) we have that for sufficiently large k it holds which means the sequence ␣ E k λ ( kě1 is nonincreasing for sufficiently large k, thus it is convergent and the boundedness of `zk ˘kě0 and the convergence rates follow from the definition of E k λ and (40).The remaining assertions follow from Lemma A.6.
Next we prove the weak convergence of the generated sequence of iterates.
Theorem 11.Let z ˚P Z and `zk ˘kě0 the sequence generated by Algorithm 1 for pβ k q kě0 a positive and nondecreasing sequence which satisfies (41).Then the sequence `zk ˘kě0 converges weakly to a solution of (1).
From Theorem 10 we deduce that pq k q kě1 is bounded.This allows us to apply Lemma A.5 and to conclude from here that lim kÑ`8 q k P R also exists.Once again, by the definition of q k and the fact that the sequence ´řk i"1 β i´1 @ z i ´z˚, V `zi ˘D¯k ě1 converges, it follows that lim kÑ`8 ∥z k ´z˚∥ P R exists.In other words, the hypothesis piq in Opial's Lemma (see Lemma A.3) is fulfilled.Now let s z be a weak sequential cluster point of `zk ˘kě0 , meaning that there exists a subsequence `zkn ˘ně0 such that From Theorem 10 we have V ´zkn ¯Ñ 0 as n Ñ `8.
Since V monotone and continuous, it s maximally monotone [16,Corollary 20.28].Therefore, the graph of V is sequentially closed in H weak ˆHstrong , which gives that V ps zq " 0, thus s z P Z.This shows that hypothesis piiq of Opial's Lemma is also fulfilled, and completes the proof.
We close the section with a result which improves the convergence rates derived in Theorem 10 for the implicit algorithm.
Theorem 12. Let z ˚P Z and `zk ˘kě0 the sequence generated by Algorithm 1 for pβ k q kě0 a positive and nondecreasing sequence which satisfies (41).Then it holds Proof.Let 0 ď ε ă α ´2 such that (39) is satisfied and 0 ă α ´1 ´ε 4 ă λ ă α ´1.In the view of (47), the discrete energy sequence can be written as According to Theorem 10, we have lim kÑ`8 This statement together with the fact that the limits lim kÑ`8 E k λ P R and lim kÑ`8 p k P R (according to (49)) exist, allows us to deduce that for the sequence h k P r0, `8q exists.
Furthermore, by taking into consideration the relation (40), Theorem 10 also guarantees that From here we conclude that lim kÑ`8 h k " 0, and since h k is a sum of two nonnegative terms and, since pβ k q kě0 is nondecreasing, we further deduce Using once again (40), we obtain lim kÑ`8 Since pz k q kě0 is bounded, we use the Cauchy-Schwarz inequality to derive 0 ď lim kÑ`8 and the proof is complete.

An explicit algorithm
In this section, additional to its monotonicity, we will assume that the operator V is L-Lipschitz continuous, with L ą 0. We propose and investigate an explicit numerical algorithm for solving (1), which follows from a temporal discretization of the dynamical system (12).The starting point is again its reformulation (36).We fix a time step s ą 0, set τ k :" s pk `1q for every k ě 1, and approximate z pτ k q « z k`1 and u pτ k q « u k`1 .In addition, we choose β pτ k q " 1 for every k ě 1 and refer to Remark 2 for the explanation of why the time scaling parameter function β is discretized via a constant sequence.The finite-difference scheme for (36) at time t :" τ k gives for every k ě 0 Therefore we have for every k ě 1 and after substraction we get where the last relation comes from the first equation in (50).
On the other hand, the second equation in (50) can be rewritten for every k ě 0 as To get an explicit choice for s z k , we opt for From here, (51) gives for all k ě 1 thus, by subtracting (54) from (53), we obtain This gives the following important estimate, which holds for every s ą 0 such that sL ď 1 and every k ě 1 Now we can formally state our explicit numerical algorithm.
Algorithm 2. (Explicit Fast OGDA) Let α ą 2, z 0 , z 1 , s z 0 P H, and 0 ă s ă 1 2L .For every k ě 1 we set For z ˚P Z, 0 ď λ ď α ´1 and z ˚P Z 0 ă γ ă 2 we define first in analogy to the implicit case the discrete energy function for every k ě 1 by where In strong contrast to the implicit case, the discrete energy sequence pE k λ q kě1 might not dissipate with every iteration of the algorithm and be even negative.This is the reason why we consider instead the following regularized seuquence of the energy function, defined for every k ě 2 as Its properties are collected in the following lemma, the proof of which is deferred to the Appendix.
Lemma 13.Let z ˚P Z and `zk ˘kě0 be the sequence generated by Algorithm 2 for 0 ă γ ă 2 and 0 ď λ ď α ´1.Then the following statements are true: where ω 0 :" p2 ´γq λ `γ ´α `γ pλ `1 ´αq , (60b) ω 2 :" 2 pλ `1 ´αq ď 0, (60d) ω 4 :" 2γ p2 ´αq ă 0, (60f) piiq if 1 ă γ ă 2, then for every k ě k 1 :" In Lemma 13 we have two degree of freedoms in the choice of the parameters γ and λ.The next result proves that there are choices for these parameters for which the discrete energy starts to dissipate with every iteration after a finite number of iterations and in the same time it is bounded from below by a nonnegative term.These two statements are fundamental in the derivation of the convergence rates and finally in the proof of the convergence of the iterates.The proof of Lemma 14 can be found in the Appendix.Lemma 14.The following statements are true: ´γq pα ´1q `pγ ´1q pα ´2q γ pα ´2q then there exist two parameters such that for every λ satisfying λ pα, γq ă λ ă λ pα, γq one can find an integer k 2 pλq ě 1 with the property that the following inequality holds for every k ě k 2 pλq piiq there exists a positive integer k 3 such that for every k ě k 3 it holds µ k ě p2 ´γq p1 ´2sLq pk `1q 2 . (66) Now we are in position to provide first convergence rates statements for Algorithm 2.
This means that for every k ě k 4 pλq :" max tk 0 , k 2 pλq , k 3 u, where k 0 is the positive integer provided by Lemma (13)(i), we have Since ω 2 ă 0, ω 4 ă 0 and ω 3 , ω 5 ě 0, we can choose k 5 :" )U ą 0, which then means that for every k ě k 6 :" max tk 4 pλq , k 5 u In view of (61) and by taking into account that λ ă γ 2 pα ´1q, we get that F k λ ě 0 starting from the index k 1 , thus the sequence `Fk λ ˘kě2 is bounded from below.Under these premises, we can apply Lemma A.6 to (68), and obtain piq as well as that the sequence `Fk λ ˘kě1 converges.
According to (68), we also have that `Fk λ ˘kěk 6 is nonincreasing, which, according to (61), implies that following estimate holds for every k ě k 6 This yields that the sequences `2λ `zk ´z˚˘`k `zk ´zk´1 ˘`γskV `s z k´1 ˘˘kě1 , `k `zk ´zk´1 ˘˘kě1 , and `zk ˘kě0 are bounded.In particular, for every k ě k 6 we have 2λ ´zk ´z˚¯`k ´zk ´zk´1 ¯`γskV ´s z k´1 ¯ ď C 0 :" Using the triangle inequality, we deduce from here that for every k ě k 6 where The statement (67b) yields which, together with (56) implies that for every k ě k 6 V ´zk`1 where The last assertion in piiq follows from the Cauchy-Schwarz inequality and the boundedness of `zk ˘kě0 , namely, for for every k ě k 6 it holds To complete the proof of piiiq, we are going to show that in fact lim kÑ`8 Indeed, we already have seen that which, by the Cauchy-Schwarz inequality and (56) yields From here we obtain the desired statement.
The following theorem addresses the convergence of the sequence of iterates to an element in Z.
Theorem 16.Let z ˚P Z and `zk ˘kě0 be the sequence generated by Algorithm 2. Then the sequence `zk ˘kě0 converges weakly to a solution of (1).
Proof.Let 1 `1 α ´1 ă γ ă 2 and λ pα, γq ă λ pα, γq be the parameters provided by Lemma 14 such that (64) holds and with the property that for every λ pα, γq ă λ ă λ pα, γq there exists an integer k 2 pλq ě 1 such that for every k ě k 2 pλq the inequality (65) holds.The proof relies on Opial's Lemma and follows the line of the proof of Theorem 11, by defining this time for every k ě 1 One can notice that the limit exists due to (67a) and the fact that the series ř kě2 @ z k ´s z k´1 , V `s z k´1 ˘D is absolutely convergent, which follows from where we make use of ( 56), (70), and (69), and of the constants C 3 , C 4 defined in the proof of Theorem 15.
As for the implicit algorithm, we can improve also for the explicit algorithm the convergence rates.
From ( 58) and (57) we have that for every k ě 1 We set for every k ě 1 and notice that, in view of (72), we have Theorem 15 asserts that lim which, together with lim kÑ`8 E k λ P R and lim kÑ`8 p k P R, yields lim kÑ`8 In addition, (67c) and (67d) in Theorem 15 guarantee that Consequently, lim kÑ`8 h k " 0, which yields This immediately implies lim kÑ`8 k z k ´zk´1 " 0. The fact that lim kÑ`8 k V `zk ˘ " 0 follows from ( 70) and ( 71), since Finally, using the Cauchy-Schwarz inequality and the fact that `zk ˘kě0 is bounded, we obtain that lim kÑ`8 k @ z k ´z˚, V `zk ˘D " 0.

Numerical experiments
In this section we perform numerical experiments to illustrate the convergence rates derived for the explicit Fast OGDA method and to compare our algorithm with other numerical schemes from the literature designed to solve equations governed by a monotone and Lipschitz continuous operator.To this end we consider a minmax problem studied in [39], which has then been used in [49] where We notice that L is nothing else than the Lagrangian of a linearly constrained quadratic minimization problem.It has been shown in [39] that ∥A∥ ď 1 2 , thus ∥H∥ ď 1 2 , and, consequently, for the monotone mapping px, yq Þ Ñ ´∇x L px, yq , ´∇y L px, yq ¯we can take L " 1 as Lipschitz constant.
In the following we summarize all the algorithms we use in the numerical experiments and the corresponding step sizes: piq OGDA: Optimistic Gradient Descent Ascent method (7) (see [42]) with s :" 0.48 L ; piiq EG: Extragradient method (6) (see [30,2]) with s :" 0.96 L ; piiiq EAG-V: Extra Anchored Gradient method (10) (see [49]) with variable step sizes ps k q kě0 satisfying (11); pivq Nesterov-EAG: Nesterov's accelerated variant of the Extra Anchored Gradient method, which has been proposed in [47] and can be obtained from (10) be taking in the first update line the sequence ´k`1 Lpk`2q ¯kě0 as step sizes, and in the second one the constant step size 1 L (see [47, Theorem 5.1, Lemma 5.1, Theorem 5.2]); pvq Halpern-OGDA: OGDA mixed with the Halpern anchoring scheme, which has been proposed in [48] and can be obtained from the variant of ( 10) with variable step sizes by replacing in the first update line V `zk ˘by V `s z k´1 ˘; pviq Fast OGDA: our explicit algorithm with s :" 0.48 L and various choices of α.For the first numerical experiments we consider the same setting as in [49], namely, we take n " 200, which means that the underlying space is R 400 , and allow a maximum number of iterations of 5 ˆ10 5 .Figure 1 presents the convergence behaviour of the different methods when solving (74) in logarithmic scale.One can see that the anchoring based methods perform better than the classical algorithms EG and OGDA, and that Nesterov-EAG performs better than Halpern-OGDA, which reconfirms a finding of [47] and is not surprising when one takes into account that the first allows for larger step sizes than EAG-V (and Halpnern-OGDA).On the other hand, Fast OGDA outperforms all the other methods in spite of the fact that the step size is restricted to `0, 1 2L

˘.
Figure 2 shows that the parameter α ą 2 influences significantly the convergence behaviour of the explicit Fast OGDA method.For this numerical experiment we take n " 1000, which means that the underlying space is R 2000 , and allow a maximum number of iterations of 5 ˆ10 5 .The speed of convergence increases with increasing α and seem to be much better than o p1{kq.Let us mention that the minimax problem (74) was constructed to show lower complexity bounds of first order methods for convex-concave saddle point problems.
For Nesterov's dynamical systems with α t as damping coefficient and the corresponding numerical algorithms approaching the minimization of a smooth and convex function, it is known that α influences in the same way the convergence rates of the objective function values.Another intriguing Figure 2: The parameter α influences the convergence behaviour of explicit Fast OGDA similarity with Nesterov's continuous and discrete schemes is the evident oscillatory behaviour of the trajectories, however, there for the objective function values, while for explicit Fast OGDA for the norm of the operator along the trajectory/sequence of generated iterates.This suggests that Nesterov's acceleration approach can improve the convergence behaviour of continuous and discrete time approaches beyond the optimization setting.
In the following we complement the comparative study of the above numerical methods by following the performance profile developed by Dolan and Moré ([21]).We denote by S the set of the algorithms/solvers (i)-(vi) from above, where for the Fast OGDA method we take α :" 3. We solve minmax problems of the form (74) with L : R n ˆRm Ñ R, for 10 different pairs pn, mq such that 20 ď m ď n ď 200 and, for each such pair, for 100 randomly generated sparse matrices A P R mˆn and vectors b P R m and h P R n , and H :" 2A T A. For each pair pn, mq we also take 10 initial points with normal distribution, which leads to a set of problems P with N p " 10 ˆ100 ˆ10 " 10000 instances.
For each problem p P P and each solver s P S, we denote by t p,s the number of iterations needed by solver s to solve the problem p successfully, i.e. by satsifying the following stopping criteria before k max :" 10 5 iterations ∥V px k , y k q∥ ∥V px 0 , y 0 q∥ ď Tol op " 10 ´6 and ∥px k , y k q ´px k´1 , y k´1 q∥ ∥px k , y k q∥ `1 ď Tol vec " 10 ´5.
The two stopping criteria quantify the relative errors measured for the operator norm and the discrete velocity.We define the performance ratio as r p,s :" and the performance of the solver s as ρ s pτ q :" 1 N p size tp P P : 0 ă r p,s ď τ u , where τ is a real factor.The performance ρ s pτ q for solver s gives the probability that the performance ratio r p,s is within a factor τ P R of the best possible ratio.Therefore, the value of ρ s p1q gives the probability that the solver s gives the best numerical performance when compared to the others, while ρ s pτ q for large values of τ measures its robustness.Figure 3 represents the performance profiles of the six solvers.We observe that the Fast OGDA method is the most efficient, followed by EAG-V and Halpern-OGDA.We note that for τ ě 3 these three solvers are robust and solve 90% of the problems, while Nesterov-EGA and EG solve for τ ě 4 80% of the problems.

A Appendix
In the first subsection of the appendix we collect some fundamental auxiliary results for the analysis carried out in the paper.Further, we present the proof of the existence and uniqueness theorem for (12) and also the proofs of technical lemmas used in the convergence analysis of the numerical algorithms.

A.1 Auxiliary results
The following result can be found in [1, Lemma 5.1].
Lemma A.1.Let δ ą 0. Suppose that f : rδ, `8q Ñ R is locally absolutely continuous, bounded from below, and there exists g P L 1 prδ, `8qq such that for almost every t ě δ d dt f ptq ď g ptq .
Then the limit lim tÑ`8 f ptq P R exists.
Opial's Lemma ( [38]) in continuous form is used in the proof of the weak convergence of the trajectory of the dynamical system (12).Lemma A.2. Let S be a nonempty subset of H and z : rt 0 , `8q Ñ H. Assume that piq for every z ˚P S, lim tÑ`8 ∥z ptq ´z˚∥ exists; piiq every weak sequential cluster point of the trajectory z ptq as t Ñ `8 belongs to S.
Then zptq converges weakly to a point in S as t Ñ `8.
For the convergence proof of the iterates generated by the two numerical algorithms we use the discrete counterpart of Opial's Lemma.
Lemma A.3.Let S be a nonempty subset of H and pz k q kě1 be a sequence in H. Assume that piq for every z ˚P S, lim kÑ`8 ∥z k ´z˚∥ exists; piiq every weak sequential cluster point of the sequence pz k q kě1 as k Ñ `8 belongs to S.
Then pz k q kě1 converges weakly to a point in S as k Ñ `8.
The following result can be found in [13,Lemma A.2].
Lemma A.4.Let a ą 0 and q : rt 0 , `8q Ñ H be a continuously differentiable function such that lim tÑ`8 ˆq ptq `t a 9 q ptq ˙" l P H.
Then it holds lim tÑ`8 q ptq " l.
The discrete counterpart of this result is stated below.We provide a proof for it, as we could not find any reference for this result in the literature.
Lemma A.5.Let a ě 1 and pq k q kě0 be a bounded sequence in H such that Then it holds lim kÑ`8 q k " l.
Proof.For every k ě 0 we set r k :" q k ´l.We fix ε ą 0. Then there exists k 0 ě 1 such that for every Multiplying both side by ak a´1 , we obtain for every k ě k 0 `ak a´1 `ka ˘rk`1 ´ka r k ď εak a´1 .
Then by applying the triangle inequality and using the fact that r :" sup kě0 ∥r k ∥ ă `8, we deduce that for every k ě k 0 ∥pk `1q a r k`1 ´ka r k ∥ ď εak a´1 ` pk `1q a ´ka ´ak a´1 r.
The Lagrange error bound of a Taylor series says that for every k ě k 0 there exists m k P pk, k `1q such that pk `1q a ´ka ´ak a´1 ď 1 2 a |a ´1| m a´2 k .
From here we consider two cases.
The case 1 ď a ă 2. Then for every k ě k 0 and every m P pk, k `1q we have m a´2 ď 1 and thus (75) leads to ∥pk `1q a r k`1 ´ka r k ∥ ď εak a´1 `1 2 a |a ´1| r.
We choose K ě k 0 and use a telescoping sum argument to get Once again, using the triangle inequality, we conclude that The case a ě 2. For for every k ě k 0 and every m P pk, k `1q we have m a´2 ď pk `1q a´2 , hence (75) leads to ∥pk `1q a r k`1 ´ka r k ∥ ď εak a´1 `1 2 a pa ´1q r pk `1q a´2 .
We choose also in this case K ě k 0 and by a similar argument as above we have that This leads to Therefore, in both scenarios we obtain lim sup kÑ`8 ∥r k ∥ ď εa, which leads to the desired conclusion, as ε ą 0 was arbitrarily chosen.
The following result is a particular instance of [16,Lemma 5.31].
Lemma A.6.Let pa k q kě1 , pb k q kě1 and pd k q kě1 be sequences of real numbers.Assume that pa k q kě1 is bounded from below, and pb k q kě1 and pd k q kě1 are nonnegative sequences such that then the following statements are true: piq the sequence pb k q kě1 is summable, namely ř kě1 b k ă `8; piiq the sequence pa k q kě1 is convergent.
The following elementary result is used several times in the paper.
Lemma A.7.Let a, b, c P R be such that a ‰ 0 and b 2 ´ac ď 0. The following statements are true: piq if a ą 0, then it holds a ∥x∥ 2 `2b xx, yy `c ∥y∥ 2 ě 0 @x, y P H; piiq if a ă 0, then it holds a ∥x∥ 2 `2b xx, yy `c ∥y∥ 2 ď 0 @x, y P H.

A.2 Proof of the existence and uniqueness theorem for the evolution equation
In this subsection we provide the proof of the existence and uniqueness of the trajectories of (12).
Since G is Lipschitz continuous on bounded sets, the local existence and uniqueness theorem (see, for instance, [44,Theorems 46.2 and 46.3]) allows to conclude that there exists a unique continuous differentiable solution pz, uq P H ˆH of (76) defined on a maximally interval rt 0 , T max q where 0 ă t 0 ă T max ď `8.Furthermore, either T max " `8 or lim tÑTmax ∥pz ptq , u ptqq∥ " `8.
In the following we will show that indeed T max " `8.

A.3 Proof of the technical lemma used in the analysis of the implicit algorithm
In this subsection we provide the proof of Lemma 9 which shows that the discrete energy (42) dissipates with every iteration of the implicit Fast OGDA method.
By plugging (82) and ( 83) into (81), we get for every k ě 1 1 2 Next we are going to consider the remaining terms in the difference of the discrete energy functions.First we observe that for every k ě 0 2λ pα ´1 ´λq Some algebra shows that for every k ě 1 2λs pk `1q Finally, according to (39) and ( 40), we have for every k ě rαs pk `α `1q pk `1q β k`1 ´pk `αq kβ k´1 " pk `α `1q pk `1q `βk`1 ´βk ˘`pk `αq k `βk ´βk´1 ˘`p2k `α `1q After adding the relations (85) -( 88) and by taking into consideration (84), we obtain (43).■ A.4 Proofs of the technical lemmas used in the analysis of the explicit algorithm In this subsection we provide the proofs of the two main technical lemmas used in the analysis of the explicit Fast OGDA method.
Proof of Lemma 13 .Let z ˚P Z, 0 ă γ ă 2 and 0 ď λ ď α ´1.First we prove that for every k ě 1 For every k ě 1 we have and after substraction we deduce from (52) that Next we recall the identities in (81) and (86) 2λ pα ´1 ´λq respectively, as they are required also in the analysis of the explicit algorithm.
(i) Let k ě 2 be fixed.By the definition of F k λ in (59), we have for every k ě 2 By using the definition of ω 0 , ω 1 , ω 2 and ω 4 in (60) the fact that 0 ď λ ď α ´1 and 0 ă γ ă 2, from (89) we obtain that for every k ě 1 it holds Plugging (100) into (99), it yields for every k ě 2 Our next aim is to derive upper estimates for the first two terms on the right-hand side of (101), which will eventually simplify the subsequent four terms.First we observe that from (55) we have for every k ě 1 2λ p2 ´αq s The monotonicity of V and relation (52) yield for every k ě 1 ´2sk pk `αq Since `ω2 0 ´δ2 ω 2 ω 4 ˘k2 is the dominant term in the above polynomial, it suffices to guarantee that ω 2 0 ´δ2 ω 2 ω 4 ă 0 in order to be sure that there exits some integer k 2 pλq ě 1 such that ∆ 1 k ď 0 for every k ě k 2 pλq and to obtain from here, due to Lemma A.7 piiq, that R k ď 0 for every k ě k 2 pλq.

Figure 1 :
Figure 1: Explicit Fast OGDA outperforms all other explicit methods

Figure 3 :
Figure 3: The performance profiles of the six solvers