Fast Augmented Lagrangian Method in the convex regime with convergence guarantees for the iterates

This work aims to minimize a continuously differentiable convex function with Lipschitz continuous gradient under linear equality constraints. The proposed inertial algorithm results from the discretization of the second-order primal-dual dynamical system with asymptotically vanishing damping term addressed by Boţ and Nguyen (J. Differential Equations 303:369–406, 2021), and it is formulated in terms of the Augmented Lagrangian associated with the minimization problem. The general setting we consider for the inertial parameters covers the three classical rules by Nesterov, Chambolle–Dossal and Attouch–Cabot used in the literature to formulate fast gradient methods. For these rules, we obtain in the convex regime convergence rates of order \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}\left( 1/k^{2} \right) $$\end{document}O1/k2 for the primal-dual gap, the feasibility measure, and the objective function value. In addition, we prove that the generated sequence of primal-dual iterates converges to a primal-dual solution in a general setting that covers the two latter rules. This is the first result which provides the convergence of the sequence of iterates generated by a fast algorithm for linearly constrained convex optimization problems without additional assumptions such as strong convexity. We also emphasize that all convergence results of this paper are compatible with the ones obtained in Boţ and Nguyen (J. Differential Equations 303:369–406, 2021) in the continuous setting.

where H, G are real Hilbert spaces, f : H Ñ R is a convex and Fréchet differentiable function with L´Lipschitz continuous gradient, for L ą 0, A : H Ñ G is a continuous linear operator and b P G.We assume that the set S of primal-dual optimal solutions of (1.1) (see Section 1.2 for a precise definition) is nonempty.
Other than in the unconstrained case, for which fast continuous and discrete time approaches have been intensively investigated in the last years, the study of solution methods with fast convergence rates for linearly constrained convex optimization problems of the form (1.1) is in an incipient stage.
Zeng, Lei, and Chen (in [30]) and He, Hu, and Fang (in [16]) have investigated a dynamical system with asymptotic vanishing damping attached to (1.1), and have shown a convergence rate of order O `1{t 2 ˘for the primal-dual gap, while Attouch, Chbani, Fadili and Riahi have considered in [2] a more general dynamical system with time rescaling.More recently, for a primal-dual dynamical system formulated in the spirit of [2,16,30], Bot ¸and Nguyen have obtained in [8] fast convergence rates for the primal-dual gap, the feasibility measure and the objective function value along the generated trajectory, and, additionally, have proved asymptotic convergence guarantees for the primal-dual trajectory to a primal-dual optimal solution.
Fast numerical methods for solving (1.1) have been mainly considered in the literature under additional assumptions such as strong convexity, and in several cases the convergence rate results have been formulated in terms of ergodic sequences.In the merely convex regime no convergence result for the iterates has been provided so far for fast convergence algorithms.To the works addressing fast converging methods for linearly constrained convex optimization problems belong [11,12,13,15,19,20,23,24,26,27,28,29], at which we will take a closer look in Section 1. 3.
The aim of this paper is to propose a numerical algorithm for solving (1.1), which results from the discretization of the dynamical system in [8], exhibits fast convergence rates for the primal-dual gap, the feasibility measure, and the objective function value as well as convergence of the sequence of iterates without additional assumptions such as strong convexity.Although there is an obvious interplay between continuous time dissipative dynamical systems and their discrete counterparts, one cannot directly and straightforwardly transfer asymptotic results from the continuous setting to numerical algorithms, thus, a separate analysis is needed for the latter.In this paper we will also comment on the similarities and the differences between the continuous and discrete time approaches.Since f is a convex function, L is convex with respect to x P H and affine with respect to λ P G.A pair px ˚, λ ˚q P H ˆG is said to be a saddle point of the Lagrangian function L if for every px, λq P H ˆG L px ˚, λq ď L px ˚, λ ˚q ď L px, λ ˚q .

Augmented Lagrangian formulation
If px ˚, λ ˚q P H ˆG is a saddle point of L, then x ˚P H is an optimal solution of (1.1) and λ ˚P G is an optimal solution of its Lagrange dual problem.If x ˚P H is an optimal solution of (1.1) and a suitable constraint qualification is fulfilled (see, for instance, [5,7]), then there exists an optimal solution λ ˚P G of the Lagrange dual problem of (1.1) such that px ˚, λ ˚q P H ˆG is a saddle point of L.
The set of saddle points of L, called also set of primal-dual optimal solutions of (1.1), will be denoted by S and, as stated above, throughout this paper it will be assumed to be nonempty.
The set of feasible points of (1.1) will be denoted by F :" tx P H : Ax " bu and the optimal objective value of (1.1) by f ˚.
The system of primal-dual optimality conditions for (1.1) reads px ˚, λ ˚q P S ô where A ˚: G Ñ H denotes the adjoint operator of A. This optimality system can be equivalently written as T L px ˚, λ ˚q " 0, where is the maximally monotone operator associated to the convex-concave function L. Indeed, it is immediate to verify that T L is monotone.Since it is also continuous, it is maximally monotone (see, for instance, [5,Corollary 20.28]).Therefore S can be interpreted as the set of zeros of the maximally monotone operator T L , which means that it is a closed convex subset of H ˆG (see, for instance, [5,Proposition 23.39]).

Related works
In this section we will recall the most significant fast primal-dual numerical approaches for linearly constrained convex optimization problems and for convex optimization problems involving compositions with continuous linear operators.
In [11], Chambolle and Pock have studied in a finite-dimensional setting the convergence rates of their celebrated primal-dual algorithm for solving the minimax problem for g the indicator function of the set tbu.For the primal-dual sequence of iterates tpx k , λ k qu kě0 the corresponding ergodic sequence ps x k , s λ k q ( kě0 is defined for every k ě 0 as where tσ k u kě0 is a sequence of properly chosen positive step sizes.
whereas, if both f and g ˚are strongly convex, then even linear convergence can be achieved.
In [12], Chen, Lan and Ouyang have considered the same minimax problem (1.5), but for f : H Ñ R a convex and Fréchet differentiable function with L-Lipschitz continuous gradient, for L ą 0, and have proposed a primal-dual algorithm that exhibits for the restricted primal-dual gap an ergodic convergence rate of sup px,λqPX ˆY A stochastic counterpart of the primal-dual algorithm along with corresponding convergence rate results and, for both the deterministic and the stochastic setting, convergence rates when either X or Y is unbounded have been also provided.
Later on, Ouyang, Chen, Lan and Pasiliao Jr. have developed in [23] an accelerated ADMM algorithm for the optimization problem (1.6) with f assumed to be Fréchet differentiable with L-Lipschitz continuous gradient, for L ą 0, on its effective domain.In the case when f and g have bounded domains this method has been proved to exhibit an ergodic convergence rate for the objective function value of type (1.7), with the coefficient of 1{k 2 depending on L and the diameter of domf and the coefficient of 1{k depends on }A} and of the diameter of domg ˚.On the other hand, without assuming boundedness for the domains of f and g ˚, the accelerated ADMM algorithms has been proved to exhibit ergodic convergence rates for the feasibility measure and the objective function value of O p1{kq as k Ñ `8.
By using a smoothing approach, Tran-Dinh, Fercoq and Cevher have designed in [26] a primal-dual algorithm for solving (1.6) and its particular formulation (1.1) that exhibits last iterates convergence rates for the objective function value and the feasibility measure in the convex regime of O p1{kq, and in the strongly convex regime of O `1{k 2 ˘as k Ñ `8.
Goldstein, O'Donoghue, Setzer and Baraniuk have studied in [13] the two-block separable optimization problem with linear constraints min f pxq `h pyq , subject to Ax `By " b (1.8) where K is another real Hilbert space, f : H Ñ R Y t`8u and h : K Ñ R Y t`8u are proper, convex and lower semicontinuous functions, A : H Ñ G and B : K Ñ G are continuous linear operators and b P G.It is obvious that (1.1) can be reformulated as (1.8) and vice versa.In [13] a numerical algorithm for solving (1.8) has been proposed that exhibits, when f and h are strongly convex, convergence rates for the dual objective function of O `1{k 2 ˘and for the feasibility measure of O p1{kq as k Ñ `8.For a fast version of the Alternating Minimization Algorithm (see [27]) a convergence rate for the dual objective function of O `1{k 2 ˘as k Ñ `8 has been also proved.
Xu has proposed in [28] a linearized Augmented Lagrangian Method for the optimization problem (1.1) for which he has shown that it exhibits for constant step sizes ergodic convergence rates of O p1{kq as k Ñ `8 for the feasibility measure and the objective function value, whereas the sequence of primal-dual iterates has been shown to converge to a primal-dual solution.He has also proved that for appropriately chosen variable step sizes, in particular when allowing the dual step sizes to be unbounded, the convergence rates of the feasibility measure and the objective function value can be improved to O `1{k 2 ˘as k Ñ `8, without saying anything about the convergence of the primal-dual iterates in this setting.In addition, a linearized Alternating Direction Method of Multipliers for (1.8) has been proposed in [28], for which similar statements as for the linearized Augmented Lagrangian Method have been proved, whereby the fast convergence rates have been obtained by assuming that one of the summands in the objective function is strongly convex.
In [14], He and Yuan have enhanced the Augmented Lagrangian Method for the linearly constrained convex optimization problem (1.1) with a Nesterov's momentum update rule for the sequence of dual iterates.They have proved that the expression Lpx ˚, λ ˚q ´L px k , λ k q has an upper bound of order 1{k 2 , where px k , λ k q kě0 denotes the generated sequence of primal-dual iterates and px ˚, λ ˚q is an arbitrary optimal solution of the Wolfe dual problem of (1.1).
In [29], Yan and He have proposed for optimization problems of type (1.1), with a proper, convex and lower semicontinuous objective function, a numerical algorithm which combines the Augmented Lagrangian Method with a Bregman proximal evaluation of the objective.When choosing the sequence of proximal parameter to fulfil η k :" η pk `1q p for every k ě 0, where η ą 0 and p ě 0, ergodic convergence rates of sup px,λqPX ˆY have been obtained.In [24], Sabach and Teboulle have considered a unified algorithmic framework for proving faster convergence rates for various Lagrangian-based methods designed to solve optimization problems of type (1.1) with a proper, convex and lower semicontinuous objective function.In the convex regime these methods exhibit a non-ergodic rate of convergence of O p1{kq as k Ñ `8 for the feasibility measure and the objective function value, namely, In the strongly convex regime the convergence rates can be improved to O `1{k 2 ˘as k Ñ `8.
For the same class of optimization problems, He, Hu, and Fang have proposed in [15] an accelerated primal-dual Lagrangian-based method, with inertial parameters following the choice of Chambolle-Dossal, that achieves a convergence rate of O `1{k 2 ˘as k Ñ `8 for the feasibility measure and the objective function value without any strong convexity assumption.
Recently, in [19], Lou have introduced in the same context an unifying algorithmic scheme which covers both the convex and the strongly convex setting.In the convex regime a convergence rate of O p1{kq as k Ñ `8 is obtained for the primal-dual gap, the feasibility measure, and the objective function value, while in the strongly convex regime these rates are improved to O `1{k 2 ˘as k Ñ `8.These results have been extended to optimization problems of type (1.8) in [20], where it has been shown that, in order to achieve a convergence rate of O `1{k 2  ȃs k Ñ `8, it is enough to assume that only one of the functions in the objective is strongly convex.
Noticeably none of theses works has addressed to convergence of the sequences of primal-dual iterates, with very few exceptions in the strongly convex regime.This phenomenon could be noticed for unconstrained convex optimization problems, too.The convergence of the sequences of iterates generated by fast numerical methods has been proved much later (by Chambolle and Dossal in [10] and by Attouch and Peypouquet in [3]) after the derivation of the convergence rates for Nesterov's accelerated gradient method ( [21]) and FISTA ( [6]).One explanation for this is that the analysis of the first is much more involved.

Our contributions
We consider as starting point a second-order dynamical system with asymptotic vanishing damping term associated with the optimization problem (1.1).This dynamical system is formulated in terms of the augmented Lagrangian and it has been studied in [8].By an appropriate time discretization this system gives rise to an inertial primal-dual numerical algorithm, which allows a flexible choice of the inertial parameters.This choice covers the three classical inertial parameters rules by Nesterov ([6,21]), ) and Attouch-Cabot ( [1]) used in the literature to formulate fast gradient methods.We show that for these rules the resulting algorithm exhibits in the convex regime convergence rates of order O `1{k 2 ˘for the primal-dual gap, the feasibility measure, and the objective function value.In addition, we prove that the generated sequence of primal-dual iterates converges weakly to a primal-dual solution of the underlying problem, which is nothing else than a saddle-point of the Lagrangian.The convergence of the iterates is stated in a general setting that covers the inertial parameters rules by Chambolle-Dossal and Attouch-Cabot.This is the first result which provides the convergence of the sequence of iterates generated by a fast algorithm for linearly constrained convex optimization problems without additional assumptions such as strong convexity.All convergence and convergence rate results of this paper are compatible with the ones obtained in [8] in the continuous setting.
The proposed Fast Augmented Lagrangian Method and all convergence results can be easily extended by using the product space approach to two-block separable linearly constrained optimization problems of the form (1.8) with f and h convex and Fréchet differentiable functions with Lipschitz continuous gradients.

Notations and preliminaries
We denote by B px; εq :" ty P H : x ´y ď εu the closed ball centered at x P H with radius ε ą 0.
Let x, y P H.We have x `y The Loewner partial ordering on S `pHq is defined for W, W 1 P S `pHq as Thus W P S `pHq is nothing else than W ě 0. If there exists α ą 0 such that W ě αId then the semi-norm ¨ W becomes a norm.
In the spirit of (1.9) and (1.11), respectively, for every x, y P H it holds x `y The second inequality is also known as the Descent Lemma.
The following result is a particular instance of [5,Lemma 5.31] and will be used several times in this paper.Lemma 1.1.Let ta k u kě1 , tb k u kě1 and td k u kě1 be sequences of real numbers.Assume that ta k u kě1 is bounded from below, and tb k u kě1 and td k u kě1 are nonnegative such that ř kě1 d k ă `8.Suppose further that for every k ě 1 it holds a k`1 ď a k ´bk `dk . (1.15) Then the following statements are true piq the sequence tb k u kě1 is summable, namely ř kě1 b k ă `8; piiq the sequence ta k u kě1 is convergent.
In order to establish the weak convergence of the iterates, we will use Opial's Lemma in discrete form (see, for instance, [5, Theorem 5.5]), which we recall as follows.
Lemma 1.2.Let C be a nonempty subset of H and tx k u kě1 a sequence in H. Assume that piq for every x ˚P C, lim kÑ`8 x k ´x˚ exists; piiq every weak sequential cluster point of tx k u kě1 belongs to C.
Then the sequence tx k u kě1 converges weakly to an element in C as k Ñ `8.

Continuous time approaches and their discrete counterparts
In this section we want to derive by time discretization a primal-dual numerical algorithm from the second-order dynamical system investigated in [8].The employed discretization technique replicates the one used when relating fast gradient algorithms with the second-order dynamical system proposed by Su, Boyd and Candès in [25] in the unconstrained case.

Fast gradient scheme: from continuous to discrete time
We recall in this section for reader's convenience the connection between the second-order dynamical system by Su, Boyd and Candès ( [25]) and the fast gradient numerical methods formulated in [10,1] in the spirit of Nesterov's accelerated gradient algorithm ( [21]).To this end we consider the unconstrained optimization problem where f : H Ñ R is a convex and Fréchet differentiable function with L-Lipschitz continuous gradient, for L ą 0.
This leads to x ptq " ´t α ´1 ∇f px ptqq and (AVD) can be written as a first-order ordinary differential equation x ptq . (2.4) Let σ ą 0. For every k ě 1 we take as time step and set τ k :" ?σ k k " a σk pk `α ´1q « ?σ pk `1q, x pτ k q « x k`1 and z pτ k q « z k`1 .We "approximate" τ k with ?σ pk `1q since it is closer to this value than to ?σk.This also explains why we consider x pτ k q « x k`1 and z pτ k q « z k`1 instead of the seemingly more natural choices x pτ k q « x k and z pτ k q « z k , respectively.
The implicit finite-difference scheme for (2.4) at time t :" where the gradient ∇f is evaluated at the point y k , which is to be determined as a suitable convex combination of x k and z k such that x k`1 ´yk Ñ 0 as k Ñ `8.Notice that, since ∇f is L´Lipschitz continuous, this implies that ∇f px k`1 q ´∇f py k q Ñ 0 as k Ñ `8.
The second equation in (2.5) is equivalent with and consequently suggests the following choice for y k From the second equation in (2.5) we further obtain In addition, Consequently, (2.5) can be equivalently written as This is nothing else than the algorithm considered by Chambolle and Dossal in [10] (see also [3]).
Furthermore, if we write for every k ě 1 (2.9) Modifications of the sequence tt k u kě1 which preserve its asymptotic behaviour lead to various acceleration schemes from the literature.
For instance, the classical Nesterov's accelerated gradient method ( [21]) is precisely (2.9), where the sequence tt k u kě1 satisfies the recurrence rule Another example is the algorithm proposed by Attouch and Cabot in [1] that corresponds to (2.9) with the choice It can also be interpreted as a discretization of (2.4) with time step and by setting τ k :" ?σ k pk ´α `1q " a σk pk ´α `1q « ?σ pk `1q , x pτ k q « x k`1 and z pτ k q « z k`1 .

Fast Augmented Lagrangian Method
In this section we will give a precise formulation of the Augmented Lagrangian Method for solving (1.1) and prove that it exhibits convergence rates of order O `1{k 2 ˘for the primal-dual gap, the feasibility measure, and the objective function value.

The algorithm
Algorithm 1.Let β ě 0 and m, γ, ρ, σ ą 0 be such that Let tt k u kě1 be a nondecreasing sequence such that Given x 0 " x 1 P H and λ 0 " λ 1 P G, for every k ě 1 we set One can notice that Algorithm 1 can be written in a concise way only in terms of the sequences of primal-dual iterates tpx k , λ k qu kě0 , however, this elaborated formulation using auxiliary sequences is more convenient for its analysis.
Even though the choice γ " 1 would give a simplified version of Algorithm 1, without affecting its fast convergence properties, we will see that in order to guarantee the convergence of tpx k , λ k qu kě0 to a primal-dual optimal solution it will be crucial to choose γ P p0, 1q.A similar phenomenon is known from the continuous and discrete schemes in the unconstrained case, where fast convergence rates have been obtained for α ě 3, while the convergence of the sequence of iterates/trajectory could be shown only for α ą 3.In view of (2.13), in order to be allowed to choose γ P p0, 1q, one must have α ą 3.
as the dual sequence tλ k u kě0 can be neglected.
Remark 3.2.By denoting for every k ě 1 it yields On the other hand, (3.4c) with index k `1 reads which is equivalent to Subtracting (3.5) from (3.7), we obtain Furthermore, by the definition of z γ k and z k in (3.3g) and (3.4a), it holds which leads to By a similar argument, denoting for every k ě 1 we can derive that ´νγ k " t k`1 pλ k`1 ´µk q `pγ ´1q pλ k`1 ´λk q and λ k`1 ´µk " Example 3.3.(The choice γ " 1) In case γ " 1 we denote z k :" z 1 k and ν k :" ν 1 k for every k ě 1, which is consistent with the notations in the remark above.Given 0 ă σ ď 1 L `β}A} 2 , tt k u kě1 a nondecreasing sequence fulfilling (3.2) x 0 " x 1 P H and λ 0 " λ 1 P G, Algorithm 1 simplifies for every k ě 1 to The fact that this iterative scheme exhibits fast convergence rates for the primal-dual gap, the feasibility measure, and the objective function value will follow from the analysis we will do for Algorithm 1.However, nothing can be said about the convergence of the primal-dual iterates.
To this end we will have to assume that γ P p0, 1q, which will be a crucial assumption.
Remark 3.4.He, Hu and Fang have considered in [15] for α ą 3, σ, σ 1 ą 0 and ´1 @k ě 1 the following iterative scheme which reads for every k ě 1 This algorithm differs from Algorithm 1 for the choice γ " 1 (as formulated in the above example) in the way the primal-dual iterates tpx k , λ k qu kě0 are defined.The formulation of the first allows a more direct derivation of the fast convergence rates for the feasibility measure and the objective function value.The convergence of tpx k , λ k qu kě0 has been not addressed in [15], and it is by far not clear whether this sequence converges.
The following lemma collects some properties of the sequence tt k u kě1 fulfilling (3.2).
Lemma 3.5.Let 0 ă m ď 1 and tt k u kě1 a nondecreasing sequence fulfilling Proof.Let k ě 1.From the assumption we have that which further gives We define the function ψ : r1, `8q Ñ R as s Þ Ñ m `?m 2 `4s 2 2 ´s.Since The statements in (3.14) follow from the fact that t k ě 1 for every k ě 1 and ϕ m ě 0 and by telescoping arguments, respectively.

Some important estimates and an energy function
In this section we will provide some important estimates which will be useful when proving that the sequence of values of a discrete energy function, which we will associate with Algorithm 1, takes at a saddle point is nonincreasing.
Lemma 3.6.Let tpx k , λ k qu kě0 be the sequence generated by Algorithm 1. Then for every x P H and every k ě 1 the following inequality holds (3.15) Proof.Let x P H and k ě 1 be fixed.According to (3.3f) we have that On the other hand, from (3.3c), (3.3e) and (3.3h) we have where the last equation follows from (3.11).Hence, replacing (3.17) in (3.16) we have ∇f py k q " ´1 γ A ˚νγ k`1 `1 γ p1 ´γq A ˚pλ k ´λk`1 q `1 σ py k ´xk`1 q ´βA ˚pAy k ´bq .(3.18) The Descent Lemma inequality (1.14) provides By summing up these relations it yields which is nothing else than (3.15).
In the following we denote and Proof.Let x P F, which means that Ax " b, λ P G, and k ě 1 be fixed.We deduce from Lemma 3.6 that where, by using the definition of Q, the last identity follows from Recall that from (3.3h) we have Moreover, from (3.3d) and (3.3g) we have therefore, by summing up (3.23) and (3.24) and after rearranging the terms, the estimate (3.21) follows.
Next we will prove the second estimate.By take x :" x k in inequality (3.15) we get where, by using again the definition of Q, the last identity follows from The identity (1.9) gives us hence, by recalling relation (1.3), (3.25) can be equivalently written as In addition, by taking λ :" λ k in (3.24) gives 0 " Moreover, we have from (3.3d) and (3.3g)For px, λq P F ˆG and k ě 1 we introduce the following energy function associated with Algorithm 1 According to (1.4), for every px ˚, λ ˚q P S and every k ě 1 it holds The following estimate for the energy function will play a fundamental role in our analysis.
Proposition 3.8.Let tpx k , λ k qu kě0 be the sequence generated by Algorithm 1. Then for every px, λq P F ˆG and every k ě 1 it holds (3.28) Proof.Let px, λq P F ˆG and k ě 1 be fixed.Multiplying (3.22) by t k`1 pt k`1 ´1q ě 0 and (3.21) by γt k`1 , and adding the resulting inequalities, yields According to (3.3b), (3.3g) and (3.3h) we have where in the last inequality we use that tt k u kě1 is nondecreasing and that t k ě 1 for every k ě 1.
On the other hand, (3.3g), (3.9a) and (1.12) give From (1.13), (3.9b) and (3.8) we have which we combine with (3.31) to obtain By using the same technique, we can derive that Next we record some direct consequences of the above estimate.
Proposition 3.9.Let tpx k , λ k qu kě0 be the sequence generated by Algorithm 1 and px ˚, λ ˚q P S.
Then the sequence tE k px ˚, λ ˚qu kě1 is nonincreasing and the following statements are true Proof.Since tt k u kě1 is an nondecreasing sequence that satisfies (3.2) and 0 ă m ď γ ď 1, we have for every k ě 1 k `p1 ´γq t k ď pm ´1q t k`1 `p1 ´γq t k ď pm ´γq t k ď 0.Moreover, as px ˚, λ ˚q P S, we must have x ˚P F and L β px k , λ ˚q ´Lβ px ˚, λ k q ě 0 for every k ě 1 due to (1.4).Combining these observations, we deduce from the inequality (3.28) applied to px, λq " px ˚, λ ˚q that for every k ě 1 Consequently, for every k ě 1 we have (3.37) Remark 3.11.Recall that from Proposition 3.9 we have ÿ whenever γ ă 1. Taking into account the way γ has arisen in the context of the dynamical system (PD-AVD) (see (2.13)), this corresponds to In the continuous case it has been proved (see [8,Theorem 3.2]) that, if 1 α ´1 ă θ, then ż `8 which can be seen as the continuous counterpart of (3.38).Both statements play a crucial role in the proof of the convergence of the sequence of iterates generated by Algorithm 1 and of the trajectory generated by (PD-AVD), respectively.
The following result, which complements the statements of Proposition 3.9, will also play a crucial role in the proof of the convergence of the sequence of iterates.Proposition 3.12.Let tpx k , λ k qu kě0 be the sequence generated by Algorithm 1 with and px ˚, λ ˚q P S. Then the following statements are true In addition, there exists C 0 ą 0 such that for every k ě 1 Proof.From (3.18), after rearranging some terms, we have for every k ě 1 `βA ˚A px k`1 ´yk q ´βA ˚pAx k`1 ´bq .
It follows from Proposition 3.9, by using (3.20) and the fact that t k ě 1, that for every k ě 1 According to (3.3d) we have for every k ě 1 hence, by applying the identity (1.10), we get On the other hand, it follows from condition (3.2) and the fact that tt k u kě1 is nondecreasing that for every k ě 1 Combining (3.40) and (3.41), it yields for every k ě 1 We are in the setting of inequality (1.15) with for every k ě 1.According to Lemma 1.1, (3.39a) and (3.39b) are fulfilled and the sequence " is convergent, therefore it is bounded.Consequently, there exists C 0 ą 0 such that for every k ě 1 which provides the conclusion.

On the boundedness of the sequences
In this section we will discuss the boundedness of the sequence of primal-dual iterates tpx k , λ k qu kě0 and also of other related sequences which play a role in the convergence analysis.
To this end we define on H ˆG the inner product where Q is the operator defined in (3.19) which we proved to be positive definite under assumption (3.1).The norm induced by this scalar product is The condition on the sequence tt k u kě1 which we will assume in the next proposition in order to guarantee boundedness for the sequences generated by Algorithm 1 has been proposed in [4].Later we will see that it is satisfied by the three classical inertial parameters rules by Nesterov, Chambolle-Dossal and Attouch-Cabot.Proposition 3.13.Let tpx k , λ k qu kě0 be the sequence generated by Algorithm 1. Suppose that Then the sequences tpx k , λ k qu kě0 , `zγ k , ν γ k ˘(kě1 and tpt k`1 px k`1 ´xk q , t k`1 pλ k`1 ´λk qqu kě0 are bounded.If, in addition β ą 0, then the sequence tt k`1 pt k`1 ´1q pAx k`1 ´Ax k qu kě0 is also bounded.
Proof.Let px ˚, λ ˚q P S be fixed.For brevity we will write u ˚:" px ˚, λ ˚q P S and u k :" px k , λ k q P H ˆG @k ě 0.
By applying (1.13), we have from (3.3g) that for every k ě 1 By applying (1.11), we have from (3.3d) that for every k ě 1 This means the energy function at px ˚, λ ˚q can be written for every k ě 1 as According to Proposition 3.9, the sequence tE k px ˚, λ ˚qu kě1 is nonincreasing, therefore for every k ě 1 From here we conclude that the sequence `zγ k , ν γ k ˘(kě1 is bounded.In addition, for every k ě 1 it holds where the last inequality is due to (3.13), with the convention t 0 :" 0. After telescoping, we get Then thanks to (3.42) we obtain which means that tu k :" px k , λ k qu kě0 is bounded.That tpt k`1 px k`1 ´xk q , t k`1 pλ k`1 ´λk qqu kě0 is bounded follows from the fact that for all k ě 1 Finally, recall that from (3.11), (3.3h) and (3.3g), we have for every k ě 1 The last statement of the proposition follows from here and (3.37).
In the following, we will see that the two most prominent choices for the sequence tt k u kě1 from the literature, namely, the ones following the rules by Nesterov and by Chambolle-Dossal satisfy not only (3.2), but also (3.42).
Example 3.14.(Nesterov rule) The classical construction proposed Nesterov in [21] for tt k u kě1 satisfies the following rule The sequence tt k u kě1 is strictly increasing and verifies relation (3.2) for m :" 1 with equality.
Example 3.15.(Chambolle-Dossal rule) The construction proposed by Chambolle and Dossal in [10] (see also [3]) for tt k u kě1 satisfies for α ě 3 the following rule First we show that this sequence fulfills (3.2) with m :" 2 α ´1 ď 1.Indeed, for every k ě 1 we have Furthermore, one can see that for every k ě 1 it holds , which proves that (3.42) is verified for κ " 1 α ´1 .Finally, we observe that, by taking into consideration the choice of γ in (2.13) in the context of the dynamical system (PD-AVD) and assumption (3.1) in Algorithm 1, it holds This connects the choice of the parameter m in Algorithm 1 with the one of the parameter θ in (PD-AVD).

Fast convergence rates for the primal-dual gap, the feasibility measure and the objective function value
We have seen in Remark 3.10 that, for the general choice of the sequence tt k u kě1 in (3.2), the convergence rate of the primal-dual gap is of order O `1{t 2 k ˘as k Ñ `8.In addition, if β ą 0, then the convergence rate of the feasibility measure is of order O p1{t k q as k Ñ `8.In this section we will prove that convergence rates of the feasibility measure and of the objective function value are O `1{t 2 k ˘as k Ñ `8 when the sequence tt k u kě1 is chosen by following the rules by Nesterov, Chambolle-Dossal and also Attouch-Cabot.In view of (3.42), this will lead for the primal-dual sequence tpx k , λ k qu kě0 generated by Algorithm 1 and a given primal-dual solution px ˚, λ ˚q to the following fast convergence rates We start with the following lemma which holds in the very general setting of Algorithm 1.
Lemma 3.16.Let tpx k , λ k qu kě0 be the sequence generated by Algorithm 1 and px ˚, λ ˚q P S.
Then the quantity C 1 :" sup µPBpλ˚;1q Proof.Let px ˚, λ ˚q P S and µ P B pλ ˚; 1q.The Cauchy-Schwarz inequality gives On the other hand, as ν γ 1 " γλ 1 and µ P B pλ ˚; 1q, it holds Combining these estimates, as z γ 1 " γx 1 , we have which proves the statement.

The Nesterov ([21]) rule
We have seen that by choosing tt k u kě1 as in (3.44), (3.2) is fulfilled as equality for m " 1, which also yields γ " 1 due to (3.1).Consequently, from Proposition 3.8 it follows that for every px, λq P F ˆG and every k ě 1 it holds which means that the sequence tE k px, λqu kě1 is nonincreasing.This statement is stronger than the one in Proposition 3.9, where we have proved that the sequence of function values of the energy function taken at a primal-dual optimal solution is nonincreasing, and will play an important role in the following.
Theorem 3.17.Let tpx k , λ k qu kě0 be the sequence generated by Algorithm 1, with the sequence tt k u kě1 chosen to satisfy Nesterov rule (3.44), and px ˚, λ ˚q P S. Then for every k ě 1 it holds and If Ax n ´b ‰ 0, then f px n q ´f px ˚q `xr n , Ax n ´by " f px n q ´f px ˚q `xλ ˚, Ax n ´by ` Ax n ´b " L px n , λ ˚q ´L px ˚, λ n q ` Ax n ´b .
On the other hand, if Ax n ´b " 0, we have f px n q ´f px ˚q `xr n , Ax n ´by " f px n q ´f px ˚q `xλ ˚, Ax n ´by " L px n , λ ˚q ´L px ˚, λ n q " L px n , λ ˚q ´L px ˚, λ n q ` Ax n ´b , thus, in both scenarios, (3.52) becomes 0 ď t 2 n pL px n , λ ˚q ´L px ˚, λ n q ` Ax n ´b q ď C 1 .
As L px k , λ ˚q ´L px ˚, λ k q ě 0, a direct consequent of (3.49) is that for every k ě 1 From (3.49) and the Cauchy-Schwarz inequality, we deduce from here that for every k ě 1 On the other hand, the convexity of f together with (1.2) guarantee that for every k ě 1

The Chambolle-Dossal ([10]) rule
In this section we prove fast convergence rates for the primal-dual gap, the feasibility measure and the objective function value for the sequence of inertial parameters tt k u kě1 following for α ě 3 the Chambole-Dossal rule (3.45).We have seen in Example 3.15 that in this case tt k u kě1 fulfills (3.2) for m :" 2 α´1 and (3.42) for κ :" 1 α´1 .For the beginning we observe that for 2 α´1 " m ď γ ď 1 and every k ě 1 it holds (see (3.46)) and γ.First we will assume that they are equal, which will then also cover the case α " 3.
Theorem 3.18.Let tpx k , λ k qu kě0 be the sequence generated by Algorithm 1 with the sequence tt k u kě1 chosen to satisfy Chambolle-Dossal rule (3.45), m :" 2 α´1 " γ ď 1, β ą 0, and px ˚, λ ˚q P S. Then for every k ě 2 it holds and where Proof.We fix n ě 2 and define Then x ˚P F and r n P B pλ ˚; 1q.Since γpα ´1q " 2, according to (3.55), we have for every By taking px, λq :" px ˚, r n q P F ˆB pλ ˚; 1q in (3.28), we obtain for every k ě 1 where (3.58b) follows from (3.37) and(3.58c) is due to (3.42).By a telescoping sum argument and Lemma 3.16 we conclude that for every k ě 1 where By choosing k :" n ´1, it yields t n pt n ´1 `γq pf px n q ´f px ˚q `xr n , Ax n ´byq ď E n px ˚, r n q ď C 4 ´logpn ´1q `1¯.
We have seen in the proof of Theorem 3.17 that f px n q ´f px ˚q `xr n , Ax n ´by " L px n , λ ˚q ´L px ˚, λ n q ` Ax n ´b , thus, by taking into account (3.42), we obtain Taking into account also Lemma 3.16 and the definition of C 4 , we have that for every k ě 1 Using this estimate in (3.58a), we obtain for every k ě 1 By using once again a the telescoping sum argument, we conclude that for every k ě 1 From here, (3.56) follows by choosing k :" n ´1, and by using that γt 2 n ď t n pt n ´1 `γq and (3.59).Statement (3.57) follows from (3.56) by repeating the arguments at the end of the proof of Theorem 3.17.Now we come to the second case, namely, when m :" 2 α´1 ă γ ď 1, which implicitly requires that α ą 3.For the proof of the fast convergence rates we will make use of the following result which can be found in [18, Lemma 2] (see, also, [17,Lemma 3.18]).Lemma 3.19.Let tζ k u kě1 Ď G be a sequence such that there exist δ ą 1 and M ą 0 with the property that for every K ě 1 Then for every K ě 1 it holds Theorem 3.20.Let tpx k , λ k qu kě0 be the sequence generated by Algorithm 1 with the sequence tt k u kě1 chosen to satisfy Chambolle-Dossal rule (3.45), m :" 2 α´1 ă γ ď 1, β ą 0, and px ˚, λ ˚q P S. Then for every k ě 1 it holds and where ω 0 :" δ pα ´2q ´2 pα ´1q and ω 1 :" pδ ´1q pα ´2q ´1.
(3.64) Furthermore, it follows from (3.42) and (3.37) that where we also recall that, due to Proposition 3. ζ k ď C 7 for every K ě 1.By using again the triangle inequality and (3.66), we obtain for every K ě 1 that Using the inequality (3.14) in Lemma 3.5, we see that for every K ě 1 it holds  3.61) by repeating the arguments at the end of the proof of Theorem 3.17.

The Attouch-Cabot ([1]) rule
Another inertial parameter rule used in the literature in the context of fast numerical algorithms is the one proposed by Attouch and Cabot in [1], which reads for α ě 3 This sequence is monotonically increasing and it fulfills (3.2) with m :" 2 α ´1 ď 1 as, for every k ě 1, it holds This shows that the sequence tt k u kě1 has very much in common with the Chambolle-Dossal parameter rule.The only significant difference is that is starts at 0 and that t k ě 1 holds only for k ě k 1 :" tαu `1.Consequently, the fast convergence rate results for the primal-dual gap, the feasibility measure and the objective function value are valid also for the Attouch-Cabot rule.This can be easily seen by slightly adapting the proofs made in the setting of the Chambolle-Dossal rule by taking into consideration that some of the estimates hold only for k ě k 1 .This exercise is left to the reader.

Convergence of the iterates
In this section we will turn our attention to the convergence of the sequence of primal-dual iterates generated by Algorithm 1 to a primal-dual solution of (1.1).First, we will prove that the first assumption in the Opial Lemma is verified and to this end we will need the following technical lemma.where r¨s `denotes the positive part.Since the right-hand side of this inequality is nonnegative, it yields that for every k ě 1 rθ k s `ď t k rθ k s `´t k`1 rθ k`1 s ``d k .
which, by telescoping cancellation, gives ř kě1 rθ k s `ă `8.According to (4.1a), we have that for every k ě 1 it holds a k`1 ď a k `θk`1 ď a k `rθ k`1 s `.
By using Lemma 1.1 we obtain from here that the sequence ta k u kě1 is convergent.Proposition 4.2.Let tpx k , λ k qu kě0 be the sequence generated by Algorithm 1 with 0 ă m ă γ ă 1.Then for every px ˚, λ ˚q P S the limit lim kÑ`8 px k , λ k q ´px ˚, λ ˚q W exists.
Proof.Let px ˚, λ ˚q P S be fixed.For brevity we will write u ˚:" px ˚, λ ˚q P S and u k :" px k , λ k q P H ˆG @k ě 0.
It follows from (3.35) that E k`1 px ˚, λ ˚q ď E k px ˚, λ ˚q for every k ě 1.In view of (3.43), after rearranging some terms, we get for every k ě 1 We notice that for every k ě 1 the estimate (4.3) becomes (4.1b), while (4.1a) obviously holds.As 0 ă m ă γ ă 1, it follows from Proposition 3.9 that ř kě1 d k ă `8.Hence, we can apply Lemma 4.1 to conclude that t px k , λ k q ´px ˚, λ ˚q W u kě1 is convergent.
The following result is the discrete counterpart of [8,Theorem 4.7] (see (2.2)).Its proof is a direct consequence of Proposition 3.9 and Proposition 3.12.consequently, and As seen in Section 3.4, if, in addition, tt k u kě1 is chosen to satisfy Chambolle-Dossal or Attouch-Cabot rule and m :" 2 α´1 , then Now we can prove the main theorem of this section establishing the convergence of the sequence of iterates generated by Algorithm 1.
Proof.From Proposition 4.2 it follows that the limit lim kÑ`8 px k , λ k q ´px ˚, λ ˚q exists for every px ˚, λ ˚q P S. This proves the first condition of Lemma 1.2.
In order to prove condition (ii), let ´r x, r λ ¯P H ˆG be an arbitrary weak sequential cluster point of tpx k , λ k qu kě0 .This means that there exists a subsequence tpx kn , λ kn qu ně0 which converges weakly to ´r x, r λ ¯as n Ñ `8.According to Theorem 4.3 we have ∇f px k q `A˚λ k Ñ 0 and Ax k ´b Ñ 0 as k Ñ `8, hence, ∇f px kn q `A˚λ kn Ñ 0 and Ax kn ´b Ñ 0 as n Ñ `8.
Since the graph of the operator T L is sequentially closed in pH ˆGq weak ˆpH ˆGq strong (cf.[5,Proposition 20.38]), it follows from here that # ∇f pr xq `A˚r λ " 0 Ar x ´b " 0 .
In other words, ´r x, r λ ¯P S and the proof is complete.´2 ă γ ă 1, 0 ă σ ă γ L `γβ}A} 2 and β ą 0, then Theorem 4.4 guarantees that the sequence tpx k , λ k qu kě0 converges weakly to a primal-dual optimal solution of (1.1).This statement is in addition to the fast convergence rates of order O `1{k 2 ˘for the primal-dual gap, the feasibility measure, and the objective function value.
If the sequence tt k u kě1 is chosen to satisfy the Nesterov rule, then, as we have seen, the fast convergence rate results also hold, however, since in this setting m " γ " 1, one cannot apply Theorem 4.4 to obtain the convergence of the iterates.This is consistent with the unconstrained case for which it is also not known if the sequence of iterates generated by the fast gradient method with inertial parameters following the Nesterov rule converges.

1
Problem formulation and motivation Consider the optimization problem min f pxq , subject to Ax " b (1.1) Consider the saddle point problem associated to problem (1.1) min xPH max λPG L px, λq , where L : H ˆG Ñ R denotes the Lagrangian function L px, λq :" f pxq `xλ, Ax ´by .

Remark 4 . 5 .
If the sequence tt k u kě1 is chosen to satisfy the Chambolle-Dossal or the Attouch-Cabot rule with α ą 3, m :" 1 α We denote by S `pHq the family of self-adjoint and positive semidefinite continuous linear operators W : H Ñ H. Every W P S `pHq induces on H a semi-norm defined by 199) Assumption (3.1) guarantees that γQ ´LId P S `pHq, as Let tpx k , λ k qu kě0 be the sequence generated by Algorithm 1. Then for every px, λq P F ˆG and every k ě 1 the following two inequalities hold .35) By applying Lemma 1.1 we obtain all conclusions.Remark 3.10.Since the sequence tE k px ˚, λ ˚qu kě1 is nonincreasing and for every k ě 1 .50) Proof.As mentioned earlier in (3.48), for every px, λq P F ˆG and every k ě 1 we have (take into account that γ " 1)t 2 k pf px k q ´f pxq `xλ, Ax k ´byq ď E k px, λq ď ¨¨¨ď E 1 px, λq .Then x ˚P F and r n P B pλ ˚; 1q.Hence, px ˚, r n q P F ˆB pλ ˚; 1q, therefore, according to (3.51) and Lemma 3.16, t 2 n pf px n q ´f px ˚q `xr n , Ax n ´byq ď E 1 px ˚, r n q ď sup Inequality (3.66) holds for every K ě 1 (notice that C 7 is independent of K), consequently, we can apply Lemma 3.19 to conclude that 13, it holds supkě1 ν k ă `8.K ÿ k"1 4.1.Let tθ k u kě1 , ta k u kě1 , tt k u kě1 be real sequences such that ta k u kě1 is bounded from below and tt k u kě1 is nondecreasing and bounded from below by 1, and td k u kě1 be a nonnegative sequence such that for every k ě 1 If ř kě1 d k ă `8, then the sequence ta k u kě1 is convergent.Proof.It follows from (4.1b) that for every k ě 1 t k`1 θ k`1 ď pt k ´1q θ k `dk ď pt k ´1q rθ k s ``d k , (4.2) a k`1 ď a k `θk`1 , (4.1a) t k`1 θ k`1 ď pt k ´1q θ k `dk .(4.1b) 1 ´γ 2ρ t k`1 λ k`1 ´λk 2 ď pt k ´1q pt k ´1 `γq ˆLβ px k , λ ˚q ´Lβ px ˚, λ k q ´1 `γq pL β px k , λ ˚q ´Lβ px ˚, λ k qq `1 2 pt k`1 ´1 `γq u k`1 ´uk :" pt k ´1 `γq ˆLβ px k , λ ˚q ´Lβ px ˚, λ k q `1 2 u k ´uk´1 :" pt k ´1 `γq pL β px k , λ ˚q ´Lβ px ˚, λ k qq `1 2 pt k`1 ´1 `γq u k`1 ´uk 2`pt k k k