Fixing and extending some recent results on the ADMM algorithm

We investigate the techniques and ideas used in Shefi and Teboulle (SIAM J Optim 24(1), 269–297, 2014) in the convergence analysis of two proximal ADMM algorithms for solving convex optimization problems involving compositions with linear operators. Besides this, we formulate a variant of the ADMM algorithm that is able to handle convex optimization problems involving an additional smooth function in its objective, and which is evaluated through its gradient. Moreover, in each iteration, we allow the use of variable metrics, while the investigations are carried out in the setting of infinite-dimensional Hilbert spaces. This algorithmic scheme is investigated from the point of view of its convergence properties.


Introduction
One of the most popular numerical algorithms for solving optimization problems of the form inf x∈R n {f (x) + g(Ax)}, (1) where f : R n → R := R ∪ {±∞} and g : R m → R are proper, convex, lower semicontinuous functions and A : R n → R m is a linear operator, is the alternating direction method of multipliers (ADMM). Here, the spaces R n and R m are equipped with their usual inner products and induced norms, which we both denote by ·, · and · , respectively, as there is no risk of confusion. By introducing an auxiliary variable z, one can rewrite (1) as The Lagrangian associated with problem (2) is and we say that (x * , z * , y * ) ∈ R n × R m × R m is a saddle point of the Lagrangian, if l(x * , z * , y) ≤ l(x * , z * , y * ) ≤ l(x, z, y * ) ∀(x, z, y) ∈ R n × R m × R m .
It is known that (x * , z * , y * ) is a saddle point of l if and only if z * = Ax * , (x * , z * ) is an optimal solution of (2), y * is an optimal solution of the Fenchel-Rockafellar dual problem (see [3-5, 20, 30]) to (1) sup y∈R m {−f * (−A T y) − g * (y)}, (4) and the optimal objective values of (1) and (4) coincide. Notice that f * and g * are the conjugates of f and g, defined by f * (u) = sup x∈R n { u, x − f (x)} for all u ∈ R n and g * (y) = sup z∈R m { y, z − g(z)} for all y ∈ R m , respectively. If (1) has an optimal solution and A(ri (domf )) ∩ ri domg = ∅, then the set of saddle points of l is nonempty. Here, we denote by ri(S) the relative interior of a convex set S, which is the interior of S relative to its affine hull.
For a fixed real number c > 0, we further consider the augmented Lagrangian associated with problem (2), which is defined as The ADMM algorithm reads: If A has full column rank, then the set of minimizers in (5) is a singleton, as is the set of minimizers in (6) without any further assumption, and the sequence (x k , z k , y k ) k≥0 generated by Algorithm 1 converges to a saddle point of the Lagrangian l (see, for instance, [19]). The alternating direction method of multipliers was first introduced in [25] and [23]. Gabay has shown in [24] (see also [19]) that ADMM is nothing else than the Douglas-Rachford algorithm applied to the monotone inclusion problem 0 ∈ ∂(f * • (−A T ))(y) + ∂g * (y) For a function k : R n → R, the set-valued operator defined by ∂k(x) := {u ∈ R n : and ∂k(x) := ∅, otherwise, denotes its (convex) subdifferential.
One of the limitations of the ADMM algorithm comes from the presence of the term Ax in the update rule of x k+1 (we refer to [14] for an approach to circumvent the limitations of ADMM). While in (6) a proximal step for the function g is taken, in (5), the function f and the operator A are not evaluated independently, which makes the ADMM algorithm less attractive for implementations than the primaldual splitting algorithms (see, for instance, [8-10, 12, 13, 16, 29]). Despite of this fact, the ADMM algorithm has been widely used for solving convex optimization problems arising in real-life applications (see, for instance, [11,17,21]). For a version of the ADMM algorithm with inertial and memory effects, we refer the reader to [7].
In order to overcome the above-mentioned drawback of the classical ADMM method and to increase its flexibility, the following so-called proximal alternating direction proximal method of multipliers has been considered in [28] (see also [22,26]): Here, M 1 ∈ R n×n and M 2 ∈ R m×m are symmetric positive semidefinite matrices and u 2 M i = u, M i u denotes the squared semi-norm induced by M i , for i ∈ {1, 2}.
Indeed, for M 1 = M 2 = 0, Algorithm 1 becomes the classical ADMM method, while for M 1 = μ 1 Id and M 2 = μ 2 Id with μ 1 , μ 2 > 0 and Id denoting the corresponding matrix, it becomes the algorithm proposed and investigated in [18]. Furthermore, if M 1 = τ −1 Id−cA T A with τ > 0 such that cτ A 2 < 1 and M 2 = 0, then one can show that Algorithm 2 is equivalent to one of the primal-dual algorithms formulated in [16].
The sequence (z k ) k≥0 generated in Algorithm 2 is uniquely determined due to the fact that the objective function in (9) is lower semicontinuous and strongly convex. On the other hand, the set of minimizers in (8) is in general not a singleton and it can be even empty. However, if one imposes that M 1 + A * A is positive definite, then (x k ) k≥0 will be uniquely determined, too.
Shefi and Teboulle provide in [28] in connection to Algorithm 2 an ergodic convergence rate result for a primal-dual gap function formulated in terms of the Lagrangian l, from which they deduce a global convergence rate result for the sequence of function values (f (x k ) + g(Ax k )) k≥0 to the optimal objective value of (1), when g is Lipschitz continuous. Furthermore, they formulate a global convergence rate result for the sequence ( Ax k − z k ) k≥0 to 0. Finally, Shefi and Teboulle prove the convergence of the sequence (x k , z k , y k ) k≥0 to a saddle point of the Lagrangian l, provided that either M 1 = 0 and A has full column rank or M 1 is positive definite.
Algorithm 2 from [28] represents the starting point of our investigations. More precisely, in this paper: • We point out some flaws in the proof of a statement in [28], which is fundamental for the derivation of the global convergence rate of ( Ax k − z k ) k≥0 to 0 and of the convergence of the sequence (x k , z k , y k ) k≥0 . • We show how the statement in cause can be proved by using different techniques. • We formulate a variant of Algorithm 2 for solving convex optimization problems in infinite-dimensional Hilbert spaces involving an additional smooth function in their objective that we evaluate through its gradient and which allows in each iteration the use of variable metrics. • We prove an ergodic convergence rate result for this algorithm involving a primal-dual gap function formulated in terms of the associated Lagrangian l and a convergence result for the sequence of iterates to a saddle point of l.
2 Fixing some results from [28] related to the convergence analysis for Algorithm 2 In this section, we point out several flaws that have been made in [28] when deriving a fundamental result for both the rate of convergence of the sequence ( Ax k − z k ) k≥0 to 0 and the convergence of the sequence (x k , z k , y k ) k≥0 to a saddle point of the Lagrangian l. We also show how these arguments can be fixed by relying on some of the building blocks of the analysis we will carry out in Section 3.
To proceed, we first recall some results from [28]. We start with a statement that follows from the variational characterization of the minimizers of (8)- (9).
which is nothing else than (15). From here, by using that v k+1 ≤ v k for all k ≥ 0 and straightforward telescoping arguments, it follows immediately that ( Ax k −z k ) k≥0 converges to zero with a rate of O(1/ √ k). We will see in the following section that the inequality (40) will play an essential role also in the convergence analysis of the sequence of iterates. When applied to the particular context of the optimization problem (1) and Algorithm 2, Theorem 12 provides a rigorous formulation and a correct and clear proof of the convergence result stated in [28,Theorem 5.6].

A variant of the ADMM algorithm in the presence of a smooth function and by involving variable metrics
In this section, we propose an extension of the ADMM algorithm considered in [28] that we also investigate from the perspective of its convergence properties. This extension is twofold: on the one hand, we consider an additional convex differentiable function in the objective of the optimization problem (1), which is evaluated in the algorithm through its gradient, and on the other hand, instead of fixed matrices M 1 and M 2 , we use different matrices in each iteration. Furthermore, we change the setting to infinite-dimensional Hilbert spaces. We start by describing the problem under investigation:

Problem 4
Let H and G be real Hilbert spaces, f : H → R, g : G → R be proper, convex, and lower semicontinuous functions, h : H → R a convex and Fréchet differentiable function with L-Lipschitz continuous gradient (where L ≥ 0) and A : H → G a linear continuous operator. The Lagrangian associated with the convex optimization problem We say that (x * , z * , y * ) ∈ H × G × G is a saddle point of the Lagrangian l, if the following inequalities hold Notice that (x * , z * , y * ) is a saddle point if and only if z * = Ax * , x * is an optimal solution of (18), y * is an optimal solution of the Fenchel-Rockafellar dual problem to (18) ( and the optimal objective values of (18) and (20) For the reader's convenience, we discuss some situations which lead to the existence of saddle points. This is for instance the case when (18) has an optimal solution and the Attouch-Brézis qualification condition holds. Here, for a convex set S ⊆ G, we denote by is a closed linear subspace of G} its strong relative interior. Notice that the classical interior is contained in the strong relative interior: int S ⊆ sriS; however, in general, this inclusion may be strict. If G is finite-dimensional, then for a nonempty and convex set S ⊆ G, one has sriS = riS. Considering again the infinite-dimensional setting, we remark that condition (21) is fulfilled if there exists x ∈ domf such that Ax ∈ domg and g is continuous at Ax . The optimality conditions for the primal-dual pair of optimization problems (18) This means that if (18) has an optimal solution x * ∈ H and the qualification condition (21) is fulfilled, then there exists y * ∈ G, an optimal solution of (20), such that (22) holds and (x * , Ax * , y * ) is a saddle point of the Lagrangian l. Conversely, if the pair (x * , y * ) ∈ H × G satisfies relation (22), then x * is an optimal solution to (18), y * is an optimal solution to (20) and (x * , Ax * , y * ) is a saddle point of the Lagrangian l. For further considerations on convex duality, we invite the reader to consult [3-5, 20, 30]. Furthermore, we discuss some conditions ensuring that (18) has an optimal solution. Suppose that (18) is feasible, which means that its optimal objective value is not +∞. The existence of optimal solutions to (18) and g is bounded from below. Indeed, under these circumstances, the objective function of (18) is coercive and the statement follows via [3,Corollary 11.15]. On the other hand, if f + h is strongly convex, then the objective function of (18) is strongly convex, too, thus (18) has a unique optimal solution (see [3,Corollary 11.16]).
Some more notations are in order before we state the algorithm for solving Problem 4. We denote by S + (H) the family of operators U : H → H which are linear, continuous, self-adjoint, and positive semidefinite. For U ∈ S + (H), we consider the semi-norm defined by We also make use of the Loewner partial ordering defined for U 1 , U 2 ∈ S + (H) by Finally, for α > 0, we set are constant in each iteration, then Algorithm 3 becomes Algorithm 2, which has been investigated in [28].
(ii) In order to ensure that the sequence (x k ) k≥0 is uniquely determined one can assume that for all k ≥ 0, there exists α k . This is in particular the case when Relying on [3, Fact 2.19], on can see that (26) holds if and only if A is injective and ranA * is closed. Hence, in finite-dimensional spaces, namely, if H = R n and G = R m , with m ≥ n ≥ 1, (26) is nothing else than saying that A has full column rank. (iii) One of the pioneering works addressing proximal ADMM algorithms in Hilbert spaces, in the particular case when h = 0 and M k 1 and M k 2 are equal for all k ≥ 0 to the corresponding identity operators, is the paper by Attouch and Soueycatt [2]. We also refer the reader to [22,26] for versions of the proximal ADMM algorithm stated in finite-dimensional spaces and with proximal terms induced by constant linear operators.

Remark 6
We show that the particular choices M k 1 = 1 τ k Id − cA * A, for τ k > 0, and M k 2 = 0 for all k ≥ 0 lead to a primal-dual algorithm introduced in [16]. Here, Id : H → H denotes the identity operator on H. Let k ≥ 0 be fixed. The optimality condition for (23) reads (for x k+2 ): thus, Furthermore, from the optimality condition for (24), we obtain which combined with (25) gives Using that M k 2 = 0 and again (25), it further follows which is equivalent to The iterative scheme obtained in (31) and (28) generates, for a given starting point (x 1 , y 0 ) ∈ H × G and c > 0, the sequence (x k , y k ) k≥1 for all k ≥ 0 as follows For τ k = τ > 0 for all k ≥ 1, one recovers a primal-dual algorithm from [16] that has been investigated under the assumption 1 τ − c A 2 > L 2 (see Algorithm 3.2 and Theorem 3.1 in [16]). We invite the reader to consult [8,9,13,29] for more insights into primal-dual algorithms and their highlights. Primal-dual algorithms with dynamic step sizes have been investigated in [13] and [9], where it has been shown that clever strategies in the choice of the step sizes can improve the convergence behavior.

Ergodic convergence rates for the primal-dual gap
In this section, we will provide a convergence rate result for a primal-dual gap function formulated in terms of the associated Lagrangian l. We start by proving a technical statement (see also [28]).

Lemma 7
In the context of Problem 4, let (x k , z k , y k ) k≥0 be a sequence generated by Algorithm 3. Then, for all k ≥ 0 and all (x, z, y) ∈ H × G × G, the following inequality holds Moreover, we have for all k ≥ 0 Proof We fix k ≥ 0 and (x, z, y) ∈ H × G × G. Writing the optimality conditions for (23), we obtain (32) From the definition of the convex subdifferential, we derive where for the last equality, we used (25). Furthermore, we claim that By combining (33) and (34), we obtain From the optimality condition for (24), we obtain which, combined with (25), gives From here, we derive the inequality The first statement of the lemma follows by combining the inequalities (35) and (38) with the identity (see (25)) The second statement follows easily from the arithmetic-geometric mean inequality in Hilbert spaces (see [28,Proposition 5

.3(a)]).
A direct consequence of the two inequalities in Lemma 7 is the following result. for all k ≥ 0, and let (x k , z k , y k ) k≥0 be the sequence generated by Algorithm 3. Then, for all k ≥ 0 and all (x, z, y) ∈ H × G × G the following inequality holds We can now state the main result of this subsection.

Theorem 9
In the context of Problem 4, assume that M k for all k ≥ 0, and let (x k , z k , y k ) k≥0 be the sequence generated by Algorithm 3. For all k ≥ 1 define the ergodic sequences Then for all k ≥ 1 and all (x, z, y Proof We fix k ≥ 1 and (x, z, y) ∈ H × G × G. Summing up the inequalities in Lemma 8 for i = 0, ..., k − 1 and using classical arguments for telescoping sums, we obtain Since l is convex in (x, z) and linear in y, the conclusion follows from the definition of the ergodic sequences.

Remark 10
Let (x * , z * , y * ) be a saddle point for the Lagrangian l. By taking (x, z, y) := (x * , z * , y * ) in the above theorem it yields is the optimal objective value of the problem (18). Hence, if we suppose that the set of optimal solutions of the dual prob-lem (20) is contained in a bounded set, there exists R > 0 such that for all k ≥ 1 The set of dual optimal solutions of (20) is equal to the convex subdifferential of the infimal value function of the problem (18) at 0. This set is weakly compact, thus bounded, if 0 ∈ int(domψ) = int(A(domf ) − domg) (see [3,5,30]).

Convergence of the sequence of generated iterates
In this subsection, we will address the convergence of the sequence of iterates generated by Algorithm 3 (see also [6,Theorem 7]). One of the important tools for the proof of the convergence result will be [15, Theorem 3.3], which we recall below.
Lemma 11 (see [15,Theorem 3.3]) Let S be a nonempty subset of H and (x k ) k≥0 a sequence in H. Let α > 0 and W k ∈ P α (H) be such that W k W k+1 for all k ≥ 0. Assume that: (i) For all z ∈ S and for all k ≥ 0: (ii) Every weak sequential cluster point of (x k ) k≥0 belongs to S.
Then, (x k ) k≥0 converges weakly to an element in S.
The proof of the convergence result relies on techniques specific to monotone operator theory and does not make use of the values of the objective function or of the Lagrangian l. This makes it different from the proofs in [28] and from the majority of other conventional convergence proofs for ADMM methods. To the few exceptions belong [2] and [19].

Theorem 12 In the context of Problem 4, assume that the set of saddle points of the Lagrangian l is nonempty and that
for all k ≥ 0, and let (x k , z k , y k ) k≥0 be the sequence generated by Algorithm 7. If one of the following assumptions is fulfilled, then (x k , z k , y k ) k≥0 converges weakly to a saddle point of the Lagrangian l. This means that (x k ) k≥0 converges weakly to an optimal solution of problem (18), and (y k ) k≥0 converges weakly to an optimal solution of its dual problem (20).
Proof Let S ⊆ H × G × G denote the set of the saddle points of the Lagrangian l and (x * , z * , y * ) be a fixed element in S. Then, z * = Ax * and the optimality conditions hold −A * y * − ∇h(x * ) ∈ ∂f (x * ), y * ∈ ∂g(Ax * ).
By using some expressions of the inner products in terms of norms, we obtain space H ×G ×G, for the sequence (x k , z k , y k ) k≥0 , for W k := (M k 1 , M k 2 +cId, c −1 Id) for k ≥ 0, and for S ⊆ H × G × G the set of saddle points of the Lagrangian l.
From (42), we obtain that (Ax k n +1 ) n∈N converges weakly to Ax (as n → +∞), which combined with (43) yields z = Ax. We use now the following notations for all n ≥ 0 a * n := cA * (z k n − Ax k n +1 − c −1 y k n ) + M k n 1 (x k n − x k n +1 ) + ∇h(x k n +1 ) − ∇h(x k n ) a n := x k n +1 b * n := y k n +1 + M k n 2 (z k n − z k n +1 ) b n := z k n +1 . From (32) and (37), we have for all n ≥ 0 a * n ∈ ∂(f + h)(a n ) and b * n ∈ ∂g(b n ).
Finally, we have a * n + A * b * n = cA * (z k n − Ax k n +1 ) + A * (y k n +1 − y k n ) + M k n 1 (x k n − x k n +1 ) +A * M k n 2 (z k n − z k n +1 ) + ∇h(x k n +1 ) − ∇h(x k n ). By using the fact that ∇h is Lipschitz continuous, from (42)-(45), we get a * n + A * b * n converges strongly to 0 (as n → +∞).
On the other hand, notice that both (39) and (40) yield hence, (y k ) k≥0 and (z k ) k≥0 are bounded. Combining this with (25) and the condition imposed on M k 1 − L 2 Id + A * A, we derive that (x k ) k≥0 is bounded, too. Hence, there exists a weakly convergent subsequence of (x k , z k , y k ) k≥0 . By using the same arguments as in the proof of (I), it follows that every weak sequential cluster point of (x k , z k , y k ) k≥0 is a saddle point of the Lagrangian l. Now, we show that the set of weak sequential cluster points of (x k , z k , y k ) k≥0 is a singleton. Let (x 1 , z 1 , y 1 ), (x 2 , z 2 , y 2 ) be two such weak sequential cluster points. Then, there exist (k p ) p≥0 , (k q ) q≥0 , k p → +∞ (as p → +∞), k q → +∞ (as q → +∞), a subsequence (x k p , z k p , y k p ) p≥0 which converges weakly to (x 1 , z 1 , y 1 ) (as p → +∞), and a subsequence (x k q , z k q , y k q ) q≥0 which converges weakly to (x 2 , z 2 , y 2 ) (as q → +∞). As seen, (x 1 , z 1 , y 1 ) and (x 2 , z 2 , y 2 ) are saddle points of the Lagrangian l and z i = Ax i for i ∈ {1, 2}. From (52), which is true for every saddle point of the Lagrangian l, we derive ∃ lim k→+∞ E(x k , z k , y k ; x 1 , z 1 , y 1 ) − E(x k , z k , y k ; x 2 , z 2 , y 2 ) , where, for (x * , z * , y * ), the expression E(x k , z k , y k ; x * , z * , y * ) is defined as E(x k , z k , y k ; x * , z * , y * ) = 1 2 Further, we have for all k ≥ 0 = 0, which further implies that (52) holds.
From here, the conclusion follows by arguing as in the proof provided in the setting of Assumption (II).

Remark 13
Choosing as in Remark 6, M k 1 = 1 τ k Id − cA * A, with τ k > 0 and such that τ := sup k≥0 τ k ∈ R, and M k 2 = 0 for all k ≥ 0, we have which means that under the assumption 1 τ − c A 2 > L 2 (which recovers the one in Algorithm 3.2 and Theorem 3.1 in [16]), the operators M k 1 − L 2 Id belong for all k ≥ 0 to the class P α 1 (H), with α 1 := 1 τ − c A 2 − L 2 > 0.

Remark 14
By taking h = 0 and L = 0, and in each iteration constant operators M k 1 = M 1 0 and M k 2 = M 2 0 for all k ≥ 0, Theorem 12 in the context of Assumption (I) covers the first situation investigated in [28,Theorem 5.6], where in finite-dimensional spaces the matrix M 1 was assumed to be positive definite and the matrix M 2 to be positive semidefinite.
The arguments used in [28,Theorem 5.6] for proving convergence in the case when M 1 = 0 and A has full column rank contain flaws and rely on incorrect statements. Theorem 12 provides in the context of Assumption (III) (for h = 0, L = 0, M k 1 = 0 and M k 2 = M 2 0 for all k ≥ 0) the correct proof of this result.
Finally, we notice that the convergence theorem for the iterates of the classical ADMM algorithm (which corresponds to the situation when h = 0, L = 0, M 1 = M 2 = 0 and A has full column rank, see for example [19]) is covered by Theorem 12 in the context of Assumption (III).