On Convergence Rates of Proximal Alternating Direction Method of Multipliers

In this paper we consider from two different aspects the proximal alternating direction method of multipliers (ADMM) in Hilbert spaces. We first consider the application of the proximal ADMM to solve well-posed linearly constrained two-block separable convex minimization problems in Hilbert spaces and obtain new and improved non-ergodic convergence rate results, including linear and sublinear rates under certain regularity conditions. We next consider the proximal ADMM as a regularization method for solving linear ill-posed inverse problems in Hilbert spaces. When the data is corrupted by additive noise, we establish, under a benchmark source condition, a convergence rate result in terms of the noise level when the number of iterations is properly chosen.


Introduction
The alternating direction method of multipliers (ADMM) was introduced and developed in the 1970s by Glowinski, Marrocco [16] and Gabay, Mercier [15] for the numerical solutions of partial differential equations.Due to its decomposability and superior flexibility, ADMM and its variants have gained renewed interest in recent years and have been widely used for solving large-scale optimization problems that arise in signal/image processing, statistics, machine learning, inverse problems and other fields, see [5,17,21].Because of their popularity, many works have been devoted to the analysis of ADMM and its variants, see [5,8,10,14,19,26,33] for instance.In this paper we will devote to deriving convergence rates of ADMM in two aspects: its applications to solve well-posed convex optimization problems and its use to solve linear ill-posed inverse problems as a regularization method.
In the first part of this paper we consider ADMM for solving linearly constrained two-block separable convex minimization problems.Let X , Y and Z be real Hilbert spaces with possibly infinite dimensions.We consider the convex minimization problem of the form minimize H(x, y) := f (x) + g(y) with respect to the primal variables x and y and then updating the dual variable λ; more precisely, starting from an initial guess y 0 ∈ Y and λ 0 ∈ Z, an iterative sequence {(x k , y k , λ k )} is defined by where ρ > 0 is a given penalty parameter.The implementation of (1.2) requires to determine x k+1 and y k+1 by solving two convex minimization problems during each iteration.Although f and g may have special structures so that their proximal mappings are easy to be determined, solving the minimization problems in (1.2) in general is highly nontrivial due to the appearance of the terms ∥Ax∥ 2 and ∥By∥ 2 .In order to avoid this implementation issue, one may consider to add certain proximal terms to the x-subproblem and y-subproblem in (1.2) to remove the terms ∥Ax∥ 2 and ∥By∥ 2 .For any bounded linear positive semi-definite self-adjoint operator D on a real Hilbert space H, we will use the notation ∥u∥ 2 D := ⟨z, Du⟩, ∀u ∈ H.By taking two bounded linear positive semi-definite self-adjoint operators P : X → X and Q : Y → Y, we may add the terms 1  2 ∥x − x k ∥ 2 P and 1 2 ∥y − y k ∥ 2 Q to the xand y-subproblems in (1.2) respectively to obtain the following proximal alternating direction method of multipliers ([4, 9, 19, 20, 22, 33]) (1. 3) The advantage of (1.3) over (1.2) is that, with wise choices of P and Q, it is possible to remove the terms ∥Ax∥ 2 and ∥By∥ 2 and thus make the determination of x k+1 and y k+1 much easier.
In recent years, various convergence rate results have been established for ADMM and its variants in either ergodic or non-ergodic sense.In [19,25] the ergodic convergence rate has been derived in terms of the objective error and the constraint error, where H * denotes the minimum value of (1.1), k denotes the number of iteration, and x j and ȳk := 1 k denote the ergodic iterates of {x k } and {y k } respectively; see also [4,Theorem 15.4].A criticism on ergodic result is that it may fail to capture the feature of the sought solution of the underlying problem because ergodic iterate has the tendency to average out the expected property and thus destroy the feature of the solution.This is in particular undesired in sparsity optimization and low-rank learning.In contrast, the non-ergodic iterate tends to share structural properties with the solution of the underlying problem.Therefore, the use of non-ergodic iterates becomes more favorable in practice.In [20] a non-ergodic convergence rate has been derived for the proximal ADMM (1.3) with Q = 0 and the result reads as By exploiting the connection with the Douglas-Rachford splitting algorithm, the non-ergodic convergence rate in terms of the objective error and the constraint error has been established in [8] for the ADMM (1.2) and an example has been provided to demonstrate that the estimates in (1.6) are sharp.However, the derivation of (1.6) in [8] relies on some unnatural technical conditions involving the convex conjugate of f and g, see Remark 2.1.Note that the estimate (1.5) implies the second estimate in (1.6), however it does not imply directly the first estimate in (1.6).In Section 2 we will show, by a simpler argument, that similar estimate as in (1.5) can be established for the proximal ADMM (1.3) with arbitrary positive semi-definite Q.Based on this result and some additional properties of the method, we will further show that the non-ergodic rate (1.6) holds for the proximal ADMM (1.3) with arbitrary positive semi-definite P and Q.Our result does not require any technical conditions as assumed in [8].
In order to obtain faster convergence rates for the proximal ADMM (1.3), certain regularity conditions should be imposed.In finite dimensional situation, a number of linear convergence results have been established.In [9] some linear convergence results of the proximal ADMM have been provided under a number of scenarios involving the strong convexity of f and/or g, the Lipschitz continuity of ∇f and/or ∇g, together with further full row/column rank assumptions on A and/or B. Under a bounded metric subregularity condition, in particular under the assumption that both f and g are convex piecewise linear-quadratic functions, a global linear convergence rate has been established in [32] for the proximal ADMM (1.3) with where A * and B * denotes the adjoints of A and B respectively; the condition (1.7) plays an essential role in the convergence analysis in [32].We will derive faster convergence rates for the proximal ADMM (1.3) in the general Hilbert space setting.To this end, we need first to consider the weak convergence of {(x k , y k , λ k )} and demonstrate that every weak cluster point of this sequence is a KKT point of (1.1).This may not be an issue in finite dimensions.However, this is nontrivial in infinite dimensional spaces because extra care is required to dealing with weak convergence.In [6] the weak convergence of the proximal ADMM (1.3) has been considered by transforming the method into a proximal point method and the result there requires restrictive conditions, see [6, Lemma 3.4 and Theorem 3.1].These restrictive conditions have been weakened later in [31] by using machinery from the maximal monotone operator theory.We will explore the structure of the proximal ADMM and show by an elementary argument that every weak cluster point of {(x k , y k , λ k )} is indeed a KKT point of (1.1) without any additional conditions.We will then consider the linear convergence of the proximal ADMM under a bounded metric subregularity condition and obtain the linear convergence for any positive semi-definite P and Q; in particular, we obtain the linear convergence of |H(x k , y k ) − H * | and ∥Ax k + By k − c∥.We also consider deriving convergence rates under a bounded Hölder metric subregularity condition which is weaker than the bounded metric subregularity.This weaker condition holds if both f and g are semi-algebraic functions and thus a wider range of applications can be covered.We show that, under a bounded Hölder metric subrigularity condition, among other things the convergence rates in (1.6) can be improved to for some number β > 1/2; the value of β depends on the properties of f and g.To further weaken the bounded (Hölder) metric subregularity assumption, we introduce an iteration based error bound condition which is an extension of the one in [27] to the general proximal ADMM (1.3).It is interesting to observe that this error bound condition holds under any one of the scenarios proposed in [9].Hence, we provide a unified analysis for deriving convergence rates under the bounded (Hölder) metric subregularity or the scenarios in [9].Furthermore, we extend the scenarios in [9] to the general Hilbert space setting and demonstrate that some conditions can be weakened and the convergence result can be strengthened; see Theorem 2.11.
In the second part of this paper, we consider using ADMM as a regularization method to solve linear ill-posed inverse problems in Hilbert spaces.Linear inverse problems have a wide range of applications, including medical imaging, geophysics, astronomy, signal processing, and more ( [12,18,28]).We consider linear inverse problems of the form where A : X → H is a compact linear operator between two Hilbert spaces X and H, C is a closed convex set in X , and b ∈ Ran(A), the range of A. In order to find a solution of (1.8) with desired properties, a priori available information on the sought solution should be incorporated into the problem.Assume that, under a suitable linear transform L from X to another Hilbert spaces Y with domain dom(L), the feature of the sought solution can be captured by a proper convex penalty function (1.9) A challenging issue related to the numerical resolution of (1.9) is its ill-posedness in the sense that the solution of (1.9) does not depend continuously on the data and thus a small perturbation on data can lead to a large deviation on solutions.In practical applications, the exact data b is usually unavailable, instead only a noisy data b δ is at hand with ∥b δ − b∥ ≤ δ for some small noise level δ > 0. To overcome ill-posedness, regularization methods should be introduced to produce reasonable approximate solutions; one may refer to [7,12,23,29] for various regularization methods.
The common use of ADMM to solve (1.9) with noisy data b δ first considers the variational regularization min then uses the splitting technique to rewrite (1.10) into the form (1.1), and finally applies the ADMM procedure to produce approximate solutions.The parameter α > 0 is the so-called regularization parameter which should be adjusted carefully to achieve reasonable good performance; consequently one has to run ADMM to solve (1.10) for many different values of α, which can be time consuming.In [21,22] the ADMM has been considered to solve (1.9) directly to reduce the computational load.Note that (1.9) can be written as where ι C denotes the indicator function of C. With the noisy data b δ we introduce the augmented Lagrangian function where ρ 1 , ρ 2 and ρ 3 are preassigned positive numbers.The proximal ADMM proposed in [22] for solving (1.9) then takes the form where Q is a bounded linear positive semi-definite self-adjoint operator.The method (1.11) is not a 3-block ADMM.Note that the variables y and x are not coupled in L ρ1,ρ2,ρ3 (z, y, x, λ, µ, ν).Thus, y k+1 and x k+1 can be updated simultaneously, i.e.
It should be highlighted that all well-established convergence results on proximal ADMM for well-posed optimization problems are not applicable to (1.11) directly.Note that (1.11) uses the noisy data b δ .If the convergence theory for well-posed optimization problems could be applicable, one would obtain a solution of the perturbed problem min f (Lx) : of (1.9).Because A is compact, it is very likely that b δ ̸ ∈ Ran(A * ) and thus (1.12) makes no sense as the feasible set is empty.Even if b δ ∈ Ran(A * ) and (1.12) has a solution, this solution could be far away from the solution of (1.9) because of the ill-posedness.Therefore, if (1.11) is used to solve (1.9), better result can not be expected even if larger number of iterations are performed.In contrast, like all other iterative regularization methods, when (1.11) is used to solve (1.9), it shows the semi-convergence property, i.e., the iterate becomes close to the sought solution at the beginning; however, after a critical number of iterations, the iterate leaves the sought solution far away as the iteration proceeds.Thus, properly terminating the iteration is important to produce acceptable approximate solutions.One may hope to determine a stopping index k δ , depending on δ and/or b δ , such that ∥x k δ − x † ∥ is as small as possible and ∥x k δ − x † ∥ → 0 as δ → 0, where x † denotes the solutio of (1.9).This has been done in our previous work [21,22] in which early stopping rules have been proposed for the method (1.11) to render it into a regularization method and numerical results have been reported to demonstrate the nice performance.However, the work in [21,22] does not provide convergence rates, i.e. the estimate on ∥x k δ − x † ∥ in terms of δ.Deriving convergence rates for iterative regularization methods involving general convex regularization terms is a challenging question and only a limited number of results are available.In order to derive a convergence rate of a regularization method for ill-posed problems, certain source condition should be imposed on the sought solution.In Section 3, under a benchmark source condition on the sought solution, we will provide a partial answer to this question by establishing a convergence rate result for (1.11) if the iteration is terminated by an a priori stopping rule.

Proximal ADMM for convex optimization problems
In this section we will consider the proximal ADMM (1.3) for solving the linearly constrained convex minimization problem (1.1).For the convergence analysis, we will make the following standard assumptions.Assumption 2.2.The problem (1.1) has a Karush-Kuhn-Tucker (KKT) point, i.e. there exists (x, ȳ, λ) ∈ X × Y × Z such that It should be mentioned that, to guarantee the proximal ADMM (1.3) to be welldefined, certain additional conditions need to be imposed to ensure that the x-and y-subproblems do have minimizers.Since the well-definedness can be easily seen in concrete applications, to make the presentation more succinct we will not state these conditions explicitly.
By the convexity of f and g, it is easy to see that, for any KKT point (x, ȳ, λ) of (1.1), there hold

Adding these two equations and using
This in particular implies that (x, ȳ) is a solution of (1.1) and thus H * := H(x, ȳ) is the minimum value of (1.1).Based on Assumptions 2.1 and 2.2 we will analyze the proximal ADMM (1.3).For ease of exposition, we set Q := ρB * B + Q and define For the sequence {u k := (x k , y k , λ k )} defined by the proximal ADMM (1.3), we use the notation We start from the first order optimality conditions on x k+1 and y k+1 which by definition can be stated as By using λ k+1 = λ k + ρ(Ax k+1 + By k+1 − c) we may rewrite (2.2) as which will be frequently used in the following analysis.We first prove the following important result which is inspired by [19,Lemma 3.1] and [4,Theorem 15.4].
Proposition 2.1.Let Assumption 2.1 hold.Then for the proximal ADMM (1.3) there holds where σ f and σ g denote the modulus of convexity of f and g respectively.
Proof.Let λk+1 := λ k+1 − ρB∆y k+1 .By using (2.3) and the convexity of f and g we have for any Since ρ(Ax k+1 + By k+1 − c) = ∆λ k+1 we then obtain By using the polarization identity and the definition of G, it follows that Using the definition of λk+1 gives Therefore 1 and Assumption 2.2 hold and let ū := (x, ȳ, λ) be any KKT point of (1.1).Then for the proximal ADMM (1.3) there holds for all k ≥ 0.Moreover, the sequence {∥u k − ū∥ 2 G } is monotonically decreasing.Proof.By taking u = ū in Proposition 2.1 and using Ax+B ȳ−c = 0 we immediately obtain (2.4).According to (2.1) we have Thus, from (2.4) we can obtain which implies the monotonicity of the sequence This result for the proximal ADMM (1.3) with Q = 0 has been established in [20] based on a variational inequality approach.We will establish this result for the proximal ADMM (1.3) with general bounded linear positive semi-definite self-adjoint operators P and Q by a simpler argument.

Note that
We therefore have By the polarization identity we then have With the help of the definition of G, we obtain which completes the proof.□ Lemma 2.4.Let Assumptions 2.1 and 2.2 hold and let ū := (x, ȳ, λ) be any KKT point of (1.1).For the proximal ADMM (1.3) there holds Proof.We will use (2.3) together with −A * λ ∈ ∂f (x) and −B * λ ∈ ∂g(ȳ).By using the monotonicity of ∂f and ∂g we have By virtue of ρ(Ax k+1 + By k+1 − c⟩ = ∆λ k+1 we further have By using the second equation in (2.3) and the monotonicity of ∂g we have By using the polarization identity we then obtain Recalling the definition of G we then complete the proof.□ Proposition 2.5.Let Assumption 2.1 and Assumption 2.2 hold.Then for the proximal ADMM (1.3) there holds where [k/2] denotes the largest integer ≤ k/2.Since (2.6) shows that the right hand side of (2.7) must converge to 0 as k → ∞.
As a byproduct of Proposition 2.5 and Corollary 2.2, we can prove the following non-ergodic convergence rate result for the proximal ADMM (1.3) in terms of the objective error and the constraint error.
Theorem 2.6.Let Assumption 2.1 and Assumption 2.2 hold.Consider the proximal ADMM (1.3) for solving (1.1).Then as k → ∞. Proof.Since we may use Proposition 2.5 to obtain the estimate In the following we will focus on deriving the estimate of |H(x k , y k ) − H * |.Let ū := (x, ȳ, λ) be a KKT point of (1.1).By using (2.4) we have By virtue of the monotonicity of {∥u k − ū∥ 2 G } given in Corollary 2.2 we then obtain On the other hand, by using (2.1) we have Therefore Now we can use Proposition 2.5 to conclude the proof.□ Remark 2.1.By exploiting the connection between the Douglas-Rachford splitting algorithm and the classical ADMM (1.2), the non-ergodic convergence rate (2.8) has been established in [8] for the classical ADMM (1.2) under the conditions that zero(∂d f + ∂d g ) ̸ = ∅ (2.12) and where d f (λ) := f * (A * λ) and d g (λ) := g * (B * λ) − ⟨λ, c⟩ with f * and g * denoting the convex conjugates of f and g respectively.The conditions (2.12) and (2.13) seems strong and unnatural because they are posed on the convex conjugates f * and g * instead of f and g themselves.In Theorem 2.6 we establish the non-ergodic convergence rate (2.8) for the proximal ADMM (1.3) with any positive semi-definite P and Q without requiring the conditions (2.12) and (2.13) and therefore our result extends and improves the one in [8].
Next we will consider establishing faster convergence rates under suitable regularity conditions.As a basis, we first prove the following result which tells that any weak cluster point of {u k } is a KKT point of (1.1).This result can be easily established for ADMM in finite-dimensional spaces, however it is nontrivial for the proximal ADMM (1.3) in infinite-dimensional Hilbert spaces due to the required treatment of weak convergence; Proposition 2.1 plays a crucial role in our proof.
Theorem 2.7.Let Assumption 2.1 and Assumption 2.2 hold.Consider the sequence {u k := (x k , y k , λ k )} generated by the proximal ADMM (1.3).Assume {u k } is bounded and let u † := (x † , y † , λ † ) be a weak cluster point of {u k }.Then u † is a KKT point of (1.1).Moreover, for any weak cluster point u * of {u k } there holds Proof.We first show that u † is a KKT point of (1.1).According to Propositon 2.5 we have as k → ∞.According to Theorem 2.6 we also have Since u † is a weak cluster point of the sequence {u k }, there exists a subsequence {u kj := (x kj , y kj , λ kj )} of {u k } such that u kj ⇀ u † as j → ∞.By using the first equation in (2.15) we immediately obtain By using Proposition 2.1 with k = k j − 1 we have for any u : (2.17) According to Corollary 2.2, {∥u k ∥ G } is bounded.Thus we may use Proposition 2.5 to conclude  for all (x, y) ∈ X × Y. Since f and g are convex and lower semi-continuous, they are also weakly lower semi-continuous (see [11, Chapter 1, Corollary 2.2]).Thus, by using x kj ⇀ x † and y kj ⇀ y † we obtain Since (x † , y † ) satisfies (2.16), we also have H(x † , y † ) ≥ H * .Therefore H(x † , y † ) = H * and then it follows from (2.18) and (2.16) that Using the definition of H we can immediately see that Let u * be another weak cluster point of {u k }.Then there exists a subsequence Since both u * and u † are KKT points of (1.1) as shown above, it follows from Corollary 2.2 that both {∥u k −u † ∥ 2 G } and {∥u k −u * ∥ 2 G } are monotonically decreasing and thus converge as k → ∞.By taking k = k j and k = l j in (2.19) respectively and letting j → ∞ we can see that, for the both cases, the right hand side tends to the same limit.Therefore 2. Theorem 2.7 requires {u k } to be bounded.According to Corollary 2.2, {∥u k ∥ 2 G } is bounded which implies the boundedness of {λ k }.In the following we will provide sufficient conditions to guarantee the boundedness of {(x k , y k )}.
(i) From (2.5) it follows that {σ By the definition of G, this in particular implies the boundedness of {λ k } and {By k }.Consequently, it follows from ∆λ k = ρ(Ax k + By k − c) that {Ax k } is bounded.Putting the above together we can conclude that both {(σ f I + P + A * A)x k } and {(σ g I + Q + B * B)y k } are bounded.Therefore, if both the bounded linear self-adjoint operators are coercive, we can conclude the boundedness of {x k } and {y k }.Here a linear operator L : V → H between two Hilbert spaces V and H is called coercive if ∥Lv∥ → ∞ whenever ∥v∥ → ∞.It is easy to see that L is coercive if and only if there is a constant c > 0 such that c∥v∥ ≤ ∥Lv∥ for all v ∈ V. (ii) If there exist β > H * and σ > 0 such that the set {(x, y) ∈ X × Y : H(x, y) ≤ β and ∥Ax + By − c∥ ≤ σ} is bounded, then {(x k , y k )} is bounded.In fact, since H(x k , y k ) → H * and Ax k + By k − c → 0 as shown in Theorem 2.6, the sequence {(x k , y k )} is contained in the above set except for finite many terms.Thus {(x k , y k )} is bounded.
Remark 2.3.It is interesting to investigate under what conditions {u k } has a unique weak cluster point.According to Theorem 2.7, for any two weak cluster points u * := (x * , y * , λ * ) and u † := (x † , y † , λ † ) of {u k } there hold By using the definition of G and the monotonicity of ∂f and ∂g we can deduce that Therefore, if both σ f I + P + A * A and σ g I + Q + B * B are injective, then x * = x † and y * = y † and hence {u k } has a unique weak cluster point, say u † ; consequently Remark 2.4.In [31] the proximal ADMM (with relaxation) has been considered under the condition that P + ρA * A + ∂f and Q + ρB * B + ∂g are strongly maximal monotone.(2.20) which requires both (P + ρA * A + ∂f ) −1 and (Q + ρB * B + ∂g) −1 exist as single valued mappings and are Lipschitz continuous.It has been shown that the iterative sequence converges weakly to a KKT point which is its unique weak cluster point.The argument in [31] used the facts that the KKT mapping F (u), defined in (2.21) below, is maximal monotone and maximal monotone operators are closed under the weak-strong topology ( [2,3]).Our argument is essentially based on Proposition 2.1, it is elementary and does not rely on any machinery from the maximal monotone operator theory.
Based on Theorem 2.7, we now devote to deriving convergence rates of the proximal ADMM (1.3) under certain regularity conditions.To this end, we introduce the multifuncton F : X × Y × Z ⇒ X × Y × Z defined by Then ū is a KKT point of (1.1) means 0 ∈ F (ū) or, equivalently, ū ∈ F −1 (0), where F −1 denotes the inverse multifunction of F .We will achieve our goal under certain bounded (Hölder) metric subregularity conditions of F .We need the following calculus lemma.
Lemma 2.8.Let {∆ k } be a sequence of nonnegative numbers satisfying for all k ≥ 1, where C > 0 and θ > 1 are constants.Then there is a constant C > 0 such that for all k ≥ 0.
Proof.Please refer to the proof of [1, Theorem 2].□ Theorem 2.9.Let Assumption 2.1 and Assumption 2.2 hold.Consider the sequence {u k := (x k , y k , λ k )} generated by the proximal ADMM (1.3).Assume {u k } is bounded and let u † := (x † , y † , λ † ) be a weak cluster point of {u k }.Let R be a number such that ∥u k − u † ∥ ≤ R for all k and assume that there exist κ > 0 and α ∈ (0, 1] such that , then there exists a constant 0 < q < 1 such that for all k ≥ 0 and consequently there exist C > 0 and 0 < q < 1 such that and consequently for all k ≥ 0. Proof.According to Theorem 2.7, u † is a KKT point of (1.1).Therefore we may use Lemma 2.4 with ū = u † to obtain where η ∈ (0, 1) is any number.According to (2.3), Thus, by using ∆λ k+1 = ρ(Ax k+1 + By k+1 − c) we can obtain where Combining this with (2.28) gives Since ∥u k − u † ∥ ≤ R for all k and F satisfies (2.23), one can see that Consequently ∥u − ū∥ G which measures the "distance" from u to F −1 (0) under the semi-norm ∥ • ∥ G .It is easy to see that d 2 G (u, F −1 (0)) ≤ ∥G∥d 2 (u, F −1 (0)), where ∥G∥ denotes the norm of the operator G. Then we have Since ū ∈ F −1 (0) is arbitrary, we thus have By using the fact ∥∆u k ∥ G → 0 established in Proposition 2.5, we can find a constant Using the inequality (a + b) p ≤ 2 p−1 (a p + b p ) for a, b ≥ 0 and p ≥ 1, we then obtain , then we obtain the linear convergence for all integers 1 ≤ l < k.By using the monotonicity of {∥∆u j ∥ 2 G } shown in Lemma 2.3 and the estimate (2.26) we have with a possibly different generic constant C.This shows the first estimate in (2.27).
Based on this, we can use (2.9) and (2.11) to obtain the last two estimates in (2.27).The proof is therefore complete.□ Remark 2.5.Let us give some comments on the condition (2.23).In finite dimensional Euclidean spaces, it has been proved in [30] that for every polyhedral multifunction Ψ : R m ⇒ R n there is a constant κ > 0 such that for any y ∈ R n there is a number ε > 0 such that This result in particular implies the bounded metric subregularity of Ψ, i.e. for any r > 0 and any y ∈ R n there is a number Therefore, if ∂f and ∂g are polyhedral multifunctions, then the multifunction F defined by (2.21) is also polyhedral and thus (2.23) with α = 1 holds.The bounded metric subregularity of polyhedral multifunctions in arbitrary Banach spaces has been established in [34].
On the other hand, if X , Y and Z are finite dimensional Euclidean spaces, and if f and g are semi-algebraic convex functions, then the multifunction F satisfies (2.23) for some α ∈ (0, 1].Indeed, the semi-algebraicity of f and g implies that their subdifferentials ∂f and ∂g are semi-algebraic multifunctions with closed graph; consequently F is semi-algebraic with closed graph.According to [24, Proposition 3.1], F is bounded Hölder metrically subregular at any point (ū, ξ) on its graph, i.e. for any r > 0 there exist κ > 0 and α ∈ (0, 1] such that which in particular implies (2.23).
By inspecting the proof of Theorem 2.9, it is easy to see that the same convergence rate results can be derived with the condition (2.23) replaced by the weaker condition: there exist κ > 0 and α ∈ (0, 1] such that Therefore we have the following result. Theorem 2.10.Let Assumption 2.1 and Assumption 2.2 hold.Consider the sequence {u k := (x k , y k , λ k )} generated by the proximal ADMM (1.3).Assume {u k } is bounded.If there exist κ > 0 and α ∈ (0, 1] such that (2.31) holds, then, for any weak cluster point u † of {u k }, the same convergence rate results in Theorem 2.9 hold.
Remark 2.6.Note that the condition (2.31) is based on the iterative sequence itself.Therefore, it makes possible to check the condition by exploring not only the property of the multifunction F but also the structure of the algorithm.The condition (2.31) with α = 1 has been introduced in [27] as an iteration based error bound condition to study the linear convergence of the proximal ADMM (1.3) with Q = 0 in finite dimensions.
Remark 2.7.The condition (2.31) is strongly motivated by the proof of Theorem 2.9.We would like to provide here an alternative motivation.Consider the proximal ADMM (1.3).We can show that if ∥∆u k ∥ G = 0 then u k must be a KKT point of (1.1).Indeed, ∥∆u k ∥ 2 G = 0 implies P ∆x k = 0, Q∆y k = 0 and ∆λ k = 0. Since Q = Q + ρB T B with Q positive semi-definite and ∆λ k = ρ(Ax k + By k − c), we also have B∆y k = 0, Q∆y k = 0 and Ax k + By k − c = 0. Thus, it follows from (2.3) that which shows that u k = (x k , y k , λ k ) is a KKT point, i.e., u k ∈ F −1 (0).Therefore, it is natural to ask, if ∥∆u k ∥ G is small, can we guarantee d G (u k , F −1 (0)) to be small as well?This motivates us to propose a condition like The condition (2.31) corresponds to φ(s) = κs α for some κ > 0 and α ∈ (0, 1]. In finite dimensional Euclidean spaces some linear convergence results on the proximal (1.3) have been established in [9] under various scenarios involving strong convexity of f and/or g, Lipschitz continuity of ∇f and/or ∇g, together with further conditions on A and/or B, see [9, Theorem 3.1 and Table 1].In the following theorem we will show that (2.31) with α = 1 holds under any one of these scenarios and thus the linear convergence in [9, Theorem 3.1 and Theorem 3.4] can be established by using Theorem 2.10.Therefore, the linear convergence results based on the bounded metric subregularity of F or the scenarios in [9] can be treated in a unified manner.
Actually our next theorem improves the results in [9] by establishing the linear convergence of {u k } and {H(x k , y k )} and relaxing the Lipschitz continuity of gradient(s) to the local Lipschitz continuity.Furthermore, Our result is established in general Hilbert spaces.To formulate the scenarios from [9] in this general setting, we need to replace the full row/column rank of matrices by the coercivity of linear operators.We also need the linear operator M : X × Y → Z defined by M (x, y) := Ax + By, ∀(x, y) ∈ X × Y which is constructed from A and B. It is easy to see that the adjoint of M is M * z = (A * z, B * z) for any z ∈ Z. Theorem 2.11.Let Assumption 2.1 and Assumption 2.2 hold.Let {u k } be the sequence generated by the proximal ADMM (1.3).Then {u k } is bounded and there exists a constant C > 0 such that for all k ≥ 1, provided any one of the following conditions holds: (i) σ g > 0, A and B * are coercive, g is differentiable and its gradient is Lipschitz continuous over bounded sets; (ii) σ f > 0, σ g > 0, B * is coercive, g is differentiable and its gradient is Lipschitz continuous over bounded sets; (iii) λ 0 = 0, σ f > 0, σ g > 0, M * restricted on N (M * ) ⊥ is coercive, both f and g are differentiable and their gradients are Lipschitz continuous over bounded sets; (iv) λ 0 = 0, σ g > 0, A is coercive, M * restricted on N (M * ) ⊥ is coercive, both f and g are differentiable and their gradients are Lipschitz continuous over bounded sets; where N (M * ) denotes the null space of M * .Consequently, there exist C > 0 and 0 < q < 1 such that for all k ≥ 0, where u † := (x † , y † , λ † ) is a KKT point of (1.1).
Proof.We will only consider the scenario (i) since the proofs for other scenarios are similar.In the following we will use C to denote a generic constant which may change from line to line but is independent of k.
We first show the boundedness of Next we show (2.32).Let u † := (x † , y † , λ † ) be a weak cluster point of {u k } whose existence is guaranteed by the boundedness of {u k }.According to Theorem 2.7, u † is a KKT point of (1.1).Let (ξ, η, τ ) ∈ F (u k ) be any element.Then By using the monotonicity of ∂f and ∂g we have Since σ g > 0, it follows from (2.33) and the Cauchy-Schwarz inequality that (2.35) By the differentiability of g we have −B * λ † = ∇g(y † ) and −B * λ k − Q∆y k = ∇g(y k ).Since B * is coercive and ∇g is Lipschitz continuous over bounded sets, we thus obtain (2.36) Adding (2.35) and (2.36) and then using (2.34), it follows ∥ which together with the Cauchy-Schwarz inequality then implies (2.37) Combining (2.34) and (2.37) we can obtain With the help of (2.29), we then obtain Because {u k } is bounded and (2.32) holds, we may use Theorem 2.10 to conclude the existence of a constant q ∈ (0, 1) such that Finally we may use (2.38) required in the scenarios (iii) and (iv) holds automatically.If it is not, then there exists a sequence By rescaling we may assume ∥z k ∥ = 1 for all k.Since Z is finite-dimensional, by taking a subsequence if necessary, we may assume which is a contradiction.

Proximal ADMM for linear inverse problems
In this section we consider the method (1.11) as a regularization method for solving (1.9) and establish a convergence rate result under a benchmark source condition on the sought solution.Throughout this section we make the following assumptions on the operators Q, L, A, the constraint set C and the function f : Assumption 3.1.
(i) A : X → H is a bounded linear operator, Q : X → X is a bounded linear positive semi-definite self-adjoint operator, and C ⊂ X is a closed convex subset.(ii) L is a densely defined, closed, linear operator from X to Y with domain dom(L).(iii) There is a constant c 0 > 0 such that is proper, lower semi-continuous, and strongly convex.
This assumptions is standard in the literature on regularization methods and has been used in [21,22].Based on (iii), we can define the adjoint L * of L which is also closed and densely defined; moreover, z ∈ dom(L * ) if and only if ⟨L * z, x⟩ = ⟨z, Lx⟩ for all x ∈ dom(L).Under Assumption 3.1, it has been shown in [21,22] that the proximal ADMM (1.11) is well-defined and if the exact data b is consistent in the sense that there exists x ∈ X such that x ∈ dom(L) ∩ C, Lx ∈ dom(f ) and Ax = b, then the problem (1.9) has a unique solution, denoted by x † .Furthermore, there holds the following monotonicity result, see [22,Lemma 2.3]; alternatively, it can also be derived from Lemma 2.3.. Lemma 3.1.Let {z k , y k , x k , λ k , µ k , ν k } be defined by the proximal ADMM (1.11) with noisy data and let Then {E k } is monotonically decreasing with respect to k.
In the following we will always assume the exact data b is consistent.We will derive a convergence rate of x k to the unique solution x † of (1.9) under the source condition Note that when L = I and C = X , (3.2) becomes the benchmark source condition which has been widely used to derive convergence rate for regularization methods, see [7,13,23,29] for instance.We have the following convergence rate result.
Theorem 3.2.Let Assumption 3.1 hold, let the exact data b be consistent, and let the sequence {z k , y k , x k , λ k , µ k , ν k } be defined by the proximal ADMM (1.11) with noisy data b δ satisfying ∥b δ − b∥ ≤ δ.Assume the unique solution x † of (1.9) satisfies the source condition (3.2).Then for the integer k δ chosen by k δ ∼ δ −1 there hold as δ → 0.
In order to prove this result, let us start from the formulation of the algorithm (1.11) to derive some useful estimates.For simplicity of exposition, we set ∆x k+1 := x k+1 − x k , ∆y k+1 := y k+1 − y k , ∆z k+1 := z k+1 − z k , According to the definition of z k+1 , y k+1 and x k+1 in (1.11), we have the optimality conditions where σ f denotes the modulus of convexity of f ; we have σ f > 0 as f is strongly convex.By taking the inner product of (3.3) with z k+1 − x † we have Therefore we may use the definition of λ k+1 , µ k+1 , ν k+1 in (1.11) and the fact Ax † = b to further obtain 0 = ⟨λ k+1 , Az k+1 − b⟩ + ⟨µ k+1 + ρ 2 ∆y k+1 , Lz k+1 − y † ⟩ Subtracting (3.7) by (3.8) gives Note that under the source condition (3.2), there exist µ † , ν † and λ † such that Thus, it follows from the above equation and the last two equations in (1.11) that By using (3.9), b = Ax † and the convexity of f , we can see that Consequently, by using the fourth equation in (1.11), we have By using the polarization identity and the last two equations in (1.11) we further have Then where E k is defined by (3.1).
This together with (3.10) implies (3.11).From (3.11) it follows immediately that which shows (3.12).By the non-negativity of E k+1 we then obtain from (3.12) that Φ k+1 ≤ Φ k + 2ρ 1 Φ k+1 δ, ∀k ≥ 0 which clearly implies (3.13).□ In order to derive the estimate on Φ k from (3.13), we need the following elementary result.Lemma 3.4.Let {a k } and {b k } be two sequences of nonnegative numbers such that Proof.We show the result by induction on k.The result is trivial for k = 0 since the given condition with k = 0 gives a 0 ≤ b 0 .Assume that the result is valid for all 0 ≤ k ≤ l for some l ≥ 0. We show it is also true for k = l + 1.If a l+1 ≤ max{a 0 , • • • , a l }, then a l+1 ≤ a j for some 0 ≤ j ≤ l.Thus, by the induction hypothesis and the monotonicity of {b k } we have which implies that Taking square roots shows a l+1 ≤ b l+1 + c(l + 1) again.□ Lemma 3.5.There hold which shows (3.15).□ Now we are ready to complete the proof of Theorem 3.2.
Proof of Theorem 3.2.Let k δ be an integer such that k δ ∼ δ −1 .From (3.14) and (3.15) in Lemma 3.5 it follows that where C 0 and C 1 are constants independent of k and δ.In order to use (3.11) in Lemma 3.3 to estimate ∥y k δ − y † ∥, we first consider Φ k − Φ k+1 for all k ≥ 0. By using the definition of Φ k and the inequality ∥u∥ 2 − ∥v∥ 2 ≤ (∥u∥ + ∥v∥)∥u − v∥, we have for k ≥ 0 that By virtue of the Cauchy-Schwarz inequality and the inequality (a + b) 2 ≤ 2(a 2 + b 2 ) for any numbers a, b ∈ R we can further obtain This together with (3.16) in particular implies Therefore, it follows from (3.11) that Thus where C 2 is a constant independent of δ and k.By using the estimate ), the definition of E k δ , and the last three equations in (1.11), we can see that By virtue of (iii) in Assumption 3.1 on A and L we thus obtain This means there is a constant C 3 independent of δ and k such that Finally we obtain The proof is thus complete.□ Remark 3.1.Under the benchmark source condition (3.2), we have obtained in Theorem 3.2 the convergence rate O(δ 1/4 ) for the proximal ADMM (1.11).This rate is not order optimal.It is not yet clear if the order optimal rate O(δ 1/2 ) can be achieved.The source condition (1.9) reduces to the form ∃µ † ∈ ∂f (x † ) and ν † ∈ ∂ι C (x † ) such that µ † + ν † ∈ Ran(A * ).(3.19)If the unique solution x † of (3.17) satisfies the source condition (3.19), one may follow the proof of Theorem 3.2 with minor modification to deduce for the method (3.18) that ∥x k δ − x † ∥ = O(δ 1/4 ) and ∥z k δ − x † ∥ = O(δ 1/4 ) whenever the integer k δ is chosen such that k δ ∼ δ −1 .
We conclude this section by presenting a numerical result to illustrate the semiconvergence property of the proximal ADMM and the convergence rate.We consider finding a solution of (1.8) with minimal norm.This is equivalent to solving (3.17For numerical implementation, we discretize the equation by the trapzoidal rule based on partitioning [0, 1] into N −1 subintervals of equal length with N = 600.We then execute the method (3.20) with Q = 0, ρ 1 = 10, ρ 2 = 1 and the initial guess x 0 = λ 0 = ν 0 = 0 using the noisy data b δ for several distinct values of δ.In Figure 1 (b) and (c) we plot the relative error ∥x k − x † ∥ L 2 /∥x † ∥ L 2 versus k, the number of iterations, for δ = 10 −2 and δ = 10 −4 respectively.These plots demonstrate that the proximal ADMM always exhibits the semi-convergence phenomenon when used to solve ill-posed problems, no matter how small the noise level is.Therefore, properly terminating the iteration is important to produce useful approximate solutions.This has been done in [21,22].
In Table 1 we report further numerical results.For the noisy data b δ with each noise level δ = 10 −i , i = 1, • • • , 7, we execute the method and determine the smallest relative error, denoted by err min , and the required number of iterations, denoted by iter min .The ratios err min /δ 1/2 and err min /δ 1/4 are then calculated.Since x † satisfies the source condition (3.21), our theoretical result predicts the convergence rate O(δ 1/4 ).However, Table 1 illustrates that the value of err min /δ 1/2 does not change much while the value of err min /δ 1/4 tends to decrease to 0 as δ → 0. This strongly suggests that the proximal ADMM admits the order optimal convergence rate O(δ 1/2 ) if the source condition (3.21) holds.However, how to derive this order optimal rate remains open.

Assumption 2 . 1 .
X , Y and Z are real Hilbert spaces, A : X → Z and B : Y → Z are bounded linear operators, P : X → X and Q : Y → Y are bounded linear positive semi-definite self-adjoint operators, and f : X → (−∞, ∞] and g : Y → (−∞, ∞] are proper, lower semi-continuous, convex functions.

Figure 1 .
Figure 1.(a) plots the true solution x † , (b) and (c) plot the relative errors versus the number of iterations for the method (3.20) using noisy data with noise level δ = 10 −2 and 10 −4 respectively

Table 1 .
Numerical results for the method (3.20) using noisy data with diverse noise levels, where err min and iter min denote respectively the the smallest relative error and the required number of iterations.