1 Introduction

The alternating direction method of multipliers (ADMM) was introduced and developed in the 1970 s by Glowinski and Marrocco [16] and Gabay and Mercier [15] for the numerical solutions of partial differential equations. Due to its decomposability and superior flexibility, ADMM and its variants have gained renewed interest in recent years and have been widely used for solving large-scale optimization problems that arise in signal/image processing, statistics, machine learning, inverse problems and other fields, see [5, 17, 21]. Because of their popularity, many works have been devoted to the analysis of ADMM and its variants, see [5, 8, 10, 14, 19, 26, 33] for instance. In this paper we will devote to deriving convergence rates of ADMM in two aspects: its applications to solve well-posed convex optimization problems and its use to solve linear ill-posed inverse problems as a regularization method.

In the first part of this paper we consider ADMM for solving linearly constrained two-block separable convex minimization problems. Let \(\mathcal X\), \(\mathcal Y\) and \(\mathcal Z\) be real Hilbert spaces with possibly infinite dimensions. We consider the convex minimization problem of the form

$$\begin{aligned} \begin{aligned} \text{ minimize }&\quad H(x, y) := f(x) + g(y) \\ \text{ subject } \text{ to }&\quad A x + By =c, \end{aligned} \end{aligned}$$
(1.1)

where \(c \in \mathcal Z\), \(A: \mathcal X\rightarrow \mathcal Z\) and \(B: \mathcal Y\rightarrow \mathcal Z\) are bounded linear operators, and \(f: \mathcal X\rightarrow (-\infty , \infty ]\) and \(g: \mathcal Y\rightarrow (-\infty , \infty ]\) are proper, lower semi-continuous, convex functions. The classical ADMM solves (1.1) approximately by constructing an iterative sequence via alternatively minimizing the augmented Lagrangian function

$$\begin{aligned} {\mathscr {L}}_\rho (x, y, z):= f(x) + g(y) + \langle \lambda , A x + B y -c \rangle + \frac{\rho }{2} \Vert A x + B y - c\Vert ^2 \end{aligned}$$

with respect to the primal variables x and y and then updating the dual variable \(\lambda \); more precisely, starting from an initial guess \(y^0\in \mathcal Y\) and \(\lambda ^0\in \mathcal Z\), an iterative sequence \(\{(x^k, y^k, \lambda ^k)\}\) is defined by

$$\begin{aligned} \begin{aligned} x^{k+1}&= \arg \min _{x\in \mathcal X} \left\{ f(x) + \langle \lambda ^k, A x\rangle + \frac{\rho }{2} \Vert A x+B y^k -c\Vert ^2 \right\} ,\\ y^{k+1}&= \arg \min _{y\in \mathcal Y} \left\{ g(y) + \langle \lambda ^k, B y\rangle + \frac{\rho }{2} \Vert A x^{k+1}+By -c\Vert ^2 \right\} ,\\ \lambda ^{k+1}&= \lambda ^k + \rho (A x^{k+1}+B y^{k+1}-c), \end{aligned} \end{aligned}$$
(1.2)

where \(\rho >0\) is a given penalty parameter. The implementation of (1.2) requires to determine \(x^{k+1}\) and \(y^{k+1}\) by solving two convex minimization problems during each iteration. Although f and g may have special structures so that their proximal mappings are easy to be determined, solving the minimization problems in (1.2) in general is highly nontrivial due to the appearance of the terms \(\Vert A x\Vert ^2 \) and \(\Vert B y\Vert ^2\). In order to avoid this implementation issue, one may consider to add certain proximal terms to the x-subproblem and y-subproblem in (1.2) to remove the terms \(\Vert A x\Vert ^2\) and \(\Vert B y\Vert ^2\). For any bounded linear positive semi-definite self-adjoint operator D on a real Hilbert space \(\mathcal H\), we will use the notation

$$\begin{aligned} \Vert u\Vert _D^2: = \langle z, D u\rangle , \quad \forall u \in \mathcal H. \end{aligned}$$

By taking two bounded linear positive semi-definite self-adjoint operators \(P: \mathcal X\rightarrow \mathcal X\) and \(Q: \mathcal Y\rightarrow \mathcal Y\), we may add the terms \(\frac{1}{2} \Vert x-x^k\Vert _P^2\) and \(\frac{1}{2} \Vert y-y^k\Vert _Q^2\) to the x- and y-subproblems in (1.2) respectively to obtain the following proximal alternating direction method of multipliers ([4, 9, 19, 20, 22, 33])

$$\begin{aligned} \begin{aligned} x^{k+1}&= \arg \min _{x\in \mathcal X} \left\{ f(x) + \langle \lambda ^k, A x\rangle + \frac{\rho }{2} \left\| A x + B y^k -c \right\| ^2 + \frac{1}{2} \Vert x-x^k\Vert _P^2\right\} ,\\ y^{k+1}&= \arg \min _{y\in \mathcal Y} \left\{ g(y) + \langle \lambda ^k, B y\rangle + \frac{\rho }{2} \left\| A x^{k+1}+By-c \right\| ^2 + \frac{1}{2} \Vert y-y^k\Vert _Q^2\right\} ,\\ \lambda ^{k+1}&= \lambda ^k + \rho (A x^{k+1} + B y^{k+1} -c). \end{aligned} \end{aligned}$$
(1.3)

The advantage of (1.3) over (1.2) is that, with wise choices of P and Q, it is possible to remove the terms \(\Vert A x\Vert ^2\) and \(\Vert B y\Vert ^2\) and thus make the determination of \(x^{k+1}\) and \(y^{k+1}\) much easier.

In recent years, various convergence rate results have been established for ADMM and its variants in either ergodic or non-ergodic sense. In [19, 25] the ergodic convergence rate

$$\begin{aligned} |H(\bar{x}^k, \bar{y}^k)-H_*| = O\left( \frac{1}{k}\right) \quad \text{ and } \quad \Vert A \bar{x}^k+B \bar{y}^k -c\Vert = O \left( \frac{1}{k}\right) \end{aligned}$$
(1.4)

has been derived in terms of the objective error and the constraint error, where \(H_*\) denotes the minimum value of (1.1), k denotes the number of iterations, and

$$\begin{aligned} \bar{x}^k:= \frac{1}{k} \sum _{j=1}^k x^j \qquad \text{ and } \qquad \bar{y}^k:= \frac{1}{k} \sum _{j=1}^k y^j \end{aligned}$$

denote the ergodic iterates of \(\{x^k\}\) and \(\{y^k\}\) respectively; see also [4, Theorem 15.4]. A criticism on ergodic result is that it may fail to capture the feature of the sought solution of the underlying problem because ergodic iterate has the tendency to average out the expected property and thus destroy the feature of the solution. This is in particular undesired in sparsity optimization and low-rank learning. In contrast, the non-ergodic iterate tends to share structural properties with the solution of the underlying problem. Therefore, the use of non-ergodic iterates becomes more favorable in practice. In [20] a non-ergodic convergence rate has been derived for the proximal ADMM (1.3) with \(Q=0\) and the result reads as

$$\begin{aligned} \Vert x^{k+1} - x^k\Vert _P^2 + \Vert B(y^{k+1}-y^k)\Vert ^2 + \Vert \lambda ^{k+1}-\lambda ^k\Vert ^2 = o \left( \frac{1}{k}\right) . \end{aligned}$$
(1.5)

By exploiting the connection with the Douglas-Rachford splitting algorithm, the non-ergodic convergence rate

$$\begin{aligned} |H(x^k, y^k)-H_*| = o\left( \frac{1}{\sqrt{k}}\right) \quad \text{ and } \quad \Vert A x^k+B y^k -c\Vert = o \left( \frac{1}{\sqrt{k}}\right) \end{aligned}$$
(1.6)

in terms of the objective error and the constraint error has been established in [8] for the ADMM (1.2) and an example has been provided to demonstrate that the estimates in (1.6) are sharp. However, the derivation of (1.6) in [8] relies on some unnatural technical conditions involving the convex conjugate of f and g, see Remark 2.1. Note that the estimate (1.5) implies the second estimate in (1.6), however it does not imply directly the first estimate in (1.6). In Sect. 2 we will show, by a simpler argument, that similar estimate as in (1.5) can be established for the proximal ADMM (1.3) with arbitrary positive semi-definite Q. Based on this result and some additional properties of the method, we will further show that the non-ergodic rate (1.6) holds for the proximal ADMM (1.3) with arbitrary positive semi-definite P and Q. Our result does not require any technical conditions as assumed in [8].

In order to obtain faster convergence rates for the proximal ADMM (1.3), certain regularity conditions should be imposed. In finite dimensional situation, a number of linear convergence results have been established. In [9] some linear convergence results of the proximal ADMM have been provided under a number of scenarios involving the strong convexity of f and/or g, the Lipschitz continuity of \(\nabla f\) and/or \(\nabla g\), together with further full row/column rank assumptions on A and/or B. Under a bounded metric subregularity condition, in particular under the assumption that both f and g are convex piecewise linear-quadratic functions, a global linear convergence rate has been established in [32] for the proximal ADMM (1.3) with

$$\begin{aligned} P:=\tau _1 I - \rho A^*A \succ 0 \quad \text{ and } \quad Q:= \tau _2 I - \rho B^*B\succ 0, \end{aligned}$$
(1.7)

where \(A^*\) and \(B^*\) denotes the adjoints of A and B respectively; the condition (1.7) plays an essential role in the convergence analysis in [32]. We will derive faster convergence rates for the proximal ADMM (1.3) in the general Hilbert space setting. To this end, we need first to consider the weak convergence of \(\{(x^k, y^k, \lambda ^k)\}\) and demonstrate that every weak cluster point of this sequence is a KKT point of (1.1). This may not be an issue in finite dimensions. However, this is nontrivial in infinite dimensional spaces because extra care is required to dealing with weak convergence. In [6] the weak convergence of the proximal ADMM (1.3) has been considered by transforming the method into a proximal point method and the result there requires restrictive conditions, see [6, Lemma 3.4 and Theorem 3.1]. These restrictive conditions have been weakened later in [31] by using machinery from the maximal monotone operator theory. We will explore the structure of the proximal ADMM and show by an elementary argument that every weak cluster point of \(\{(x^k, y^k, \lambda ^k)\}\) is indeed a KKT point of (1.1) without any additional conditions. We will then consider the linear convergence of the proximal ADMM under a bounded metric subregularity condition and obtain the linear convergence for any positive semi-definite P and Q; in particular, we obtain the linear convergence of \(|H(x^k, y^k) - H_*|\) and \(\Vert A x^k + B y^k - c\Vert \). We also consider deriving convergence rates under a bounded Hölder metric subregularity condition which is weaker than the bounded metric subregularity. This weaker condition holds if both f and g are semi-algebraic functions and thus a wider range of applications can be covered. We show that, under a bounded Hölder metric subregularity condition, among other things the convergence rates in (1.6) can be improved to

$$\begin{aligned} \Vert A x^k + B y^k - c\Vert = O(k^{-\beta }) \quad \text{ and } \quad |H(x^k, y^k) - H_*| = O(k^{-\beta }) \end{aligned}$$

for some number \(\beta >1/2\); the value of \(\beta \) depends on the properties of f and g. To further weaken the bounded (Hölder) metric subregularity assumption, we introduce an iteration based error bound condition which is an extension of the one in [27] to the general proximal ADMM (1.3). It is interesting to observe that this error bound condition holds under any one of the scenarios proposed in [9]. Hence, we provide a unified analysis for deriving convergence rates under the bounded (Hölder) metric subregularity or the scenarios in [9]. Furthermore, we extend the scenarios in [9] to the general Hilbert space setting and demonstrate that some conditions can be weakened and the convergence result can be strengthened; see Theorem 2.11.

In the second part of this paper, we consider using ADMM as a regularization method to solve linear ill-posed inverse problems in Hilbert spaces. Linear inverse problems have a wide range of applications, including medical imaging, geophysics, astronomy, signal processing, and more [12, 18, 28]. We consider linear inverse problems of the form

$$\begin{aligned} Ax = b, \quad x\in \mathcal C, \end{aligned}$$
(1.8)

where \(A: \mathcal X\rightarrow \mathcal H\) is a compact linear operator between two Hilbert spaces \(\mathcal X\) and \(\mathcal H\), \(\mathcal C\) is a closed convex set in \(\mathcal X\), and \(b \in \text{ Ran }(A)\), the range of A. In order to find a solution of (1.8) with desired properties, a priori available information on the sought solution should be incorporated into the problem. Assume that, under a suitable linear transform L from \(\mathcal X\) to another Hilbert spaces \(\mathcal Y\) with domain \(\text{ dom }(L)\), the feature of the sought solution can be captured by a proper convex penalty function \(f: \mathcal Y\rightarrow (-\infty , \infty ]\). One may consider instead of (1.8) the constrained optimization problem

$$\begin{aligned} \min \{f (Lx): Ax = b, \ x\in \mathcal C, \ x\in \text{ dom }(L)\}. \end{aligned}$$
(1.9)

A challenging issue related to the numerical resolution of (1.9) is its ill-posedness in the sense that the solution of (1.9) does not depend continuously on the data and thus a small perturbation on data can lead to a large deviation on solutions. In practical applications, the exact data b is usually unavailable, instead only a noisy data \(b^\delta \) is at hand with

$$\begin{aligned} \Vert b^\delta - b\Vert \le \delta \end{aligned}$$

for some small noise level \(\delta >0\). To overcome ill-posedness, regularization methods should be introduced to produce reasonable approximate solutions; one may refer to [7, 12, 23, 29] for various regularization methods.

The common use of ADMM to solve (1.9) with noisy data \(b^\delta \) first considers the variational regularization

$$\begin{aligned} \min _{x\in \mathcal C} \left\{ \frac{1}{2} \Vert A x - b^\delta \Vert ^2 + \alpha f(Lx)\right\} , \end{aligned}$$
(1.10)

then uses the splitting technique to rewrite (1.10) into the form (1.1), and finally applies the ADMM procedure to produce approximate solutions. The parameter \(\alpha >0\) is the so-called regularization parameter which should be adjusted carefully to achieve reasonable good performance; consequently one has to run ADMM to solve (1.10) for many different values of \(\alpha \), which can be time consuming.

In [21, 22] the ADMM has been considered to solve (1.9) directly to reduce the computational load. Note that (1.9) can be written as

$$\begin{aligned} \left\{ \begin{array}{lll} \min f(y) + \iota _{\mathcal C}(x)\\ \text{ subject } \text{ to } A z = b, \ L z - y = 0, \ z - x = 0, \ z \in \text{ dom }(L), \end{array}\right. \end{aligned}$$

where \(\iota _{\mathcal C}\) denotes the indicator function of \(\mathcal C\). With the noisy data \(b^\delta \) we introduce the augmented Lagrangian function

$$\begin{aligned} {\mathscr {L}}_{\rho _1, \rho _2, \rho _3}(z, y, x, \lambda , \mu , \nu )&:= f(y) + \iota _{\mathcal C}(x) + \langle \lambda , A z - b^\delta \rangle + \langle \mu , L z - y\rangle + \langle \nu , z - x \rangle \\&\quad \, + \frac{\rho _1}{2} \Vert A z - b^\delta \Vert ^2 + \frac{\rho _2}{2} \Vert L z - y\Vert ^2 + \frac{\rho _3}{2} \Vert z -x\Vert ^2, \end{aligned}$$

where \(\rho _1\), \(\rho _2\) and \(\rho _3\) are preassigned positive numbers. The proximal ADMM proposed in [22] for solving (1.9) then takes the form

$$\begin{aligned} \begin{aligned}&z^{k+1} = \arg \min _{z\in \text{ dom }(L)} \left\{ {\mathscr {L}}_{\rho _1, \rho _2, \rho _3}(z, y^k, x^k, \lambda ^k, \mu ^k, \nu ^k) + \frac{1}{2} \Vert z-z^k\Vert _Q^2\right\} ,\\&y^{k+1} = \arg \min _{y\in \mathcal Y} \left\{ {\mathscr {L}}_{\rho _1, \rho _2, \rho _3}(z^{k+1}, y, x^k, \lambda ^k, \mu ^k, \nu ^k) \right\} ,\\&x^{k+1} = \arg \min _{x\in \mathcal X} \left\{ {\mathscr {L}}_{\rho _1, \rho _2, \rho _3}(z^{k+1}, y^{k+1}, x, \lambda ^k, \mu ^k, \nu ^k)\right\} , \\&\lambda ^{k+1} = \lambda ^k + \rho _1 (A z^{k+1} - b^\delta ), \\&\mu ^{k+1} = \mu ^k + \rho _2 (L z^{k+1} - y^{k+1}), \\&\nu ^{k+1} = \nu ^k + \rho _3 (z^{k+1} - x^{k+1}), \end{aligned} \end{aligned}$$
(1.11)

where Q is a bounded linear positive semi-definite self-adjoint operator. The method (1.11) is not a 3-block ADMM. Note that the variables y and x are not coupled in \({\mathscr {L}}_{\rho _1, \rho _2, \rho _3}(z, y, x, \lambda , \mu , \nu )\). Thus, \(y^{k+1}\) and \(x^{k+1}\) can be updated simultaneously, i.e.

$$\begin{aligned} (y^{k+1}, x^{k+1}) = \arg \min _{y\in \mathcal Y, x \in \mathcal X} \left\{ {\mathscr {L}}_{\rho _1, \rho _2, \rho _3}(z^{k+1}, y, x, \lambda ^k, \mu ^k, \nu ^k) \right\} . \end{aligned}$$

This demonstrates that (1.11) is a 2-block proximal ADMM.

It should be highlighted that all well-established convergence results on proximal ADMM for well-posed optimization problems are not applicable to (1.11) directly. Note that (1.11) uses the noisy data \(b^\delta \). If the convergence theory for well-posed optimization problems could be applicable, one would obtain a solution of the perturbed problem

$$\begin{aligned} \min \left\{ f(Lx): A x = b^\delta , \ x \in \mathcal C, \ x \in \text{ dom }(L)\right\} \end{aligned}$$
(1.12)

of (1.9). Because A is compact, it is very likely that \(b^\delta \not \in \text{ Ran }(A^*)\) and thus (1.12) makes no sense as the feasible set is empty. Even if \(b^\delta \in \text{ Ran }(A^*)\) and (1.12) has a solution, this solution could be far away from the solution of (1.9) because of the ill-posedness.

Therefore, if (1.11) is used to solve (1.9), better result can not be expected even if larger number of iterations are performed. In contrast, like all other iterative regularization methods, when (1.11) is used to solve (1.9), it shows the semi-convergence property, i.e., the iterate becomes close to the sought solution at the beginning; however, after a critical number of iterations, the iterate leaves the sought solution far away as the iteration proceeds. Thus, properly terminating the iteration is important to produce acceptable approximate solutions. One may hope to determine a stopping index \(k_\delta \), depending on \(\delta \) and/or \(b^\delta \), such that \(\Vert x^{k_\delta }-x^\dag \Vert \) is as small as possible and \(\Vert x^{k_\delta }-x^\dag \Vert \rightarrow 0\) as \(\delta \rightarrow 0\), where \(x^\dag \) denotes the solution of (1.9). This has been done in our previous work [21, 22] in which early stopping rules have been proposed for the method (1.11) to render it into a regularization method and numerical results have been reported to demonstrate the nice performance. However, the work in [21, 22] does not provide convergence rates, i.e. the estimate on \(\Vert x^{k_\delta } - x^\dag \Vert \) in terms of \(\delta \). Deriving convergence rates for iterative regularization methods involving general convex regularization terms is a challenging question and only a limited number of results are available. In order to derive a convergence rate of a regularization method for ill-posed problems, certain source condition should be imposed on the sought solution. In Sect. 3, under a benchmark source condition on the sought solution, we will provide a partial answer to this question by establishing a convergence rate result for (1.11) if the iteration is terminated by an a priori stopping rule.

We conclude this section with some notation and terminology. Let \(\mathcal V\) be a real Hilbert spaces. We use \(\langle \cdot , \cdot \rangle \) and \(\Vert \cdot \Vert \) to denote its inner product and the induced norm. We also use “\(\rightarrow \)" and “\(\rightharpoonup \)" to denote the strong convergence and weak convergence respectively. For a function \(\varphi : \mathcal V\rightarrow (-\infty , \infty ]\) its domain is defined as \(\text{ dom }(\varphi ):= \{x\in \mathcal V: \varphi (x) <\infty \}\). If \(\text{ dom }(\varphi )\ne \emptyset \), \(\varphi \) is called proper. For a proper convex function \(\varphi : \mathcal V\rightarrow (-\infty , \infty ]\), its modulus of convexity, denoted by \(\sigma _\varphi \), is defined to be the largest number c such that

$$\begin{aligned} \varphi (t x + (1-t) y) + c t(1-t) \Vert x-y\Vert ^2 \le t \varphi (x) + (1-t) \varphi (y) \end{aligned}$$

for all \(x, y \in \text{ dom }(\varphi )\) and \(0\le t\le 1\). We always have \(\sigma _\varphi \ge 0\). If \(\sigma _\varphi >0\), \(\varphi \) is called strongly convex. For a proper convex function \(\varphi : \mathcal V\rightarrow (-\infty , \infty ]\), we use \(\partial \varphi \) to denote its subdifferential, i.e.

$$\begin{aligned} \partial \varphi (x):=\{\xi \in \mathcal V: \varphi (y) \ge \varphi (x) + \langle \xi , y-x\rangle \text{ for } \text{ all } y \in \mathcal V\}, \quad x \in \mathcal V. \end{aligned}$$

Let \(\text{ dom }(\partial \varphi ):=\{x\in \mathcal V: \partial \varphi (x) \ne \emptyset \}\). It is easy to see that

$$\begin{aligned} \varphi (y) - \varphi (x) - \langle \xi , y-x\rangle \ge \sigma _\varphi \Vert y-x\Vert ^2 \end{aligned}$$

for all \(y\in \mathcal V\), \(x\in \text{ dom }(\partial \varphi )\) and \(\xi \in \partial \varphi (x)\) which in particular implies the monotonicity of \(\partial \varphi \), i.e.

$$\begin{aligned} \langle \xi - \eta , x - y\rangle \ge 2 \sigma _\varphi \Vert x-y\Vert ^2 \end{aligned}$$

for all \(x, y \in \text{ dom }(\partial \varphi )\), \(\xi \in \partial \varphi (x)\) and \(\eta \in \partial \varphi (y)\).

2 Proximal ADMM for Convex Optimization Problems

In this section we will consider the proximal ADMM (1.3) for solving the linearly constrained convex minimization problem (1.1). For the convergence analysis, we will make the following standard assumptions.

Assumption 1

\(\mathcal X\), \(\mathcal Y\) and \(\mathcal Z\) are real Hilbert spaces, \(A: \mathcal X\rightarrow \mathcal Z\) and \(B: \mathcal Y\rightarrow \mathcal Z\) are bounded linear operators, \(P: \mathcal X\rightarrow \mathcal X\) and \(Q: \mathcal Y\rightarrow \mathcal Y\) are bounded linear positive semi-definite self-adjoint operators, and \(f: \mathcal X\rightarrow (-\infty , \infty ]\) and \(g: \mathcal Y\rightarrow (-\infty , \infty ]\) are proper, lower semi-continuous, convex functions.

Assumption 2

The problem (1.1) has a Karush-Kuhn-Tucker (KKT) point, i.e. there exists \((\bar{x}, \bar{y}, \bar{\lambda }) \in \mathcal X\times \mathcal Y\times \mathcal Z\) such that

$$\begin{aligned} -A^* \bar{\lambda }\in \partial f(\bar{x}), \quad -B^* \bar{\lambda }\in \partial g (\bar{y}), \quad A \bar{x} + B \bar{y} =c. \end{aligned}$$

It should be mentioned that, to guarantee the proximal ADMM (1.3) to be well-defined, certain additional conditions need to be imposed to ensure that the x- and y-subproblems do have minimizers. Since the well-definedness can be easily seen in concrete applications, to make the presentation more succinct we will not state these conditions explicitly.

By the convexity of f and g, it is easy to see that, for any KKT point \((\bar{x}, \bar{y}, \bar{\lambda })\) of (1.1), there hold

$$\begin{aligned} 0&\le f(x) - f(\bar{x}) + \langle \bar{\lambda }, A(x-\bar{x})\rangle , \quad \forall x \in \mathcal X,\\ 0&\le g(y) - g(\bar{y}) + \langle \bar{\lambda }, B(y-\bar{y})\rangle , \quad \forall y \in \mathcal Y. \end{aligned}$$

Adding these two equations and using \(A \bar{x}+ B\bar{y}-c =0\), it follows that

$$\begin{aligned} 0 \le H(x, y) - H(\bar{x}, \bar{y}) + \langle \bar{\lambda }, A x + B y-c \rangle , \quad \forall (x, y) \in \mathcal X\times \mathcal Y. \end{aligned}$$
(2.1)

This in particular implies that \((\bar{x}, \bar{y})\) is a solution of (1.1) and thus \(H_*:= H(\bar{x}, \bar{y})\) is the minimum value of (1.1).

Based on Assumptions 1 and 2 we will analyze the proximal ADMM (1.3). For ease of exposition, we set \(\widehat{Q}:= \rho B^* B + Q\) and define

$$\begin{aligned} G u:= (P x, {\widehat{Q}} y, \lambda /\rho ), \quad \forall u:= (x, y, \lambda ) \in \mathcal X\times \mathcal Y\times \mathcal Z\end{aligned}$$

which is a bounded linear positive semi-definite self-adjoint operator on \(\mathcal X\times \mathcal Y\times \mathcal Z\). Then, for any \(u:=(x, y,\lambda ) \in \mathcal X\times \mathcal Y\times \mathcal Z\) we have

$$\begin{aligned} \Vert u\Vert _G^2:=\langle u, G u\rangle = \Vert x\Vert _P^2 + \Vert y\Vert _{{\widehat{Q}}}^2 + \frac{1}{\rho } \Vert \lambda \Vert ^2. \end{aligned}$$

For the sequence \(\{u^k:=(x^k, y^k, \lambda ^k)\}\) defined by the proximal ADMM (1.3), we use the notation

$$\begin{aligned} \Delta x^k := x^k-x^{k-1}, \ \ \Delta y^k := y^k-y^{k-1}, \ \ \Delta \lambda ^k := \lambda ^k-\lambda ^{k-1}, \ \ \Delta u^k := u^k - u^{k-1}. \end{aligned}$$

We start from the first order optimality conditions on \(x^{k+1}\) and \(y^{k+1}\) which by definition can be stated as

$$\begin{aligned} \begin{aligned} -A^* \lambda ^k - \rho A^* (A x^{k+1} + B y^k -c) - P (x^{k+1}-x^k)&\in \partial f(x^{k+1}), \\ -B^* \lambda ^k - \rho B^* (A x^{k+1} + B y^{k+1} -c) - Q(y^{k+1}-y^k)&\in \partial g(y^{k+1}). \end{aligned} \end{aligned}$$
(2.2)

By using \(\lambda ^{k+1}=\lambda ^k + \rho (A x^{k+1} + B y^{k+1} -c)\) we may rewrite (2.2) as

$$\begin{aligned} \begin{aligned} -A^* (\lambda ^{k+1} - \rho B \Delta y^{k+1}) -P \Delta x^{k+1}&\in \partial f(x^{k+1}), \\ -B^* \lambda ^{k+1} - Q \Delta y^{k+1}&\in \partial g(y^{k+1}) \end{aligned} \end{aligned}$$
(2.3)

which will be frequently used in the following analysis. We first prove the following important result which is inspired by [19, Lemma 3.1] and [4, Theorem 15.4].

Proposition 2.1

Let Assumption 1 hold. Then for the proximal ADMM (1.3) there holds

$$\begin{aligned}&\sigma _f \Vert x^{k+1} - x\Vert ^2 + \sigma _g \Vert y^{k+1} - y\Vert ^2 \\&\le H(x, y) - H(x^{k+1}, y^{k+1}) + \langle \lambda ^{k+1} - \rho B \Delta y^{k+1}, A x + B y -c \rangle \\&\quad \, - \langle \lambda , A x^{k+1} + B y^{k+1}-c\rangle + \frac{1}{2} \left( \Vert u^k-u\Vert _G^2 - \Vert u^{k+1}-u\Vert _G^2\right) \\&\quad \, -\frac{1}{2\rho } \Vert \Delta \lambda ^{k+1} - \rho B \Delta y^{k+1}\Vert ^2 -\frac{1}{2} \Vert \Delta x^{k+1}\Vert _P^2- \frac{1}{2} \Vert \Delta y^{k+1}\Vert _Q^2 \end{aligned}$$

for all \(u:= (x, y, \lambda ) \in \mathcal X\times \mathcal Y\times \mathcal Z\), where \(\sigma _f\) and \(\sigma _g\) denote the modulus of convexity of f and g respectively.

Proof

Let \(\tilde{\lambda }^{k+1}:= \lambda ^{k+1} - \rho B \Delta y^{k+1}\). By using (2.3) and the convexity of f and g we have for any \((x, y, \lambda ) \in \mathcal X\times \mathcal Y\times \mathcal Z\) that

$$\begin{aligned}&\sigma _f \Vert x^{k+1} - x\Vert ^2 + \sigma _g \Vert y^{k+1} - y\Vert ^2 \\&\le f(x) - f(x^{k+1}) + \langle \lambda ^{k+1} -\rho B \Delta y^{k+1}, A (x-x^{k+1})\rangle + \langle P\Delta x^{k+1}, x -x^{k+1}\rangle \\&\quad \, + g(y) - g(y^{k+1}) + \langle \lambda ^{k+1}, B(y-y^{k+1})\rangle + \langle Q \Delta y^{k+1}, y-y^{k+1}\rangle \\&= H(x, y) - H(x^{k+1}, y^{k+1}) + \langle \tilde{\lambda }^{k+1}, A(x-x^{k+1}) + B(y - y^{k+1})\rangle \\&\quad \, + \langle P\Delta x^{k+1}, x -x^{k+1}\rangle + \langle {\widehat{Q}} \Delta y^{k+1}, y-y^{k+1}\rangle \\&= H(x, y) - H(x^{k+1}, y^{k+1}) + \langle \tilde{\lambda }^{k+1}, A x + B y -c\rangle \\&\quad \, - \langle \lambda , A x^{k+1} + B y^{k+1} -c \rangle + \langle \lambda - \tilde{\lambda }^{k+1}, A x^{k+1} + B y^{k+1} - c\rangle \\&\quad \, + \langle P\Delta x^{k+1}, x -x^{k+1}\rangle + \langle {\widehat{Q}} \Delta y^{k+1}, y-y^{k+1}\rangle . \end{aligned}$$

Since \(\rho (A x^{k+1} + B y^{k+1} - c) = \Delta \lambda ^{k+1}\) we then obtain

$$\begin{aligned}&\sigma _f \Vert x^{k+1} - x\Vert ^2 + \sigma _g \Vert y^{k+1} - y\Vert ^2 \\&\le H(x, y) - H(x^{k+1}, y^{k+1}) + \langle \tilde{\lambda }^{k+1}, A x + By -c\rangle - \langle \lambda , A x^{k+1} + B y^{k+1} -c \rangle \\&\quad \, + \frac{1}{\rho } \langle \lambda - \lambda ^{k+1}, \Delta \lambda ^{k+1}\rangle + \frac{1}{\rho }\langle \lambda ^{k+1} - \tilde{\lambda }^{k+1}, \Delta \lambda ^{k+1} \rangle \\&\quad \, + \langle P\Delta x^{k+1}, x -x^{k+1}\rangle + \langle {\widehat{Q}} \Delta y^{k+1}, y-y^{k+1}\rangle . \end{aligned}$$

By using the polarization identity and the definition of G, it follows that

$$\begin{aligned}&\sigma _f \Vert x^{k+1} - x\Vert ^2 + \sigma _g \Vert y^{k+1} - y\Vert ^2 \\&\le H(x, y) - H(x^{k+1}, y^{k+1}) + \langle \tilde{\lambda }^{k+1}, A x + B y -c\rangle - \langle \lambda , A x^{k+1} + B y^{k+1} -c \rangle \\&\quad \, + \frac{1}{2\rho } \left( \Vert \lambda ^k-\lambda \Vert ^2 - \Vert \lambda ^{k+1} -\lambda \Vert ^2 - \Vert \Delta \lambda ^{k+1}\Vert ^2 \right) \\&\quad \, - \frac{1}{2\rho } \left( \Vert \lambda ^k - \tilde{\lambda }^{k+1}\Vert ^2 - \Vert \lambda ^{k+1} - \tilde{\lambda }^{k+1}\Vert ^2 - \Vert \Delta \lambda ^{k+1}\Vert ^2\right) \\&\quad \, + \frac{1}{2} \left( \Vert x^k-x\Vert _P^2 -\Vert x^{k+1}-x\Vert _P^2 - \Vert \Delta x^{k+1}\Vert _P^2 \right) \\&\quad \, + \frac{1}{2} \left( \Vert y^k-y\Vert _{{\widehat{Q}}}^2 - \Vert y^{k+1}-y\Vert _{{\widehat{Q}}}^2 -\Vert \Delta y^{k+1}\Vert _{{\widehat{Q}}}^2 \right) \\&= H(x, y) - H(x^{k+1}, y^{k+1}) + \langle \tilde{\lambda }^{k+1}, A x + B y -c\rangle - \langle \lambda , A x^{k+1} + B y^{k+1} -c \rangle \\&\quad \, + \frac{1}{2} \left( \Vert u^k-u\Vert _G^2 - \Vert u^{k+1} - u\Vert _G^2 \right) -\frac{1}{2\rho } \left( \Vert \lambda ^k - \tilde{\lambda }^{k+1}\Vert ^2 - \Vert \lambda ^{k+1} - \tilde{\lambda }^{k+1}\Vert ^2\right) \\&\quad \, - \frac{1}{2} \Vert \Delta x^{k+1}\Vert _P^2 - \frac{1}{2} \Vert \Delta y^{k+1}\Vert _{{\widehat{Q}}}^2. \end{aligned}$$

Using the definition of \(\tilde{\lambda }^{k+1}\) gives

$$\begin{aligned} \lambda ^k - \tilde{\lambda }^{k+1} = - \Delta \lambda ^{k+1} + \rho B \Delta y^{k+1}, \quad \lambda ^{k+1} -\tilde{\lambda }^{k+1} = \rho B \Delta y^{k+1}. \end{aligned}$$

Therefore

$$\begin{aligned}&\sigma _f \Vert x^{k+1} - x\Vert ^2 + \sigma _g \Vert y^{k+1} - y\Vert ^2 \\&\le H(x, y) - H(x^{k+1}, y^{k+1}) + \langle \tilde{\lambda }^{k+1}, A x + B y -c\rangle - \langle \lambda , A x^{k+1} + B y^{k+1} -c \rangle \\&\quad \, + \frac{1}{2} \left( \Vert u^k-u\Vert _G^2 - \Vert u^{k+1} - u\Vert _G^2 \right) -\frac{1}{2\rho } \Vert \Delta \lambda ^{k+1} - \rho B \Delta y^{k+1}\Vert ^2 \\&\quad \, - \frac{1}{2} \Vert \Delta x^{k+1}\Vert _P^2 - \frac{1}{2} \Vert \Delta y^{k+1}\Vert _{{\widehat{Q}}}^2 + \frac{\rho }{2} \Vert B \Delta y^{k+1}\Vert ^2. \end{aligned}$$

Since \(\rho \Vert B \Delta y^{k+1}\Vert ^2 - \Vert \Delta y^{k+1}\Vert _{\widehat{Q}}^2 = - \Vert \Delta y^{k+1}\Vert _Q^2\), we thus complete the proof. \(\square \)

Corollary 2.2

Let Assumptions 1 and 2 hold and let \(\bar{u}:=(\bar{x}, \bar{y}, \bar{\lambda })\) be any KKT point of (1.1). Then for the proximal ADMM (1.3) there holds

$$\begin{aligned} \sigma _f \Vert x^{k+1} - \bar{x}\Vert ^2 + \sigma _g \Vert y^{k+1} - \bar{y}\Vert ^2&\le H_* - H(x^{k+1}, y^{k+1}) - \langle \bar{\lambda }, A x^{k+1} + B y^{k+1} -c \rangle \nonumber \\&\quad \, + \frac{1}{2} \left( \Vert u^k-\bar{u}\Vert _G^2 - \Vert u^{k+1} - \bar{u}\Vert _G^2 \right) \end{aligned}$$
(2.4)

for all \(k\ge 0\). Moreover, the sequence \(\{\Vert u^k-\bar{u}\Vert _G^2\}\) is monotonically decreasing.

Proof

By taking \(u = \bar{u}\) in Proposition 2.1 and using \(A \bar{x} + B \bar{y} - c =0\) we immediately obtain (2.4). According to (2.1) we have

$$\begin{aligned} H(x^{k+1}, y^{k+1}) - H_* + \langle \bar{\lambda }, A x^{k+1} + B y^{k+1} -c \rangle \ge 0. \end{aligned}$$

Thus, from (2.4) we can obtain

$$\begin{aligned} \sigma _f \Vert x^{k+1} - \bar{x}\Vert ^2 + \sigma _g \Vert y^{k+1} - \bar{y}\Vert ^2 \le \frac{1}{2} \left( \Vert u^k-\bar{u}\Vert _G^2 - \Vert u^{k+1} - \bar{u}\Vert _G^2 \right) \end{aligned}$$
(2.5)

which implies the monotonicity of the sequence \(\{\Vert u^k - \bar{u}\Vert _G^2\}\). \(\square \)

We next show that \(\Vert \Delta u^k\Vert _G^2 = o(1/k)\) as \(k\rightarrow \infty \). This result for the proximal ADMM (1.3) with \(Q =0\) has been established in [20] based on a variational inequality approach. We will establish this result for the proximal ADMM (1.3) with general bounded linear positive semi-definite self-adjoint operators P and Q by a simpler argument.

Lemma 2.3

Let Assumption 1 hold. For the proximal ADMM (1.3), the sequence \(\{\Vert \Delta u^k\Vert _G^2\}\) is monotonically decreasing.

Proof

By using (2.3) and the monotonicity of \(\partial f\) and \(\partial g\), we can obtain

$$\begin{aligned} 0&\le \left\langle -A^*(\Delta \lambda ^{k+1} - \rho B \Delta y^{k+1} + \rho B \Delta y^k) - P \Delta x^{k+1} + P \Delta x^k, \Delta x^{k+1}\right\rangle \\&\quad + \left\langle - B^* \Delta \lambda ^{k+1}-Q \Delta y^{k+1} + Q\Delta y^k, \Delta y^{k+1} \right\rangle \\&= - \langle \Delta \lambda ^{k+1}, A \Delta x^{k+1} + B \Delta y^{k+1} \rangle + \rho \langle B (\Delta y^{k+1} -\Delta y^k), A \Delta x^{k+1}\rangle \\&\quad - \langle P (\Delta x^{k+1} - \Delta x^k), \Delta x^{k+1}\rangle - \langle Q(\Delta y^{k+1} - \Delta y^k), \Delta y^{k+1}\rangle . \end{aligned}$$

Note that

$$\begin{aligned} A \Delta x^{k+1} + B \Delta y^{k+1} = \frac{1}{\rho } (\Delta \lambda ^{k+1} -\Delta \lambda ^k). \end{aligned}$$

We therefore have

$$\begin{aligned} 0&\le - \frac{1}{\rho } \langle \Delta \lambda ^{k+1}, \Delta \lambda ^{k+1} -\Delta \lambda ^k\rangle - \rho \langle B (\Delta y^{k+1} -\Delta y^k), B \Delta y^{k+1}\rangle \\&\quad \, + \langle B(\Delta y^{k+1} - \Delta y^k), \Delta \lambda ^{k+1} - \Delta \lambda ^k\rangle \\&\quad \, - \langle P (\Delta x^{k+1} - \Delta x^k), \Delta x^{k+1}\rangle - \langle Q(\Delta y^{k+1} - \Delta y^k), \Delta y^{k+1}\rangle . \end{aligned}$$

By the polarization identity we then have

$$\begin{aligned} 0&\le \frac{1}{2\rho } \left( \Vert \Delta \lambda ^k\Vert ^2 - \Vert \Delta \lambda ^{k+1}\Vert ^2 - \Vert \Delta \lambda ^k -\Delta \lambda ^{k+1}\Vert ^2 \right) \\&\quad \, + \frac{\rho }{2} \left( \Vert B \Delta y^k\Vert ^2 - \Vert B \Delta y^{k+1}\Vert ^2 - \Vert B(\Delta y^k-\Delta y^{k+1})\Vert ^2 \right) \\&\quad \, + \frac{1}{2} \left( \Vert \Delta x^k \Vert _P^2 - \Vert \Delta x^{k+1}\Vert _P^2 - \Vert \Delta x^k-\Delta x^{k+1}\Vert _P^2\right) \\&\quad \, + \frac{1}{2} \left( \Vert \Delta y^k\Vert _Q^2 - \Vert \Delta y^{k+1}\Vert _Q^2 - \Vert \Delta y^k-\Delta y^{k+1}\Vert _Q^2 \right) \\&\quad \, + \langle B(\Delta y^{k+1} - \Delta y^k), \Delta \lambda ^{k+1} - \Delta \lambda ^k\rangle . \end{aligned}$$

With the help of the definition of G, we obtain

$$\begin{aligned} 0&\le \Vert \Delta u^k\Vert _G^2 - \Vert \Delta u^{k+1}\Vert _G^2 - \Vert \Delta x^k-\Delta x^{k+1}\Vert _P^2 - \Vert \Delta y^k-\Delta y^{k+1}\Vert _Q^2 \\&\quad \, - \frac{\rho }{2} \left\| B(\Delta y^{k+1} -\Delta y^k) - \frac{1}{\rho } (\Delta \lambda ^{k+1} -\Delta \lambda ^k) \right\| ^2 \end{aligned}$$

which completes the proof. \(\square \)

Lemma 2.4

Let Assumptions 1 and 2 hold and let \(\bar{u}:=(\bar{x}, \bar{y}, \bar{\lambda })\) be any KKT point of (1.1). For the proximal ADMM (1.3) there holds

$$\begin{aligned}&\Vert \Delta u^{k+1}\Vert _G^2 \le \left( \Vert u^k-\bar{u}\Vert _G^2 + \Vert \Delta y^k\Vert _Q^2\right) -\left( \Vert u^{k+1}-\bar{u}\Vert _G^2 + \Vert \Delta y^{k+1}\Vert _Q^2\right) \end{aligned}$$

for all \(k \ge 1\).

Proof

We will use (2.3) together with \(-A^* \bar{\lambda }\in \partial f(\bar{x})\) and \(-B^* \bar{\lambda }\in \partial g(\bar{y})\). By using the monotonicity of \(\partial f\) and \(\partial g\) we have

$$\begin{aligned} 0&\le \left\langle -A^* (\lambda ^{k+1} -\bar{\lambda }- \rho B \Delta y^{k+1}) - P \Delta x^{k+1}, x^{k+1}-\bar{x}\right\rangle \\&\quad \, + \left\langle - B^* (\lambda ^{k+1} - \bar{\lambda }) -Q \Delta y^{k+1}, y^{k+1} - \bar{y}\right\rangle \\&= \langle \bar{\lambda }-\lambda ^{k+1}, A x^{k+1} + B y^{k+1} -c \rangle + \rho \langle B \Delta y^{k+1}, A (x^{k+1}-\bar{x})\rangle \\&\quad \, - \langle P \Delta x^{k+1}, x^{k+1}-\bar{x}\rangle - \langle Q \Delta y^{k+1}, y^{k+1} - \bar{y}\rangle . \end{aligned}$$

By virtue of \(\rho (A x^{k+1} + B y^{k+1} - c\rangle = \Delta \lambda ^{k+1}\) we further have

$$\begin{aligned} 0&\le \frac{1}{\rho } \langle \bar{\lambda }-\lambda ^{k+1}, \Delta \lambda ^{k+1} \rangle - \rho \langle B \Delta y^{k+1}, B(y^{k+1} - \bar{y}) \rangle + \langle B \Delta y^{k+1}, \Delta \lambda ^{k+1}\rangle \\&\quad \, - \langle P \Delta x^{k+1}, x^{k+1}-\bar{x}\rangle - \langle Q \Delta y^{k+1}, y^{k+1} - \bar{y}\rangle . \end{aligned}$$

By using the second equation in (2.3) and the monotonicity of \(\partial g\) we have

$$\begin{aligned} 0&\le \left\langle -B^* \Delta \lambda ^{k+1} - Q \Delta y^{k+1} + Q \Delta y^k, \Delta y^{k+1}\right\rangle \\&= - \langle \Delta \lambda ^{k+1}, B \Delta y^{k+1}\rangle - \langle Q(\Delta y^{k+1} -\Delta y^k), \Delta y^{k+1}\rangle \end{aligned}$$

which shows that

$$\begin{aligned} \langle \Delta \lambda ^{k+1}, B \Delta y^{k+1}\rangle \le - \langle Q(\Delta y^{k+1} -\Delta y^k), \Delta y^{k+1}\rangle . \end{aligned}$$

Therefore

$$\begin{aligned} 0&\le \frac{1}{\rho } \langle \bar{\lambda }-\lambda ^{k+1}, \Delta \lambda ^{k+1} \rangle - \langle {\widehat{Q}} \Delta y^{k+1}, y^{k+1} - \bar{y}\rangle - \langle P \Delta x^{k+1}, x^{k+1}-\bar{x}\rangle \\&\quad - \langle Q(\Delta y^{k+1} -\Delta y^k), \Delta y^{k+1}\rangle . \end{aligned}$$

By using the polarization identity we then obtain

$$\begin{aligned} 0&\le \frac{1}{2\rho } \left( \langle \lambda ^k-\bar{\lambda }\Vert ^2 -\Vert \lambda ^{k+1} -\bar{\lambda }\Vert ^2 -\Vert \Delta \lambda ^{k+1}\Vert ^2 \right) \\&\quad + \frac{1}{2} \left( \Vert y^k-\bar{y}\Vert _{{\widehat{Q}}}^2 - \Vert y^{k+1}-\bar{y}\Vert _{{\widehat{Q}}}^2 -\Vert \Delta y^{k+1}\Vert _{{\widehat{Q}}}^2\right) \\&\quad + \frac{1}{2} \left( \Vert x^k-\bar{x}\Vert _P^2 - \Vert x^{k+1}-\bar{x}\Vert _P^2 - \Vert \Delta x^{k+1}\Vert _P^2 \right) \\&\quad + \frac{1}{2} \left( \Vert \Delta y^k\Vert _Q^2 - \Vert \Delta y^{k+1}\Vert _Q^2 - \Vert \Delta y^{k+1} -\Delta y^k\Vert _Q^2 \right) . \end{aligned}$$

Recalling the definition of G we then complete the proof. \(\square \)

Proposition 2.5

Let Assumptions 1 and 2 hold. Then for the proximal ADMM (1.3) there holds \(\Vert \Delta u^k\Vert _G^2 = o(1/k)\) as \(k\rightarrow \infty \).

Proof

Let \(\bar{u}\) be a KKT point of (1.1). From Lemma 2.4 it follows that

$$\begin{aligned} \sum _{j=1}^k \Vert \Delta u^{j+1}\Vert _G^2&\le \sum _{j=1}^k \left( \left( \Vert u^j-\bar{u}\Vert _G^2 + \Vert \Delta y^j\Vert _Q^2\right) - \left( \Vert u^{j+1}-\bar{u}\Vert _G^2+\Vert \Delta y^{j+1}\Vert _Q^2\right) \right) \nonumber \\&\le \Vert u^1-\bar{u}\Vert _G^2 +\Vert \Delta y^1 \Vert _Q^2 \end{aligned}$$
(2.6)

for all \(k \ge 1\). By Lemma 2.3, \(\{\Vert \Delta u^{j+1}\Vert _G^2\}\) is monotonically decreasing. Thus

$$\begin{aligned} \left( \frac{k}{2} +1\right) \Vert \Delta u^{k+1}\Vert _G^2 \le \sum _{j=[k/2]}^k \Vert \Delta u^{j+1}\Vert _G^2, \end{aligned}$$
(2.7)

where [k/2] denotes the largest integer \(\le k/2\). Since (2.6) shows that

$$\begin{aligned} \sum _{j=1}^\infty \Vert \Delta u^{j+1}\Vert _G^2 <\infty , \end{aligned}$$

the right hand side of (2.7) must converge to 0 as \(k\rightarrow \infty \). Thus \((k+1) \Vert \Delta u^{k+1}\Vert _G^2 =o(1)\) and hence \(\Vert \Delta u^{k}\Vert _G^2 =o(1/k)\) as \(k\rightarrow \infty \). \(\square \)

As a byproduct of Proposition 2.5 and Corollary 2.2, we can prove the following non-ergodic convergence rate result for the proximal ADMM (1.3) in terms of the objective error and the constraint error.

Theorem 2.6

Let Assumptions 1 and 2 hold. Consider the proximal ADMM (1.3) for solving (1.1). Then

$$\begin{aligned} |H(x^k, y^k)-H_*| = o\left( \frac{1}{\sqrt{k}}\right) \quad \text{ and } \quad \Vert A x^k+B y^k -c\Vert = o \left( \frac{1}{\sqrt{k}}\right) \end{aligned}$$
(2.8)

as \(k \rightarrow \infty \).

Proof

Since

$$\begin{aligned} \rho (A x^{k} + B y^{k}-c) = \Delta \lambda ^{k} \quad \text{ and } \quad \Vert \Delta \lambda ^{k}\Vert ^2 \le \rho \Vert \Delta u^{k}\Vert _G^2 \end{aligned}$$
(2.9)

we may use Proposition 2.5 to obtain the estimate \(\Vert A x^{k}+B y^{k} -c\Vert = o (1/\sqrt{k})\) as \(k\rightarrow \infty \).

In the following we will focus on deriving the estimate of \(|H(x^k, y^k) - H_*|\). Let \(\bar{u}:= (\bar{x}, \bar{y}, \bar{\lambda })\) be a KKT point of (1.1). By using (2.4) we have

$$\begin{aligned} H(x^{k}, y^{k}) - H_*&\le - \langle \bar{\lambda }, A x^{k}+By^{k}-c\rangle + \frac{1}{2} \left( \Vert u^{k-1}-\bar{u}\Vert _G^2 - \Vert u^{k}-\bar{u}\Vert _G^2\right) \nonumber \\&= -\frac{1}{\rho } \langle \bar{\lambda }, \Delta \lambda ^{k}\rangle -\langle u^{k-1}-\bar{u}, G \Delta u^{k}\rangle - \frac{1}{2} \Vert \Delta u^{k}\Vert _G^2 \nonumber \\&\le \frac{\Vert \bar{\lambda }\Vert }{\rho } \Vert \Delta \lambda ^{k}\Vert + \Vert u^{k-1}-\bar{u}\Vert _G \Vert \Delta u^{k}\Vert _G. \end{aligned}$$
(2.10)

By virtue of the monotonicity of \(\{\Vert u^k-\bar{u}\Vert _G^2\}\) given in Corollary 2.2 we then obtain

$$\begin{aligned} H(x^{k}, y^{k}) - H_*&\le \frac{\Vert \bar{\lambda }\Vert }{\rho } \Vert \Delta \lambda ^{k}\Vert + \Vert u^0-\bar{u}\Vert _G \Vert \Delta u^{k}\Vert _G \nonumber \\&\le \left( \Vert u^0-\bar{u}\Vert _G + \frac{\Vert \bar{\lambda }\Vert }{\sqrt{\rho }}\right) \Vert \Delta u^{k}\Vert _G. \end{aligned}$$

On the other hand, by using (2.1) we have

$$\begin{aligned} H(x^{k}, y^{k}) - H_*&\ge -\langle \bar{\lambda }, A x^{k} + B y^{k} -c\rangle = -\frac{1}{\rho } \langle \bar{\lambda }, \Delta \lambda ^{k}\rangle \\&\ge -\frac{\Vert \bar{\lambda }\Vert }{\rho } \Vert \Delta \lambda ^{k}\Vert \ge - \frac{\Vert \bar{\lambda }\Vert }{\sqrt{\rho }} \Vert \Delta u^{k}\Vert _G. \end{aligned}$$

Therefore

$$\begin{aligned} \left| H(x^{k}, y^{k}) - H_*\right| \le \left( \Vert u^0-\bar{u}\Vert _G + \frac{\Vert \bar{\lambda }\Vert }{\sqrt{\rho }}\right) \Vert \Delta u^{k}\Vert _G. \end{aligned}$$
(2.11)

Now we can use Proposition 2.5 to conclude the proof. \(\square \)

Remark 2.1

By exploiting the connection between the Douglas-Rachford splitting algorithm and the classical ADMM (1.2), the non-ergodic convergence rate (2.8) has been established in [8] for the classical ADMM (1.2) under the conditions that

$$\begin{aligned} \text{ zero }(\partial d_f + \partial d_g) \ne \emptyset \end{aligned}$$
(2.12)

and

$$\begin{aligned} \partial d_f = A^*\circ \partial f^* \circ A, \qquad \partial d_g = B^*\circ \partial g^* \circ B - c, \end{aligned}$$
(2.13)

where \(d_f (\lambda ):= f^*(A^* \lambda )\) and \(d_g(\lambda ):= g^*(B^*\lambda )-\langle \lambda , c\rangle \) with \(f^*\) and \(g^*\) denoting the convex conjugates of f and g respectively. The conditions (2.12) and (2.13) seems strong and unnatural because they are posed on the convex conjugates \(f^*\) and \(g^*\) instead of f and g themselves. In Theorem 2.6 we establish the non-ergodic convergence rate (2.8) for the proximal ADMM (1.3) with any positive semi-definite P and Q without requiring the conditions (2.12) and (2.13) and therefore our result extends and improves the one in [8].

Next we will consider establishing faster convergence rates under suitable regularity conditions. As a basis, we first prove the following result which tells that any weak cluster point of \(\{u^k\}\) is a KKT point of (1.1). This result can be easily established for ADMM in finite-dimensional spaces, however it is nontrivial for the proximal ADMM (1.3) in infinite-dimensional Hilbert spaces due to the required treatment of weak convergence; Proposition 2.1 plays a crucial role in our proof.

Theorem 2.7

Let Assumptions 1 and 2 hold. Consider the sequence \(\{u^k:=(x^k, y^k, \lambda ^k)\}\) generated by the proximal ADMM (1.3). Assume \(\{u^k\}\) is bounded and let \(u^\dag :=(x^\dag , y^\dag , \lambda ^\dag )\) be a weak cluster point of \(\{u^k\}\). Then \(u^\dag \) is a KKT point of (1.1). Moreover, for any weak cluster point \(u^*\) of \(\{u^k\}\) there holds \( \Vert u^*-u^\dag \Vert _G =0.\)

Proof

We first show that \(u^\dag \) is a KKT point of (1.1). According to Propositon 2.5 we have \(\Vert \Delta u^k\Vert _G^2 \rightarrow 0\) which means

$$\begin{aligned} \Delta \lambda ^k \rightarrow 0, \quad P \Delta x^k \rightarrow 0, \quad B \Delta y^k \rightarrow 0, \quad Q\Delta y^k\rightarrow 0 \end{aligned}$$
(2.14)

as \(k\rightarrow \infty \). According to Theorem 2.6 we also have

$$\begin{aligned} A x^k + B y^k -c \rightarrow 0 \quad \text{ and } \quad H(x^k, y^k) \rightarrow H_* \quad \text{ as } k \rightarrow \infty . \end{aligned}$$
(2.15)

Since \(u^\dag \) is a weak cluster point of the sequence \(\{u^k\}\), there exists a subsequence \(\{u^{k_j}:= (x^{k_j}, y^{k_j},\lambda ^{k_j})\}\) of \(\{u^k\}\) such that \(u^{k_j} \rightharpoonup u^\dag \) as \(j \rightarrow \infty \). By using the first equation in (2.15) we immediately obtain

$$\begin{aligned} A x^\dag + B y^\dag -c =0. \end{aligned}$$
(2.16)

By using Proposition 2.1 with \(k = k_j -1\) we have for any \(u:= (x, y, \lambda ) \in \mathcal X\times \mathcal Y\times \mathcal Z\) that

$$\begin{aligned} 0&\le H(x, y) - H(x^{k_j}, y^{k_j}) + \langle \lambda ^{k_j} - \rho B \Delta y^{k_j}, A x + B y - c\rangle \nonumber \\&\quad - \langle \lambda , A x^{k_j} + B y^{k_j} -c\rangle + \frac{1}{2} \left( \Vert u^{k_j-1} - u\Vert _G^2 - \Vert u^{k_j} - u\Vert _G^2\right) . \end{aligned}$$
(2.17)

According to Corollary 2.2, \(\{\Vert u^k\Vert _G\}\) is bounded. Thus we may use Proposition 2.5 to conclude

$$\begin{aligned} \left| \Vert u^{k_j-1} - u\Vert _G^2 - \Vert u^{k_j} - u\Vert _G^2\right| \le \left( \Vert u^{k_j-1} - u\Vert _G + \Vert u^{k_j} - u\Vert _G\right) \Vert \Delta u^{k_j}\Vert _G \rightarrow 0 \end{aligned}$$

as \(j \rightarrow \infty \). Therefore, by taking \(j \rightarrow \infty \) in (2.17) and using (2.14), (2.15) and \(\lambda ^{k_j} \rightharpoonup \lambda ^\dag \) we can obtain

$$\begin{aligned} 0 \le H(x, y) - H_* + \langle \lambda ^\dag , A x + B y -c\rangle \end{aligned}$$
(2.18)

for all \((x, y) \in \mathcal X\times \mathcal Y\). Since f and g are convex and lower semi-continuous, they are also weakly lower semi-continuous (see [11, Chapter 1, Corollary 2.2]). Thus, by using \(x^{k_j} \rightharpoonup x^\dag \) and \(y^{k_j} \rightharpoonup y^\dag \) we obtain

$$\begin{aligned} H(x^\dag , y^\dag )&= f(x^\dag ) + g(y^\dag ) \le \liminf _{j\rightarrow \infty } f(x^{k_j}) + \liminf _{j\rightarrow \infty } g(y^{k_j}) \\&\le \liminf _{j\rightarrow \infty } \left( f(x^{k_j}) + g(y^{k_j}) \right) \\&= \liminf _{j \rightarrow \infty } H(x^{k_j}, y^{k_j}) = H_*. \end{aligned}$$

Since \((x^\dag , y^\dag )\) satisfies (2.16), we also have \(H(x^\dag , y^\dag ) \ge H_*\). Therefore \(H(x^\dag , y^\dag ) = H_*\) and then it follows from (2.18) and (2.16) that

$$\begin{aligned} 0 \le H(x, y) - H(x^\dag , y^\dag ) + \langle \lambda ^\dag , A(x - x^\dag ) + B (y - y^\dag )\rangle \end{aligned}$$

for all \((x, y) \in \mathcal X\times \mathcal Y\). Using the definition of H we can immediately see that \(-A^* \lambda ^\dag \in \partial f(x^\dag )\) and \(-B^* \lambda ^\dag \in \partial g(y^\dag )\). Therefore \(u^\dag \) is a KKT point of (1.1).

Let \(u^*\) be another weak cluster point of \(\{u^k\}\). Then there exists a subsequence \(\{u^{l_j}\}\) of \(\{u^k\}\) such that \(u^{l_j} \rightharpoonup u^*\) as \(j \rightarrow \infty \). Noting the identity

$$\begin{aligned} 2\langle u^k, G(u^* - u^\dag )\rangle = \Vert u^k - u^\dag \Vert _G^2 - \Vert u^k - u^*\Vert _G^2 - \Vert u^\dag \Vert _G^2 + \Vert u^*\Vert _G^2. \end{aligned}$$
(2.19)

Since both \(u^*\) and \(u^\dag \) are KKT points of (1.1) as shown above, it follows from Corollary 2.2 that both \(\{\Vert u^k - u^\dag \Vert _G^2\}\) and \(\{\Vert u^k - u^*\Vert _G^2\}\) are monotonically decreasing and thus converge as \(k \rightarrow \infty \). By taking \(k = k_j\) and \(k=l_j\) in (2.19) respectively and letting \(j \rightarrow \infty \) we can see that, for the both cases, the right hand side tends to the same limit. Therefore

$$\begin{aligned} \langle u^*, G(u^*-u^\dag )\rangle&= \lim _{j\rightarrow \infty } \langle u^{l_j}, G(u^*-u^\dag )\rangle \\&= \lim _{j\rightarrow \infty } \langle u^{k_j}, G(u^*-u^\dag )\rangle \\&= \langle u^\dag , G(u^*-u^\dag )\rangle \end{aligned}$$

which implies \(\Vert u^*-u^\dag \Vert _G^2 =0\). \(\square \)

Remark 2.2

Theorem 2.7 requires \(\{u^k\}\) to be bounded. According to Corollary 2.2, \(\{\Vert u^k\Vert _G^2\}\) is bounded which implies the boundedness of \(\{\lambda ^k\}\). In the following we will provide sufficient conditions to guarantee the boundedness of \(\{(x^k, y^k)\}\).

  1. (i)

    From (2.5) it follows that \(\{\sigma _f \Vert x^k\Vert ^2 + \sigma _g \Vert y^k\Vert ^2 + \Vert u^k\Vert _G^2\}\) is bounded. By the definition of G, this in particular implies the boundedness of \(\{\lambda ^k\}\) and \(\{B y^k\}\). Consequently, it follows from \(\Delta \lambda ^k = \rho (A x^k + B y^k - c)\) that \(\{A x^k\}\) is bounded. Putting the above together we can conclude that both \(\{(\sigma _f I + P + A^*A) x^k\}\) and \(\{(\sigma _g I + Q + B^* B) y^k\}\) are bounded. Therefore, if both the bounded linear self-adjoint operators

    $$\begin{aligned} \sigma _f I + P + A^*A \quad \text{ and } \quad \sigma _g I + Q + B^*B \end{aligned}$$

    are coercive, we can conclude the boundedness of \(\{x^k\}\) and \(\{y^k\}\). Here a linear operator \(L: \mathcal V\rightarrow \mathcal H\) between two Hilbert spaces \(\mathcal V\) and \(\mathcal H\) is called coercive if \(\Vert Lv\Vert \rightarrow \infty \) whenever \(\Vert v\Vert \rightarrow \infty \). It is easy to see that L is coercive if and only if there is a constant \(c>0\) such that \(c\Vert v\Vert \le \Vert Lv\Vert \) for all \(v\in \mathcal V\).

  2. (ii)

    If there exist \(\beta >H_*\) and \(\sigma >0\) such that the set

    $$\begin{aligned} \{(x, y) \in \mathcal X\times \mathcal Y: H(x, y)\le \beta \text{ and } \Vert A x + B y -c\Vert \le \sigma \} \end{aligned}$$

    is bounded, then \(\{(x^k,y^k)\}\) is bounded. In fact, since \(H(x^k, y^k)\rightarrow H_*\) and \(A x^k + B y^k - c\rightarrow 0\) as shown in Theorem 2.6, the sequence \(\{(x^k, y^k)\}\) is contained in the above set except for finite many terms. Thus \(\{(x^k, y^k)\}\) is bounded.

Remark 2.3

It is interesting to investigate under what conditions \(\{u^k\}\) has a unique weak cluster point. According to Theorem 2.7, for any two weak cluster points \(u^*:= (x^*, y^*, \lambda ^*)\) and \(u^\dag := (x^\dag , y^\dag , \lambda ^\dag )\) of \(\{u^k\}\) there hold

$$\begin{aligned}&\Vert u^* - u^\dag \Vert _G^2 =0, \quad Ax^* + By^* = c, \quad -A^* \lambda ^* \in \partial f(x^*),\quad -B^* \lambda ^* \in \partial g(y^*), \\&A x^\dag + B y^\dag = c, \quad -A^* \lambda ^\dag \in \partial f(x^\dag ), \quad - B^* \lambda ^\dag \in \partial g(y^\dag ). \end{aligned}$$

By using the definition of G and the monotonicity of \(\partial f\) and \(\partial g\) we can deduce that

$$\begin{aligned}&\lambda ^* = \lambda ^\dag , \quad P(x^*-x^\dag ) =0, \quad Q(y^*-y^\dag ) = 0, \quad B(y^* - y^\dag ) = 0, \\&A(x^* - x^\dag ) = 0, \quad \sigma _f \Vert x^* - x^\dag \Vert ^2 =0, \quad \sigma _g \Vert y^* - y^\dag \Vert ^2=0. \end{aligned}$$

Consequently

$$\begin{aligned} (\sigma _f I + P + A^*A) (x^*-x^\dag ) =0 \quad \text{ and } \quad (\sigma _g I + Q + B^* B) (y^* - y^\dag ) = 0. \end{aligned}$$

Therefore, if both \(\sigma _f I + P + A^* A\) and \(\sigma _g I + Q + B^* B\) are injective, then \(x^* = x^\dag \) and \(y^* = y^\dag \) and hence \(\{u^k\}\) has a unique weak cluster point, say \(u^\dag \); consequently \(u^k \rightharpoonup u^\dag \) as \(k \rightarrow \infty \).

Remark 2.4

In [31] the proximal ADMM (with relaxation) has been considered under the condition that

$$\begin{aligned} P + \rho A^*A + \partial f \text{ and } Q + \rho B^*B + \partial g \text{ are } \text{ strongly } \text{ maximal } \text{ monotone }. \end{aligned}$$
(2.20)

which requires both \((P + \rho A^*A + \partial f)^{-1}\) and \((Q + \rho B^*B + \partial g)^{-1}\) exist as single valued mappings and are Lipschitz continuous. It has been shown that the iterative sequence converges weakly to a KKT point which is its unique weak cluster point. The argument in [31] used the facts that the KKT mapping F(u), defined in (2.21) below, is maximal monotone and maximal monotone operators are closed under the weak-strong topology [2, 3]. Our argument is essentially based on Proposition 2.1, it is elementary and does not rely on any machinery from the maximal monotone operator theory.

Based on Theorem 2.7, we now devote to deriving convergence rates of the proximal ADMM (1.3) under certain regularity conditions. To this end, we introduce the multifuncton \(F: \mathcal X\times \mathcal Y\times \mathcal Z\rightrightarrows \mathcal X\times \mathcal Y\times \mathcal Z\) defined by

$$\begin{aligned} F(u):= \left( \begin{array}{ccc} \partial f(x) + A^* \lambda \\ \partial g(y) + B^* \lambda \\ A x + By -c \end{array}\right) , \quad \forall u = (x, y, \lambda )\in \mathcal X\times \mathcal Y\times \mathcal Z. \end{aligned}$$
(2.21)

Then \(\bar{u}\) is a KKT point of (1.1) means \(0 \in F(\bar{u})\) or, equivalently, \(\bar{u}\in F^{-1}(0)\), where \(F^{-1}\) denotes the inverse multifunction of F. We will achieve our goal under certain bounded (Hölder) metric subregularity conditions of F. We need the following calculus lemma.

Lemma 2.8

Let \(\{\Delta _k\}\) be a sequence of nonnegative numbers satisfying

$$\begin{aligned} \Delta _k^\theta \le C (\Delta _{k-1}-\Delta _k) \end{aligned}$$
(2.22)

for all \(k\ge 1\), where \(C>0\) and \(\theta >1\) are constants. Then there is a constant \(\tilde{C}>0\) such that

$$\begin{aligned} \Delta _k \le \tilde{C} (1+ k)^{-\frac{1}{\theta -1}} \end{aligned}$$

for all \(k\ge 0\).

Proof

Please refer to the proof of [1, Theorem 2]. \(\square \)

Theorem 2.9

Let Assumptions 1 and 2 hold. Consider the sequence \(\{u^k:=(x^k, y^k, \lambda ^k)\}\) generated by the proximal ADMM (1.3). Assume \(\{u^k\}\) is bounded and let \(u^\dag :=(x^\dag , y^\dag , \lambda ^\dag )\) be a weak cluster point of \(\{u^k\}\). Let R be a number such that \(\Vert u^k - u^\dag \Vert \le R\) for all k and assume that there exist \(\kappa >0\) and \(\alpha \in (0, 1]\) such that

$$\begin{aligned} d(u, F^{-1}(0)) \le \kappa [d(0, F(u))]^\alpha , \quad \forall u \in B_R(u^\dag ). \end{aligned}$$
(2.23)
  1. (i)

    If \(\alpha = 1\), then there exists a constant \(0<q<1\) such that

    $$\begin{aligned} \Vert u^{k+1} - u^\dag \Vert _G^2 + \Vert \Delta y^{k+1}\Vert _Q^2 \le q^2 \left( \Vert u^k - u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2\right) \end{aligned}$$
    (2.24)

    for all \(k \ge 0\) and consequently there exist \(C>0\) and \(0<q<1\) such that

    $$\begin{aligned} \begin{aligned} \Vert u^k - u^\dag \Vert _G, \, \Vert \Delta u^k\Vert _G&\le C q^k, \\ \Vert A x^k + B y^k - c\Vert&\le C q^k, \\ |H(x^k, y^k) - H_*|&\le C q^k \end{aligned} \end{aligned}$$
    (2.25)

    for all \(k \ge 0\).

  2. (ii)

    If \(\alpha \in (0, 1)\) then there is a constant C such that

    $$\begin{aligned} \Vert u^k - u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2 \le C (k+1)^{-\frac{\alpha }{1-\alpha }} \end{aligned}$$
    (2.26)

    and consequently

    $$\begin{aligned} \begin{aligned} \Vert \Delta u^k \Vert _G&\le C (k+1)^{-\frac{1}{2(1-\alpha )}}, \\ \Vert A x^k + B y^k - c\Vert&\le C (k+1)^{-\frac{1}{2(1-\alpha )}}, \\ |H(x^k, y^k) - H_*|&\le C (k+1)^{- \frac{1}{2(1-\alpha )}} \end{aligned} \end{aligned}$$
    (2.27)

    for all \(k \ge 0\).

Proof

According to Theorem 2.7, \(u^\dag \) is a KKT point of (1.1). Therefore we may use Lemma 2.4 with \(\bar{u} = u^\dag \) to obtain

$$\begin{aligned} \Vert u^{k+1} - u^\dag \Vert _G^2 + \Vert \Delta y^{k+1}\Vert _Q^2&\le \Vert u^k - u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2 - \Vert \Delta u^{k+1}\Vert _G^2 \nonumber \\&= \Vert u^k- u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2 - \eta \Vert \Delta u^{k+1}\Vert _G^2 \nonumber \\&\quad \, - (1-\eta ) \Vert \Delta u^{k+1}\Vert _G^2, \end{aligned}$$
(2.28)

where \(\eta \in (0, 1)\) is any number. According to (2.3),

$$\begin{aligned} \begin{pmatrix} \rho A^* B \Delta y^{k+1} - P \Delta x^{k+1}\\ - Q \Delta y^{k+1} \\ A x^{k+1} + B y^{k+1} - c \end{pmatrix} \in F(u^{k+1}). \end{aligned}$$

Thus, by using \(\Delta \lambda ^{k+1} = \rho (A x^{k+1} + B y^{k+1}-c)\) we can obtain

$$\begin{aligned} d^2(0, F(u^{k+1}))&\le \Vert \rho A^* B \Delta y^{k+1} - P \Delta x^{k+1}\Vert ^2 + \Vert -Q \Delta y^{k+1}\Vert ^2 \nonumber \\&\quad \, + \Vert A x^{k+1} + B y^{k+1}-c\Vert ^2 \nonumber \\&\le 2 \Vert P \Delta x^{k+1}\Vert ^2 + 2 \rho ^2 \Vert A\Vert ^2 \Vert B \Delta y^{k+1}\Vert ^2 \nonumber \\&\quad \, + \Vert Q \Delta y^{k+1}\Vert ^2 + \frac{1}{\rho ^2} \Vert \Delta \lambda ^{k+1}\Vert ^2 \nonumber \\&\le \gamma \Vert \Delta u^{k+1}\Vert _G^2, \end{aligned}$$
(2.29)

where

$$\begin{aligned} \gamma := \max \left\{ 2 \Vert P\Vert , 2 \rho \Vert A\Vert ^2, \Vert Q\Vert , \frac{1}{\rho }\right\} . \end{aligned}$$

Combining this with (2.28) gives

$$\begin{aligned} \Vert u^{k+1} - u^\dag \Vert _G^2 + \Vert \Delta y^{k+1} \Vert _Q^2&\le \Vert u^k - u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2 - \eta \Vert \Delta u^{k+1}\Vert _G^2 \\&\quad \, - \frac{1-\eta }{\gamma } d^2(0, F(u^{k+1})). \end{aligned}$$

Since \(\Vert u^k - u^\dag \Vert \le R\) for all k and F satisfies (2.23), one can see that

$$\begin{aligned} d(u^{k+1}, F^{-1}(0)) \le \kappa [d(0, F(u^{k+1}))]^\alpha , \quad \forall k\ge 0. \end{aligned}$$

Consequently

$$\begin{aligned} \Vert u^{k+1} - u^\dag \Vert _G^2 + \Vert \Delta y^{k+1}\Vert _Q^2&\le \Vert u^k - u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2 - \eta \Vert \Delta u^{k+1}\Vert _G^2 \\&\quad \, - \frac{1-\eta }{\gamma \kappa ^{2/\alpha }} [d(u^{k+1}, F^{-1}(0))]^{2/\alpha }. \end{aligned}$$

For any \(u =(x, y, \lambda ) \in \mathcal X\times \mathcal Y\times \mathcal Z\) let

$$\begin{aligned} d_G(u, F^{-1}(0)):= \inf _{\bar{u}\in F^{-1}(0)} \Vert u-\bar{u}\Vert _G \end{aligned}$$

which measures the “distance" from u to \(F^{-1}(0)\) under the semi-norm \(\Vert \cdot \Vert _G\). It is easy to see that

$$\begin{aligned} d_G^2(u, F^{-1}(0)) \le \Vert G\Vert d^2(u, F^{-1}(0)), \end{aligned}$$

where \(\Vert G\Vert \) denotes the norm of the operator G. Then we have

$$\begin{aligned} \Vert u^{k+1} - u^\dag \Vert _G^2 + \Vert \Delta y^{k+1}\Vert _Q^2&\le \Vert u^k - u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2 - \eta \Vert \Delta u^{k+1}\Vert _G^2 \\&\quad \, - \frac{1-\eta }{\gamma (\kappa ^2 \Vert G\Vert )^{1/\alpha }} [d_G(u^{k+1}, F^{-1}(0))]^{2/\alpha }. \end{aligned}$$

Now let \(\bar{u}\in F^{-1}(0)\) be any point. Then

$$\begin{aligned} \Vert u^{k+1} - u^\dag \Vert _G \le \Vert u^{k+1} - \bar{u}\Vert _G + \Vert u^\dag - \bar{u}\Vert _G. \end{aligned}$$

Since \(u^\dag \) is a weak cluster point of \(\{u^k\}\), there is a subsequence \(\{u^{k_j}\}\) of \(\{u^k\}\) such that \(u^{k_j} \rightharpoonup u^\dag \). Thus

$$\begin{aligned} \Vert u^\dag - \bar{u}\Vert _G^2 = \lim _{j \rightarrow \infty } \langle u^{k_j} - \bar{u}, G(u^\dag - \bar{u})\rangle \le \liminf _{j\rightarrow \infty } \Vert u^{k_j} - \bar{u}\Vert _G \Vert u^\dag - \bar{u}\Vert _G \end{aligned}$$

which implies \(\Vert u^\dag - \bar{u}\Vert _G \le \liminf _{j\rightarrow \infty } \Vert u^{k_j} - \bar{u}\Vert _G\). From Corollary 2.2 we know that \(\{\Vert u^k - \bar{u}\Vert _G^2\}\) is monotonically decreasing. Thus

$$\begin{aligned} \Vert u^{k+1} - u^\dag \Vert _G \le \Vert u^{k+1} - \bar{u}\Vert _G + \liminf _{j\rightarrow \infty } \Vert u^{k_j} - \bar{u}\Vert _G \le 2 \Vert u^{k+1} - \bar{u}\Vert _G. \end{aligned}$$

Since \(\bar{u}\in F^{-1}(0)\) is arbitrary, we thus have

$$\begin{aligned} \Vert u^{k+1}- u^\dag \Vert _G \le 2 d_G(u^{k+1}, F^{-1}(0)). \end{aligned}$$

Therefore

$$\begin{aligned} \Vert u^{k+1} - u^\dag \Vert _G^2 + \Vert \Delta y^{k+1}\Vert _Q^2&\le \Vert u^k - u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2 - \eta \Vert \Delta u^{k+1}\Vert _G^2 \\&\quad \, - \frac{1-\eta }{\gamma (4\kappa ^2 \Vert G\Vert )^{1/\alpha }} \Vert u^{k+1} - u^\dag \Vert _G^{2/\alpha }. \end{aligned}$$

By using the fact \(\Vert \Delta u^k\Vert _G\rightarrow 0\) established in Proposition 2.5, we can find a constant \(C>0\) such that

$$\begin{aligned} \Vert \Delta u^{k+1}\Vert _G^2 \ge C \Vert \Delta u^{k+1}\Vert _G^{2/\alpha }. \end{aligned}$$

Note that \(\Vert \Delta u^{k+1}\Vert _G^2 \ge \Vert \Delta y^{k+1}\Vert _Q^2\). Thus

$$\begin{aligned} \Vert u^{k+1} - u^\dag \Vert _G^2 + \Vert \Delta y^{k+1}\Vert _Q^2&\le \Vert u^k - u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2 - C \eta \Vert \Delta y^{k+1}\Vert _Q^{2/\alpha } \\&\quad \, - \frac{1-\eta }{\gamma (4\kappa ^2 \Vert G\Vert )^{1/\alpha }} \Vert u^{k+1} - u^\dag \Vert _G^{2/\alpha }. \end{aligned}$$

Choose \(\eta \) such that

$$\begin{aligned} \eta = \frac{1}{1 + C \gamma (4 \kappa ^2 \Vert G\Vert )^{1/\alpha }}. \end{aligned}$$

Then

$$\begin{aligned}&\Vert u^{k+1} - u^\dag \Vert _G^2 + \Vert \Delta y^{k+1}\Vert _Q^2 \\&\le \Vert u^k - u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2 - C \eta \left( \Vert \Delta y^{k+1}\Vert _Q^{2/\alpha } + \Vert u^{k+1} - u^\dag \Vert _G^{2/\alpha }\right) . \end{aligned}$$

Using the inequality \((a+b)^p \le 2^{p-1}(a^p + b^p)\) for \(a,b\ge 0\) and \(p\ge 1\), we then obtain

$$\begin{aligned}&\Vert u^{k+1} - u^\dag \Vert _G^2 + \Vert \Delta y^{k+1}\Vert _Q^2 \nonumber \\&\le \Vert u^k - u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2 - 2^{1-1/\alpha } C \eta \left( \Vert u^{k+1} - u^\dag \Vert _G^2 + \Vert \Delta y^{k+1}\Vert _Q^2\right) ^{1/\alpha }. \end{aligned}$$
(2.30)

(i) If \(\alpha =1\), then we obtain the linear convergence

$$\begin{aligned} (1 + C\eta ) \left( \Vert u^{k+1} - u^\dag \Vert _G^2 + \Vert \Delta y^{k+1}\Vert _Q^2\right) \le \Vert u^k - u^\dag \Vert _G^2 + \Vert \Delta y^k\Vert _Q^2 \end{aligned}$$

which is (2.24) with \(q = 1/(1 + C \eta )\). By using Lemma 2.4 and (2.24) we immediately obtain the first estimate in (2.25). By using (2.9) and (2.11) we then obtain the last two estimates in (2.25).

(ii) If \(\alpha \in (0, 1)\), we may use (2.30) and Lemma 2.8 to obtain (2.26). To derive the first estimate in (2.27), we may use Lemma 2.4 to obtain

$$\begin{aligned} \sum _{j=l+1}^k \Vert \Delta u^j\Vert _G^2 \le \Vert u^l - u^\dag \Vert _G^2 + \Vert \Delta y^l\Vert _Q^2 \end{aligned}$$

for all integers \(1 \le l < k\). By using the monotonicity of \(\{\Vert \Delta u^j\Vert _G^2\}\) shown in Lemma 2.3 and the estimate (2.26) we have

$$\begin{aligned} (k-l) \Vert \Delta u^k\Vert _G^2 \le C (l+1)^{-\frac{\alpha }{1-\alpha }}. \end{aligned}$$

Taking \(l = [k/2]\), the largest integers \(\le k/2\), gives

$$\begin{aligned} \Vert \Delta u^k\Vert _G^2 \le C (k + 1)^{-\frac{\alpha }{1-\alpha }-1} = C (k+1)^{-\frac{1}{1-\alpha }} \end{aligned}$$

with a possibly different generic constant C. This shows the first estimate in (2.27). Based on this, we can use (2.9) and (2.11) to obtain the last two estimates in (2.27). The proof is therefore complete. \(\square \)

Remark 2.5

Let us give some comments on the condition (2.23). In finite dimensional Euclidean spaces, it has been proved in [30] that for every polyhedral multifunction \(\Psi : {\mathbb R}^m \rightrightarrows {\mathbb R}^n\) there is a constant \(\kappa >0\) such that for any \(y \in {\mathbb R}^n\) there is a number \(\varepsilon >0\) such that

$$\begin{aligned} d(x, \Psi ^{-1}(y)) \le \kappa d(y, \Psi (x)), \quad \forall x \text{ satisfying } d(y, \Psi (x))<\varepsilon . \end{aligned}$$

This result in particular implies the bounded metric subregularity of \(\Psi \), i.e. for any \(r>0\) and any \(y \in {\mathbb R}^n\) there is a number \(C>0\) such that

$$\begin{aligned} d(x, \Psi ^{-1}(y)) \le C d(y, \Psi (x)), \quad \forall x \in B_r(0). \end{aligned}$$

Therefore, if \(\partial f\) and \(\partial g\) are polyhedral multifunctions, then the multifunction F defined by (2.21) is also polyhedral and thus (2.23) with \(\alpha =1\) holds. The bounded metric subregularity of polyhedral multifunctions in arbitrary Banach spaces has been established in [34].

On the other hand, if \(\mathcal X\), \(\mathcal Y\) and \(\mathcal Z\) are finite dimensional Euclidean spaces, and if f and g are semi-algebraic convex functions, then the multifunction F satisfies (2.23) for some \(\alpha \in (0, 1]\). Indeed, the semi-algebraicity of f and g implies that their subdifferentials \(\partial f\) and \(\partial g\) are semi-algebraic multifunctions with closed graph; consequently F is semi-algebraic with closed graph. According to [24, Proposition 3.1], F is bounded Hölder metrically subregular at any point \((\bar{u}, \bar{\xi })\) on its graph, i.e. for any \(r>0\) there exist \(\kappa >0\) and \(\alpha \in (0,1]\) such that

$$\begin{aligned} d(u, F^{-1}(\bar{\xi })) \le \kappa [d(\bar{\xi }, F(u))]^\alpha , \quad \forall u \in B_r(\bar{u}) \end{aligned}$$

which in particular implies (2.23).

By inspecting the proof of Theorem 2.9, it is easy to see that the same convergence rate results can be derived with the condition (2.23) replaced by the weaker condition: there exist \(\kappa >0\) and \(\alpha \in (0,1]\) such that

$$\begin{aligned} d_G(u^k, F^{-1}(0)) \le \kappa \left\| \Delta u^k\right\| _G^\alpha , \quad \forall k \ge 1. \end{aligned}$$
(2.31)

Therefore we have the following result.

Theorem 2.10

Let Assumptions 1 and 2 hold. Consider the sequence \(\{u^k:=(x^k, y^k, \lambda ^k)\}\) generated by the proximal ADMM (1.3). Assume \(\{u^k\}\) is bounded. If there exist \(\kappa >0\) and \(\alpha \in (0, 1]\) such that (2.31) holds, then, for any weak cluster point \(u^\dag \) of \(\{u^k\}\), the same convergence rate results in Theorem 2.9 hold.

Remark 2.6

Note that the condition (2.31) is based on the iterative sequence itself. Therefore, it makes possible to check the condition by exploring not only the property of the multifunction F but also the structure of the algorithm. The condition (2.31) with \(\alpha = 1\) has been introduced in [27] as an iteration based error bound condition to study the linear convergence of the proximal ADMM (1.3) with \(Q =0\) in finite dimensions.

Remark 2.7

The condition (2.31) is strongly motivated by the proof of Theorem 2.9. We would like to provide here an alternative motivation. Consider the proximal ADMM (1.3). We can show that if \(\Vert \Delta u^k\Vert _G =0\) then \(u^k\) must be a KKT point of (1.1). Indeed, \(\Vert \Delta u^k\Vert _G^2=0\) implies \(P\Delta x^k =0\), \({{\widehat{Q}}} \Delta y^k=0\) and \(\Delta \lambda ^k=0\). Since \({{\widehat{Q}}} = Q +\rho B^T B\) with Q positive semi-definite and \(\Delta \lambda ^k = \rho (A x^k + B y^k-c)\), we also have \(B \Delta y^k=0\), \(Q\Delta y^k=0\) and \(A x^k+B y^k-c=0\). Thus, it follows from (2.3) that

$$\begin{aligned} -A^* \lambda ^k \in \partial f(x^k), \quad -B^* \lambda ^k \in \partial g(y^k), \quad A x^k + B y^k =c \end{aligned}$$

which shows that \(u^k=(x^k, y^k, \lambda ^k)\) is a KKT point, i.e., \(u^k \in F^{-1}(0)\). Therefore, it is natural to ask, if \(\Vert \Delta u^k\Vert _G\) is small, can we guarantee \(d_G(u^k, F^{-1}(0))\) to be small as well? This motivates us to propose a condition like

$$\begin{aligned} d_G(u^k, F^{-1}(0)) \le \varphi (\Vert \Delta u^k\Vert _G), \quad \forall k \ge 1 \end{aligned}$$

for some function \(\varphi : [0, \infty ) \rightarrow [0, \infty )\) with \(\varphi (0) =0\). The condition (2.31) corresponds to \(\varphi (s) = \kappa s^\alpha \) for some \(\kappa >0\) and \(\alpha \in (0, 1]\).

In finite dimensional Euclidean spaces some linear convergence results on the proximal ADMM (1.3) have been established in [9] under various scenarios involving strong convexity of f and/or g, Lipschitz continuity of \(\nabla f\) and/or \(\nabla g\), together with further conditions on A and/or B, see [9, Theorem 3.1 and Table 1]. In the following theorem we will show that (2.31) with \(\alpha = 1\) holds under any one of these scenarios and thus the linear convergence in [9, Theorem 3.1 and Theorem 3.4] can be established by using Theorem 2.10. Therefore, the linear convergence results based on the bounded metric subregularity of F or the scenarios in [9] can be treated in a unified manner.

Actually our next theorem improves the results in [9] by establishing the linear convergence of \(\{u^k\}\) and \(\{H(x^k, y^k)\}\) and relaxing the Lipschitz continuity of gradient(s) to the local Lipschitz continuity. Furthermore, Our result is established in general Hilbert spaces. To formulate the scenarios from [9] in this general setting, we need to replace the full row/column rank of matrices by the coercivity of linear operators. We also need the linear operator \(M: \mathcal X\times \mathcal Y\rightarrow \mathcal Z\) defined by

$$\begin{aligned} M(x,y):=Ax+By, \quad \forall (x,y)\in \mathcal X\times \mathcal Y\end{aligned}$$

which is constructed from A and B. It is easy to see that the adjoint of M is \(M^* z = (A^* z, B^* z)\) for any \(z \in \mathcal Z\).

Theorem 2.11

Let Assumptions 1 and 2 hold. Let \(\{u^k\}\) be the sequence generated by the proximal ADMM (1.3). Then \(\{u^k\}\) is bounded and there exists a constant \(C>0\) such that

$$\begin{aligned} d_G(u^k, F^{-1}(0)) \le C \Vert \Delta u^k\Vert _G \end{aligned}$$
(2.32)

for all \(k\ge 1\), provided any one of the following conditions holds:

  1. (i)

    \(\sigma _g>0\), A and \(B^*\) are coercive, g is differentiable and its gradient is Lipschitz continuous over bounded sets;

  2. (ii)

    \(\sigma _f>0\), \(\sigma _g>0\), \(B^*\) is coercive, g is differentiable and its gradient is Lipschitz continuous over bounded sets;

  3. (iii)

    \(\lambda ^0=0\), \(\sigma _f>0\), \(\sigma _g>0\), \(M^*\) restricted on \(\mathcal N(M^*)^\perp \) is coercive, both f and g are differentiable and their gradients are Lipschitz continuous over bounded sets;

  4. (iv)

    \(\lambda ^0=0\), \(\sigma _g>0\), A is coercive, \(M^*\) restricted on \(\mathcal N(M^*)^\perp \) is coercive, both f and g are differentiable and their gradients are Lipschitz continuous over bounded sets;

where \(\mathcal N(M^*)\) denotes the null space of \(M^*\). Consequently, there exist \(C>0\) and \(0< q<1\) such that

$$\begin{aligned} \Vert u^k - u^\dag \Vert \le C q^k \quad \text{ and } \quad |H(x^k, y^k) - H_*| \le C q^{k} \end{aligned}$$

for all \(k \ge 0\), where \(u^\dag :=(x^\dag , y^\dag , \lambda ^\dag )\) is a KKT point of (1.1).

Proof

We will only consider the scenario (i) since the proofs for other scenarios are similar. In the following we will use C to denote a generic constant which may change from line to line but is independent of k.

We first show the boundedness of \(\{u^k\}\). According to Corollary 2.2, \(\{\Vert u^k\Vert _G^2\}\) is bounded which implies the boundedness of \(\{\lambda ^k\}\). Since \(\sigma _g>0\), it follows from (2.5) that \(\{y^k\}\) is bounded. Consequently, it follows from \(\Delta \lambda ^k = \rho (A x^k + B y^k -c)\) that \(\{A x^k\}\) is bounded. Since A is coercive, \(\{x^k\}\) must be bounded.

Next we show (2.32). Let \(u^\dag :=(x^\dag , y^\dag , \lambda ^\dag )\) be a weak cluster point of \(\{u^k\}\) whose existence is guaranteed by the boundedness of \(\{u^k\}\). According to Theorem 2.7, \(u^\dag \) is a KKT point of (1.1). Let \((\xi , \eta , \tau ) \in F(u^k)\) be any element. Then

$$\begin{aligned} \xi - A^* \lambda ^k \in \partial f(x^k), \quad \eta - B^* \lambda ^k \in \partial g(y^k), \quad \tau = A x^k + B y^k - c. \end{aligned}$$

By using the monotonicity of \(\partial f\) and \(\partial g\) we have

$$\begin{aligned}&\sigma _f \Vert x^k - x^\dag \Vert ^2 + \sigma _g \Vert y^k - y^\dag \Vert ^2 \nonumber \\&\le \langle \xi - A^* \lambda ^k + A^* \lambda ^\dag , x^k - x^\dag \rangle + \langle \eta - B^* \lambda ^k + B^* \lambda ^\dag , y^k - y^\dag \rangle \nonumber \\&= \langle \xi , x^k - x^\dag \rangle + \langle \eta , y^k - y^\dag \rangle + \langle \lambda ^\dag - \lambda ^k, A(x^k-x^\dag ) + B(y^k - y^\dag )\rangle \nonumber \\&= \langle \xi , x^k - x^\dag \rangle + \langle \eta , y^k - y^\dag \rangle + \langle \lambda ^\dag - \lambda ^k, \tau \rangle . \end{aligned}$$
(2.33)

Since \(\sigma _g>0\), it follows from (2.33) and the Cauchy-Schwarz inequality that

$$\begin{aligned} \Vert y^k - y^\dag \Vert ^2 \le C \left( \Vert \eta \Vert ^2 + \Vert \xi \Vert \Vert x^k - x^\dag \Vert + \Vert \tau \Vert \Vert \lambda ^k - \lambda ^\dag \Vert \right) . \end{aligned}$$
(2.34)

Note that \(A (x^k - x^\dag ) = - B(y^k - y^\dag ) + \frac{1}{\rho } \Delta \lambda ^k\). Since A is coercive, we have

$$\begin{aligned} \Vert x^k - x^\dag \Vert ^2 \le C \Vert A(x^k - x^\dag )\Vert ^2 \le C\left( \Vert y^k - y^\dag \Vert ^2 + \Vert \Delta \lambda ^k\Vert ^2\right) . \end{aligned}$$
(2.35)

By the differentiability of g we have \(-B^*\lambda ^\dag = \nabla g(y^\dag )\) and \(-B^* \lambda ^k - Q \Delta y^k = \nabla g(y^k)\). Since \(B^*\) is coercive and \(\nabla g\) is Lipschitz continuous over bounded sets, we thus obtain

$$\begin{aligned} \Vert \lambda ^k - \lambda ^\dag \Vert ^2&\le C \Vert B^*(\lambda ^k - \lambda ^\dag )\Vert ^2 = \Vert Q\Delta y^k + \nabla g(y^k) - \nabla g(y^\dag )\Vert ^2 \nonumber \\&\le C \left( \Vert \Delta y^k\Vert _Q^2 + \Vert y^k - y^\dag \Vert ^2\right) . \end{aligned}$$
(2.36)

Adding (2.35) and (2.36) and then using (2.34), it follows

$$\begin{aligned} \Vert x^k - x^\dag \Vert ^2 + \Vert \lambda ^k - \lambda ^\dag \Vert ^2&\le C \left( \Vert \eta \Vert ^2 + \Vert \Delta u^k\Vert _G^2 + \Vert \xi \Vert \Vert x^k - x^\dag \Vert + \Vert \tau \Vert \Vert \lambda ^k - \lambda ^\dag \Vert \right) \end{aligned}$$

which together with the Cauchy-Schwarz inequality then implies

$$\begin{aligned} \Vert x^k - x^\dag \Vert ^2 + \Vert \lambda ^k - \lambda ^\dag \Vert ^2 \le C\left( \Vert \xi \Vert ^2 + \Vert \eta \Vert ^2 +\Vert \tau \Vert ^2 + \Vert \Delta u^k\Vert _G^2\right) . \end{aligned}$$
(2.37)

Combining (2.34) and (2.37) we can obtain

$$\begin{aligned} \Vert x^k - x^\dag \Vert ^2 + \Vert y^k - y^\dag \Vert ^2 + \Vert \lambda ^k - \lambda ^\dag \Vert ^2 \le C \left( \Vert \xi \Vert ^2 + \Vert \eta \Vert ^2 + \Vert \tau \Vert ^2 + \Vert \Delta u^k\Vert _G^2\right) . \end{aligned}$$

Since \((\xi , \eta , \tau ) \in F(u^k)\) is arbitrary, we therefore have

$$\begin{aligned} \Vert u^k - u^\dag \Vert ^2 \le C \left( [d(0, F(u^k))]^2 + \Vert \Delta u^k\Vert _G^2\right) . \end{aligned}$$

With the help of (2.29), we then obtain

$$\begin{aligned} \Vert u^k - u^\dag \Vert ^2 \le C \Vert \Delta u^k\Vert _G^2. \end{aligned}$$
(2.38)

Thus

$$\begin{aligned}{}[d_G(u^k, F^{-1}(0))]^2 \le C [d(u^k, F^{-1}(0))]^2 \le C\Vert u^k - u^\dag \Vert ^2 \le C \Vert \Delta u^k\Vert _G^2 \end{aligned}$$

which shows (2.32).

Because \(\{u^k\}\) is bounded and (2.32) holds, we may use Theorem 2.10 to conclude the existence of a constant \(q \in (0, 1)\) such that

$$\begin{aligned} \Vert \Delta u^k\Vert _G \le C q^k \quad \text{ and } \quad |H(x^k, y^k) - H_*|\le C q^{k}. \end{aligned}$$

Finally we may use (2.38) to obtain \(\Vert u^k-u^\dag \Vert \le C q^k\). \(\square \)

Remark 2.8

If \(\mathcal Z\) is finite-dimensional, the coercivity of \(M^*\) restricted on \(\mathcal N(M^*)^\perp \) required in the scenarios (iii) and (iv) holds automatically. If it is not, then there exists a sequence \(\{z^k\}\subset \mathcal N(M^*)^\perp {\setminus }\{0\}\) such that

$$\begin{aligned} \Vert z^k\Vert \ge k \Vert M^* z^k\Vert , \quad k = 1, 2, \cdots . \end{aligned}$$

By rescaling we may assume \(\Vert z^k\Vert =1\) for all k. Since \(\mathcal Z\) is finite-dimensional, by taking a subsequence if necessary, we may assume \(z^k \rightarrow z\) for some \(z \in \mathcal Z\). Clearly \(z\in \mathcal N(M^*)^\perp \) and \(\Vert z\Vert =1\). Note that \(\Vert M^* z^k\Vert \le 1/k\) for all k, we have \(\Vert M^* z\Vert = \lim _{k\rightarrow \infty } \Vert M^* z^k\Vert =0\) which means \(z \in \mathcal N(M^*)\). Thus \(z \in \mathcal N(M^*) \cap \mathcal N(M^*)^\perp = \{0\}\) which is a contradiction.

3 Proximal ADMM for Linear Inverse Problems

In this section we will consider the method (1.11) as a regularization method for solving (1.9) and establish a convergence rate result under a benchmark source condition on the sought solution. Throughout this section we will make the following assumptions on the operators Q, L, A, the constraint set \(\mathcal C\) and the function f:

Assumption 3

  1. (i)

    \(A: \mathcal X\rightarrow \mathcal H\) is a bounded linear operator, \(Q: \mathcal X\rightarrow \mathcal X\) is a bounded linear positive semi-definite self-adjoint operator, and \(\mathcal C\subset \mathcal X\) is a closed convex subset.

  2. (ii)

    L is a densely defined, closed, linear operator from \(\mathcal X\) to \(\mathcal Y\) with domain \(\textrm{dom}(L)\).

  3. (iii)

    There is a constant \(c_0>0\) such that

    $$\begin{aligned} \Vert A x\Vert ^2 + \Vert L x\Vert ^2 \ge c_0 \Vert x\Vert ^2, \qquad \forall x\in \textrm{dom}(L). \end{aligned}$$
  4. (iv)

    \(f: \mathcal {Y}\rightarrow (-\infty , \infty ]\) is proper, lower semi-continuous, and strongly convex.

This assumptions is standard in the literature on regularization methods and has been used in [21, 22]. Based on (iii), we can define the adjoint \(L^*\) of L which is also closed and densely defined; moreover, \(z\in \text{ dom }(L^*)\) if and only if \(\langle L^* z, x\rangle =\langle z, L x\rangle \) for all \(x\in \text{ dom }(L)\). Under Assumption 3, it has been shown in [21, 22] that the proximal ADMM (1.11) is well-defined and if the exact data b is consistent in the sense that there exists \({\hat{x}}\in \mathcal X\) such that

$$\begin{aligned} {\hat{x}} \in \text{ dom }(L) \cap \mathcal C, \quad L {\hat{x}} \in \text{ dom }(f) \quad \text{ and } \quad A {\hat{x}} = b, \end{aligned}$$

then the problem (1.9) has a unique solution, denoted by \(x^\dag \). Furthermore, there holds the following monotonicity result, see [22, Lemma 2.3]; alternatively, it can also be derived from Lemma 2.3..

Lemma 3.1

Let \(\{z^k, y^k, x^k, \lambda ^k, \mu ^k, \nu ^k\}\) be defined by the proximal ADMM (1.11) with noisy data and let

$$\begin{aligned} E_k&:= \frac{1}{2\rho _1} \Vert \Delta \lambda ^k\Vert ^2 + \frac{1}{2\rho _2} \Vert \Delta \mu ^k\Vert ^2 + \frac{1}{2\rho _3} \Vert \Delta \nu ^k\Vert ^2 \nonumber \\&\quad \ + \frac{\rho _2}{2} \Vert \Delta y^k\Vert ^2 + \frac{\rho _3}{2} \Vert \Delta x^k\Vert ^2 + \frac{1}{2} \Vert \Delta z^k\Vert _Q^2. \end{aligned}$$
(3.1)

Then \(\{E_k\}\) is monotonically decreasing with respect to k.

In the following we will always assume the exact data b is consistent. We will derive a convergence rate of \(x^k\) to the unique solution \(x^\dag \) of (1.9) under the source condition

$$\begin{aligned} \exists \mu ^\dag \in \partial f(L x^\dag ) \cap \text{ dom }(L^*) \text{ and } \nu ^\dag \in \partial \iota _\mathcal C(x^\dag ) \text{ such } \text{ that } L^* \mu ^\dag + \nu ^\dag \in \text{ Ran }(A^*). \end{aligned}$$
(3.2)

Note that when \(L = I\) and \(\mathcal C= \mathcal X\), (3.2) becomes the benchmark source condition

$$\begin{aligned} \partial f(x^\dag ) \cap \text{ Ran }(A^*) \ne \emptyset \end{aligned}$$

which has been widely used to derive convergence rate for regularization methods, see [7, 13, 23, 29] for instance. We have the following convergence rate result.

Theorem 3.2

Let Assumption 3 hold, let the exact data b be consistent, and let the sequence \(\{z^k, y^k, x^k, \lambda ^k, \mu ^k, \nu ^k\}\) be defined by the proximal ADMM (1.11) with noisy data \(b^\delta \) satisfying \(\Vert b^\delta - b\Vert \le \delta \). Assume the unique solution \(x^\dag \) of (1.9) satisfies the source condition (3.2). Then for the integer \(k_\delta \) chosen by \(k_\delta \sim \delta ^{-1}\) there hold

$$\begin{aligned} \Vert x^{k_\delta } - x^\dag \Vert = O(\delta ^{1/4}), \quad \Vert y^{k_\delta } - L x^\dag \Vert = O(\delta ^{1/4}) \quad \text{ and } \quad \Vert z^{k_\delta } - x^\dag \Vert = O(\delta ^{1/4}) \end{aligned}$$

as \(\delta \rightarrow 0\).

In order to prove this result, let us start from the formulation of the algorithm (1.11) to derive some useful estimates. For simplicity of exposition, we set

$$\begin{aligned}&\Delta x^{k+1}:= x^{k+1} - x^k, \quad \Delta y^{k+1}:= y^{k+1} - y^k, \quad \Delta z^{k+1}:= z^{k+1} - z^k, \\&\Delta \lambda ^{k+1}:= \lambda ^{k+1} - \lambda ^k, \quad \Delta \mu ^{k+1}:= \mu ^{k+1} - \mu ^k, \quad \Delta \nu ^{k+1}:= \nu ^{k+1} - \nu ^k. \end{aligned}$$

According to the definition of \(z^{k+1}\), \(y^{k+1}\) and \(x^{k+1}\) in (1.11), we have the optimality conditions

$$\begin{aligned}&0 = A^* \lambda ^k + \nu ^k + \rho _1 A^* (A z^{k+1} - b^\delta ) + L^* (\mu ^k + \rho _2(L z^{k+1} -y^k)) \nonumber \\&\quad \quad + \rho _3 (z^{k+1} - x^k) + Q(z^{k+1} - z^k), \end{aligned}$$
(3.3)
$$\begin{aligned}&0 \in \partial f(y^{k+1}) - \mu ^k - \rho _2(L z^{k+1} - y^{k+1}), \end{aligned}$$
(3.4)
$$\begin{aligned}&0 \in \partial \iota _C(x^{k+1}) -\nu ^k - \rho _3 (z^{k+1} - x^{k+1}). \end{aligned}$$
(3.5)

By using the last two equations in (1.11), we have from (3.4) and (3.5) that

$$\begin{aligned} \mu ^{k+1} \in \partial f(y^{k+1}) \quad \text{ and } \quad \nu ^{k+1} \in \partial \iota _{\mathcal C}(x^{k+1}). \end{aligned}$$
(3.6)

Letting \(y^\dag = L x^\dag \). From the strong convexity of f, the convexity of \(\iota _C\), and (3.6) it follows that

$$\begin{aligned} \sigma _f \Vert y^{k+1} - y^\dag \Vert ^2&\le f(y^\dag ) - f(y^{k+1}) - \langle \mu ^{k+1}, y^\dag - y^{k+1}\rangle \nonumber \\&\quad \, + \langle \nu ^{k+1}, x^{k+1} - x^\dag \rangle . \end{aligned}$$
(3.7)

where \(\sigma _f\) denotes the modulus of convexity of f; we have \(\sigma _f>0\) as f is strongly convex. By taking the inner product of (3.3) with \(z^{k+1} - x^\dag \) we have

$$\begin{aligned} 0&= \langle \lambda ^k + \rho _1 (A z^{k+1} - b^\delta ), A (z^{k+1} - x^\dag )\rangle \\&\quad \, + \langle \mu ^k + \rho _2(L z^{k+1} - y^k), L(z^{k+1} - x^\dag )\rangle \\&\quad \, + \langle \nu ^k + \rho _3 (z^{k+1} - x^k), z^{k+1} - x^\dag \rangle \\&\quad \, + \langle Q(z^{k+1} - z^k), z^{k+1} - x^\dag \rangle . \end{aligned}$$

Therefore we may use the definition of \(\lambda ^{k+1}, \mu ^{k+1}, \nu ^{k+1}\) in (1.11) and the fact \(A x^\dag = b\) to further obtain

$$\begin{aligned} 0&= \langle \lambda ^{k+1}, A z^{k+1} - b\rangle + \langle \mu ^{k+1} + \rho _2 \Delta y^{k+1}, L z^{k+1} - y^\dag \rangle \nonumber \\&\quad \, + \langle \nu ^{k+1} + \rho _3 \Delta x^{k+1}, z^{k+1} - x^\dag \rangle \nonumber \\&\quad \, + \langle Q\Delta z^{k+1}, z^{k+1} - x^\dag \rangle . \end{aligned}$$
(3.8)

Subtracting (3.7) by (3.8) gives

$$\begin{aligned} \sigma _f \Vert y^{k+1} - y^\dag \Vert ^2&\le f(y^\dag ) - f(y^{k+1}) - \langle \lambda ^{k+1}, A z^{k+1} - b\rangle + \langle \mu ^{k+1}, y^{k+1} - L z^{k+1}\rangle \\&\quad \, - \rho _2 \langle \Delta y^{k+1}, L z^{k+1} - y^\dag \rangle + \langle \nu ^{k+1}, x^{k+1} - z^{k+1}\rangle \\&\quad \, -\rho _3 \langle \Delta x^{k+1}, z^{k+1} - x^\dag \rangle - \langle Q\Delta z^{k+1}, z^{k+1} - x^\dag \rangle . \end{aligned}$$

Note that under the source condition (3.2), there exist \(\mu ^\dag \), \(\nu ^\dag \) and \(\lambda ^\dag \) such that

$$\begin{aligned} \mu ^\dag \in \partial f(y^\dag ), \quad \nu ^\dag \in \partial \iota _\mathcal C(x^\dag ) \quad \text{ and } \quad L^* \mu ^\dag + \nu ^\dag + A^* \lambda ^\dag = 0. \end{aligned}$$
(3.9)

Thus, it follows from the above equation and the last two equations in (1.11) that

$$\begin{aligned}&\sigma _f \Vert y^{k+1} - y^\dag \Vert ^2 \\&\le f(y^\dag ) - f(y^{k+1}) - \langle \lambda ^\dag , A z^{k+1} - b\rangle - \langle \mu ^\dag , L z^{k+1} - y^{k+1}\rangle - \langle \nu ^\dag , z^{k+1} - x^{k+1}\rangle \\&\quad \ - \langle \lambda ^{k+1} - \lambda ^\dag , A z^{k+1} - b^\delta + b^\delta - b\rangle \\&\quad \ -\frac{1}{\rho _2} \langle \mu ^{k+1} - \mu ^\dag , \Delta \mu ^{k+1} \rangle - \rho _2 \langle \Delta y^{k+1}, L z^{k+1} - y^\dag \rangle \\&\quad \ - \frac{1}{\rho _3} \langle \nu ^{k+1} - \nu ^\dag , \Delta \nu ^{k+1}\rangle - \rho _3 \langle \Delta x^{k+1}, z^{k+1} - x^\dag \rangle \\&\quad \, - \langle Q\Delta z^{k+1}, z^{k+1} - x^\dag \rangle . \end{aligned}$$

By using (3.9), \(b = A x^\dag \) and the convexity of f, we can see that

$$\begin{aligned}&f(y^\dag ) - f(y^{k+1}) - \langle \lambda ^\dag , A z^{k+1} - b\rangle - \langle \mu ^\dag , L z^{k+1} - y^{k+1}\rangle - \langle \nu ^\dag , z^{k+1} - x^{k+1}\rangle \\&= f(y^\dag ) - f(y^{k+1}) + \langle \lambda ^\dag , b\rangle + \langle \mu ^\dag , y^{k+1}\rangle + \langle \nu ^\dag , x^{k+1}\rangle \\&= f(y^\dag ) - f(y^{k+1}) + \langle A^* \lambda ^\dag , x^\dag \rangle + \langle \mu ^\dag , y^{k+1}\rangle + \langle \nu ^\dag , x^{k+1}\rangle \\&= f(y^\dag ) - f(y^{k+1}) - \langle L^* \mu ^\dag , x^\dag \rangle + \langle \mu ^\dag , y^{k+1}\rangle + \langle \nu ^\dag , x^{k+1} - x^\dag \rangle \\&= f(y^\dag ) - f(y^{k+1}) + \langle \mu ^\dag , y^{k+1} - y^\dag \rangle + \langle \nu ^\dag , x^{k+1} - x^\dag \rangle \le 0. \end{aligned}$$

Consequently, by using the fourth equation in (1.11), we have

$$\begin{aligned} \sigma _f \Vert y^{k+1} - y^\dag \Vert ^2&\le - \langle \lambda ^{k+1} - \lambda ^\dag , b^\delta - b\rangle - \frac{1}{\rho _1} \langle \lambda ^{k+1} - \lambda ^\dag , \Delta \lambda ^{k+1}\rangle \\&\quad \ - \frac{1}{\rho _2} \langle \mu ^{k+1} - \mu ^\dag , \Delta \mu ^{k+1}\rangle - \frac{1}{\rho _3} \langle \nu ^{k+1} - \nu ^\dag , \Delta \nu ^{k+1}\rangle \\&\quad \ - \rho _2 \langle \Delta y^{k+1}, y^{k+1}-y^\dag + L z^{k+1} - y^{k+1}\rangle \\&\quad \ - \rho _3 \langle \Delta x^{k+1}, x^{k+1} - x^\dag + z^{k+1} - x^{k+1}\rangle \\&\quad \ - \langle Q\Delta z^{k+1}, z^{k+1} - x^\dag \rangle . \end{aligned}$$

By using the polarization identity and the last two equations in (1.11) we further have

$$\begin{aligned} \sigma _f \Vert y^{k+1} - y^\dag \Vert ^2&\le - \langle \lambda ^{k+1} - \lambda ^\dag , b^\delta - b\rangle \\&\quad \ + \frac{1}{2\rho _1} \left( \Vert \lambda ^k -\lambda ^\dag \Vert ^2 - \Vert \lambda ^{k+1} - \lambda ^\dag \Vert ^2 - \Vert \Delta \lambda ^{k+1}\Vert ^2\right) \\&\quad \ + \frac{1}{2\rho _2} \left( \Vert \mu ^k - \mu ^\dag \Vert ^2 - \Vert \mu ^{k+1} - \mu ^\dag \Vert ^2 - \Vert \Delta \mu ^{k+1}\Vert ^2\right) \\&\quad \ + \frac{1}{2\rho _3} \left( \Vert \nu ^k - \nu ^\dag \Vert ^2 - \Vert \nu ^{k+1} - \nu ^\dag \Vert ^2 - \Vert \Delta \nu ^{k+1}\Vert ^2 \right) \\&\quad \ + \frac{1}{2} \left( \Vert z^k - x^\dag \Vert _Q^2 - \Vert z^{k+1} - x^\dag \Vert _Q^2 - \Vert \Delta z^{k+1}\Vert _Q^2\right) \\&\quad \ + \frac{\rho _2}{2} \left( \Vert y^k - y^\dag \Vert ^2 - \Vert y^{k+1} - y^\dag \Vert ^2 - \Vert \Delta y^{k+1}\Vert ^2\right) \\&\quad \ + \frac{\rho _3}{2} \left( \Vert x^k - x^\dag \Vert ^2 - \Vert x^{k+1} - x^\dag \Vert ^2 - \Vert \Delta x^{k+1}\Vert ^2\right) \\&\quad \ - \langle \Delta y^{k+1}, \Delta \mu ^{k+1} \rangle - \langle \Delta x^{k+1}, \Delta \nu ^{k+1} \rangle . \end{aligned}$$

Let

$$\begin{aligned} \Phi _k&:= \frac{1}{2\rho _1} \Vert \lambda ^k -\lambda ^\dag \Vert ^2 + \frac{1}{2\rho _2} \Vert \mu ^k - \mu ^\dag \Vert ^2 + \frac{1}{2\rho _3} \Vert \nu ^k - \nu ^\dag \Vert ^2 \\&\quad \ + \frac{1}{2} \Vert z^k - x^\dag \Vert _Q^2 + \frac{\rho _2}{2} \Vert y^k - y^\dag \Vert ^2 + \frac{\rho _3}{2} \Vert x^k - x^\dag \Vert ^2. \end{aligned}$$

Then

$$\begin{aligned} \sigma _f \Vert y^{k+1} - y^\dag \Vert ^2&\le \Phi _k - \Phi _{k+1} - \langle \lambda ^{k+1} - \lambda ^\dag , b^\delta - b\rangle - E_{k+1} \nonumber \\&\quad \ - \langle \Delta y^{k+1}, \Delta \mu ^{k+1}\rangle - \langle \Delta x^{k+1}, \Delta \nu ^{k+1}\rangle , \end{aligned}$$
(3.10)

where \(E_k\) is defined by (3.1).

Lemma 3.3

For all \(k = 0, 1, \cdots \) there hold

$$\begin{aligned}&\sigma _f \Vert y^{k+1} - y^\dag \Vert ^2 \le \Phi _k - \Phi _{k+1} - \langle \lambda ^{k+1} - \lambda ^\dag , b^\delta - b\rangle - E_{k+1}, \end{aligned}$$
(3.11)
$$\begin{aligned}&E_{k+1} \le \Phi _k - \Phi _{k+1} + \sqrt{2\rho _1 \Phi _{k+1}} \delta \end{aligned}$$
(3.12)

and

$$\begin{aligned} \Phi _{k+1} \le \Phi _0 + \left( \sum _{j=1}^{k+1} \sqrt{2\rho _1 \Phi _j}\right) \delta . \end{aligned}$$
(3.13)

Proof

By using (3.6) and the monotonicity of the subdifferentials \(\partial f\) and \(\partial \iota _{\mathcal C}\) we have

$$\begin{aligned} 0 \le \sigma _f \Vert \Delta y^{k+1}\Vert ^2 \le \langle \Delta \mu ^{k+1}, \Delta y^{k+1}\rangle + \langle \Delta \nu ^{k+1}, \Delta x^{k+1}\rangle . \end{aligned}$$

This together with (3.10) implies (3.11). From (3.11) it follows immediately that

$$\begin{aligned} E_{k+1}&\le \Phi _k - \Phi _{k+1} - \langle \lambda ^{k+1} - \lambda ^\dag , b^\delta - b\rangle \nonumber \\&\le \Phi _k - \Phi _{k+1} + \Vert \lambda ^{k+1} - \lambda ^\dag \Vert \delta \nonumber \\&\le \Phi _k - \Phi _{k+1} + \sqrt{2\rho _1 \Phi _{k+1}} \delta \end{aligned}$$

which shows (3.12). By the non-negativity of \(E_{k+1}\) we then obtain from (3.12) that

$$\begin{aligned} \Phi _{k+1} \le \Phi _k + \sqrt{2 \rho _1 \Phi _{k+1}} \delta , \quad \forall k \ge 0 \end{aligned}$$

which clearly implies (3.13). \(\square \)

In order to derive the estimate on \(\Phi _k\) from (3.13), we need the following elementary result.

Lemma 3.4

Let \(\{a_k\}\) and \(\{b_k\}\) be two sequences of nonnegative numbers such that

$$\begin{aligned} a_k^2 \le b_k^2 + c \sum _{j=1}^{k} a_j, \quad k=0, 1, \cdots , \end{aligned}$$

where \(c \ge 0\) is a constant. If \(\{b_k\}\) is non-decreasing, then

$$\begin{aligned} a_k \le b_k + c k, \quad k=0, 1, \cdots . \end{aligned}$$

Proof

We show the result by induction on k. The result is trivial for \(k =0\) since the given condition with \(k=0\) gives \(a_0 \le b_0\). Assume that the result is valid for all \(0\le k \le l\) for some \(l\ge 0\). We show it is also true for \(k = l+1\). If \(a_{l+1} \le \max \{a_0, \cdots , a_l\}\), then \(a_{l+1} \le a_j\) for some \(0\le j\le l\). Thus, by the induction hypothesis and the monotonicity of \(\{b_k\}\) we have

$$\begin{aligned} a_{l+1} \le a_j \le b_j + c j \le b_{l+1} + c (l+1). \end{aligned}$$

If \(a_{l+1} > \max \{a_0, \cdots , a_l\}\), then

$$\begin{aligned} a_{l+1}^2 \le b_{l+1}^2 + c \sum _{j=1}^{l+1} a_j \le b_{l+1}^2 + c(l+1) a_{l+1} \end{aligned}$$

which implies that

$$\begin{aligned} \left( a_{l+1} - \frac{1}{2} c (l+1)\right) ^2&= a_{l+1}^2 - c (l+1) a_{l+1} + \frac{1}{4} c^2 (l+1)^2\\&\le b_{l+1}^2 + \frac{1}{4} c^2 (l+1)^2 \\&\le \left( b_{l+1} + \frac{1}{2} c (l+1)\right) ^2. \end{aligned}$$

Taking square roots shows \(a_{l+1} \le b_{l+1} + c (l+1)\) again. \(\square \)

Lemma 3.5

There hold

$$\begin{aligned} \Phi _k^{1/2} \le \Phi _0^{1/2} + \sqrt{2 \rho _1} k \delta , \quad \forall k \ge 0 \end{aligned}$$
(3.14)

and

$$\begin{aligned} E_k \le \frac{2\Phi _0}{k} + \frac{5}{2} \rho _1 k \delta ^2, \quad \forall k \ge 1. \end{aligned}$$
(3.15)

Proof

Based on (3.13), we may use Lemma 3.4 with \(a_k = \Phi _k^{1/2}\), \(b_k = \Phi _0^{1/2}\) and \(c = (2\rho _2)^{1/2} \delta \) to obtain (3.14) directly. Next, by using the monotonicity of \(\{E_k\}\), (3.12) and (3.14) we have

$$\begin{aligned} k E_k&\le \sum _{j=1}^k E_j \le \sum _{j=1}^k \left( \Phi _{j-1} - \Phi _j + \sqrt{2\rho _1 \Phi _j}\delta \right) \\&\le \Phi _0 - \Phi _k + \sum _{j=1}^k \sqrt{2\rho _1 \Phi _j} \delta \\&\le \Phi _0 + \sum _{j=1}^k \sqrt{2\rho _1} \left( \sqrt{\Phi _0} + \sqrt{2\rho _1} j \delta \right) \delta \\&= \Phi _0 + \sqrt{2\rho _1 \Phi _0} k \delta + \rho _1 k(k+1) \delta ^2 \\&\le 2 \Phi _0 + \frac{5}{2} \rho _1 k^2 \delta ^2 \end{aligned}$$

which shows (3.15). \(\square \)

Now we are ready to complete the proof of Theorem 3.2.

Proof (Proof of Theorem 3.2)

Let \(k_\delta \) be an integer such that \(k_\delta \sim \delta ^{-1}\). From (3.14) and (3.15) in Lemma 3.5 it follows that

$$\begin{aligned} E_{k_\delta } \le C_0 \delta \quad \text{ and } \quad \Phi _k \le C_1 \text{ for } \text{ all } k\le k_\delta , \end{aligned}$$
(3.16)

where \(C_0\) and \(C_1\) are constants independent of k and \(\delta \). In order to use (3.11) in Lemma 3.3 to estimate \(\Vert y^{k_\delta } - y^\dag \Vert \), we first consider \(\Phi _k - \Phi _{k+1}\) for all \(k\ge 0\). By using the definition of \(\Phi _k\) and the inequality \(\Vert u\Vert ^2 - \Vert v\Vert ^2 \le (\Vert u\Vert + \Vert v\Vert ) \Vert u - v\Vert \), we have for \(k \ge 0\) that

$$\begin{aligned} \Phi _{k} - \Phi _{k+1}&\le \frac{1}{2\rho _1} \left( \Vert \lambda ^{k} - \lambda ^\dag \Vert + \Vert \lambda ^{k+1} - \lambda ^\dag \Vert \right) \Vert \Delta \lambda ^{k+1}\Vert \\&\quad \ + \frac{1}{2\rho _2} \left( \Vert \mu ^{k} - \mu ^\dag \Vert + \Vert \mu ^{k+1} - \mu ^\dag \Vert \right) \Vert \Delta \mu ^{k+1}\Vert \\&\quad \ + \frac{1}{2\rho _3} \left( \Vert \nu ^{k} - \nu ^\dag \Vert + \Vert \nu ^{k+1} - \nu ^\dag \Vert \right) \Vert \Delta \nu ^{k+1}\Vert \\&\quad \ + \frac{1}{2}\left( \Vert z^k - x^\dag \Vert _Q + \Vert z^{k+1}-x^\dag \Vert _Q\right) \Vert \Delta z^{k+1}\Vert _Q \\&\quad \ + \frac{\rho _2}{2} \left( \Vert y^k - y^\dag \Vert + \Vert y^{k+1} - y^\dag \Vert \right) \Vert \Delta y^{k+1}\Vert \\&\quad \ + \frac{\rho _3}{2} \left( \Vert x^k - x^\dag \Vert + \Vert x^{k+1} - x^\dag \Vert \right) \Vert \Delta x^{k+1}\Vert . \end{aligned}$$

By virtue of the Cauchy-Schwarz inequality and the inequality \((a+b)^2 \le 2(a^2 + b^2)\) for any numbers \(a, b \in {\mathbb R}\) we can further obtain

$$\begin{aligned} \Phi _k - \Phi _{k+1} \le \sqrt{2(\Phi _k + \Phi _{k+1}) E_{k+1}}, \quad \forall k\ge 0. \end{aligned}$$

This together with (3.16) in particular implies

$$\begin{aligned} \Phi _{k_\delta -1} - \Phi _{k_\delta } \le \sqrt{4 C_0 C_1 \delta }. \end{aligned}$$

Therefore, it follows from (3.11) that

$$\begin{aligned} \sigma _f \Vert y^{k_\delta }-y^\dag \Vert ^2&\le \Phi _{k_\delta -1} - \Phi _{k_\delta } + \Vert \lambda ^{k_\delta }-\lambda ^\dag \Vert \delta \\&\le \sqrt{4 C_0C_1 \delta } + \sqrt{2\rho _1 \Phi _{k_\delta }} \delta \\&\le \sqrt{4 C_0 C_1 \delta } + \sqrt{2\rho _1 C_1} \delta . \end{aligned}$$

Thus

$$\begin{aligned} \Vert y^{k_\delta } - y^\dag \Vert ^2 \le C_2 \delta ^{1/2}, \end{aligned}$$

where \(C_2\) is a constant independent of \(\delta \) and k. By using the estimate \(E_{k_\delta } \le C_0 \delta \) in (3.16), the definition of \(E_{k_\delta }\), and the last three equations in (1.11), we can see that

$$\begin{aligned}&\Vert A z^{k_\delta }- b^\delta \Vert ^2 = \frac{1}{\rho _1^2} \Vert \Delta \lambda ^{k_\delta }\Vert ^2 \le \frac{2}{\rho _1} E_{k_\delta } \le \frac{2 C_0}{\rho _1} \delta ,\\&\Vert L z^{k_\delta } - y^{k_\delta }\Vert ^2 = \frac{1}{\rho _2^2} \Vert \Delta \mu ^{k_\delta }\Vert ^2 \le \frac{2}{\rho _2} E_{k_\delta } \le \frac{2C_0}{\rho _2} \delta ,\\&\Vert z^{k_\delta } - x^{k_\delta }\Vert ^2 = \frac{1}{\rho _3^2} \Vert \Delta \nu ^{k_\delta }\Vert ^2 \le \frac{2}{\rho _3} E_{k_\delta } \le \frac{2 C_0}{\rho _3} \delta . \end{aligned}$$

Therefore

$$\begin{aligned}&\Vert L(z^{k_\delta }-x^\dag )\Vert ^2 \le 2\left( \Vert L z^{k_\delta } - y^{k_\delta }\Vert ^2 + \Vert y^{k_\delta } - y^\dag \Vert ^2\right) \le \frac{4 C_0}{\rho _2} \delta + 2 C_2 \delta ^{1/2}, \\&\Vert A(z^{k_\delta }- x^\dag )\Vert ^2 \le 2\left( \Vert A z^{k_\delta } - b^\delta \Vert ^2 + \Vert b^\delta - b\Vert ^2\right) \le 2 \left( \frac{2 C_0}{\rho _1} + 1\right) \delta . \end{aligned}$$

By virtue of (iii) in Assumption 3 on A and L we thus obtain

$$\begin{aligned} c_0 \Vert z^{k_\delta }-x^\dag \Vert ^2&\le \Vert A(z^{k_\delta } - x^\dag )\Vert ^2 + \Vert L(z^{k_\delta }- x^\dag )\Vert ^2 \\&\le 2 \left( \frac{2 C_0}{\rho _1} + \frac{2 C_0}{\rho _1} + 1\right) \delta + 2 C_2 \delta ^{1/2}. \end{aligned}$$

This means there is a constant \(C_3\) independent of \(\delta \) and k such that

$$\begin{aligned} \Vert z^{k_\delta } - x^\dag \Vert ^2 \le C_3 \delta ^{1/2}. \end{aligned}$$

Finally we obtain

$$\begin{aligned} \Vert x^{k_\delta } - x^\dag \Vert ^2 \le 2\left( \Vert x^{k_\delta }- z^{k_\delta }\Vert ^2 + \Vert z^{k_\delta } - x^\dag \Vert ^2\right) \le \frac{4 C_0}{\rho _3} \delta + 2 C_3 \delta ^{1/2}. \end{aligned}$$

The proof is thus complete. \(\square \)

Remark 3.1

Under the benchmark source condition (3.2), we have obtained in Theorem 3.2 the convergence rate \(O(\delta ^{1/4})\) for the proximal ADMM (1.11). This rate is not order optimal. It is not yet clear if the order optimal rate \(O(\delta ^{1/2})\) can be achieved.

Remark 3.2

When using the proximal ADMM to solve (1.9) with \(L = I\), i.e.

$$\begin{aligned} \min \left\{ f(x): A x = b \text{ and } x \in \mathcal C\right\} , \end{aligned}$$
(3.17)

it is not necessary to introduce the y-variable as is done in (1.11) and thus (1.11) can be simplified to the scheme

$$\begin{aligned}&z^{k+1} = \arg \min _{z\in \mathcal X} \left\{ {\mathscr {L}}_{\rho _1, \rho _2}(z, x^k, \lambda ^k, \nu ^k) + \frac{1}{2} \Vert z-z^k\Vert _Q^2\right\} , \nonumber \\&x^{k+1} = \arg \min _{x\in \mathcal X} \left\{ {\mathscr {L}}_{\rho _1, \rho _2}(z^{k+1}, x, \lambda ^k, \nu ^k)\right\} , \nonumber \\&\lambda ^{k+1} = \lambda ^k + \rho _1 (A z^{k+1} - b^\delta ), \nonumber \\&\nu ^{k+1} = \nu ^k + \rho _2 (z^{k+1} - x^{k+1}), \end{aligned}$$
(3.18)

where

$$\begin{aligned} {\mathscr {L}}_{\rho _1, \rho _2}(z, x, \lambda , \nu ):= & {} f(z) + \iota _\mathcal C(x) + \langle \lambda , A z- b^\delta \rangle \\ {}{} & {} + \langle \nu , z-x\rangle + \frac{\rho _1}{2}\Vert A z- b^\delta \Vert ^2 + \frac{\rho _2}{2} \Vert z-x\Vert ^2. \end{aligned}$$

The source condition (1.9) reduces to the form

$$\begin{aligned} \exists \mu ^\dag \in \partial f(x^\dag ) \text{ and } \nu ^\dag \in \partial \iota _\mathcal C(x^\dag ) \text{ such } \text{ that } \mu ^\dag + \nu ^\dag \in \text{ Ran }(A^*). \end{aligned}$$
(3.19)

If the unique solution \(x^\dag \) of (3.17) satisfies the source condition (3.19), one may follow the proof of Theorem 3.2 with minor modification to deduce for the method (3.18) that

$$\begin{aligned} \Vert x^{k_\delta } - x^\dag \Vert = O(\delta ^{1/4}) \quad \text{ and } \quad \Vert z^{k_\delta } - x^\dag \Vert = O(\delta ^{1/4}) \end{aligned}$$

whenever the integer \(k_\delta \) is chosen such that \(k_\delta \sim \delta ^{-1}\).

We conclude this section by presenting a numerical result to illustrate the semi-convergence property of the proximal ADMM and the convergence rate. We consider finding a solution of (1.8) with minimal norm. This is equivalent to solving (3.17) with \(f(y) = \frac{1}{2} \Vert y\Vert ^2\). With a noisy data \(b^\delta \) satisfying \(\Vert b^\delta - b\Vert \le \delta \), the corresponding proximal ADMM (3.18) takes the form

$$\begin{aligned} z^{k+1}&= \left( (1 + \rho _2) I + Q + \rho _1 A^*A\right) ^{-1} \left( \rho _1 A^* b^\delta + \rho _2 x^k + Q z^k - A^* \lambda ^k - \nu ^k\right) , \nonumber \\ x^{k+1}&= P_\mathcal C\left( z^{k+1} + \nu ^k/\rho _2\right) , \nonumber \\ \lambda ^{k+1}&= \lambda ^k + \rho _1(A z^{k+1} - b^\delta ), \nonumber \\ \nu ^{k+1}&= \nu ^k + \rho _2(z^{k+1} - x^{k+1}), \end{aligned}$$
(3.20)

where \(P_\mathcal C\) denotes the orthogonal projection of \(\mathcal X\) onto \(\mathcal C\). The source condition (3.19) now takes the form

$$\begin{aligned} \exists \nu ^\dag \in \partial \iota _\mathcal C(x^\dag ) \text{ such } \text{ that } x^\dag + \nu ^\dag \in \text{ Ran }(A^*) \end{aligned}$$
(3.21)

which is equivalent to the projected source condition \(x^\dag \in P_\mathcal C(\text{ Ran }(A^*))\).

Example 3.6

In our numerical simulation we consider the first kind integral equation

$$\begin{aligned} (A x)(t) := \int _0^1 \kappa (s, t) x(s) ds = b(t), \quad t \in [0,1] \end{aligned}$$
(3.22)

on \(L^2[0,1]\), where the kernel \(\kappa \) is continuous on \([0,1]\times [0,1]\). Such equations arise naturally in many linear ill-posed inverse problems, see [12, 18]. Clearly A is a compact linear operator from \(L^2[0,1]\) to itself. We will use

$$\begin{aligned} \kappa (s, t) = d \left( d^2 + (s-t)^2\right) ^{-3/2} \end{aligned}$$

with \(d = 0.1\). The corresponding equation is a 1-D model problem in gravity surveying. Assume the equation (3.22) has a nonnegative solution. We will employ the method (3.20) to determine the unique nonnegative solution of (3.22) with minimal norm in case the data is corrupted by noise. Here \(\mathcal C:=\{x \in L^2[0,1]: x\ge 0 \text{ a.e. }\}\) and thus \(P_\mathcal C(x) = \max \{x, 0\}\).

Fig. 1
figure 1

a Plots the true solution \(x^\dag \), b, c plot the relative errors versus the number of iterations for the method (3.20) using noisy data with noise level \(\delta = 10^{-2}\) and \(10^{-4}\) respectively

In order to investigate the convergence rate of the method, we generate our data as follows. First take \(\omega ^\dag \in L^2[0,1]\), set \(x^\dag := \max \{A^* \omega ^\dag , 0\}\) and define \(b:= A x^\dag \). Thus \(x^\dag \) is a nonnegative solution of \(A x = b\) satisfying \(x^\dag = P_\mathcal C(A^* \omega ^\dag )\), i.e. the source condition (3.21) holds. We use \(\omega ^\dag = t^3(0.9-t)(t-0.35)\), the corresponding \(x^\dag \) is plotted in Fig. 1a. We then pick a random data \(\xi \) with \(\Vert \xi \Vert _{L^2[0,1]} = 1\) and generate the noisy data \(b^\delta \) by \(b^\delta := b + \delta \xi \). Clearly \(\Vert b^\delta - b\Vert _{L^2[0,1]} = \delta \).

For numerical implementation, we discretize the equation by the trapzoidal rule based on partitioning [0, 1] into \(N-1\) subintervals of equal length with \(N = 600\). We then execute the method (3.20) with \(Q =0\), \(\rho _1 = 10\), \(\rho _2=1\) and the initial guess \(x^0 = \lambda ^0 = \nu ^0 = 0\) using the noisy data \(b^\delta \) for several distinct values of \(\delta \). In Fig. 1b and c we plot the relative error \(\Vert x^k- x^\dag \Vert _{L^2}/\Vert x^\dag \Vert _{L^2}\) versus k, the number of iterations, for \(\delta = 10^{-2}\) and \(\delta = 10^{-4}\) respectively. These plots demonstrate that the proximal ADMM always exhibits the semi-convergence phenomenon when used to solve ill-posed problems, no matter how small the noise level is. Therefore, properly terminating the iteration is important to produce useful approximate solutions. This has been done in [21, 22].

Table 1 Numerical results for the method (3.20) using noisy data with diverse noise levels, where \(\texttt {err}_{\min }\) and \(\texttt {iter}_{\min }\) denote respectively the the smallest relative error and the required number of iterations

In Table 1 we report further numerical results. For the noisy data \(b^\delta \) with each noise level \(\delta = 10^{-i}\), \(i = 1, \cdots , 7\), we execute the method and determine the smallest relative error, denoted by \(\texttt {err}_{\min }\), and the required number of iterations, denoted by \(\texttt {iter}_{\min }\). The ratios \(\texttt {err}_{\min }/\delta ^{1/2}\) and \(\texttt {err}_{\min }/\delta ^{1/4}\) are then calculated. Since \(x^\dag \) satisfies the source condition (3.21), our theoretical result predicts the convergence rate \(O(\delta ^{1/4})\). However, Table 1 illustrates that the value of \(\texttt {err}_{\min }/\delta ^{1/2}\) does not change much while the value of \(\texttt {err}_{\min }/\delta ^{1/4}\) tends to decrease to 0 as \(\delta \rightarrow 0\). This strongly suggests that the proximal ADMM admits the order optimal convergence rate \(O(\delta ^{1/2})\) if the source condition (3.21) holds. However, how to derive this order optimal rate remains open.