1 Introduction

Many linear inverse problems can be formulated into the minimization problem

$$\begin{aligned} \min \{{\mathcal {R}}(x) : x \in X \text{ and } Ax = y\}, \end{aligned}$$
(1.1)

where \(A : X \rightarrow Y\) is a bounded linear operator from a Banach space X to a Hilbert space Y, \(y \in \text{ Ran }(A)\), the range of A, and \({\mathcal {R}}: X \rightarrow (-\infty , \infty ]\) is a proper, lower semi-continuous, convex function that is used to select a solution with desired feature. Throughout the paper, all spaces are assumed to be real vector spaces; however, all results still hold for complex vector spaces by minor modifications adapted to complex environments. The norms in X and Y are denoted by the same notation \(\Vert \cdot \Vert \). We also use the same notation \(\langle \cdot , \cdot \rangle \) to denote the duality pairing in Banach spaces and the inner product in Hilbert spaces. When the operator A does not have a closed range, the problem (1.1) is ill-posed in general, thus, if instead of y, we only have a noisy data \(y^\delta \) satisfying

$$\begin{aligned} \Vert y^\delta -y\Vert \le \delta \end{aligned}$$

with a small noise level \(\delta >0\), then replacing y in (1.1) by \(y^\delta \) may lead to a problem that is not well-defined; even if it is well-defined, the solution may not depend continuously on the data. In order to use a noisy data to find an approximate solution of (1.1), a regularization technique should be employed to remove the instability [13, 42]

In this paper we will consider a dual gradient method to solve (1.1). This method is based on applying the gradient method to its dual problem. In order for a better understanding, we provide a brief derivation of this method which is well known in optimization community [5, 43]; the facts from convex analysis that are used will be reviewed in Sect. 2. Assume that we only have a noisy data \(y^\delta \) and consider the problem (1.1) with y replaced by \(y^\delta \). The associated Lagrangian function is

$$\begin{aligned} {{\mathcal {L}}}(x, \lambda ) = {\mathcal {R}}(x) - \langle \lambda , Ax -y^\delta \rangle , \quad x \in X \text{ and } \lambda \in Y \end{aligned}$$

which induces the dual function

$$\begin{aligned} \inf _{x\in X} \left\{ {\mathcal {R}}(x) - \langle \lambda , Ax -y^\delta \rangle \right\} = -{\mathcal {R}}^*(A^*\lambda ) + \langle \lambda , y^\delta \rangle , \end{aligned}$$

where \(A^*: Y \rightarrow X^*\) denotes the adjoint of A and \({\mathcal {R}}^*: X^*\rightarrow (-\infty , \infty ]\) denotes the Legendre-Fenchel conjugate of \({\mathcal {R}}\). Thus the corresponding dual problem is

$$\begin{aligned} \min _{\lambda \in Y} \left\{ d_{y^\delta }(\lambda ): = {\mathcal {R}}^*(A^*\lambda )- \langle \lambda , y^\delta \rangle \right\} . \end{aligned}$$
(1.2)

Assuming that \({\mathcal {R}}\) is strongly convex, then \({\mathcal {R}}^*\) is continuous differentiable with \(\nabla {\mathcal {R}}^*: X^*\rightarrow X\) and so is the function \(\lambda \rightarrow d_{y^\delta }(\lambda )\) on Y. Therefore, we may apply a gradient method to solve (1.2) which leads to

$$\begin{aligned} \lambda _{n+1} = \lambda _n - \gamma \left( A \nabla {\mathcal {R}}^*(A^*\lambda _n)- y^\delta \right) , \end{aligned}$$

where \(\gamma >0\) is a step-size. Let \(x_n:= \nabla {\mathcal {R}}^*(A^*\lambda _n)\). Then by the properties of subdifferential we have \(A^*\lambda _n\in \partial {\mathcal {R}}(x_n)\) and hence

$$\begin{aligned} x_n \in \arg \min _{x\in X} \left\{ {\mathcal {R}}(x) -\langle \lambda _n, A x- y^\delta \rangle \right\} . \end{aligned}$$

Combining the above two equations results in the following dual gradient method

$$\begin{aligned} \begin{aligned} x_n&= \arg \min _{x\in X} \left\{ {\mathcal {R}}(x) - \langle \lambda _n, A x- y^\delta \rangle \right\} , \\ \lambda _{n+1}&= \lambda _n - \gamma (A x_n - y^\delta ). \end{aligned} \end{aligned}$$
(1.3)

Note that when X is a Hilbert space and \({\mathcal {R}}(x) = \Vert x\Vert ^2/2\), the method (1.3) becomes the standard linear Landweber iteration in Hilbert spaces [13].

By setting \(\xi _n := A^*\lambda _n\), we can obtain from (1.3) the algorithm

$$\begin{aligned} \begin{aligned} x_n&= \arg \min _{x\in X} \left\{ {\mathcal {R}}(x) -\langle \xi _n, x\rangle \right\} , \\ \xi _{n+1}&= \xi _n - \gamma A^*\left( Ax_n - y^\delta \right) . \end{aligned} \end{aligned}$$
(1.4)

Actually the method (1.4) is equivalent to (1.3) when the initial guess \(\xi _0\) is chosen from \(\text{ Ran }(A^*)\), the range of \(A^*\). Indeed, under the given condition on \(\xi _0\), we can conclude from (1.4) that \(\xi _n \in \text{ Ran }(A^*)\) for all n. Assuming \(\xi _n = A^* \lambda _n\) for some \(\lambda _n \in Y\), we can easily see that \(x_n\) defined by the first equation in (1.4) satisfies the first equation in (1.3). Furthermore, from the second equation in (1.4) we have

$$\begin{aligned} \xi _{n+1} = A^*\left( \lambda _n - \gamma (A x_n - y^\delta )\right) \end{aligned}$$

which means \(\xi _{n+1} = A^* \lambda _{n+1}\) with \(\lambda _{n+1}\) defined by the second equation in (1.3).

The method (1.4) as well as its generalizations to linear and nonlinear ill-posed problems in Banach spaces have been considered in [9, 29, 33, 34, 40, 41] and the convergence property has been proved when the method is terminated by the discrepancy principle. However, except for the linear and nonlinear Landweber iteration in Hilbert spaces [13, 22], the convergence rate in general is missing from the existing convergence theory. In this paper we will consider the dual gradient method (1.3) and hence the method (1.4) under the discrepancy principle

$$\begin{aligned} \Vert A x_{n_\delta } - y^\delta \Vert \le \tau \delta< \Vert A x_n - y^\delta \Vert , \quad 0 \le n< n_\delta \end{aligned}$$
(1.5)

with a constant \(\tau >1\) and derive the convergence rate when the sought solution satisfies a variational source condition. This is the main contribution of the present paper. We also consider accelerating the dual gradient method by Nesterov’s acceleration strategy and provide a convergence rate result when the method is terminated by an a priori stopping rule. Furthermore, we discuss various applications of our convergence theory: we provide a rather complete analysis of the dual projected Landweber iteration for solving linear ill-posed problems in Hilbert spaces with convex constraint which was proposed in [12] with only preliminary results; we also propose an entropic dual gradient method using Boltzmann-Shannon entropy to solve linear ill-posed problems whose solutions are probability density functions.

In the existing literature there exist a number of regularization methods for solving (1.1), including the Tikhonov regularization method, the augmented Lagrangian method, and the nonstationary iterated Tikhonov regularization [19, 32, 42]. In particular, we would like to mention that the augmented Lagrangian method

$$\begin{aligned} \begin{aligned}&x_n \in \arg \min _{x\in X} \left\{ {\mathcal {R}}(x) - \langle \lambda _n, A x- y^\delta \rangle + \frac{\gamma _n}{2} \Vert Ax - y^\delta \Vert ^2 \right\} , \\&\lambda _{n+1} = \lambda _n - \gamma _n (A x_n - y^\delta ) \end{aligned} \end{aligned}$$
(1.6)

has been considered in [17,18,19, 30] for solving ill-posed problem (1.1) as a regularization method. This method can be viewed as a modification of the dual gradient method (1.3) by adding the augmented term \(\frac{\gamma _n}{2} \Vert A x-y^\delta \Vert ^2\) to the definition of \(x_n\). Although the addition of this extra term enables to establish the regularization property of the augmented Lagrangian method under quite general conditions on \({\mathcal {R}}\), it destroys the decomposability structure and thus extra work has to be done to determine \(x_n\) at each iteration step. In contrast, the convergence analysis of the dual gradient method requires \({\mathcal {R}}\) to be strongly convex, however the determination of \(x_n\) is much easier in general. In fact \(x_n\) can be given by a closed formula in many interesting cases; even if \(x_n\) does not have a closed formula, there exist fast algorithms for solving the minimization problem that is used to define \(x_n\) since it does not involve the operator A, see Sect. 4 and [9, 29, 33] for instance. This can significantly save the computational time.

The paper is organized as follows, In Sect. 2, we give a brief review of some basic facts from convex analysis in Banach spaces. In Sect. 3, after a quick account on convergence, we focus on deriving the convergence rates of the dual gradient method under variational source conditions on the sought solution when the method is terminated by either an a priori stopping rule or the discrepancy principle; we also discuss the acceleration of the method by Nesterov’s strategy. Finally in Sect. 4, we address various applications of our convergence theory.

2 Preliminaries

In this section, we will collect some basic facts on convex analysis in Banach spaces which will be used in the analysis of the dual gradient method (1.3); for more details one may refer to [8, 44] for instance.

Let X be a Banach space whose norm is denoted by \(\Vert \cdot \Vert \), we use \(X^*\) to denote its dual space. Given \(x\in X\) and \(\xi \in X^*\) we write \(\langle \xi , x\rangle = \xi (x)\) for the duality pairing. For a convex function \(f : X \rightarrow (-\infty , \infty ]\), we use

$$\begin{aligned} \text{ dom }(f) := \{x \in X : f(x) < \infty \} \end{aligned}$$

to denote its effective domain. If \(\text{ dom }(f) \ne \emptyset \), f is called proper. Given \(x\in \text{ dom }(f)\), an element \(\xi \in X^*\) is called a subgradient of f at x if

$$\begin{aligned} f({\bar{x}}) \ge f(x) + \langle \xi , {\bar{x}} - x\rangle , \quad \forall \bar{x} \in X. \end{aligned}$$

The collection of all subgradients of f at x is denoted as \(\partial f(x)\) and is called the subdifferential of f at x. If \(\partial f(x) \ne \emptyset \), then f is called subdifferentiable at x. Thus \(x \rightarrow \partial f(x)\) defines a set-valued mapping \(\partial f\) whose domain of definition is defined as

$$\begin{aligned} \text{ dom }(\partial f) := \{x \in \text{ dom }(f) : \partial f(x) \ne \emptyset \}. \end{aligned}$$

Given \(x\in \text{ dom }(\partial f)\) and \(\xi \in \partial f(x)\), the Bregman distance induced by f at x in the direction \(\xi \) is defined by

$$\begin{aligned} D_f^\xi ({\bar{x}}, x) := f({\bar{x}}) - f(x) - \langle \xi , {\bar{x}} - x\rangle , \quad \forall {\bar{x}} \in X \end{aligned}$$

which is always nonnegative.

For a proper function \(f : X\rightarrow (-\infty , \infty ]\), its Legendre–Fenchel conjugate is defined by

$$\begin{aligned} f^*(\xi ) := \sup _{x\in X} \{\langle \xi , x\rangle - f(x)\}, \quad \xi \in X^* \end{aligned}$$

which is a convex function taking values in \((-\infty , \infty ]\). According to the definition we immediately have the Fenchel–Young inequality

$$\begin{aligned} f^*(\xi ) + f(x) \ge \langle \xi , x\rangle \end{aligned}$$
(2.1)

for all \(x\in X\) and \(\xi \in X^*\). If \(f : X\rightarrow (-\infty , \infty ]\) is proper, lower semi-continuous and convex, \(f^*\) is also proper and

$$\begin{aligned} \xi \in \partial f(x) \Longleftrightarrow x\in \partial f^*(\xi ) \Longleftrightarrow f(x) + f^*(\xi ) = \langle \xi , x\rangle . \end{aligned}$$
(2.2)

We will use the following version of the Fenchel–Rockafellar duality formula (see [8,  Theorem 4.4.3]).

Proposition 2.1

Let X and Y be Banach spaces, let \(f : X \rightarrow (-\infty , \infty ]\) and \(g : Y \rightarrow (-\infty , \infty ]\) be proper, convex functions, and let \(A : X \rightarrow Y\) be a bounded linear operator. If there is \(x_0 \in dom (f)\) such that \(A x_0 \in dom (g)\) and g is continuous at \(A x_0\), then

$$\begin{aligned} \inf _{x\in X} \{f(x) + g(Ax)\} = \sup _{\eta \in Y^*} \{-f^*(A^*\eta ) - g^*(-\eta ) \}. \end{aligned}$$
(2.3)

A proper function \(f : X \rightarrow (-\infty , \infty ]\) is called strongly convex if there exists a constant \(\sigma >0\) such that

$$\begin{aligned} f(t{\bar{x}} + (1-t) x) + \sigma t(1-t) \Vert {\bar{x}} -x\Vert ^2 \le tf({\bar{x}}) + (1-t)f(x) \end{aligned}$$
(2.4)

for all \({\bar{x}}, x\in \text{ dom }(f)\) and \(t\in [0, 1]\). The largest number \(\sigma >0\) such that (2.4) holds true is called the modulus of convexity of f. It can be shown that a proper, lower semi-continuous, convex function \(f : X \rightarrow (-\infty , \infty ]\) is strongly convex with modulus of convexity \(\sigma >0\) if and only if

$$\begin{aligned} D_f^\xi ({\bar{x}}, x) \ge \sigma \Vert x-{\bar{x}}\Vert ^2 \end{aligned}$$
(2.5)

for all \({\bar{x}}\in \text{ dom }(f)\), \(x\in \text{ dom }(\partial f)\) and \(\xi \in \partial f(x)\); see [44,  Corollary 3.5.11]. Furthermore, [44,  Corollary 3.5.11] also contains the following important result which in particular shows that the strong convexity of f implies the continuous differentiability of \(f^*\).

Proposition 2.2

Let X be a Banach space and let \(f : X \rightarrow (-\infty , \infty ]\) be a proper, lower semi-continuous, strongly convex function with modulus of convexity \(\sigma >0\). Then \(\text{ dom }(f^*) = X^*\), \(f^*\) is Fréchet differentiable and its gradient \(\nabla f^*: X^*\rightarrow X\) satisfies

$$\begin{aligned} \Vert \nabla f^*(\xi ) -\nabla f^*(\eta ) \Vert \le \frac{\Vert \xi -\eta \Vert }{2\sigma } \end{aligned}$$

for all \(\xi , \eta \in X^*\).

It should be emphasized that X in Proposition 2.2 can be an arbitrary Banach space. For the gradient \(\nabla f^*\) of \(f^*\), it is in general a mapping from \(X^* \rightarrow X^{**}\), the second dual space of X. Proposition 2.2 actually concludes that, for each \(\xi \in X^*\), \(\nabla f^*(\xi )\) is an element in \(X^{**}\) that can be identified with an element in X via the canonical embedding \(X \rightarrow X^{**}\), and thus \(\nabla f^*\) is a mapping from \(X^*\) to X.

3 Main results

This section focuses on the study of the dual gradient method (1.3). We will make the following assumption.

Assumption 1

  1. (i)

    X is a Banach space, Y is a Hilbert space, and \(A : X \rightarrow Y\) is a bounded linear operator;

  2. (ii)

    \({\mathcal {R}}: X\rightarrow (-\infty , \infty ]\) is a proper, lower semi-continuous, strongly convex function with modulus of convexity \(\sigma >0\);

  3. (iii)

    The equation \(Ax = y\) has a solution in \(\text{ dom }({\mathcal {R}})\).

Under Assumption 1, one can use [44,  Proposition 3.5.8] to conclude that (1.1) has a unique solution \(x^\dag \) and, for each n, the minimization problem involved in the method (1.3) has a unique minimizer \(x_n\) and thus the method (1.3) is well-defined. By the definition of \(x_n\) we have

$$\begin{aligned} A^* \lambda _n\in \partial {\mathcal {R}}(x_n). \end{aligned}$$
(3.1)

By virtue of (2.2) and Proposition 2.2 we further have

$$\begin{aligned} x_n = \nabla {\mathcal {R}}^*(A^*\lambda _n) \end{aligned}$$
(3.2)

for all \(n\ge 0\).

3.1 Convergence

The regularization property of a family of gradient type methods, including (1.4) as a special case, have been considered in [33] for solving ill-posed problems in Banach spaces. Adapting the corresponding result to the dual gradient method (1.3) we can obtain the following convergence result.

Theorem 3.1

Let Assumption 1 hold and let \(L := \Vert A\Vert ^2/(2\sigma )\). Consider the dual gradient method (1.3) with \(\lambda _0=0\) for solving (1.1).

  1. (i)

    If \(0<\gamma \le 1/L\) then for the integer \(n_\delta \) chosen such that \(n_\delta \rightarrow \infty \) and \(\delta ^2 n_\delta \rightarrow 0\) as \(\delta \rightarrow 0\) there hold

    $$\begin{aligned} {\mathcal {R}}(x_{n_\delta }) \rightarrow {\mathcal {R}}(x^\dag ) \quad \text{ and } \quad D_{{\mathcal {R}}}^{A^*\lambda _{n_\delta }}(x^\dag , x_{n_\delta }) \rightarrow 0 \end{aligned}$$

    and hence \(\Vert x_{n_\delta }-x^\dag \Vert \rightarrow 0\) as \(\delta \rightarrow 0\).

  2. (ii)

    If \(\tau >1\) and \(\gamma >0\) are chosen such that \(1-1/\tau -L\gamma >0\), then the discrepancy principle (1.5) defines a finite integer \(n_\delta \) with

    $$\begin{aligned} {\mathcal {R}}(x_{n_\delta }) \rightarrow {\mathcal {R}}(x^\dag ) \quad \text{ and } \quad D_{{\mathcal {R}}}^{A^*\lambda _{n_\delta }}(x^\dag , x_{n_\delta }) \rightarrow 0 \end{aligned}$$

    and hence\(\Vert x_{n_\delta } - x^\dag \Vert \rightarrow 0\) as \(\delta \rightarrow 0\).

Theorem 3.1 gives the convergence results on the method (1.3) with \(\lambda _0=0\). The convergence result actually holds for any initial guess \(\lambda _0\) with the iterative sequence defined by (1.3) converging to a solution \(x^\dag \) of \(A x = y\) with the property

$$\begin{aligned} D_{\mathcal {R}}^{A^*\lambda _0}(x^\dag , x_0) = \min \left\{ D_{\mathcal {R}}^{A^*\lambda _0}(x, x_0): x\in X \text{ and } A x = y\right\} , \end{aligned}$$

where \(x_0 =\arg \min _{x\in X} \{{\mathcal {R}}(x)-\langle \lambda _0,x \rangle \}\); this can be seen from Theorem 3.1 by replacing \({\mathcal {R}}(x)\) by \(D_{\mathcal {R}}^{A^*\lambda _0}(x, x_0)\). This same remark applies to the convergence rate results in the forthcoming subsection. For simplicity of exposition, in the following we will consider only the method (1.3) with \(\lambda _0=0\).

In [33] the convergence result was stated for X to be a reflexive Banach space. The reflexivity of X was only used in [33] to show the well-definedness of each \(x_n\) by the procedure of extracting a weakly convergent subsequence from a bounded sequence. Under Assumption 1 (ii) the reflexivity of X is unnecessary as the strong convexity of \({\mathcal {R}}\) guarantees that each \(x_n\) is well-defined in an arbitrary Banach space, see [44,  Proposition 3.5.8]. This relaxation on X allows the convergence result to be used in a wider range of applications, see Sect. 4.2 for instance.

The work in [33] actually concentrates on proving part (ii) of Theorem 3.1, i.e. the regularization property of the method terminated by the discrepancy principle and part (i) was not explicitly stated. However, the argument can be easily adapted to obtain part (i) of Theorem 3.1, i.e. the regularization property of the method under an a priori stopping rule.

It should be mentioned that the convergence \({\mathcal {R}}(x_{n_\delta }) \rightarrow {\mathcal {R}}(x^\dag )\) was not established in [33]. However, if the residual \(\Vert A x_n - y^\delta \Vert \) is monotonically decreasing with respect to n, then, following the proof in [33], one can easily establish the convergence \({\mathcal {R}}(x_{n_\delta }) \rightarrow {\mathcal {R}}(x^\dag )\) as \(\delta \rightarrow 0\). For the dual gradient method (1.3), the monotonicity of \(\Vert A x_n - y^\delta \Vert \) is established in the following result which is also useful in the forthcoming analysis on deriving convergence rates.

Lemma 3.2

Let Assumption 1 hold and let \(0<\gamma \le 4 \sigma /\Vert A\Vert ^2\). Then for the sequence \(\{x_n\}\) defined by (1.3) there holds

$$\begin{aligned} \Vert A x_{n+1} - y^\delta \Vert \le \Vert A x_n -y^\delta \Vert \end{aligned}$$

for all integers \(n \ge 0\).

Proof

Recall from (3.1) that \(A^*\lambda _n\in \partial {\mathcal {R}}(x_n)\) for each \(n \ge 0\). By using (2.5) and the equation \(\lambda _{n+1} = \lambda _n - \gamma (A x_n - y^\delta )\), we have

$$\begin{aligned} 2 \sigma \Vert x_{n+1} - x_n\Vert ^2&\le D_{{\mathcal {R}}}^{A^*\lambda _n}(x_{n+1}, x_n) + D_{{\mathcal {R}}}^{A^*\lambda _{n+1}} (x_n, x_{n+1}) \\&= \langle A^*\lambda _{n+1} - A^*\lambda _n, x_{n+1} - x_n\rangle \\&= \langle \lambda _{n+1} - \lambda _n, A x_{n+1} - A x_n\rangle \\&= \gamma \langle A x_n - y^\delta , A x_n - A x_{n+1}\rangle . \end{aligned}$$

In view of the polarization identity in Hilbert spaces, we further have

$$\begin{aligned} 2 \sigma \Vert x_{n+1} - x_n\Vert ^2&\le \frac{\gamma }{2} \left( \Vert A x_n - y^\delta \Vert ^2 - \Vert A x_{n+1} - y^\delta \Vert ^2 + \Vert A(x_{n+1} - x_n)\Vert ^2\right) \\&\le \frac{\gamma }{2} \left( \Vert A x_n - y^\delta \Vert ^2 - \Vert A x_{n+1} - y^\delta \Vert ^2\right) +\frac{\gamma \Vert A\Vert ^2}{2} \Vert x_{n+1} - x_n\Vert ^2. \end{aligned}$$

Since \(0<\gamma \le 4 \sigma /\Vert A\Vert ^2\), we thus obtain the monotonicity of \(\Vert A x_n - y^\delta \Vert ^2\) with respect to n. \(\square \)

3.2 Convergence rates

In this subsection we will derive the convergence rates of the dual gradient method (1.3) when the sought solution satisfies certain variational source conditions. The following result plays a crucial role for achieving this purpose.

Proposition 3.3

Let Assumption 1 hold and let \(d_{y^\delta }(\lambda ):= {\mathcal {R}}^*(A^*\lambda ) - \langle \lambda , y^\delta \rangle \). Let \(L := \Vert A\Vert ^2/(2\sigma )\). Consider the dual gradient method (1.3) with \(\lambda _0 = 0\). If \(0<\gamma \le 1/L\) then for any \(\lambda \in Y\) there holds

$$\begin{aligned} d_{y^\delta }(\lambda ) - d_{y^\delta }(\lambda _{n+1})&\ge \frac{1}{2\gamma (n+1)} \left( \Vert \lambda _{n+1}-\lambda \Vert ^2 - \Vert \lambda \Vert ^2\right) \\&\quad \, + \left\{ \left( \frac{1}{2} - \frac{L\gamma }{4}\right) n + \left( \frac{1}{2} - \frac{L\gamma }{2}\right) \right\} \gamma \Vert A x_n - y^\delta \Vert ^2 \end{aligned}$$

for all \(n \ge 0\).

Proof

Since \({\mathcal {R}}\) is strongly convex with modulus of convexity \(\sigma >0\), it follows from Proposition 2.2 that \({\mathcal {R}}^*\) is continuously differentiable and

$$\begin{aligned} \Vert \nabla {\mathcal {R}}^*(\xi ) -\nabla {\mathcal {R}}^*(\eta )\Vert \le \frac{\Vert \xi -\eta \Vert }{2\sigma }, \quad \forall \xi , \eta \in X^*. \end{aligned}$$

Consequently, the function \(\lambda \rightarrow d_{y^\delta }(\lambda )\) is differentiable on Y and its gradient is given by

$$\begin{aligned} \nabla d_{y^\delta }(\lambda ) = A \nabla {\mathcal {R}}^*(A^* \lambda ) - y^\delta \end{aligned}$$

with

$$\begin{aligned} \Vert \nabla d_{y^\delta }(\tilde{\lambda }) - \nabla d_{y^\delta }(\lambda ) \Vert \le L \Vert \tilde{\lambda }-\lambda \Vert , \quad \forall \tilde{\lambda }, \lambda \in Y, \end{aligned}$$

where \(L= \Vert A\Vert ^2/(2\sigma )\). Therefore

$$\begin{aligned} d_{y^\delta }(\lambda _{n+1}) \le d_{y^\delta }(\lambda _n) + \langle \nabla d_{y^\delta }(\lambda _n), \lambda _{n+1} -\lambda _n\rangle + \frac{L}{2} \Vert \lambda _{n+1} - \lambda _n\Vert ^2. \end{aligned}$$

By the convexity of \(d_{y^\delta }\) we have for any \(\lambda \in Y\) that

$$\begin{aligned} d_{y^\delta }(\lambda _n) \le d_{y^\delta }(\lambda ) + \langle \nabla d_{y^\delta }(\lambda _n), \lambda _n -\lambda \rangle . \end{aligned}$$

Combining the above equations we thus obtain

$$\begin{aligned} d_{y^\delta }(\lambda _{n+1}) \le d_{y^\delta }(\lambda ) + \langle \nabla d_{y^\delta }(\lambda _n), \lambda _{n+1}-\lambda \rangle + \frac{L}{2} \Vert \lambda _{n+1}-\lambda _n\Vert ^2. \end{aligned}$$

By using (3.2) we can see \(\nabla d_{y^\delta }(\lambda _n) = A x_n - y^\delta \) which together with the equation \(\lambda _{n+1} - \lambda _n = -\gamma (A x_n - y^\delta )\) shows that \(\nabla d_{y^\delta }(\lambda _n) = (\lambda _n- \lambda _{n+1})/\gamma \). Consequently

$$\begin{aligned} d_{y^\delta }(\lambda _{n+1}) \le d_{y^\delta }(\lambda ) + \frac{1}{\gamma }\langle \lambda _n - \lambda _{n+1}, \lambda _{n+1} - \lambda \rangle + \frac{L}{2} \Vert \lambda _{n+1} - \lambda _n\Vert ^2. \end{aligned}$$

Note that

$$\begin{aligned} \langle \lambda _n -\lambda _{n+1}, \lambda _{n+1} - \lambda \rangle = \frac{1}{2} \left( \Vert \lambda _n -\lambda \Vert ^2 -\Vert \lambda _{n+1} - \lambda \Vert ^2 - \Vert \lambda _{n+1} - \lambda _n\Vert ^2\right) . \end{aligned}$$

Therefore

$$\begin{aligned} d_{y^\delta }(\lambda ) - d_{y^\delta }(\lambda _{n+1})&\ge \frac{1}{2\gamma } \left( \Vert \lambda _{n+1}-\lambda \Vert ^2 - \Vert \lambda _n-\lambda \Vert ^2 \right) \nonumber \\&\quad + \left( \frac{1}{2\gamma }- \frac{L}{2} \right) \Vert \lambda _{n+1} - \lambda _n\Vert ^2. \end{aligned}$$
(3.3)

Let \(m\ge 0\) be any number. By summing (3.3) over n from \(n =0\) to \(n = m\) and using \(\lambda _0 = 0\) we can obtain

$$\begin{aligned} \sum _{n=0}^m \left( d_{y^\delta }(\lambda )-d_{y^\delta }(\lambda _{n+1})\right)&\ge \frac{1}{2\gamma } \left( \Vert \lambda _{m+1} -\lambda \Vert ^2 - \Vert \lambda \Vert ^2\right) \nonumber \\&\quad + \left( \frac{1}{2\gamma } - \frac{L}{2} \right) \sum _{n=0}^m \Vert \lambda _{n+1} - \lambda _n\Vert ^2. \end{aligned}$$
(3.4)

Next we take \(\lambda = \lambda _n\) in (3.3) to obtain

$$\begin{aligned} d_{y^\delta }(\lambda _n) - d_{y^\delta }(\lambda _{n+1}) \ge \left( \frac{1}{\gamma }- \frac{L}{2} \right) \Vert \lambda _{n+1} - \lambda _n\Vert ^2. \end{aligned}$$

Multiplying this inequality by n and then summing over n from \(n=0\) to \(n = m\) we can obtain

$$\begin{aligned} \sum _{n=0}^m n \left( d_{y^\delta }(\lambda _n) - d_{y^\delta }(\lambda _{n+1})\right) \ge \left( \frac{1}{\gamma }-\frac{L}{2} \right) \sum _{n=0}^m n \Vert \lambda _{n+1} -\lambda _n\Vert ^2. \end{aligned}$$

Note that

$$\begin{aligned} \sum _{n=0}^m n \left( d_{y^\delta }(\lambda _n) - d_{y^\delta }(\lambda _{n+1})\right) = - (m+1) d_{y^\delta }(\lambda _{m+1}) + \sum _{n=0}^m d_{y^\delta }(\lambda _{n+1}). \end{aligned}$$

Thus

$$\begin{aligned} -(m+1) d_{y^\delta }(\lambda _{m+1})+ \sum _{n=0}^m d_{y^\delta }(\lambda _{n+1}) \ge \left( \frac{1}{\gamma }-\frac{L}{2}\right) \sum _{n=0}^m n \Vert \lambda _{n+1} - \lambda _n\Vert ^2. \end{aligned}$$

Adding this inequality to (3.4) gives

$$\begin{aligned} (m+1) \left( d_{y^\delta }(\lambda ) - d_{y^\delta }(\lambda _{m+1})\right)&\ge \frac{1}{2\gamma } \left( \Vert \lambda _{m+1}-\lambda \Vert ^2 - \Vert \lambda \Vert ^2\right) \\&\quad + \left( \frac{1}{2\gamma } - \frac{L}{2} \right) \sum _{n=0}^m \Vert \lambda _{n+1} - \lambda _n\Vert ^2 \\&\quad + \left( \frac{1}{\gamma }-\frac{L}{2} \right) \sum _{n=0}^m n \Vert \lambda _{n+1} - \lambda _n\Vert ^2. \end{aligned}$$

Recall that \(\lambda _n -\lambda _{n+1} = \gamma (A x_n- y^\delta )\). By using the monotonicity of \(\Vert A x_n - y^\delta \Vert \) shown in Lemma 3.2 we then obtain

$$\begin{aligned}&(m+1) \left( d_{y^\delta }(\lambda ) - d_{y^\delta }(\lambda _{m+1})\right) \\&\quad \ge \frac{1}{2\gamma } \left( \Vert \lambda _{m+1} - \lambda \Vert ^2 -\Vert \lambda \Vert ^2\right) + \left( \frac{1}{2\gamma } - \frac{L}{2} \right) \gamma ^2 \sum _{n=0}^m \Vert A x_n - y^\delta \Vert ^2 \\&\qquad \, + \left( \frac{1}{\gamma }- \frac{L}{2} \right) \gamma ^2 \sum _{n=0}^m n \Vert A x_n - y^\delta \Vert ^2 \\&\quad \ge \frac{1}{2\gamma } \left( \Vert \lambda _{m+1}- \lambda \Vert ^2 -\Vert \lambda \Vert ^2 \right) + \left( \frac{1}{2} - \frac{L\gamma }{2}\right) \gamma (m+1) \Vert A x_m - y^\delta \Vert ^2 \\&\qquad + \left( 1- \frac{L\gamma }{2}\right) \gamma \frac{m(m+1)}{2} \Vert A x_m - y^\delta \Vert ^2. \end{aligned}$$

The proof is therefore complete. \(\square \)

We now assume that the unique solution \(x^\dag \) satisfies a variational source condition specified in the following assumption.

Assumption 2

For the unique solution \(x^\dag \) of (1.1) there is an error measure function \({{\mathcal {E}}}^\dag : \text{ dom }({\mathcal {R}}) \rightarrow [0, \infty )\) with \({{\mathcal {E}}}^\dag (x^\dag ) = 0\) such that

$$\begin{aligned} {{\mathcal {E}}}^\dag (x) \le {\mathcal {R}}(x) - {\mathcal {R}}(x^\dag ) + M \Vert A x - y\Vert ^q, \quad \forall x \in \text{ dom }({\mathcal {R}}) \end{aligned}$$

for some \(0<q\le 1\) and some constant \(M >0\).

Variational source conditions were first introduced in [23], as a generalization of the spectral source conditions in Hilbert spaces, to derive convergence rates of Tikhonov regularization in Banach spaces. This kind of source conditions was further generalized, refined and verified, see [15, 17, 20, 24,25,26] for instance. The error measure function \({{\mathcal {E}}}^\dag \) in Assumption 2 is used to measure the speed of convergence; it can be taken in various forms and the usual choice of \({{\mathcal {E}}}^\dag \) is the Bregman distance induced by \({\mathcal {R}}\). Use of a general error measure functional has the advantage of covering a wider range of applications. For instance, in reconstructing sparse solutions of ill-posed problems, one may consider the sought solution in the \(\ell ^1\) space and take \({\mathcal {R}}(x)=\Vert x\Vert _{\ell ^1}\). In this situation, convergence under the Bregman distance induced by \({\mathcal {R}}\) may not provide useful approximation result because two points with zero Bregman distance may have arbitrarily large \(\ell ^1\)-distance. However, under certain natural conditions, the variational source conditions can be verified with \({{\mathcal {E}}}^\dag (x)=\Vert x-x^\dag \Vert _{\ell ^1}\); see [15, 16].

We first derive the convergence rates for the dual gradient method (1.3) under an a priori stopping rule when \(x^\dag \) satisfies the variational source conditions specified in Assumption 2.

Theorem 3.4

Let Assumption 1 hold and let \(L: = \Vert A\Vert ^2/(2\sigma )\). If \(0<\gamma \le 1/L\) and \(x^\dag \) satisfies the variational source conditions specified in Assumption 2, then for the dual gradient method (1.3) with the initial guess \(\lambda _0= 0\) there holds

$$\begin{aligned} {{\mathcal {E}}}^\dag (x_n) \le C\left( n^{-\frac{q}{2-q}} + \delta ^q + n^{\frac{1-q}{2-q}}\delta + n \delta ^2\right) \end{aligned}$$

for all \(n \ge 1\), where C is a generic positive constant independent of n and \(\delta \). Consequently, by choosing an integer \(n_\delta \) with \(n_\delta \sim \delta ^{q-2}\) we have

$$\begin{aligned} {{\mathcal {E}}}^\dag (x_{n_\delta }) = O(\delta ^q). \end{aligned}$$

Proof

Let \(d_y(\lambda ):={\mathcal {R}}^*(A^*\lambda )-\langle \lambda , y\rangle \). Since \(0 <\gamma \le 1/L\), from Proposition 3.3 it follows that

$$\begin{aligned}&\left\{ \left( \frac{1}{2}-\frac{L\gamma }{4}\right) n +\left( \frac{1}{2}-\frac{L\gamma }{2}\right) \right\} \gamma \Vert A x_n - y^\delta \Vert ^2 \nonumber \\&\quad \le d_{y^\delta }(\lambda ) - d_{y^\delta }(\lambda _{n+1}) - \frac{1}{2\gamma (n+1)} \left( \Vert \lambda _{n+1}-\lambda \Vert ^2 -\Vert \lambda \Vert ^2\right) \nonumber \\&\quad = d_y(\lambda ) - d_y(\lambda _{n+1}) + \langle \lambda _{n+1}-\lambda , y^\delta -y\rangle \nonumber \\&\qquad - \frac{1}{2\gamma (n+1)} \left( \Vert \lambda _{n+1}-\lambda \Vert ^2-\Vert \lambda \Vert ^2\right) \end{aligned}$$
(3.5)

for all \(\lambda \in Y\). By the Cauchy-Schwarz inequality we have

$$\begin{aligned} \langle \lambda _{n+1} -\lambda , y^\delta -y\rangle \le \delta \Vert \lambda _{n+1} -\lambda \Vert \le \frac{1}{4 \gamma (n+1)} \Vert \lambda _{n+1}- \lambda \Vert ^2 + \gamma (n+1) \delta ^2. \end{aligned}$$

Thus, it follows from (3.5) that

$$\begin{aligned}&c_0 n \Vert A x_n - y^\delta \Vert ^2 + \frac{1}{4\gamma (n + 1)} \Vert \lambda _{n+1} - \lambda \Vert ^2 \\&\quad \le d_y(\lambda ) - d_y(\lambda _{n+1}) + \frac{\Vert \lambda \Vert ^2}{2\gamma (n + 1)} + \gamma (n + 1)\delta ^2, \end{aligned}$$

where \(c_0:= (1/2-L\gamma /4) \gamma >0\). By virtue of the inequality \(\Vert \lambda _{n+1}\Vert ^2 \le 2 (\Vert \lambda \Vert ^2 + \Vert \lambda _{n+1}-\lambda \Vert ^2)\) we then have

$$\begin{aligned}&c_0 n \Vert A x_n - y^\delta \Vert ^2 + \frac{1}{8\gamma (n + 1)} \Vert \lambda _{n+1}\Vert ^2 \\&\quad \le d_y(\lambda ) - d_y(\lambda _{n+1}) + \frac{3\Vert \lambda \Vert ^2}{4\gamma (n + 1)} + \gamma (n + 1)\delta ^2. \end{aligned}$$

By the Fenchel-Young inequality (2.1) and \(A x^\dag = y\) we have

$$\begin{aligned} d_y(\lambda _{n+1})&= {\mathcal {R}}^*(A^*\lambda _{n+1}) - \langle \lambda _{n+1}, A x^\dag \rangle \nonumber \\&= {\mathcal {R}}^*(A^*\lambda _{n+1}) - \langle A^* \lambda _{n+1}, x^\dag \rangle \nonumber \\&\ge - {\mathcal {R}}(x^\dag ) . \end{aligned}$$
(3.6)

Therefore

$$\begin{aligned}&c_0 n \Vert A x_n - y^\delta \Vert ^2 + \frac{1}{8 \gamma (n+1)} \Vert \lambda _{n+1}\Vert ^2 \\&\quad \le {\mathcal {R}}^*(A^* \lambda ) - \langle \lambda , y\rangle + {\mathcal {R}}(x^\dag ) + \frac{3\Vert \lambda \Vert ^2}{4 \gamma (n+1)} + \gamma (n+1) \delta ^2 \end{aligned}$$

for all \(\lambda \in Y\). Consequently

$$\begin{aligned}&c_0 n \Vert A x_n - y^\delta \Vert ^2 + \frac{1}{8 \gamma (n+1)} \Vert \lambda _{n+1}\Vert ^2 \\&\quad \le \inf _{\lambda \in Y} \left\{ {\mathcal {R}}^*(A^* \lambda ) - \langle \lambda , y\rangle + {\mathcal {R}}(x^\dag ) + \frac{3\Vert \lambda \Vert ^2}{4 \gamma (n+1)}\right\} + \gamma (n+1) \delta ^2 \\&\quad = {\mathcal {R}}(x^\dag ) - \sup _{\lambda \in Y} \left\{ -{\mathcal {R}}^*(A^* \lambda ) + \langle \lambda , y\rangle - \frac{3\Vert \lambda \Vert ^2}{4 \gamma (n+1)}\right\} + \gamma (n+1) \delta ^2. \end{aligned}$$

According to the Fenchel–Rockafellar duality formula given in Proposition 2.1, we have

$$\begin{aligned} \sup _{\lambda \in Y} \left\{ -{\mathcal {R}}^*(A^* \lambda ) + \langle \lambda , y\rangle - \frac{3\Vert \lambda \Vert ^2}{4 \gamma (n+1)}\right\} = \inf _{x\in X} \left\{ {\mathcal {R}}(x) + \frac{1}{3} \gamma (n+1) \Vert A x- y\Vert ^2\right\} . \end{aligned}$$

Indeed, by taking \(f(x) = {\mathcal {R}}(x)\) for \(x \in X\) and \(g(z) = \frac{1}{3} \gamma (n+1) \Vert z-y\Vert ^2\) for \(z \in Y\), we can obtain this identity immediately from (2.3) by noting that

$$\begin{aligned} g^*(\lambda ) = \frac{3}{4\gamma (n+1)} \Vert \lambda \Vert ^2 + \langle \lambda , y\rangle , \quad \lambda \in Y. \end{aligned}$$

Therefore

$$\begin{aligned}&c_0 n \Vert A x_n - y^\delta \Vert ^2 + \frac{1}{8 \gamma (n+1)} \Vert \lambda _{n+1}\Vert ^2 \le \eta _n + \gamma (n+1) \delta ^2, \end{aligned}$$
(3.7)

where

$$\begin{aligned} \eta _n:= \sup _{x\in X} \left\{ {\mathcal {R}}(x^\dag ) - {\mathcal {R}}(x) -\frac{1}{3} \gamma (n+1) \Vert A x - y\Vert ^2 \right\} . \end{aligned}$$
(3.8)

We now estimate \(\eta _n\) when \(x^\dag \) satisfies the variatioinal source condition given in Assumption 2. By the nonnegativity of \({{\mathcal {E}}}^\dag \) we have \({\mathcal {R}}(x^\dag ) - {\mathcal {R}}(x) \le M \Vert A x- y\Vert ^q\). Thus

$$\begin{aligned} \eta _n&\le \sup _{x\in X} \left\{ M \Vert A x - y\Vert ^q - \frac{1}{3} \gamma (n+1) \Vert A x- y\Vert ^2 \right\} \nonumber \\&\le \sup _{s\ge 0} \left\{ M s^q - \frac{1}{3} \gamma (n+1) s^2\right\} = c_1 (n+1)^{-\frac{q}{2-q}}, \end{aligned}$$
(3.9)

where \(c_1:= \left( 1-\frac{q}{2}\right) \left( \frac{3qM}{2\gamma }\right) ^{\frac{q}{2-q}} M>0\). Combining this with (3.7) gives

$$\begin{aligned}&c_0 n \Vert A x_n - y^\delta \Vert ^2 + \frac{1}{8 \gamma (n+1)} \Vert \lambda _{n+1}\Vert ^2 \le c_1 (n+1)^{-\frac{q}{2-q}} + \gamma (n+1) \delta ^2 \end{aligned}$$

which implies that

$$\begin{aligned} \Vert A x_n - y^\delta \Vert \le C \left( n^{-\frac{1}{2-q}} + \delta \right) \quad \text{ and } \quad \Vert \lambda _n\Vert \le C \left( n^{\frac{1-q}{2-q}} + n \delta \right) . \end{aligned}$$
(3.10)

Recall \(A^* \lambda _n \in \partial {\mathcal {R}}(x_n)\) from (3.1), we have

$$\begin{aligned} {\mathcal {R}}(x_n)- {\mathcal {R}}(x^\dag ) \le \langle A^* \lambda _n, x_n -x^\dag \rangle = \langle \lambda _n, A x_n - y\rangle . \end{aligned}$$

Therefore, by using the variational source condition specified in Assumption 2, we obtain

$$\begin{aligned} {{\mathcal {E}}}^\dag (x_n)&\le {\mathcal {R}}(x_n) -{\mathcal {R}}(x^\dag ) + M \Vert A x_n -y\Vert ^q \\&\le \langle \lambda _n, A x_n - y\rangle + M \Vert A x_n - y\Vert ^q \\&\le \Vert \lambda _n\Vert \Vert A x_n - y\Vert + M \Vert A x_n - y\Vert ^q. \end{aligned}$$

Thus, it follows from (3.10) that

$$\begin{aligned} {{\mathcal {E}}}^\dag (x_n)&\le \Vert \lambda _n\Vert \left( \Vert A x_n - y^\delta \Vert + \delta \right) + M \left( \Vert A x_n - y^\delta \Vert + \delta \right) ^q \\&\le C \left( n^{\frac{1-q}{2-q}} + n\delta \right) \left( n^{-\frac{1}{2-q}} + \delta \right) + C \left( n^{-\frac{1}{2-q}} + \delta \right) ^q\\&\le C \left( n^{-\frac{q}{2-q}} + \delta ^q + n^{\frac{1-q}{2-q}} \delta + n \delta ^2\right) . \end{aligned}$$

The proof is thus complete. \(\square \)

During the Proof of Theorem 3.4, we have introduced the quantity \(\eta _n\) defined by (3.8). Taking \(x = x^\dag \) in (3.8) shows \(\eta _n \ge 0\). As can be seen from the proof of Theorem 3.4, we have

$$\begin{aligned} \eta _n = \inf _{\lambda \in Y} \left\{ {\mathcal {R}}^*(A^* \lambda ) - \langle \lambda , y \rangle + {\mathcal {R}}(x^\dag ) + \frac{3\Vert \lambda \Vert ^2}{4\gamma (n+1)}\right\} \end{aligned}$$

by the Fenchel-Rockafellar duality formula. Taking \(\lambda =0\) in this equation gives \(0\le \eta _n \le {\mathcal {R}}(x^\dag ) + {\mathcal {R}}^*(0)<\infty \). The proof of Theorem 3.4 demonstrates that \(\eta _n\) can decay to 0 at certain rate if \(x^\dag \) satisfies a variational source condition.

Corollary 3.5

Let Assumption 1 hold and let \(L := \Vert A\Vert ^2/(2\sigma )\). If \(0<\gamma \le 1/L\) and if there is \(\lambda ^\dag \in Y\) such that \(A^*\lambda ^\dag \in \partial {\mathcal {R}}(x^\dag )\), then for the dual gradient method (1.3) with the initial guess \(\lambda _0 =0\) there holds

$$\begin{aligned} D_{{\mathcal {R}}}^{A^*\lambda ^\dag } (x_n, x^\dag ) \le C \left( n^{-1} + \delta + n \delta ^2\right) \end{aligned}$$
(3.11)

for all \(n \ge 1\), where C is a generic positive constant independent of n and \(\delta \). Consequently, by choosing an integer \(n_\delta \) with \(n_\delta \sim \delta ^{-1}\) we have

$$\begin{aligned} D_{{\mathcal {R}}}^{A^*\lambda ^\dag } (x_{n_\delta }, x^\dag ) = O(\delta ) \end{aligned}$$
(3.12)

and hence \(\Vert x_{n_\delta } - x^\dag \Vert = O(\delta ^{1/2})\).

Proof

We show that \(x^\dag \) satisfies the variational source condition specified in Assumption 2 with \(q = 1\). The argument is well-known, see [23] for instance. Since \(A^*\lambda ^\dag \in \partial {\mathcal {R}}(x^\dag )\) for some \(\lambda ^\dag \in Y\), we have for all \(x \in \text{ dom }({\mathcal {R}})\) that

$$\begin{aligned} D_{{\mathcal {R}}}^{A^*\lambda ^\dag } (x, x^\dag )&= {\mathcal {R}}(x) - {\mathcal {R}}(x^\dag ) -\langle \lambda ^\dag , A x - y\rangle \\&\le {\mathcal {R}}(x) - {\mathcal {R}}(x^\dag ) + \Vert \lambda ^\dag \Vert \Vert A x- y\Vert \end{aligned}$$

which shows that Assumption 2 holds with \({{\mathcal {E}}}^\dag (x) = D_{{\mathcal {R}}}^{A^*\lambda ^\dag } (x, x^\dag )\), \(M = \Vert \lambda ^\dag \Vert \) and \(q = 1\). Thus by invoking Theorem 3.4, we immediately obtain (3.11) which together with the choice \(n_\delta \sim \delta ^{-1}\) implies (3.12). By using (2.5) we then obtain \(\Vert x_{n_\delta } - x^\dag \Vert = O(\delta ^{1/2})\). \(\square \)

We next turn to deriving convergence rates of the dual gradient method (1.3) under the variational source condition given in Assumption 2 when the method is terminated by the discrepancy principle (1.5). We will use the following consequence of Proposition 3.3.

Lemma 3.6

Let Assumption 1 hold and let \(L := \Vert A\Vert ^2/(2\sigma )\). Consider the dual gradient method (1.3) with \(\lambda _0=0\). If \(\tau >1\) and \(\gamma >0\) are chosen such that \(1-1/\tau ^2- L\gamma >0\), then there is a constant \(c_2>0\) such that

$$\begin{aligned} c_2 (n + 1)\delta ^2\le \eta _n \quad \text{ and } \quad \frac{1}{8\gamma (n+1)} \Vert \lambda _{n+1}\Vert ^2 \le \eta _n + \gamma (n + 1) \delta ^2 \end{aligned}$$

for all integers \(0\le n < n_{\delta }\), where \(n_\delta \) is the integer determined by the discrepancy principle (1.5) and \(\eta _n\) is the quantity defined by (3.8).

Proof

The second estimate follows directly from (3.7), actually it holds for all integers \(n\ge 0\). It remains only to show the first estimate. For any \(n<n_\delta \) we have \(\Vert A x_n - y^\delta \Vert >\tau \delta \). Therefore from (3.5) it follows for all \(\lambda \in Y\) that

$$\begin{aligned}&\left\{ \left( \frac{1}{2}-\frac{L\gamma }{4}\right) n + \left( \frac{1}{2}-\frac{L\gamma }{2}\right) \right\} \gamma \tau ^2 \delta ^2 \\&\quad \le d_y(\lambda ) - d_y(\lambda _{n+1}) +\langle \lambda _{n+1}-\lambda , y^\delta - y\rangle - \frac{1}{2\gamma (n+1)} \left( \Vert \lambda _{n+1}-\lambda \Vert ^2 - \Vert \lambda \Vert ^2\right) . \end{aligned}$$

By the Cauchy-Schwarz inequality we have

$$\begin{aligned}&\langle \lambda _{n+1}-\lambda , y^\delta - y\rangle \\&\quad \le \delta \Vert \lambda _{n+1}-\lambda \Vert \le \frac{1}{2\gamma (n+1)} \Vert \lambda _{n+1}-\lambda \Vert ^2 + \frac{1}{2} \gamma (n+1) \delta ^2. \end{aligned}$$

Therefore

$$\begin{aligned}&\left[ \left\{ \left( \frac{1}{2}-\frac{L\gamma }{4}\right) n + \left( \frac{1}{2}-\frac{L\gamma }{2}\right) \right\} \tau ^2 -\frac{1}{2}(n+1)\right] \gamma \delta ^2 \\&\quad \le d_y(\lambda ) - d_y(\lambda _{n+1}) + \frac{1}{2\gamma (n+1)} \Vert \lambda \Vert ^2. \end{aligned}$$

By the conditions on \(\gamma \) and \(\tau \), it is easy to see that

$$\begin{aligned} \left[ \left\{ \left( \frac{1}{2}-\frac{L\gamma }{4}\right) n + \left( \frac{1}{2}-\frac{L\gamma }{2}\right) \right\} \tau ^2 -\frac{1}{2}(n+1)\right] \gamma \ge c_2 (n+1), \end{aligned}$$

where \(c_2:= \left( (1/2-L\gamma /2)\tau ^2-1/2\right) \gamma >0\). Therefore

$$\begin{aligned} c_2(n+1) \delta ^2 \le d_y(\lambda ) - d_y(\lambda _{n+1}) + \frac{1}{2\gamma (n+1)} \Vert \lambda \Vert ^2. \end{aligned}$$

According to (3.6) we have \(d_y(\lambda _{n+1}) \ge -{\mathcal {R}}(x^\dag )\). Thus

$$\begin{aligned} c_2(n+1) \delta ^2 \le {\mathcal {R}}^*(A^*\lambda ) -\langle \lambda , y\rangle + {\mathcal {R}}(x^\dag ) + \frac{1}{2\gamma (n+1)} \Vert \lambda \Vert ^2 \end{aligned}$$

which is valid for all \(\lambda \in Y\). Consequently

$$\begin{aligned} c_2(n+1) \delta ^2&\le \inf _{\lambda \in Y} \left\{ {\mathcal {R}}^*(A^*\lambda ) -\langle \lambda , y\rangle + {\mathcal {R}}(x^\dag ) + \frac{1}{2\gamma (n+1)} \Vert \lambda \Vert ^2\right\} \\&= {\mathcal {R}}(x^\dag ) - \sup _{\lambda \in Y} \left\{ -{\mathcal {R}}^*(A^*\lambda ) +\langle \lambda , y\rangle - \frac{1}{2\gamma (n+1)} \Vert \lambda \Vert ^2\right\} . \end{aligned}$$

According to the Fenchel-Rockafellar duality formula given in Proposition 2.1, we can further obtain

$$\begin{aligned} c_2 (n+1) \delta ^2&\le {\mathcal {R}}(x^\dag ) - \inf _{x\in X} \left\{ {\mathcal {R}}(x) + \frac{1}{2} \gamma (n+1) \Vert A x- y\Vert ^2\right\} \\&= \sup _{x\in X} \left\{ {\mathcal {R}}(x^\dag ) - {\mathcal {R}}(x) - \frac{1}{2} \gamma (n+1) \Vert A x- y\Vert ^2\right\} \\&\le \eta _n \end{aligned}$$

which shows the first estimate. \(\square \)

Now we are ready to show the convergence rate result for the dual gradient method (1.3) under Assumption 2 when the method is terminated by the discrepancy principle (1.5).

Theorem 3.7

Let Assumption 1 hold and let \(L := \Vert A\Vert ^2/(2\sigma )\). Consider the dual gradient method (1.3) with the initial guess \(\lambda _0= 0\). Assume that \(\tau >1\) and \(\gamma >0\) are chosen such that \(1- 1/\tau ^2 - L \gamma >0\) and let \(n_\delta \) be the integer determined by the discrepancy principle (1.5). If \(x^\dag \) satisfies the variational source condition specified in Assumption 2, then

$$\begin{aligned} {{\mathcal {E}}}^\dag (x_{n_\delta }) = O(\delta ^q). \end{aligned}$$
(3.13)

Consequently, if there is \(\lambda ^\dag \in Y\) such that \(A^* \lambda ^\dag \in \partial {\mathcal {R}}(x^\dag )\), then

$$\begin{aligned} D_{{\mathcal {R}}}^{A^*\lambda ^\dag } (x_{n_\delta }, x^\dag ) = O(\delta ) \end{aligned}$$
(3.14)

and hence \(\Vert x_{n_\delta } - x^\dag \Vert = O(\delta ^{1/2})\).

Proof

By using the variational source condition on \(x^\dag \) specified in Assumption 2, the convexity of \({\mathcal {R}}\), and the fact \(A^*\lambda _{n_\delta } \in \partial {\mathcal {R}}(x_{n_\delta })\) we have

$$\begin{aligned} {{\mathcal {E}}}^\dag (x_{n_\delta })&\le {\mathcal {R}}(x_{n_\delta }) - {\mathcal {R}}(x^\dag ) + M\Vert A x_{n_\delta } - y\Vert ^q \\&\le \langle \lambda _{n_\delta }, A x_{n_\delta } - y\rangle + M \Vert A x_{n_\delta }-y\Vert ^q \\&\le \Vert \lambda _{n_\delta }\Vert \Vert A x_{n_\delta } - y\Vert + M \Vert A x_{n_\delta }-y\Vert ^q. \end{aligned}$$

By the definition of \(n_\delta \) we have \(\Vert A x_{n_\delta } - y^\delta \Vert \le \tau \delta \) and thus

$$\begin{aligned} \Vert A x_{n_\delta }-y\Vert \le \Vert A x_{n_\delta }-y^\delta \Vert + \Vert y^\delta - y\Vert \le (\tau +1) \delta . \end{aligned}$$

Therefore

$$\begin{aligned} {{\mathcal {E}}}^\dag (x_{n_\delta }) \le (\tau +1) \Vert \lambda _{n_\delta }\Vert \delta + M (\tau +1)^q \delta ^q. \end{aligned}$$
(3.15)

If \(n_\delta = 0\), then we have \(\lambda _{n_\delta } = 0\) and hence \({{\mathcal {E}}}^\dag (x_{n_\delta }) \le M (\tau +1)^q \delta ^q\). In the following we consider the case \(n_\delta \ge 1\). We will use Lemma 3.6 to estimate \(\Vert \lambda _{n_\delta }\Vert \). By virtue of Assumption 2 we have \(\eta _n \le c_1 (n + 1)^{-\frac{q}{2-q}}\), see (3.9). Combining this with the estimates in Lemma 3.6 we can obtain

$$\begin{aligned} c_2 (n+1)^{\frac{2}{2-q}} \delta ^2 \le c_1 \end{aligned}$$
(3.16)

and

$$\begin{aligned} \Vert \lambda _{n+1}\Vert ^2 \le 8 \gamma c_1 (n+1)^{\frac{2(1-q)}{2-q}} + 8\gamma ^2 (n+1)^2 \delta ^2 \end{aligned}$$
(3.17)

for all \(0 \le n <n_\delta \). Taking \(n = n_\delta - 1\) in (3.16) gives

$$\begin{aligned} n_\delta \le \left( \frac{c_1}{c_2 \delta ^2}\right) ^{\frac{2-q}{2}} \end{aligned}$$

which together with (3.17) with \(n = n_\delta -1\) shows that

$$\begin{aligned} \Vert \lambda _{n_\delta }\Vert \le c_3 \delta ^{q-1}, \end{aligned}$$

where \(c_3 := \sqrt{8\gamma (\gamma +c_2)(c_1/c_2)^{2-q}}\). Combining this estimate with (3.15) we finally obtain

$$\begin{aligned} {{\mathcal {E}}}^\dag (x_{n_\delta }) \le \left( c_3(\tau +1) + M(\tau +1)^q\right) \delta ^q \end{aligned}$$

which shows (3.13).

When \(A^* \lambda ^\dag \in \partial {\mathcal {R}}(x^\dag )\) for some \(\lambda ^\dag \in Y\), we know from the proof of Corollary 3.5 that Assumption 2 is satisfied with \({{\mathcal {E}}}^\dag (x) = D_{\mathcal {R}}^{A^*\lambda ^\dag }(x, x^\dag )\) and \(q = 1\). Thus, we may use (3.13) to conclude (3.14). \(\square \)

3.3 Acceleration

The dual gradient method, which generalizes the linear Landweber iteration in Hilbert spaces, is a slowly convergent method in general. To make it more practical important, it is necessary to consider accelerating this method with faster convergence speed. Since the dual gradient method is obtained by applying the gradient method to the dual problem, one may consider to accelerate this method by applying any available acceleration strategy for gradient methods among which Nesterov’s acceleration strategy [2, 6, 37] is the most prominent. By applying Nesterov’s accelerated gradient method to minimize the function \(d_{y^\delta }(\lambda ) = {\mathcal {R}}^*(A^*\lambda ) - \langle \lambda , y^\delta \rangle \) it leads to the iteration scheme

$$\begin{aligned} \hat{\lambda }_n&= \lambda _n + \frac{n-1}{n+\alpha } (\lambda _n-\lambda _{n-1}), \\ \lambda _{n+1}&= \hat{\lambda }_n - \gamma \nabla d_{y^\delta }(\hat{\lambda }_n). \end{aligned}$$

Let \({\hat{x}}_n = \nabla {\mathcal {R}}^*(A^* \hat{\lambda }_n)\). Then \(\nabla d_{y^\delta }(\hat{\lambda }_n) = A {\hat{x}}_n - y^\delta \) and \(A^*\hat{\lambda }_n \in \partial {\mathcal {R}}({\hat{x}}_n)\) which imply that

$$\begin{aligned} \begin{aligned} \hat{\lambda }_n&= \lambda _n + \frac{n-1}{n+\alpha } (\lambda _n -\lambda _{n-1}),\\ {\hat{x}}_n&= \arg \min _{x\in X} \left\{ {\mathcal {R}}(x) - \langle \hat{\lambda }_n, A x- y^\delta \rangle \right\} , \\ \lambda _{n+1}&= \hat{\lambda }_n - \gamma (A {\hat{x}}_n - y^\delta ), \\ x_{n+1}&= \arg \min _{x\in X} \left\{ {\mathcal {R}}(x) - \langle \lambda _{n+1}, A x - y^\delta \rangle \right\} , \end{aligned} \end{aligned}$$
(3.18)

where \(\lambda _{-1} = \lambda _0=0\), \(\alpha \ge 2\) is a given number, and \(\gamma >0\) is a step size. We have the following convergence rate result when the method is terminated by an a priori stopping rule.

Theorem 3.8

Let Assumption 1 hold and let \(L := \Vert A\Vert ^2/(2\sigma )\). Consider the accelerated dual gradient method (3.18) with noisy data \(y^\delta \) satisfying \(\Vert y^\delta -y\Vert \le \delta \). Assume that \(0 <\gamma \le 1/L\) and \(\alpha \ge 2\). If \(x^\dag \) satisfies the source condition \(A^*\lambda ^\dag \in \partial {\mathcal {R}}(x^\dag )\) for some \(\lambda ^\dag \in Y\), then there exist positive constants \(c_4\) and \(c_5\) depending only on \(\gamma \) and \(\alpha \) such that

$$\begin{aligned} D_{{\mathcal {R}}}^{A^*\lambda _n} (x^\dag , x_n)\le \left( \frac{c_4 \Vert \lambda ^\dag \Vert }{n} + c_5 n\delta \right) ^2 \end{aligned}$$
(3.19)

for all \(n \ge 1\). Consequently by choosing an integer \(n_\delta \) with \(n_\delta \sim \delta ^{-1/2}\) we have

$$\begin{aligned} D_{{\mathcal {R}}}^{A^*\lambda _{n_\delta }} (x^\dag , x_{n_\delta }) = O(\delta ) \end{aligned}$$

and hence \(\Vert x_{n_\delta } - x^\dag \Vert = O(\delta ^{1/2})\) as \(\delta \rightarrow 0\).

Proof

According to the definition of \(x_n\) we have \(A^*\lambda _n\in \partial {\mathcal {R}}(x_n)\) for all \(n \ge 1\). From this fact and the condition \(A^*\lambda ^\dag \in \partial {\mathcal {R}}(x^\dag )\) it follows from (2.2) that

$$\begin{aligned} D_{\mathcal {R}}^{A^*\lambda _n}(x^\dag , x_n)&= {\mathcal {R}}(x^\dag ) - {\mathcal {R}}(x_n) - \langle \lambda _n, y -A x_n\rangle \nonumber \\&= \left\{ \langle A^* \lambda ^\dag , x^\dag \rangle - {\mathcal {R}}^*(A^* \lambda ^\dag )\right\} - \left\{ \langle A^* \lambda _n, x_n\rangle - {\mathcal {R}}^*(A^*\lambda _n)\right\} \nonumber \\&\quad - \langle \lambda _n, y - A x_n\rangle \nonumber \\&= {\mathcal {R}}^*(A^*\lambda _n) - {\mathcal {R}}^*(A^*\lambda ^\dag ) - \langle \lambda _n, y\rangle + \langle \lambda ^\dag , y\rangle \nonumber \\&= d_y(\lambda _n) - d_y(\lambda ^\dag ), \end{aligned}$$
(3.20)

where \(d_y(\lambda ) := {\mathcal {R}}^*(A^*\lambda ) - \langle \lambda , y\rangle \). We need to estimate \(d_y(\lambda _n) - d_y(\lambda ^\dag )\). This can be done by using a perturbation analysis of the accelerated gradient method, see [2, 3]. For completeness, we include a derivation here. Because \(A^*\lambda ^\dag \in \partial {\mathcal {R}}(x^\dag )\), we have \(x^\dag = \nabla {\mathcal {R}}^*(A^* \lambda ^\dag )\). Thus

$$\begin{aligned} \nabla d_y(\lambda ^\dag ) = A \nabla {\mathcal {R}}^*(A^* \lambda ^\dag ) - y = A x^\dag - y = 0. \end{aligned}$$

Since \(d_y\) is convex, this shows that \(\lambda ^\dag \) is a global minimizer of \(d_y\) over Y. Note that \(\nabla d_{y^\delta }(\lambda ) = \nabla d_y(\lambda ) + y- y^\delta \). Thus, it follows from the definition of \(\lambda _{n+1}\) that

$$\begin{aligned} \lambda _{n+1} = \hat{\lambda }_n - \gamma \left( \nabla d_y(\hat{\lambda }_n) + y - y^\delta \right) . \end{aligned}$$

Based on this, the Lipschitz continuity of \(\nabla d_y\) and the convexity of \({\mathcal {R}}\), we may use a similar argument in the proof of Proposition 3.3 to obtain for any \(\lambda \in Y\) that

$$\begin{aligned} d_y(\lambda _{n+1})&\le d_y(\lambda ) + \langle \nabla d_y(\hat{\lambda }_n), \lambda _{n+1}-\lambda \rangle + \frac{L}{2} \Vert \lambda _{n+1} - \hat{\lambda }_n\Vert ^2 \\&= d_y(\lambda ) + \left\langle \frac{1}{\gamma } (\hat{\lambda }_n - \lambda _{n+1}) - (y - y^\delta ), \lambda _{n+1} - \lambda \right\rangle + \frac{L}{2} \Vert \lambda _{n+1} - \hat{\lambda }_n\Vert ^2 \\&= d_y(\lambda ) + \frac{1}{2\gamma } \left( \Vert \hat{\lambda }_n-\lambda \Vert ^2 -\Vert \lambda _{n+1}-\lambda \Vert ^2 \right) -\langle y-y^\delta , \lambda _{n+1} - \lambda \rangle \\&\quad - \left( \frac{1}{2\gamma } - \frac{L}{2} \right) \Vert \lambda _{n+1} -\hat{\lambda }_n\Vert ^2. \end{aligned}$$

Since \(0<\gamma \le 1/L\) and \(\Vert y^\delta - y\Vert \le \delta \), we have

$$\begin{aligned} d_y(\lambda _{n+1}) \le d_y(\lambda ) + \frac{1}{2\gamma } \left( \Vert \hat{\lambda }_n-\lambda \Vert ^2 -\Vert \lambda _{n+1}-\lambda \Vert ^2 \right) + \delta \Vert \lambda _{n+1} - \lambda \Vert . \end{aligned}$$
(3.21)

Note that \(\frac{n-1}{n+\alpha } = \frac{t_n-1}{t_{n+1}}\) with \(t_n = \frac{n+\alpha -1}{\alpha }\). Now we take \(\lambda = \left( 1-\frac{1}{t_{n+1}}\right) \lambda _n + \frac{1}{t_{n+1}} \lambda ^\dag \) in (3.21) and use the convexity of \(d_y\) to obtain

$$\begin{aligned} d_y(\lambda _{n+1})&\le \left( 1-\frac{1}{t_{n+1}}\right) d_y(\lambda _n) + \frac{1}{t_{n+1}} d_y(\lambda ^\dag ) \\&\quad + \frac{1}{2\gamma t_{n+1}^2} \left\| \lambda ^\dag - \left( \lambda _n + t_{n+1}(\hat{\lambda }_n - \lambda _n)\right) \right\| ^2 \\&\quad - \frac{1}{2 \gamma t_{n+1}^2} \left\| \lambda ^\dag - \left( \lambda _n + t_{n+1}(\lambda _{n+1}-\lambda _n)\right) \right\| ^2 \\&\quad + \frac{\delta }{t_{n+1}} \left\| \lambda ^\dag - \left( \lambda _n + t_{n+1} (\lambda _{n+1} - \lambda _n)\right) \right\| . \end{aligned}$$

Let \(u_n = \lambda _{n-1} + t_n (\lambda _n - \lambda _{n-1})\). Then it follows from \(\hat{\lambda }_n = \lambda _n + \frac{t_n-1}{t_{n+1}} (\lambda _n -\lambda _{n-1})\) that \(\lambda _n + t_{n+1} (\hat{\lambda }_n - \lambda _n) = u_n\). Therefore

$$\begin{aligned} d_y(\lambda _{n+1})&\le \left( 1-\frac{1}{t_{n+1}}\right) d_y(\lambda _n) + \frac{1}{t_{n+1}} d_y(\lambda ^\dag ) + \frac{1}{2\gamma t_{n+1}^2} \Vert \lambda ^\dag - u_n\Vert ^2 \\&\quad - \frac{1}{2\gamma t_{n+1}^2} \Vert \lambda ^\dag - u_{n+1}\Vert ^2 + \frac{\delta }{t_{n+1}} \Vert \lambda ^\dag - u_{n+1}\Vert . \end{aligned}$$

Multiplying both sides by \(2\gamma t_{n+1}^2\), regrouping the terms and setting \(w_n := d_y(\lambda _n) - d_y(\lambda ^\dag )\), we obtain

$$\begin{aligned} 2 \gamma t_{n+1}^2 w_{n+1} - 2 \gamma t_n^2 w_n&\le 2 \gamma \rho _n w_n + \Vert \lambda ^\dag - u_n\Vert ^2 - \Vert \lambda ^\dag - u_{n+1}\Vert ^2 \nonumber \\&\quad + 2 \gamma \delta t_{n+1} \Vert \lambda ^\dag - u_{n+1}\Vert \end{aligned}$$
(3.22)

for all \(n \ge 0\), where \(\rho _n:= t_{n+1}^2 - t_{n+1} - t_n^2\). Note that \(\alpha \ge 2\) implies \(\rho _n \le 0\) for \(n \ge 1\). Let \(m \ge 1\) be any integer. Summing the above inequality over n from \(n = 1\) to \(n = m - 1\) and using \(w_n \ge 0\), we can obtain

$$\begin{aligned} 2\gamma t_m^2 w_m + \Vert \lambda ^\dag - u_m\Vert ^2 \le 2\gamma t_1^2 w_1 + \Vert \lambda ^\dag - u_1\Vert ^2 + 2 \gamma \delta \sum _{k=2}^m t_k \Vert \lambda ^\dag -u_k\Vert . \end{aligned}$$

Using (3.22) with \(n = 0\) and noting that \(t_0^2 +\rho _0 =0\) and \(u_0= 0\) we can further obtain

$$\begin{aligned}&2\gamma t_m^2 w_m + \Vert \lambda ^\dag - u_m\Vert ^2 \nonumber \\&\quad \le 2 \gamma (t_0^2 + \rho _0) w_0 + \Vert \lambda ^\dag \Vert ^2 + 2 \gamma \delta t_1 \Vert \lambda ^\dag - u_1\Vert + 2 \gamma \delta \sum _{k=2}^m t_k \Vert \lambda ^\dag - u_k\Vert \nonumber \\&\quad = \Vert \lambda ^\dag \Vert ^2 + 2 \gamma \delta \sum _{k=1}^m t_k \Vert \lambda ^\dag - u_k\Vert . \end{aligned}$$
(3.23)

According to (3.23), we have

$$\begin{aligned} \Vert \lambda ^\dag - u_m\Vert ^2 \le \Vert \lambda ^\dag \Vert ^2 + 2 \gamma \delta \sum _{k=1}^m t_k \Vert \lambda ^\dag - u_k\Vert \end{aligned}$$
(3.24)

from which we may use an induction argument to obtain

$$\begin{aligned} \Vert \lambda ^\dag - u_m\Vert \le \Vert \lambda ^\dag \Vert + 2 \gamma \delta \sum _{k=1}^m t_k \end{aligned}$$
(3.25)

for all integers \(m \ge 0\). Indeed, since \(u_0 = 0\), (3.25) holds trivially for \(m = 0\). Assume next that (3.25) holds for all \(0 \le m \le n\) for some \(n \ge 0\). We show (3.25) also holds for \(m = n+1\). If there is \(0 \le m \le n\) such that \(\Vert \lambda ^\dag - u_{n+1}\Vert \le \Vert \lambda ^\dag - u_m\Vert \), then by the induction hypothesis we have

$$\begin{aligned} \Vert \lambda ^\dag - u_{n+1}\Vert \le \Vert \lambda ^\dag \Vert + 2\gamma \delta \sum _{k=1}^m t_k \le \Vert \lambda ^\dag \Vert + 2 \gamma \delta \sum _{k=1}^{n+1} t_k. \end{aligned}$$

So we may assume \(\Vert \lambda ^\dag - u_{n+1}\Vert > \Vert \lambda ^\dag - u_m\Vert \) for all \(0 \le m \le n\). It then follows from (3.24) that

$$\begin{aligned} \Vert \lambda ^\dag - u_{n+1}\Vert ^2 \le \Vert \lambda ^\dag \Vert ^2 + 2\gamma \delta \left( \sum _{k=1}^{n+1} t_k\right) \Vert \lambda ^\dag - u_{n+1}\Vert . \end{aligned}$$

By using the elementary inequality “\(a^2 \le b^2 + ca \Longrightarrow a \le b+c\) for \(a, b, c \ge 0\)", we obtain again

$$\begin{aligned} \Vert \lambda ^\dag - u_{n+1}\Vert \le \Vert \lambda ^\dag \Vert + 2\gamma \delta \sum _{k=1}^{n+1} t_k. \end{aligned}$$

By the induction principle, we thus obtain (3.25). Based on (3.23) and (3.25) we have

$$\begin{aligned} 2 \gamma t_m^2 w_m&\le \Vert \lambda ^\dag \Vert ^2 + 2 \gamma \delta \sum _{k=1}^m t_k \Vert \lambda ^\dag - u_k\Vert \\&\le \Vert \lambda ^\dag \Vert ^2 + 2 \gamma \delta \left( \sum _{k=1}^m t_k\right) \left( \Vert \lambda ^\dag \Vert + 2 \gamma \delta \sum _{k=1}^m t_k\right) \\&\le \left( \Vert \lambda ^\dag \Vert + 2 \gamma \delta \sum _{k=1}^m t_k\right) ^2. \end{aligned}$$

Thus, by the definition of \(t_n\) it is straightforward to see that

$$\begin{aligned} d_y(\lambda _m) - d_y(\lambda ^\dag ) \le \frac{1}{2\gamma t_m^2} \left( \Vert \lambda ^\dag \Vert + 2\gamma \delta \sum _{k=1}^m t_k\right) ^2 \le \left( \frac{c_4 \Vert \lambda ^\dag \Vert }{m} + c_5 m \delta \right) ^2, \end{aligned}$$

where \(c_4\) and \(c_5\) are two positive constants depending only on \(\gamma \) and \(\alpha \). Combining this with (3.20) we thus complete the proof of (3.19). \(\square \)

From Theorem 3.8 it follows that, under the source condition \(A^*\lambda ^\dag \in \partial {\mathcal {R}}(x^\dag )\), we can obtain the convergence rate \(\Vert x_{n_\delta } - x^\dag \Vert = O(\delta ^{1/2})\) within \(O(\delta ^{-1/2})\) iterations for the method (3.18). For the dual gradient method (1.3), however, we need to perform \(O(\delta ^{-1})\) iterations to achieve the same convergence rate, see Corollary 3.5. This demonstrates that the method (3.18) indeed has acceleration effect.

We remark that Nesterov’s acceleration strategy was first proposed in [29] to accelerate gradient type regularization method for linear as well as nonlinear ill-posed problems in Banach spaces and various numerical results were reported which demonstrate the striking performance; see also [27, 28, 35, 39, 45] for further numerical simulations. Although we have proved in Theorem 3.8 a convergence rate result for the method (3.18) under an a priori stopping rule, the regularization property of the method under the discrepancy principle is not yet established for general strongly convex \({\mathcal {R}}\). However, when X is a Hilbert space and \({\mathcal {R}}(x) = \Vert x\Vert ^2/2\), the regularization property of the corresponding method has been established in [35, 39] based on a general acceleration framework in [21] using orthogonal polynomials; in particular it was observed in [35] that the parameter \(\alpha \) plays an interesting role in deriving order optimal convergence rates. For an analysis of Nesterov’s acceleration for nonlinear ill-posed problems in Hilbert spaces, one may refer to [28].

4 Applications

Various applications of the dual gradient method (1.3), or equivalently the method (1.4), have been considered in [9, 29, 33] for sparsity recovery and image reconstruction through the choices of \({\mathcal {R}}\) as strong convex perturbations of the \(L^1\) and the total variation functionals and the numerical results demonstrate its nice performance. In the following we will provide some additional applications.

4.1 Dual projected landweber iteration

We first consider the application of our convergence theory to linear ill-posed problems in Hilbert spaces with convex constraint. Such problems arise from a number of real applications including the computed tomography [36] in which the sought solutions are nonnegative.

Let \(A : X \rightarrow Y\) be a bounded linear operator between two Hilbert spaces X and Y and let \(C\subset X\) be a closed convex set. Given \(y\in Y\) and assuming that \(Ax = y\) has a solution in C, we consider finding the unique solution \(x^\dag \) of \(Ax = y\) in C with minimal norm which can be stated as the minimization problem

$$\begin{aligned} \min \left\{ \frac{1}{2} \Vert x\Vert ^2 : x \in C \text{ and } Ax = y\right\} . \end{aligned}$$
(4.1)

This problem takes the form (1.1) with \({\mathcal {R}}(x) := \frac{1}{2} \Vert x\Vert ^2 + \delta _C(x)\), where \(\delta _C\) denotes the indicator function of C, i.e. \(\delta _C(x) = 0\) if \(x\in C\) and \(\infty \) otherwise. Clearly \({\mathcal {R}}\) satisfies Assumption 1 (ii). It is easy to see that for any \(\xi \in X\) the unique solution of

$$\begin{aligned} \min _{x\in X} \left\{ {\mathcal {R}}(x) -\langle \xi , x\rangle \right\} \end{aligned}$$

is given by \(P_C(\xi )\), where \(P_C\) denotes the metric projection of X onto C. Therefore, applying the algorithm (1.3) to (4.1) leads to the dual projected Landweber iteration

$$\begin{aligned} \begin{aligned} x_n&= P_C (A^*\lambda _n), \\ \lambda _{n+1}&= \lambda _n - \gamma (A x_n - y^\delta ) \end{aligned} \end{aligned}$$
(4.2)

that has been considered in [12]. Besides a stability estimate, it has been shown in [12] that, for the method (4.2) with exact data \(y^\delta =y\), if \(x^\dag \in P_C (A^*Y)\) then

$$\begin{aligned} \sum _{n=1}^\infty \Vert x_n - x^\dag \Vert ^2 <\infty \end{aligned}$$

which implies \(\Vert x_n - x^\dag \Vert \rightarrow 0\) as \(n \rightarrow \infty \) but does not provide an error estimate unless \(\Vert x_n -x^\dag \Vert \) is monotonically decreasing which is unfortunately unknown. Therefore, the work in [12] does not tell much information about the regularization property of the method (4.2). It is natural to ask if the method (4.2) renders a regularization method under a priori or a posteriori stopping rules and if it is possible to derive error estimates under suitable source conditions on the sought solution. Applying our convergence theory can provide satisfactory answers to these questions with a rather complete analysis on the method (4.2), see Corollary 4.1 below, which is far beyond the one provided in [12]. In particular we can obtain the convergence and convergence rates when the method (4.2) is terminated by either an a priori stopping rule or the discrepancy principle (1.5).

Corollary 4.1

For the linear ill-posed problem (4.1) in Hilbert spaces constrained by a closed convex set C, consider the dual projected Landweber iteration (4.2) with \(\lambda _0 =0\) and with noisy data \(y^\delta \) satisfying \(\Vert y^\delta - y\Vert \le \delta \).

  1. (i)

    If \(0<\gamma \le 1/\Vert A\Vert ^2\) then for the integer \(n_\delta \) satisfying \(n_\delta \rightarrow \infty \) and \(\delta ^2 n_\delta \rightarrow 0\) as \(\delta \rightarrow 0\) there holds \(\Vert x_{n_\delta }-x^\dag \Vert \rightarrow 0\) as \(\delta \rightarrow 0\). If in addition \(x^\dag \) satisfies the projected source condition

    $$\begin{aligned} x^\dag = P_C((A^*A)^{\nu /2} \omega ) \text{ for } \text{ some } 0 < \nu \le 1 \text{ and } \omega \in X, \end{aligned}$$
    (4.3)

    then with the choice \(n_\delta \sim \delta ^{-\frac{2}{1+\nu }}\) we have \(\Vert x_{n_\delta } - x^\dag \Vert = O(\delta ^{\frac{\nu }{1+\nu }})\).

  2. (ii)

    If \(\tau >1\) and \(\gamma >0\) are chosen such that \(1 - 1/\tau - \Vert A\Vert ^2 \gamma >0\), then the discrepancy principle (1.5) defines a finite integer \(n_\delta \) with \(\Vert x_{n_\delta } - x^\dag \Vert \rightarrow 0\) as \(\delta \rightarrow 0\). If in addition \(x^\dag \) satisfies the projected source condition (4.3), then \(\Vert x_{n_\delta } - x^\dag \Vert = O(\delta ^{\frac{\nu }{1+\nu }})\).

Proof

According to Theorems 3.1, 3.4 and 3.7, it remains only to show that, under the projected source condition (4.3), \(x^\dag \) satisfies the variational source condition

$$\begin{aligned} \frac{1}{4} \Vert x-x^\dag \Vert ^2 \le {\mathcal {R}}(x) - {\mathcal {R}}(x^\dag ) + c_\nu \Vert \omega \Vert ^{\frac{2}{1+\nu }} \Vert A x - y\Vert ^{\frac{2\nu }{1+\nu }} \end{aligned}$$
(4.4)

for all \(x \in \text{ dom }({\mathcal {R}}) = C\), where \(c_\nu := 2^{-\frac{2\nu }{1+\nu }} (1+\nu ) (1-\nu )^{\frac{1-\nu }{1+\nu }}\). To see this, note first that for any \(x \in C\) there holds

$$\begin{aligned} \frac{1}{2} \Vert x-x^\dag \Vert ^2 - {\mathcal {R}}(x) + {\mathcal {R}}(x^\dag )&= \frac{1}{2} \Vert x-x^\dag \Vert ^2 - \frac{1}{2} \Vert x\Vert ^2 + \frac{1}{2} \Vert x^\dag \Vert ^2 \\&= \langle x^\dag , x^\dag -x\rangle . \end{aligned}$$

By using \(x^\dag = P_C ((A^* A)^{\nu /2} \omega )\) and the property of the projection \(P_C\) we have

$$\begin{aligned} \langle (A^*A)^{\nu /2}\omega - x^\dag , x - x^\dag \rangle \le 0, \quad \forall x \in C \end{aligned}$$

which implies that

$$\begin{aligned} \langle x^\dag , x^\dag -x\rangle \le \langle (A^*A)^{\nu /2} \omega , x^\dag - x\rangle = \langle \omega , (A^*A)^{\nu /2} (x^\dag -x)\rangle . \end{aligned}$$

Therefore

$$\begin{aligned} \frac{1}{2} \Vert x-x^\dag \Vert ^2 - {\mathcal {R}}(x) + {\mathcal {R}}(x^\dag )&\le \langle \omega , (A^*A)^{\nu /2} (x^\dag -x)\rangle \\&\le \Vert \omega \Vert \Vert (A^*A)^{\nu /2} (x^\dag -x)\Vert . \end{aligned}$$

By invoking the interpolation inequality [13] and the Young’s inequality we can further obtain

$$\begin{aligned} \frac{1}{2} \Vert x-x^\dag \Vert ^2 - {\mathcal {R}}(x) + {\mathcal {R}}(x^\dag )&\le \Vert \omega \Vert \Vert x-x^\dag \Vert ^{1-\nu } \Vert A(x-x^\dag )\Vert ^\nu \\&\le c_\nu \Vert \omega \Vert ^{\frac{2}{1+\nu }} \Vert A x - y\Vert ^{\frac{2\nu }{1+\nu }} + \frac{1}{4} \Vert x-x^\dag \Vert ^2 \end{aligned}$$

which shows (4.4). The proof is therefore complete. \(\square \)

We remark that the projected source condition (4.3) with \(\nu =1\), i.e. \(x^\dag \in P_C(A^* Y)\), was first used in [38] to derive the convergence rate of Tikhonov regularization in Hilbert spaces with convex constraint.

The dual projected Landweber iteration (4.2) can be accelerated by Nesterov’s acceleration strategy. As was derived in Sect. 3.3, the accelerated scheme takes the form

$$\begin{aligned} \begin{aligned} \hat{\lambda }_n&= \lambda _n + \frac{n-1}{n+\alpha } (\lambda _n -\lambda _{n-1}), \qquad {\hat{x}}_n = P_C(A^* \hat{\lambda }_n), \\ \lambda _{n+1}&= \hat{\lambda }_n - \gamma (A {\hat{x}}_n - y^\delta ), \qquad x_{n+1} = P_C (A^* \lambda _{n+1}). \end{aligned} \end{aligned}$$
(4.5)

By noting that \(\partial {\mathcal {R}}(x) = x + \partial \delta _C(x)\), it is easy to see that an element \(\lambda ^\dag \in Y\) is such that \(A^* \lambda ^\dag \in \partial {\mathcal {R}}(x^\dag )\) if and only if \(x^\dag = P_C(A^*\lambda ^\dag )\). Therefore, by using Theorem 3.8, we can obtain the following convergence rate result of the method (4.5).

Corollary 4.2

For the problem (4.1) in Hilbert spaces constrained by a closed convex set C, consider the method (4.5) with \(\lambda _0 = \lambda _{-1} = 0\). If \(0 <\gamma \le 1/\Vert A\Vert ^2\), \(\alpha \ge 2\) and \(x^\dag \in P_C(A^* Y)\), then with the choice \(n_\delta \sim \delta ^{-1/2}\) we have

$$\begin{aligned} \Vert x_{n_\delta } - x^\dag \Vert = O(\delta ^{1/2}) \end{aligned}$$

as \(\delta \rightarrow 0\).

4.2 An entropic dual gradient method

Let \(\varOmega \subset {{\mathbb {R}}}^d\) be a bounded domain and let \(A : L^1(\varOmega ) \rightarrow Y\) be a bounded linear operator, where Y is a Hilbert space. For an element \(y \in Y\) in the range of A, we consider the equation \(Ax = y\). We assume that the sought solution \(x^\dag \) is a probability density function, i.e. \(x^\dag \ge 0\) a.e. on \(\varOmega \) and \(\int _\varOmega x^\dag = 1\). We may find such a solution by considering the convex minimization problem

$$\begin{aligned} \min \left\{ {\mathcal {R}}(x) : = f(x) + \delta _\varDelta (x): x\in L^1(\varOmega ) \text{ and } A x = y\right\} , \end{aligned}$$
(4.6)

where \(\delta _\varDelta \) denotes the indicator function of the closed convex set

$$\begin{aligned} \varDelta := \left\{ x\in L^1(\varOmega ): x\ge 0 \text{ a.e. } \text{ on } \varOmega \text{ and } \int _\varOmega x =1\right\} \end{aligned}$$

in \(L^1(\varOmega )\) and f denotes the negative of the Boltzmann-Shannon entropy, i.e.

$$\begin{aligned} f(x) := \left\{ \begin{array}{lll} \int _\varOmega x \log x &{} \text{ if } x \in L_+^1(\varOmega ) \text{ and } x \log x \in L^1(\varOmega ), \\ \infty &{} \text{ otherwise } \end{array}\right. \end{aligned}$$

where, here and below, \(L_+^p(\varOmega ):= \{x \in L^p(\varOmega ): x \ge 0 \text{ a.e. } \text{ on } \varOmega \}\) for each \(1\le p\le \infty \). The Boltzmann-Shannon entropy has been used in Tikhonov regularization as a stable functional to determine nonnegative solutions; see [1, 11, 14, 31] for instance.

In the following we summarize some useful properties of the negative of the Boltzmann-Shannon entropy f:

  1. (i)

    f is proper, lower semi-continuous and convex on \(L^1(\varOmega )\); see [1, 11].

  2. (ii)

    f is subdifferentiable at \(x\in L^1(\varOmega )\) if and only if \(x \in L_+^\infty (\varOmega )\) and is bounded away from zero, i.e.

    $$\begin{aligned} \text{ dom }(\partial f) = \{x \in L_+^\infty (\varOmega ): x\ge \beta \text{ on } \varOmega \text{ for } \text{ some } \text{ constant } \beta >0\}. \end{aligned}$$

    Moreover for each \(x\in \text{ dom }(\partial f)\) there holds \(\partial f(x) = \{1 + \log x\}\); see [4,  Proposition 2.53].

  3. (iii)

    By straightforward calculation one can see that for any \(x \in \text{ dom }(\partial f)\) and \({\tilde{x}}\in \text{ dom }(f)\), the Bregman distance induced by f is the Kullback-Leibler functional

    $$\begin{aligned} D({\tilde{x}}, x) := \int _\varOmega \left( {\tilde{x}} \log \frac{\tilde{x}}{x} - {\tilde{x}} + x\right) . \end{aligned}$$
  4. (iv)

    For any \(x \in \text{ dom }(\partial f)\) and \({\tilde{x}} \in \text{ dom }(f)\) there holds (see [7,  Lemma 2.2])

    $$\begin{aligned} \Vert x-{\tilde{x}}\Vert _{L^1(\varOmega )}^2 \le \left( \frac{4}{3} \Vert x\Vert _{L^1(\varOmega )} + \frac{2}{3} \Vert {\tilde{x}}\Vert _{L^1(\varOmega )}\right) D({\tilde{x}}, x). \end{aligned}$$
    (4.7)

Based on these facts, we can see that the function \({\mathcal {R}}\) defined in (4.6) satisfies Assumption 1. In order to apply the dual gradient method (1.3) to solve (4.6), we need to determine the closed formula for the solution of the minimization problem involved in the algorithm. By the Karush-Kuhn-Tucker theory, it is easy to see that, for any \(\ell \in L^\infty (\varOmega )\), the unique minimizer of

$$\begin{aligned} \min \left\{ \int _\varOmega (x\log x - \ell x): x \ge 0 \text{ a.e. } \text{ on } \varOmega \text{ and } \int _\varOmega x = 1\right\} \end{aligned}$$

is given by \({\hat{x}} = e^\ell /\int _\varOmega e^\ell \). Therefore we can obtain from the algorithm (1.3) the following entropic dual gradient method

$$\begin{aligned} \begin{aligned} x_n&= \frac{1}{\int _\varOmega e^{A^*\lambda _n}} e^{A^*\lambda _n}, \\ \lambda _{n+1}&= \lambda _n - \gamma (A x_n - y^\delta ). \end{aligned} \end{aligned}$$
(4.8)

We have the following convergence result.

Corollary 4.3

For the convex problem (4.6), consider the entropic dual gradient method (4.8) with \(\lambda _0 = 0\) and with noisy data \(y^\delta \) satisfying \(\Vert y^\delta - y\Vert \le \delta \).

  1. (i)

    If \(0 <\gamma \le 1/\Vert A\Vert ^2\) then for the integer \(n_\delta \) satisfying \(n_\delta \rightarrow \infty \) and \(\delta ^2 n_\delta \rightarrow 0\) as \(\delta \rightarrow 0\) there holds \(\Vert x_{n_\delta } - x^\dag \Vert \rightarrow 0\) as \(\delta \rightarrow 0\). If in addition \(x^\dag \) satisfies the source condition

    $$\begin{aligned} 1 + \log x^\dag = A^*\lambda ^\dag \quad \text{ for } \text{ some } \lambda ^\dag \in Y, \end{aligned}$$
    (4.9)

    then with the choice \(n_\delta \sim \delta ^{-1}\) we have \(\Vert x_{n_\delta } -x^\dag \Vert _{L^1(\varOmega )} = O(\delta ^{1/2})\).

  2. (ii)

    If \(\tau >1\) and \(\gamma >0\) are chosen such that \(1-1/\tau -\Vert A\Vert ^2 \gamma >0\), then the discrepancy principle (1.5) defines a finite integer \(n_\delta \) with \(\Vert x_{n_\delta } - x^\dag \Vert _{L^1(\varOmega )} \rightarrow 0\) as \(\delta \rightarrow 0\). If in addition \(x^\dag \) satisfies the source condition (4.9), then \(\Vert x_{n_\delta } - x^\dag \Vert _{L^1(\varOmega )} = O(\delta ^{1/2})\).

Proof

Under (4.9) there holds \(A^* \lambda ^\dag \in \partial f(x^\dag )\). Therefore, by using (4.7), we have for any \(x\in \text{ dom }({\mathcal {R}})\) that

$$\begin{aligned} \frac{1}{2} \Vert x-x^\dag \Vert _{L^1(\varOmega )}^2&\le D(x, x^\dag ) = f(x) - f(x^\dag ) - \langle A^* \lambda ^\dag , x-x^\dag \rangle \\&= {\mathcal {R}}(x)-{\mathcal {R}}(x^\dag ) - \langle \lambda ^\dag , A x - y\rangle \\&\le {\mathcal {R}}(x) - {\mathcal {R}}(x^\dag ) + \Vert \lambda ^\dag \Vert \Vert A x - y\Vert , \end{aligned}$$

where we used \({\mathcal {R}}(x) = f(x) \) and \(\int _\varOmega x =1\) for \(x \in \text{ dom }({\mathcal {R}})\). Thus \(x^\dag \) satisfies the varaitional source condition specified in Assumption 2 with \({{\mathcal {E}}}^\dag (x) = \frac{1}{2} \Vert x-x^\dag \Vert _{L^1(\varOmega )}^2\), \(M =\Vert \lambda ^\dag \Vert \) and \(q =1\). Now we can complete the proof by applying Theorems 3.1, 3.7 and Corollary 3.5 to the method (4.8). \(\square \)

The source condition (4.9) has been used in [11, 14] in which one may find further discussions. We would like to mention that an entropic Landweber method of the form

$$\begin{aligned} x_{n+1} = \frac{x_n e^{\gamma A^*(y^\delta -A x_n)}}{\int _\varOmega x_n e^{\gamma A^*(y^\delta - A x_n)}} \end{aligned}$$
(4.10)

has been proposed and studied in the recent paper [10] in which weak convergence in \(L^1(\varOmega )\) is proved without relying on source conditions and, under the source condition (4.9), an error estimate is derived when the method is terminated by an a priori stopping rule. Our method (4.8) is different from (4.10) due to its primal-dual nature. As stated in Corollary 4.3, our method (4.8) enjoys nicer convergence properties: it admits strong convergence in \(L^1(\varOmega )\) in general and, when the source condition (4.9) is satisfied, an error estimate can be derived when the method is terminated by either an a priori stopping rule or the discrepancy principle.

Applying Nesterov’s acceleration strategy, we can accelerate the entropic dual gradient method (4.8) by the following scheme

$$\begin{aligned} \begin{aligned} \hat{\lambda }_n&= \lambda _n + \frac{n-1}{n+\alpha } (\lambda _n - \lambda _{n-1}), \qquad {\hat{x}}_n = \frac{1}{\int _\varOmega e^{A^*\hat{\lambda }_n}} e^{A^*\hat{\lambda }_n}, \\ \lambda _{n+1}&= \hat{\lambda }_n - \gamma (A {\hat{x}}_n - y^\delta ), \qquad x_{n+1} = \frac{1}{\int _\varOmega e^{A^*\lambda _{n+1}}} e^{A^*\lambda _{n+1}}. \end{aligned} \end{aligned}$$
(4.11)

By using Theorem 3.8 we can obtain the following convergence rate result on the method (4.11) with noisy data.

Corollary 4.4

For the minimization problem (4.6), consider the method (4.11) with \(\lambda _0 = \lambda _{-1} = 0\). If \(0 <\gamma \le 1/\Vert A\Vert ^2\), \(\alpha \ge 2\) and \(x^\dag \) satisfies the source condition (4.9), then with the choice \(n_\delta \sim \delta ^{-1/2}\) we have

$$\begin{aligned} \Vert x_{n_\delta } - x^\dag \Vert _{L^1(\varOmega )} = O(\delta ^{1/2}) \end{aligned}$$

as \(\delta \rightarrow 0\).

5 Conclusion

Due to its simplicity and relatively small complexity per iteration, Landweber iteration has received extensive attention in the inverse problems community. In recent years, Landweber iteration has been extended to solve inverse problems in Banach spaces with general uniformly convex regularization terms and various convergence properties have been established. However, except for the linear and nonlinear Landweber iteration in Hilbert spaces, the convergence rate in general is missing from the existing convergence theory.

This paper attempts to fill in this gap by providing a novel technique to derive convergence rates for a class of Landweber type methods. We considered a class of ill-posed problems defined by a bounded linear operator from a Banach space to a Hilbert space and used a strongly convex regularization functional to select the sought solution. The dual problem turns out to have a smooth objective function and thus can be solved by the usual gradient method. The resulting method is called a dual gradient method which is a special case of the Landweber type method in Banach spaces. Applying gradient methods to the dual problem allows us to interpret the method in a new perspective which enables us to use tools from convex analysis and optimization to carry out the analysis. We have actually obtained the convergence and convergence rates of the dual gradient method when it is terminated by either an a priori stopping rule or the discrepancy principle. Furthermore, by applying Nesetrov’s acceleration strategy to the dual problem we proposed an accelerated dual gradient method and established a convergence rate result under an a priori stopping rule. We also discussed some applications, in particular, as a direct application of our convergence theory, we provided a rather complete analysis of the dual projected Landweber iteration of Eicke for which only preliminary result is available in the existing literature.

There are a few of questions which might be interesting for future development.

  1. (i)

    We established convergence rate results for the dual gradient method (1.3) which require A to be a bounded linear operator and Y a Hilbert space. Is it possible to establish a general convergence rate result for Landweber iteration for solving linear as well as nonlinear ill-posed problems in Banach spaces?

  2. (ii)

    For the dual gradient method (1.3), its analysis under the a priori stopping rule allows to take the step-size as \(0<\gamma \le 1/L\), while the analysis under the discrepancy principle (1.5) requires \(\tau >1\) and \(\gamma >0\) to satisfy \(1-1/\tau ^2-L\gamma >0\) which means either \(\tau \) has to be large or \(\gamma \) has to be small. Is it possible to develop a convergence theory of the discrepancy principle under merely the conditions \(\tau >1\) and \(0<\gamma \le 1/L\)?

  3. (iii)

    In Sect. 3.3 we considered the accelerated dual gradient method (3.18) and established a convergence rate result under an a priori stopping rule. Is it possible to establish the convergence and convergence rate result of (3.18) under the discrepancy principle?