1 Introduction

Consider an unconstrained nonsmooth convex optimization problem of the form

$$ \min_{x\in \mathbb {R}^n} F(x)=f_1(x)+f_2(x), $$
(1)

where f 1 is a nonsmooth convex function given by

$$ f_1(x)=\sum_{J\in\mathcal{J}}w_J \|x_J\|+\lambda\|x\|_1, $$
(2)

with \(\mathcal{J}\) a partition of {1,⋯,n} and \(\lambda,\;\{ w_{J}\}_{J\in\mathcal{J}}\) some given nonnegative constants; f 2(x) is a composite convex function

$$ f_2(x)=h(Ax), $$
(3)

where \(h:\mathbb {R}^{m}\mapsto \mathbb {R}\) is a continuously differentiable strongly convex function and \(A\in \mathbb {R}^{m\times n}\) is a given matrix. Notice that unless A has full column rank (i.e., \({\rm rank}(A)=n\)), the composite function f 2(x) is not strongly convex.

1.1 Motivating Applications

Nonsmooth convex optimization problems of the form (1) arise in many contemporary statistical and signal applications [4, 24] including signal denoising, compressive sensing, sparse linear regression and high dimensional multinomial classification. To motivate the nonsmooth convex optimization problem (1), we briefly outline some application examples below.

Example 1

Suppose that we have a noisy measurement vector \(d\in \mathbb {R}^{m}\) about an unknown sparse vector \(x\in \mathbb {R}^{n}\), where the signal model is linear and given by

$$d\approx Ax $$

for some given matrix \(A \in \mathbb {R}^{m\times n}\). A popular technique to estimate the sparse vector x is called Lasso [17, 23] which performs simultaneous estimation and variable selection. Furthermore, a related technique called group Lasso [22] acts like Lasso at the group level. Since the group Lasso does not yield sparsity within a group, a generalized model that yields sparsity at both the group and individual feature levels was proposed in [5]. This sparse group Lasso criterion is formulated as

$$ \min_{x\in \mathbb {R}^n} \frac{1}{2}\|Ax-d \|^2+\sum_{J\in\mathcal{J}}w_J \|x_J\|+\lambda\|x\|_1, $$
(4)

where the minimization of ∥Axd2 has a denoising effect, while the middle term promotes or leads to sparse groups, where \(\mathcal{J}\) is a partition of {1,2,⋯,n} into groups. The 1-norm minimization sparsifies the solution x, and effectively selects the most significant components of x.

Obviously, the sparse Lasso problem (4) is in the form of the nonsmooth convex optimization problem (1) with \(h(\cdot)=\frac{1}{2}\|\cdot-d\|^{2}\). Moreover, if λ=0, (4) is reduced to group Lasso; if w J =0 for all \(J\in \mathcal{J}\), (4) is exactly Lasso problem. We refer readers to [1, 9] for recent applications of the group Lasso technique.

Example 2

In logistic regression, we are given a set of n-dimensional feature vectors a i (i=1,2,⋯,m), and the corresponding class labels d i ∈{0,1}. The probability distribution of the class label d given a feature vector a and a logistic regression coefficient vector \(x\in \mathbb {R}^{n}\) can be described by

$$p(d = 1\mid a; x) = {\frac{\exp(a^Tx)}{1+ \exp(a^Tx)}.} $$

The logistic group Lasso technique [10] corresponds to selecting x by

$$\min_{x\in \mathbb {R}^n} \sum_{i=1}^m \bigl(\log\bigl(1+\exp\bigl(a_i^Tx\bigr) \bigr)-d_ia_i^Tx \bigr) +\sum_{J\in\mathcal{J}}w_J\|x_J\|. $$

Again, this is in the form of the nonsmooth convex optimization problem (1) with λ=0 and

$$h(u)=\sum_{i=1}^m \bigl(\log\bigl(1+ \exp(u_i)\bigr)-d_iu_i \bigr), $$

which is strongly convex in u. We refer readers to [6, 7, 13, 16, 18, 2022] for further studies on group Lasso type of statistical techniques.

Example 3

Consider a high dimensional multinomial classification problem with K classes, N samples, and p covariates. Denote the data set as (t 1,y 1),⋯,(t N ,y N ), where for all i=1,⋯,N, \(t_{i}\in\mathbb{R}^{p}\) is the observed covariate vector and y i ∈{1,⋯,K} is the categorical response. The covariate vectors can be organized in the N×p design matrix

$$T = (t_1\quad t_2\quad \cdots\quad t_N)^T, $$

and the model parameters can be grouped in a K×p matrix

$$x = [x_{1}\quad x_2\quad \cdots\quad x_p], $$

where \(x_{i}\in\mathbb{R}^{K}\) denotes the parameter vector associated with the ith covariate.

Let \(x_{0}\in\mathbb{R}^{K}\). For i=1,⋯,N, we define

$$\eta^{(i)}=x_0 + xt_i, \quad \mbox{and}\quad q\bigl(\ell,\eta^{(i)}\bigr) = \frac{\mbox{exp}(\eta^{(i)}_\ell)}{\sum_{k=1}^K\mbox{exp}(\eta^{(i)}_k)}. $$

The log-likelihood function is

$$\ell(x_0, x) = \sum_{i=1}^N \log q\bigl(y_i, \eta^{(i)}\bigr). $$

The so called multinomial sparse group Lasso classifier [19] is given by the following optimization problem:

$$ \min_{x_0\in\mathbb{R}^K, x\in\mathbb {R}^{K\times p}} F(x) = -\ell(x_0, x) + \lambda\|x\|_1+\sum_{J=1}^pw_J \|x_J\|. $$
(5)

The above problem (5) is can be cast in the form of (1) by adding an extra (vacuous) term 0∥x 01+0∥x 0∥.

1.2 Proximal Gradient Method

A popular approach to solve the nonsmooth convex optimization problem (1) is by the so called proximal gradient method (PGM). For any convex function φ(x) (possibly nonsmooth), the Moreau–Yoshida proximity operator [15] \(\operatorname{prox}_{\varphi}:\mathbb {R}^{n}\mapsto \mathbb {R}^{n}\) is defined as

$$ \operatorname{prox}_\varphi(x)=\mbox{arg}\min_{y\in \mathbb {R}^n}\varphi(y)+\frac{1}{2}\|y-x\|^2. $$
(6)

Since \(\frac{1}{2}\|\cdot-x\|^{2}\) is strongly convex and φ(⋅) is convex, the minimizer of (6) exists and is unique, so the prox-operator \(\operatorname{prox}_{\varphi}\) is well-defined. The prox-operator is known to be non-expansive,

$$\bigl\|\operatorname{prox}_\varphi(x)-\operatorname{prox}_\varphi(y)\bigr\|\leqslant \|x-y\|,\quad \forall x,y\in \mathbb {R}^n $$

and is therefore Lipschitz continuous.

Notice that if φ(x) is the indicator function i C (x) of a closed convex set C, then the corresponding proximity operator \(\operatorname{prox}_{\varphi}\) becomes the standard projection operator to the convex set C. Thus \(\operatorname{prox}\)-operator is a natural extension of the projection operator onto a convex set. For problems of large dimension, the computation of the proximity operator can be difficult due to nonsmoothness of φ(⋅). However, if φ has a separable structure, then the computation of the proximity operator decomposes naturally, yielding substantial efficiency. For instance, for the nonsmooth convex function f 1(x) defined by (2), the proximity operator \(\operatorname{prox}_{f_{1}}\) can be computed efficiently (e.g., in closed form) via the so called (group) shrinkage operator [12].

Using the proximity operator, we can write down the optimality condition for (1) as a fixed point equation

$$ x=\operatorname{prox}_{\alpha f_1}\bigl(x-\alpha\nabla f_2(x)\bigr), $$
(7)

for some α>0. The proximal gradient method (PGM) is to solve this fixed point equation via the iteration

$$ x^{k+1}=\operatorname{prox}_{\alpha_kf_1} \bigl(x^k-\alpha_k\nabla f_2 \bigl(x^k\bigr)\bigr), \quad k=0,1,2,\cdots, $$
(8)

where α k >0 is a stepsize. Since the nonsmooth function f 1 (cf. (2)) has a separable structure, the resulting proximal step \(\operatorname{prox}_{\alpha_{k}f_{1}}(\cdot)\) decomposes naturally across groups (and/or coordinates) and can be computed efficiently via (group) shrinkage (see Sect. 2).

Despite its popularity, the convergence analysis of the proximal gradient method is still rather limited. For instance, it is only known [3, Theorem 3.4; or 2, Proposition 2] that if the stepsize α k satisfies

$$ 0<\underbar{\mbox{$\alpha$}}\leqslant \alpha_k\leqslant \bar{\alpha}<{\frac{2}{L}},\quad k=0,1,2,\cdots, $$
(9)

where L is the Lipschitz constant for the gradient ∇f 2(x):

$$\bigl\|\nabla f_2(x)-\nabla f_2(y)\bigr\|\leqslant L\|x-y\|,\quad\forall x,y\in \mathbb {R}^n, $$

then every sequence {x k} k⩾0 generated by the proximal gradient algorithm (8) converges to a solution to (1). The rate of convergence is typically sublinear O(1/k) [11]. The linear rate of convergence is still unknown except for some special cases. For instance, when f 1(x)=i C (x), the indicator function of the polyhedron C, the proximal gradient method (8) has been shown [8] to be globally linearly convergent to an optimal solution of (1), so long as the function f 2 has the composite structure (3). The significance of this convergence analysis lies in the fact it does not require strong convexity of f 2. More recently, Tseng [12] has proved that the PGM is linearly convergent for the case \(f_{1}(x)=\sum_{J\in\mathcal{J}}w_{J}\| x_{J}\|\), again without assuming the strong convexity of f 2. The latter is particularly important for the applications described in Sect. 1.1. In particular, for either Lasso, Group Lasso or Sparse Group Lasso, the number of measurements is far less than the number of unknowns. Therefore, we have mn, so the matrix A cannot have full column rank, implying that f 2 cannot be strongly convex.

In this paper, we extend Tseng’s results of [12] to the case where f 1 is given by (2), namely, \(f_{1}(x)=\sum_{J\in \mathcal{J}}w_{J}\|x_{J}\|+\lambda\|x\|_{1}\). In particular, we establish the linear convergence of the proximal gradient method (8) for the class of the nonsmooth convex minimization (1). Our result implies the linear convergence of PGM (8) for the sparse group Lasso problem (4) even if A does not have full column rank. This result significantly strengthens the sublinear convergence rate of PGM in the absence of strong convexity. Similar to the analysis of [8, 12], the key step in the linear convergence proof lies in the establishment of a local error bound that bounds the distance from an iterate to the optimal solution set in terms of the optimality residual \(\|x-\operatorname{prox}_{f_{1}}(x-\nabla f_{2}(x))\|\).

2 Preliminaries

We now develop some technical preliminaries needed for the subsequent convergence analysis in the next section.

For any vector \(a\in \mathbb {R}^{n}\), we use sign(a) to denote the vector whose ith component is

$$ \mbox{sign}(a_i):=\left \{\begin{array}{l@{\quad}l} 1,&\mbox{if } a_i>0,\\ -1,&\mbox{if } a_i<0,\\ {[}-1,1],&\mbox{if } a_i=0. \end{array} \right . $$

With this notation, the subdifferential of f 1 (cf. (2)) [14] can be written as ∂f 1=(⋯,(∂f 1) J ,⋯) with

(10)

for any \(J\in\mathcal{J}\), where \(\mathcal {B}\) and \(\mathcal {B}_{\infty}\) are 2-norm and -norm unit balls, respectively,

$$\mathcal {B}=\bigl\{s\in \mathbb {R}^{|J|}\mid\|s\|\leqslant 1\bigr\},\qquad \mathcal {B}_\infty=\bigl\{t\in \mathbb {R}^{|J|}\mid\| t\|_{\infty} \leqslant 1\bigr\}. $$

Let us now consider a generic iteration of PGM (8). For convenience, we use x, x + and α to denote x k, x k+1 and α k , respectively. In light of the definition of prox-operator (6), we can equivalently express the PGM iteration (8) in terms of the optimality condition for the prox operator

$$x-\alpha\nabla f_2(x)\in\alpha\partial f_1\bigl(x^+ \bigr)+x^+. $$

Using the separable structure of f 1 (cf. (2)), we can break up this optimality condition to the group level:

$$ x_J-\alpha{\bigl(\nabla f_2(x) \bigr)_J}\in\alpha\bigl(\partial f_1\bigl(x^+\bigr) \bigr)_J+x^+_J,\quad\mbox{for all } J\in\mathcal{J}, $$
(11)

where (∂f 1(x +)) J is given by (10). Notice that for any \(J\in\mathcal{J}\), the component vector \(x_{J}^{+}\) is uniquely defined by the above optimality condition (11).

Fix any x and any \(J\in\mathcal{J}\). For each jJ, let us denote

$$ \beta_j(\alpha)=\left \{ \begin{array}{l} 0, \quad \mbox{if }(x-\alpha\nabla f_2(x))_J\in\alpha(w_J\mathcal {B}+\lambda \mathcal {B}_\infty)\\\noalign{\vspace{3pt}} \phantom{0,}\quad \mbox{or } |(x-\alpha\nabla f_2(x))_j|\leqslant \alpha\lambda,\\\noalign{\vspace{3pt}} (x-\alpha\nabla f_2(x))_j-\alpha\lambda\operatorname{sign}((x-\alpha\nabla f_2(x))_j), \quad \mbox{else}. \end{array} \right . $$
(12)

Notice that, in the second case of (12), β j (α) is simply equal to Shrink [−αλ,αλ]((xαf 2(x)) j ), where the shrinkage operator is the same as that in the compressive sensing algorithms. Namely, for any γ>0, the shrinkage operator over the interval [−γ,γ] is given by

$$\mathbf{Shrink}_{[-\gamma,\gamma]}(u)=\left \{ \begin{array}{l@{\quad}l} 0,& \mbox{if } |u|\leqslant \gamma,\\ u+\gamma, & \mbox{if }u\leqslant -\gamma,\\ u-\gamma, & \mbox{if }u\geqslant \gamma. \end{array} \right . $$

We now provide a complete characterization of the PGM iterate (8) by further simplifying the optimality condition (11).

Proposition 1

The PGM iterate x + can be computed explicitly according to

$$ x^+_J=\left \{ \begin{array}{l@{\quad}l} 0,& (x-\alpha\nabla f_2(x))_J\in\alpha(w_J\mathcal {B}+\lambda \mathcal {B}_\infty),\\ \beta_J(\alpha)(1-w_J\alpha/\|\beta_J(\alpha)\|), & \mbox{\textit{else}}. \end{array} \right . $$
(13)

Proof

Fix any x. If \((x-\alpha\nabla f_{2}(x))_{J}\in\alpha(w_{J}\mathcal {B}+\lambda \mathcal {B}_{\infty})\), then it follows from (10) that the optimality condition (11) is satisfied at 0, implying \(x^{+}_{J}=0\) (by the uniqueness of \(x^{+}_{J}\)). The converse is also true: if \((x-\alpha\nabla f_{2}(x))_{J}\not\in\alpha(w_{J}\mathcal {B}+\lambda \mathcal {B}_{\infty})\), then \(x^{+}_{J}\neq0\), because otherwise the optimality condition (11) would be violated.

Next we assume \((x-\alpha\nabla f_{2}(x))_{J}\not\in\alpha(w_{J}\mathcal {B}+\lambda \mathcal {B}_{\infty})\) so \(x^{+}_{J}\neq0\). If, in addition, (xαf 2(x)) j ∈[−αλ,αλ], then the optimality condition (11) implies \(x^{+}_{j}=0\) (simply check that the optimality condition is satisfied at the point 0, and use the uniqueness of \(x^{+}_{j}\)).

The remaining case is both \((x-\alpha\nabla f_{2}(x))_{J}\not\in\alpha (w_{J}\mathcal {B}+\lambda \mathcal {B}_{\infty})\) and |(xαf 2(x)) j |>αλ. In this case, \(x^{+}_{j}\neq0\) and the optimality condition (11) implies

$$\bigl(x-\alpha\nabla f_2(x)\bigr)_j=x^+_j+ \alpha w_J\frac{x^+_j}{\|x^+_J\|}+\alpha\lambda\operatorname{sign} \bigl(x^+_j\bigr). $$

Since the terms on the right hand side have the same sign, it follows that \(\mbox{sign}((x-\alpha\nabla f_{2}(x))_{j})=\mbox{sign}(x^{+}_{j})\). Replacing the last term by sign((xαf 2(x)) j ) and rearranging the terms, we obtain

$$ \beta_j(\alpha):=\bigl(x-\alpha\nabla f_2(x)\bigr)_j-\alpha\lambda \operatorname{sign}\bigl(\bigl(x- \alpha\nabla f_2(x)\bigr)_j\bigr)=x^+_j+ \alpha w_J\frac{x^+_j}{\|x^+_J\|} $$
(14)

which further implies

$$\sqrt{\sum_{j\in J: x^+_j\neq0 } \beta^2_j(\alpha)}=\bigl\|\beta_J(\alpha)\bigr\|= \bigl\|x^+_J\bigr\| \biggl(1+\alpha w_J\frac{1}{\|x^+_J\|} \biggr), $$

where we have used the fact that β j (α)=0 whenever \(x_{j}^{+}=0\) (see the definition of β j (α) (12)). Hence, we have

$$\bigl\|x^+_J\bigr\|=\bigl\|\beta_J(\alpha)\bigr\|-\alpha w_J. $$

Substituting this relation into (14) yields

$$x^+_j=\beta_j(\alpha) \bigl(1-w_J\alpha/ \bigl\|\beta_J(\alpha)\bigr\|\bigr) $$

which establishes the proposition. □

Proposition 1 explicitly specifies how the PGM iterate x + can be computed. The only part that still requires further checking is to see whether the first condition in (12) holds. This can be accomplished easily by solving the following convex quadratic programming problem:

$$ \min_{t\in \mathcal {B}_\infty}\sum _{j\in J}\bigl(x_j-\alpha\nabla f_2(x)_j- \alpha\lambda t_j\bigr)^2. $$
(15)

By Proposition 1, if the minimum value of (15) is less than or equal to \(\alpha^{2}w^{2}_{J}\), then we set \(x^{+}_{J}=0\); else set

$$ x^{+}_J=\beta_J(\alpha) \bigl(1-w_J\alpha/\bigl\|\beta_J(\alpha)\bigr\|\bigr), $$
(16)

where β J (α) is defined by (12). In fact, due to the separable structure of the cost function, the minimum of (15) is attained at

$$t_j=\mbox{proj}_{[-1,1]} \bigl(\bigl(x-\alpha\nabla f_2(x)\bigr)_j/\alpha\lambda\bigr),\quad j\in J $$

and the minimum value is simply

$$\sum_{j\in{J}}\beta_j^2( \alpha)=\bigl\|\beta_J(\alpha)\bigr\|^2, $$

where β j (α) is defined by (12).

In light of the preceding discussion, the updating formula (13) in Proposition 1 can be rewritten as

$$ x^{+}_J=\left \{ \begin{array}{l@{\quad}l} 0,&\mbox{if }\|\beta_J(\alpha)\|\leqslant \alpha w_J, \\ \beta_J(\alpha)(1-w_J\alpha/\|\beta_J(\alpha)\|),&\mbox{otherwise}. \end{array} \right . $$
(17)

Hence, we can summarize the PGM iteration as follows:

figure a

Another useful property in the analysis of PGM is the fact that Ax is invariant over the optimal solution set of (1). Denote the optimal solution set of (1) by

$$ \bar{X}= \Bigl\{x^*\in \mathbb {R}^n\mid F\bigl (x^*\bigr)= \min_xF(x) \Bigr\}. $$
(18)

Proposition 2

Consider the nonsmooth convex minimization problem min x F(x)=f 1(x)+f 2(x), where f 1 and f 2 are given by (2) and (3), respectively. Then Ax is invariant over \(\bar{X}\) in the sense that there exists \(\bar{y} \in \operatorname {dom}h\) such that

$$ Ax^*=\bar{y},\quad\forall x^*\in\bar{X}. $$
(19)

Proof

The argument is similar to Lemma 2.1 in [8]. Since F(x)=f 1(x)+f 2(x)=f 1(x)+h(Ax) is continuous and convex, the optimal solution set \(\bar{X}\) must be closed and convex. For any \(x^{*},~\tilde{x}\in\bar{X}\), we have by the convexity of \(\bar{X}\) that \((x^{*}+\tilde{x})/2\in\bar{X}\). It follows that

$$F\bigl(x^*\bigr)=F(\tilde{x})=\frac{1}{2}\bigl(F\bigl(x^*\bigr )+F(\tilde{x})\bigr)=F \biggl(\frac{x^*+\tilde{x}}{2} \biggr) $$

which further implies

$$\frac{1}{2} \bigl(f_1\bigl(x^*\bigr)+h\bigl(Ax^*\bigr)+f_1( \tilde{x})+h(A\tilde{x}) \bigr) =f_1 \biggl(\frac{x^*+\tilde{x}}{2} \biggr)+h \biggl(A \biggl(\frac{x^*+\tilde {x}}{2} \biggr) \biggr). $$

By the convexity of h(⋅), we have

$$\frac{h(Ax^*)+h(A\tilde{x})}{2}\geqslant h \biggl(A \biggl(\dfrac {x^*+\tilde {x}}{2} \biggr) \biggr). $$

Combining the above two relations, we obtain

$$\frac{f_1(x^*)+f_1(\tilde{x})}{2}\leqslant f_1 \biggl(\frac{x^*+\tilde {x}}{2} \biggr). $$

By the convexity of f 1(x), it follows that

$$\frac{f_1(x^*)+f_1(\tilde{x})}{2}\geqslant f_1 \biggl(\frac{x^*+\tilde {x}}{2} \biggr). $$

This implies that \(\frac{f_{1}(x^{*})+f_{1}(\tilde{x})}{2}=f_{1} (\frac{x^{*}+\tilde{x}}{2} )\). Therefore, we obtain

$$\frac{h(Ax^*)+h(A\tilde{x})}{2}=h \biggl(A \biggl(\frac{x^*+\tilde {x}}{2} \biggr) \biggr). $$

By the strict convexity of h(⋅), we must have \(Ax^{*}=A\tilde{x}\). □

3 Error Bound Condition

The global convergence of PGM is given by [3, Theorem 3.4]. In particular, under the following three assumptions:

figure b

and if the stepsize α k satisfies

$$0<\underbar{\mbox{$\alpha$}}\leqslant \alpha_k\leqslant \bar{\alpha}< {\frac{2}{L}}, \quad k=0,1,2,\cdots, $$

then every sequence {x k} k⩾0 generated by proximal gradient algorithm converges to a solution to (1).

Let us now focus on the linear convergence of PGM. Traditionally, linear convergence of a first order optimization method is only possible under strong convexity and smoothness assumptions. Unfortunately in our case, the objective function F(x) in (1) is neither smooth nor strongly convex. To establish linear convergence in the absence of strong convexity, we rely on the following error bound condition which estimates the distance from an iterate to the optimal solution set.

Error Bound Condition

Let us define a distance function for the optimal solution set \(\bar {X}\) (cf. (18)) as

$$\operatorname{dist}_{\bar{X}}(x)=\inf_{y\in\bar{X}}\|x-y\| $$

and define a residual function

$$ r(x)=\operatorname{prox}_{f_1} \bigl(x-\nabla f_2(x) \bigr)-x. $$
(20)

Since the prox-operator is Lipschitz continuous (in fact non-expansive), it follows that the residual function r(x) is continuous on \(\operatorname {dom}f_{2}\).

We say a local error bound holds around the optimal solution set \(\bar{X}\) of (1) if for any ξ⩾min x F, there exist scalars κ>0 and ε>0, such that

$$ \operatorname{dist}_{\bar{X}}(x)\leqslant \kappa\bigl\|r(x)\bigr\|,\quad \mbox{whenever } F(x)\leqslant \xi,\ \bigl\|r(x)\bigr\|\leqslant \varepsilon. $$
(21)

To simplify notations, we denote g=∇f 2(x) and β J :=β J (1) (cf. (12)). In light of Proposition 1 and specializing the update formula (13) for the proximal step with α=1, we can write the residual function r(x) as

$$ r(x)_J=\left \{ \begin{array}{l@{\quad}l} -x_J, & \mbox{if }\|\beta_J\|\leqslant w_J,\\ \beta_J-x_{J}-w_J \beta_J/\|\beta_J\|, &\mbox{if }\|\beta_J\|> {w_J} \end{array} \right . $$
(22)

for all \(J\in\mathcal{J}\).

Theorem 1

Consider the nonsmooth convex minimization problem (1) with f 1(x) and f 2(x) defined by (2) and (3). Suppose f 1(x) and f 2(x) satisfy the assumptions (A1)(A3). Then the error bound condition (21) holds.

The proof of Theorem 1 is rather technical and extends the analysis of Tseng [12]. In particular, we need two intermediate lemmas described below. For simplicity, for any sequence {x k} k⩾0 in \(\mathbb {R}^{n}\setminus \bar{X}\), we adopt the following short notations:

$$r^k:=r\bigl(x^k\bigr), \qquad \delta_k:= \bigl\|x^k-\bar{x}^k\bigr\|, \qquad \bar{x}^k:=\mathop{\mathrm{argmin}}\limits_{\bar{x}\in\bar{X}}\bigl\|x^k-\bar{x}\bigr\| $$

and

$$ u^k:=\frac{x^k-\bar{x}^k}{\delta_k}\rightarrow\bar {u}\neq0. $$
(23)

Lemma 1

Consider the nonsmooth convex minimization problem (1) with f 1 and f 2 defined by (2) and (3), respectively. Suppose f 1(x) and f 2(x) satisfy assumptions (A1)(A3). Furthermore, suppose there exists a sequence \(x^{1},x^{2},\cdots\in \mathbb {R}^{n}\setminus\bar{X}\) satisfying

$$ F\bigl(x^k\bigr)\leqslant \zeta, \quad \forall k\quad \mbox{\textit{and}}\quad \bigl\{r^k \bigr\}\rightarrow0,\qquad \biggl\{\frac{r^k}{\delta_k} \biggr\} \rightarrow0 $$
(24)

and \(A\bar{u}=0. \) Let

$$ \hat{x}^k:=\bar{x}^k+ \delta_k^2\bar{u}. $$
(25)

Then there exists a subsequence of \(\{\hat{x}^{k}\}\) along which the following:

$$ 0\in\bar{g}_J+w_J\partial\bigl\| \hat{x}^k_J\bigr\|+\lambda\partial\bigl\|\hat{x}^k_J\bigr\|_1 $$
(26)

is satisfied for all \(J\in\mathcal{J}\).

Lemma 2

Suppose f 1(x) and f 2(x) satisfy assumptions (A1)(A3). Moreover, suppose there exists a sequence \(x^{1},x^{2},\cdots\in \mathbb {R}^{n}\setminus \bar{X}\) satisfying (24). Then there exists a κ>0 such that

$$ \bigl\|x^k-\bar{x}^k\bigr\|\leqslant \kappa\bigl\|Ax^k- \bar{y}\bigr\|\quad \forall k. $$
(27)

The proof of Lemmas 1–2 is relegated to Appendix A and B. Assuming these lemmas hold, we can proceed to prove Theorem 1.

Proof of Theorem 1

We argue by contradiction. Suppose there exists a ζ⩾minF such that (21) fails to hold for all κ>0 and ε>0. Then there exists a sequence \(x^{1},x^{2},\cdots\in \mathbb {R}^{n}\setminus\bar{X}\) satisfying (24).

Let \(\bar{y} =Ax^{*}\) for any \(x^{*}\in\bar{X}\) (note that \(\bar{y}\) is independent of x , cf. Proposition 2) and let

$$ g^k:=\nabla f_2\bigl(x^k \bigr)=A^{T}\nabla h\bigl(Ax^k\bigr), \qquad \bar{g}:=A^{T}\nabla h(\bar{y}). $$
(28)

By Proposition 2, \(\bar{g}^{k}=A^{T}\nabla h(A\bar {x}^{k})=A^{T}\nabla h(\bar{y})=\bar{g}\) for all k. Since

$$r^k\in\mbox{arg}\min_df_1 \bigl(x^k+d\bigr)+ \frac{1}{2}\bigl\|d+g^k\bigr\|^2, $$

it follows from the convexity that

$$0\in\partial f_1\bigl(x^k+r^k \bigr)+r^k+g^k. $$

The latter is also the optimality condition for

$$ r^k\in\mbox{arg}\min_d\bigl\langle g^k+r^k,d\bigr\rangle+f_1 \bigl(x^k+d\bigr). $$
(29)

We use an argument similar to that of [12]. In particular, by evaluating the right hand side of (29) at r k and \(\bar {x}^{k}-x^{k}\), respectively, we have

$$ \bigl\langle g^k+r^k,r^k\bigr \rangle+f_1\bigl(x^k+r^k\bigr)\leqslant \bigl \langle g^k+r^k,\bar{x}^k-x^k \bigr\rangle+f_1\bigl(\bar{x}^k\bigr). $$
(30)

Similarly, since \(\bar{x}^{k}\in\bar{X}\) and \(\bar{g}^{k}=\bar{g}\), it follows that

$$0\in\mbox{arg}\min_df_1\bigl( \bar{x}^k+d\bigr)+\frac{1}{2}\bigl\|d+\bar{g}^k\bigr\|^2, $$

which is further equivalent to

$$ 0\in\mbox{arg}\min_d\langle\bar{g},d \rangle+f_1\bigl({\bar{x}}^k+d\bigr). $$
(31)

By evaluating the right hand side of (31) at 0 and \(x^{k}+r^{k}-\bar{x}^{k}\), respectively, we obtain

$$\langle\bar{g}+0,0\rangle+f_1\bigl(\bar{x}^k+0\bigr) \leqslant \bigl\langle\bar{g},x^k+r^k-\bar{x}^k \bigr\rangle+f_1\bigl(\bar{x}^k+x^k+r^k- \bar{x}^k\bigr), $$

i.e.

$$ f_1\bigl(\bar{x}^k\bigr)\leqslant \bigl\langle \bar{g},x^k+r^k-\bar{x}^k\bigr\rangle +f_1\bigl(x^k+r^k\bigr). $$
(32)

Adding (30) and (32) yields

$$\bigl\langle g^k-\bar{g},x^k-\bar{x}^k\bigr \rangle+\bigl\|r^k\bigr\|^2\leqslant \bigl\langle\bar{g}-g^k,r^k \bigr\rangle+\bigl\langle r^k,\bar{x}^k-x^k \bigr\rangle. $$

By (28), the strong convexity of h and Lemma 2 (cf. (27)), we obtain

$$\bigl\langle g^k-\bar{g},x^k-\bar{x}^k\bigr \rangle=\bigl\langle\nabla h\bigl(Ax^k\bigr)-\nabla h( \bar{y}),Ax^k-\bar{y}\bigr\rangle \geqslant \sigma\bigl\|Ax^k-\bar{y} \bigr\|^2\geqslant \frac {\sigma}{\kappa^2}\bigl\|x^k-\bar{x}^k \bigr\|^2. $$

Moreover, since ∥A∥:=maxd∥=1Ad∥, it follows that

$$\bigl\langle\bar{g}-g^k,r^k\bigr\rangle=\bigl\langle \nabla h(\bar{y})-\nabla h\bigl(Ax^k\bigr),Ar^k\bigr \rangle \leqslant L\|A\|^2\bigl\|x^k-\bar{x}^k\bigr\| \bigl\|r^k\bigr\|. $$

Combining the above three inequalities gives

$$\frac{\sigma}{\kappa^2}\bigl\|x^k-\bar{x}^k\bigr\|^2+ \bigl\|r^k\bigr\|^2\leqslant L\|A\|^2\big\| x^k- \bar{x}^k\bigr\|\bigl\|r^k\bigr\|+\bigl\|r^k\bigr\|\bigl\|x^k- \bar{x}^k\bigr\|, $$

which further implies

$$\frac{\sigma}{\kappa^2}\bigl\|x^k-\bar{x}^k\bigr\|^2\leqslant \bigl(L\|A\|^2+1\bigr)\bigl\|x^k-\bar{x}^k\bigr\| \bigl\|r^k\bigr\|, \quad \forall k. $$

Canceling out a factor of \(\|x^{k}-\bar{x}^{k}\|\) yields

$$\frac{\sigma}{\kappa^2}\bigl\|x^k-\bar{x}^k\bigr\|\leqslant \bigl(L\|A \|^2+1\bigr)\bigl\|r^k\bigr\|,\quad \forall k, $$

which contradicts (24). This completes the proof of Theorem 1. □

4 Linear Convergence

We now establish the linear convergence of the PGM (8) under the local error bound condition (21). Let F(x)=f 1(x)+f 2(x) where f 1 and f 2 are defined by (2) and (3), respectively. Suppose that ∇f 2 is Lipschitz continuous with modulus L. Let {x k} k⩾0 be a sequence generated by the PGM (8). There are three key steps in the linear convergence proof which we outline below. The framework was first established by Luo and Tseng in 1992 [8].

Step 1 :

Sufficient decrease. Suppose the step size α k is chosen according to (9), but with \(\bar{\alpha}<\frac{1}{L}\). Then for all k⩾0, we have

$$ {F\bigl(x^{k}\bigr)}- F\bigl(x^{k+1}\bigr)\geqslant c_1\bigl\|x^{k+1} - x^k\bigr\|^2,\quad\mbox{for some } c_1 > 0. $$
(33)
Step 2 :

Local error bound. Let \(\bar{X}\) denote the set of optimal solutions satisfying (7) and let \(\operatorname{dist}_{\bar{X}}(x) := \min_{x^{*}\in\bar{X}}\|x-x^{*}\|\). Then for any ξ⩾minF(x), there exist some κ, ε>0 such that

$$ \operatorname{dist}_{\bar{X}}(x)\leqslant \kappa\bigl\|x- \operatorname{prox}_{f_1}\bigl[x-\nabla f_2(x)\bigr]\bigr\|, $$
(34)

for all x such that \(\|x-\operatorname{prox}_{f_{1}}[x-\nabla f_{2}(x)]\| \leqslant \varepsilon\).

Step 3 :

Cost-to-go estimate. There exists a constant c 2>0 s.t.

$$ F\bigl(x^k\bigr) - F^* \leqslant c_2\bigl(\operatorname{dist}^2_{\bar{X}} \bigl( x^k\bigr)+\bigl\|x^{k+1}-x^k\bigr\|^2 \bigr),\quad\forall k. $$
(35)

We first establish the sufficient decrease property (33). Notice that the PGM iteration (8) can be equivalently written as

$$x^{k+1}=\mathop{\mathrm{argmin}}\limits_{x} \biggl \{f_1(x)+\bigl\langle\nabla f_2\bigl(x^k \bigr),x-x^k\bigr\rangle+\frac{1}{2\alpha_k}\bigl\|x- x^k\bigr\|^2 \biggr\}, \quad k=0,1,2,\cdots. $$

Plugging the values of x=x k+1 and x k, respectively, into the right hand side yields

$$f_1\bigl(x^{k+1}\bigr)+\bigl\langle\nabla f_2\bigl(x^k\bigr),x^{k+1}-x^k \bigr\rangle+\frac {1}{2\alpha_k}\bigl\|x^{k+1}-x^k \bigr\|^2 \leqslant f_1\bigl(x^k\bigr). $$

Since L is the Lipschitz constant of ∇f 2, it follows from the Taylor expansion of f 2 that

where the last step is due to (9). Since \(\bar {\alpha}<1/L\), it follows that the sufficient decrease condition (33) holds for all k⩾0 with

$$c_1=\frac{1-\bar{\alpha}L}{2\bar{\alpha}}>0. $$

The local error bound condition holds due to Theorem 1. So we need to establish the cost-to-go estimate (35). Let \(\bar{x}^{k}\in\bar{X}\) be s.t. \(\operatorname{dist}_{\bar{X}}(x^{k})=\|x^{k}-\bar{x}^{k}\|\). The optimality of x k+1 implies

implying

$$\bigl\langle\nabla f_2\bigl( x^k\bigr),x^{k+1}-\bar{x}^k\bigr\rangle + f_1\bigl(x^{k+1}\bigr)\!-\!f_1\bigl( \bar{x}^k\bigr)\leqslant \frac{1}{2\alpha_k}\operatorname{dist}^2_{\bar{X}} \bigl(x^k\bigr)\leqslant \frac{1}{2\underbar{\mbox{$\alpha$}}}\operatorname{dist}^2_{\bar{X}} \bigl(x^k\bigr). $$

Also, the mean value theorem shows

$$f_2\bigl(x^{k+1}\bigr)-f_2 \bigl(\bar{x}^k\bigr)=\bigl\langle\nabla f_2\bigl(\eta ^k\bigr),x^{k+1}-\bar{x}^k\bigr\rangle $$

for some η k in the line segment joining x k+1 and \(\bar{x}^{k}\). Combining the above two relations and using the triangular inequality

$$\bigl\|\eta^k-x^k\bigr\|\leqslant \bigl\|x^{k+1}- x^k\bigr\|+\bigl\|\bar{x}^k-x^k\bigr\|=\bigl\|x^{k+1}-x^k\bigr\|+\operatorname{dist}_{\bar{X}}\bigl(x^k\bigr) $$

yields

where the first inequality is due to the Lipschitz continuity of ∇f 2 and the second inequality follows from the triangular inequality. This establishes the cost-to-go estimate (35). We are now ready to combine the three steps outlined above to establish the following main linear convergence result.

Theorem 2

Assume that f 1 is convex and f 2 is convex differentiable with a Lipschitz continuousf 2. Moreover, suppose \(\bar{X}\) is nonempty and a local error bound (34) holds around the solution set \(\bar{X}\), and that the step size α k is chosen according to

$$0<\underbar{\mbox{$\alpha$}}\leqslant \alpha_k\leqslant \bar{\alpha}<1/L,\quad k=0,1,2,\cdots. $$

Then the PGM algorithm (8) generates a sequence of iterates x 0,x 1,⋯,x k,⋯ that converges linearly to a solution in \(\bar{X}\).

Proof

First, the sufficient decrease condition (33) implies ∥x k+1x k2→0. Since

where we used the fact \({\liminf_{k}}\alpha_{k}\geqslant \underbar{\mbox{$\alpha$}}\), it follows that

$$\bigl\|x^k - \operatorname{prox}_{f_1}\bigl[x^k - \nabla f_2\bigl(x^k\bigr)\bigr]\bigr\| \to0. $$

Since the function values F(x k) is monotonically decreasing (33), it follows that the local error bound (34) holds for some κ and ϵ. In particular, for sufficiently large k, we have

$$\operatorname{dist}_{\bar{X}}\bigl(x^k\bigr)\leqslant \kappa\bigl\| x^k - \operatorname{prox}_{f_1}\bigl[x^k - \nabla f_2\bigl(x^k\bigr)\bigr]\bigr\| $$

implying \(\operatorname{dist}_{\bar{X}}(x^{k})\to0\). Consequently, by the cost-to-go estimate (35) we have

$$F\bigl(x^k\bigr)\to F^*. $$

Now we use the local error bound (34) and the cost-to-go estimate (35) to obtain

Hence, we have

$$F\bigl(x^{k+1}\bigr)-F^*\leqslant \frac{c_3}{1+c_3} \bigl( F\bigl( x^{k}\bigr)-F^* \bigr), $$

where

$$c_3= \frac{\kappa^2 c_2 +c_2}{c_1\min\{1,\underbar{\mbox{$\alpha$}}^2\}}. $$

This implies the Q-linear convergence of F(x k)→F . In light of (33), this further implies the R-linear convergence of ∥x k+1x k2. Thus, {x k} converges linearly to an optimal solution in \(\bar{X}\). □

5 Closing Remarks

Motivated by the recent applications in sparse group Lasso, we have considered in this paper a class of nonsmooth convex minimization problems whose objective function is the sum of a smooth convex function, a nonsmooth 1-norm regularization term and an 2-norm regularization term. We have derived a proximal gradient method for this problem whose subproblem can be solved efficiently (in closed form). Moreover, we have established linear convergence of this method when the smooth part of the objective function consists of a strongly convex function composed with a linear mapping, even though the overall objective function is not strongly convex and the problem may have multiple solutions. The key step in the analysis is a local error bound condition which provides an estimate of the distance to the optimal solution set in terms of the size of the proximal gradient vector.