Abstract
We consider a class of nonsmooth convex optimization problems where the objective function is the composition of a strongly convex differentiable function with a linear mapping, regularized by the sum of both ℓ 1-norm and ℓ 2-norm of the optimization variables. This class of problems arise naturally from applications in sparse group Lasso, which is a popular technique for variable selection. An effective approach to solve such problems is by the Proximal Gradient Method (PGM). In this paper we prove a local error bound around the optimal solution set for this problem and use it to establish the linear convergence of the PGM method without assuming strong convexity of the overall objective function.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Consider an unconstrained nonsmooth convex optimization problem of the form
where f 1 is a nonsmooth convex function given by
with \(\mathcal{J}\) a partition of {1,⋯,n} and \(\lambda,\;\{ w_{J}\}_{J\in\mathcal{J}}\) some given nonnegative constants; f 2(x) is a composite convex function
where \(h:\mathbb {R}^{m}\mapsto \mathbb {R}\) is a continuously differentiable strongly convex function and \(A\in \mathbb {R}^{m\times n}\) is a given matrix. Notice that unless A has full column rank (i.e., \({\rm rank}(A)=n\)), the composite function f 2(x) is not strongly convex.
1.1 Motivating Applications
Nonsmooth convex optimization problems of the form (1) arise in many contemporary statistical and signal applications [4, 24] including signal denoising, compressive sensing, sparse linear regression and high dimensional multinomial classification. To motivate the nonsmooth convex optimization problem (1), we briefly outline some application examples below.
Example 1
Suppose that we have a noisy measurement vector \(d\in \mathbb {R}^{m}\) about an unknown sparse vector \(x\in \mathbb {R}^{n}\), where the signal model is linear and given by
for some given matrix \(A \in \mathbb {R}^{m\times n}\). A popular technique to estimate the sparse vector x is called Lasso [17, 23] which performs simultaneous estimation and variable selection. Furthermore, a related technique called group Lasso [22] acts like Lasso at the group level. Since the group Lasso does not yield sparsity within a group, a generalized model that yields sparsity at both the group and individual feature levels was proposed in [5]. This sparse group Lasso criterion is formulated as
where the minimization of ∥Ax−d∥2 has a denoising effect, while the middle term promotes or leads to sparse groups, where \(\mathcal{J}\) is a partition of {1,2,⋯,n} into groups. The ℓ 1-norm minimization sparsifies the solution x, and effectively selects the most significant components of x.
Obviously, the sparse Lasso problem (4) is in the form of the nonsmooth convex optimization problem (1) with \(h(\cdot)=\frac{1}{2}\|\cdot-d\|^{2}\). Moreover, if λ=0, (4) is reduced to group Lasso; if w J =0 for all \(J\in \mathcal{J}\), (4) is exactly Lasso problem. We refer readers to [1, 9] for recent applications of the group Lasso technique.
Example 2
In logistic regression, we are given a set of n-dimensional feature vectors a i (i=1,2,⋯,m), and the corresponding class labels d i ∈{0,1}. The probability distribution of the class label d given a feature vector a and a logistic regression coefficient vector \(x\in \mathbb {R}^{n}\) can be described by
The logistic group Lasso technique [10] corresponds to selecting x by
Again, this is in the form of the nonsmooth convex optimization problem (1) with λ=0 and
which is strongly convex in u. We refer readers to [6, 7, 13, 16, 18, 20–22] for further studies on group Lasso type of statistical techniques.
Example 3
Consider a high dimensional multinomial classification problem with K classes, N samples, and p covariates. Denote the data set as (t 1,y 1),⋯,(t N ,y N ), where for all i=1,⋯,N, \(t_{i}\in\mathbb{R}^{p}\) is the observed covariate vector and y i ∈{1,⋯,K} is the categorical response. The covariate vectors can be organized in the N×p design matrix
and the model parameters can be grouped in a K×p matrix
where \(x_{i}\in\mathbb{R}^{K}\) denotes the parameter vector associated with the ith covariate.
Let \(x_{0}\in\mathbb{R}^{K}\). For i=1,⋯,N, we define
The log-likelihood function is
The so called multinomial sparse group Lasso classifier [19] is given by the following optimization problem:
The above problem (5) is can be cast in the form of (1) by adding an extra (vacuous) term 0∥x 0∥1+0∥x 0∥.
1.2 Proximal Gradient Method
A popular approach to solve the nonsmooth convex optimization problem (1) is by the so called proximal gradient method (PGM). For any convex function φ(x) (possibly nonsmooth), the Moreau–Yoshida proximity operator [15] \(\operatorname{prox}_{\varphi}:\mathbb {R}^{n}\mapsto \mathbb {R}^{n}\) is defined as
Since \(\frac{1}{2}\|\cdot-x\|^{2}\) is strongly convex and φ(⋅) is convex, the minimizer of (6) exists and is unique, so the prox-operator \(\operatorname{prox}_{\varphi}\) is well-defined. The prox-operator is known to be non-expansive,
and is therefore Lipschitz continuous.
Notice that if φ(x) is the indicator function i C (x) of a closed convex set C, then the corresponding proximity operator \(\operatorname{prox}_{\varphi}\) becomes the standard projection operator to the convex set C. Thus \(\operatorname{prox}\)-operator is a natural extension of the projection operator onto a convex set. For problems of large dimension, the computation of the proximity operator can be difficult due to nonsmoothness of φ(⋅). However, if φ has a separable structure, then the computation of the proximity operator decomposes naturally, yielding substantial efficiency. For instance, for the nonsmooth convex function f 1(x) defined by (2), the proximity operator \(\operatorname{prox}_{f_{1}}\) can be computed efficiently (e.g., in closed form) via the so called (group) shrinkage operator [12].
Using the proximity operator, we can write down the optimality condition for (1) as a fixed point equation
for some α>0. The proximal gradient method (PGM) is to solve this fixed point equation via the iteration
where α k >0 is a stepsize. Since the nonsmooth function f 1 (cf. (2)) has a separable structure, the resulting proximal step \(\operatorname{prox}_{\alpha_{k}f_{1}}(\cdot)\) decomposes naturally across groups (and/or coordinates) and can be computed efficiently via (group) shrinkage (see Sect. 2).
Despite its popularity, the convergence analysis of the proximal gradient method is still rather limited. For instance, it is only known [3, Theorem 3.4; or 2, Proposition 2] that if the stepsize α k satisfies
where L is the Lipschitz constant for the gradient ∇f 2(x):
then every sequence {x k} k⩾0 generated by the proximal gradient algorithm (8) converges to a solution to (1). The rate of convergence is typically sublinear O(1/k) [11]. The linear rate of convergence is still unknown except for some special cases. For instance, when f 1(x)=i C (x), the indicator function of the polyhedron C, the proximal gradient method (8) has been shown [8] to be globally linearly convergent to an optimal solution of (1), so long as the function f 2 has the composite structure (3). The significance of this convergence analysis lies in the fact it does not require strong convexity of f 2. More recently, Tseng [12] has proved that the PGM is linearly convergent for the case \(f_{1}(x)=\sum_{J\in\mathcal{J}}w_{J}\| x_{J}\|\), again without assuming the strong convexity of f 2. The latter is particularly important for the applications described in Sect. 1.1. In particular, for either Lasso, Group Lasso or Sparse Group Lasso, the number of measurements is far less than the number of unknowns. Therefore, we have m≪n, so the matrix A cannot have full column rank, implying that f 2 cannot be strongly convex.
In this paper, we extend Tseng’s results of [12] to the case where f 1 is given by (2), namely, \(f_{1}(x)=\sum_{J\in \mathcal{J}}w_{J}\|x_{J}\|+\lambda\|x\|_{1}\). In particular, we establish the linear convergence of the proximal gradient method (8) for the class of the nonsmooth convex minimization (1). Our result implies the linear convergence of PGM (8) for the sparse group Lasso problem (4) even if A does not have full column rank. This result significantly strengthens the sublinear convergence rate of PGM in the absence of strong convexity. Similar to the analysis of [8, 12], the key step in the linear convergence proof lies in the establishment of a local error bound that bounds the distance from an iterate to the optimal solution set in terms of the optimality residual \(\|x-\operatorname{prox}_{f_{1}}(x-\nabla f_{2}(x))\|\).
2 Preliminaries
We now develop some technical preliminaries needed for the subsequent convergence analysis in the next section.
For any vector \(a\in \mathbb {R}^{n}\), we use sign(a) to denote the vector whose ith component is
With this notation, the subdifferential of f 1 (cf. (2)) [14] can be written as ∂f 1=(⋯,(∂f 1) J ,⋯) with
for any \(J\in\mathcal{J}\), where \(\mathcal {B}\) and \(\mathcal {B}_{\infty}\) are ℓ 2-norm and ℓ ∞-norm unit balls, respectively,
Let us now consider a generic iteration of PGM (8). For convenience, we use x, x + and α to denote x k, x k+1 and α k , respectively. In light of the definition of prox-operator (6), we can equivalently express the PGM iteration (8) in terms of the optimality condition for the prox operator
Using the separable structure of f 1 (cf. (2)), we can break up this optimality condition to the group level:
where (∂f 1(x +)) J is given by (10). Notice that for any \(J\in\mathcal{J}\), the component vector \(x_{J}^{+}\) is uniquely defined by the above optimality condition (11).
Fix any x and any \(J\in\mathcal{J}\). For each j∈J, let us denote
Notice that, in the second case of (12), β j (α) is simply equal to Shrink [−αλ,αλ]((x−α∇f 2(x)) j ), where the shrinkage operator is the same as that in the compressive sensing algorithms. Namely, for any γ>0, the shrinkage operator over the interval [−γ,γ] is given by
We now provide a complete characterization of the PGM iterate (8) by further simplifying the optimality condition (11).
Proposition 1
The PGM iterate x + can be computed explicitly according to
Proof
Fix any x. If \((x-\alpha\nabla f_{2}(x))_{J}\in\alpha(w_{J}\mathcal {B}+\lambda \mathcal {B}_{\infty})\), then it follows from (10) that the optimality condition (11) is satisfied at 0, implying \(x^{+}_{J}=0\) (by the uniqueness of \(x^{+}_{J}\)). The converse is also true: if \((x-\alpha\nabla f_{2}(x))_{J}\not\in\alpha(w_{J}\mathcal {B}+\lambda \mathcal {B}_{\infty})\), then \(x^{+}_{J}\neq0\), because otherwise the optimality condition (11) would be violated.
Next we assume \((x-\alpha\nabla f_{2}(x))_{J}\not\in\alpha(w_{J}\mathcal {B}+\lambda \mathcal {B}_{\infty})\) so \(x^{+}_{J}\neq0\). If, in addition, (x−α∇f 2(x)) j ∈[−αλ,αλ], then the optimality condition (11) implies \(x^{+}_{j}=0\) (simply check that the optimality condition is satisfied at the point 0, and use the uniqueness of \(x^{+}_{j}\)).
The remaining case is both \((x-\alpha\nabla f_{2}(x))_{J}\not\in\alpha (w_{J}\mathcal {B}+\lambda \mathcal {B}_{\infty})\) and |(x−α∇f 2(x)) j |>αλ. In this case, \(x^{+}_{j}\neq0\) and the optimality condition (11) implies
Since the terms on the right hand side have the same sign, it follows that \(\mbox{sign}((x-\alpha\nabla f_{2}(x))_{j})=\mbox{sign}(x^{+}_{j})\). Replacing the last term by sign((x−α∇f 2(x)) j ) and rearranging the terms, we obtain
which further implies
where we have used the fact that β j (α)=0 whenever \(x_{j}^{+}=0\) (see the definition of β j (α) (12)). Hence, we have
Substituting this relation into (14) yields
which establishes the proposition. □
Proposition 1 explicitly specifies how the PGM iterate x + can be computed. The only part that still requires further checking is to see whether the first condition in (12) holds. This can be accomplished easily by solving the following convex quadratic programming problem:
By Proposition 1, if the minimum value of (15) is less than or equal to \(\alpha^{2}w^{2}_{J}\), then we set \(x^{+}_{J}=0\); else set
where β J (α) is defined by (12). In fact, due to the separable structure of the cost function, the minimum of (15) is attained at
and the minimum value is simply
where β j (α) is defined by (12).
In light of the preceding discussion, the updating formula (13) in Proposition 1 can be rewritten as
Hence, we can summarize the PGM iteration as follows:
Another useful property in the analysis of PGM is the fact that Ax is invariant over the optimal solution set of (1). Denote the optimal solution set of (1) by
Proposition 2
Consider the nonsmooth convex minimization problem min x F(x)=f 1(x)+f 2(x), where f 1 and f 2 are given by (2) and (3), respectively. Then Ax ∗ is invariant over \(\bar{X}\) in the sense that there exists \(\bar{y} \in \operatorname {dom}h\) such that
Proof
The argument is similar to Lemma 2.1 in [8]. Since F(x)=f 1(x)+f 2(x)=f 1(x)+h(Ax) is continuous and convex, the optimal solution set \(\bar{X}\) must be closed and convex. For any \(x^{*},~\tilde{x}\in\bar{X}\), we have by the convexity of \(\bar{X}\) that \((x^{*}+\tilde{x})/2\in\bar{X}\). It follows that
which further implies
By the convexity of h(⋅), we have
Combining the above two relations, we obtain
By the convexity of f 1(x), it follows that
This implies that \(\frac{f_{1}(x^{*})+f_{1}(\tilde{x})}{2}=f_{1} (\frac{x^{*}+\tilde{x}}{2} )\). Therefore, we obtain
By the strict convexity of h(⋅), we must have \(Ax^{*}=A\tilde{x}\). □
3 Error Bound Condition
The global convergence of PGM is given by [3, Theorem 3.4]. In particular, under the following three assumptions:
and if the stepsize α k satisfies
then every sequence {x k} k⩾0 generated by proximal gradient algorithm converges to a solution to (1).
Let us now focus on the linear convergence of PGM. Traditionally, linear convergence of a first order optimization method is only possible under strong convexity and smoothness assumptions. Unfortunately in our case, the objective function F(x) in (1) is neither smooth nor strongly convex. To establish linear convergence in the absence of strong convexity, we rely on the following error bound condition which estimates the distance from an iterate to the optimal solution set.
Error Bound Condition
Let us define a distance function for the optimal solution set \(\bar {X}\) (cf. (18)) as
and define a residual function
Since the prox-operator is Lipschitz continuous (in fact non-expansive), it follows that the residual function r(x) is continuous on \(\operatorname {dom}f_{2}\).
We say a local error bound holds around the optimal solution set \(\bar{X}\) of (1) if for any ξ⩾min x F, there exist scalars κ>0 and ε>0, such that
To simplify notations, we denote g=∇f 2(x) and β J :=β J (1) (cf. (12)). In light of Proposition 1 and specializing the update formula (13) for the proximal step with α=1, we can write the residual function r(x) as
for all \(J\in\mathcal{J}\).
Theorem 1
Consider the nonsmooth convex minimization problem (1) with f 1(x) and f 2(x) defined by (2) and (3). Suppose f 1(x) and f 2(x) satisfy the assumptions (A1)–(A3). Then the error bound condition (21) holds.
The proof of Theorem 1 is rather technical and extends the analysis of Tseng [12]. In particular, we need two intermediate lemmas described below. For simplicity, for any sequence {x k} k⩾0 in \(\mathbb {R}^{n}\setminus \bar{X}\), we adopt the following short notations:
and
Lemma 1
Consider the nonsmooth convex minimization problem (1) with f 1 and f 2 defined by (2) and (3), respectively. Suppose f 1(x) and f 2(x) satisfy assumptions (A1)–(A3). Furthermore, suppose there exists a sequence \(x^{1},x^{2},\cdots\in \mathbb {R}^{n}\setminus\bar{X}\) satisfying
and \(A\bar{u}=0. \) Let
Then there exists a subsequence of \(\{\hat{x}^{k}\}\) along which the following:
is satisfied for all \(J\in\mathcal{J}\).
Lemma 2
Suppose f 1(x) and f 2(x) satisfy assumptions (A1)–(A3). Moreover, suppose there exists a sequence \(x^{1},x^{2},\cdots\in \mathbb {R}^{n}\setminus \bar{X}\) satisfying (24). Then there exists a κ>0 such that
The proof of Lemmas 1–2 is relegated to Appendix A and B. Assuming these lemmas hold, we can proceed to prove Theorem 1.
Proof of Theorem 1
We argue by contradiction. Suppose there exists a ζ⩾minF such that (21) fails to hold for all κ>0 and ε>0. Then there exists a sequence \(x^{1},x^{2},\cdots\in \mathbb {R}^{n}\setminus\bar{X}\) satisfying (24).
Let \(\bar{y} =Ax^{*}\) for any \(x^{*}\in\bar{X}\) (note that \(\bar{y}\) is independent of x ∗, cf. Proposition 2) and let
By Proposition 2, \(\bar{g}^{k}=A^{T}\nabla h(A\bar {x}^{k})=A^{T}\nabla h(\bar{y})=\bar{g}\) for all k. Since
it follows from the convexity that
The latter is also the optimality condition for
We use an argument similar to that of [12]. In particular, by evaluating the right hand side of (29) at r k and \(\bar {x}^{k}-x^{k}\), respectively, we have
Similarly, since \(\bar{x}^{k}\in\bar{X}\) and \(\bar{g}^{k}=\bar{g}\), it follows that
which is further equivalent to
By evaluating the right hand side of (31) at 0 and \(x^{k}+r^{k}-\bar{x}^{k}\), respectively, we obtain
i.e.
By (28), the strong convexity of h and Lemma 2 (cf. (27)), we obtain
Moreover, since ∥A∥:=max∥d∥=1∥Ad∥, it follows that
Combining the above three inequalities gives
which further implies
Canceling out a factor of \(\|x^{k}-\bar{x}^{k}\|\) yields
which contradicts (24). This completes the proof of Theorem 1. □
4 Linear Convergence
We now establish the linear convergence of the PGM (8) under the local error bound condition (21). Let F(x)=f 1(x)+f 2(x) where f 1 and f 2 are defined by (2) and (3), respectively. Suppose that ∇f 2 is Lipschitz continuous with modulus L. Let {x k} k⩾0 be a sequence generated by the PGM (8). There are three key steps in the linear convergence proof which we outline below. The framework was first established by Luo and Tseng in 1992 [8].
- Step 1 :
-
Sufficient decrease. Suppose the step size α k is chosen according to (9), but with \(\bar{\alpha}<\frac{1}{L}\). Then for all k⩾0, we have
$$ {F\bigl(x^{k}\bigr)}- F\bigl(x^{k+1}\bigr)\geqslant c_1\bigl\|x^{k+1} - x^k\bigr\|^2,\quad\mbox{for some } c_1 > 0. $$(33) - Step 2 :
-
Local error bound. Let \(\bar{X}\) denote the set of optimal solutions satisfying (7) and let \(\operatorname{dist}_{\bar{X}}(x) := \min_{x^{*}\in\bar{X}}\|x-x^{*}\|\). Then for any ξ⩾minF(x), there exist some κ, ε>0 such that
$$ \operatorname{dist}_{\bar{X}}(x)\leqslant \kappa\bigl\|x- \operatorname{prox}_{f_1}\bigl[x-\nabla f_2(x)\bigr]\bigr\|, $$(34)for all x such that \(\|x-\operatorname{prox}_{f_{1}}[x-\nabla f_{2}(x)]\| \leqslant \varepsilon\).
- Step 3 :
-
Cost-to-go estimate. There exists a constant c 2>0 s.t.
$$ F\bigl(x^k\bigr) - F^* \leqslant c_2\bigl(\operatorname{dist}^2_{\bar{X}} \bigl( x^k\bigr)+\bigl\|x^{k+1}-x^k\bigr\|^2 \bigr),\quad\forall k. $$(35)
We first establish the sufficient decrease property (33). Notice that the PGM iteration (8) can be equivalently written as
Plugging the values of x=x k+1 and x k, respectively, into the right hand side yields
Since L is the Lipschitz constant of ∇f 2, it follows from the Taylor expansion of f 2 that
where the last step is due to (9). Since \(\bar {\alpha}<1/L\), it follows that the sufficient decrease condition (33) holds for all k⩾0 with
The local error bound condition holds due to Theorem 1. So we need to establish the cost-to-go estimate (35). Let \(\bar{x}^{k}\in\bar{X}\) be s.t. \(\operatorname{dist}_{\bar{X}}(x^{k})=\|x^{k}-\bar{x}^{k}\|\). The optimality of x k+1 implies
implying
Also, the mean value theorem shows
for some η k in the line segment joining x k+1 and \(\bar{x}^{k}\). Combining the above two relations and using the triangular inequality
yields
where the first inequality is due to the Lipschitz continuity of ∇f 2 and the second inequality follows from the triangular inequality. This establishes the cost-to-go estimate (35). We are now ready to combine the three steps outlined above to establish the following main linear convergence result.
Theorem 2
Assume that f 1 is convex and f 2 is convex differentiable with a Lipschitz continuous ∇f 2. Moreover, suppose \(\bar{X}\) is nonempty and a local error bound (34) holds around the solution set \(\bar{X}\), and that the step size α k is chosen according to
Then the PGM algorithm (8) generates a sequence of iterates x 0,x 1,⋯,x k,⋯ that converges linearly to a solution in \(\bar{X}\).
Proof
First, the sufficient decrease condition (33) implies ∥x k+1−x k∥2→0. Since
where we used the fact \({\liminf_{k}}\alpha_{k}\geqslant \underbar{\mbox{$\alpha$}}\), it follows that
Since the function values F(x k) is monotonically decreasing (33), it follows that the local error bound (34) holds for some κ and ϵ. In particular, for sufficiently large k, we have
implying \(\operatorname{dist}_{\bar{X}}(x^{k})\to0\). Consequently, by the cost-to-go estimate (35) we have
Now we use the local error bound (34) and the cost-to-go estimate (35) to obtain
Hence, we have
where
This implies the Q-linear convergence of F(x k)→F ∗. In light of (33), this further implies the R-linear convergence of ∥x k+1−x k∥2. Thus, {x k} converges linearly to an optimal solution in \(\bar{X}\). □
5 Closing Remarks
Motivated by the recent applications in sparse group Lasso, we have considered in this paper a class of nonsmooth convex minimization problems whose objective function is the sum of a smooth convex function, a nonsmooth ℓ 1-norm regularization term and an ℓ 2-norm regularization term. We have derived a proximal gradient method for this problem whose subproblem can be solved efficiently (in closed form). Moreover, we have established linear convergence of this method when the smooth part of the objective function consists of a strongly convex function composed with a linear mapping, even though the overall objective function is not strongly convex and the problem may have multiple solutions. The key step in the analysis is a local error bound condition which provides an estimate of the distance to the optimal solution set in terms of the size of the proximal gradient vector.
References
Bach, F.: Consistency of the group Lasso and multiple kernel learning. J. Mach. Learn. Res. 9, 1179–1225 (2008)
Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. arXiv:0912.3522v4 [math.OC], 18 May 2010
Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul. 4, 1168–1200 (2005)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1359 (2001)
Friedman, J., Hastie, T., Tibshirani, R.: A note on the group Lasso and a sparse group Lasso. arXiv:1001.0736v1 [math.ST], 5 Jan 2010
Kim, D., Sra, S., Dhillon, I.: A scalable trust-region algorithm with application to mixednorm regression. In: International Conference on Machine Learning (ICML), vol. 1 (2010)
Liu, J., Ji, S., Ye, J.: SLEP: sparse learning with efficient projections. Arizona State University (2009)
Luo, Z.Q., Tseng, P.: On the linear convergence of descent methods for convex essentially smooth minimization. SIAM J. Control Optim. 30(2), 408–425 (1992)
Ma, S., Song, X., Huang, J.: Supervised group Lasso with applications to microarray data analysis. BMC Bioinform. 8(1), 60 (2007)
Meier, L., Van de Geer, S., Buhlmann, P.: The group Lasso for logistic regression. J. R. Stat. Soc., Ser. B, Stat. Methodol. 70(1), 53–71 (2008)
Nesterov, Y.: Introductory Lectures on Convex Optimization. Kluwer, Boston (2004)
Tseng, P.: Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. 125(2), 263–295 (2010)
Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117, 387–423 (2009)
Rockafellar, R.T.: Convex Analysis. Princeton Univ. Press, Princeton (1970)
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, New York (1998)
Roth, V., Fischer, B.: The group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In: Proceedings of the 25th International Conference on Machine Learning, pp. 848–855. ACM, New York (2008)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 58, 267–288 (1996)
van den Berg, E., Schmidt, M., Friedlander, M., Murphy, K.: Group sparsity via linear-time projection. Technical Report TR-2008-09, Department of Computer Science, University of British Columbia (2008)
Vincent, M., Hansen, N.R.: Sparse group Lasso and high dimensional multinomial classification. J. Comput. Stat. Data Anal. arXiv:1205.1245v1 [stat.ML], 6 May 2012
Wright, S., Nowak, R., Figueiredo, M.: Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57(7), 2479–2493 (2009)
Yang, H., Xu, Z., King, I., Lyu, M.: Online learning for group Lasso. In: 27th International Conference on Machine Learning (ICML2010) (2010)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc., Ser. B, Stat. Methodol. 68(1), 49–67 (2006)
Zou, H.: The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67(2), 301–320 (2005)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was partially supported by the National Natural Science Foundation of China (Nos. 61179033, DMS-1015346). Part of this work was performed during a research visit by the first author to the University of Minnesota, with support from the Education Commission of Beijing Municipal Government.
Appendices
Appendix A: Proof of Lemma 1
For each \(J\in\mathcal{J}\), if \(\bar{u}_{J}=0\), then \(\hat{x}^{k}_{J}=\bar{x}^{k}_{J}\) and (26) holds automatically (since \(\bar{x}^{k}\in\bar{X}\)). In the remainder of the proof, we assume \({\bar{u}_{J}}\neq0\).
Since f 2 is given by (3), it follows from (1), (18), and (19) that
and by the positivity of λ or w J >0 for \(J\in\mathcal{J}\), \(\bar{X}\) must be compact. In fact, the level sets of F(x) must also be compact.
Let
By (19) and (36), \(A\bar{x}^{k}=\bar{y}\) and \(\nabla f_{2}(\bar{x}^{k})=\bar{g}\) for all k, where \(\bar{x}^{k}=\mbox {argmin}_{\bar{x}\in\bar{X}}\|\bar{x}-x^{k}\|\).
Since the level sets of F are compact, it follows from (24) that the sequence {x k} must be bounded. By further passing to a subsequence if necessary, we can assume that \(x^{k}\to\bar{x}\) for some \(\bar{x}\). By assumption (24), the sequence {r(x k)} converges to zero. Since the residual function r(x) is continuous (cf. (20)), this implies \(r(\bar{x})=0\), so \(\bar{x}\in\bar{X}\). Hence \(\delta_{k}=\|x^{k}-\bar{x}^{k}\|\leqslant \|x^{k}-\bar{x}\|\rightarrow0\), so that \(\bar{x}^{k}\rightarrow\bar{x}\). Also, by (36), \(g^{k}=\nabla f_{2}(x^{k})\rightarrow\nabla f_{2}(\bar{x})=\bar{g}\). Since f 1(x k)⩾0, it follows that h(Ax k)=F(x k)−f 1(x k)⩽F(x k)⩽ζ for all k. Since h is strongly convex, its level set must be compact. This implies that {Ax k} and \(\bar{y}\) lie in some compact convex subset Y of the open convex set \(\operatorname {dom}h\). By the strong convexity and the assumption that ∇h is Lipschitz continuous on Y, we have
Since by assumption
it follows that \(\|Ax^{k}-\bar{y}\|=o(\delta_{k})\). Since Ax k and \(\bar{y}\) are in Y, the Lipschitz continuity of ∇h on Y (see (37)) and (36) yield
Consider a group \(J\in\mathcal{J}\). We decompose \({J}=J^{k}_{0}\cup J^{k}_{1}\), where \(\bar{x}^{k}_{j}=0,~\mbox {iff}~j\in J^{k}_{0}\), and \(\bar{x}^{k}_{j}\neq0,~\mbox{iff}~j\in J^{k}_{1}\). In general, \(J^{k}_{0},~J^{k}_{1}\) vary with iteration index k. Since there are only finitely many choices for \(J^{k}_{0}\) and \(J^{k}_{1}\), by passing onto a subsequence \(\mathcal{K}_{0}\) if necessary, we can assume that \(J^{k}_{0}\) and \(J^{k}_{1}\) are fixed. Let us denote them simply as J 0 and J 1, respectively. Then we have for all \(k\in\mathcal{K}_{0}\)
By further passing to a subsequence if necessary, we consider the following three cases:
-
(a)
\(\|\beta^{k}_{J}\|\leqslant w_{J}\), for all k;
-
(b)
\(\|\beta^{k}_{J}\|> w_{J}\), and \(\bar{x}^{k}_{J}\neq0\) for all k;
-
(c)
\(\|\beta^{k}_{J}\|> w_{J}\), and \(\bar{x}^{k}_{J}=0\) for all k.
Case (a). In this case, the formula (22) implies that \(r^{k}_{J}:=(r(x^{k}))_{J}=-x^{k}_{J}\) for all k. Since r k→0 and \(x^{k}\rightarrow\bar{x}\), it follows that \(\bar{x}_{J}=0\). Also, by (23) and (24),
Since \(\bar{u}_{J}\neq0\), it follows that \(\bar{x}^{k}_{J}\neq0\) for sufficiently large \(k\in\mathcal{K}_{0}\), so J 1≠∅. By \(\nabla f_{2}(\bar{x}^{k})=\bar{g}\), \(\bar{x}^{k}\in\bar{X}\) and the optimality condition, we have
We first consider the entries in J 0=J∖J 1. Since \(\bar {x}^{k}_{J_{0}}=0\), it follows from (40) that \(\bar{u}_{J_{0}}=0\). By (25), we have
Also, by (41), we have
Since \(\|\bar{x}^{k}_{J}\|\neq0\), it follows from (42)
Letting \(C=\frac{\|\hat{x}^{k}_{J}\|}{w_{J}}\), we have
It remains to consider the entries in J 1. Since for j∈J 1 and \(k\in\mathcal{K}_{0}\), \(\bar{x}^{k}_{j}\neq0\), there exist a subsequence \(\mathcal{K}_{1}\subseteq\mathcal{K}_{0}\) and a constant vector \(s_{J_{1}}\),
and by (41) we have
implying that \(\bar{x}^{k}_{J_{1}}/\|\bar{x}^{k}_{J}\|\) is constant and parallel to \({\bar{u}_{J_{1}}}\) (cf. (40)). Hence, we have
Together with the above equation and by (41), that is, \(\bar {x}^{k}_{J_{1}}=-\frac{\|\bar{x}^{k}_{J}\|}{w_{J}}(\bar{g}_{J_{1}}+\lambda \operatorname{sign}(\bar{x}^{k}_{J_{1}}))\), we have
Because \(\bar{u}_{J_{1}}=-\lim_{k\rightarrow\infty}\bar {x}^{k}_{J_{1}}/\delta_{k}\) and \(\bar{u}_{J_{1}}\neq0\) (from \(\bar{u}_{J}\neq0\) and \(\bar{u}_{J_{0}}=0\)), for sufficiently large \(k\in\mathcal{K}_{1}\), \(\delta_{k}=O(\|\bar{x}^{k}_{J_{1}}\|)\), therefore, \(\|\bar{x}^{k}_{J}\| -\delta ^{2}_{k}\|\bar{u}_{J}\|>0\) and \(\mbox{sign}(\hat{x}^{k}_{J_{1}})=\mbox{sign}(\bar {x}^{k}_{J_{1}})=s_{J_{1}}\). Then for sufficiently large \(k\in\mathcal{K}_{1}\), by (46), (47), the two vectors \(\bar{x}^{k}_{J_{1}}\) and \(\hat{x}^{k}_{J_{1}}\) are parallel so that
Since \(\|\hat{x}^{k}_{J}\|=\|\hat{x}^{k}_{J_{1}}\|\) and \(\|\bar{x}^{k}_{J_{1}}\| =\| \bar{x}^{k}_{J}\|\), we have
Substituting this into (46) and using \(\mbox{sign}(\hat {x}^{k}_{J_{1}})=s_{J_{1}}\) yields
By (44) and (48), we obtain (26).
Case (b). Similar to Case (a), we will show that the two vectors \(\bar{x}^{k}_{J}\) and \(\bar{u}_{J}\) are parallel to the direction \(\bar{g}_{J}+\lambda \operatorname{sign}(\bar{x}^{k}_{J})\), and that \(\bar {x}^{k}_{J_{1}}\) can be written as (47), while \(\hat {x}^{k}_{J_{0}}=\bar{x}^{k}_{J_{0}}=\bar{u}_{J_{0}}=0\).
First, since \(\bar{x}^{k}\in\bar{X}\) and \(\bar{x}^{k}_{J}\neq0\), it follows that J 1≠∅ for \(k\in \mathcal{K}_{0}\). The optimality condition and \(\nabla f_{2}(\bar{x}^{k})=\bar{g}\) imply (41), (45) and (46) hold for \(k\in\mathcal{K}_{1}\). According to (45) and (46), the sign of various quantities are fixed for all sufficiently large \(k\in\mathcal{K}_{1}\)
In light of (41), we can see
We next use a limiting argument to show that \(\bar{u}_{J_{1}}\) is also parallel to the direction \((\bar{g}_{J_{1}}+\lambda \operatorname{sign}(\bar {x}^{k}_{J_{1}}))\). Denote
and note that \(\beta^{k}_{J}=x^{k}_{J}-g^{k}_{J}-\lambda \operatorname{sign}(x^{k}_{J}-g^{k}_{J})\) (cf. (12)). It follows from (23) and (38) that
Using the optimality condition (41) and the property (49), we can simplify (51) as
which, by \(\|\bar{x}^{k}_{J_{1}}\|=\|\bar{x}^{k}_{J}\|\), further implies
Hence, we have \(1-w_{J}/\|\bar{\beta}^{k}_{J}\|\neq0\). By \(\bar{x}^{k}\in\bar{X}\) and (22), we obtain
and by \(\bar{x}^{k}_{J_{0}}=0\), we have from (51)
From (53), we have
Moreover, it follows from \(\bar{x}^{k}_{J_{0}}=0\) and (54),
Therefore, we have
Recall that \(\bar{r}^{k}_{J}:=r(\bar{x}^{k}_{J})=0\). It follows from (22) and (12) with α=1 that for sufficiently large \(k\in\mathcal{K}_{1}\),
Now, using (51), Taylor expansion and (55), we obtain
Multiplying both sides by \(\frac{\|\bar{\beta}^{k}_{J}\|}{\delta_{k}}\), using (24) and (56), yields in the limit:
This shows that \(\bar{u}_{J_{1}}\) is parallel to the vector \(({\bar {g}_{J_{1}}+\lambda \operatorname{sign}(\bar{x}^{k}_{J_{1}})} )\). Since \(\|\bar{g}_{J_{1}}+\lambda \operatorname{sign}(\bar{x}^{k}_{J_{1}})\|=w_{J}\) (by (46)), we have
Combining this with (50), we obtain
where \(\|\bar{x}^{k}_{J}\|-\delta^{2}_{k}\|\bar{u}_{J_{1}}\|>0\) for sufficiently large \(k\in\mathcal{K}_{1}\) (in fact, if \(\bar{x}_{J}\neq0\), then the statement is obvious; else \(\bar{x}_{J}=0\), because \(\bar{u}_{J}=-\mbox {lim}_{k\rightarrow\infty}\bar{x}^{k}_{J}/\delta_{k}\) and \(\bar{u}_{J}\neq0\), \(\delta_{k}=O(\|\bar{x}^{k}_{J}\|)\)).
Next we consider the entries in J 0. Clearly, we have \(\bar {x}^{k}_{J_{0}}=0\) by definition. We now show by a limiting argument that \(\bar{u}_{J_{0}}\) is also zero. By (22), substituting \(\beta ^{k}_{J_{0}}\) into \(-r^{k}_{J_{0}}\) and merging some similar items, we have
Moreover, by (38), we have
where the second step follows from (54). Since the signs are fixed for sufficiently large \(k\in\mathcal{K}_{1}\) (cf. (49)), we have \(-\lambda \operatorname{sign}(\bar {x}_{J_{0}}-\bar {g}_{J_{0}})+\lambda \operatorname{sign}(x^{k}_{J_{0}}-g^{k}_{J_{0}})=0\), therefore,
Multiplying both sides of above equation with \(\frac{1}{\delta_{k}}\) and letting k→+∞, we obtain by (24) that \(x^{k}_{J_{0}}/\delta_{k}\rightarrow0\). This further implies
So, \(\hat{x}^{k}_{J_{0}}=\bar{x}^{k}_{J_{0}}+\delta_{k}^{2}\bar{u}_{J_{0}}=0\). Combining this with (58) yields
Using an argument identical to that for the Case (a), we can show that \(\hat{x}^{k}_{J}\) satisfies (48) for sufficiently large \(k\in \mathcal {K}_{1}\), thus establishing the desired property (26).
Case (c). In this case \(0=\bar{x}^{k}_{J}\rightarrow\bar{x}_{J}\), it follows from that \(\bar {x}_{J}=0\). By \(\|\beta_{J}^{k}\|>w_{J}\) (the assumption for Case (c)), it follows that \(x^{k}_{J}+r^{k}_{J}\neq0\). By the optimality condition of the prox-operator we have
Notice that
where we have used the fact that \(\bar{x}_{J}^{k}=0\) in this case. Hence, we have
Moreover, notice that
Thus, by taking limit k→∞ in (59), we obtain
implying
This is precisely the optimality condition for \(\hat{x}^{k}_{J}=\bar {x}^{k}_{J}+\delta^{2}_{k}\bar{u}_{J}=\delta^{2}_{k}\bar{u}_{J}\). In other words, we have (26) as desired.
Appendix B: Proof of Lemma 2
We argue by contradiction. Suppose this is false. Then, by passing to a subsequent if necessary, we can assume that
Since \(\bar{y}=A\bar{x}^{k}\), this is equivalent to {Au k}→0, where u k is defined by (23). Then ∥u k∥=1 for all k. By further passing to subsequence if necessary, we will assume that \(u^{k}\to\bar{u}\) for some \(\bar{u}\) with \(\|\bar{u}\|=1\). Then \(A\bar {u}=0\) and \(\|\bar{u}\|=1\). Moreover,
Since \(\{u^{k}\}\rightarrow\bar{u}\) and \(\|\bar{u}\|=1\), we have \(\langle u^{k},\bar{u}\rangle \geqslant \frac{1}{2}\) for all k sufficiently large. Fix any such k and consider \(\hat{x}^{k}\) defined by (25), namely, \(\hat{x}^{k}=\bar{x}^{k}+\delta^{2}_{k}\bar{u}\).
Since \(A\bar{u}=0\), it follows \(A\hat{x}^{k}=A\bar{x}^{k}\), which further implies \(\nabla f_{2}(\hat{x}^{k})= A^{T}\nabla h(A\hat{x}^{k})=A^{T}\nabla h(A\bar {x}^{k})=\nabla f_{2}(\bar{x}^{k})=\bar{g}\). By Lemma 1, since \(\hat{x}^{k}\) satisfies (25), it follows that
for all \(J\in\mathcal{J}\). Hence \(\hat{x}^{k}\in\bar{X}\). Since \(\langle x^{k}-\bar{x}^{k},\bar{u}\rangle=\delta_{k}\langle u^{k},\bar{u}\rangle >\delta _{k}/2\) and \(\|\bar{u}\|=1\), it follows that
which contradicts \(\bar{x}^{k}\) being the point in \(\bar{X}\) nearest to x k. This proves (27).
Rights and permissions
About this article
Cite this article
Zhang, H., Jiang, J. & Luo, ZQ. On the Linear Convergence of a Proximal Gradient Method for a Class of Nonsmooth Convex Minimization Problems. J. Oper. Res. Soc. China 1, 163–186 (2013). https://doi.org/10.1007/s40305-013-0015-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40305-013-0015-x