Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

Shimmura, Ryosuke; Suzuki, Joe

doi:10.1007/s43069-024-00307-x

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

Research
Published: 20 March 2024

Volume 5, article number 27, (2024)
Cite this article

Operations Research Forum Aims and scope Submit manuscript

Ryosuke Shimmura¹ &
Joe Suzuki¹

47 Accesses
Explore all metrics

Abstract

In this paper, we propose new methods to efficiently solve convex optimization problems encountered in sparse estimation. These methods include a new quasi-Newton method that avoids computing the Hessian matrix and improves efficiency, and we prove its fast convergence. We also prove the local convergence of the Newton method under the assumption of strong convexity. Our proposed methods offer a more efficient and effective approach, particularly for $L_1$ regularization and group regularization problems, as they incorporate variable selection with each update. Through numerical experiments, we demonstrate the efficiency of our methods in solving problems encountered in sparse estimation. Our contributions include theoretical guarantees and practical applications for various kinds of problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scaled Proximal Gradient Methods for Sparse Optimization Problems

Article 22 November 2023

Practical inexact proximal quasi-Newton method with global complexity analysis

Article 31 March 2016

Improved complexities of conditional gradient-type methods with applications to robust matrix recovery problems

Article 29 November 2019

Data Availability

The data used in the numerical experiments were the cod-RNA and ijcnn1 datasets available on the LIBSVM website.

Notes

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

References

Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288
MathSciNet Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67
Article MathSciNet Google Scholar
Xiao X, Li Y, Wen Z et al (2018) A regularized semismooth Newton method with projection steps for composite convex programs. J Sci Comput 76(1):364–389
Article MathSciNet Google Scholar
Patrinos P, Bemporad A (2013) Proximal Newton methods for convex composite optimization. In: 52nd IEEE Conference on Decision and Control, IEEE, pp 2358–2363
Patrinos P, Stella L, Bemporad A (2014) Forward-backward truncated Newton methods for convex composite optimization. arXiv preprint arXiv:1402.6655
Stella L, Themelis A, Patrinos P (2017) Forward-backward quasi-Newton methods for nonsmooth optimization problems. Comput Optim Appl 67(3):443–487
Article MathSciNet Google Scholar
Milzarek A, Xiao X, Cen S et al (2019) A stochastic semismooth newton method for nonsmooth nonconvex optimization. SIAM J Optim 29(4):2916–2948
Article MathSciNet Google Scholar
Yang M, Milzarek A, Wen Z et al (2021) A stochastic extra-step quasi-newton method for nonsmooth nonconvex optimization. Math Program pp 1–47
Li Y, Wen Z, Yang C et al (2018) A semi-smooth newton method for semidefinite programs and its applications in electronic structure calculations. SIAM J Sci Comput 40(6):A4131–A4157
Article Google Scholar
Ali A, Wong E, Kolter JZ (2017) A semismooth Newton method for fast, generic convex programming. In: International Conference on Machine Learning, PMLR, pp 70–79
Bauschke HH, Combettes PL (2011) Convex analysis and monotone operator theory in Hilbert spaces. Springer
Book Google Scholar
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci 2(1):183–202
Article MathSciNet Google Scholar
Ulbrich M (2011) Semismooth Newton methods for variational inequalities and constrained optimization problems in function spaces. SIAM
Facchinei F, Pang JS (2003) Finite-dimensional variational inequalities and complementarity problems. Springer
Google Scholar
Zhang Y, Zhang N, Sun D et al (2020) An efficient Hessian based algorithm for solving large-scale sparse group lasso problems. Math Program 179:223–263
Article MathSciNet Google Scholar
Qi L, Sun J (1993) A nonsmooth version of newton’s method. Math Program 58(1–3):353–367
Article MathSciNet Google Scholar
Facchinei F, Fischer A, Kanzow C (1996) Inexact newton methods for semismooth equations with applications to variational inequality problems. Nonconvex Optim Appl pp 125–139
Sun D, Han J (1997) Newton and quasi-Newton methods for a class of nonsmooth equations and related problems. SIAM J Optim 7(2):463–480
Article MathSciNet Google Scholar
Hintermüller M (2010) Semismooth Newton methods and applications. Humboldt-University of Berlin, Department of Mathematics
Google Scholar
Uzilov AV, Keegan JM, Mathews DH (2006) Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinf 7(1):1–30
Article Google Scholar
Prokhorov D (2001) Ijcnn 2001 neural network competition. Slide presentation in IJCNN 1(97):38
Google Scholar
Roth V, Fischer B (2008) The group-lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In: Proceedings of the 25th international conference on Machine learning, pp 848–855
Pavlidis P, Weston J, Cai J et al (2001) Gene functional classification from heterogeneous data. In: Proceedings of the fifth annual international conference on Computational biology, pp 249–255
Ortega JM, Rheinboldt WC (2000) Iterative solution of nonlinear equations in several variables. SIAM

Download references

Funding

This work was supported by JSPS KAKENHI Grant Number JP23KJ1458 and the Grant-in-Aid for Scientific Research (C) 22K11931.

Author information

Authors and Affiliations

Graduate School of Engineering Science, Osaka University, Toyonaka, 560-8531, Japan
Ryosuke Shimmura & Joe Suzuki

Authors

Ryosuke Shimmura
View author publications
You can also search for this author in PubMed Google Scholar
Joe Suzuki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ryosuke Shimmura.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Proof of Proposition 3

Lemma 1

Suppose $A,B \in {\mathbb R}^{n \times n}$ are symmetric and A is positive semidefinite. Then, any eigenvalue $\lambda$ of AB satisfies $\min \{\Vert A \Vert \lambda _{\min } (B),0\} \le \lambda \le \max \{ \Vert A\Vert \lambda _{\max } (B),0 \}$, where $\lambda _{\min } (B)$ and $\lambda _{\max } (B)$ are the minimum and maximum eigenvalues, respectively, of B.

Proof

Since the eigenvalues of AB are equivalent to the eigenvalues of BA, we consider the eigenvalues of BA. We let $\lambda \in {\mathbb R},x \in {\mathbb R}^n$ such that $BAx = \lambda x$ and $x\ne 0$. By multiplying $x^TA$ from the left, we obtain

$$\begin{aligned} x^TABA x= \lambda x^T A x. \end{aligned}$$

If $x^T A x = 0$, then $\lambda = 0$ since $Ax = 0$. Next, we consider the $x^TAx > 0$ case. Since A is a symmetric positive semidefinite matrix, $A^{\frac{1}{2}}$, and we obtain

$$\begin{aligned} \begin{aligned} \frac{x^TA B A x}{x^T A x}&= \lambda \\ \frac{x^TA^{\frac{1}{2}} A^{\frac{1}{2}}B A^{\frac{1}{2}} A^{\frac{1}{2}}x}{x^T A^{\frac{1}{2}} A^{\frac{1}{2}} x}&= \lambda . \end{aligned} \end{aligned}$$

(41)

We can rewrite (41) as

$$\begin{aligned} \frac{y^TA^{\frac{1}{2}} B A^{\frac{1}{2}} y}{y^T y} = \lambda , \end{aligned}$$

where $y = A^{\frac{1}{2}} x\ne 0$. Thus, since $0<\Vert A^{\frac{1}{2}} y \Vert _2 \le \Vert A\Vert ^{\frac{1}{2}} \Vert y\Vert _2$, we can obtain

$$\begin{aligned} \min \{\Vert A\Vert \lambda _{\min }(B), 0\} \le \frac{y^TA^{\frac{1}{2}} B A^{\frac{1}{2}} y}{y^T y} \le \max \{\Vert A\Vert \lambda _{\max }(B), 0\}. \end{aligned}$$

Therefore, if $x^T A x>0$, then $\min \{\Vert A\Vert \lambda _{\min }(B), 0\} \le \lambda \le \Vert A\Vert \max \{\Vert A\Vert \lambda _{\max }(B), 0\}$. Also using the result when $x^T A x = 0$, Lemma 4 holds.$\square$

Theorem 1

([6], Theorem 3.2). Suppose $g:{\mathbb R}^n \rightarrow \mathbb {R} \cup \{+ \infty \}$ is a closed convex function. Every $V\in \partial _B \textrm{prox}_{\nu g}(x)$ is a symmetric positive semidefinite matrix that satisfies $\Vert V\Vert \le 1$ for all $x\in {\mathbb R}^n$.

Proof of Proposition 3

By assumption, since $\lambda _{\min }(\nabla ^2 f(x)) \ge \mu$ for any $x\in {\mathbb R}^n$,

$$\begin{aligned} \lambda _{\max } (I- \nu \nabla ^2 f(x)) \le 1-\nu \mu . \end{aligned}$$

Since every $V\in \partial _B \textrm{prox}_{\nu g}(x)$ is a symmetric positive semidefinite matrix that satisfies $\Vert V\Vert \le 1$ for all $x\in {\mathbb R}^n$ by Theorem 4, from Lemma 4,

$$\begin{aligned} \lambda _{\max }\left( V \left( I-\nu \nabla ^2 f(x) \right) \right) \le \max \{1- \nu \mu , 0\}. \end{aligned}$$

Thus, every eigenvalue of $I - V \left( I-\nu \nabla ^2 f(x) \right)$ is a real number that is greater than or equal to $\min \{\nu \mu , 1\}$, and $I - V \left( I-\nu \nabla ^2 f(x) \right)$ is a nonsingular matrix.$\square$

Appendix B. Proof of Theorem 2

Lemma 2

([24], Lemma 2.3.2). Let $A, C \in {\mathbb R}^{n \times n}$, and we assume that A is nonsingular, with $\Vert A^{-1} \Vert \le \alpha$. If $\Vert A - C \Vert \le \beta$ and $\beta \alpha < 1$, then C is also nonsingular, and

$$\begin{aligned} \Vert C^{-1}\Vert \le \frac{\alpha }{(1-\alpha \beta )} \end{aligned}$$

Proof of Theorem 2

Let $V^{(k)} \in \partial _B \textrm{prox}_{\nu g}(x^{(k)} - \nu \nabla f(x^{(k)}))$, $U^{(k)}:= I - V^{(k)}\left( I-\nu \nabla ^2 f(x^{(k)}) \right) \in \partial F_\nu (x^{(k)})$ and $W^{(k)}:= I - V^{(k)}\left( I-\nu B^{(k)} \right) \in \hat{\partial }^{(k)}F_\nu (x^{(k)})$. From Proposition 3, every eigenvalue of $U^{(k)}$ is a real number that is greater than or equal to $\xi := \min \{\nu \mu , 1\}$, and

$$\begin{aligned} \left\| \left( U^{(k)}\right) ^{-1} \right\| \le \frac{\sqrt{n}}{\xi }. \end{aligned}$$

Let $\Delta = \frac{\xi }{5 \nu \sqrt{n}}$. Since $\partial F_\nu$ is the LNA of $F_\nu$ at $x^*$, there exists $\epsilon > 0$ such that

$$\begin{aligned} \Vert F_\nu (x) - F_\nu (x^*) - U(x-x^*)\Vert _2 \le \nu \Delta \Vert x-x^*\Vert _2 \end{aligned}$$

for any $x\in B(x^*, \epsilon ):= \{y\mid \Vert x^*-y\Vert _2 < \epsilon \},U\in \partial F_\nu (x)$. Since $W^{(k)} - U^{(k)} = \nu V^{(k)}\left( B^{(k)} - \nabla ^2 f(x^{(k)}) \right)$ and $\Vert B^{(k)} - \nabla ^2 f(x^{(k)})\Vert < \Delta$, we obtain $\Vert W^{(k)} - U^{(k)}\Vert \le \nu \Delta$. By Lemma 5, $W^{(k)}$ is nonsingular and

$$\begin{aligned} \left\| \left( W^{(k)} \right) ^{-1} \right\|&\le \frac{\sqrt{n}/\xi }{1-\sqrt{n}/\xi \times \nu \Delta }\\&= \frac{5}{4} \frac{\sqrt{n}}{\xi }. \end{aligned}$$

Thus, if $\Vert x^{(k)} - x^* \Vert _2< \epsilon$, then we have

$$\begin{aligned} \Vert x^{(k+1)}-x^*\Vert _2&= \Vert x^{(k)}-(W^{(k)})^{-1}F_\nu (x^{(k)})-x^*\Vert _2 \\&\le \Vert (W^{(k)})^{-1}\Vert \Vert F_\nu (x^{(k)})-F_\nu (x^*)-W^{(k)}(x^{(k)} - x^*)\Vert _2 \\&\le \Vert (W^{(k)})^{-1}\Vert \left[ \Vert F_\nu (x^{(k)})-F_\nu (x^*)-U^{(k)}(x^{(k)} - x^*)\Vert _2 + \Vert W^{(k)}-U^{(k)}\Vert \Vert x^{(k)}-x^*\Vert _2\right] \\&\le \frac{5}{4}\frac{\sqrt{n}}{\xi } (2\nu \Delta \Vert x^{(k)} - x^*\Vert _2)\\&= \frac{1}{2}\Vert x^{(k)} - x^*\Vert _2 \end{aligned}$$

Therefore, there exists $\epsilon , \Delta$ such that the sequence generated by Algorithm 2 locally linearly converges to $x^*$.$\square$

Appendix C. Proof of Theorem 3

Proof

Let $V^{(k)} \in \partial _B \textrm{prox}_{\nu g}(x^{(k)} - \nu \nabla f(x^{(k)}))$, $U^{(k)}:= I - V^{(k)}\left( I-\nu \nabla ^2 f(x^{(k)}) \right) \in \partial F_\nu (x^{(k)})$ and $W^{(k)}:= I - V^{(k)}\left( I-\nu B^{(k)} \right) \in \hat{\partial }^{(k)}F_\nu (x^{(k)})$. We let $e^{(k)} = x^{(k)}-x^*,s^{(k)}=x^{(k+1)}-x^{(k)}$. We note that $s^{(k)} = e^{(k+1)} - e^{(k)}$ and $\{e^{(k)}\}$ and $\{s^{(k)}\}$ converge to 0 since $\{x^{(k)}\}$ converges to $x^*$. From the update rule of Algorithm 2, we have

$$\begin{aligned} F_\nu (x^*)&= \left[ F_\nu (x^{(k)}) + W^{(k)}s^{(k)}\right] + \left[ \left( U^{(k)} - W^{(k)}\right) s^{(k)}\right] - \left[ F_\nu (x^{(k)}) - F_\nu (x^*) - U^{(k)} e^{(k)}\right] - U^{(k)}e^{(k+1)} \\&=\left[ \left( U^{(k)} - W^{(k)}\right) s^{(k)}\right] - \left[ F_\nu (x^{(k)}) - F_\nu (x^*) - U^{(k)} e^{(k)}\right] - U^{(k)}e^{(k+1)}. \end{aligned}$$

Since $F_\nu (x^*) = 0$ and $U^{(k)}$ is a nonsingular matrix,

$$\begin{aligned} U^{(k)}e^{(k+1)}&=\left[ \left( U^{(k)} - W^{(k)}\right) s^{(k)}\right] - \left[ F_\nu (x^{(k)}) - F_\nu (x^*) - U^{(k)} e^{(k)}\right] \\ e^{(k+1)}&= \left( U^{(k)} \right) ^{-1}\left[ \left( U^{(k)} - W^{(k)}\right) s^{(k)}\right] - \left( U^{(k)} \right) ^{-1} \left[ F_\nu (x^{(k)}) - F_\nu (x^*) - U^{(k)} e^{(k)}\right] . \end{aligned}$$

By assumption, since $\Vert (W^{(k)} - U^{(k)})s^{(k)}\Vert _2 = \Vert \nu V^{(k)} (\nabla ^2 f(x^{(k)}) - B^{(k)}) s^{(k)}\Vert _2$ and $\Vert \nabla ^2 f(x^{*}) - \nabla ^2 f(x^{(k)}) \Vert \rightarrow 0$ for $k \rightarrow \infty$, we have $\Vert (W^{(k)} - U^{(k)})s^{(k)}\Vert _2 = o( \Vert s^{(k)} \Vert _2 )$. Therefore,

$$\begin{aligned} \Vert e^{(k+1)}\Vert _2 = o(\Vert s^{(k)}\Vert _2) + o(\Vert e^{(k)}\Vert _2) = o(\Vert e^{(k+1)}\Vert _2) + o(\Vert e^{(k)} \Vert _2). \end{aligned}$$

Thus, we obtain $\Vert e^{(k+1)}\Vert _2 = o(\Vert e^{(k)}\Vert _2)$, and since $e^{(k)} = x^{(k)} - x^*$, the sequence $\{x^{(k)}\}$ generated by Algorithm 2 superlinearly converges to $x^*$.$\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shimmura, R., Suzuki, J. Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation. Oper. Res. Forum 5, 27 (2024). https://doi.org/10.1007/s43069-024-00307-x

Download citation

Received: 22 November 2023
Accepted: 23 February 2024
Published: 20 March 2024
DOI: https://doi.org/10.1007/s43069-024-00307-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

Abstract

Access this article

Similar content being viewed by others

Scaled Proximal Gradient Methods for Sparse Optimization Problems

Practical inexact proximal quasi-Newton method with global complexity analysis

Improved complexities of conditional gradient-type methods with applications to robust matrix recovery problems

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A. Proof of Proposition 3

Lemma 1

Proof

Theorem 1

Proof of Proposition 3

Appendix B. Proof of Theorem 2

Lemma 2

Proof of Theorem 2

Appendix C. Proof of Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

Abstract

Access this article

Similar content being viewed by others

Scaled Proximal Gradient Methods for Sparse Optimization Problems

Practical inexact proximal quasi-Newton method with global complexity analysis

Improved complexities of conditional gradient-type methods with applications to robust matrix recovery problems

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A. Proof of Proposition 3

Lemma 1

Proof

Theorem 1

Proof of Proposition 3

Appendix B. Proof of Theorem 2

Lemma 2

Proof of Theorem 2

Appendix C. Proof of Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation