Skip to main content
Log in

Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation

  • Research
  • Published:
Operations Research Forum Aims and scope Submit manuscript

Abstract

In this paper, we propose new methods to efficiently solve convex optimization problems encountered in sparse estimation. These methods include a new quasi-Newton method that avoids computing the Hessian matrix and improves efficiency, and we prove its fast convergence. We also prove the local convergence of the Newton method under the assumption of strong convexity. Our proposed methods offer a more efficient and effective approach, particularly for \(L_1\) regularization and group regularization problems, as they incorporate variable selection with each update. Through numerical experiments, we demonstrate the efficiency of our methods in solving problems encountered in sparse estimation. Our contributions include theoretical guarantees and practical applications for various kinds of problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Algorithm 3
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

The data used in the numerical experiments were the cod-RNA and ijcnn1 datasets available on the LIBSVM website.

Notes

  1. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

References

  1. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288

    MathSciNet  Google Scholar 

  2. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67

    Article  MathSciNet  Google Scholar 

  3. Xiao X, Li Y, Wen Z et al (2018) A regularized semismooth Newton method with projection steps for composite convex programs. J Sci Comput 76(1):364–389

    Article  MathSciNet  Google Scholar 

  4. Patrinos P, Bemporad A (2013) Proximal Newton methods for convex composite optimization. In: 52nd IEEE Conference on Decision and Control, IEEE, pp 2358–2363

  5. Patrinos P, Stella L, Bemporad A (2014) Forward-backward truncated Newton methods for convex composite optimization. arXiv preprint arXiv:1402.6655

  6. Stella L, Themelis A, Patrinos P (2017) Forward-backward quasi-Newton methods for nonsmooth optimization problems. Comput Optim Appl 67(3):443–487

    Article  MathSciNet  Google Scholar 

  7. Milzarek A, Xiao X, Cen S et al (2019) A stochastic semismooth newton method for nonsmooth nonconvex optimization. SIAM J Optim 29(4):2916–2948

    Article  MathSciNet  Google Scholar 

  8. Yang M, Milzarek A, Wen Z et al (2021) A stochastic extra-step quasi-newton method for nonsmooth nonconvex optimization. Math Program pp 1–47

  9. Li Y, Wen Z, Yang C et al (2018) A semi-smooth newton method for semidefinite programs and its applications in electronic structure calculations. SIAM J Sci Comput 40(6):A4131–A4157

    Article  Google Scholar 

  10. Ali A, Wong E, Kolter JZ (2017) A semismooth Newton method for fast, generic convex programming. In: International Conference on Machine Learning, PMLR, pp 70–79

  11. Bauschke HH, Combettes PL (2011) Convex analysis and monotone operator theory in Hilbert spaces. Springer

    Book  Google Scholar 

  12. Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci 2(1):183–202

    Article  MathSciNet  Google Scholar 

  13. Ulbrich M (2011) Semismooth Newton methods for variational inequalities and constrained optimization problems in function spaces. SIAM

  14. Facchinei F, Pang JS (2003) Finite-dimensional variational inequalities and complementarity problems. Springer

    Google Scholar 

  15. Zhang Y, Zhang N, Sun D et al (2020) An efficient Hessian based algorithm for solving large-scale sparse group lasso problems. Math Program 179:223–263

    Article  MathSciNet  Google Scholar 

  16. Qi L, Sun J (1993) A nonsmooth version of newton’s method. Math Program 58(1–3):353–367

    Article  MathSciNet  Google Scholar 

  17. Facchinei F, Fischer A, Kanzow C (1996) Inexact newton methods for semismooth equations with applications to variational inequality problems. Nonconvex Optim Appl pp 125–139

  18. Sun D, Han J (1997) Newton and quasi-Newton methods for a class of nonsmooth equations and related problems. SIAM J Optim 7(2):463–480

    Article  MathSciNet  Google Scholar 

  19. Hintermüller M (2010) Semismooth Newton methods and applications. Humboldt-University of Berlin, Department of Mathematics

    Google Scholar 

  20. Uzilov AV, Keegan JM, Mathews DH (2006) Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinf 7(1):1–30

    Article  Google Scholar 

  21. Prokhorov D (2001) Ijcnn 2001 neural network competition. Slide presentation in IJCNN 1(97):38

    Google Scholar 

  22. Roth V, Fischer B (2008) The group-lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In: Proceedings of the 25th international conference on Machine learning, pp 848–855

  23. Pavlidis P, Weston J, Cai J et al (2001) Gene functional classification from heterogeneous data. In: Proceedings of the fifth annual international conference on Computational biology, pp 249–255

  24. Ortega JM, Rheinboldt WC (2000) Iterative solution of nonlinear equations in several variables. SIAM

Download references

Funding

This work was supported by JSPS KAKENHI Grant Number JP23KJ1458 and the Grant-in-Aid for Scientific Research (C) 22K11931.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryosuke Shimmura.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Proof of Proposition 3

Lemma 1

Suppose \(A,B \in {\mathbb R}^{n \times n}\) are symmetric and A is positive semidefinite. Then, any eigenvalue \(\lambda\) of AB satisfies \(\min \{\Vert A \Vert \lambda _{\min } (B),0\} \le \lambda \le \max \{ \Vert A\Vert \lambda _{\max } (B),0 \}\), where \(\lambda _{\min } (B)\) and \(\lambda _{\max } (B)\) are the minimum and maximum eigenvalues, respectively, of B.

Proof

Since the eigenvalues of AB are equivalent to the eigenvalues of BA, we consider the eigenvalues of BA. We let \(\lambda \in {\mathbb R},x \in {\mathbb R}^n\) such that \(BAx = \lambda x\) and \(x\ne 0\). By multiplying \(x^TA\) from the left, we obtain

$$\begin{aligned} x^TABA x= \lambda x^T A x. \end{aligned}$$

If \(x^T A x = 0\), then \(\lambda = 0\) since \(Ax = 0\). Next, we consider the \(x^TAx > 0\) case. Since A is a symmetric positive semidefinite matrix, \(A^{\frac{1}{2}}\), and we obtain

$$\begin{aligned} \begin{aligned} \frac{x^TA B A x}{x^T A x}&= \lambda \\ \frac{x^TA^{\frac{1}{2}} A^{\frac{1}{2}}B A^{\frac{1}{2}} A^{\frac{1}{2}}x}{x^T A^{\frac{1}{2}} A^{\frac{1}{2}} x}&= \lambda . \end{aligned} \end{aligned}$$
(41)

We can rewrite (41) as

$$\begin{aligned} \frac{y^TA^{\frac{1}{2}} B A^{\frac{1}{2}} y}{y^T y} = \lambda , \end{aligned}$$

where \(y = A^{\frac{1}{2}} x\ne 0\). Thus, since \(0<\Vert A^{\frac{1}{2}} y \Vert _2 \le \Vert A\Vert ^{\frac{1}{2}} \Vert y\Vert _2\), we can obtain

$$\begin{aligned} \min \{\Vert A\Vert \lambda _{\min }(B), 0\} \le \frac{y^TA^{\frac{1}{2}} B A^{\frac{1}{2}} y}{y^T y} \le \max \{\Vert A\Vert \lambda _{\max }(B), 0\}. \end{aligned}$$

Therefore, if \(x^T A x>0\), then \(\min \{\Vert A\Vert \lambda _{\min }(B), 0\} \le \lambda \le \Vert A\Vert \max \{\Vert A\Vert \lambda _{\max }(B), 0\}\). Also using the result when \(x^T A x = 0\), Lemma 4 holds.\(\square\)

Theorem 1

([6], Theorem 3.2). Suppose \(g:{\mathbb R}^n \rightarrow \mathbb {R} \cup \{+ \infty \}\) is a closed convex function. Every \(V\in \partial _B \textrm{prox}_{\nu g}(x)\) is a symmetric positive semidefinite matrix that satisfies \(\Vert V\Vert \le 1\) for all \(x\in {\mathbb R}^n\).

Proof of Proposition 3

By assumption, since \(\lambda _{\min }(\nabla ^2 f(x)) \ge \mu\) for any \(x\in {\mathbb R}^n\),

$$\begin{aligned} \lambda _{\max } (I- \nu \nabla ^2 f(x)) \le 1-\nu \mu . \end{aligned}$$

Since every \(V\in \partial _B \textrm{prox}_{\nu g}(x)\) is a symmetric positive semidefinite matrix that satisfies \(\Vert V\Vert \le 1\) for all \(x\in {\mathbb R}^n\) by Theorem 4, from Lemma 4,

$$\begin{aligned} \lambda _{\max }\left( V \left( I-\nu \nabla ^2 f(x) \right) \right) \le \max \{1- \nu \mu , 0\}. \end{aligned}$$

Thus, every eigenvalue of \(I - V \left( I-\nu \nabla ^2 f(x) \right)\) is a real number that is greater than or equal to \(\min \{\nu \mu , 1\}\), and \(I - V \left( I-\nu \nabla ^2 f(x) \right)\) is a nonsingular matrix.\(\square\)

Appendix B. Proof of Theorem 2

Lemma 2

([24], Lemma 2.3.2). Let \(A, C \in {\mathbb R}^{n \times n}\), and we assume that A is nonsingular, with \(\Vert A^{-1} \Vert \le \alpha\). If \(\Vert A - C \Vert \le \beta\) and \(\beta \alpha < 1\), then C is also nonsingular, and

$$\begin{aligned} \Vert C^{-1}\Vert \le \frac{\alpha }{(1-\alpha \beta )} \end{aligned}$$

Proof of Theorem 2

Let \(V^{(k)} \in \partial _B \textrm{prox}_{\nu g}(x^{(k)} - \nu \nabla f(x^{(k)}))\), \(U^{(k)}:= I - V^{(k)}\left( I-\nu \nabla ^2 f(x^{(k)}) \right) \in \partial F_\nu (x^{(k)})\) and \(W^{(k)}:= I - V^{(k)}\left( I-\nu B^{(k)} \right) \in \hat{\partial }^{(k)}F_\nu (x^{(k)})\). From Proposition 3, every eigenvalue of \(U^{(k)}\) is a real number that is greater than or equal to \(\xi := \min \{\nu \mu , 1\}\), and

$$\begin{aligned} \left\| \left( U^{(k)}\right) ^{-1} \right\| \le \frac{\sqrt{n}}{\xi }. \end{aligned}$$

Let \(\Delta = \frac{\xi }{5 \nu \sqrt{n}}\). Since \(\partial F_\nu\) is the LNA of \(F_\nu\) at \(x^*\), there exists \(\epsilon > 0\) such that

$$\begin{aligned} \Vert F_\nu (x) - F_\nu (x^*) - U(x-x^*)\Vert _2 \le \nu \Delta \Vert x-x^*\Vert _2 \end{aligned}$$

for any \(x\in B(x^*, \epsilon ):= \{y\mid \Vert x^*-y\Vert _2 < \epsilon \},U\in \partial F_\nu (x)\). Since \(W^{(k)} - U^{(k)} = \nu V^{(k)}\left( B^{(k)} - \nabla ^2 f(x^{(k)}) \right)\) and \(\Vert B^{(k)} - \nabla ^2 f(x^{(k)})\Vert < \Delta\), we obtain \(\Vert W^{(k)} - U^{(k)}\Vert \le \nu \Delta\). By Lemma 5, \(W^{(k)}\) is nonsingular and

$$\begin{aligned} \left\| \left( W^{(k)} \right) ^{-1} \right\|&\le \frac{\sqrt{n}/\xi }{1-\sqrt{n}/\xi \times \nu \Delta }\\&= \frac{5}{4} \frac{\sqrt{n}}{\xi }. \end{aligned}$$

Thus, if \(\Vert x^{(k)} - x^* \Vert _2< \epsilon\), then we have

$$\begin{aligned} \Vert x^{(k+1)}-x^*\Vert _2&= \Vert x^{(k)}-(W^{(k)})^{-1}F_\nu (x^{(k)})-x^*\Vert _2 \\&\le \Vert (W^{(k)})^{-1}\Vert \Vert F_\nu (x^{(k)})-F_\nu (x^*)-W^{(k)}(x^{(k)} - x^*)\Vert _2 \\&\le \Vert (W^{(k)})^{-1}\Vert \left[ \Vert F_\nu (x^{(k)})-F_\nu (x^*)-U^{(k)}(x^{(k)} - x^*)\Vert _2 + \Vert W^{(k)}-U^{(k)}\Vert \Vert x^{(k)}-x^*\Vert _2\right] \\&\le \frac{5}{4}\frac{\sqrt{n}}{\xi } (2\nu \Delta \Vert x^{(k)} - x^*\Vert _2)\\&= \frac{1}{2}\Vert x^{(k)} - x^*\Vert _2 \end{aligned}$$

Therefore, there exists \(\epsilon , \Delta\) such that the sequence generated by Algorithm 2 locally linearly converges to \(x^*\).\(\square\)

Appendix C. Proof of Theorem 3

Proof

Let \(V^{(k)} \in \partial _B \textrm{prox}_{\nu g}(x^{(k)} - \nu \nabla f(x^{(k)}))\), \(U^{(k)}:= I - V^{(k)}\left( I-\nu \nabla ^2 f(x^{(k)}) \right) \in \partial F_\nu (x^{(k)})\) and \(W^{(k)}:= I - V^{(k)}\left( I-\nu B^{(k)} \right) \in \hat{\partial }^{(k)}F_\nu (x^{(k)})\). We let \(e^{(k)} = x^{(k)}-x^*,s^{(k)}=x^{(k+1)}-x^{(k)}\). We note that \(s^{(k)} = e^{(k+1)} - e^{(k)}\) and \(\{e^{(k)}\}\) and \(\{s^{(k)}\}\) converge to 0 since \(\{x^{(k)}\}\) converges to \(x^*\). From the update rule of Algorithm 2, we have

$$\begin{aligned} F_\nu (x^*)&= \left[ F_\nu (x^{(k)}) + W^{(k)}s^{(k)}\right] + \left[ \left( U^{(k)} - W^{(k)}\right) s^{(k)}\right] - \left[ F_\nu (x^{(k)}) - F_\nu (x^*) - U^{(k)} e^{(k)}\right] - U^{(k)}e^{(k+1)} \\&=\left[ \left( U^{(k)} - W^{(k)}\right) s^{(k)}\right] - \left[ F_\nu (x^{(k)}) - F_\nu (x^*) - U^{(k)} e^{(k)}\right] - U^{(k)}e^{(k+1)}. \end{aligned}$$

Since \(F_\nu (x^*) = 0\) and \(U^{(k)}\) is a nonsingular matrix,

$$\begin{aligned} U^{(k)}e^{(k+1)}&=\left[ \left( U^{(k)} - W^{(k)}\right) s^{(k)}\right] - \left[ F_\nu (x^{(k)}) - F_\nu (x^*) - U^{(k)} e^{(k)}\right] \\ e^{(k+1)}&= \left( U^{(k)} \right) ^{-1}\left[ \left( U^{(k)} - W^{(k)}\right) s^{(k)}\right] - \left( U^{(k)} \right) ^{-1} \left[ F_\nu (x^{(k)}) - F_\nu (x^*) - U^{(k)} e^{(k)}\right] . \end{aligned}$$

By assumption, since \(\Vert (W^{(k)} - U^{(k)})s^{(k)}\Vert _2 = \Vert \nu V^{(k)} (\nabla ^2 f(x^{(k)}) - B^{(k)}) s^{(k)}\Vert _2\) and \(\Vert \nabla ^2 f(x^{*}) - \nabla ^2 f(x^{(k)}) \Vert \rightarrow 0\) for \(k \rightarrow \infty\), we have \(\Vert (W^{(k)} - U^{(k)})s^{(k)}\Vert _2 = o( \Vert s^{(k)} \Vert _2 )\). Therefore,

$$\begin{aligned} \Vert e^{(k+1)}\Vert _2 = o(\Vert s^{(k)}\Vert _2) + o(\Vert e^{(k)}\Vert _2) = o(\Vert e^{(k+1)}\Vert _2) + o(\Vert e^{(k)} \Vert _2). \end{aligned}$$

Thus, we obtain \(\Vert e^{(k+1)}\Vert _2 = o(\Vert e^{(k)}\Vert _2)\), and since \(e^{(k)} = x^{(k)} - x^*\), the sequence \(\{x^{(k)}\}\) generated by Algorithm 2 superlinearly converges to \(x^*\).\(\square\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shimmura, R., Suzuki, J. Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation. Oper. Res. Forum 5, 27 (2024). https://doi.org/10.1007/s43069-024-00307-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s43069-024-00307-x

Keywords

Navigation