Abstract
In this paper, we propose new methods to efficiently solve convex optimization problems encountered in sparse estimation. These methods include a new quasi-Newton method that avoids computing the Hessian matrix and improves efficiency, and we prove its fast convergence. We also prove the local convergence of the Newton method under the assumption of strong convexity. Our proposed methods offer a more efficient and effective approach, particularly for \(L_1\) regularization and group regularization problems, as they incorporate variable selection with each update. Through numerical experiments, we demonstrate the efficiency of our methods in solving problems encountered in sparse estimation. Our contributions include theoretical guarantees and practical applications for various kinds of problems.
Similar content being viewed by others
Data Availability
The data used in the numerical experiments were the cod-RNA and ijcnn1 datasets available on the LIBSVM website.
References
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol) 58(1):267–288
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67
Xiao X, Li Y, Wen Z et al (2018) A regularized semismooth Newton method with projection steps for composite convex programs. J Sci Comput 76(1):364–389
Patrinos P, Bemporad A (2013) Proximal Newton methods for convex composite optimization. In: 52nd IEEE Conference on Decision and Control, IEEE, pp 2358–2363
Patrinos P, Stella L, Bemporad A (2014) Forward-backward truncated Newton methods for convex composite optimization. arXiv preprint arXiv:1402.6655
Stella L, Themelis A, Patrinos P (2017) Forward-backward quasi-Newton methods for nonsmooth optimization problems. Comput Optim Appl 67(3):443–487
Milzarek A, Xiao X, Cen S et al (2019) A stochastic semismooth newton method for nonsmooth nonconvex optimization. SIAM J Optim 29(4):2916–2948
Yang M, Milzarek A, Wen Z et al (2021) A stochastic extra-step quasi-newton method for nonsmooth nonconvex optimization. Math Program pp 1–47
Li Y, Wen Z, Yang C et al (2018) A semi-smooth newton method for semidefinite programs and its applications in electronic structure calculations. SIAM J Sci Comput 40(6):A4131–A4157
Ali A, Wong E, Kolter JZ (2017) A semismooth Newton method for fast, generic convex programming. In: International Conference on Machine Learning, PMLR, pp 70–79
Bauschke HH, Combettes PL (2011) Convex analysis and monotone operator theory in Hilbert spaces. Springer
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imag Sci 2(1):183–202
Ulbrich M (2011) Semismooth Newton methods for variational inequalities and constrained optimization problems in function spaces. SIAM
Facchinei F, Pang JS (2003) Finite-dimensional variational inequalities and complementarity problems. Springer
Zhang Y, Zhang N, Sun D et al (2020) An efficient Hessian based algorithm for solving large-scale sparse group lasso problems. Math Program 179:223–263
Qi L, Sun J (1993) A nonsmooth version of newton’s method. Math Program 58(1–3):353–367
Facchinei F, Fischer A, Kanzow C (1996) Inexact newton methods for semismooth equations with applications to variational inequality problems. Nonconvex Optim Appl pp 125–139
Sun D, Han J (1997) Newton and quasi-Newton methods for a class of nonsmooth equations and related problems. SIAM J Optim 7(2):463–480
Hintermüller M (2010) Semismooth Newton methods and applications. Humboldt-University of Berlin, Department of Mathematics
Uzilov AV, Keegan JM, Mathews DH (2006) Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinf 7(1):1–30
Prokhorov D (2001) Ijcnn 2001 neural network competition. Slide presentation in IJCNN 1(97):38
Roth V, Fischer B (2008) The group-lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In: Proceedings of the 25th international conference on Machine learning, pp 848–855
Pavlidis P, Weston J, Cai J et al (2001) Gene functional classification from heterogeneous data. In: Proceedings of the fifth annual international conference on Computational biology, pp 249–255
Ortega JM, Rheinboldt WC (2000) Iterative solution of nonlinear equations in several variables. SIAM
Funding
This work was supported by JSPS KAKENHI Grant Number JP23KJ1458 and the Grant-in-Aid for Scientific Research (C) 22K11931.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A. Proof of Proposition 3
Lemma 1
Suppose \(A,B \in {\mathbb R}^{n \times n}\) are symmetric and A is positive semidefinite. Then, any eigenvalue \(\lambda\) of AB satisfies \(\min \{\Vert A \Vert \lambda _{\min } (B),0\} \le \lambda \le \max \{ \Vert A\Vert \lambda _{\max } (B),0 \}\), where \(\lambda _{\min } (B)\) and \(\lambda _{\max } (B)\) are the minimum and maximum eigenvalues, respectively, of B.
Proof
Since the eigenvalues of AB are equivalent to the eigenvalues of BA, we consider the eigenvalues of BA. We let \(\lambda \in {\mathbb R},x \in {\mathbb R}^n\) such that \(BAx = \lambda x\) and \(x\ne 0\). By multiplying \(x^TA\) from the left, we obtain
If \(x^T A x = 0\), then \(\lambda = 0\) since \(Ax = 0\). Next, we consider the \(x^TAx > 0\) case. Since A is a symmetric positive semidefinite matrix, \(A^{\frac{1}{2}}\), and we obtain
We can rewrite (41) as
where \(y = A^{\frac{1}{2}} x\ne 0\). Thus, since \(0<\Vert A^{\frac{1}{2}} y \Vert _2 \le \Vert A\Vert ^{\frac{1}{2}} \Vert y\Vert _2\), we can obtain
Therefore, if \(x^T A x>0\), then \(\min \{\Vert A\Vert \lambda _{\min }(B), 0\} \le \lambda \le \Vert A\Vert \max \{\Vert A\Vert \lambda _{\max }(B), 0\}\). Also using the result when \(x^T A x = 0\), Lemma 4 holds.\(\square\)
Theorem 1
([6], Theorem 3.2). Suppose \(g:{\mathbb R}^n \rightarrow \mathbb {R} \cup \{+ \infty \}\) is a closed convex function. Every \(V\in \partial _B \textrm{prox}_{\nu g}(x)\) is a symmetric positive semidefinite matrix that satisfies \(\Vert V\Vert \le 1\) for all \(x\in {\mathbb R}^n\).
Proof of Proposition 3
By assumption, since \(\lambda _{\min }(\nabla ^2 f(x)) \ge \mu\) for any \(x\in {\mathbb R}^n\),
Since every \(V\in \partial _B \textrm{prox}_{\nu g}(x)\) is a symmetric positive semidefinite matrix that satisfies \(\Vert V\Vert \le 1\) for all \(x\in {\mathbb R}^n\) by Theorem 4, from Lemma 4,
Thus, every eigenvalue of \(I - V \left( I-\nu \nabla ^2 f(x) \right)\) is a real number that is greater than or equal to \(\min \{\nu \mu , 1\}\), and \(I - V \left( I-\nu \nabla ^2 f(x) \right)\) is a nonsingular matrix.\(\square\)
Appendix B. Proof of Theorem 2
Lemma 2
([24], Lemma 2.3.2). Let \(A, C \in {\mathbb R}^{n \times n}\), and we assume that A is nonsingular, with \(\Vert A^{-1} \Vert \le \alpha\). If \(\Vert A - C \Vert \le \beta\) and \(\beta \alpha < 1\), then C is also nonsingular, and
Proof of Theorem 2
Let \(V^{(k)} \in \partial _B \textrm{prox}_{\nu g}(x^{(k)} - \nu \nabla f(x^{(k)}))\), \(U^{(k)}:= I - V^{(k)}\left( I-\nu \nabla ^2 f(x^{(k)}) \right) \in \partial F_\nu (x^{(k)})\) and \(W^{(k)}:= I - V^{(k)}\left( I-\nu B^{(k)} \right) \in \hat{\partial }^{(k)}F_\nu (x^{(k)})\). From Proposition 3, every eigenvalue of \(U^{(k)}\) is a real number that is greater than or equal to \(\xi := \min \{\nu \mu , 1\}\), and
Let \(\Delta = \frac{\xi }{5 \nu \sqrt{n}}\). Since \(\partial F_\nu\) is the LNA of \(F_\nu\) at \(x^*\), there exists \(\epsilon > 0\) such that
for any \(x\in B(x^*, \epsilon ):= \{y\mid \Vert x^*-y\Vert _2 < \epsilon \},U\in \partial F_\nu (x)\). Since \(W^{(k)} - U^{(k)} = \nu V^{(k)}\left( B^{(k)} - \nabla ^2 f(x^{(k)}) \right)\) and \(\Vert B^{(k)} - \nabla ^2 f(x^{(k)})\Vert < \Delta\), we obtain \(\Vert W^{(k)} - U^{(k)}\Vert \le \nu \Delta\). By Lemma 5, \(W^{(k)}\) is nonsingular and
Thus, if \(\Vert x^{(k)} - x^* \Vert _2< \epsilon\), then we have
Therefore, there exists \(\epsilon , \Delta\) such that the sequence generated by Algorithm 2 locally linearly converges to \(x^*\).\(\square\)
Appendix C. Proof of Theorem 3
Proof
Let \(V^{(k)} \in \partial _B \textrm{prox}_{\nu g}(x^{(k)} - \nu \nabla f(x^{(k)}))\), \(U^{(k)}:= I - V^{(k)}\left( I-\nu \nabla ^2 f(x^{(k)}) \right) \in \partial F_\nu (x^{(k)})\) and \(W^{(k)}:= I - V^{(k)}\left( I-\nu B^{(k)} \right) \in \hat{\partial }^{(k)}F_\nu (x^{(k)})\). We let \(e^{(k)} = x^{(k)}-x^*,s^{(k)}=x^{(k+1)}-x^{(k)}\). We note that \(s^{(k)} = e^{(k+1)} - e^{(k)}\) and \(\{e^{(k)}\}\) and \(\{s^{(k)}\}\) converge to 0 since \(\{x^{(k)}\}\) converges to \(x^*\). From the update rule of Algorithm 2, we have
Since \(F_\nu (x^*) = 0\) and \(U^{(k)}\) is a nonsingular matrix,
By assumption, since \(\Vert (W^{(k)} - U^{(k)})s^{(k)}\Vert _2 = \Vert \nu V^{(k)} (\nabla ^2 f(x^{(k)}) - B^{(k)}) s^{(k)}\Vert _2\) and \(\Vert \nabla ^2 f(x^{*}) - \nabla ^2 f(x^{(k)}) \Vert \rightarrow 0\) for \(k \rightarrow \infty\), we have \(\Vert (W^{(k)} - U^{(k)})s^{(k)}\Vert _2 = o( \Vert s^{(k)} \Vert _2 )\). Therefore,
Thus, we obtain \(\Vert e^{(k+1)}\Vert _2 = o(\Vert e^{(k)}\Vert _2)\), and since \(e^{(k)} = x^{(k)} - x^*\), the sequence \(\{x^{(k)}\}\) generated by Algorithm 2 superlinearly converges to \(x^*\).\(\square\)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shimmura, R., Suzuki, J. Newton-Type Methods with the Proximal Gradient Step for Sparse Estimation. Oper. Res. Forum 5, 27 (2024). https://doi.org/10.1007/s43069-024-00307-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43069-024-00307-x