Abstract
The paper deals with the optimal parameter tuning for the elastic net problem. This process is formulated as an optimization problem over a Pareto set. The Pareto set is associated with a convex multi-objective optimization problem, and, based on the scalarization theorem, we give a parametrical representation of it. Thus, the problem becomes a bilevel optimization with a unique response of the follower (strong Stackelberg game). Then, we apply this strategy to the parameter tuning for the elastic net problem. We propose a new algorithm called Ensalg to compute the optimal regularization path of the elastic net w.r.t. the sparsity-inducing term in the objective. In contrast to existing algorithms, our method can also deal with the so-called “many-at-a-time” case, where more than one variable becomes zero at the same time and/or changes from zero. In examples involving real-world data, we demonstrate the effectiveness of the algorithm.
Similar content being viewed by others
Notes
For a map \(\varPhi :A\rightarrow B\) and a subset \(S \subseteq A\), we denote by \(\varPhi (S) := \{ \varPhi (x) :x\in S\}\) the image of S under \(\varPhi \).
This definition holds without the convexity assumption.
\(\mathbb {R}^r_+=\{ (\alpha _1, \ldots , \alpha _r)\in \mathbb {R}^r :\alpha _i \ge 0 \text { for all } i\}\) and \( \mathbb {R}^r_{++}=\{(\alpha _1, \ldots , \alpha _r)\in \mathbb {R}^r :\alpha _i > 0 \text { for all } i\}\)
\(f_i\) coercive means \(\lim _{\Vert x\Vert \rightarrow +\infty }f_i(x) = +\infty \).
In these limit cases, in order to satisfy the hypothesis \(({\mathrm {H}}_{{\mathrm {we}}})\), i.e., to have uniqueness of the solution \(x(0, \beta )\), we have to assume that \(p \ge n\) and A is full rank, and hence, \(f_1\) is strictly convex.
The results presented in this section work also for \(\alpha =0\) under the additional assumption that \(p\ge n\) and the matrix A has full rank.
For a matrix \(M \in \mathbb {R}^{p\times q}\) and \(I = \{ i_1< \ldots < i_k \} \subseteq \{ 1,\ldots , p\}\), \(J = \{ j_1< \ldots < j_\ell \} \subseteq \{ 1,\ldots , q\}\), we denote
$$\begin{aligned} M_{IJ} = (M_{ij})_{(i,j)\in I\times J} \in \mathbb {R}^{k\times \ell }\,. \end{aligned}$$When \(k = p\) (resp. \(\ell = q\)), we set
$$\begin{aligned} M_{\cdot J} = M_{IJ} \quad (\text {resp. } M_{I\cdot }=M_{IJ}) \end{aligned}$$For a vector \(u=(u_1, \ldots , u_n)\in \mathbb {R}^n\), we denote \({{\,\mathrm{{\mathrm {sign}}}\,}}(u)=({{\,\mathrm{{\mathrm {sign}}}\,}}(u_1), \ldots , {{\,\mathrm{{\mathrm {sign}}}\,}}(u_n)).\)
Notice that \(J_1\) and \(J_2\) are not necessarily disjoint.
For a set \(I\subseteq \{1, \ldots , n\}\), we denote by \(I^c= \{1, \ldots , n\}\setminus I\) its complement.
Notice that for all \(i\in I_m^s\), we have \(G_i^{m-1} \in \{ -1,\, 1\}\).
Because the matrix \(R_{ I_{m_0} I_{m_0}}\) is invertible
References
Kuhn, H.W., Tucker, A.W.: Nonlinear programming. In: Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, pp. 481–492 (1951)
Pareto, V.: Manuale di economia politica. Società Editrice Libraria (1906)
Edgeworth, F.Y.: Mathematical Psychics: An Essay on the Application of Mathematics to the Moral Sciences. C.K. Paul & co, London (1881)
Philip, J.: Algorithms for the vector maximization problem. Math. Program. 2(1), 207–229 (1972)
Benson, H.P.: Optimization over the efficient set. J. Math. Anal. Appl. 98(2), 562–580 (1984)
Dauer, J.P.: Optimization over the efficient set using an active constraint approach. Zeitschrift für Oper. Res. 35(3), 185–195 (1991)
Craven, B.D.: Aspects of multicriteria optimization. In: Recent Developments in Mathematical Programming, pp. 93–100 (1991)
Benson, H.P.: A finite, non-adjacent extreme point search algorithm for optimization over the efficient set. J. Optim. Theory Appl. 73(1), 47–64 (1992)
Bolintinéanu, S.: Necessary conditions for nonlinear suboptimization over the weakly-efficient set. J. Optim. Theory Appl. 78(2), 579–598 (1993)
Bolintinéanu, S.: Minimization of a quasi-concave function over an efficient set. Math. Program. 61(1–3), 89–110 (1993)
Fülöp, J.: A cutting plane algorithm for linear optimization over the efficient set. In: Generalized Convexity, pp. 374–385 (1994)
Dauer, J.P., Fosnaugh, T.A.: Optimization over the efficient set. J. Global Optim. 7(3), 261–277 (1995)
An, L.T.H., Tao, P.D., Muu, L.D.: Numerical solution for optimization over the efficient set by D.C. optimization algorithms. Oper. Res. Lett. 19(3), 117–128 (1996)
Horst, R., Thoai, N.V.: Maximizing a concave function over the efficient or weakly-efficient set. Eur. J. Oper. Res. 117(2), 239–252 (1999)
Horst, R., Thoai, N.V., Yamamoto, Y., Zenke, D.: On optimization over the efficient set in linear multicriteria programming. J. Optim. Theory Appl. 134(3), 433–443 (2007)
Kim, N.T.B., Ngoc, T.T.: Optimization over the efficient set of a bicriteria convex programming problem. Pac. J. Optim. 9(1), 103–115 (2013)
Yamamoto, Y.: Optimization over the efficient set: overview. J. Global Optim. 22(1–4), 285–317 (2002)
Bolintinéanu, S.: Optimality conditions for minimization over the (weakly or properly) efficient set. J. Math. Anal. Appl. 173(2), 523–541 (1993)
Bonnel, H., Kaya, C.Y.: Optimization over the efficient set of multi-objective control problems. J. Optim. Theory Appl. 147(1), 93–112 (2010)
Bonnel, H., Pham, N.S.: Nonsmooth optimization over the (weakly or properly) Pareto set of a linear-quadratic multi-objective control problem: explicit optimality conditions. J. Ind. Manage. Optim. 7(4), 789–809 (2011)
Bonnel, H.: Post-Pareto analysis for multiobjective parabolic control systems. Ann. Acad. Romanian Sci. Ser. Math. Appl. 5(1–2), 13–34 (2013)
Bonnel, H., Collonge, J.: Stochastic optimization over a pareto set associated with a stochastic multi-objective optimization problem. J. Optim. Theory Appl. 162(2), 405–427 (2014)
Bonnel, H., Collonge, J.: Optimization over the Pareto outcome set associated with a convex bi-objective optimization problem: theoretical results, deterministic algorithm and application to the stochastic case. J. Global Optim. 62(3), 481–505 (2015)
Bonnel, H., Morgan, J.: Semivectorial bilevel optimization problem: penalty approach. J. Optim. Theory Appl. 131(3), 365–382 (2006)
Bonnel, H.: Optimality conditions for the semivectorial bilevel optimization problem. Pac. J. Optim. 2(3), 447–468 (2006)
Ankhili, Z., Mansouri, A.: An exact penalty on bilevel programs with linear vector optimization lower level. Eur. J. Oper. Res. 197(1), 36–41 (2009)
Bonnel, H., Morgan, J.: Semivectorial bilevel convex optimal control problems: existence results. SIAM J. Control Optim. 50(6), 3224–3241 (2012)
Eichfelder, G.: Multiobjective bilevel optimization. Math. Program. 123(2), 419–449 (2010)
Zheng, Y., Wan, Z.: A solution method for semivectorial bilevel programming problem via penalty method. J. Appl. Math. Comput. 37(1–2), 207–219 (2011)
Bonnel, H., Morgan, J.: Optimality conditions for semivectorial bilevel convex optimal control problems. In: Computational and Analytical Mathematics, pp. 45–78 (2013)
Dempe, S., Gadhi, N., Zemkoho, A.B.: New optimality conditions for the semivectorial bilevel optimization problem. J. Optim. Theory Appl. 157(1), 54–74 (2013)
Bonnel, H., Todjihoundé, L., Udrişte, C.: Semivectorial bilevel optimization on riemannian manifolds. J. Optim. Theory Appl. 167(2), 464–486 (2015)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. 67(2), 301–320 (2005)
Tibshirani, R.: Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. 58(1), 267–288 (1996)
Giesen, J., Müller, J.K., Laue, S., Swiercy, S.: Approximating concavely parameterized optimization problems. In: Advances in Neural Information Processing Systems (NIPS), pp. 2114–2122 (2012)
Giesen, J., Löhne, A., Laue, S., Schneider, C.: Using benson’s algorithm for regularization parameter tracking. Proc. AAAI Confer. Artif. Intell. 33(01), 3689–3696 (2019)
Efron, B., Hastie, T., Johnstone, I., Tibshirani, T.: Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
Rosset, S., Zhu, J.: Piecewise linear regularized solution paths. Ann. Stat. 35(3), 1012–1030 (2007)
Osborne, M.R., Presnell, B., Turlach, B.A.: A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20(3), 389–403 (2000)
Mairal, J., Yu, B.: Complexity analysis of the lasso regularization path. In: International Conference on Machine Learning (ICML), pp. 353–360 (2012)
Jahn, J.: Vector Optimization: Theory, Applications, and Extensions. Springer, Berlin (2011)
Luc, D.T.: Theory of Vector Optimization. Springer, Berlin (1989)
Miettinen, K.: Nonlinear Multiobjective Optimization. Springer, Berlin (1998)
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Murty, K.G.: Linear Complementarity. Internet edn, Linear and Nonlinear Programming (1997)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2009)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., Yang, N.: Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II Radical prostatectomy treated patients. J. Urol. 141(5), 1076–1083 (1989)
Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, New York (1998)
Bonnans, J.F., Shapiro, A.: Perturbation Analysis of Optimization Problems. Springer, New York (2000)
Acknowledgements
The second author acknowledges financial support by Carl-Zeiss-Stiftung. The first author is grateful to Dr. V. Dragan from the Institute of Mathematics of the Romanian Academy for informing him about the result known as “Complement of Schur”. The authors are grateful to the referees for their comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Xiaoqi Yang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Appendix: Proofs of Theorem 3.1, Proposition 3.1, and Lemma 3.2
A Appendix: Proofs of Theorem 3.1, Proposition 3.1, and Lemma 3.2
Proof of Theorem 3.1
-
(a)
A simple computation shows that
$$\begin{aligned} {{\,\mathrm{{\mathrm {grad}}}\,}}\left( x \mapsto \frac{1}{2} \Vert Ax-b\Vert _2^2 + \frac{\alpha }{2}\Vert x\Vert _2^2\right) ({\bar{x}}) = A^{\mathsf {T}}A{\bar{x}} - A^{\mathsf {T}}b + \alpha {\bar{x}}\,, \end{aligned}$$(60)and hence, the subdifferential of the convex function \(J_{\alpha \beta }\) at any \(x\in \mathbb {R}^n\), denoted \(\partial J_{\alpha \beta }(x)\), is given by the set of all vectors of the form
$$\begin{aligned} \left( A^{\mathsf {T}}A + \alpha I_n\right) x - A^{\mathsf {T}}b + \beta \xi \,, \end{aligned}$$where \(\xi \in [-1,1]^n\) verifies
$$\begin{aligned} \xi _i {\left\{ \begin{array}{ll} = {{\,\mathrm{{\mathrm {sign}}}\,}}(x_i) &{}\text {if } x_i\ne 0\,,\\ \in [-1,1] &{}\text {if } x_i=0\,, \end{array}\right. } \qquad i = 1,\ldots , n\,. \end{aligned}$$On the other hand, the optimality of \(x(\alpha , \beta )\) is equivalent to the relation
$$\begin{aligned} 0 \in \partial J(x(\alpha , \beta ))\,, \end{aligned}$$and hence, \(\xi (\alpha , \beta )\) is the unique vector such that
$$\begin{aligned} \left( A^{\mathsf {T}}A + \alpha I_n\right) x(\alpha , \beta ) - A^{\mathsf {T}}b + \beta \xi (\alpha , \beta )=0 \end{aligned}$$and
$$\begin{aligned} \xi _i(\alpha , \beta ) {\left\{ \begin{array}{ll} = {{\,\mathrm{{\mathrm {sign}}}\,}}x_i(\alpha ,\beta ) &{}\text {if } x_i(\alpha ,\beta ) \ne 0\,,\\ \in [-1,1] &{}\text {if } x_i(\alpha , \beta ) =0\,, \end{array}\right. } \qquad \text {for } i \in \{ 1, \ldots , n \}\,. \end{aligned}$$Then, we obviously have relation (6).
-
(b)
The proof of the converse is obvious.
-
(d)
If \(\beta = 0\), then
$$\begin{aligned} {{\,\mathrm{{\mathrm {grad}}}\,}}\left( J_{\alpha ,0} \right) (x(\alpha ,0)) = {{\,\mathrm{{\mathrm {grad}}}\,}}\left( x \mapsto \frac{1}{2} \Vert Ax-b\Vert _2^2 + \frac{\alpha }{2} \Vert x\Vert _2^2\right) (x(\alpha , 0)) = 0\,, \end{aligned}$$ -
(c)
The continuity of the function \(x(\cdot , \cdot )\) on \(\mathbb {R}_{++}\times \mathbb {R}_+\) is a consequence of [50, Theorem 1.17] (see also [51, Proposition 4.4]). Indeed, it is easy to see that [50, Theorem 1.17] holds also if we replace \(\mathbb {R}^m\) with an open subset U of \(\mathbb {R}^m\). Thus, with \(m = 2\), put \(U := \mathbb {R}_{++} \times \mathbb {R}\). Using the notations of [50, Theorem 1.17], for \(u := (\alpha , \beta ) \in U \subseteq \mathbb {R}^2\) and \(x\in \mathbb {R}^n\), we denote
$$\begin{aligned} f(x,u) := J_{\alpha \beta }(x)\,, \quad p(u) := \inf _x f(x,u)\,, \quad P(u) := {{\,\mathrm{{\mathrm {arg\,min}}}\,}}_x f(x,u)\,. \end{aligned}$$The function \(f(\cdot , \cdot )\) obviously is continuous on \(\mathbb {R}^n\times U\) and hence lower semi-continuous. Using the equivalence of norms in \(\mathbb {R}^n\), it is easy to see that \((x,u) \mapsto f(x,u)\) is level-bounded in x uniformly in u (see [50, Definition 1.16]). So, all the hypotheses of [50, Theorem 1.17] are fulfilled. Therefore, since f is finite everywhere, we have \({{\,\mathrm{{\mathrm {dom}}}\,}}(p) = U\). Also, for each \(u \in U\), we have \(p(u) = \min _x f(x,u)\) and P(u) is nonempty and compact. Finally, by (c) of [50, Theorem 1.17], we obtain that p is continuous on U, and hence, (b) holds. Since for each \({\bar{u}} = (\alpha , \beta ) \in U\) with \(\beta \ge 0\), the function \(f(\cdot , {\bar{u}})\) is strictly convex on \(\mathbb {R}^n\), we have that the set \(P({\bar{u}})\) is a singleton. Thus, for such \(\bar{u}\) (with \(\beta \ge 0\)), the conclusion (b) of [50, Theorem 1.17] implies that \(x(\cdot , \cdot ) :\mathbb {R}_{++}\times \mathbb {R}_+\rightarrow \mathbb {R}^n\) is continuous at \({\bar{u}}=(\alpha , \beta )\). Now, from (6) we have for each \((\alpha , \beta ) \in \mathbb {R}_{++}^2\)
$$\begin{aligned} \xi (\alpha ,\beta ) = \beta ^{-1} \left[ \left( A^{\mathsf {T}}A + \alpha I_n\right) x(\alpha ,\beta ) - A^{\mathsf {T}}b \right] , \end{aligned}$$which proves the uniqueness of \(\xi (\alpha , \beta )\) and the continuity of the function \(\xi (\cdot , \cdot )\) on \(\mathbb {R}_{++}^2\).
\(\square \)
Proof of Proposition 3.1
-
(a)
Let \(0 \le \lambda _1 \le \ldots \le \lambda _n\) be the eigenvalues of the positive semi-definite matrix \(A^{\mathsf {T}}A\). Then, for any \(\alpha >0\), the matrix \(A^{\mathsf {T}}A + \alpha I_n\) has the eigenvalues \(\lambda _1 + \alpha \le \ldots \le \lambda _n + \alpha \). Therefore, the eigenvalues of \(R(\alpha )\) are \(\frac{1}{\lambda _n + \alpha } \le \ldots \le \frac{1}{\lambda _1 + \alpha }\). Denote the matrix norm induced by the Euclidean norm also by \(\Vert \cdot \Vert _2\). Then, it is well known that \(\Vert R(\alpha ) \Vert _2 = \frac{1}{\lambda _1 + \alpha }\), and, from Rayleigh quotient, we have that for any vector \(v\in \mathbb {R}^n\), \(v^{\mathsf {T}}R(\alpha ) v \ge \frac{1}{\lambda _n + \alpha } \Vert v\Vert _2^2\). Thus, multiplying relation (10) on the left by \(\xi ^{\mathsf {T}}(\alpha , \beta )\), we obtain
$$\begin{aligned} \xi ^{\mathsf {T}}(\alpha ,\beta ) u(\alpha ) - \beta \, \xi ^{\mathsf {T}}(\alpha ,\beta ) R(\alpha ) \xi (\alpha , \beta )= & {} \xi ^{\mathsf {T}}(\alpha ,\beta ) \cdot x(\alpha ,\beta )\\= & {} \left\| x(\alpha , \beta ) \right\| _1. \end{aligned}$$Therefore, using Cauchy–Schwarz inequality, we obtain
$$\begin{aligned} \left\| x(\alpha ,\beta ) \right\| _1\le & {} \left\| \xi (\alpha ,\beta ) \right\| _2 \cdot \left\| u(\alpha ) \right\| _2 - \beta \, \xi ^{\mathsf {T}}(\alpha ,\beta ) R(\alpha ) \xi (\alpha ,\beta )\\\le & {} \left\| \xi (\alpha ,\beta ) \right\| _2 \cdot \left\| u(\alpha ) \right\| _2 - \frac{\beta }{\lambda _n + \alpha } \left\| \xi (\alpha ,\beta ) \right\| _2^2. \end{aligned}$$Since
$$\begin{aligned} \left\| u(\alpha ) \right\| _2 = \left\| R(\alpha ) A^{\mathsf {T}}b \right\| _2 \le \left\| R(\alpha ) \right\| _2\cdot \left\| A^{\mathsf {T}}b \right\| _2 = \frac{1}{\lambda _1 + \alpha } \left\| A^{\mathsf {T}}b \right\| _2, \end{aligned}$$we obtain that
$$\begin{aligned} 0 \le \left\| x(\alpha ,\beta ) \right\| _1 \le \frac{1}{\lambda _1 + \alpha } \left\| \xi (\alpha ,\beta ) \right\| _2 \cdot \left\| A^{\mathsf {T}}b \right\| _2 - \frac{\beta }{\lambda _n + \alpha } \left\| \xi (\alpha ,\beta ) \right\| _2^2. \end{aligned}$$(61)This implies that for all \((\alpha , \beta )\in \mathbb {R}^2_{++}\), we have
$$\begin{aligned} \left\| \xi (\alpha ,\beta ) \right\| _2 \le \frac{\lambda _n + \alpha }{\beta \left( \lambda _1 + \alpha \right) } \left\| A^{\mathsf {T}}b \right\| _2. \end{aligned}$$(62)Notice that
$$\begin{aligned} \left\| \xi (\alpha ,\beta ) \right\| _2 < 1 \quad \Longrightarrow \quad x(\alpha ,\beta ) =0\,. \end{aligned}$$On the other hand, for all \(\alpha > 0\) we have
$$\begin{aligned} 1 \le \frac{\lambda _n + \alpha }{\lambda _1 + \alpha } \le \frac{\lambda _n}{\lambda _1}\,. \end{aligned}$$Therefore, from (62) we obtain that for all \(\alpha >0\) and \(\beta >\frac{\lambda _n}{\lambda _1}\Vert A^{\mathsf {T}}b\Vert _2\), we have \(\Vert \xi (\alpha ,\beta ) \Vert _2 < 1\), so part (a) is proved.
-
(b)
Let
$$\begin{aligned} \beta \in \left] \left\| A^{\mathsf {T}}b \right\| _2, \frac{\lambda _n}{\lambda _1} \left\| A^{\mathsf {T}}b \right\| _2 \right[. \end{aligned}$$This is equivalent to
$$\begin{aligned} \frac{\lambda _1}{\lambda _n}< \frac{ \left\| A^{\mathsf {T}}b \right\| _2}{\beta } < 1\,. \end{aligned}$$The function \(\alpha \mapsto \psi (\alpha ) := \frac{\lambda _1 + \alpha }{\lambda _n + \alpha }\) is increasing on \(\mathbb {R}_{++}\), and hence, for all \(\alpha >0 \),
$$\begin{aligned} \frac{\lambda _1}{\lambda _n}< \psi (\alpha ) < 1\,, \end{aligned}$$and for \(\alpha _0 = \frac{\lambda _n \Vert A^{\mathsf {T}}b \Vert _2 - \beta \lambda _1}{\beta - \Vert A^{\mathsf {T}}b \Vert _2}\), we have
$$\begin{aligned} \psi (\alpha _0) = \frac{\left\| A^{\mathsf {T}}b \right\| _2}{\beta }\,. \end{aligned}$$Therefore, for all \(\alpha > \alpha _0\), we have \(\psi (\alpha ) > \frac{\Vert A^{\mathsf {T}}b \Vert _2}{\beta }\), and hence,
$$\begin{aligned} \frac{(\lambda _n + \alpha )}{\beta \left( \lambda _1 + \alpha \right) } \left\| A^{\mathsf {T}}b \right\| _2 < 1\,, \end{aligned}$$which implies by (62) that \(\Vert \xi (\alpha ,\beta ) \Vert _2 < 1\). Thus, \(x(\alpha ,\beta ) = 0\).
-
(c)
The proof of part (c) follows from the fact that \(\Vert \xi (\alpha ,\beta ) \Vert _2 \le \sqrt{n}\) for all \((\alpha ,\beta ) \in \mathbb {R}^2_{++}\), and hence, (61) implies that
$$\begin{aligned} \left\| x(\alpha ,\beta ) \right\| _1 \le \frac{1}{\lambda _1+\alpha } \left\| A^{\mathsf {T}}b \right\| _2 - \frac{\beta \sqrt{n}}{\lambda _n + \alpha } \end{aligned}$$and the last expression goes to 0 when \(\alpha \rightarrow +\infty \).
\(\square \)
Proof of Lemma 3.2
-
(i)
We claim that
$$\begin{aligned} J_1\cup J_2=\{ 1, \ldots , n\}\,. \end{aligned}$$(63)Indeed, if (63) does not hold, then the set \(J_1^c\cap J_2^c\) is nonempty. For each \(i\in J_1^c\cap J_2^c\), there exist decreasing sequences \((\gamma _k^{(i)}), (\delta _k^{(i)})\) tending to \({{\hat{\beta }}}\) such that \(x_i(\gamma _k^{(i)})\ne 0\) and \(|\xi _i(\delta _k^{(i)})|<1\), and hence, \(x_i(\delta _k^{(i)})=0\). By the continuity of \(x_i\) and \(\xi _i\) and by (7), we can find open intervals \({]}a_k^{(i)},b_k^{(i)}{[}\) and \({]}c_k^{(i)},d_k^{(i)}{[}\) around \(\gamma _k^{(i)}\) (resp. \(\delta _k^{(i)}\)) such that \(x_i(\beta )\ne 0\) for all \(\beta \in {]}a_k^{(i)},b_k^{(i)}{[}\) and \(x_i(\beta )= 0\) for all \(\beta \in {]}c_k^{(i)},d_k^{(i)}{[}\). Based on this property, we can easily find a sequence of mutually disjoint open intervals \((I_k = {]}\nu _k,\mu _k{[})_{k\ge 0}\) such that \(\mu _k \searrow {{\hat{\beta }}}\), \(\nu _k \searrow {{\hat{\beta }}}\) and for each \(i \in J_1^c\cap J_2^c\) and for each integer \(k\ge 0\) we have
$$\begin{aligned} \left( x_i(\beta )\ne 0 \text { for all } \beta \in I_k \right) \quad \text {OR} \quad \left( x_i(\beta )= 0 \text { for all } \beta \in I_k \right) . \end{aligned}$$Moreover, we can assume that these intervals are maximal with this property. This implies that for each k, there exists \(i\in J_1^c\cap J_2^c\) such that,
$$\begin{aligned} \begin{aligned}&\text {if } x_i(\beta )\ne 0 \text { for all } \beta \in I_k, \text { then } x_i(\nu _k)=0\\&\qquad \qquad \qquad \qquad \text {OR}\\&\text {if } |\xi _i(\beta )| < 1 \text { for all } \beta \in I_k, \text { then } |\xi _i(\nu _k)| = 1\,. \end{aligned} \end{aligned}$$(64)Consider now the index sets \(L_k=\{ i \in J_1^c\cap J_2^c :x_i(\beta ) =0 \text { for all } \beta \in I_k\}\), \(J^* =\{ 1\le i \le n :|\xi _i({{\hat{\beta }}})| <1\}\). So, for \(\beta >{{\hat{\beta }}}\) near \({{\hat{\beta }}}\) and for all \(i\in J^*\), \( |\xi _i( \beta )| <1\), and hence, \(x_i(\beta )=0\). Thus, for sufficiently large k and for all \(i\in M_k := J^*\cup L_k\), we have \(x_i(\beta ) = 0\) for all \(\beta \in I_k\). On the other hand, for all \(i\in M_k^c\) we must have \(| \xi _i(\beta )|=1\) for all \(\beta \in I_k\). In other words, \(\xi _{M_k^c}(\beta )\) is constant on \(I_k\) having the coordinates 1 or \(-1\). Since for all \(\beta \in I_k\) we have \(x_{M_k}(\beta )=0\), using (10) we obtain
$$\begin{aligned} 0&=u_{M_k}-\beta R_{M_k\cdot }\xi (\beta )\nonumber \\&=u_{M_k}-\beta \Big (R_{M_kM_k} \xi _{M_k}(\beta ) +R_{M_kM_k^c} \xi _{M_k^c}(\beta )\Big ) \nonumber \\&=u_{M_k}-\beta \Big (R_{M_kM_k} \xi _{M_k}(\beta ) +R_{M_kM_k^c}G^k_{M_k^c}\Big ), \end{aligned}$$(65)where \(G^k\in \mathbb {R}^n\) is the constant vector such that \(G^k_{M_k^c}:=\xi _{M_k^c}(\beta ) \) (for all \(\beta \in I_k\)). Multiplying Eq. (65) on the left with \(\frac{1}{\beta } (R_{M_kM_k}) ^{-1}\), we get easily
$$\begin{aligned} \xi (\beta )=\frac{1}{\beta } F^k+G^k \qquad \text {for all } \beta \in I_k\,, \end{aligned}$$(66)where
$$\begin{aligned} F^k_{M_k}&=(R_{M_kM_k}) ^{-1}u_{M_k} \\ G^k_{M_k}&=-(R_{M_kM_k}) ^{-1}R_{M_kM_k^c}G^k_{M_k^c}\\ F^k_{M_k^c}&=0\,. \end{aligned}$$$$\begin{aligned} x(\beta )=u-RF^k-\beta RG^k \qquad \text {for all } \beta \in I_k\,. \end{aligned}$$(67)Since \(M_k\subseteq \{1, \ldots ,n\}\), the number of distinct sets \(M_k\) is upper bounded by \(2^n\). Hence, we can find a constant subsequence of the sequence \((M_k)_{k\ge 0}\), which—to simplify notation—we denote \((M_{k'})_{k'\ge 0}\). So, let M such that \(M_{k'}=M\) for all \(k'\). Therefore, there exist F and G such that \(F^{k'}=F\), \(G^{k'}=G\) for all \(k'\). Finally, we have for all \(k'\)
$$\begin{aligned} x(\beta )&=u-RF-\beta RG&\text {for all } \beta \in I_{k'}\\ \xi (\beta )&=\frac{1}{\beta } F+G&\text {for all } \beta \in I_{k'}\,. \end{aligned}$$By (64) and using the fact that the set \(J_1^c\cap J_2^c\) is finite, we can find a subsequence \((I_{k''})_{k''\ge 0}\) such that there exists \(i\in J_1^c\cap J_2^c\) verifying \(\Big (x_i(\beta )\ne 0\) for all \(\beta \in I_{k''}\) and \(x_i(\nu _{k''})=0\Big )\) OR \(\Big (|\xi _i(\beta )|<1\) for all \(\beta \in I_{k''}\) and \(| \xi _i(\nu _{k''})|=1\Big )\) for all \(k''\). By the continuity of \(x_i(\cdot )\) and \(\xi _i(\cdot )\), we get for all \(k''\)
$$\begin{aligned}&\Big (u_i-R_{i\cdot }F-\nu _{k''} R_{i\cdot }G=0 \quad \text {with }u_i-R_{i\cdot }F\ne 0\Big ) \\&\quad \text { OR } \quad \Big (\Big |\frac{1}{\nu _{k''}} F_i+G_i\Big |=1\quad \text {with } F_i\ne 0\Big ) \end{aligned}$$which is impossible. This contradiction proves that (63) is satisfied.
-
(ii)
By (63) we have \(x_{J_1}(\beta )=0\) and \(\xi _{J_2}(\beta )=\xi _{J_2}({{\hat{\beta }}})\) for all \(\beta >{{\hat{\beta }}}\) near \({{\hat{\beta }}}\). Since \(J_1^c\subset J_2\) we also have \(\xi _{J_1^c}(\beta )=\xi _{J_1^c}({{\hat{\beta }}})\) for all \(\beta >{{\hat{\beta }}}\) near \({{\hat{\beta }}}\). From (10)
$$\begin{aligned} 0&=x_{J_1}(\beta ) \\&= u_{J_1}-\beta R_{J_1\cdot }\xi (\beta )\\&= u_{J_1}-\beta \Big (R_{J_1J_1}\xi _{J_1}(\beta )+R_{J_1J_1^c}\xi _{J_1^c}(\beta )\Big )\\&=u_{J_1}-\beta \Big (R_{J_1J_1}\xi _{J_1}(\beta )+R_{J_1J_1^c}\xi _{J_1^c}({{\hat{\beta }}})\Big ) \end{aligned}$$for all \(\beta >{{\hat{\beta }}}\) near \({{\hat{\beta }}}\). It follows immediately that
$$\begin{aligned} \xi _{J_1}(\beta )=\frac{1}{\beta }(R_{J_1J_1})^{-1}u_{J_1}-(R_{J_1J_1})^{-1}R_{J_1J_1^c}\xi _{J_1^c}({{\hat{\beta }}})\,, \end{aligned}$$and hence, defining F, \(G\in \mathbb {R}^n\) by
$$\begin{aligned}&F_{J_1}=(R_{J_1J_1})^{-1}u_{J_1}\,, \quad G_{J_1}=-(R_{J_1J_1})^{-1}R_{J_1J_1^c}\xi _{J_1^c}({{\hat{\beta }}})\,, \quad F_{J_1^c}=0\,,\\&\quad G_{J_1^c} = \xi _{J_1^c}({{\hat{\beta }}})\,, \end{aligned}$$
\(\square \)
Rights and permissions
About this article
Cite this article
Bonnel, H., Schneider, C. Post-Pareto Analysis and a New Algorithm for the Optimal Parameter Tuning of the Elastic Net. J Optim Theory Appl 183, 993–1027 (2019). https://doi.org/10.1007/s10957-019-01592-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-019-01592-x
Keywords
- Post-Pareto analysis
- Multi-objective optimization
- Bilevel optimization
- Linear regression
- Sparsity
- Elastic net
- Linear complementarity problem