Abstract
The TREX is a recently introduced approach to sparse linear regression. In contrast to most well-known approaches to penalized regression, the TREX can be formulated without the use of tuning parameters. In this paper, we establish the first known prediction error bounds for the TREX. Additionally, we introduce extensions of the TREX to a more general class of penalties, and we provide a bound on the prediction error in this generalized setting. These results deepen the understanding of the TREX from a theoretical perspective and provide new insights into penalized regression in general.
Similar content being viewed by others
References
Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
Arlot S, Celisse A (2011) Segmentation of the mean of heteroscedastic data via cross-validation. Stat Comput 21(4):613–632
Bach, F (2008) Bolasso: Model consistent Lasso estimation through the bootstrap. In: Proceedings of the 25th international conference on machine learning, pp 33–40
Baraud Y, Giraud C, Huet S (2009) Gaussian model selection with an unknown variance. Ann Stat 37(2):630–672
Barber R, Candès E (2015) Controlling the false discovery rate via knockoffs. Ann Stat 43(5):2055–2085
Belloni A, Chernozhukov V, Wang L (2011) Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika 98(4):791–806
Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of lasso and Dantzig selector. Ann Stat 37(4):1705–1732
Bien J, Gaynanova I, Lederer J, Müller C (2018) Non-convex global minimization and false discovery rate control for the TREX. J Comput Graph Stat 27(1):23–33. https://doi.org/10.1080/10618600.2017.1341414
Boucheron S, Lugosi G, Massart P (2013) Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, Cambridge
Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Berlin
Bunea F, Lederer J, She Y (2014) The group square-root lasso: theoretical properties and fast algorithms. IEEE Trans Inf Theory 60(2):1313–1325
Bunea F, Tsybakov A, Wegkamp M (2006) Aggregation and sparsity via \(\ell _1\)-penalized least squares. In: Proceedings of 19th annual conference on learning theory, pp 379–391
Candès E, Plan Y (2009) Near-ideal model selection by \(\ell _1\) minimization. Ann Stat 37(5):2145–2177
Candes E, Tao T (2007) The Dantzig selector: Statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351
Chatterjee S, Jafarov J (2015) Prediction error of cross-validated lasso. arXiv:1502.06291
Chételat D, Lederer J, Salmon J (2017) Optimal two-step prediction in regression. Electron J Stat 11(1):2519–2546
Chichignoud M, Lederer J, Wainwright M (2016) A practical scheme and fast algorithm to tune the lasso with optimality guarantees. J Mach Learn Res 17:1–20
Combettes P, Müller C (2016) Perspective functions: proximal calculus and applications in high-dimensional statistics. J Math Anal Appl 457(2):1283–1306
Dalalyan A, Tsybakov A (2012) Mirror averaging with sparsity priors. Bernoulli 18(3):914–944
Dalalyan A, Tsybakov A (2012) Sparse regression learning by aggregation and langevin monte-carlo. J Comput Syst Sci 78(5):1423–1443
Dalalyan A, Hebiri M, Lederer J (2017) On the prediction performance of the lasso. Bernoulli 23(1):552–581
Dalalyan A, Tsybakov A (2007) Aggregation by exponential weighting and sharp oracle inequalities. In: Proceedings of 19th annual conference on learning theory, pp 97–111
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Giraud C, Huet S, Verzelen N (2012) High-dimensional regression with unknown variance. Stat Sci 27(4):500–518
Hebiri M, Lederer J (2013) How correlations influence lasso prediction. IEEE Trans Inf Theory 59(3):1846–1854
Huang C, Cheang G, Barron A (2008) Risk of penalized least squares, greedy selection and L1 penalization for flexible function libraries. Manuscript
Koltchinskii V, Lounici K, Tsybakov A (2011) Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann Stat 39(5):2302–2329
Lederer J, van de Geer S (2014) New concentration inequalities for empirical processes. Bernoulli 20(4):2020–2038
Lederer J, Müller C (2014) Topology adaptive graph estimation in high dimensions. arXiv:1410.7279
Lederer J, Müller C (2015) Don’t fall for tuning parameters: tuning-free variable selection in high dimensions with the TREX. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence
Lederer J, Yu L, Gaynanova I (2016) Oracle inequalities for high-dimensional prediction. arXiv:1608.00624
Lim N, Lederer J (2016) Efficient feature selection with large and high-dimensional data. arXiv:1609.07195
Massart P, Meynet C (2011) The Lasso as an \(\ell _1\)-ball model selection procedure. Electron J Stat 5:669–687
Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B 72(4):417–473
Raskutti G, Wainwright M, Yu B (2010) Restricted eigenvalue properties for correlated Gaussian designs. J Mach Learn Res 11:2241–2259
Rigollet P, Tsybakov A (2011) Exponential screening and optimal rates of sparse estimation. Ann Stat 39(2):731–771
Sabourin J, Valdar W, Nobel A (2015) A permutation approach for selecting the penalty parameter in penalized model selection. Biometrics 71(4):1185–1194
Shah R, Samworth R (2013) Variable selection with error control: another look at stability selection. J R Stat Soc Ser B 75(1):55–80
Sun T, Zhang CH (2012) Scaled sparse linear regression. Biometrika 99(4):879–898
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Statist Soc Ser B 58(1):267–288
van de Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the lasso. Electron J Stat 3:1360–1392
van de Geer S, Lederer J (2013) The Bernstein-Orlicz norm and deviation inequalities. Probab Theory Relat Fields 157(1–2):225–250
van de Geer S, Lederer J (2013) The Lasso, correlated design, and improved oracle inequalities. IMS Collections 9:303–316
van der Vaart A, Wellner J (1996) Weak convergence and empirical processes. Springer, Berlin
van de Geer S (2007) The deterministic lasso. In Joint statistical meetings proceedings
van de Geer S (2000) Empirical processes in M-estimation. Cambridge University Press, Cambridge
Wainwright M (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (lasso). IEEE Trans Inf Theory 55(4):2183–2202
Wellner J (2017) The Bennett-Orlicz norm. Sankhya A 79(2):355–383
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68(1):49–67
Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38(2):894–942
Zhuang R, Lederer J (2017) Maximum regularized likelihood estimators: a general prediction theory and applications. arXiv:1710.02950
Acknowledgements
We thank the editor and the reviewers for their insightful comments.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: proof of a generalization of Theorem 1
We first consider a generalization of Assumption 1.
Assumption 4
The signal \(\beta ^*\) is sufficiently small such that for some \(\kappa _{1}>1\) and \(\kappa _{2}>2\) with \(1/\kappa _1 + 2/\kappa _2 <1\),
As a first step toward the proof of Theorem 1, we show that any TREX solution has larger \(\ell _{1}\)-norm than any LASSO solution with tuning parameter \(\lambda =\hat{u}\).
Lemma 3
Any TREX solution (4) satisfies
where \(\tilde{\beta }(\hat{u})\) is any LASSO solution as in (2) with tuning parameter \(\lambda =\hat{u}\).
Proof (of Lemma 3)
If \(\tilde{\beta }(\hat{u})=0\), the statement holds trivially. Now for \(\tilde{\beta }(\hat{u})\ne 0\), the KKT conditions for LASSO imply that
Together with the definition of \(\hat{\beta }\), this yields
On the other hand, the definition of the LASSO implies
Combining these two displays gives us
The claim follows now from \(c< 2\).
We are now ready to prove a generalization of Theorem 1.
Theorem 4
Let Assumption 4 be fulfilled, and let \(\tilde{\lambda }:=\max \{\kappa _{1} \hat{u},\frac{\kappa _{2}}{c}\Vert X^\top \varepsilon \Vert _\infty \}\). Then, for any \(\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1\), the prediction loss of the TREX satisfies
Theorem 1 follows from Theorem 4 by setting \(\kappa _{1}=2\), and \(\kappa _{2}=8\).
Proof (of Theorem 4)
Assume first \(\tilde{\beta }(\tilde{\lambda })=0\). Then, since \(\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1\), the definition of the TREX implies
Assume now \(\tilde{\beta }(\tilde{\lambda })\ne 0\). In view of the KKT conditions for LASSO, \(\tilde{\beta }(\tilde{\lambda })\) fulfills
The definition of the TREX therefore yields
We now observe that since \(\kappa _1> 1\) by assumption, we have \(\tilde{\lambda }\ge \hat{u}\), and one can verify easily that this implies \(\Vert \tilde{\beta }(\hat{u})\Vert _{1}\ge \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\). At the same time, Lemma 3 ensures \(\Vert \hat{\beta }\Vert _1\ge \Vert \tilde{\beta }^{\hat{u}}\Vert _1\). Thus, \(c\hat{u}\Vert \hat{\beta }\Vert _{1}\ge c\hat{u}\Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\) and, therefore, we find again
Invoking the model, this results in
We can now use Hölder’s inequality and the triangle inequality to deduce
Next, we observe that by the definition of our estimator \(\hat{\beta }\) and of \(\tilde{\lambda }\),
Combining these two displays and using the model assumption then gives
We can now use Hölder’s inequality and rearrange the terms to get
The last step is to use Assumption 4, which ensures
and, therefore,
The above display therefore yields
Using the triangle inequality, we finally obtain
as desired.
Corollary 4
Let Assumption 4 be fulfilled, and let \(\tilde{\lambda }:=\max \{\kappa _{1} \hat{u},\frac{\kappa _{2}}{c}\Vert X^\top \varepsilon \Vert _\infty \}\). Furthermore, let \(\kappa _1,\kappa _2>0\) be such that
Then for any \(\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1\), the prediction loss of TREX satisfies
where \(\nu \) is the compatibility constant defined in (5).
Corollary 3 follows from Corollary 4 by setting \(\kappa _{1}=2\), and \(\kappa _{2}=8\), which satisfy the requirement for any \(c\in (0,2)\).
Proof (of Corollary 4)
Using Theorem 4 and the definition of \(\tilde{\lambda }\), the TREX prediction loss satisfies
On the other hand, the LASSO estimator \(\tilde{\beta }(\tilde{\lambda })\) satisfies (Bühlmann and van de Geer 2011, Theorem 6.1)
Since by assumption
the LASSO bound implies
as desired.
Appendix B: proof of a generalization of Theorem 2
We consider a generalization of the TREX according to
where \(\Omega \) is a norm on \(\mathbb {R}^p\), \(\Omega ^{*}(\eta ):=\sup \{\eta ^{\top }\beta :\Omega (\beta )\le 1\}\) is the dual of \(\Omega \), \(0<c<2\), and the minimum is taken over all \(\beta \in \mathbb {R}^p\). We also set \(\hat{u}:=\Omega ^{*}(X^{\top }(Y-X\hat{\beta }))\) with some abuse of notation. The corresponding generalization of Assumption 2 then reads as follows.
Assumption 5
The regression vector \(\beta ^*\) is sufficiently large such that
We now prove a generalization of Theorem 2.
Theorem 5
Let Assumption 5 be fulfilled. If \(\hat{u}\le \Omega ^{*}(X^{\top }Y)\), then the prediction loss of TREX satisfies
Theorem 2 follows from Theorem 5 by setting \(\Omega (\cdot ):=\Vert \cdot \Vert _{1}.\)
Proof (of Theorem 5)
The definition of the estimator implies
which yields together with the model assumptions
Rearranging the terms and Hölder’s inequality in the form of \(2\varepsilon ^\top X\hat{\beta }\le 2\Omega ^*(X^\top \varepsilon )\Omega (\hat{\beta })\) and \(\Vert X\beta ^*\Vert ^{2}_{2}=\beta ^*{}^{\top }X^{\top }X\beta ^*\le \Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)\) then gives
Case 1: We first consider the case \(2\Omega ^{*}(X^{\top }\varepsilon )\le c\hat{u}\).
For this, we first note that \(\hat{u} \le \Omega ^{*}(X^{\top }Y)\) by assumption. Using this and \(2\Omega ^{*}(X^{\top }\varepsilon )\le c\hat{u}\) allows us to remove the first two terms on the right-hand side of Inequality (9) so that
Since \(2\varepsilon ^\top X\beta ^*\le 2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)\) due to Hölder’s Inequality, and since \(\hat{u}\le \Omega ^*(X^\top Y)\) by definition of our estimator, we therefore obtain from the above display and the model assumptions
Next, we note that the triangle inequality gives
Plugging this in the previous display finally yields
which concludes the proof for Case 1.
Case 2: We now consider the case \(2\Omega ^{*}(X^{\top }\varepsilon )\ge c\hat{u}\).
Similarly as before, we start with the definition of the estimator, which yields in particular
Invoking the model assumptions and Hölder’s inequality then gives
We can now plug this into Inequality (9) to obtain
We can now rearrange the terms to get
We now observe that Assumption 5 implies via the triangle inequality and the model assumptions that
Using this, Hölder’s inequality, and the triangle inequality, we then find
This concludes the proof for Case 2 and, therefore, the proof of Theorem 5.
Appendix C: Prediction error results for \(p=128\)
See Fig. 2.
Rights and permissions
About this article
Cite this article
Bien, J., Gaynanova, I., Lederer, J. et al. Prediction error bounds for linear regression with the TREX. TEST 28, 451–474 (2019). https://doi.org/10.1007/s11749-018-0584-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-018-0584-4