Prediction error bounds for linear regression with the TREX

Abstract

The TREX is a recently introduced approach to sparse linear regression. In contrast to most well-known approaches to penalized regression, the TREX can be formulated without the use of tuning parameters. In this paper, we establish the first known prediction error bounds for the TREX. Additionally, we introduce extensions of the TREX to a more general class of penalties, and we provide a bound on the prediction error in this generalized setting. These results deepen the understanding of the TREX from a theoretical perspective and provide new insights into penalized regression in general.

This is a preview of subscription content, log in to check access.

Fig. 1

Notes

  1. 1.

    The right-hand side in Corollary 2 has a minimal magnitude of \(4\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1/n\), while the right-hand side in Theorem 2 has a minimal magnitude of \((2+2/c)\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1/n\).

References

  1. Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79

    MathSciNet  MATH  Article  Google Scholar 

  2. Arlot S, Celisse A (2011) Segmentation of the mean of heteroscedastic data via cross-validation. Stat Comput 21(4):613–632

    MathSciNet  MATH  Article  Google Scholar 

  3. Bach, F (2008) Bolasso: Model consistent Lasso estimation through the bootstrap. In: Proceedings of the 25th international conference on machine learning, pp 33–40

  4. Baraud Y, Giraud C, Huet S (2009) Gaussian model selection with an unknown variance. Ann Stat 37(2):630–672

    MathSciNet  MATH  Article  Google Scholar 

  5. Barber R, Candès E (2015) Controlling the false discovery rate via knockoffs. Ann Stat 43(5):2055–2085

    MathSciNet  MATH  Article  Google Scholar 

  6. Belloni A, Chernozhukov V, Wang L (2011) Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika 98(4):791–806

    MathSciNet  MATH  Article  Google Scholar 

  7. Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of lasso and Dantzig selector. Ann Stat 37(4):1705–1732

    MathSciNet  MATH  Article  Google Scholar 

  8. Bien J, Gaynanova I, Lederer J, Müller C (2018) Non-convex global minimization and false discovery rate control for the TREX. J Comput Graph Stat 27(1):23–33. https://doi.org/10.1080/10618600.2017.1341414

  9. Boucheron S, Lugosi G, Massart P (2013) Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, Cambridge

    Google Scholar 

  10. Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Berlin

    Google Scholar 

  11. Bunea F, Lederer J, She Y (2014) The group square-root lasso: theoretical properties and fast algorithms. IEEE Trans Inf Theory 60(2):1313–1325

    MathSciNet  MATH  Article  Google Scholar 

  12. Bunea F, Tsybakov A, Wegkamp M (2006) Aggregation and sparsity via \(\ell _1\)-penalized least squares. In: Proceedings of 19th annual conference on learning theory, pp 379–391

  13. Candès E, Plan Y (2009) Near-ideal model selection by \(\ell _1\) minimization. Ann Stat 37(5):2145–2177

    MATH  Article  Google Scholar 

  14. Candes E, Tao T (2007) The Dantzig selector: Statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351

    MathSciNet  MATH  Article  Google Scholar 

  15. Chatterjee S, Jafarov J (2015) Prediction error of cross-validated lasso. arXiv:1502.06291

  16. Chételat D, Lederer J, Salmon J (2017) Optimal two-step prediction in regression. Electron J Stat 11(1):2519–2546

    MathSciNet  MATH  Article  Google Scholar 

  17. Chichignoud M, Lederer J, Wainwright M (2016) A practical scheme and fast algorithm to tune the lasso with optimality guarantees. J Mach Learn Res 17:1–20

    MathSciNet  MATH  Google Scholar 

  18. Combettes P, Müller C (2016) Perspective functions: proximal calculus and applications in high-dimensional statistics. J Math Anal Appl 457(2):1283–1306

    MathSciNet  MATH  Article  Google Scholar 

  19. Dalalyan A, Tsybakov A (2012) Mirror averaging with sparsity priors. Bernoulli 18(3):914–944

    MathSciNet  MATH  Article  Google Scholar 

  20. Dalalyan A, Tsybakov A (2012) Sparse regression learning by aggregation and langevin monte-carlo. J Comput Syst Sci 78(5):1423–1443

    MathSciNet  MATH  Article  Google Scholar 

  21. Dalalyan A, Hebiri M, Lederer J (2017) On the prediction performance of the lasso. Bernoulli 23(1):552–581

    MathSciNet  MATH  Article  Google Scholar 

  22. Dalalyan A, Tsybakov A (2007) Aggregation by exponential weighting and sharp oracle inequalities. In: Proceedings of 19th annual conference on learning theory, pp 97–111

  23. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    MathSciNet  MATH  Article  Google Scholar 

  24. Giraud C, Huet S, Verzelen N (2012) High-dimensional regression with unknown variance. Stat Sci 27(4):500–518

    MathSciNet  MATH  Article  Google Scholar 

  25. Hebiri M, Lederer J (2013) How correlations influence lasso prediction. IEEE Trans Inf Theory 59(3):1846–1854

    MathSciNet  MATH  Article  Google Scholar 

  26. Huang C, Cheang G, Barron A (2008) Risk of penalized least squares, greedy selection and L1 penalization for flexible function libraries. Manuscript

  27. Koltchinskii V, Lounici K, Tsybakov A (2011) Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann Stat 39(5):2302–2329

    MathSciNet  MATH  Article  Google Scholar 

  28. Lederer J, van de Geer S (2014) New concentration inequalities for empirical processes. Bernoulli 20(4):2020–2038

    MathSciNet  MATH  Article  Google Scholar 

  29. Lederer J, Müller C (2014) Topology adaptive graph estimation in high dimensions. arXiv:1410.7279

  30. Lederer J, Müller C (2015) Don’t fall for tuning parameters: tuning-free variable selection in high dimensions with the TREX. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence

  31. Lederer J, Yu L, Gaynanova I (2016) Oracle inequalities for high-dimensional prediction. arXiv:1608.00624

  32. Lim N, Lederer J (2016) Efficient feature selection with large and high-dimensional data. arXiv:1609.07195

  33. Massart P, Meynet C (2011) The Lasso as an \(\ell _1\)-ball model selection procedure. Electron J Stat 5:669–687

    MathSciNet  MATH  Article  Google Scholar 

  34. Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B 72(4):417–473

    MathSciNet  MATH  Article  Google Scholar 

  35. Raskutti G, Wainwright M, Yu B (2010) Restricted eigenvalue properties for correlated Gaussian designs. J Mach Learn Res 11:2241–2259

    MathSciNet  MATH  Google Scholar 

  36. Rigollet P, Tsybakov A (2011) Exponential screening and optimal rates of sparse estimation. Ann Stat 39(2):731–771

    MathSciNet  MATH  Article  Google Scholar 

  37. Sabourin J, Valdar W, Nobel A (2015) A permutation approach for selecting the penalty parameter in penalized model selection. Biometrics 71(4):1185–1194

    MathSciNet  MATH  Article  Google Scholar 

  38. Shah R, Samworth R (2013) Variable selection with error control: another look at stability selection. J R Stat Soc Ser B 75(1):55–80

    MathSciNet  Article  Google Scholar 

  39. Sun T, Zhang CH (2012) Scaled sparse linear regression. Biometrika 99(4):879–898

    MathSciNet  MATH  Article  Google Scholar 

  40. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Statist Soc Ser B 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  41. van de Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the lasso. Electron J Stat 3:1360–1392

    MathSciNet  MATH  Article  Google Scholar 

  42. van de Geer S, Lederer J (2013) The Bernstein-Orlicz norm and deviation inequalities. Probab Theory Relat Fields 157(1–2):225–250

    MathSciNet  MATH  Article  Google Scholar 

  43. van de Geer S, Lederer J (2013) The Lasso, correlated design, and improved oracle inequalities. IMS Collections 9:303–316

    MathSciNet  MATH  Google Scholar 

  44. van der Vaart A, Wellner J (1996) Weak convergence and empirical processes. Springer, Berlin

    Google Scholar 

  45. van de Geer S (2007) The deterministic lasso. In Joint statistical meetings proceedings

  46. van de Geer S (2000) Empirical processes in M-estimation. Cambridge University Press, Cambridge

    Google Scholar 

  47. Wainwright M (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (lasso). IEEE Trans Inf Theory 55(4):2183–2202

    MATH  Article  Google Scholar 

  48. Wellner J (2017) The Bennett-Orlicz norm. Sankhya A 79(2):355–383

    MathSciNet  MATH  Article  Google Scholar 

  49. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68(1):49–67

    MathSciNet  MATH  Article  Google Scholar 

  50. Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38(2):894–942

    MathSciNet  MATH  Article  Google Scholar 

  51. Zhuang R, Lederer J (2017) Maximum regularized likelihood estimators: a general prediction theory and applications. arXiv:1710.02950

Download references

Acknowledgements

We thank the editor and the reviewers for their insightful comments.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Johannes Lederer.

Appendices

Appendix A: proof of a generalization of Theorem 1

We first consider a generalization of Assumption 1.

Assumption 4

The signal \(\beta ^*\) is sufficiently small such that for some \(\kappa _{1}>1\) and \(\kappa _{2}>2\) with \(1/\kappa _1 + 2/\kappa _2 <1\),

$$\begin{aligned} \Vert \beta ^*\Vert _1\le \frac{1}{4}\left( 1-\frac{1}{\kappa _{1}}-\frac{2}{\kappa _{2}}\right) \frac{\Vert \varepsilon \Vert _2^2}{\Vert X^{\top }\varepsilon \Vert _{\infty }}. \end{aligned}$$

As a first step toward the proof of Theorem 1, we show that any TREX solution has larger \(\ell _{1}\)-norm than any LASSO solution with tuning parameter \(\lambda =\hat{u}\).

Lemma 3

Any TREX solution (4) satisfies

$$\begin{aligned} \Vert \hat{\beta }\Vert _{1}\ge \Vert \tilde{\beta }(\hat{u})\Vert _{1}, \end{aligned}$$

where \(\tilde{\beta }(\hat{u})\) is any LASSO solution as in (2) with tuning parameter \(\lambda =\hat{u}\).

Proof (of Lemma 3)

If \(\tilde{\beta }(\hat{u})=0\), the statement holds trivially. Now for \(\tilde{\beta }(\hat{u})\ne 0\), the KKT conditions for LASSO imply that

$$\begin{aligned} \Vert X^{\top }(Y-X\tilde{\beta }(\hat{u}))\Vert _{\infty }=\hat{u}. \end{aligned}$$

Together with the definition of \(\hat{\beta }\), this yields

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+c\hat{u}\Vert \hat{\beta }\Vert _{1}\le \Vert Y-X\tilde{\beta }(\hat{u})\Vert ^{2}_{2}+c\hat{u}\Vert \tilde{\beta }(\hat{u})\Vert _{1}. \end{aligned}$$

On the other hand, the definition of the LASSO implies

$$\begin{aligned} \Vert Y-X\tilde{\beta }(\hat{u})\Vert ^{2}_{2}+2\hat{u}\Vert \tilde{\beta }(\hat{u})\Vert _{1}\le \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+2\hat{u}\Vert \hat{\beta }\Vert _{1}. \end{aligned}$$

Combining these two displays gives us

$$\begin{aligned} (c-2)\hat{u}\Vert \hat{\beta }\Vert _{1}\le (c-2)\hat{u}\Vert \tilde{\beta }(\hat{u})\Vert _{1}. \end{aligned}$$

The claim follows now from \(c< 2\).

We are now ready to prove a generalization of Theorem 1.

Theorem 4

Let Assumption 4 be fulfilled, and let \(\tilde{\lambda }:=\max \{\kappa _{1} \hat{u},\frac{\kappa _{2}}{c}\Vert X^\top \varepsilon \Vert _\infty \}\). Then, for any \(\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1\), the prediction loss of the TREX satisfies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le & {} \left( \frac{1}{\kappa _{1}}+\frac{2}{\kappa _{2}}\right) \Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2} \\&+\left( 2+\frac{2}{\kappa _{1}}+\frac{4}{\kappa _{2}}\right) \Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1. \end{aligned}$$

Theorem 1 follows from Theorem 4 by setting \(\kappa _{1}=2\), and \(\kappa _{2}=8\).

Proof (of Theorem 4)

Assume first \(\tilde{\beta }(\tilde{\lambda })=0\). Then, since \(\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1\), the definition of the TREX implies

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+c\hat{u}\Vert \hat{\beta }\Vert _{1}\le \Vert Y\Vert _2^2/\kappa _1 =\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert ^{2}_{2}/\kappa _1\,. \end{aligned}$$

Assume now \(\tilde{\beta }(\tilde{\lambda })\ne 0\). In view of the KKT conditions for LASSO, \(\tilde{\beta }(\tilde{\lambda })\) fulfills

$$\begin{aligned} \Vert X^\top (Y-X\tilde{\beta }(\tilde{\lambda }))\Vert _\infty =\tilde{\lambda }=\max \{\kappa _{1} \hat{u},\frac{\kappa _{2}}{c}\Vert X^\top \varepsilon \Vert _\infty \}\ge \kappa _{1}\hat{u}. \end{aligned}$$

The definition of the TREX therefore yields

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+c\hat{u}\Vert \hat{\beta }\Vert _{1}\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert ^{2}_{2}}{{\kappa _{1}}}+c\hat{u}\Vert \tilde{\beta }(\tilde{\lambda })\Vert _1. \end{aligned}$$

We now observe that since \(\kappa _1> 1\) by assumption, we have \(\tilde{\lambda }\ge \hat{u}\), and one can verify easily that this implies \(\Vert \tilde{\beta }(\hat{u})\Vert _{1}\ge \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\). At the same time, Lemma 3 ensures \(\Vert \hat{\beta }\Vert _1\ge \Vert \tilde{\beta }^{\hat{u}}\Vert _1\). Thus, \(c\hat{u}\Vert \hat{\beta }\Vert _{1}\ge c\hat{u}\Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\) and, therefore, we find again

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert ^{2}_{2}}{\kappa _{1}}. \end{aligned}$$

Invoking the model, this results in

$$\begin{aligned}&\kappa _1\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( 1-\kappa _{1}\right) \Vert \varepsilon \Vert ^{2}_{2}+{2\varepsilon ^\top (X\beta ^*- X\tilde{\beta }(\tilde{\lambda }))}+2\varepsilon ^\top (X\hat{\beta }- X\beta ^*)+{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

We can now use Hölder’s inequality and the triangle inequality to deduce

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \Vert \hat{\beta }\Vert _1+2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1+ \frac{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}{\kappa _1} \\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \left( \frac{\Vert Y-X\hat{\beta }\Vert _2^2}{c\hat{u}}+ \Vert \hat{\beta }\Vert _1\right) +2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1+ \frac{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}{\kappa _1}. \end{aligned}$$

Next, we observe that by the definition of our estimator \(\hat{\beta }\) and of \(\tilde{\lambda }\),

$$\begin{aligned}&\frac{\Vert Y-X\hat{\beta }\Vert _2^2}{c\hat{u}}+ \Vert \hat{\beta }\Vert _1\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert _2^2}{c\tilde{\lambda }}+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert _2^2}{\kappa _2\Vert X^\top \varepsilon \Vert _\infty }+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1. \end{aligned}$$

Combining these two displays and using the model assumption then gives

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \left( \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert _2^2}{\kappa _2\Vert X^\top \varepsilon \Vert _\infty }+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\right) +2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1 + \frac{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}{\kappa _1}\\&\quad = \left( \frac{1-\kappa _{1}}{\kappa _{1}}+\frac{2}{\kappa _2}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \left( \frac{2\varepsilon ^\top X(\beta ^*-\tilde{\beta }(\tilde{\lambda }))}{\kappa _2\Vert X^\top \varepsilon \Vert _\infty }+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\right) +2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1\\&\qquad +\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

We can now use Hölder’s inequality and rearrange the terms to get

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}+\frac{2}{\kappa _2}\right) \Vert \varepsilon \Vert _2^2+\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) 2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1+2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1+\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

The last step is to use Assumption 4, which ensures

$$\begin{aligned} \Vert \beta ^*\Vert _1\le \left( \frac{1}{4}-\frac{1}{4\kappa _{1}}-\frac{1}{2\kappa _{2}}\right) \frac{\Vert \varepsilon \Vert _2^2}{\Vert X^{\top }\varepsilon \Vert _{\infty }} \end{aligned}$$

and, therefore,

$$\begin{aligned} 4\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1\le \left( \frac{\kappa _{1}-1}{\kappa _{1}}-\frac{2}{\kappa _{2}}\right) \Vert \varepsilon \Vert _2^2. \end{aligned}$$

The above display therefore yields

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le -\,4\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1 +\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) 2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1\\&\quad + 2\Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1+2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1\\&\quad +\left( \frac{1}{\kappa _1} +\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

Using the triangle inequality, we finally obtain

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le & {} \left( 2+\frac{2}{\kappa _1}+\frac{4}{\kappa _2}\right) \Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1\\&+\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}} \end{aligned}$$

as desired.

Corollary 4

Let Assumption 4 be fulfilled, and let \(\tilde{\lambda }:=\max \{\kappa _{1} \hat{u},\frac{\kappa _{2}}{c}\Vert X^\top \varepsilon \Vert _\infty \}\). Furthermore, let \(\kappa _1,\kappa _2>0\) be such that

$$\begin{aligned} \frac{1}{\kappa _2}+\frac{\kappa _1}{\kappa _2 + 2\kappa _1} \le \frac{1}{c}. \end{aligned}$$

Then for any \(\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1\), the prediction loss of TREX satisfies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) \frac{16s\tilde{\lambda }^2}{\nu ^2 n}, \end{aligned}$$

where \(\nu \) is the compatibility constant defined in (5).

Corollary 3 follows from Corollary 4 by setting \(\kappa _{1}=2\), and \(\kappa _{2}=8\), which satisfy the requirement for any \(c\in (0,2)\).

Proof (of Corollary 4)

Using Theorem 4 and the definition of \(\tilde{\lambda }\), the TREX prediction loss satisfies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}\\&\quad + \left( 2+\frac{2}{\kappa _1}+\frac{4}{\kappa _2}\right) \Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1\\&\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) \left[ \Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2} + \left( 2+\frac{2\kappa _1\kappa _2}{\kappa _2 + 2\kappa _1}\right) \frac{c}{{\kappa _2 }}\tilde{\lambda }\Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1\right] . \end{aligned}$$

On the other hand, the LASSO estimator \(\tilde{\beta }(\tilde{\lambda })\) satisfies (Bühlmann and van de Geer 2011, Theorem 6.1)

$$\begin{aligned} \Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2} +2\tilde{\lambda }\Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1 \le \frac{16s\tilde{\lambda }^2}{\nu ^2n}. \end{aligned}$$

Since by assumption

$$\begin{aligned} \left( 2 + \frac{2\kappa _1\kappa _2}{\kappa _2 + 2\kappa _1}\right) \frac{c}{{\kappa _2} }\le 2, \end{aligned}$$

the LASSO bound implies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) \frac{16s\tilde{\lambda }^2}{\nu ^2 n} \end{aligned}$$

as desired.

Appendix B: proof of a generalization of Theorem 2

We consider a generalization of the TREX according to

$$\begin{aligned} \hat{\beta }\in \mathop {\mathrm {arg\,min}}\left\{ \frac{\Vert Y-X\beta \Vert ^2_2}{c\Omega ^*(X^\top (Y-X\beta ))}+\Omega (\beta )\right\} , \end{aligned}$$

where \(\Omega \) is a norm on \(\mathbb {R}^p\), \(\Omega ^{*}(\eta ):=\sup \{\eta ^{\top }\beta :\Omega (\beta )\le 1\}\) is the dual of \(\Omega \), \(0<c<2\), and the minimum is taken over all \(\beta \in \mathbb {R}^p\). We also set \(\hat{u}:=\Omega ^{*}(X^{\top }(Y-X\hat{\beta }))\) with some abuse of notation. The corresponding generalization of Assumption 2 then reads as follows.

Assumption 5

The regression vector \(\beta ^*\) is sufficiently large such that

$$\begin{aligned} \Omega ^{*}(X^{\top }X\beta ^*)\ge \left( 1+\frac{2}{c}\right) \Omega ^{*}(X^{\top }\varepsilon ). \end{aligned}$$

We now prove a generalization of Theorem 2.

Theorem 5

Let Assumption 5 be fulfilled. If \(\hat{u}\le \Omega ^{*}(X^{\top }Y)\), then the prediction loss of TREX satisfies

$$\begin{aligned} \frac{\Vert X\hat{\beta }-X\beta ^*\Vert _{2}^{2}}{n}\le \frac{\left( 2\Omega ^{*}(X^{\top }\varepsilon )+\max \left\{ \hat{u},\frac{2}{c}\Omega ^{*}(X^{\top }\varepsilon )\right\} \right) \Omega (\beta ^*)}{n}. \end{aligned}$$

Theorem 2 follows from Theorem 5 by setting \(\Omega (\cdot ):=\Vert \cdot \Vert _{1}.\)

Proof (of Theorem 5)

The definition of the estimator implies

$$\begin{aligned} \frac{\Vert Y-X\hat{\beta }\Vert ^2_2}{c\cdot \hat{u}}+\Omega (\hat{\beta })\le \frac{\Vert Y\Vert ^2_2}{c\cdot \Omega ^*(X^\top Y)}, \end{aligned}$$

which yields together with the model assumptions

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}+\Vert \varepsilon \Vert ^{2}_{2}+2\varepsilon ^{\top }(X\beta ^*-X\hat{\beta })+c\hat{u}\Omega (\hat{\beta })\\&\quad \le \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Vert X\beta ^*\Vert _{2}^{2}+2\varepsilon ^{\top }X\beta ^*\right) . \end{aligned}$$

Rearranging the terms and Hölder’s inequality in the form of \(2\varepsilon ^\top X\hat{\beta }\le 2\Omega ^*(X^\top \varepsilon )\Omega (\hat{\beta })\) and \(\Vert X\beta ^*\Vert ^{2}_{2}=\beta ^*{}^{\top }X^{\top }X\beta ^*\le \Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)\) then gives

$$\begin{aligned} \begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le&\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) \Vert \varepsilon \Vert ^{2}_{2}+(2\Omega ^{*}(X^{\top }\varepsilon )-c\hat{u})\Omega (\hat{\beta })\\&+\frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+2\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*. \end{aligned} \end{aligned}$$
(9)

Case 1: We first consider the case \(2\Omega ^{*}(X^{\top }\varepsilon )\le c\hat{u}\).

For this, we first note that \(\hat{u} \le \Omega ^{*}(X^{\top }Y)\) by assumption. Using this and \(2\Omega ^{*}(X^{\top }\varepsilon )\le c\hat{u}\) allows us to remove the first two terms on the right-hand side of Inequality (9) so that

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le \frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*. \end{aligned}$$

Since \(2\varepsilon ^\top X\beta ^*\le 2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)\) due to Hölder’s Inequality, and since \(\hat{u}\le \Omega ^*(X^\top Y)\) by definition of our estimator, we therefore obtain from the above display and the model assumptions

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le \frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+\left( 1-\frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}\right) 2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)\\&=2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)+\hat{u}\left( \frac{\Omega ^{*}(X^{\top }X\beta ^*)-2\Omega ^{*}(X^{\top }\varepsilon )}{\Omega ^{*}(X^{\top }Y)}\right) \Omega (\beta ^*)\\&=2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)+\hat{u}\left( \frac{\Omega ^{*}(X^{\top }X\beta ^*)-2\Omega ^{*}(X^{\top }\varepsilon )}{\Omega ^{*}(X^{\top }X\beta ^*+X^{\top }\varepsilon )}\right) \Omega (\beta ^*). \end{aligned}$$

Next, we note that the triangle inequality gives

$$\begin{aligned} \Omega ^{*}(X^{\top }X\beta ^*+X^{\top }\varepsilon )\ge \Omega ^{*}(X^{\top }X\beta ^*)-\Omega ^{*}(X^{\top }\varepsilon ). \end{aligned}$$

Plugging this in the previous display finally yields

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le 2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)+\hat{u}\Omega (\beta ^*), \end{aligned}$$

which concludes the proof for Case 1.

Case 2: We now consider the case \(2\Omega ^{*}(X^{\top }\varepsilon )\ge c\hat{u}\).

Similarly as before, we start with the definition of the estimator, which yields in particular

$$\begin{aligned} \Omega (\hat{\beta })\le \frac{\Vert Y\Vert ^2_2}{c\cdot \Omega ^*(X^\top Y)}. \end{aligned}$$

Invoking the model assumptions and Hölder’s inequality then gives

$$\begin{aligned} \Omega (\hat{\beta })&\le \frac{1}{c\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Vert X\beta ^*\Vert ^{2}_{2}+2\varepsilon ^{\top }X\beta ^*\right) \\&\le \frac{1}{c\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)+2\varepsilon ^\top X\beta ^*\right) . \end{aligned}$$

We can now plug this into Inequality (9) to obtain

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le \left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) \Vert \varepsilon \Vert ^{2}_{2}\\&\quad +(2\Omega ^*(X^\top \varepsilon )-c\hat{u})\frac{1}{c\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)+2\varepsilon ^{\top }X\beta ^*\right) \\&\quad +\frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*. \end{aligned}$$

We can now rearrange the terms to get

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le&\left( \frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}-1\right) \Vert \varepsilon \Vert ^{2}_{2}+\left( \frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*\\&+\frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*). \end{aligned}$$

We now observe that Assumption 5 implies via the triangle inequality and the model assumptions that

$$\begin{aligned} \Omega ^{*}(X^{\top }Y)\ge \frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c}. \end{aligned}$$

Using this, Hölder’s inequality, and the triangle inequality, we then find

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( 1-\frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}\right) 2\Omega ^*(X^{\top }\varepsilon )\Omega (\beta ^*)\\&\qquad +2\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)/c+\frac{2\Omega ^*(X^\top \varepsilon )}{c\Omega ^*(X^\top Y)}\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)\\&\quad = 2\Omega ^*(X^{\top }\varepsilon )\Omega (\beta ^*)+2\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)/c-\frac{2\Omega ^*(X^\top \varepsilon )}{c\Omega ^*(X^\top Y)}\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)\\&\quad \le (2+2/c)\Omega ^*(X^{\top }\varepsilon )\Omega (\beta ^*). \end{aligned}$$

This concludes the proof for Case 2 and, therefore, the proof of Theorem 5.

Appendix C: Prediction error results for \(p=128\)

See Fig. 2.

Fig. 2
figure2

Prediction error versus relative regularization path \(\rho \) for \(p=128\) for \(N=51\) repetitions. LASSO prediction errors for \(\lambda \ge \lambda _\text {lb}\) are shown in solid blue and TREX prediction errors for \(c \ge c_\text {lb}\) in solid red. The blue and red intervals at the bottom mark the range where \(\lambda _\text {lb}\) and \(c_\text {lb}\) values fall across N repetitions. Dashed transparent lines show prediction errors without theoretical guarantees. For the LASSO, we show the locations of the regularization parameters selected with cross-validation and the MSE (orange) and 1SE (purple) rule as well as with BIC (green) (color figure online)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bien, J., Gaynanova, I., Lederer, J. et al. Prediction error bounds for linear regression with the TREX. TEST 28, 451–474 (2019). https://doi.org/10.1007/s11749-018-0584-4

Download citation

Keywords

  • TREX
  • High-dimensional regression
  • Tuning parameters
  • Oracle inequalities

Mathematics Subject Classification

  • 62J07