Skip to main content
Log in

Prediction error bounds for linear regression with the TREX

  • Original Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

The TREX is a recently introduced approach to sparse linear regression. In contrast to most well-known approaches to penalized regression, the TREX can be formulated without the use of tuning parameters. In this paper, we establish the first known prediction error bounds for the TREX. Additionally, we introduce extensions of the TREX to a more general class of penalties, and we provide a bound on the prediction error in this generalized setting. These results deepen the understanding of the TREX from a theoretical perspective and provide new insights into penalized regression in general.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. The right-hand side in Corollary 2 has a minimal magnitude of \(4\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1/n\), while the right-hand side in Theorem 2 has a minimal magnitude of \((2+2/c)\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1/n\).

References

  • Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79

    Article  MathSciNet  MATH  Google Scholar 

  • Arlot S, Celisse A (2011) Segmentation of the mean of heteroscedastic data via cross-validation. Stat Comput 21(4):613–632

    Article  MathSciNet  MATH  Google Scholar 

  • Bach, F (2008) Bolasso: Model consistent Lasso estimation through the bootstrap. In: Proceedings of the 25th international conference on machine learning, pp 33–40

  • Baraud Y, Giraud C, Huet S (2009) Gaussian model selection with an unknown variance. Ann Stat 37(2):630–672

    Article  MathSciNet  MATH  Google Scholar 

  • Barber R, Candès E (2015) Controlling the false discovery rate via knockoffs. Ann Stat 43(5):2055–2085

    Article  MathSciNet  MATH  Google Scholar 

  • Belloni A, Chernozhukov V, Wang L (2011) Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika 98(4):791–806

    Article  MathSciNet  MATH  Google Scholar 

  • Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of lasso and Dantzig selector. Ann Stat 37(4):1705–1732

    Article  MathSciNet  MATH  Google Scholar 

  • Bien J, Gaynanova I, Lederer J, Müller C (2018) Non-convex global minimization and false discovery rate control for the TREX. J Comput Graph Stat 27(1):23–33. https://doi.org/10.1080/10618600.2017.1341414

  • Boucheron S, Lugosi G, Massart P (2013) Concentration inequalities: a nonasymptotic theory of independence. Oxford University Press, Cambridge

    Book  MATH  Google Scholar 

  • Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Berlin

    Book  MATH  Google Scholar 

  • Bunea F, Lederer J, She Y (2014) The group square-root lasso: theoretical properties and fast algorithms. IEEE Trans Inf Theory 60(2):1313–1325

    Article  MathSciNet  MATH  Google Scholar 

  • Bunea F, Tsybakov A, Wegkamp M (2006) Aggregation and sparsity via \(\ell _1\)-penalized least squares. In: Proceedings of 19th annual conference on learning theory, pp 379–391

  • Candès E, Plan Y (2009) Near-ideal model selection by \(\ell _1\) minimization. Ann Stat 37(5):2145–2177

    Article  MATH  Google Scholar 

  • Candes E, Tao T (2007) The Dantzig selector: Statistical estimation when p is much larger than n. Ann Stat 35(6):2313–2351

    Article  MathSciNet  MATH  Google Scholar 

  • Chatterjee S, Jafarov J (2015) Prediction error of cross-validated lasso. arXiv:1502.06291

  • Chételat D, Lederer J, Salmon J (2017) Optimal two-step prediction in regression. Electron J Stat 11(1):2519–2546

    Article  MathSciNet  MATH  Google Scholar 

  • Chichignoud M, Lederer J, Wainwright M (2016) A practical scheme and fast algorithm to tune the lasso with optimality guarantees. J Mach Learn Res 17:1–20

    MathSciNet  MATH  Google Scholar 

  • Combettes P, Müller C (2016) Perspective functions: proximal calculus and applications in high-dimensional statistics. J Math Anal Appl 457(2):1283–1306

    Article  MathSciNet  MATH  Google Scholar 

  • Dalalyan A, Tsybakov A (2012) Mirror averaging with sparsity priors. Bernoulli 18(3):914–944

    Article  MathSciNet  MATH  Google Scholar 

  • Dalalyan A, Tsybakov A (2012) Sparse regression learning by aggregation and langevin monte-carlo. J Comput Syst Sci 78(5):1423–1443

    Article  MathSciNet  MATH  Google Scholar 

  • Dalalyan A, Hebiri M, Lederer J (2017) On the prediction performance of the lasso. Bernoulli 23(1):552–581

    Article  MathSciNet  MATH  Google Scholar 

  • Dalalyan A, Tsybakov A (2007) Aggregation by exponential weighting and sharp oracle inequalities. In: Proceedings of 19th annual conference on learning theory, pp 97–111

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360

    Article  MathSciNet  MATH  Google Scholar 

  • Giraud C, Huet S, Verzelen N (2012) High-dimensional regression with unknown variance. Stat Sci 27(4):500–518

    Article  MathSciNet  MATH  Google Scholar 

  • Hebiri M, Lederer J (2013) How correlations influence lasso prediction. IEEE Trans Inf Theory 59(3):1846–1854

    Article  MathSciNet  MATH  Google Scholar 

  • Huang C, Cheang G, Barron A (2008) Risk of penalized least squares, greedy selection and L1 penalization for flexible function libraries. Manuscript

  • Koltchinskii V, Lounici K, Tsybakov A (2011) Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann Stat 39(5):2302–2329

    Article  MathSciNet  MATH  Google Scholar 

  • Lederer J, van de Geer S (2014) New concentration inequalities for empirical processes. Bernoulli 20(4):2020–2038

    Article  MathSciNet  MATH  Google Scholar 

  • Lederer J, Müller C (2014) Topology adaptive graph estimation in high dimensions. arXiv:1410.7279

  • Lederer J, Müller C (2015) Don’t fall for tuning parameters: tuning-free variable selection in high dimensions with the TREX. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence

  • Lederer J, Yu L, Gaynanova I (2016) Oracle inequalities for high-dimensional prediction. arXiv:1608.00624

  • Lim N, Lederer J (2016) Efficient feature selection with large and high-dimensional data. arXiv:1609.07195

  • Massart P, Meynet C (2011) The Lasso as an \(\ell _1\)-ball model selection procedure. Electron J Stat 5:669–687

    Article  MathSciNet  MATH  Google Scholar 

  • Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B 72(4):417–473

    Article  MathSciNet  MATH  Google Scholar 

  • Raskutti G, Wainwright M, Yu B (2010) Restricted eigenvalue properties for correlated Gaussian designs. J Mach Learn Res 11:2241–2259

    MathSciNet  MATH  Google Scholar 

  • Rigollet P, Tsybakov A (2011) Exponential screening and optimal rates of sparse estimation. Ann Stat 39(2):731–771

    Article  MathSciNet  MATH  Google Scholar 

  • Sabourin J, Valdar W, Nobel A (2015) A permutation approach for selecting the penalty parameter in penalized model selection. Biometrics 71(4):1185–1194

    Article  MathSciNet  MATH  Google Scholar 

  • Shah R, Samworth R (2013) Variable selection with error control: another look at stability selection. J R Stat Soc Ser B 75(1):55–80

    Article  MathSciNet  Google Scholar 

  • Sun T, Zhang CH (2012) Scaled sparse linear regression. Biometrika 99(4):879–898

    Article  MathSciNet  MATH  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Statist Soc Ser B 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • van de Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the lasso. Electron J Stat 3:1360–1392

    Article  MathSciNet  MATH  Google Scholar 

  • van de Geer S, Lederer J (2013) The Bernstein-Orlicz norm and deviation inequalities. Probab Theory Relat Fields 157(1–2):225–250

    Article  MathSciNet  MATH  Google Scholar 

  • van de Geer S, Lederer J (2013) The Lasso, correlated design, and improved oracle inequalities. IMS Collections 9:303–316

    MathSciNet  MATH  Google Scholar 

  • van der Vaart A, Wellner J (1996) Weak convergence and empirical processes. Springer, Berlin

    Book  MATH  Google Scholar 

  • van de Geer S (2007) The deterministic lasso. In Joint statistical meetings proceedings

  • van de Geer S (2000) Empirical processes in M-estimation. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Wainwright M (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell _1\)-constrained quadratic programming (lasso). IEEE Trans Inf Theory 55(4):2183–2202

    Article  MATH  Google Scholar 

  • Wellner J (2017) The Bennett-Orlicz norm. Sankhya A 79(2):355–383

    Article  MathSciNet  MATH  Google Scholar 

  • Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68(1):49–67

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38(2):894–942

    Article  MathSciNet  MATH  Google Scholar 

  • Zhuang R, Lederer J (2017) Maximum regularized likelihood estimators: a general prediction theory and applications. arXiv:1710.02950

Download references

Acknowledgements

We thank the editor and the reviewers for their insightful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johannes Lederer.

Appendices

Appendix A: proof of a generalization of Theorem 1

We first consider a generalization of Assumption 1.

Assumption 4

The signal \(\beta ^*\) is sufficiently small such that for some \(\kappa _{1}>1\) and \(\kappa _{2}>2\) with \(1/\kappa _1 + 2/\kappa _2 <1\),

$$\begin{aligned} \Vert \beta ^*\Vert _1\le \frac{1}{4}\left( 1-\frac{1}{\kappa _{1}}-\frac{2}{\kappa _{2}}\right) \frac{\Vert \varepsilon \Vert _2^2}{\Vert X^{\top }\varepsilon \Vert _{\infty }}. \end{aligned}$$

As a first step toward the proof of Theorem 1, we show that any TREX solution has larger \(\ell _{1}\)-norm than any LASSO solution with tuning parameter \(\lambda =\hat{u}\).

Lemma 3

Any TREX solution (4) satisfies

$$\begin{aligned} \Vert \hat{\beta }\Vert _{1}\ge \Vert \tilde{\beta }(\hat{u})\Vert _{1}, \end{aligned}$$

where \(\tilde{\beta }(\hat{u})\) is any LASSO solution as in (2) with tuning parameter \(\lambda =\hat{u}\).

Proof (of Lemma 3)

If \(\tilde{\beta }(\hat{u})=0\), the statement holds trivially. Now for \(\tilde{\beta }(\hat{u})\ne 0\), the KKT conditions for LASSO imply that

$$\begin{aligned} \Vert X^{\top }(Y-X\tilde{\beta }(\hat{u}))\Vert _{\infty }=\hat{u}. \end{aligned}$$

Together with the definition of \(\hat{\beta }\), this yields

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+c\hat{u}\Vert \hat{\beta }\Vert _{1}\le \Vert Y-X\tilde{\beta }(\hat{u})\Vert ^{2}_{2}+c\hat{u}\Vert \tilde{\beta }(\hat{u})\Vert _{1}. \end{aligned}$$

On the other hand, the definition of the LASSO implies

$$\begin{aligned} \Vert Y-X\tilde{\beta }(\hat{u})\Vert ^{2}_{2}+2\hat{u}\Vert \tilde{\beta }(\hat{u})\Vert _{1}\le \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+2\hat{u}\Vert \hat{\beta }\Vert _{1}. \end{aligned}$$

Combining these two displays gives us

$$\begin{aligned} (c-2)\hat{u}\Vert \hat{\beta }\Vert _{1}\le (c-2)\hat{u}\Vert \tilde{\beta }(\hat{u})\Vert _{1}. \end{aligned}$$

The claim follows now from \(c< 2\).

We are now ready to prove a generalization of Theorem 1.

Theorem 4

Let Assumption 4 be fulfilled, and let \(\tilde{\lambda }:=\max \{\kappa _{1} \hat{u},\frac{\kappa _{2}}{c}\Vert X^\top \varepsilon \Vert _\infty \}\). Then, for any \(\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1\), the prediction loss of the TREX satisfies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le & {} \left( \frac{1}{\kappa _{1}}+\frac{2}{\kappa _{2}}\right) \Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2} \\&+\left( 2+\frac{2}{\kappa _{1}}+\frac{4}{\kappa _{2}}\right) \Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1. \end{aligned}$$

Theorem 1 follows from Theorem 4 by setting \(\kappa _{1}=2\), and \(\kappa _{2}=8\).

Proof (of Theorem 4)

Assume first \(\tilde{\beta }(\tilde{\lambda })=0\). Then, since \(\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1\), the definition of the TREX implies

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+c\hat{u}\Vert \hat{\beta }\Vert _{1}\le \Vert Y\Vert _2^2/\kappa _1 =\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert ^{2}_{2}/\kappa _1\,. \end{aligned}$$

Assume now \(\tilde{\beta }(\tilde{\lambda })\ne 0\). In view of the KKT conditions for LASSO, \(\tilde{\beta }(\tilde{\lambda })\) fulfills

$$\begin{aligned} \Vert X^\top (Y-X\tilde{\beta }(\tilde{\lambda }))\Vert _\infty =\tilde{\lambda }=\max \{\kappa _{1} \hat{u},\frac{\kappa _{2}}{c}\Vert X^\top \varepsilon \Vert _\infty \}\ge \kappa _{1}\hat{u}. \end{aligned}$$

The definition of the TREX therefore yields

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}+c\hat{u}\Vert \hat{\beta }\Vert _{1}\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert ^{2}_{2}}{{\kappa _{1}}}+c\hat{u}\Vert \tilde{\beta }(\tilde{\lambda })\Vert _1. \end{aligned}$$

We now observe that since \(\kappa _1> 1\) by assumption, we have \(\tilde{\lambda }\ge \hat{u}\), and one can verify easily that this implies \(\Vert \tilde{\beta }(\hat{u})\Vert _{1}\ge \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\). At the same time, Lemma 3 ensures \(\Vert \hat{\beta }\Vert _1\ge \Vert \tilde{\beta }^{\hat{u}}\Vert _1\). Thus, \(c\hat{u}\Vert \hat{\beta }\Vert _{1}\ge c\hat{u}\Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\) and, therefore, we find again

$$\begin{aligned} \Vert Y-X\hat{\beta }\Vert ^{2}_{2}\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert ^{2}_{2}}{\kappa _{1}}. \end{aligned}$$

Invoking the model, this results in

$$\begin{aligned}&\kappa _1\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( 1-\kappa _{1}\right) \Vert \varepsilon \Vert ^{2}_{2}+{2\varepsilon ^\top (X\beta ^*- X\tilde{\beta }(\tilde{\lambda }))}+2\varepsilon ^\top (X\hat{\beta }- X\beta ^*)+{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

We can now use Hölder’s inequality and the triangle inequality to deduce

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \Vert \hat{\beta }\Vert _1+2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1+ \frac{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}{\kappa _1} \\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \left( \frac{\Vert Y-X\hat{\beta }\Vert _2^2}{c\hat{u}}+ \Vert \hat{\beta }\Vert _1\right) +2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1+ \frac{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}{\kappa _1}. \end{aligned}$$

Next, we observe that by the definition of our estimator \(\hat{\beta }\) and of \(\tilde{\lambda }\),

$$\begin{aligned}&\frac{\Vert Y-X\hat{\beta }\Vert _2^2}{c\hat{u}}+ \Vert \hat{\beta }\Vert _1\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert _2^2}{c\tilde{\lambda }}+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\le \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert _2^2}{\kappa _2\Vert X^\top \varepsilon \Vert _\infty }+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1. \end{aligned}$$

Combining these two displays and using the model assumption then gives

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \left( \frac{\Vert Y-X\tilde{\beta }(\tilde{\lambda })\Vert _2^2}{\kappa _2\Vert X^\top \varepsilon \Vert _\infty }+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\right) +2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1 + \frac{\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}{\kappa _1}\\&\quad = \left( \frac{1-\kappa _{1}}{\kappa _{1}}+\frac{2}{\kappa _2}\right) \Vert \varepsilon \Vert _2^2+\frac{2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1}{\kappa _{1}}\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \left( \frac{2\varepsilon ^\top X(\beta ^*-\tilde{\beta }(\tilde{\lambda }))}{\kappa _2\Vert X^\top \varepsilon \Vert _\infty }+ \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1\right) +2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1\\&\qquad +\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

We can now use Hölder’s inequality and rearrange the terms to get

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( \frac{1-\kappa _{1}}{\kappa _{1}}+\frac{2}{\kappa _2}\right) \Vert \varepsilon \Vert _2^2+\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) 2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1\\&\qquad + 2\Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1+2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1+\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

The last step is to use Assumption 4, which ensures

$$\begin{aligned} \Vert \beta ^*\Vert _1\le \left( \frac{1}{4}-\frac{1}{4\kappa _{1}}-\frac{1}{2\kappa _{2}}\right) \frac{\Vert \varepsilon \Vert _2^2}{\Vert X^{\top }\varepsilon \Vert _{\infty }} \end{aligned}$$

and, therefore,

$$\begin{aligned} 4\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1\le \left( \frac{\kappa _{1}-1}{\kappa _{1}}-\frac{2}{\kappa _{2}}\right) \Vert \varepsilon \Vert _2^2. \end{aligned}$$

The above display therefore yields

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le -\,4\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1 +\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) 2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*-\tilde{\beta }(\tilde{\lambda })\Vert _1\\&\quad + 2\Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })\Vert _1+2\Vert X^\top \varepsilon \Vert _\infty \Vert \beta ^*\Vert _1\\&\quad +\left( \frac{1}{\kappa _1} +\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}. \end{aligned}$$

Using the triangle inequality, we finally obtain

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le & {} \left( 2+\frac{2}{\kappa _1}+\frac{4}{\kappa _2}\right) \Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1\\&+\left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}} \end{aligned}$$

as desired.

Corollary 4

Let Assumption 4 be fulfilled, and let \(\tilde{\lambda }:=\max \{\kappa _{1} \hat{u},\frac{\kappa _{2}}{c}\Vert X^\top \varepsilon \Vert _\infty \}\). Furthermore, let \(\kappa _1,\kappa _2>0\) be such that

$$\begin{aligned} \frac{1}{\kappa _2}+\frac{\kappa _1}{\kappa _2 + 2\kappa _1} \le \frac{1}{c}. \end{aligned}$$

Then for any \(\hat{u}\le \Vert X^\top Y\Vert _\infty /\kappa _1\), the prediction loss of TREX satisfies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) \frac{16s\tilde{\lambda }^2}{\nu ^2 n}, \end{aligned}$$

where \(\nu \) is the compatibility constant defined in (5).

Corollary 3 follows from Corollary 4 by setting \(\kappa _{1}=2\), and \(\kappa _{2}=8\), which satisfy the requirement for any \(c\in (0,2)\).

Proof (of Corollary 4)

Using Theorem 4 and the definition of \(\tilde{\lambda }\), the TREX prediction loss satisfies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) {\Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2}}\\&\quad + \left( 2+\frac{2}{\kappa _1}+\frac{4}{\kappa _2}\right) \Vert X^\top \varepsilon \Vert _\infty \Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1\\&\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) \left[ \Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2} + \left( 2+\frac{2\kappa _1\kappa _2}{\kappa _2 + 2\kappa _1}\right) \frac{c}{{\kappa _2 }}\tilde{\lambda }\Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1\right] . \end{aligned}$$

On the other hand, the LASSO estimator \(\tilde{\beta }(\tilde{\lambda })\) satisfies (Bühlmann and van de Geer 2011, Theorem 6.1)

$$\begin{aligned} \Vert X\tilde{\beta }(\tilde{\lambda })-X\beta ^*\Vert ^{2}_{2} +2\tilde{\lambda }\Vert \tilde{\beta }(\tilde{\lambda })-\beta ^*\Vert _1 \le \frac{16s\tilde{\lambda }^2}{\nu ^2n}. \end{aligned}$$

Since by assumption

$$\begin{aligned} \left( 2 + \frac{2\kappa _1\kappa _2}{\kappa _2 + 2\kappa _1}\right) \frac{c}{{\kappa _2} }\le 2, \end{aligned}$$

the LASSO bound implies

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le \left( \frac{1}{\kappa _1}+\frac{2}{\kappa _2}\right) \frac{16s\tilde{\lambda }^2}{\nu ^2 n} \end{aligned}$$

as desired.

Appendix B: proof of a generalization of Theorem 2

We consider a generalization of the TREX according to

$$\begin{aligned} \hat{\beta }\in \mathop {\mathrm {arg\,min}}\left\{ \frac{\Vert Y-X\beta \Vert ^2_2}{c\Omega ^*(X^\top (Y-X\beta ))}+\Omega (\beta )\right\} , \end{aligned}$$

where \(\Omega \) is a norm on \(\mathbb {R}^p\), \(\Omega ^{*}(\eta ):=\sup \{\eta ^{\top }\beta :\Omega (\beta )\le 1\}\) is the dual of \(\Omega \), \(0<c<2\), and the minimum is taken over all \(\beta \in \mathbb {R}^p\). We also set \(\hat{u}:=\Omega ^{*}(X^{\top }(Y-X\hat{\beta }))\) with some abuse of notation. The corresponding generalization of Assumption 2 then reads as follows.

Assumption 5

The regression vector \(\beta ^*\) is sufficiently large such that

$$\begin{aligned} \Omega ^{*}(X^{\top }X\beta ^*)\ge \left( 1+\frac{2}{c}\right) \Omega ^{*}(X^{\top }\varepsilon ). \end{aligned}$$

We now prove a generalization of Theorem 2.

Theorem 5

Let Assumption 5 be fulfilled. If \(\hat{u}\le \Omega ^{*}(X^{\top }Y)\), then the prediction loss of TREX satisfies

$$\begin{aligned} \frac{\Vert X\hat{\beta }-X\beta ^*\Vert _{2}^{2}}{n}\le \frac{\left( 2\Omega ^{*}(X^{\top }\varepsilon )+\max \left\{ \hat{u},\frac{2}{c}\Omega ^{*}(X^{\top }\varepsilon )\right\} \right) \Omega (\beta ^*)}{n}. \end{aligned}$$

Theorem 2 follows from Theorem 5 by setting \(\Omega (\cdot ):=\Vert \cdot \Vert _{1}.\)

Proof (of Theorem 5)

The definition of the estimator implies

$$\begin{aligned} \frac{\Vert Y-X\hat{\beta }\Vert ^2_2}{c\cdot \hat{u}}+\Omega (\hat{\beta })\le \frac{\Vert Y\Vert ^2_2}{c\cdot \Omega ^*(X^\top Y)}, \end{aligned}$$

which yields together with the model assumptions

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}+\Vert \varepsilon \Vert ^{2}_{2}+2\varepsilon ^{\top }(X\beta ^*-X\hat{\beta })+c\hat{u}\Omega (\hat{\beta })\\&\quad \le \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Vert X\beta ^*\Vert _{2}^{2}+2\varepsilon ^{\top }X\beta ^*\right) . \end{aligned}$$

Rearranging the terms and Hölder’s inequality in the form of \(2\varepsilon ^\top X\hat{\beta }\le 2\Omega ^*(X^\top \varepsilon )\Omega (\hat{\beta })\) and \(\Vert X\beta ^*\Vert ^{2}_{2}=\beta ^*{}^{\top }X^{\top }X\beta ^*\le \Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)\) then gives

$$\begin{aligned} \begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le&\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) \Vert \varepsilon \Vert ^{2}_{2}+(2\Omega ^{*}(X^{\top }\varepsilon )-c\hat{u})\Omega (\hat{\beta })\\&+\frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+2\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*. \end{aligned} \end{aligned}$$
(9)

Case 1: We first consider the case \(2\Omega ^{*}(X^{\top }\varepsilon )\le c\hat{u}\).

For this, we first note that \(\hat{u} \le \Omega ^{*}(X^{\top }Y)\) by assumption. Using this and \(2\Omega ^{*}(X^{\top }\varepsilon )\le c\hat{u}\) allows us to remove the first two terms on the right-hand side of Inequality (9) so that

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le \frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*. \end{aligned}$$

Since \(2\varepsilon ^\top X\beta ^*\le 2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)\) due to Hölder’s Inequality, and since \(\hat{u}\le \Omega ^*(X^\top Y)\) by definition of our estimator, we therefore obtain from the above display and the model assumptions

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}&\le \frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+\left( 1-\frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}\right) 2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)\\&=2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)+\hat{u}\left( \frac{\Omega ^{*}(X^{\top }X\beta ^*)-2\Omega ^{*}(X^{\top }\varepsilon )}{\Omega ^{*}(X^{\top }Y)}\right) \Omega (\beta ^*)\\&=2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)+\hat{u}\left( \frac{\Omega ^{*}(X^{\top }X\beta ^*)-2\Omega ^{*}(X^{\top }\varepsilon )}{\Omega ^{*}(X^{\top }X\beta ^*+X^{\top }\varepsilon )}\right) \Omega (\beta ^*). \end{aligned}$$

Next, we note that the triangle inequality gives

$$\begin{aligned} \Omega ^{*}(X^{\top }X\beta ^*+X^{\top }\varepsilon )\ge \Omega ^{*}(X^{\top }X\beta ^*)-\Omega ^{*}(X^{\top }\varepsilon ). \end{aligned}$$

Plugging this in the previous display finally yields

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le 2\Omega ^{*}(X^{\top }\varepsilon )\Omega (\beta ^*)+\hat{u}\Omega (\beta ^*), \end{aligned}$$

which concludes the proof for Case 1.

Case 2: We now consider the case \(2\Omega ^{*}(X^{\top }\varepsilon )\ge c\hat{u}\).

Similarly as before, we start with the definition of the estimator, which yields in particular

$$\begin{aligned} \Omega (\hat{\beta })\le \frac{\Vert Y\Vert ^2_2}{c\cdot \Omega ^*(X^\top Y)}. \end{aligned}$$

Invoking the model assumptions and Hölder’s inequality then gives

$$\begin{aligned} \Omega (\hat{\beta })&\le \frac{1}{c\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Vert X\beta ^*\Vert ^{2}_{2}+2\varepsilon ^{\top }X\beta ^*\right) \\&\le \frac{1}{c\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)+2\varepsilon ^\top X\beta ^*\right) . \end{aligned}$$

We can now plug this into Inequality (9) to obtain

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le \left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) \Vert \varepsilon \Vert ^{2}_{2}\\&\quad +(2\Omega ^*(X^\top \varepsilon )-c\hat{u})\frac{1}{c\Omega ^{*}(X^{\top }Y)}\left( \Vert \varepsilon \Vert ^{2}_{2}+\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)+2\varepsilon ^{\top }X\beta ^*\right) \\&\quad +\frac{\hat{u}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*)}{\Omega ^{*}(X^{\top }Y)}+\left( \frac{\hat{u}}{\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*. \end{aligned}$$

We can now rearrange the terms to get

$$\begin{aligned} \Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\le&\left( \frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}-1\right) \Vert \varepsilon \Vert ^{2}_{2}+\left( \frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}-1\right) 2\varepsilon ^{\top }X\beta ^*\\&+\frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}\Omega ^{*}(X^{\top }X\beta ^*)\Omega (\beta ^*). \end{aligned}$$

We now observe that Assumption 5 implies via the triangle inequality and the model assumptions that

$$\begin{aligned} \Omega ^{*}(X^{\top }Y)\ge \frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c}. \end{aligned}$$

Using this, Hölder’s inequality, and the triangle inequality, we then find

$$\begin{aligned}&\Vert X\hat{\beta }-X\beta ^*\Vert ^{2}_{2}\\&\quad \le \left( 1-\frac{2\Omega ^{*}(X^{\top }\varepsilon )}{c\Omega ^{*}(X^{\top }Y)}\right) 2\Omega ^*(X^{\top }\varepsilon )\Omega (\beta ^*)\\&\qquad +2\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)/c+\frac{2\Omega ^*(X^\top \varepsilon )}{c\Omega ^*(X^\top Y)}\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)\\&\quad = 2\Omega ^*(X^{\top }\varepsilon )\Omega (\beta ^*)+2\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)/c-\frac{2\Omega ^*(X^\top \varepsilon )}{c\Omega ^*(X^\top Y)}\Omega ^*(X^\top \varepsilon )\Omega (\beta ^*)\\&\quad \le (2+2/c)\Omega ^*(X^{\top }\varepsilon )\Omega (\beta ^*). \end{aligned}$$

This concludes the proof for Case 2 and, therefore, the proof of Theorem 5.

Appendix C: Prediction error results for \(p=128\)

See Fig. 2.

Fig. 2
figure 2

Prediction error versus relative regularization path \(\rho \) for \(p=128\) for \(N=51\) repetitions. LASSO prediction errors for \(\lambda \ge \lambda _\text {lb}\) are shown in solid blue and TREX prediction errors for \(c \ge c_\text {lb}\) in solid red. The blue and red intervals at the bottom mark the range where \(\lambda _\text {lb}\) and \(c_\text {lb}\) values fall across N repetitions. Dashed transparent lines show prediction errors without theoretical guarantees. For the LASSO, we show the locations of the regularization parameters selected with cross-validation and the MSE (orange) and 1SE (purple) rule as well as with BIC (green) (color figure online)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bien, J., Gaynanova, I., Lederer, J. et al. Prediction error bounds for linear regression with the TREX. TEST 28, 451–474 (2019). https://doi.org/10.1007/s11749-018-0584-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-018-0584-4

Keywords

Mathematics Subject Classification

Navigation